2602.10177v2

Model: healer-alpha-free

# Towards Autonomous Mathematics Research **Authors**: Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-Tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong, Project leads, Mathematicians. Work conducted under Google DeepMind fengtony@google.com, thangluong@google.com. External affiliations: UC Berkeley (Tony Feng), Brown University (Junehyuk Jung), Yonsei University (Joonkyung Lee), Concordia University (Carlo Pagano), Korea Institute for Advanced Study (Sang-hyun Kim), UC San Diego (Federico Pasqualotto), Caltech (Sergei Gukov). ## Abstract Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem solving to professional research, however, requires navigating vast literature and constructing long-horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end-to-end in natural language. Specifically, Aletheia is powered by by three main sources: (i) an advanced version of Gemini Deep Think for challenging reasoning problems, (ii) a novel inference-time scaling law that extends beyond Olympiad-level problems, and (iii) intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD-level exercises and most notably, through several distinct milestones in AI-assisted mathematics research: (a) a research paper [Feng2026] generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper [LeeSeo2026] demonstrating human-AI collaboration in proving bounds on systems of interacting particles called independent sets; and (c) an extensive semi-autonomous evaluation erdos of 700 open problems on Bloom’s Erdős Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest quantifying standard levels of autonomy and novelty of AI-assisted results, as well as propose a novel concept of human-AI interaction cards for transparency. We conclude with reflections on human-AI collaboration in mathematics and share all prompts as well as model outputs at https://github.com/google-deepmind/superhuman/tree/main/aletheia. ## 1 Introduction Recent years have witnessed rapid progress in the reasoning capabilities of natural language-based AI in competition mathematics luong2025robustmathematicalreasoning. Notably, AI models have already achieved gold-medal performance in the 2025 International Mathematical Olympiad (IMO) imo-gold, widely regarded as the world’s premier mathematics competition. This milestone motivates a fundamental question regarding AI-driven scientific discovery: can AI autonomously discover and prove new mathematical theorems? Transitioning from contest to research mathematics presents significant challenges. Unlike self-contained competition problems, research questions require synthesizing advanced techniques from an extensive body of literature. This poses a major hurdle for foundation or large language models, which often hallucinate and exhibit only a superficial understanding of specialized topics—limitations likely stemming from a scarcity of relevant training data. To explore this challenge, we developed Aletheia, a math research agent that can iteratively generate, verify, and revise solutions end-to-end in natural language. Aletheia is powered by three main sources: (i) an advanced version of Gemini Deep Think deepmind:gemini2p5deepthink for tackling extremely hard reasoning problems, (ii) a novel inference-time scaling law that extends from Olympiad-level problems to PhD-level exercises, and (iii) intensive tool use such as Google Search and web browsing to navigate the complexities of mathematical research. On Olympiad-level problems, Aletheia, without internet access, has achieved a new state-of-the-art performance of $95.1\$ accuracy on the advanced version of IMO-ProofBench luong2025robustmathematicalreasoning, according to human experts This significantly outperforms the previous best result of $65.7\$ achieved by Gemini Deep Think (IMO Gold) in July 2025. See full leaderboard in https://imobench.github.io/.. Similarly, on an internal benchmark (FutureMath Basic) of PhD-level exercises, Aletheia also demonstrated state-of-the-art performance. Notably, this paper presents and reflects on the initial wave of mathematical research papers achieved by Aletheia See our project page https://github.com/google-deepmind/superhuman/tree/main/aletheia for detailed prompts and model outputs. in collaboration with mathematicians: 1. Reliable autonomous research. A research paper [Feng2026] generated by AI without any human intervention, calculating certain structure constants in arithmetic geometry called eigenweights. 1. AI-guided collaboration. A research paper [LeeSeo2026] demonstrating human-AI collaboration in proving bounds on systems of interacting particles called independent sets. 1. An extensive semi-autonomous evaluation on the Erdős problems. A report erdos on a systematic deployment of our agent on the 700 open problems on Bloom’s Erdős Conjectures database https://www.erdosproblems.com/., including four open questions of Erdős that it resolves, verified semi-autonomously using both AI grading and human experts. Beyond solving Erdős-1051 completely, our model drove a generalization which resulted in a research paper [BKKKZ]. 1. In addition, the agent contributed intermediate propositions on two further research papers, [FYZ4] and [ACGKMP], improving prior proofs by the human authors. Given the current pace of improvement, it seems inevitable that AI will play an increasing role in mathematics research. However, for the vast majority of mathematics research results, only a few experts are equipped to properly evaluate their novelty and significance. This evaluation gap has enabled misinformation about AI-generated mathematics to spread unchecked in popular media. With a view towards transparent communication of future developments to the public, we propose codifying representations of AI-generated mathematics according to a standard taxonomy of “Autonomous Mathematics Research Levels”, analogous to the SAE Levels of Vehicle Autonomy. | | Primarily Human (secondary AI input) | Human-AI Collaboration | Essentially Autonomous | | --- | --- | --- | --- | | Level 0: Negligible Novelty | | | Erdős-652, 654, 1040 erdos | | Level 1: Minor Novelty | | | Erdős-1051 erdos | | Complexity Bounds [ACGKMP] | Generalized Erdős-1051 [BKKKZ] | Eigenweights [Feng2026] | | | Level 2: Publishable Research ∗ | Arithmetic Volumes [FYZ4] | Independence Polynomials [LeeSeo2026] | | | Level 3: Major Advance | | | | | Level 4: Landmark Breakthrough | | | | Table 1: Classification of all AI-assisted mathematics results encompassed in this work. Detailed descriptions of the categories, with discussion and examples, can be found in § 5.1. ∗ Works listed as Level 2 in this table have been submitted for publications. While the ultimate specifics of such a taxonomy should be left to the mathematical community, we suggest features that would be informative, such as separate axes for representing the mathematical significance and the degree of AI contribution. For concreteness, we sketch one possible taxonomy in § 5.1, and use it to organize our own AI-assisted mathematics results, as displayed in Table 1. In particular, this contextualizes the ‘Open’ (according to ErdosProblems.com) Erdős problems that we solved, most of which turned out—despite being open for several decades—to be quite elementary. In fact, one such ‘Open’ problem (Erdős-397, posed in 1980) was eventually discovered to be nearly identical to a problem on the 2012 Team Selection Test for the Chinese IMO team AoPS12, a (challenging) exam for high school students; we exclude it from our classification since we consider the solution to be already in the literature. It also clarifies that our autonomous results, while milestones for artificial intelligence, are not claimed to be “major advances” for mathematics (as defined in § 5.1). ## 2 Aletheia agent: From Olympiads to Research-level Mathematics This section outlines the methodological framework utilized in the development of agents capable of addressing research-grade mathematics. There are many challenges involved in transferring capabilities from contest problem-solving to mathematics research. While IMO problems require ingenuity, their solutions usually span only a few pages and rely only on standard theorems from the high school curriculum. By contrast, research-level mathematics draws on advanced techniques from vast literature, with papers often spanning dozens of pages. Human mathematicians, even IMO medalists, usually take many years of postgraduate study to reach the frontier of mathematical research. While foundation models possess a large knowledge base from pretraining, their understanding of advanced subjects remains superficial due to data scarcity, and they are also prone to hallucinations. <details> <summary>x1.png Details</summary> ![bb4f5bbb](/v1/image/bb4f5bbb14a6115b9f2d8c2af233ae897d9f09bdf015df910d7e8248bb90d523) ### Visual Description \n ## Diagram: Aletheia Problem-Solving Flowchart ### Overview The image is a flowchart diagram illustrating a cyclical, iterative problem-solving process named "Aletheia." The system is explicitly labeled as "Powered by Gemini Deep Think." The diagram uses a classical, scroll-and-pillar aesthetic with a parchment-like background and a Greek key border. It depicts a workflow that generates, verifies, and refines solutions to a problem. ### Components/Axes The diagram consists of six primary components connected by directional arrows. Each component is represented by a stylized icon with a text label. **Components (in left-to-right flow order):** 1. **Problem**: Represented by a rolled parchment scroll on the far left. 2. **Generator**: Represented by a classical pillar structure. 3. **Candidate Solution**: Represented by a rolled parchment scroll. 4. **Verifier**: Represented by a circular medallion containing the scales of justice and a quill pen. This is the central decision node. 5. **Final Output**: Represented by a rolled parchment scroll on the far right, glowing with a yellow aura. 6. **Reviser**: Represented by a classical pillar structure located below the main flow. **Connectors and Conditional Paths:** * **Problem → Generator**: A straight, right-pointing arrow. * **Generator → Candidate Solution**: A straight, right-pointing arrow. * **Candidate Solution → Verifier**: A straight, right-pointing arrow. * **Verifier → Final Output**: A right-pointing arrow labeled **"Correct"**. This path is accompanied by an icon of an olive branch. * **Verifier → Reviser**: A downward-curving arrow labeled **"Minor fixes needed"**. * **Reviser → Candidate Solution**: An upward-pointing arrow. * **Verifier → Generator**: A large, red, upward-curving arrow that loops back to the Generator. This path is labeled **"Critically flawed"** and is accompanied by an icon of a broken column. ### Detailed Analysis The process flow is as follows: 1. A **Problem** is input into the system. 2. The **Generator** creates an initial **Candidate Solution**. 3. This solution is passed to the **Verifier** for evaluation. 4. The Verifier assesses the solution and routes it based on one of three outcomes: * **Outcome 1 (Success):** If the solution is deemed **"Correct"**, it proceeds to become the **Final Output**. * **Outcome 2 (Minor Revision):** If the solution requires **"Minor fixes needed"**, it is sent to the **Reviser**. The Reviser modifies the solution and sends it back to the Candidate Solution stage for re-verification. * **Outcome 3 (Major Failure):** If the solution is **"Critically flawed"**, the process loops all the way back to the **Generator** to create a new candidate solution from scratch. ### Key Observations * **Cyclical and Iterative:** The core of the process is a feedback loop. Solutions are not accepted on the first pass unless perfect; they are refined or regenerated based on verification. * **Tiered Error Handling:** The system distinguishes between minor errors (handled by the Reviser) and fundamental flaws (requiring a complete restart at the Generator). This suggests an efficiency optimization. * **Central Authority of the Verifier:** The Verifier, symbolized by the scales of justice, is the sole decision-making component that determines the path of the workflow. * **Visual Metaphors:** The classical imagery (scrolls, pillars, scales, olive branch, broken column) reinforces themes of truth (Aletheia is Greek for "truth"), wisdom, and structured process. ### Interpretation This diagram represents a sophisticated, self-correcting AI or algorithmic pipeline for generating solutions. The "Powered by Gemini Deep Think" label indicates it is a conceptual model for how a large language model or reasoning engine might structure its problem-solving to ensure accuracy and reliability. The process embodies a **Peircean investigative cycle** of hypothesis (Generator), testing (Verifier), and refinement (Reviser). The "Critically flawed" loop back to the Generator is particularly significant; it shows the system is designed to recognize when a line of reasoning is fundamentally unsalvageable and must be abandoned, preventing wasted effort on minor fixes. This mimics human expert reasoning, where one might discard a flawed approach entirely rather than patch it. The **olive branch** symbol for "Correct" is a nuanced touch, suggesting that a correct solution brings peace or resolution to the problem. The **broken column** for "Critically flawed" implies structural failure. The entire system is geared towards moving from the uncertainty of a "Problem" to the illuminated clarity of a "Final Output," with the Verifier acting as the gatekeeper of truth. </details> Figure 1: Overview of Aletheia, a math research agent powered by Deep Think that can iteratively generate, verify, and revise for research-level math problems. To address these issues, we built a math research agent, internally codenamed Aletheia, on top of Gemini Deep Think. Aletheia include three subagents, a (solution) Generator, a Verifier, and a Reviser that interact continuously until a solution is found that the Verifier approves, or until the attempts reach a preset (hyperparameter) limit. The subagent orchestration is sketched in Figure 1. In turn, each of the three subagents involves internal orchestration of calls to a Gemini base model. In contrast to AlphaGeometry Trinh2024-alphageometry-1, chervonyi2025goldmedalistperformancesolvingolympiad-alphageometry2 and AlphaProof Hubert2025, that use formal language, Aletheia operates end-to-end in natural language. At the same time, designs similar to Aletheia have been demonstrated elsewhere. For example, Huang–Yang huang2025winninggoldimo2025 showed that a manually crafted solver-verifier harness could boost GPT-5, Gemini 2.5 Pro, or Grok 4 to Gold Medal performance on the 2025 IMO. Another example is the FullProof system The authors were not aware of this work until after the results of this paper were collected. built by Salafatinos bryan2026motivic, which is another mathematics specific research agent incorporating a separate informal verifier mechanism. ### 2.1 Scaling Laws and the Evolution of Deep Think <details> <summary>AletheiaProofBench.png Details</summary> ![30bf769d](/v1/image/30bf769d2ee2e23ff416dd0e0760654d4355f621105fd9b915f3972bc41a1674) ### Visual Description \n ## Line Chart: Model Performance vs. Inference-Time Compute ### Overview This is a line chart comparing the performance (Score in %) of two versions of an AI model called "Gemini Deep Think" against a third model, "Aletheia," as a function of increasing inference-time compute. The x-axis uses a logarithmic scale (base 2). The chart demonstrates how scaling compute resources impacts model accuracy on a given task. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis:** * **Label:** "Inference-Time Compute (Log Scale)" * **Scale:** Logarithmic, base 2. * **Markers/Ticks:** 2⁰, 2¹, 2², 2³, 2⁴, 2⁵, 2⁶, 2⁷, 2⁸, 2⁹, 2¹⁰, 2¹¹. * **Y-Axis:** * **Label:** "Score (%)" * **Scale:** Linear, from 30 to 90+. * **Major Gridlines:** At 30, 40, 50, 60, 70, 80, 90. * **Legend (Position: Bottom-right corner):** 1. **Blue line with circle markers:** "Gemini Deep Think (advanced version, Jan 2026)" 2. **Orange line with circle markers:** "Gemini Deep Think (IMO Gold, Jul 2025)" 3. **Red star marker:** "Aletheia" * **Other Elements:** A light gray grid is present in the background. ### Detailed Analysis **Data Series 1: Gemini Deep Think (advanced version, Jan 2026) - Blue Line** * **Trend:** Shows a strong, consistent upward trend that begins to plateau after 2⁸. The line slopes upward steeply from 2⁰ to 2⁸, then fluctuates slightly at a high level. * **Data Points (Approximate):** * 2⁰: ~40% * 2³: ~67% * 2⁴: ~73% * 2⁵: ~78% * 2⁶: ~82% * 2⁷: ~85% * 2⁸: ~90% * 2⁹: ~85% * 2¹⁰: ~90% * 2¹¹: ~88% **Data Series 2: Gemini Deep Think (IMO Gold, Jul 2025) - Orange Line** * **Trend:** Shows an initial slight decline, followed by a strong upward trend that also plateaus. The line dips at 2³ before rising steadily. It consistently scores lower than the advanced version. * **Data Points (Approximate):** * 2⁰: ~33% * 2³: ~29% (Local minimum) * 2⁴: ~43% * 2⁶: ~58% * 2⁷: ~59% * 2⁸: ~63% * 2⁹: ~66% * 2¹⁰: ~68% * 2¹¹: ~65% **Data Point 3: Aletheia - Red Star** * **Position:** Located at x=2⁸, y=~95%. * **Note:** This is a single data point, not a series. It is positioned significantly above both lines at the same compute level (2⁸). ### Key Observations 1. **Performance Gap:** The "advanced version (Jan 2026)" consistently outperforms the "IMO Gold (Jul 2025)" version across all compute levels, indicating significant model improvement over approximately six months. 2. **Compute Scaling:** Both model versions show a clear positive correlation between inference-time compute and score, demonstrating the benefit of increased computational resources for this task. 3. **Diminishing Returns:** Both lines show signs of plateauing after 2⁸ (256) on the log scale, suggesting that beyond this point, additional compute yields smaller performance gains. 4. **Aletheia's Efficiency:** The Aletheia model achieves the highest observed score (~95%) at a moderate compute level (2⁸), outperforming even the advanced Gemini model at that same point. This suggests superior efficiency or capability for this specific task. 5. **Anomaly in Orange Line:** The dip in the IMO Gold version's performance at 2³ (from ~33% at 2⁰ to ~29%) is an outlier in its otherwise upward trend. ### Interpretation The chart illustrates a classic scaling law in AI: performance improves with increased inference-time compute, but with diminishing returns. The data tells a story of rapid model iteration and improvement, as seen in the jump from the July 2025 to January 2026 versions of Gemini Deep Think. The most significant finding is the position of Aletheia. It represents a different point in the performance-compute trade-off space. While the Gemini models follow a predictable scaling curve, Aletheia achieves a state-of-the-art result with less compute than the peak of the advanced Gemini model. This could indicate a more efficient architecture, a different training paradigm, or specialization for the benchmark measured by this "Score." The chart positions Aletheia not just as a high performer, but as a potentially more efficient solution, challenging the notion that more compute is always the primary path to higher scores. </details> (a) IMO-ProofBench Advanced (Olympiad Level). <details> <summary>AletheiaFutureMath.png Details</summary> ![1cd98bb1](/v1/image/1cd98bb1dd5d9b92838cdedecf2703da4488b7d2b7432e4ee6e58f273120ef9a) ### Visual Description ## Line Chart: Inference-Time Compute vs. Score Percentage ### Overview The image displays a line chart comparing the performance (Score %) of two AI models—"Gemini Deep Think" and "Aletheia"—as a function of increasing inference-time compute. The x-axis uses a logarithmic scale (base 2), while the y-axis is linear. The chart suggests a relationship between computational resources allocated during inference and model performance on a specific task or benchmark. ### Components/Axes * **Chart Type:** Line chart with a logarithmic x-axis. * **X-Axis:** * **Label:** "Inference-Time Compute (Log Scale)" * **Scale:** Logarithmic, base 2. * **Markers/Ticks:** 2⁰, 2¹, 2², 2³, 2⁴, 2⁵, 2⁶, 2⁷, 2⁸, 2⁹, 2¹⁰, 2¹¹. * **Y-Axis:** * **Label:** "Score (%)" * **Scale:** Linear. * **Range:** 0 to approximately 45. * **Major Ticks:** 0, 10, 20, 30, 40. * **Legend:** Located in the bottom-right quadrant of the chart area. * **Entry 1:** A blue line with circular markers, labeled "Gemini Deep Think (advanced version, Jan 2026)". * **Entry 2:** A red star symbol, labeled "Aletheia". * **Grid:** A light gray grid is present, aligning with the major ticks on both axes. ### Detailed Analysis **Data Series: Gemini Deep Think (Blue Line with Circles)** * **Trend:** The line shows an overall upward trend with significant fluctuations. It rises sharply initially, peaks, dips, and then resumes a strong upward climb at higher compute levels. * **Data Points (Approximate):** * At x = 2⁰: y ≈ 0% * At x = 2³: y ≈ 19% * At x = 2⁴: y ≈ 30% (Local Peak) * At x = 2⁵: y ≈ 19% (Dip) * At x = 2⁶: y ≈ 20.5% * At x = 2⁷: y ≈ 17.5% (Lowest point after initial rise) * At x = 2⁸: y ≈ 20.5% * At x = 2⁹: y ≈ 22% * At x = 2¹⁰: y ≈ 35% * At x = 2¹¹: y ≈ 38% (Highest point for this series) **Data Series: Aletheia (Red Star)** * **Trend:** This is a single data point, not a continuous line. It represents a performance score at a specific compute level. * **Data Point (Approximate):** * At x = 2⁹: y ≈ 46% (Positioned significantly above the Gemini line at the same x-value). ### Key Observations 1. **Non-Linear Scaling:** Performance for Gemini Deep Think does not scale linearly with log-compute. There is a notable peak at 2⁴, followed by a regression, before a strong positive trend resumes after 2⁷. 2. **Performance Disparity at 2⁹:** At the compute level of 2⁹, the Aletheia model (red star, ~46%) dramatically outperforms the Gemini Deep Think model (blue circle, ~22%). The vertical gap is approximately 24 percentage points. 3. **Late-Stage Acceleration:** The Gemini model shows its most significant performance gains in the highest compute brackets (from 2⁹ to 2¹¹), jumping from ~22% to ~38%. 4. **Initial Volatility:** The performance between 2⁴ and 2⁸ is volatile, suggesting a region where increased compute does not reliably translate to better scores for this model version. ### Interpretation This chart illustrates a comparative analysis of model efficiency and scaling laws. The data suggests that: * **Model Architecture Matters:** Aletheia achieves a very high score (~46%) at a moderate compute level (2⁹), implying it may have a more efficient architecture or training paradigm for this specific task compared to the Gemini Deep Think version tested. * **Scaling is Not Guaranteed:** The dip in Gemini's performance between 2⁴ and 2⁷ indicates that simply increasing inference-time compute can sometimes lead to worse outcomes, possibly due to overfitting to a certain compute regime or instability in the model's reasoning process at those scales. * **High-Compute Potential:** The steep upward trajectory for Gemini from 2⁹ to 2¹¹ shows that substantial performance headroom exists at very high compute levels, though this comes at a significant computational cost. * **Benchmark Context:** The "Score (%)" likely represents accuracy on a specific benchmark. The chart argues that for this benchmark, Aletheia is currently the more compute-efficient solution at the 2⁹ level, while Gemini Deep Think may have a higher ceiling if given orders of magnitude more compute (2¹¹). The "Jan 2026" label on the Gemini series hints at this being a snapshot in time, with model capabilities evolving rapidly. </details> (b) FutureMath Basic (Ph.D. Level Exercises). Figure 2: The latest advanced version of Deep Think, as of Jan 2026, has significantly outperformed the IMO-Gold version (Jul 2025) on Olympiad-level problems. The inference-time scaling law also transfers to PhD-level exercises. Aletheia makes further leaps in terms of reasoning quality with lower inference-time compute. All results were graded by human experts. We first discuss some empirical findings about inference time scaling laws for Olympiad-level and PhD-level problems, which underlie the development of Deep Think. To investigate the inference scaling behavior, we evaluated different advanced versions of Gemini Deep Think deepmind:gemini2p5deepthink. Since Deep Think leverages parallel thinking to explore multiple ideas at once, we were able to flexibly adjust the amount of compute spent at inference time. To ensure that results were not cherry-picked, we ran the model exactly once for each problem at each compute scale, with tool use disabled, and the outputs were carefully graded according to the grading guideline. For Olympiad level, we leveraged the the advanced subset of IMO-ProofBench luong2025robustmathematicalreasoning which consists of 30 problems similar in difficulty and problem style to those found at the IMO. This was the benchmark used during our IMO 2025 preparation. We observed that the inference-time compute could be increased by orders of magnitude while yielding substantial gains in accuracy, as graded by human experts, before eventually plateauing (Figure 2(a)). Ultimately, an advanced version of Gemini Deep Think achieved a gold-medal standard at the IMO in July 2025, perfectly solving five out of six problems imo-gold. Building upon the key recipes developed for the IMO-gold model and incorporating a suite of novel technical improvements, we trained a stronger model, Gemini Deep Think (advanced version, Jan 2026), that significantly improved inference efficiency. The compute required to achieve equivalent performance on IMO-ProofBench was reduced by approximately two orders of magnitude (100x). Specifically, utilizing extreme scale and simple prompting, Gemini Deep Think (advanced version, Jan 2026), without internet access, demonstrated the potential to solve exceptionally difficult reasoning tasks such as IMO 2025 Problem 6 (see Appendix B), that our previous IMO-gold model failed to solve. Notably, the latest model also solved IMO 2024 Problem 3 with a minor mistake (achieved at $2^7$ scale) and completely solved Problem 5 (at $2^8$ scale) (see Appendix C) We caution that the model’s knowledge cutoff falls between IMO 2024 and IMO 2025, implying possible exposure to the IMO 2024 problems. Additionally, as IMO 2025 has been public for six months, we do not claim the problem was strictly unseen. We report these findings to transparently demonstrate the potential of our new Gemini Deep Think model. . To further assess the readiness of our models for research-grade mathematics, we evaluated on FutureMath Basic, an internal benchmark designed to test PhD-level mathematics knowledge. We give an example of a FutureMath benchmark problem in appendix § A. Figure 2(b) demonstrated a similar scaling law for PhD-level problems, though with significantly lower accuracy compared to the competition problems. Corroborating these quantitative results, expert mathematicians who assessed the model noted that its susceptibility to mistakes and hallucinations inhibited its effectiveness on the longer, more complex reasoning required for research. With these findings, we concluded that inference-time scaling alone would not be sufficient, and we shifted our focus towards further improving and adapting our systems for research-level math. ### 2.2 Developing Agentic Harnesses for Research-Level Math The design of Aletheia is motivated by an empirical observation: decoupling a reasoning model’s final output from its intermediate thinking tokens, and adding well-chosen prompt scaffolding, enables the model to recognize flaws it initially overlooked during generation. One possible explanation for this phenomenon is that the training process incentivizes the model to guess or bluff. Another hypothesis is that an extended thinking trace might act as misleading “supporting” context, artificially inflating the conditional probability of an erroneous solution. Regardless of the underlying mechanism, we observed that explicitly separating out the verification step is effective in practice. Figure 2 depicts Aletheia ’s performance relative to Gemini Deep Think, the agent that achieved the 2025 IMO Gold Medal standard imo-gold. Due to its dynamic nature, Aletheia ’s total inference compute cannot be precisely controlled, but the results indicate that Aletheia outperforms Deep Think at similar (and indeed, larger) compute scales. On IMO-ProofBench Advanced, Aletheia achieved a 95.1% overall score, surpassing Deep Think across all tested compute scales using the same base model (Figure 2(a)). Moreover, on the 29 out of 30 problems where Aletheia actually returned a solution, its conditional accuracy reached 98.3%. On the FutureMath Basic (Figure 2(b)), Aletheia again outperformed Deep Think at all compute scales on the same base model. Notably, it returned solutions for fewer than 60% of these problems, and its conditional accuracy on the answered subset exceeded 82%. Thanks to its verification mechanism, Aletheia often admits failure to solve a problem, a key feature that improved the efficiency of our human-AI research collaborations. ### 2.3 Importance of Tool Use Beyond its foundational natural language verifier, Aletheia relies heavily on tool use to navigate the complexities of mathematical research. We find that a foundation model operating without such tools is prone to frequent hallucinations—primarily in the form of spurious citations (see Figure 3 for an example), and to a lesser extent, computational inaccuracies. In contrast to the structured, bounded nature of IMO problems, mathematical research requires synthesizing published literature to address broader, open-ended questions. We observed that simply granting internet access is insufficient; however, when our model was extensively trained for tool use, the integration of Google Search and web browsing led to a substantial reduction in obvious citation hallucinations with fictitious titles and/or authors. This shifted citation errors to subtler hallucinations, for example where the reference exists but its results are not quoted properly (as illustrated in Figure 4). Perhaps unexpectedly, our results indicate that the integration of Python as a tool yielded only marginal improvements in mitigating computational hallucinations. This suggests that the baseline proficiency of Gemini in these specific tasks is already high; consequently, further performance gains may require more specialized or sophisticated tools beyond standard code execution. Sample Interaction Prompt: Prove or disprove: the pretzel knot $P(-3,5,13)$ has infinite order in the smooth concordance group. Model output: … Proof. This is derived from Theorem 3.1 in [C. Livingston and S. Naik, “Ozsváth-Szabó and Rasmussen invariants of some pretzel knots”, Algebraic & Geometric Topology, 13(2) (2013), 1115-1124]. … Figure 3: A hallucinated paper from the (truncated) output of a model without internet search capability. The red text refers to a completely fabricated paper. Sample Interaction Prompt: If $1=d_1<⋯<d_τ(n)=n$ are the divisors of $n$ , then let $τ_⊥(n)$ count the number of $i$ for which $(d_i,d_i+1)=1$ . Is it true that $τ_⊥(n)/ω(n)→∞$ for almost all $n$ ? … Model output: A classical result by Galambos (1976) on the distribution of prime factors states that for almost all integers $n$ (on a set of asymptotic density 1), the prime factors satisfy the condition: $p_j>p_1p_2⋯ p_j-1 for all j>j_0,$ (1) where $j_0$ is a fixed integer. … References 1. Galambos, J. (1976). The sequences of prime divisors of integers. Figure 4: When trained for tool use, the model tends not to fabricate papers, but can still cite results incorrectly. In this example, the referenced paper of Galambos exists, but the claimed “classical result” cannot be found there. Prompt and model output truncated. ## 3 Summary of Mathematical Research Results Below we will briefly summarize the papers with substantial contributions from Aletheia that have already been publicly released. Due to timing, these results were obtained by Aletheia with a base model that was intermediate between the two base models depicted in Figure 2. To be clear, the final versions of these papers were all written by human authors, starting from Aletheia ’s outputs. This was due to the principle held by the human authors involved that research papers should be authored exclusively by humans, even when AI contributions reach a level that would traditionally merit co-authorship. The primary reason is that authorship of a mathematics paper implies accountability for all of its contents, and this is a responsibility that only humans can bear. We emphasize that beyond mathematical correctness, authors also assume accountability for important aspects of exposition such as adequacy and accuracy of attributions, which cannot be guaranteed by formal verification. ### 3.1 Milestone A: Reliable Autonomous Research The paper Eigenweights for Arithmetic Hirzebruch Proportionality [Feng2026] resolves a problem that arose in the work of Feng–Yun–Zhang [FYZ4] investigating an extension of the celebrated Hirzebruch Proportionality Principle [Hir58]. The backdrop. Hirzebruch’s original Proportionality Principle expresses the Chern numbers of an automorphic vector bundle on a compact locally symmetric space as a multiple of the corresponding Chern numbers on the compact dual variety, with a proportionality constant that can be interpreted as a value of the $L$ -function for the associated Gross motive [Gro97]. This formula was later generalized to non-compact locally symmetric spaces by Mumford [Mum77]. Recent work by Feng–Yun–Zhang [FYZ4] studies a variant of this principle, called “Arithmetic Hirzebruch Proportionality”, relating the “arithmetic volume” of Chern classes on moduli spaces of shtukas to differential operators applied to the $L$ -function of Gross motives. It can be seen as a generalization of Gaitsgory–Lurie’s resolution of Weil’s Tamagawa Number Conjecture [GL14]. The precise differential operator appearing in Arithmetic Hirzebruch Proportionality is governed by certain fundamental structure constants called eigenweights. In [FYZ4], the authors calculated some examples of eigenweights, but did not know how to determine all of them in closed form. The story of the paper. In fact, this project began almost incidentally, while tracking partial progress on a problem that was submitted to an internal benchmark: the computation of a particular family of eigenweights (whose answer was already known to Feng–Yun–Zhang). Initially no model could solve it, but clear progress was visible as inference time computation was scaled, until the eventual IMO Gold model found the correct answer. Upon examining the AI-generated solution, the authors of [FYZ4] preferred it over their original one, leading them to replace the original proofs in their paper. This success encouraged us to task Aletheia with calculating the eigenweights in general. Without any human intervention, Aletheia found an elegant way to do this, using techniques from a different field of mathematics Namely, algebraic combinatorics. unfamiliar to the authors of [FYZ4], and in the process resolved several questions left open in [FYZ4]. This is the subject of the paper [Feng2026], which is notable in that all of the mathematical content is fully generated by AI. While our ultimate goal is to empower human mathematicians with AI tools, we note the ability to generate correct results without human intervention as an important milestone for reliability and trustworthiness of natural language models. ### 3.2 Milestone B: Lower Bounds for Multivariate Independence Polynomials The paper Lower bounds for multivariate independence polynomials and their generalisations [LeeSeo2026] establishes new mathematical inequalities that link the world of physics with the logic of discrete mathematics. The backdrop. In physics, scientists often model how gas molecules occupy a space by imagining them as points on a network. Because these molecules naturally repel one another, no two molecules can occupy “neighboring” spots that are directly connected. Mathematicians call these valid arrangements independent sets, and they are a cornerstone of understanding complex systems in both physics and mathematics. The story of the paper. Joonkyung Lee and Jaehyeon Seo first used Gemini 2.5 Deep Think to prove the key inequality needed to obtain a further generalization of a well-known result by Sah–Sawhney–Stoner–Zhao [sah2019number] on the lower bound for the number of (weighted) independent sets. Other models were also able to solve the problem: Gemini Deep Think, Aletheia, and—with some back and forth—GPT-5 Pro. Encouraged by this success, they tackled an even harder problem: modeling systems where two different types of molecules interact at the same time, such that different types do not repel each other. For this deeper challenge, Aletheia provided Other models were tasked with the same prompt, but Aletheia ’s output was the only one that the authors found useful. a high-level roadmap of insightful ideas that the authors then turned into a complete, rigorous proof. The interesting feature here is that the workflow was somewhat reversed from the usual narrative of human-AI collaboration. Typical human-AI workflows involve decomposing a project into granular technical queries for the model to resolve. But in this project, Aletheia actually gave the “bigger picture” strategy for a deep result–such as suggesting the use of specific “dual sets”—leaving the human authors to fill in the rigorous execution. In some parts, the authors retained only the statements provided by the AI and produced the proofs independently, as if they were inspired by the model’s vision. ### 3.3 Milestone C: The Erdős Problems Erdős was a prolific Hungarian mathematician who left a vast legacy of papers and unsolved conjectures. In 2023, Thomas Bloom launched ErdosProblems.com, a centralized repository designed to catalog these conjectures and track their resolutions. At the time of this writing, the database tracks 1,179 problems, with 483 (41%) classified as solved. We stress, however, that the “Open” status of a problem in this database does not always reflect the true state of the literature. Recently, AI managed to solve a few problems marked “Open” in Bloom’s database. While some of these solutions were later discovered to already exist in the literature, others appear to be genuinely novel. To systematically document these advancements, Terence Tao tao2026erdosai established a community wiki dedicated to tracking AI-assisted progress on Erdős’s conjectures. Because only success cases tend to be reported in public forums, these results do not provide a complete picture of AI capability. In order to understand this better, we initiated a case study on the Erdős problems in early December, 2025. From December 2–9, we deployed Aletheia against the 700 Erdős problems then marked as “Open” in Bloom’s database. Aletheia ’s informal verifier mechanism streamlined the initial pool of candidates: of the 700 original prompts, the model returned 212 responses as potentially correct. These candidates were then evaluated by a team of mathematicians. Most were not experts in the relevant problem domain, so we prioritized narrowing the pool of candidate solutions quickly (possibly at the cost of making noisier judgments) to a manageable size for our smaller core of domain experts. This step was essentially completed by December 21. Then our internal domain experts vetted the solutions carefully, consulting external experts when correctness was ascertained but novelty was unclear. In some cases, minor inaccuracies in the solutions were corrected. Our ultimate findings were that 63 solutions were technically correct, but only 13 solutions correctly addressed the intended problem statement (either by invoking the literature, or by a novel argument). The remaining 50 of Aletheia ’s correct solutions were technically valid but mathematically vacuous, because the problem statements were interpreted in a way that did not capture Erdős’s intent, often (but not always) leading to trivial solutions. Finally, 12 of the responses were marked ambiguous, for example due to open-endedness of the question itself. In summary, out of the 200 solution candidates that we were definitely able to mark correct or incorrect, 137 ( $68.5\$ ) of the responses were fundamentally flawed, while 63 ( $31.5\$ ) of the responses were technically correct, of which only 13 ( $6.5\$ ) were meaningfully correct. These 13 meaningfully correct results clustered naturally into four categories which we felt should be distinguished; see Table 2 for a summary. Autonomous Resolution. On these problems, Aletheia found the first correct solution (as far as we can tell) in a mathematically substantive way. These include Erdős-652 and Erdős-1051. Partial AI Solution. On these problems, there were multiple questions and Aletheia found the first correct solution to one of the questions. These include Erdős-654, and Erdős-1040. Independent Rediscovery. On these problems, Aletheia found a correct solution, but human auditors subsequently found an independent solution already in the literature. These include Erdős-397, Erdős-659, Erdős-935, and Erdős-1089. The solutions appear to have been independently rediscovered by our model: we scanned the logs of Aletheia ’s reasoning trace to ensure that the solution was not pulled directly from the literature solution. It is of course possible that the solution was indirectly ingested from the literature solution, either implicitly through intermediate sources or during pretraining. This highlights a new danger that accompanies AI-generated mathematics: it is susceptible to “subconscious plagiarism” by reproducing knowledge acquired during pretraining, without attribution. Literature Identification. On these problems, Aletheia found that a solution was already explicitly in the literature, despite the problem being marked “Open” on Bloom’s website at the time of model deployment. These include Erdős-333, Erdős-591, Erdős-705, Erdős-992, Erdős-1105. To be clear, we make no claims of novelty for the latter two categories. The ‘4’ autonomous solutions cited above refer to Erdős-652, Erdős-654, Erdős-935, Erdős-1040, and Erdős-1051. In the estimation of our experts, none of the four individually rises to the level of a research paper. The solution to Erdős-1051 was generalized further, in a collaborative effort by Aletheia together with human mathematicians and Gemini Deep Think, to produce the research paper [BKKKZ]. | Autonomous Resolution | Autonomous novel solution. | 652 *, 1051 | | --- | --- | --- | | Partial AI Solution | Solved some part of a multi-part problem. | 654, 1040 | | Independent Rediscovery | Found a correct solution later discovered to exist in the literature. | 397 *, 659 *, 935, 1089 | | Literature Identification | Identified that the problem was already solved in the literature. | 333 *, 591, 705, 992, 1105 | Table 2: Taxonomy of Aletheia results on Erdős problems. * Independently obtained by other parties after our initial evaluations were conducted, but prior to the publishing of this work. Contextualizing the results. A disclaimer is necessary regarding the novelty of these results on Erdős problems. While we made considerable efforts to review the literature, it is certainly possible that we missed earlier solutions to these problems by human mathematicians. Therefore, our initial classification into categories is, at best, an upper bound on novelty. It is subject to revision after further investigation by the public. Indeed, previous AI-assisted work on Erdős problems 1026, 397, 333, and 281 was discovered, after initial announcements of novelty, be redundant with the literature For Erdős-281, we note that the AI solution is distinct from the previously existing literature solution.. To the outside observer, this may present a misleading impression of mathematics research: in the authors’ experience, it is very unusual for human-generated results to be redundant in this manner (in the modern era of communication). One reason why it seems to be happening so frequently with AI-generated work on Erdős problems is that the solutions are so simple that they would not necessarily attract attention if they originated from humans. For instance, Erdős-1089 is answered by an offhand remark in a 1981 paper Bannai81, where the authors seemed unaware that they had resolved an Erdős problem. Our takeaway from this experience is that many open Erdős problems remained unresolved out of obscurity rather than difficulty. We stress that the mathematical significance of such resolutions can only be accurately evaluated by expert mathematicians, even if the correctness can be ascertained by non-mathematicians or formal verifiers. See erdos for further discussion. ### 3.4 Bounds for Polynomial Dyadics The paper Strongly Polynomial Policy Iteration for $L_∞$ Robust MDPs by Asadi–Chatterjee–Goharshady–Karrabi–Montaseri–Pagano [ACGKMP] proves complexity bounds for certain algorithms arising in Machine Learning and Game Theory called “Robust Markov Decision Processes (MDPs)”. For this paper, Aletheia provided an improvement to a crucial mathematical ingredient. The backdrop. This work was originally a project of Asadi–Chatterjee–Goharshady–Karrabi–Montaseri, a team of computer scientists at ISTA. They were able to prove the desired strongly-polynomial time bound on Robust MDPs conditionally on a number-theoretic assertion: that specific bounded combinations of numbers are in polynomially many dyadic intervals. They reached out to number theorists for help, and eventually the problem arrived at Pagano, who, after some effort, could prove the desired complexity bound using a relatively advanced result from Number Theory called Siegel’s Lemma (a tool that does not immediately appear related to the question). Pagano shared the problem with several colleagues and some of them eventually found variants of the argument leading to the same complexity bound. The result was sufficient for the intended algorithmic application, but the optimality of the bounds was unclear. The story of the paper. Pagano also contributed the problem to an internal benchmark, where the IMO Gold version of Deep Think provided a valid solution. However, Aletheia devised an independent argument (also making creative use of Siegel’s Lemma) that achieved the best bound among all human and autonomous attempts. In particular, this significantly improved the bound established originally by Pagano and the other mathematicians. Therefore, this argument was adopted for the eventual publication. ## 4 Analysis ### 4.1 Ablation studies To contextualize Aletheia ’s performance, we conducted ablation studies using the Gemini Deep Think agent, operating at the IMO Gold scale with the identical underlying base model. We evaluated Deep Think using the same prompts that generated the research results detailed in § 3. We first evaluate the 13 Erdős problems that Aletheia successfully solved, as listed in § 3.3. According to human expert grading Natural language verification by human experts inherently involves some subjectivity; see Footnote 3 of erdos., Gemini Deep Think correctly solved 8 of these 13 problems, while operating at almost exactly twice the average compute per problem as Aletheia. These results are summarized in Table 3. | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | Table 3: Performance of Gemini Deep Think (IMO scale) on the 13 Erdős problems successfully solved by Aletheia (§ 3.3). Both agents utilized the same base model, with Deep Think consuming roughly 2 $×$ the average compute per problem. Next we compare performance on the specific prompts underlying the research papers documented in § 3. As summarized in Table 4, Deep Think successfully reproduced the correct answer for [FYZ4] but failed across all three prompts for [Feng2026]. For the prompts in [LeeSeo2026], it resolved the first prompt but failed on the second and far more crucial prompt. For [BKKKZ], the interaction was too messy to replicate exactly, but Deep Think essentially succeeded. Lastly, for [ACGKMP], Deep Think derived a satisfactory upper bound, although it was not as sharp as Aletheia ’s. On this set of prompts, the total compute used by the two agents turned out to be comparable. The results are summarized in Table 4. | [FYZ4] | [Feng2026] | [LeeSeo2026] | [BKKKZ] | [ACGKMP] | | --- | --- | --- | --- | --- | | ✓ | ✗ | ✗ | ✓ | ✓ | Table 4: Success of Gemini Deep Think (IMO scale) at reproducing the autonomous assistance utilized for our research papers. Both agents utilized the same base model and consumed similar total compute. ### 4.2 Weaknesses of AI The results of this paper should not be interpreted as suggesting that AI can consistently solve research-level mathematics questions. In fact, our anecdotal experience is the opposite: success cases are rare, and an apt intuition for autonomous capabilities (and limitations) may currently be important for finding such cases. The papers [Feng2026, LeeSeo2026, ACGKMP] grew out of spontaneous positive outcomes in a wider benchmarking effort on research-level problems; for most of these problems, no autonomous progress was made. The case study erdos on the Erdős problems offers a more quantitative picture. Table 5 illustrates the success rate on the 200 solution candidates that we were able to mark Correct or Incorrect (filtered from 700 open problems). In total, 31.5% of the solutions were technically correct under some interpretation of the question, but only 6.5% were meaningfully correct in addressing what we deemed to be the intended interpretation. | Fundamentally Flawed Technically Correct Meaningfully Correct (subset) | 137 63 13 | 68.5% 31.5% 6.5% | | --- | --- | --- | | Total Candidates | 200 | 100.0% | Table 5: Solution accuracy on 200 AI-generated responses, as graded by humans. Thus, even with its verifier mechanism, the model is still highly prone to errors compared to humans. (We note, however, that the base model used on the Erdős problems is substantially weaker than the January 2026 base model referenced in § 3.) Furthermore, whenever there is room for ambiguity, the model exhibits a tendency to misinterpret the question in a way that is easiest to answer, even when such an interpretation would be obviously unintended to a human expert. This aligns with the well-known tendencies for “specification gaming” and “reward hacking” in machine learning. Finally, we reiterate that hallucination is still a common failure mode. Even with internet search capability to check references, the model tends to fabricate or misrepresent results from legitimate references in order to assert a solution (see Figure 4 for an example). We identify several qualitative gaps between current autonomous mathematics research and human-generated research. To date, autonomous results have been relatively brief and elementary in comparison to typical human papers. Furthermore, success cases seem to arise from clever technical manipulations or vast knowledge retrieval, rather than what mathematicians would consider to be genuine creativity, although the latter concept is admittedly subjective. ## 5 Representing AI contributions to mathematics Mathematics provides a domain that is—at least superficially—analogous to game-playing, making it a highly attractive target for AI researchers. In mathematics there is a near-universal consensus on starting points (mathematical axioms) and atomic moves (valid mathematical deductions), and a rich landscape of objectives (precise theorems and conjectures). However, the extreme difficulty and specialization of frontier mathematics research makes performance evaluation exceptionally challenging. For a given research problem, there may be only a handful of mathematicians in the world equipped to evaluate a potential solution, and such evaluations can take tremendous effort. (For example, in mathematics the peer review process typically exceeds a year, and it is common for research papers to surpass 100 pages in length.) Even given a solution that is known to be correct, perhaps through formal verification, it could be difficult for an outside observer to distinguish a major advance from a triviality. This evaluation gap has enabled a surge in hyperbolic or misleading messaging regarding AI-generated mathematical content. Instead of using academic publishing venues with vetting and review by professional mathematicians, such claims are often propagated through social media channels. While AI proponents have obvious incentives to exaggerate the mathematical capabilities of AI, a somewhat perverse incentive has also arisen for mathematicians to overstate the contribution of AI to their own work, because doing so leads to a significant increase in publicity and attention. Moreover, misinformation propagates easily in this domain because the audience is ill-equipped to evaluate the claims. Since we are developing tools for AI-assisted mathematics research, we also felt a responsibility to contribute to the discussion on responsible documentation, evaluation and communication of AI-generated results. In order to inform our thinking, we held a discussion with several mathematicians to understand the concerns of the mathematical community regarding AI. Mathematicians consulted (outside those among the authors of this paper) included Jarod Alper ⋆, Kevin Barreto, Sourav Chatterjee, Otis Chodosh, Michael Hutchings, Daniel Litt, Ciprian Manolescu, Mona Merling, Johannes Schmitt, Ravi Vakil, Shengtong Zhang. ⋆ Affiliated with Google DeepMind. One of the dominant themes of the discussion was the desire for transparency, both about the role of the AI and the technical importance of AI-assisted results. Motivated by this, we will define one possible framework for describing autonomous mathematics, and contextualize our own results within it for illustration. This should be considered as only a starting point in a broader discussion of responsible scientific communication. If any such standards are to be universally adopted, they must arise organically from the mathematical community. ### 5.1 Autonomous Mathematics Levels Autonomous driving is classified according to standard “Levels”, measuring the degree of autonomy, defined by the Society of Automotive Engineers (SAE). A meaningful taxonomy of autonomous mathematics should have at least two dimensions, with one axis for the level of autonomy and another for the level of mathematical significance A similar example of such a division is in Epoch AI’s FrontierMath open problems, which are rated as “moderately interesting,” “solid result,” “major advance,” or “breakthrough.” epochai:2026:frontiermathopen. Maintaining a coarse division while preserving essential distinctions would minimize subjective disputes. We divide our own results into three broad levels of autonomy, described in Table 6. | H | Primarily Human (secondary AI input) | Core mathematical content is human-generated; AI contributions are minor or auxiliary (e.g., literature search, basic calculations, writing routine proofs which the human author could have supplied) if any. | | --- | --- | --- | | C | Human-AI Collaboration | A substantive human-AI collaboration where both parties contribute in an essential way. | | A | Essentially Autonomous | The core mathematical content is fully AI-generated without any essential human intervention. (Human authors are still ultimately responsible for the final paper content.) | Table 6: Axis 1: Level of Autonomy. This axis quantifies the relative contribution between the human and AI agents. #### 5.1.1 Discussion and Examples for Level of Autonomy. In these terms, we consider the paper [Feng2026] to be of Level A, as the human contribution is only in expository aspects, aside from posing the original questions. The results in [erdos] are also of Level A insofar as the role of humans was only to correct minor inaccuracies. We consider the papers [LeeSeo2026] and [BKKKZ] to be of Level C: both humans and AI contributed essential aspects of the papers. We consider the papers [FYZ4] and [ACGKMP] to be of Level H: while the written papers have non-trivial contributions from AI, these contributions form a minor portion of the paper and the human authors could have supplied those parts themselves. Even with these broad subdivisions, there can be some ambiguity. A possible situation is that interacting with AI leads human mathematicians to a key idea or reference, but the AI itself does not truly make its own mathematical contribution. In other words, the AI contributes only via “inspiration”. Although the research may not have happened without AI assistance, we would consider such instances as Level H. #### 5.1.2 Discussion and Examples for Level of Significance. We emphasize at the outset that these Levels fail to capture the depth and diversity of human mathematics research, and are only meant to contextualize autonomous contributions. Given the inherent subjectivity of mathematical significance, and the proven difficulty of comparing the work of (human) mathematicians, we felt that the Levels of Mathematical Significance should be coarse enough that they do not differentiate human-generated research. A possible exception is that there is near-universal consensus on the top 5 journals in mathematics In alphabetical order, these are: Acta Math., Annals of Math., Inventiones Math., Journal of the AMS, Publ. Math. IHES. We expect that any proposed list of “top $N$ journals” for a different value of $N$ would be met with significant contention, as would any attempted internal ranking of these five., so it could make sense to anchor one level to them. One would expect to find major advances published in these journals; as far as we are aware, no autonomous result has yet come close to that standard. A final category could be reserved for truly landmark breakthroughs of once-in-a-generation stature. Examples might include the proof of Fermat’s Last Theorem wiles1995modular, taylor1995ring or Perelman’s proof of the Poincaré Conjecture perelman2002entropy, perelman2003ricci, perelman2003finite. Although evident, it seems worth emphasizing that the Level of Mathematical Significance can only be evaluated by mathematicians, who are moreover experts on the particular domain of the problem. | 0 | Negligible novelty | Suitable for Olympiad contests or Ph.D. student exercises. | | --- | --- | --- | | 1 | Minor novelty | A new result, but the techniques and novelty do not meet the standard for a professional research paper. | | 2 | Publication Grade | Publishable in a research journal in good standing (that is not specifically intended for students or expository articles), following established peer-review practices. | | 3 | Major Advance | At the highest level of typical human results; published in the top 5 general-interest journals in mathematics. | | 4 | Landmark Breakthrough | A truly landmark breakthrough result of generational importance and influence. | Table 7: Axis 2: Mathematical Significance of Result. This axis measures how significant the mathematical result is, regardless of how it was produced. In these terms, we view our solutions to Erdős-652, 654, 935, 1040 as Level A0, and our solution to Erdős-1051 as Level A1. We view [Feng2026] as Level A2, [BKKKZ] and [LeeSeo2026] as Level C2, and [ACGKMP] and [FYZ4] as Level H2. We reiterate that the variation within Level 2 is deliberately wide to encompass the vast majority of papers written by professional mathematicians, ranging from very minor to extremely influential works. This appears to be the only way to avoid subjective disputes about human research. Because of the exceptional width of this category, the public should not understand a paper of level C2 or A2 as indicating that AI tools have reached the level of human mathematicians. For example, we view our own results within Level 2 [ACGKMP, BKKKZ, Feng2026, FYZ4, LeeSeo2026] as being very spread out in significance and difficulty. ### 5.2 Documenting Human-AI Collaboration Another theme of concern in the discussion with mathematicians was how best to document AI-assisted mathematics, both in the interest of helping the community achieve effective usage of AI tools, and also in regulating the promotion of AI involvement as an attention-getting mechanism. Several examples of AI-assisted mathematics papers have recently appeared in the literature [bryan2026motivic, jang2025point, schmitt2025extremal], with each featuring its own method of documenting the AI contributions. In our work, the interactions were clean enough to be neatly captured by a “Human-AI Interaction (HAI) Card”, which we display below for the papers [Feng2026, LeeSeo2026, BKKKZ]. This terminology is inspired by the “model card” concept in describing AI models. Human-AI Interaction Card for [Feng2026] Query the computation of eigenweights (certain structure constants arising in arithmetic geometry) for groups of Type A. Human Aletheia Fully correct solution using Atiyah–Bott localization, Schur polynomial manipulation, Frobenius character identities, and the Murnaghan–Nakayama rule. Proceed to query eigenweights for groups of Type C. Human Aletheia Fully correct solution, applying the same tools in a variant manner. Proceed to query eigenweights for groups of Type D. Human Aletheia Fully correct solution, applying the same tools in a variant manner. Raw prompts and outputs: https://github.com/google-deepmind/superhuman/tree/main/aletheia Human-AI Interaction Card for [LeeSeo2026] Query a multivariate inequality on independent sets of graphs (generalization of sah2019number). Human Gemini 3 Deep Think Fully correct solution. Query challenging extension to semiproper colourings. Human Aletheia Proof outline with crucial ideas, including dual sets and their log convexity, reduction techniques, and several key Lemmas. Work out details of outline, devising new (simpler) proofs. Human Raw prompts and outputs: https://github.com/google-deepmind/superhuman/tree/main/aletheia Human-AI Interaction Card for [BKKKZ] Query Erdős-1051 as stated in [ErdosGraham1980], involving summation of $a_na_n+1$ case. Human Aletheia Correct solution. State and prove a generalization to the $a_n⋯ a_n+d-1$ case, and requests a “meaningful, interesting generalization”. Human Gemini 3 Deep Think Propose a generalization to the $a_n^w_0⋯ a_n+d-1^w_d-1$ case; provides non-rigorous heuristics. Further generalize the theorem by changing the hypothesis $\lim=∞$ to a weaker condition $\limsup=∞$ ; complete a rigorous proof; identify examples demonstrating that the results are optimal. Human Raw prompts and outputs: https://github.com/google-deepmind/superhuman/tree/main/aletheia In other instances, it may not be realistic to capture the interactions with the same template. While the specific presentation may be left to an author’s discretion, the field would benefit from a common baseline for transparency. For Level C or Level A results, where AI input is deemed essential, a possible baseline would be to expose at least the most important raw prompts and outputs that contain the essential new insights generated by AI. Specific standards are left for the community of mathematicians to decide. We note that, in a related direction, Tao has initiated an online discussion of “best practices for incorporating AI usage” tao2024bestpractices for mathematicians. ## 6 Reflections on the Impact of AI in Mathematics To date, hype notwithstanding, the impact of artificial intelligence on pure mathematics research has been limited. While our results do solve some problems that seem to have eluded experts, they do not indicate that artificial intelligence has matched, or will match, the capabilities of human mathematicians. Rather, they illustrate how certain comparative advantages of AI models over humans can be useful for certain kinds of problems. This perhaps clarifies the directions where human researchers can expect the most impact from AI in the near future. A first observation is that AI models exhibit a form of intelligence that diverges significantly from that of human scientists. In any specific subject, frontier models have much shallower knowledge than a domain expert, but they also possess superhuman breadth of knowledge, which could be the key to unlocking certain problems. The simple fact that artificial intelligence differs from human intelligence presents the possibility that it is better suited for solving some types of problems, for example those requiring vast memory, computation, or breadth of knowledge. Another comparative strength of AI is that it is not constrained by human physical limitations. It is likely that many open questions lie within the reach of existing techniques, but are not resolved because of limited time and attention from the right experts, as demonstrated by our results on the Erdős problems erdos. This reinforces the point that AI is bottlenecked by very different factors compared to humans, which can be an advantage in the right context. ## 7 Related Work Advances in Mathematical Reasoning Recent advances in large language models have yielded systems capable of sophisticated mathematical reasoning deepmind:gemini2p5deepthink, openai:2025:gpt5_2, moving beyond standard benchmarks to competition-level problems. Notably, these models have achieved gold-medal standards at the International Mathematics Olympiad (IMO) imo-gold and demonstrated proficiency on difficult benchmarks like IMO-ProofBench luong2025robustmathematicalreasoning. New, more challenging benchmarks continue to emerge as the field shifts to focus on research-level math glazer2024frontiermath, nie2025uqassessinglanguagemodels, tao2026erdosai. AI-Assisted Research Results Already, AI has been successfully applied to assist mathematical research. Examples include research on extremal descendant integrals on moduli spaces [schmitt2025extremal], motivic classes of genus zero maps [bryan2026motivic], and convergence rates for Nesterov’s method [jang2025point]. Beyond these specific cases, AI has been used in other works across discrete analysis, convex analysis, and combinatorics [ivanisvili2025counterexamplemajorityoptimalitynicd, salim2025acceleratingmathematicalresearchlanguage]. Other studies show AI is useful for solving accessible conjectures or assisting with problems that require mathematical exploration [feldman2025godeltestlargelanguage, georgiev2025mathematicalexplorationdiscoveryscale]. Even more broadly, woodruff2026acceleratingscientificresearchgemini demonstrates a multitude of techniques and case studies for AI-assisted research across several mathematics-adjacent domains, including theoretical computer science, economics, and physics. Frameworks for Human-AI Collaboration A variety of human-AI collaborative implementations have emerged, ranging from fully agentic architectures with self-verification and refinement huang2025winninggoldimo2025, bryan2026motivic, to exploration methods such as AlphaEvolve novikov2025alphaevolvecodingagentscientific, wang2025thetaevolvetesttimelearningopen, yuksekgonul2026learningdiscovertesttime, to more collaboration-focused methods such as interactive protocols suggesting how to integrate AI as a research assistant li2025advancingmathematicalresearchhumanai, henkel2025mathematiciansassistantintegratingai. Despite the surge in attention bubeck2025earlyscienceaccelerationexperiments, zheng2025automationautonomysurveylarge, the community is still in the early stages of adapting AI for mathematical research, making it essential to establish transparent standards for how AI-generated mathematics is documented, evaluated, and communicated. ## 8 Conclusion Ultimately, we believe that AI will become a tool that enhances rather than replaces mathematicians. Currently, natural language models struggle to reason reliably without human intervention to correct mistakes and hallucinations, while formal verification systems are not yet capable of even formulating the questions of interest on most research frontiers. For this reason, we have introduced specialized math reasoning agents, incorporating informal natural language verification, to help mathematicians harness the benefits of AI. Acknowledgments. Special thanks to the following experts for mathematical discussions on the work: Jarod Alper, Kevin Barreto, Thomas Bloom, Sourav Chatterjee, Otis Chodosh, Michael Hutchings, Seongbin Jeon, Youngbeom Jin, Aiden Yuchan Jung, Jiwon Kang, Jimin Kim, Vjekoslav Kovač, Daniel Litt, Ciprian Manolescu, Mona Merling, Agustin Moreno, Carl Schildkraut, Johannes Schmitt, Insuk Seo, Jaehyeon Seo, Cheng-Chiang Tsai, Ravi Vakil, Zhiwei Yun, Shengtong Zhang, Wei Zhang, Yufei Zhao. We also would like to thank Adam Brown, Alex Davies, Thomas Hubert, Pushmeet Kohli, Vinay Ramasesh, Eugénie Rives, Benoit Schillings, Divy Thakkar for feedback and support of the work. ## References ## Appendix A Case Study: FutureMath Basic This is an example of FutureMath Basic problems that Aletheia solved using the search tool. Gemini Deep Think did not make much progress on this problem. FM-Grad-011: Let a positive integer $q≥ 3$ be given, and define $S=\{1,2,\dots,q\}$ . For a positive integer $N≥ 2$ and $x=(x_1,\dots,x_N)∈ S^N$ , define $$ U_N(x)=\frac{1}{2N}∑_1≤ i,j≤ N1(x_i=x_j). $$ For a real number $β>0$ , define $$ Z_N,β=∑_x=(x_{1,\dots,x_N)∈ S^N}e^β U_N(x). $$ Find the maximum value of $β$ for which the following equality holds: $$ \lim_N→∞\frac{1}{N}\log Z_N,β=\frac{β}{2q}+\log q. $$ Short answer: $\frac{2(q-1)}{q-2}\log(q-1)$ Aletheia solution: Large Deviation Analysis of the Phase Transition in the Mean-Field Potts Model We determine the maximum inverse temperature $β$ for which the free energy density of the $q$ -state mean-field Potts model is determined by the uniform empirical distribution. Using large deviation theory, we rigorously reduce the optimization of the free energy functional to a one-dimensional problem and solve the resulting system of algebraic equations to identify the exact critical temperature of the first-order phase transition. Introduction and Formulation via Empirical Measures Let $S=\{1,2,\dots,q\}$ for an integer $q≥ 3$ . We consider the partition function $$ Z_N,β=∑_x∈ S^Ne^β U_N(x), $$ where the energy function is given by $U_N(x)=\frac{1}{2N}∑_i,j=1^N1(x_i=x_j)$ . For a configuration $x=(x_1,\dots,x_N)∈ S^N$ , let $N_k(x)=∑_i=1^N1(x_i=k)$ denote the occupation number of state $k$ . The empirical probability measure of the configuration $x$ is the vector $p(x)=(p_1(x),\dots,p_q(x))∈Δ_q$ , where $p_k(x)=N_k(x)/N$ and $Δ_q=\{p∈ℝ^q:p_k≥ 0,∑_kp_k=1\}$ is the $(q-1)$ -dimensional probability simplex. The energy can be expressed purely in terms of the empirical measure: $$ U_N(x)=\frac{N}{2}∑_k=1^qp_k(x)^2. $$ By Sanov’s Theorem (see Ellis, 1985), the sequence of empirical measures satisfies a large deviation principle with rate function given by the negative of the Shannon entropy, $H(p)=-∑_k=1^qp_k\log p_k$ . Applying Varadhan’s Lemma, the asymptotic free energy density is the supremum over the simplex of the difference between the energy and the entropy: $$ \lim_N→∞\frac{1}{N}\log Z_N,β=\sup_p∈Δ_{q}F_β(p), \tag{2} $$ where the free energy functional is $$ F_β(p)=\frac{β}{2}∑_k=1^qp_k^2-∑_k=1^qp_k\log p_k. $$ The Uniform Distribution and Symmetry Breaking At the uniform distribution $p^*=(1/q,1/q,\dots,1/q)$ , the value of the functional is $$ F_β(p^*)=\frac{β}{2}∑_k=1^q\frac{1}{q^2}-∑_k=1^q\frac{1}{q}\log\frac{1}{q}=\frac{β}{2q}+\log q. $$ The equality stated in the problem holds if and only if $p^*$ is the global maximizer of $F_β(p)$ on $Δ_q$ . To identify all possible local maximizers of $F_β$ , we optimize $F_β(p)$ subject to $∑_k=1^qp_k=1$ . Using a Lagrange multiplier $λ$ , the critical points in the interior of $Δ_q$ satisfy the stationarity condition $$ β p_k-\log p_k-1=λ for all k=1,\dots,q. $$ Define the function $f(z)=β z-\log z$ . Since $f^\prime\prime(z)=z^-2>0$ , $f(z)$ is strictly convex on $(0,1)$ . Thus, any horizontal line $f(z)=λ+1$ can intersect the graph of $f(z)$ at most twice. This implies that the components of any critical point $p$ can take at most two distinct values. Consequently, any critical point must be of the form $p=(x,y,\dots,y)$ up to a permutation of the coordinates, where $x$ appears $m$ times and $y$ appears $q-m$ times. A standard second-order stability analysis of the Hessian of $F_β$ (Costeniuc, Ellis, and Touchette, 2005) reveals that local maxima can only occur when $m=1$ or $m=q$ . Therefore, apart from the uniform distribution $p^*$ , the only candidate for a global maximum is a distribution with one large component and $q-1$ small components: $p=(x,y,y,\dots,y)$ with $x+(q-1)y=1$ . Eliminating $y=\frac{1-x}{q-1}$ , the objective function restricted to this symmetric subspace becomes a function of $x∈(0,1)$ : $$ h(x)=\frac{β}{2}≤ft(x^2+\frac{(1-x)^2}{q-1}\right)-x\log x-(1-x)\log≤ft(\frac{1-x}{q-1}\right). $$ Determination of the Critical Inverse Temperature For $q≥ 3$ , the model undergoes a first-order phase transition (Wu, 1982). As $β$ increases, a secondary local maximum appears at some $x>1/q$ and eventually overtakes the uniform state. The critical inverse temperature $β_\max$ is defined as the point at which this secondary peak matches the height of the uniform state. This requires finding $x_c∈(1/q,1)$ and $β_\max$ such that: 1. Stationarity: $h^\prime(x_c)=0$ 1. Equal heights: $h(x_c)=h(1/q)$ The derivative of $h(x)$ is $$ h^\prime(x)=β≤ft(\frac{qx-1}{q-1}\right)-\log≤ft(\frac{x(q-1)}{1-x}\right). $$ Setting $h^\prime(x_c)=0$ yields $$ β=\frac{q-1}{qx_c-1}\log\frac{x_c(q-1)}{1-x_c}. \tag{3} $$ To determine $x_c$ , we substitute this expression for $β$ into the equal-height condition $h(x_c)-h(1/q)=0$ . Using the algebraic identity $\frac{1}{2}≤ft(x_c^2+\frac{(1-x_c)^2}{q-1}\right)-\frac{1}{2q}=\frac{(qx_c-1)^2}{2q(q-1)}$ , the condition becomes $$ \frac{1}{2}≤ft(\frac{qx_c-1}{q}\right)\log\frac{x_c(q-1)}{1-x_c}-x_c\log x_c-(1-x_c)\log≤ft(\frac{1-x_c}{q-1}\right)-\log q=0. $$ This system of equations can be solved exactly. One can verify algebraically that $x_c=\frac{q-1}{q}$ is the unique non-trivial solution in the interval $(1/q,1]$ . With $x_c=\frac{q-1}{q}$ , we have $1-x_c=\frac{1}{q}$ , and the argument of the logarithm in (3) becomes $(q-1)^2$ . Substituting $x_c$ into (3) gives: $$ β_\max=\frac{q-1}{q≤ft(\frac{q-1}{q}\right)-1}\log≤ft((q-1)^2\right)=\frac{2(q-1)}{q-2}\log(q-1). $$ To verify the equal-height condition at this critical point, we compute $h(x_c)-h(1/q)$ with $β_\max$ and $x_c=\frac{q-1}{q}$ : $$ h(x_c)-h(1/q)=\frac{β_\max}{2}≤ft(\frac{q^2-3q+3}{q(q-1)}-\frac{1}{q}\right)-\frac{q-2}{q}\log(q-1). $$ Using $\frac{q^2-3q+3}{q(q-1)}-\frac{1}{q}=\frac{(q-2)^2}{q(q-1)}$ , this becomes $$ \frac{1}{2}≤ft(\frac{2(q-1)}{q-2}\log(q-1)\right)\frac{(q-2)^2}{q(q-1)}-\frac{q-2}{q}\log(q-1)=0, $$ which holds identically. Conclusion For $β≤β_\max$ , the uniform distribution $p^*=(1/q,\dots,1/q)$ is the global maximizer of the free energy functional $F_β(p)$ , and the limit of the free energy density is $\frac{β}{2q}+\log q$ . For $β>β_\max$ , the system undergoes a first-order phase transition; a non-uniform configuration yields a strictly larger free energy density, violating the equality. Thus, the maximum value is $$ β_\max=\frac{2(q-1)}{q-2}\log(q-1). $$ References 1. Costeniuc, M., Ellis, R. S., and Touchette, H. (2005). Complete analysis of phase transitions and ensemble equivalence for the Curie–Weiss–Potts model. Journal of Mathematical Physics, 46(6), 063301. 1. Ellis, R. S. (1985). Entropy, Large Deviations, and Statistical Mechanics. Springer-Verlag. (Theorem II.7.1, Application to the Curie-Weiss model). 1. Wu, F. Y. (1982). The Potts model. Reviews of Modern Physics, 54(1), 235–268. ## Appendix B Case Study: Solving IMO 2025 Problem 6 We consider the problem statement for IMO 2025 P6: Consider a $2025× 2025$ grid of unit squares. Matilda wishes to place on the grid some rectangular tiles, possibly of different sizes, such that each side of every tile lies on a grid line and every unit square is covered by at most one tile. Determine the minimum number of tiles Matilda needs to place so that each row and each column of the grid has exactly one unit square that is not covered by any tile. We initialized the session by providing Gemini Deep Think (advanced version, Jan 2026) with the raw problem statement only, excluding any context regarding its origin (IMO) or specific formatting constraints. Consistent with standard competition rules, the model was evaluated in a restricted environment without access to the internet or external tools. We ran the model at extreme scale (equivalent to $2^12$ in Figure 2 a) and obtained the following response: To find the absolute minimum number of rectangular tiles Matilda needs to place, we can model this problem using combinatorial geometry. Since each row and each column must contain exactly one uncovered unit square, there are exactly $N=2025$ uncovered squares in total. We can treat these uncovered squares as “holes” acting as obstacles in our $2025× 2025$ grid. Because no two holes share a row or a column, their coordinates perfectly correspond to an $N× N$ permutation matrix. Our goal is to choose the optimal permutation of holes to minimize the number of non-overlapping rectangular tiles required to perfectly cover the remaining $N^2-N$ squares. 1. Theoretical Lower Bound When partitioning a rectilinear grid with $N$ isolated holes into rectangles, each hole introduces internal reflex (concave) corners into the free space. Every reflex corner must be resolved by the boundaries (cuts) of the rectangular tiles. By analyzing the maximum number of non-intersecting internal cuts that can connect these concave corners, a known theorem in discrete geometry (recently featured in the European Girls’ Mathematical Olympiad) dictates that for any arrangement of holes corresponding to a permutation, the minimum number of rectangular tiles $T$ required to partition the remaining grid is bounded by: $$ T≥ N+a+b-3 $$ where $a$ and $b$ are the lengths of the longest monotonically increasing and decreasing subsequences of the permutation formed by the holes’ coordinates. Geometrically, these subsequences represent the maximum number of holes stepping diagonally down-and-right ( $a$ ) and down-and-left ( $b$ ). According to the Erdős-Szekeres Theorem (via Dilworth’s Theorem), any permutation sequence of length $N$ must contain an increasing subsequence of length $a$ and a decreasing subsequence of length $b$ such that: $$ a× b≥ N $$ For Matilda’s grid, $N=2025$ , meaning $a× b≥ 2025$ . To minimize the total number of tiles $T$ , Matilda must select a hole permutation that minimizes the sum $a+b$ . Because $2025$ is a perfect square ( $45^2$ ), by the AM-GM inequality, the sum $a+b$ is absolutely minimized when the lengths are equal: $a=45$ and $b=45$ . Substituting these optimal lengths into our lower bound formula establishes the mathematical minimum possible number of tiles: $$ T≥ 2025+45+45-3=2112 $$ 2. The Optimal Tiling Construction Matilda can flawlessly achieve this global minimum of exactly 2112 tiles by utilizing a highly symmetric “macro-block” arrangement that limits both $a$ and $b$ strictly to 45. Let $k=45$ . She conceptually divides her $2025× 2025$ grid into a macroscopic $k× k$ grid, where each “macro-block” $B_u,v$ (for $1≤ u,v≤ k$ ) is itself a smaller $k× k$ grid of unit squares. Inside each block $B_u,v$ , she places its single uncovered hole at the following relative coordinates: - Relative Row: $v$ - Relative Column: $k+1-u$ This beautifully shifted cyclical stepping pattern guarantees exactly one hole in every absolute row and absolute column of the entire $2025× 2025$ grid. Because of this interlaced design, the holes dynamically step away from the inner corners of the blocks, cleanly bounding large, unbroken orthogonal channels of empty space. The remaining squares can be seamlessly partitioned with zero wasted cuts into exactly 2112 strictly non-overlapping rectangles: 1. Central Tiles: Safely nestled entirely in the unbroken spaces bounded between the diagonal chains of holes, there form exactly $(k-1)^2$ massive, completely empty squares. Each tile perfectly fills a $k× k$ area. Number of central tiles $=44^2=1936$ (each of size $45× 45$ ) 1. Edge Tiles: The remaining boundary space cleanly extends from the outermost holes to the 4 absolute edges of the overall grid, forming a staggered staircase perimeter. This area can be completely wrapped and perfectly flushed out using exactly $k-1$ unbroken rectangular boundary tiles along each of the 4 outer walls. These cascade symmetrically in varying widths and lengths. Number of edge tiles $=4× 44=176$ (consisting of 4 sets of cascading strip dimensions such as $1× 45,2× 45,\dots,44× 45$ ) Summing the completely disjoint tiles used in this provably optimal, zero-waste arrangement gives: $$ Total Tiles=1936 (central squares)+176 (edge rectangles)=2112 $$ Because this perfectly matches the absolute theoretical lower bound, it is mathematically impossible to cover the space efficiently using any fewer tiles. Final Answer: The minimum number of tiles Matilda needs to place is 2112. It is important to note that the model successfully identified the correct construction and outlined a valid proof sketch. This represents a significant improvement over our previous IMO Gold model, which failed to make meaningful progress on this problem even with extreme scaling. However, despite this conceptual breakthrough, our unofficial expert assessment determined the preliminary attempt was merited only a partial score of 1–3 points. The solution relied heavily on citing advanced—and in this case, likely non-existent—results without proof. While such citation is often encouraged and necessary in research mathematics, the IMO strictly requires solutions to be self-contained and derived using elementary techniques, rendering the unproven citations inadmissible. To address this, we explicitly prompted the model to adhere to the competition format. We instructed it to treat the task as an IMO problem, thereby enforcing the requirement for elementary, self-contained logic: Please rewrite the solution below to be suitable for an IMO competition. The current draft relies on advanced theorems that may not be accepted without proof; strictly replace these with standard, elementary Olympiad techniques. Ensure the logic is self-contained and rigorous enough to receive a perfect score using only tools familiar to high school contestants. Problem: <the problem statement> Solution: <the candidate model solution> We ran our model with the new refinement prompt at an inference scale of $2^8$ (corresponding to Figure 2), yielding the following correct solution: To achieve a perfect score on an IMO-level problem, the solution must be absolutely rigorous, self-contained, and built entirely upon elementary tools. We can bypass advanced discrete geometry black-boxes by modeling the grid as a rectilinear partition, utilizing a corner-counting topological formula, and applying a classic Pigeonhole Principle argument (the Erdős-Szekeres Theorem). 1. The Rectilinear Partition Formula Let $N=2025$ . The grid contains exactly $N$ uncovered $1× 1$ squares (“holes”), with exactly one hole in each row and column. Our goal is to perfectly tile the remaining free space with the absolute minimum number of rectangular tiles, $T$ . The final arrangement partitions the $N× N$ grid into exactly $F=T+N$ smaller rectangles (the $T$ tiles and $N$ holes). The boundaries of these rectangles form a set of maximal straight line segments. Let $H$ be the number of maximal internal horizontal segments, $V$ be the number of maximal internal vertical segments, and $I$ be the number of strictly internal crossings (where a horizontal and vertical segment intersect). We establish an exact formula for $T$ by counting the $90^∘$ corners of the rectangles. Every rectangle has exactly 4 corners, yielding $4(T+N)$ corners in total. These corners occur strictly at the intersections of the grid’s bounding frame and the internal segments. We classify all such points: 1. Grid Corners: The 4 absolute corners of the $N× N$ grid each provide exactly 1 rectangle corner. (Total: $4$ ) 1. Internal Crossings: Each of the $I$ strict crossings of maximal segments provides exactly 4 rectangle corners. (Total: $4I$ ) 1. T-Junctions (Endpoints): Every maximal internal segment has exactly 2 endpoints. An endpoint must terminate perpendicularly on the outer grid boundary or on another maximal internal segment, forming a T-junction. (Note: An “L-junction” where two endpoints meet without continuing is geometrically impossible. This would leave a $270^∘$ reflex angle in the space, which cannot be partitioned by rectangles without a segment emanating from the corner.) Furthermore, no two maximal internal segment endpoints can meet at the same T-junction (collinear endpoints would merge into a single maximal segment, and perpendicular endpoints would form an impossible L-junction or an X-crossing). Thus, there are exactly $2H+2V$ distinct endpoints, each forming a unique T-junction. A T-junction provides exactly 2 rectangle corners. (Total: $2(2H+2V)=4H+4V$ ) Summing these completely accounts for all rectangle corners: $$ 4(T+N)=4+4I+4H+4V\impliesT=H+V+I+1-N $$ 2. Segment Bounds and Chords For each $y∈\{1,\dots,N-1\}$ , the internal horizontal grid line $y$ separates row $y$ and row $y+1$ . The hole in row $y$ has its bottom edge on line $y$ , occupying the interval $[c_y-1,c_y]$ , where $c_y$ is its column. The hole in row $y+1$ has its top edge on line $y$ , occupying $[c_y+1-1,c_y+1]$ . Because holes are valid rectangles in the partition, both edges must be covered by maximal horizontal segments on line $y$ . - If both are covered by the same maximal horizontal segment, let $u_y=0$ . - If they are covered by different maximal horizontal segments, let $u_y=1$ . In either valid case, line $y$ contains at least $1+u_y$ maximal horizontal segments. Summing over all internal horizontal lines gives $H≥ N-1+∑_y=1^N-1u_y$ . Symmetrically, defining $w_x∈\{0,1\}$ for the vertical grid lines gives $V≥ N-1+∑_x=1^N-1w_x$ . Let $X_x,y∈\{0,1\}$ indicate if an internal crossing occurs strictly at grid point $(x,y)$ , so $I≥∑ X_x,y$ . Substituting these bounds into our formula yields: $$ T≥ N-1+∑ u_y+∑ w_x+∑ X_x,y $$ 3. Forcing Penalties via Monotonic Subsequences The column coordinates of the holes $(c_1,\dots,c_N)$ form a permutation of $\{1,\dots,N\}$ . Assign each hole $i$ a pair $(L_i,D_i)$ , representing the lengths of the longest increasing and decreasing subsequences ending at $i$ . Because all pairs must be distinct, the maximum lengths $a$ and $b$ must satisfy $a× b≥ N$ . Since $N=2025$ , the AM-GM inequality guarantees $a+b≥ 2√{2025}=90$ . Consider the longest increasing subsequence (LIS). Let its row indices be $y_1<y_2<\dots<y_a$ . Because the subsequence is maximal, for any adjacent pair $k∈\{1,\dots,a-1\}$ , the open bounding box bounded by $x∈(c_y_{k},c_y_{k+1})$ and $y∈(y_k,y_k+1)$ must be entirely empty of holes. As we iterate through the rows $y∈[y_k,y_k+1]$ , the hole column $c_y$ cannot fall strictly inside this empty box. Thus, it must jump from $≤ c_y_{k}$ to $≥ c_y_{k+1}$ . There must exist a specific row $y^*_k∈[y_k,y_k+1-1]$ where $c_y^*_k≤ c_y_{k}$ and $c_y^*_k+1≥ c_y_{k+1}$ . Symmetrically, there must exist a specific column $x^*_k∈[c_y_{k},c_y_{k+1}-1]$ where the hole row $r_x$ jumps from $≤ y_k$ to $≥ y_k+1$ , meaning $r_x^*_k≤ y_k$ and $r_x^*_k+1≥ y_k+1$ . Now observe the grid point $(x^*_k,y^*_k)$ : - If $u_y^*_k=0$ , a single horizontal segment covers both hole edges. This segment spans from at least the right edge of the left hole ( $c_y^*_k≤ c_y_{k}≤ x^*_k$ ) to at least the right edge of the right hole ( $c_y^*_k+1≥ c_y_{k+1}≥ x^*_k+1$ ). Thus, it spans strictly from $≤ x^*_k$ to $≥ x^*_k+1$ , passing completely through $(x^*_k,y^*_k)$ in its interior. - If $w_x^*_k=0$ , identical logic bounds a single vertical segment from $≤ y^*_k$ to $≥ y^*_k+1$ , passing completely through $(x^*_k,y^*_k)$ in its interior. - If both are $0$ , these continuous segments strictly cross at $(x^*_k,y^*_k)$ , forcing $X_x^*_k,y^{*_k}=1$ . Thus, for each of the $a-1$ LIS gaps, we strictly incur a penalty of $u_y^*_k+w_x^*_k+X_x^*_k,y^{*_k}≥ 1$ . Applying identical logic to the longest decreasing subsequence (LDS) of length $b$ provides $b-1$ gaps. For the $m$ -th gap, the coordinates must jump in the opposite direction. This isolates a row $y^**_m$ where $c_y^**_m>c_y^**_m+1$ and a column $x^**_m$ where $r_x^**_m>r_x^**_m+1$ , similarly incurring a penalty of $u_y^**_m+w_x^**_m+X_x^**_m,y^{**_m}≥ 1$ . Crucially, LIS gaps strictly demand $c_y<c_y+1$ , whereas LDS gaps strictly demand $c_y>c_y+1$ . Because these conditions are mutually exclusive, the sets of row indices $\{y^*_k\}$ and $\{y^**_m\}$ are strictly disjoint. The column indices are correspondingly disjoint. Consequently, all $a+b-2$ evaluated coordinates are completely unique, guaranteeing zero double-counting. $$ ∑ u_y+∑ w_x+∑ X_x,y≥(a-1)+(b-1)=a+b-2 $$ Substituting this into our main bound establishes the absolute theoretical minimum: $$ T≥ N-1+a+b-2=N+a+b-3 $$ $$ T≥ 2025+90-2112 $$ 4. The Optimal Construction Matilda can perfectly achieve this absolute minimum of exactly 2112 tiles by utilizing a symmetric “macro-block” arrangement that restricts both $a$ and $b$ to exactly $45$ . Let $k=45$ . She conceptually divides her $2025× 2025$ grid into a $k× k$ array of macro-blocks, each being a $k× k$ subgrid. For the macro-block at macro-row $u$ and macro-column $v$ (where $1≤ u,v≤ k$ ), she places its single uncovered hole at absolute coordinates: - Row: $(u-1)k+v$ - Column: $(v-1)k+(k+1-u)$ This elegantly shifted cyclical pattern guarantees exactly one hole in every row and column, forming $k$ parallel diagonal chains. Because the holes cyclically step away from the block boundaries, they bound unbroken orthogonal channels of free space, completely eliminating wasted segment cuts. The remaining free squares seamlessly partition into exactly 2112 non-overlapping rectangles: 1. Central Tiles: Nestled entirely in the unbroken spaces bounded between the diagonal chains of holes are massive, completely empty squares. Each perfectly fills an area of $(k-1)×(k-1)$ spanning across macro-block boundaries. (Number of central tiles = $(k-1)^2=44^2=1936$ ) 1. Edge Tiles: The boundary space cleanly extends from the outermost holes to the 4 absolute edges of the grid. This staircase perimeter wraps perfectly using exactly $k-1$ cascading rectangular boundary tiles flush along each of the 4 outer walls. (Number of edge tiles = $4× 44=176$ ) Summing the completely disjoint tiles used in this zero-waste arrangement gives: $$ Total Tiles=1936+176=2112 $$ Because this flawlessly matches the rigorously proven mathematical lower bound, it is impossible to cover the space efficiently using fewer tiles. ## Appendix C IMO 2024 Case Studies In this section, we analyze the performance of Gemini Deep Think (advanced version, Jan 2026) on variants of problems from the 2024 International Mathematical Olympiad. While we acknowledge the potential for data contamination given that the model’s knowledge cutoff post-dates IMO 2024, the significant scaling required to successfully solve these problems implies the presence of robust reasoning capabilities rather than mere memorization. ### C.1 Problem 3: Success at $2^7$ Scale We observe that Gemini Deep Think (advanced version, Jan 2026) solves a variant of IMO 2024 Problem 3 with a miscellaneous mistake when inference compute is scaled to $2^7$ . We added a footnote to the part where the model made a mistake. The problem statement and the model-generated solution are provided below: Problem: PB-Advanced-021 (Robustified IMO 2024 P3) from luong2025robustmathematicalreasoning Infinitely many boys and girls stand in a line alternatingly. For a given positive integer $N$ , they choose positive integers according to the following rules: (1) The first $N$ people in the line choose any positive integers they want. (2) For each $m>N$ , the $m$ -th person chooses the number that is one more than the number of appearances of the number chosen by the $(m-1)$ -th person appears among the numbers chosen by the first $m-2$ people. Let $b_i$ be the number chosen by the $i$ -th boy, and $g_i$ be the number chosen by the $i$ -th girl. Prove that at least one of the following holds: (i) There exists a sufficiently large integer $M$ such that $\{b_n\}_n≥ M$ is a periodic sequence. (ii) There exists a sufficiently large integer $M$ such that $\{g_n\}_n≥ M$ is a periodic sequence. Model-Generated Solution To determine the eventual behavior of the numbers chosen by the boys ( $\{b_n\}$ ) and girls ( $\{g_n\}$ ), we analyze the combined sequence of all chosen numbers, which we denote as $A=(a_1,a_2,\dots)$ . The boys and girls correspond to the subsequences at the odd and even indices. Step 1: Reformulating the Generation Rule Let $c_m(x)$ denote the number of times the integer $x$ appears in the prefix $a_1,a_2,\dots,a_m$ . For any $m>N$ , rule (2) states that $a_m$ is $1$ plus the number of appearances of $a_m-1$ in $a_1,\dots,a_m-2$ . Because $a_m-1$ appears exactly once at the $(m-1)$ -th position, adding $1$ perfectly accounts for this last appearance. Thus, the rule elegantly simplifies to: $$ a_m=c_m-1(a_m-1) for m>N $$ Every term after the $N$ -th is exactly the current occurrence count of the immediately preceding term. Step 2: Bounding the Frequencies of Large Numbers Let $M_0=\max(N,\max_1≤ i≤ Na_i)$ . Any integer $y>M_0$ does not appear in the first $N$ steps. Therefore, for $m>N$ , $y$ is generated at step $m$ ( $a_m=y$ ) if and only if $c_m-1(a_m-1)=y$ . This occurs exactly when some element’s count reaches $y$ . Because counts strictly increment by $1$ , each distinct element reaches a count of $y$ at most once. Thus, the total number of times $y$ appears up to step $m$ is exactly the number of distinct elements $x$ whose count has reached at least $y$ by step $m-1$ : $$ c_m(y)=|\{x≥ 1\mid c_m-1(x)≥ y\}| for all y>M_0 $$ We prove by induction on $m≥ N$ that $c_m(y)≤ M_0$ for all $y>M_0$ . - Base case ( $m=N$ ): $c_N(y)=0≤ M_0$ . - Inductive step: Assume $c_m-1(z)≤ M_0$ for all $z>M_0$ . If an element $x$ satisfies $c_m-1(x)≥ y>M_0$ , it must be that $x≤ M_0$ (since any $x>M_0$ has $c_m-1(x)≤ M_0<y$ ). Thus, only the positive integers $1,\dots,M_0$ can satisfy the condition. Since there are exactly $M_0$ such integers, $c_m(y)≤ M_0$ . Consequently, no element $>M_0$ can ever appear more than $M_0$ times globally. Step 3: Characterizing the Infinitely Occurring Elements Let $S$ be the set of elements appearing infinitely many times in $A$ . Because elements $>M_0$ appear finitely many times, $S⊆\{1,\dots,M_0\}$ , making $S$ finite. If $S$ were empty, the sequence would take infinitely many distinct values. Each time a new element appears, its count becomes $1$ , generating $1$ as the subsequent term. This forces $1$ to appear infinitely often, a contradiction. Thus, $S$ is non-empty. Let $L=|S|≥ 1$ . Let $M_1=\max_x∉ Sc_∞(x)$ . Because elements outside $S$ appear finitely many times, $M_1$ is finite. Authors’ comment: The claim is trivial, but the reasoning only works when applied to a finite set. Specifically, $\max_x∉ Sc_∞(x)$ corresponds to $\max≤ft(\max_x∉ S, x≤ M_0c_∞(x), \max_x>M_0c_∞(x)\right)$ . The first term is finite by the reasoning applied to the finite set, while the second is $≤ M_0$ by Step 2. Let $M_2=\max(M_0,M_1)$ . Any integer $V>M_2$ can only be reached by the counts of elements in $S$ . Because all $L$ elements in $S$ appear infinitely many times, they all eventually reach a count of $V$ . Thus, exactly $L$ elements reach count $V$ , meaning every integer $V>M_2$ appears exactly $L$ times in the sequence. During its $L$ appearances, $V$ ’s occurrence count increments through $1,2,\dots,L$ . These counts are generated as the immediately following terms. Since there are infinitely many integers $V>M_2$ , the numbers $1,2,\dots,L$ are generated infinitely many times. Thus, $S=\{1,2,\dots,L\}$ . Step 4: Strict Alternation Between Small and Large Numbers We can choose a sufficiently large step $M_3$ such that for all $m>M_3$ : 1. The counts of all elements in $S$ strictly exceed $M_2≥ L$ . 1. No element $x∉ S$ with $x≤ M_2$ ever appears again. For any $m>M_3$ : - If $a_m-1∈ S$ : Its count is $>M_2≥ L$ . Thus, $a_m=c_m-1(a_m-1)>L$ , meaning $a_m∉ S$ . - If $a_m-1∉ S$ : Since $m>M_3$ , $a_m-1>M_2$ . Its global frequency is exactly $L$ , so its current count is $≤ L$ . Thus, $a_m=c_m-1(a_m-1)≤ L$ , meaning $a_m∈ S$ . The sequence strictly alternates between elements in $S$ and large elements outside $S$ . Because boys and girls stand alternatingly, one gender eventually always chooses from $S$ , while the other chooses elements $>M_2$ . Without loss of generality, assume the girls choose from $S$ . Let $s_n∈ S$ be the $n$ -th such choice. Step 5: Eventual Periodicity via Deterministic Transitions Let $v_n∈ℕ^L$ be the vector of occurrence counts of $\{1,\dots,L\}$ just after $s_n∈ S$ is appended to the sequence. The subsequent term is a large number $V=v_n(s_n)∉ S$ . The term after that is the girl’s next choice $s_n+1$ , which is the count of $V$ . Because $V>M_2$ , only elements in $S$ have counts $≥ V$ . Thus: $$ s_n+1=|\{x∈ S\mid v_n(x)≥ v_n(s_n)\}| $$ This means $s_n+1$ is precisely the rank of the newly incremented value within $v_n$ . Then $s_n+1$ is appended, giving $v_n+1=v_n+e_s_{n+1}$ (where $e_j$ is the $j$ -th standard basis vector), forming a self-contained dynamical system. We show the differences between the counts in $v_n$ remain strictly bounded. For $1≤ k≤ L-1$ , let $U_k=\{1,\dots,k\}$ . Let $S_k(n)$ be the sum of the $k$ largest components of $v_n$ , and $E_k(n)$ be the sum of the first $k$ components of $v_n$ . Define the difference $D_k(n)=S_k(n)-E_k(n)≥ 0$ . When we transition to $v_n+1=v_n+e_s_{n+1}$ : - $E_k(n+1)-E_k(n)=1(s_n+1≤ k)$ . - $S_k$ increases by $1$ if and only if the newly incremented element becomes one of the top $k$ values in $v_n+1$ , which occurs iff its rank is $≤ k$ . Since $s_n+2$ is its rank, $S_k(n+1)-S_k(n)=1(s_n+2≤ k)$ . This yields the telescoping difference $D_k(n+1)-D_k(n)=1(s_n+2≤ k)-1(s_n+1≤ k)$ . Thus, $D_k(n)≤ D_k(0)+1≡ B_k$ , remaining globally bounded. Let $y_1≥ y_2≥\dots≥ y_L$ be the sorted values of $v_n$ , and $T_k$ be the set of indices of the top $k$ elements. - If $T_k≠ U_k$ , then matching elements in $T_k∖ U_k$ and $U_k∖ T_k$ gives $D_k(n)≥ y_k-y_k+1$ . Thus, the gap $y_k-y_k+1≤ B_k$ . - If $T_k=U_k$ and $y_k-y_k+1>0$ , we cannot have $s_n+1∈ U_k$ . If we did, the incremented element would widen the gap from outside elements, keeping $T_k=U_k$ and yielding $s_n+2∈ U_k$ permanently. This would trap $\{s_m\}$ inside $U_k$ , starving the elements $>k$ and contradicting that all $L$ elements appear infinitely often. Thus, $s_n+1∉ U_k$ , incrementing an element outside $U_k$ , strictly preventing $y_k-y_k+1$ from increasing. Therefore, the gap $y_k-y_k+1$ is permanently bounded by $\max(B_k,1)$ for all $k$ . Because all gaps between sorted counts are bounded, $\max(v_n)-\min(v_n)$ is globally bounded. The system’s state space, defined by $X_n=(s_n+1,v_n-\min(v_n)1)$ , is therefore strictly finite. Since transitions are uniquely deterministic, the sequence of states $X_n$ — and thus the sequence of choices $\{s_n\}$ — eventually enters a periodic cycle. Since $\{s_n\}$ corresponds to either $\{g_n\}_n≥ M$ or $\{b_n\}_n≥ M$ for sufficiently large $M$ , at least one sequence is definitively periodic. ### C.2 Problem 5: Success at $2^8$ Scale Solving the variant of IMO 2024 Problem 5 required more compute, achieving success only at the $2^8$ scale. On Solution Novelty: The model’s approach differs significantly from the visual heuristics found in standard online resources. Common solutions, such as those in chen2024imo, rely on constructing specific shapes (e.g., “staircase” patterns or “happy triangles”) to define winning paths, but the model does not use these visual constructions. Instead, it treats the grid as a state-space, defining dynamic sets of “safe” and “trapped” columns to derive a general “Safe Probing” algorithm. This shift from pattern-based path-finding to abstract state-tracking suggests that the model derived an alternative strategy from first principles, rather than simply recalling shape-based lemmas found online. Most notably, the model’s approach bypasses the case splitting based on the first-row stone—a step typical of most public solutions. The problem statement and the model-generated solution are provided below: PB-Advanced-023 (Robustified IMO 2024 P5) from luong2025robustmathematicalreasoning The problem statement below has a minor typo when compared to the problem statement from luong2025robustmathematicalreasoning, namely, “there must be at most one stone in each column” should read “there must be at most one selected cell in each column”. On a table of size $3002× 3001$ , a stone is placed on the leftmost cell of the first row. James and Peters play a game on this table. Peter selects $3000$ cells, under the rule that he must choose one from each row except the first and last rows (i.e., the $1$ st and $3002$ th row), and there must be at most one stone in each column. James knows this rule too, but he doesn’t know where Peter placed the stones. The goal of James is to move the stone to the last row, avoiding the cells selected by Peter. The stone can only move to adjacent cells on the table. If the stone enters a cell selected by Peter, James receives a penalty of 1 point, and the stone returns to its initial position (i.e., the leftmost cell). Find the smallest positive integer $n$ such that there exists a method for James to achieve his goal before receiving a penalty of $n$ points. Model-Generated Solution To determine the smallest positive integer $n$ such that James can guarantee reaching the last row before receiving $n$ penalty points, we can analyze the game as an adversarial contest. James wants to reach row 3002 while minimizing his penalties, whereas Peter chooses the 3000 selected cells (which we will call traps) to maximize James’s penalties. (Note: The problem states “at most one stone in each column” referring to Peter’s selected cells. We will use the word “trap” for Peter’s cells to distinguish them from the “stone” James moves). The grid has 3002 rows and 3001 columns. Peter places exactly 3000 traps following these rules: 1. Row 1 and Row 3002 contain 0 traps. 1. Each intermediate row (rows 2 through 3001) contains exactly 1 trap. 1. Each column contains at most 1 trap. (Since there are 3001 columns and 3000 traps, exactly one column is completely empty, and all others contain exactly one trap). When James steps on a trap, he receives 1 penalty point, teleports back to $(1,1)$ , and permanently learns the location of that trap. 1. Lower Bound: Peter can force at least 2 penalties ( $n≥ 3$ ) Because James does not know the trap locations initially, Peter can anticipate any deterministic path James takes and place traps to ensure James gets hit at least twice: - First Penalty: To reach the bottom, James’s initial path $P_1$ must eventually cross row 2. Let the very first cell he steps on in row 2 be $(2,c_1)$ . Peter places the row 2 trap at $(2,c_1)$ . James hits it, receives his 1st penalty, and is sent back to $(1,1)$ . - Second Penalty: Knowing $(2,c_1)$ is a trap, James generates a new path $P_2$ . To cross the board, $P_2$ must eventually reach row 3. Because orthogonal movement restricts James from stepping onto $(3,c_1)$ without first stepping onto $(2,c_1)$ or an adjacent cell already in row 3, the first cell $P_2$ visits in row 3 must be some $(3,c_2)$ where $c_2≠ c_1$ . Since $c_2≠ c_1$ , Peter can validly place the row 3 trap at $(3,c_2)$ without violating the “at most one trap per column” rule. James hits it, receiving his 2nd penalty. Thus, James cannot guarantee a win with 0 or 1 penalty. He must be prepared to take at least 2 penalties, meaning $n>2\implies n≥ 3$ . 2. Upper Bound: James can always win with at most 2 penalties ( $n≤ 3$ ) James can use a structured row-by-row probing strategy to ensure he never receives a 3rd penalty. Let $S$ be the set of columns where traps have already been discovered. Let $A$ be the set of “available” columns where no traps have been found. Because each column holds at most one trap, columns in $A$ are guaranteed to have no traps in any of the rows explored so far. Starting at row $k=2$ , James processes the board row by row: 1. Choosing a column to skip ( $x$ ): - If James has 0 penalties, he picks any column $x∈ A$ . - If James has 1 penalty (which occurred at some column $c_1∈ S$ ), he chooses $x∈ A$ to be the column in $A$ that is horizontally “closest” to $c_1$ . This ensures that all columns strictly between $x$ and $c_1$ belong to $S$ . 1. Safe Probing: For the current row $k$ , James safely probes every other available column $y∈ A∖\{x\}$ . He does this by walking across the completely safe row 1 to column $y$ , then straight down column $y$ to $(k,y)$ . (This vertical walk is safe because $y∈ A$ , meaning column $y$ has no traps above row $k$ .) 1. Deduction (0 penalties): If James probes all $y$ and none are traps, he perfectly deduces that the single trap for row $k$ must be at $(k,x)$ . He adds $x$ to $S$ and moves to the next row taking 0 penalties. 1. Taking a Penalty: If $(k,y)$ IS a trap, James receives a penalty, adds $y$ to $S$ , and the turn ends. Because he avoids stepping on $(k,x)$ and stops probing row $k$ immediately upon finding its trap, he takes at most 1 penalty per row. The “Drop to Finish” Maneuver: Suppose James receives his 2nd penalty overall at row $k$ , stepping on the trap at $y$ . He teleports to $(1,1)$ . He now has enough structural information to immediately reach the bottom without further probing: - Step 1: Safely drop down column $x$ . James walks along row 1 to column $x$ , and goes straight down to row $k$ . Why is this safe? $x∈ A$ , so it has no traps above row $k$ . Furthermore, the trap for row $k$ is known to be at $y$ . Since $y≠ x$ , the cell $(k,x)$ is safe. - Step 2: Safely walk horizontally to column $c_1$ . James walks horizontally along row $k$ from $x$ to $c_1$ . Why is this safe? Because $x$ was chosen as the closest column in $A$ to $c_1$ , every column strictly between them belongs to $S$ . Columns in $S$ had their single traps discovered in rows strictly above $k$ . Therefore, their cells in row $k$ are definitively safe. The cell $(k,c_1)$ is also safe for the same reason. - Step 3: Safely drop down column $c_1$ to the finish. James is now at $(k,c_1)$ . Why is this safe? Column $c_1$ is the column where he took his 1st penalty (at some row $r_1<k$ ). Because each column has at most one trap, the entirety of column $c_1$ strictly below row $r_1$ is completely empty. James walks straight down this empty highway to row 3002 and wins! (Note: If James finishes probing all rows with $≤ 1$ penalty, the single remaining column in $A$ will be completely void of traps, and he can walk straight down it from row 1 to 3002). Conclusion James’s strategy inherently limits his maximum possible penalties to exactly 2. Thus, he achieves his goal flawlessly before his accumulated penalties reach 3. Final Answer: The smallest positive integer $n$ such that there exists a method for James to achieve his goal before receiving a penalty of $n$ points is 3.

Rendering Paper...