# Transcendence: Generative Models Can Outperform The Experts That Train Them
**Authors**:
- Edwin Zhang (OpenAI)
- Humanity Unleashed
- &Vincent Zhu
- UC Santa Barbara
- Humanity Unleashed
- &Naomi Saphra (Harvard University)
- &Anat Kleiman (Harvard University)
- &Benjamin L. Edelman (Princeton University)
- &Milind Tambe (Harvard University)
- &Sham Kakade (Harvard University)
- &Eran Malach (Harvard University)
Abstract
Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset. To play with our models, code, and data, please see our website at https://transcendence.eddie.win. We theoretically prove that transcendence can be enabled by low-temperature sampling, and rigorously assess this claim experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.
<details>
<summary>2406.11741v4/x1.png Details</summary>

### Visual Description
# Technical Document Extraction: Temperature-Rating Analysis
## Chart Structure Overview
Three line plots are presented, each depicting the relationship between **Temperature (τ)** and **Rating**. All charts share identical axis labels and scales but differ in line color, trend, and annotated maximum ratings.
---
### Chart 1: Green Line (Max Rating: 1000)
**Legend**: Right-aligned, green line labeled *"Max Rating Seen During Training: 1000"*
**Trend**: Gradual downward slope from left to right.
**Key Data Points**:
- τ = 0.001 → Rating = 1400
- τ = 0.01 → Rating = 1350
- τ = 0.1 → Rating = 1500 (peak)
- τ = 0.3 → Rating = 1300
- τ = 0.5 → Rating = 1250
- τ = 1 → Rating = 1000
- τ = 1.5 → Rating = 800
**Annotation**: *"Max Rating Seen During Training: 1000"* (dashed green line at y=1000).
---
### Chart 2: Teal Line (Max Rating: 1300)
**Legend**: Right-aligned, teal line labeled *"Max Rating Seen During Training: 1300"*
**Trend**: Steeper decline compared to Chart 1.
**Key Data Points**:
- τ = 0.001 → Rating = 1500
- τ = 0.01 → Rating = 1600 (peak)
- τ = 0.1 → Rating = 1550
- τ = 0.3 → Rating = 1500
- τ = 0.5 → Rating = 1400
- τ = 1 → Rating = 1200
- τ = 1.5 → Rating = 900
**Annotation**: *"Max Rating Seen During Training: 1300"* (dashed teal line at y=1300).
---
### Chart 3: Blue Line (Max Rating: 1500)
**Legend**: Right-aligned, blue line labeled *"Max Rating Seen During Training: 1500"*
**Trend**: Moderate decline with fluctuations.
**Key Data Points**:
- τ = 0.001 → Rating = 1400
- τ = 0.01 → Rating = 1600 (peak)
- τ = 0.1 → Rating = 1450
- τ = 0.3 → Rating = 1500
- τ = 0.5 → Rating = 1400
- τ = 1 → Rating = 1200
- τ = 1.5 → Rating = 1000
**Annotation**: *"Max Rating Seen During Training: 1500"* (dashed blue line at y=1500).
---
### Common Elements Across All Charts
1. **Axes**:
- **X-axis**: Temperature (τ) ranging from 0.001 to 1.5 (logarithmic scale).
- **Y-axis**: Rating ranging from 600 to 1800.
2. **Shading**: Semi-transparent bands around each line indicate confidence intervals or variability.
3. **Annotations**: Dashed horizontal lines at the maximum rating values for each chart.
---
### Observations
- **Inverse Relationship**: All charts show a general decline in rating as temperature increases.
- **Peak Ratings**:
- Chart 1: 1500 at τ = 0.1.
- Chart 2: 1600 at τ = 0.01.
- Chart 3: 1600 at τ = 0.01.
- **Divergence**: Chart 2 exhibits the steepest decline, while Chart 3 shows the most stable trend.
---
### Legend Spatial Grounding
- All legends are positioned on the **right side** of their respective charts.
- Color consistency confirmed:
- Green → Chart 1 (1000).
- Teal → Chart 2 (1300).
- Blue → Chart 3 (1500).
---
### Conclusion
The charts illustrate how maximum ratings during training correlate with temperature thresholds. Higher initial ratings (e.g., 1600) correlate with sharper declines, suggesting sensitivity to temperature variations.
</details>
Figure 1: Ratings of our autoregressive decoder-only transformer, ChessFormer, over several different temperatures. We refer to our models as “ChessFormer <Maximum Glicko-2 rating seen during training>" to easily distinguish between different models in subsequent sections. Each model is trained only on games with players up to a certain rating ( $1000$ , $1300$ , $1500$ , respectively). We report 95% confidence intervals calculated through taking $± 1.96\sigma$ .
1 Introduction
Generative models (GMs) are typically trained to mimic human behavior. These humans may be skilled in their various human objectives: answering a question, creating art, singing a song. The model has only one objective: minimizing the cross-entropy loss with respect to the output distribution, thereby adjusting it to match the distribution of human labels Although chatbots are subject to a variety of post-training tuning methods, e.g., RLHF, we restrict our scope by assuming that the specialized knowledge and capacities are already provided by cross-entropy loss.. Therefore, one might assume the model can, at best, match the performance of an expert on their human objectives. Is it possible for these models to surpass—to transcend —their expert sources in some domains?
We illustrate an example of such transcendence in Figure 1, which measures the chess ratings (Glicko-2 [7]) of several transformer [35] models. Our experimental testbed is generative modeling on chess, which we choose as a domain for its well-understood, constrained nature. The transformer models are trained on public datasets of human chess transcripts, autoregressively predicting the next move in the game. To test for transcendence, we limit the maximal rating of the human players in the dataset below a specified score. We find that ChessFormer $1000$ and ChessFormer $1300$ (the latter number being the maximum rating seen during training) achieve significant levels of transcendence, surpassing the maximal rating seen in the dataset. Our focus is this capacity of a GM to transcend its expert sources by broadly outperforming any one expert. The key to our findings is the observation that GMs implicitly perform majority voting over the human experts. As these models are trained on a collection of many experts with diverse capacities, predilections, and biases, this majority vote oftentimes outperforms any individual expert, a phenomena that is known as “wisdom of the crowd”.
Our objective is to formalize the notion of transcendence and focus narrowly on this source of improvement over the experts: the removal of diverse human biases and errors. We prove that this form of denoising is enabled by low-temperature sampling, which implicitly induces a majority vote. Our result draws a subtle but deep connection from our new setting to a rich prior literature on model ensembling [1, 6, 19], enabling several key results. We precisely characterize the conditions under which transcendence is possible, and give a rigorous theoretical framework for enabling future study into the phenomenon. To test the predictive power of our theory, we then empirically demonstrate these effects. Digging deeper into the effects of majority voting, we show that its advantage is primarily due to performing much better on a small subset of states—that is, under conditions that are likely key to determining the outcome of the game. We also find that diversity in the data is a necessary condition for practically effective majority voting, confirming our theoretical findings. In short:
- We formalize the notion of transcendence in generative models (Section 2).
- We find a key insight explaining one cause of transcendence by connecting the case of denoising experts to model ensembling. In low temperature sampling settings, we prove that a generative model can transcend if trained on a single expert that makes mistakes uniformly at random. We then extend this result to transcending a collection of experts that are each skilled in different domains (Section 3).
- We train a chess transformer on game transcripts that only include players up to a particular skill level. We confirm our theoretical prediction that this model only surpasses the maximum rating of its expert data generators at low temperature settings (Section 4).
- We visualize the distribution of changes in reward by setting a lower sampling temperature, attributing the increased performance to large improvements on a relatively small portion of states (Section 4.2).
- We explore the necessity of dataset diversity, and the inability of ChessFormer to transcend when trained on less diverse datasets (Section 4.2).
2 Definition of Transcendence
Denote by $\mathcal{X}$ the (variable-length) input space and by $\mathcal{Y}$ the (finite) output space. Let $\mathcal{F}$ be the class of all functions mapping $\mathcal{X}\mapsto P(\mathcal{Y})$ (where we use the notation $P(\mathcal{Y})$ to denote probability distributions over $\mathcal{Y}$ ). That is, the functions in $\mathcal{F}$ map inputs in $\mathcal{X}$ to probability distributions over $\mathcal{Y}$ , so each function $f∈\mathcal{F}$ defines a conditional probability distribution of $y∈\mathcal{Y}$ given $x∈\mathcal{X}$ . We denote this distribution by $f(y|x)$ .
Fix some input distribution $p$ over $\mathcal{X}$ such that $p$ has full support (namely, for every $x∈\mathcal{X}$ we have $p(x)>0$ ). Throughout the paper, we assume that our data is labeled by $k$ experts, denoted $f_{1},...,f_{k}∈\mathcal{F}$ . Namely, we assume that the inputs are sampled from the input distribution $p$ and then each input $x∈\mathcal{X}$ is labeled by some expert chosen uniformly at random Equivalently, we can assume that each example is labeled by all experts.. This process induces a joint probability distribution over $\mathcal{X}×\mathcal{Y}$ , which we denote by $\operatorname*{D}$ . Specifically, $\operatorname*{D}(x,y)=p(x)\overline{f}(y|x)$ where $\overline{f}$ is the mixture of the expert distributions, namely
$$
\overline{f}(y|x)=\frac{1}{k}\sum_{i=1}^{k}f_{i}(y|x) \tag{1}
$$
We measure the quality of some prediction function $f∈\mathcal{F}$ using a reward assigned to each input-output pair. Namely, we define a reward function $r:\mathcal{X}×\mathcal{Y}→\mathbb{R}$ , s.t. for all $x$ , the function $r(x,·)$ is not constant (i.e., for every input $x$ not all outputs have the same reward). We choose some test distribution $p_{\mathrm{test}}$ over $\mathcal{X}$ , and for some $f∈\mathcal{F}$ define the average reward of $f$ over $p_{\mathrm{test}}$ by:
$$
R_{p_{\mathrm{test}}}(f)=\mathbb{E}_{x\sim p_{\mathrm{test}}}\left[r_{x}(f)%
\right],~{}~{}~{}\mathrm{where}~{}~{}r_{x}(f)=\mathbb{E}_{y\sim f(\cdot|x)}%
\left[r(x,y)\right] \tag{2}
$$
A learner has access to the distribution $\operatorname*{D}$ , and needs to find a function that minimizes the cross-entropy loss over $\operatorname*{D}$ . Namely, the learner chooses some function $\hat{f}∈\mathcal{F}$ s.t. $\hat{f}=\arg\min_{f∈\mathcal{F}}\mathbb{E}_{x\sim p}\left[H(\overline{f},f)\right]$ where $H$ is the cross-entropy function.
**Definition 1**
*We define “transcendence” to be a setting of $f_{1},...,f_{k}∈\mathcal{F}$ and $p∈ P(\mathcal{X})$ where:
$$
R_{p_{\mathrm{test}}}(\hat{f})>\max_{i\in[k]}R_{p_{\mathrm{test}}}(f_{i}) \tag{3}
$$*
In other words, transcendence describes cases where the learned predictor performs better (achieves better reward) than the best expert generating the data. Note that we are focusing on an idealized setting, where the learner has access to infinite amount of data from the distribution $\operatorname*{D}$ , and can arbitrarily choose any function to fit the distribution (not limited to a particular choice of architecture or optimization constraints). As we will show, even in this idealized setting, transcendence can be impossible to achieve without further modifying the distribution.
**Remark 1**
*We have made various simplifying assumptions when introducing our setting. For example, we assume that all experts share the same input distribution, we assume that all inputs have non-zero probability under the training distribution $p$ , and we assume the experts are sampled uniformly at random. We leave a complete analysis of a more general setting to future work, and discuss this point further in section 6.*
3 Conditions for Transcendence
In this section we analyze the necessary and sufficient conditions for transcendence in our setting. We begin by showing that low-temperature sampling is necessary for transcendence in our specific setting. Then, we analyze specific sufficient conditions for transcendence, both in the case where the data is generated by a single expert and when the data is generated by multiple experts. We defer all proofs to Appendix A.
<details>
<summary>2406.11741v4/extracted/5922169/advantage-analysis.png Details</summary>

### Visual Description
# Technical Document Extraction: Chessboard Reward Analysis
## Legend & Color Gradient
- **Legend Title**: `Reward R_x(y)`
- **Color Scale**: Vertical gradient from **blue** (low reward) to **purple** (high reward)
- **Placement**: Left-aligned, spanning the full height of the image
## Chessboard Analysis
### General Structure
- **Grid Layout**: 8x8 chessboard with labeled ranks (1-8) and files (a-h)
- **Pieces**: Standard chess pieces (kings, queens, rooks, knights, bishops, pawns)
- **Arrows**: Red arrows indicate movement paths between squares
- **τ Values**: Three distinct configurations labeled below each board
### Board 1: τ = 1.0
- **Key Observations**:
- Arrows connect:
- `e2 → e4` (pawn move)
- `d1 → f3` (bishop move)
- `g1 → f3` (knight move)
- Color Intensity:
- `e4` (pawn destination) shows **purple** (highest reward)
- Other squares: **blue** (lower reward)
- **Bar Graph Below**:
- **x-axis**: Piece types (`♙`, `♘`, `♗`, `♕`, `♔`)
- **y-axis**: Reward magnitude (0.0–1.0)
- **Trend**:
- `♙` (pawn) dominates with ~0.4 reward
- `♘` (knight) ~0.2, `♗` (bishop) ~0.1
- `♕` (queen) and `♔` (king) negligible (~0.0)
### Board 2: τ = 0.75
- **Key Observations**:
- Arrows connect:
- `e7 → e5` (pawn move)
- `b8 → c6` (bishop move)
- `g8 → f6` (knight move)
- Color Intensity:
- `e5` (pawn destination) shows **purple**
- Other squares: **blue**
- **Bar Graph Below**:
- **x-axis**: Piece types (`♙`, `♘`, `♗`, `♕`, `♔`)
- **y-axis**: Reward magnitude (0.0–1.0)
- **Trend**:
- `♙` (pawn) ~0.35
- `♘` (knight) ~0.15, `♗` (bishop) ~0.1
- `♕` (queen) and `♔` (king) negligible (~0.0)
### Board 3: τ = 0.001
- **Key Observations**:
- Arrows connect:
- `e8 → e6` (pawn move)
- `a8 → b6` (bishop move)
- `h8 → f7` (knight move)
- Color Intensity:
- `e6` (pawn destination) shows **purple**
- Other squares: **blue**
- **Bar Graph Below**:
- **x-axis**: Piece types (`♙`, `♘`, `♗`, `♕`, `♔`)
- **y-axis**: Reward magnitude (0.0–1.0)
- **Trend**:
- `♙` (pawn) ~0.95 (dominant)
- All other pieces: ~0.0 (no visible bars)
## Cross-Referenced Trends
1. **τ Dependency**:
- As τ decreases (1.0 → 0.001), reward distribution becomes **more concentrated** on pawn moves.
- High τ (1.0) shows moderate rewards for multiple pieces; low τ (0.001) focuses rewards almost exclusively on pawns.
2. **Arrow Color Consistency**:
- All arrows match the legend’s color scale: **purple** for high-reward squares, **blue** for low-reward squares.
## Spatial Grounding
- **Legend Position**: `[x=0, y=0]` (left edge, full height)
- **Chessboards**: Centered horizontally, aligned with τ labels below.
- **Bar Graphs**: Positioned directly below each chessboard, aligned with τ labels.
## Data Table Reconstruction
| τ Value | Piece Type | Reward Magnitude |
|---------|------------|------------------|
| 1.0 | ♙ (Pawn) | ~0.4 |
| 1.0 | ♘ (Knight) | ~0.2 |
| 1.0 | ♗ (Bishop) | ~0.1 |
| 0.75 | ♙ (Pawn) | ~0.35 |
| 0.75 | ♘ (Knight) | ~0.15 |
| 0.75 | ♗ (Bishop) | ~0.1 |
| 0.001 | ♙ (Pawn) | ~0.95 |
## Notes
- **No Non-English Text**: All labels and annotations are in English.
- **Critical Insight**: Lower τ values (τ → 0) strongly prioritize pawn moves, suggesting a focus on short-term positional advantages in the reward function.
</details>
Figure 2: Visualizing the denoising effects of low temperature on the action distribution: an example of ChessFormer shifting probability mass towards the high reward move of trapping the queen with the rook as the temperature $\tau$ decreases. Opacity of the red arrows represent the probability mass given to different moves. The color of the square represent the reward that would be given for taking the action that moves the given piece to that state. Purple here is high reward, while blue is low. For more visualizations, see Appendix B.
3.1 Low-Temperature Sampling is Necessary for Transcendence
Observe that by definition of $\hat{f}$ , and using standard properties of the cross-entropy loss, we get that $\hat{f}=\overline{f}$ , as defined in Eq. (1). Therefore, the conditional probability distribution generated by $\hat{f}$ is simply an average of the distributions generated by the expert. Since the reward is a linear function of these distributions, we get that $\hat{f}$ never achieves transcendence:
**Proposition 1**
*For all choice of $f_{1},...,f_{k}$ and $p_{\mathrm{test}}$ , there exists some $f_{i}$ s.t. $R_{p_{\mathrm{test}}}(f_{i})≥ R_{p_{\mathrm{test}}}(\hat{f})$ .*
Note that in our setting, we assume that all experts are sampled uniformly for a given input $x$ . If instead this assumption is removed, then it may be possible to achieve transcendence with a bayesian weighting. We leave this analysis for future work.
3.2 Transcendence with Low-Temperature Sampling
Now, we consider a temperature sampling scheme over the learned function $\hat{f}$ . Namely, for some temperature $\tau>0$ , and some probability distribution $q∈ P(\mathcal{Y})$ , denote the softmax operator with temperature $\tau$ by $\mathrm{softmax}(q;\tau)∈ P(\mathcal{Y})$ s.t. $\mathrm{softmax}(q;\tau)_{y}=\dfrac{\exp(q_{y}/\tau)}{\sum_{y^{\prime}∈%
\mathcal{Y}}\exp(q_{y^{\prime}}/\tau)}$ . Additionally, we define $\operatorname*{arg\,max}(q)∈ P(\mathcal{Y})$ to be the uniform distribution over the maximal values of $q$ , namely $\operatorname*{arg\,max}(q)=1/{\left\lvert Y_{q}\right\rvert}$ if $y∈ Y_{q}$ and 0 if $y∉ Y_{q}$ , where $Y_{q}=\{y∈\mathcal{Y}:q_{y}=\max(q)\}$ . Now, define $\hat{f}_{\tau}$ to be the temperature sampling of $\hat{f}$ , i.e. $\hat{f}_{\tau}(·|x)=\mathrm{softmax}(\hat{f}(·|x);\tau)$ and $\hat{f}_{\max}$ the arg-max “sampling” of $\hat{f}$ , i.e. $\hat{f}_{\max}(·|x)=\operatorname*{arg\,max}(\hat{f}(·|x))$ . We now show that if the arg-max predictor $\hat{f}_{\max}$ is better than the best expert, then transcendence is possible with low-temperature sampling.
**Proposition 2**
*$R_{p_{\mathrm{test}}}(\hat{f}_{\max})>\max_{i∈[k]}R_{p_{\mathrm{test}}}(f_{i})$ if and only if there exists some temperature $\tau∈(0,1)$ s.t. for all $0≤\tau^{\prime}≤\tau$ , it holds that $R_{p_{\mathrm{test}}}(\hat{f}_{\tau^{\prime}})>\max_{i∈[k]}R_{p_{\mathrm{%
test}}}(f_{i}).$*
The above shows that, even though transcendence cannot be achieved when directly modeling the distribution, it can be achieved by temperature sampling, assuming that the arg-max predictor achieves higher reward compared to all experts. In other words, we make the subtle connection here that low-temperature sampling can be thought of as performning majority vote [1, 6] between the experts. Please see Appendix A for a formal proof of this connection. When the experts put non-negligible mass onto the best actions, the resulting majority vote may find the best action [9], which improves performance compared to individual experts (i.e., “wisdom of the crowd”) and thus achieve transcendence.
3.3 Denoising a Single Expert
We now turn to study particular cases where low-temperature sampling can lead to transcendence. The most simple case is of a single expert that outputs a correct but noisy prediction. Denote by $f^{*}$ the optimal expert, s.t. for all $x$ we have $f^{*}(y|x)=\dfrac{\delta(y∈ Y^{*}_{x})}{\lvert Y^{*}_{x}\rvert}$ , where $Y^{*}_{x}=\{y∈\mathcal{Y}:y=\max_{y^{\prime}}r(x,y^{\prime})\}$ and $\delta(\text{condition})$ is 1 if the condition is true and 0 otherwise. Now, for some $\rho∈(0,1)$ , let $f_{\rho}$ be a “noisy” expert, s.t., for all $x$ , with probability $\rho$ chooses a random output, and with probability $1-\rho$ chooses an output according to the optimal expert $f^{*}(·|x)$ , namely $f_{\rho}(y|x)=\rho/\left\lvert\mathcal{Y}\right\rvert+(1-\rho)f^{*}(y|x)$ . We show that transcendence is achieved with low-temperature sampling for data generated by $f_{\rho}$ :
**Proposition 3**
*Assume the data is generated by a single expert $f_{\rho}$ . Then, there exists some temperature $\tau∈(0,1)$ s.t. for all $\tau^{\prime}≤\tau$ , the predictor $\hat{f}_{\tau^{\prime}}$ achieves “transcendence”.*
3.4 Transcendence from Multiple Experts
Next, we consider the case where the dataset is generated by multiple experts that complement each other in terms of their ability to correctly predict the best output. For example, consider the case where the input space is partitioned into $k$ disjoint subsets, $\mathcal{X}=\mathcal{X}_{1}\dot{\cup}...\dot{\cup}\mathcal{X}_{k}$ , s.t. the $i$ -th expert performs well on the subset $\mathcal{X}_{i}$ , but behaves randomly on other subsets. Namely, assume the expert $f_{i}$ behaves as follows: $f_{i}(y|x)=\biggl{(}\frac{\delta(y∈ Y^{\star}_{x})\delta(x∈\mathcal{X}_{i}%
)}{|Y^{\star}_{x}|}+\frac{\delta(x∉\mathcal{X}_{i})}{|\mathcal{Y}|}\biggr%
{)}$ where $Y_{x}^{*}$ is as previously defined and $\delta(\text{condition})$ is 1 if the condition is true and 0 otherwise. We show that, assuming that the test distribution $p_{\mathrm{test}}$ is not concentrated on a single subset $\mathcal{X}_{i}$ , we achieve transcendence with low-temperature sampling:
**Proposition 4**
*Let $p_{\mathrm{test}}$ be some distribution s.t. there are at least two subsets $\mathcal{X}_{i}≠\mathcal{X}_{j}$ s.t. $p_{\mathrm{test}}(\mathcal{X}_{i}),p_{\mathrm{test}}(\mathcal{X}_{j})>0$ . Then, if the data is generated by $f_{1},...,f_{k}$ , there exists some temperature $\tau∈(0,1)$ s.t. for all $\tau^{\prime}≤\tau$ , the predictor $\hat{f}_{\tau^{\prime}}$ achieves “transcendence”.*
In order to build intuition for Proposition 4, see Appendix C for an intuitive diagram.
4 Experiments
To evaluate the predictive power of our impossibility result of transcendence with no temperature sampling (Proposition 1) as well as our result of transcendence from multiple experts with low temperature sampling (Proposition 2), we turn to modeling and training chess players. Chess stands out as an attractive option for several reasons. Chess is a well-understood domain and more constrained than other settings such as natural language generation, lending to easier and stronger analysis. Evaluation of skill in chess is also natural and well-studied, with several rigorous statistical rating systems available. In this paper, we use the Glicko-2 rating system [7], which is also adopted by https://lichess.org, the free and open-source online chess server from which we source our dataset.
4.1 Experimental Setup
<details>
<summary>2406.11741v4/extracted/5922169/latent_board_state_reward_tsne.png Details</summary>

### Visual Description
# Technical Document Extraction: Chessboard Reward Analysis
## 1. Chart Description
The image contains a **heatmap visualization** of chessboard positions with associated reward values, accompanied by four annotated chessboard diagrams with mathematical equations.
### 1.1 Heatmap Components
- **Color Gradient Legend**:
- Located on the right side of the heatmap
- Vertical gradient from **red (1.0)** to **blue (0.0)**
- Intermediate values:
- Purple (0.5)
- Dark blue (0.25)
- **Spatial Distribution**:
- Red clusters (high reward) concentrated in specific regions
- Blue clusters (low reward) dispersed throughout
- Purple regions (moderate reward) forming transitional zones
### 1.2 Axes
- **X-axis**: Labeled "Reward" (no numerical scale)
- **Y-axis**: Unlabeled (positional axis for clusters)
## 2. Chessboard Diagrams
Four annotated chessboard positions with mathematical equations representing expected rewards:
### 2.1 Chessboard 1
- **Position**:
- White king on h1
- Black king on e8
- White queen on a8
- Black pawn on b7
- **Equation**:
$$
\mathbb{E}_{y\sim f^*}[r_x] = 1.0
$$
### 2.2 Chessboard 2
- **Position**:
- White king on a1
- Black king on e8
- White pawn on c7
- Black pawn on d6
- **Equation**:
$$
\mathbb{E}_{y\sim f^*}[r_x] = 0.0
$$
### 2.3 Chessboard 3
- **Position**:
- Full starting position (all pieces)
- White king on e1
- Black king on e8
- **Equation**:
$$
\mathbb{E}_{y\sim f^*}[r_x] = 0.53
$$
### 2.4 Chessboard 4
- **Position**:
- Endgame scenario
- White king on e1
- Black king on e8
- White pawn on d5
- **Equation**:
$$
\mathbb{E}_{y\sim f^*}[r_x] = 0.54
$$
## 3. Key Observations
1. **Reward Correlation**:
- Chessboard 1 (highest reward, 1.0) corresponds to a red cluster in the heatmap
- Chessboard 2 (lowest reward, 0.0) aligns with blue clusters
- Chessboards 3 and 4 (moderate rewards) match purple regions
2. **Positional Analysis**:
- Endgame positions (Chessboards 3 and 4) show similar reward values (~0.5)
- Opening positions (Chessboard 1) have maximum reward
- Mid-game positions (Chessboard 2) have minimum reward
3. **Mathematical Notation**:
- All equations use conditional expectation notation:
$$
\mathbb{E}_{y\sim f^*}[r_x] = \text{value}
$$
- Subscripts indicate reward distribution conditioned on policy $f^*$
## 4. Data Table Reconstruction
| Chessboard | Position Description | Expected Reward |
|------------|----------------------------|-----------------|
| 1 | King vs King + Queen | 1.0 |
| 2 | Pawn vs Pawn | 0.0 |
| 3 | Full starting position | 0.53 |
| 4 | Endgame (Pawn vs King) | 0.54 |
## 5. Trend Verification
- **Heatmap Trends**:
- Red clusters (high reward) form dense regions in specific positional zones
- Blue clusters (low reward) appear as isolated dots
- Purple regions show gradient transitions between reward values
- **Chessboard Trends**:
- Reward decreases from opening (1.0) to mid-game (0.0)
- Endgame positions show slight increase (0.53 → 0.54)
## 6. Spatial Grounding Confirmation
- Legend position: Right side, vertical orientation
- Color matching:
- Red = 1.0 (Chessboard 1)
- Blue = 0.0 (Chessboard 2)
- Purple = 0.5 (Chessboards 3 and 4)
## 7. Missing Elements
- No textual labels for individual chess pieces
- No explicit explanation of reward function $f^*$
- No temporal dimension or move sequence information
## 8. Conclusion
The visualization demonstrates reward distribution across chess positions, with mathematical equations providing precise expected values. The heatmap's color coding effectively correlates with the annotated chessboard positions, showing clear reward gradients from opening to endgame scenarios.
</details>
Figure 3: Inspired by Mnih et al. [20], we generate a t-SNE embedding [34] of ChessFormer’s last hidden layer latent representations of game transcripts during training time. The colors represent the probability of winning, with $+1$ corresponding to a state where White has won and $0 0$ to Black. Probabiliy of winning is computed through the Stockfish analysis engine. We also visualize several board states associated with different clusters in the t-SNE embedding, and their associated expected reward when following the expert Stockfish distribution. Note that the model distinguishes between states where the outcome has already been determined (the two left boards), versus opening states that are extremely similar (the two right boards). See the full t-SNE in Appendix G.
Training Details.
We trained several $50$ M parameter autoregressive transformer decoders following best practices from modern large model training, including a cosine learning rate schedule and similar batch size-learning rate ratios as prescribed by the OPT-175B team [37]. Our dataset consists of human chess games from the lichess.org open source database from January 2023 to October 2023. In total, this dataset contains approximately one billion games. In this setting, an expert is a specific individual player. To test for transcendence, we truncate this dataset by a maximum rating, so that during training a model only sees data up to a given rating. We train our model on the next-token prediction objective, and represent our chess games as Portable Game Notation (PGN) strings, such as 1.e4 e5 2.Nf3 Nc6 3.Bb5... 1/2-1/2. Note that we do not give any rating or reward information during training—the only input the model sees are the moves and the outcome of the game. We tokenize our dataset at the $32$ -symbol character level. (For further details, see Appendix E.) Our model plays chess “blind”—without direct access to the board state—and, furthermore, is never explicitly given the rules of the game: at no point is play constrained to valid outputs for a given piece or board state. Nontrivial chess skill is therefore not straightforward to acquire, and if not for the surprising capabilities of modern large transformers, one might imagine such a model would fail to learn even the basic rules of playing chess. This blindfolded setting has also been studied by prior work [23, 30], as discussed further in section 5.
One gap between our theory and practice is that in our theory, we assume that each expert is defined over the entire input space $\mathcal{X}$ . However, in the chess setting such full coverage is extremely unlikely to be the case after around move $15$ , as there are more unique chess games than atoms in the universe due to the high branching factor of the game tree. To address this gap, we visualize the latent representation of our model in Figure 3, where we find the model is able to capture meaningful semantics regarding both the relative advantage of a state, as well as the identity of the black and white player. This visualization illustrates the ability of our model to generalize by compressing games into some shared latent representation, enabling experts to generalize to unseen states, bridging this gap between theory and practice.
Evaluation.
We evaluate each model by its Glicko-2 ratings against Stockfish 16.1 [29], a popular open-source chess engine. Stockfish uses a traditional minimax search equipped with a bespoke CPU-efficient neural network for evaluation [22] and $\alpha$ - $\beta$ pruning for further efficiency. We evaluate Stockfish at levels 1, 3, and 5 with a 100ms timeout directly on Lichess’ platform against the Maia [18] 1, 5, and 9 bots (human behavior cloned convolutional networks trained at rating bins 1100-1200, 1500-1600, and 1900-2000, respectively) for several hundred games, obtaining calibrated Glicko-2 ratings for Stockfish specifically on Lichess’ platform ( $1552± 45.2$ , $1842± 45.2$ , $2142± 59$ for Stockfish Levels 1, 3, and 5, respectively). Next, for evaluating our own models, we then play against Stockfish levels of 1, 3, and 5 for 100 games each, reaching a final rating calculation with 300 games. We then report both the Glicko-2 rating $R$ as well as rating deviation $RD$ of our models, where $R± 2*RD$ provides a $95\%$ confidence interval. To play against Stockfish, we successively prompt our model with the current game PGN string. Note that our output is entirely unconstrained, and may be either illegal in the current board state or altogether unparsable. If our model fails to generate a valid legal move after 5 samples, we consider it to have lost. After generation, we give the updated board state to Stockfish and pass a new PGN string appended with the prior move of Stockfish back to our model. We repeat this process until the game ends.
4.2 Experimental Results
Main Result: Low-temperature sampling enables transcendence.
In this section we attempt to answer our primary research question, can low-temperature sampling actually induce transcendence in practice? We test Proposition 2 by evaluating several ChessFormers across different temperature values, from $0.001$ (nearly deterministic), to $1.0$ (original distribution), to $1.5$ (high entropy). In Figure 1 we definitively confirm the existence of transcendence. Our ChessFormer 1000 (where the latter number refers to the maximum rating seen during training) and ChessFormer 1300 models are able to transcend to around 1500 rating at temperature $\tau$ equal to $0.001$ . Interestingly, ChessFormer 1500 is unable to transcend at test time, a result we further analyze in Dataset Diversity.
To more deeply understand when and why transcendence occurs, we investigate two questions. (1) How does the reward function defined in Equation 2 shift with respect to low-temperature sampling? (2) Does transcendence rely on dataset diversity, as introduced theoretically in subsection 3.4?
Lowering temperature increases rewards in expectation on specific states, leading to transcendence over the full game.
When playing chess, a low-skilled player may play reasonably well until they make a significant blunder at a key point in play. If these errors are idiosyncratic, averaging across many experts would have a denoising effect, leaving the best moves with higher probability. Therefore, low-temperature sampling would move probability mass towards better moves in specific play contexts. Without low-temperature sampling, the model would still put probability mass onto blunders. To gain intuition for this idea, we visualize it theoretically in Appendix C and empirically in Figure 2 and Appendix B. This hypothesis motivates our first research question in this section: Does low-temperature sampling improve the expected reward very much for just some specific key game states, or a little for many game states?
To formalize this notion, we first define a “favor” function, which captures the improvement in reward by following some new probability distribution over some baseline probability distribution. Our definition is inspired by the Performance Difference Lemma (PDL) [10] from Reinforcement Learning (RL), which establishes an equivalence between the change in performance from following some new policy (a probability distribution of actions given a state) over some old policy, and the expected value of the advantage function of the old policy sampled with respect to the new policy. In RL, the advantage function is defined as the difference between the value of taking a single action in a given state versus the expected value of following some policy distribution of actions in that state.
Here, we define the “favor” of $f^{\prime}$ over $f$ in $x$ as the change in the reward function by comparing what $f$ would have done when following $f^{\prime}$ for a given input $x$ :
$$
F(f^{\prime},f;x)=\mathbb{E}_{x\sim d^{f^{\prime}},y\sim f^{\prime}(\cdot|x)}[%
r(x,y)]-\mathbb{E}_{x\sim d^{f^{\prime}},y\sim f(\cdot|x)}[r(x,y)]. \tag{4}
$$
Where $d^{f}$ refers to the state visitation distribution [31] when following $f$ in a sequential setting—informally, this variable can be thought of the distribution of states seen when sampling from $f$ with a fixed transition function that takes in an input $x$ , a output $y$ , and outputs a next input $x$ . Here, that transition function is given by the rules of chess and the opponent player. Given this favor function, we can now quantitatively explore the effects that lead to transcendence by setting the baseline $f$ to be the original imitation-learned probability distribution (temperature $\tau=1$ ), and $f^{\prime}$ as a low-temperature intervention on $f$ (e.g. temperature $\tau=0$ ). We can empirically calculate the reward by using the evaluation function [22] of Stockfish, an expert neural reward function that Stockfish uses to calculate its next move. This reward function is a neural network trained to predict the probability of winning through a sigmoid on a linear combination of handcrafted expert heuristics, such as amount of material versus opponent material, and number of moves to a potential checkmate.
<details>
<summary>2406.11741v4/extracted/5922169/adv-gain-dist-flat.png Details</summary>

### Visual Description
# Technical Document Extraction: Probability Distribution Chart
## Chart Overview
The image depicts a probability distribution chart analyzing changes in expected reward over a base probability distribution. Key components include axes, legends, shaded regions, and vertical reference lines.
---
### **Axes and Labels**
- **X-Axis**:
- Title: *"Change in expected reward over base fτ=1 probability distribution over individual states: F(f′,f)"*
- Range: -10.0 to 10.0
- Markers: -10.0, -7.5, -5.0, -2.5, 0.0, 2.5, 5.0, 7.5, 10.0
- **Y-Axis**:
- Title: *"P(F)"* (Probability of F)
- Range: 0.0 to 0.6
- Markers: 0.0, 0.2, 0.4, 0.6
---
### **Legend**
- **Location**: Upper-left corner of the chart.
- **Entries**:
1. **Baseline**: Black vertical line at `E[F] = 0.0`.
2. **f′ = fτ=0.001**: Green shaded histogram.
3. **f′ = fτ=0.75**: Pink shaded histogram.
4. **E[F] = 2.15**: Green vertical line at x = 2.15.
5. **E[F] = 0.99**: Red vertical line at x = 0.99.
---
### **Chart Components**
1. **Baseline (Black Line)**:
- Vertical line at `x = 0.0`, representing the baseline expected reward (`E[F] = 0.0`).
2. **Probability Distributions**:
- **f′ = fτ=0.001 (Green)**:
- Shaded histogram peaking near `x ≈ -2.5`.
- Extends from `x ≈ -7.5` to `x ≈ 2.5`.
- **f′ = fτ=0.75 (Pink)**:
- Shaded histogram peaking near `x ≈ 0.5`.
- Extends from `x ≈ -2.5` to `x ≈ 5.0`.
3. **Expected Value Markers**:
- **E[F] = 2.15 (Green Line)**: Vertical line at `x = 2.15`.
- **E[F] = 0.99 (Red Line)**: Vertical line at `x = 0.99`.
---
### **Trends and Observations**
- **Baseline**: The black line at `x = 0.0` serves as the reference point for comparing deviations in expected reward.
- **f′ = fτ=0.001 (Green)**:
- Distribution is skewed left, with higher probability density in negative reward changes.
- Peaks at `x ≈ -2.5`, indicating a higher likelihood of reduced rewards compared to the baseline.
- **f′ = fτ=0.75 (Pink)**:
- Distribution is centered closer to the baseline but slightly right-skewed.
- Peaks at `x ≈ 0.5`, suggesting moderate positive reward changes on average.
- **Expected Values**:
- `E[F] = 0.99` (red) and `E[F] = 2.15` (green) represent mean reward changes for the respective distributions.
---
### **Spatial Grounding**
- **Legend Position**: Upper-left corner (coordinates: `[x ≈ 0.05, y ≈ 0.95]` relative to chart boundaries).
- **Color Consistency**:
- Green legend entry matches the green shaded histogram and green vertical line (`E[F] = 2.15`).
- Pink legend entry matches the pink shaded histogram.
- Black legend entry matches the baseline vertical line.
---
### **Conclusion**
The chart compares two modified probability distributions (`f′ = fτ=0.001` and `f′ = fτ=0.75`) against a baseline (`fτ=1`). The green distribution (`fτ=0.001`) shows a higher probability of negative reward changes, while the pink distribution (`fτ=0.75`) leans toward positive changes. Expected values (`E[F]`) are marked at 0.99 and 2.15, providing quantitative benchmarks for comparison.
</details>
Figure 4: The favor probability distribution, or change in expected reward by setting temperature lower than $\tau=1.0$ . We plot the favor distribution across two different temperatures: setting $\tau=.75$ and $\tau=0.001$ by running the Stockfish analysis engine across $100$ total Chessformer $1000$ games played at $0.001$ temperature against Stockfish level $1$ (as theoretically justified by PDL [10]). We calculate favor by sampling $100$ counterfactual potential moves at $\tau=1.0$ per actual move made at $\tau=0.001$ to compute a baseline expected reward. In total, we gather an empirical probability distribution with $n=382,000$ total samples per $\tau$ ( $38.2$ moves on average per game). Note that we plot the distributions with transparency, so the brownish area is where the two overlap. We visualize several long-tail examples in Appendix B.
In Figure 4, we find that lowering the temperature has the effect of skewing the expected reward distribution to the right, especially for the green $\tau=0.001$ distribution. This result implies that the model does not improve the expected reward by a small amount for many game states, but rather improves the expected reward by a relatively large amount for a few game states. Thus, $\tau=0.001$ improves the expected reward (probability of winning) by an average of $\mathbf{2.15± 0.17\%}$ , but for some states, this expected improvement is over 5%. Note that the original temperature expected reward can be thought of as a Dirac distribution centered at $0 0$ . The above finding answers our research question in this section: Low-temperature sampling is able improves the expected reward by relatively large amounts for some specific game states, which is likely why the ChessFormer $1000$ and $1300$ model was able to achieve transcendence.
| $\tau=0.001$ | $\mathbf{39.95± 0.92}$ | $\mathbf{2.15± 0.17}$ | $\mathbf{29.61± 1.43}$ | $\mathbf{54.26± 1.57}$ | $\mathbf{66.86± 1.47}$ |
| --- | --- | --- | --- | --- | --- |
| $\tau=0.75$ | $38.79± 0.90$ | $0.99± 0.06$ | $25.08± 0.95$ | $47.84± 1.09$ | $60.37± 1.04$ |
| $\tau=1.0$ | $37.80± 0.87$ | $0± 0$ | $22.61± 0.86$ | $44.00± 9.96$ | $56.27± 0.93$ |
Table 1: Table of several statistics describing the relationship between reward at $\tau=0$ vs. $\tau=1$ . In the first column, we display the expected reward across our dataset, which is $\mathbb{P}$ of winning calculated by Stockfish 16.1). In the second column, we display $F$ , or the change in reward for the given temperature $\tau$ versus the baseline. In the last three columns we display the accuracy for the best moves ranked by Stockfish analysis run at a time cutoff of $1$ second. Here, the top- $k$ accuracy is the percentage of games where the actual move sampled by the model was in the top- $k$ moves as ranked by Stockfish. We report 95% bootstrapped confidence intervals with 10K resamples.
In Table 1, we present the statistics of the favor function for different temperature values. From this table, we observe that as the temperature decreases, the top- $k$ accuracies monotonically increase, suggesting that the model becomes more consistent in selecting good moves. We also observe that although the model improves as temperature decreases, the probability of winning is still below $50\%$ , meaning our model should tend to lose more games than it wins against Stockfish $1$ . This result matches with our results in Figure 1, as the rating of Stockfish $1$ is also higher than the reported rating for $\tau=0.001$ ( $1550$ for Stockfish 1 vs $\sim 1450$ for Chessformer $1000$ ). Overall, the analysis of the advantage statistics provides further evidence for the effectiveness of low-temperature sampling in inducing transcendence in chess models.
Dataset diversity is essential for transcendence.
As we note in subsection 3.4, our theory requires dataset diversity as a necessary condition for enabling transcendence. Importantly, we find in Figure 1 that not all models are able to transcend. Unlike ChessFormer 1000 or 1300, the Chessformer 1500 fails to transcend. We hypothesize that this results is due to the fact that in the band of ratings from $1000$ to $1500$ , diversity does not significantly increase. If so, a $1000$ rated player can be thought of as a noisy $1500$ rated player, but a $1500$ rated player cannot be thought of as a noisy $2000$ rated player. In this section we ask the following research question: Is diversity in data required for enabling transcendence?
In Figure 5, we explore this research question by quantifying dataset diversity through the normalized entropy on the action distribution $\mathcal{H}_{f}(Y|X)={\mathbb{E}_{y\sim f(y|x=X)}[-\log_{2}f(y|x=X)]}/{\log_{2%
}|\mathcal{Y}|}.$ To gain intuition for this metric, imagine the action distribution of moves taken for any given state. Entropy will be higher for more uniform action distributions, and lower for more deterministic, peaked action distributions. The average entropy of these action distributions can therefore serve as a measurement of the diversity of the dataset. We normalize this entropy to the range $[0,1]$ by dividing by the binary log of the number of legal moves: $\log_{2}|\mathcal{Y}|$ .
Importantly, we cannot calculate this normalized entropy for every state, as most states after move $16$ in the midgame and before the engame are unique within the dataset and we therefore observe just a single action for thus states. Therefore our metric is limited in that it only considers opening moves, the beginning of the midgame, and the endgame. We consider only common states with greater than $100$ actions by sampling $1,000,000$ games from each dataset. The average entropy confirm our hypothesis: The $<1500$ cut off dataset has on average less diversity than the $<1300$ dataset, which has is again less than the $<1000$ dataset. This result suggests that Chessformer $1500$ likely is not transcendent due to a lack of diversity in its dataset. If the entropy instead stayed constant for each dataset, it would imply that each had a similar level of diversity. In such a case, we would expect that ChessFormer $1500$ likely would also transcend. Instead, as predicted, it is likely not transcendent due to a lack of diversity.
<details>
<summary>2406.11741v4/x2.png Details</summary>

### Visual Description
# Technical Document Extraction: Action Distribution Histogram
## Title
**H of action distribution over common states**
## Axes
- **X-axis**: ℋ(Y|X) (ranges from 0.0 to 1.0)
- **Y-axis**: P(ℋ(Y|X)) (ranges from 0.00 to 0.15)
## Legend
- **Location**: Top-left corner
- **Entries**:
- Green: Max Rating: 1000
- Orange: Max Rating: 1300
- Red: Max Rating: 1500
## Data Points & Trends
1. **Max Rating: 1000 (Green)**
- **E[ℋ]**: 0.70
- **Visual Trend**: Peaks sharply near x=0.70, with a tall, narrow distribution.
- **Key Observation**: Highest probability density (y-axis) among all distributions.
2. **Max Rating: 1300 (Orange)**
- **E[ℋ]**: 0.66
- **Visual Trend**: Broader distribution than 1000, peaking near x=0.66.
- **Key Observation**: Overlaps with 1000 but has a lower peak.
3. **Max Rating: 1500 (Red)**
- **E[ℋ]**: 0.64
- **Visual Trend**: Widest distribution, peaking near x=0.64.
- **Key Observation**: Lowest peak but extends further left/right.
## Chart Structure
- **Type**: Stacked histogram with overlapping distributions.
- **Color Coding**:
- Green (1000) dominates the rightmost region (x > 0.7).
- Orange (1300) occupies the middle region (x ≈ 0.6–0.7).
- Red (1500) spans the leftmost region (x < 0.65).
## Spatial Grounding
- **Legend Colors**:
- Green (1000) matches green bars.
- Orange (1300) matches orange bars.
- Red (1500) matches red bars.
- **Vertical Lines**:
- Red line at x=0.70 (1000).
- Orange line at x=0.66 (1300).
- Green line at x=0.64 (1500).
## Additional Text
- **Embedded Labels**:
- "E[ℋ] 1000: 0.70" (near red line).
- "E[ℋ] 1300: 0.66" (near orange line).
- "E[ℋ] 1500: 0.64" (near green line).
## Notes
- No non-English text detected.
- All numerical values and labels are explicitly transcribed.
- Trends and spatial relationships are cross-verified with legend and axis markers.
</details>
Figure 5: Action distribution diversity, as measured by the average normalized entropy over different chess rating dataset cutoffs with $n=2681,3037,3169$ common states for ratings $1000,1300,1500$ , respectively. These entropies are calculated directly from the empiricial frequencies of our dataset, and are model-agnostic.
4.3 Additional Settings
SQuADv2 Natural Language Temperature Denoising Experiment. We extend our analysis to the Natural Language Processing domain by running experiments on the Stanford Question Answering Dataset (SQuAD 2.0). We tested the effects of temperature denoising on the performance of several large language models (LLMs) of varying sizes. The SQuAD task involves reading comprehension and question-answering based on Wikipedia articles, making it an ideal setting to evaluate the impact of denoising on language models. We measured the exact-match, semantic-match, and F1 scores of the model outputs at different temperatures. The results show that temperature denoising leads to improved performance, corroborating the findings of our chess experiments and providing broader validation of the underlying mechanism of temperature denoising in diverse domains.
<details>
<summary>2406.11741v4/extracted/5922169/nlu_experiment/Combined_SQuADv2_Evaluation_Metrics_Broken_Axis.jpeg Details</summary>

### Visual Description
# Technical Document Extraction: Model Performance vs Temperature
## Overview
The image contains three comparative line graphs analyzing model performance metrics across varying temperature parameters (τ). Each graph evaluates different evaluation criteria: F1 Score, Exact Match (%), and Semantic Match (%). Four language models are compared:
- Qwen2 (7B)
- Mistral (7B)
- Gemma 2 (2B)
- GPT-2 (163M)
---
## Graph 1: F1 Score vs Temperature
### Axes
- **X-axis**: Temperature (τ) [0.001, 0.25, 0.5, 0.75, 1, 1.5]
- **Y-axis**: F1 Score [0, 10, 20, ..., 70]
### Legend
- **Placement**: Right side of graph
- **Color Mapping**:
- Qwen2 (7B): Green
- Mistral (7B): Teal
- Gemma 2 (2B): Blue
- GPT-2 (163M): Purple
### Key Trends
1. **Qwen2 (7B)**:
- Starts at ~70 F1 Score at τ=0.001
- Gradual decline to ~60 at τ=1.5
- Shaded confidence interval narrows slightly with increasing τ
2. **Mistral (7B)**:
- Starts at ~65 F1 Score at τ=0.001
- Declines to ~55 at τ=1.5
- Confidence interval widens moderately
3. **Gemma 2 (2B)**:
- Starts at ~50 F1 Score at τ=0.001
- Drops to ~40 at τ=1.5
- Confidence interval remains relatively stable
4. **GPT-2 (163M)**:
- Starts at ~5 F1 Score at τ=0.001
- Declines to ~3 at τ=1.5
- Confidence interval shows minimal variation
### Data Points (Approximate)
| Model | τ=0.001 | τ=0.25 | τ=0.5 | τ=0.75 | τ=1.0 | τ=1.5 |
|----------------|---------|--------|-------|--------|-------|-------|
| Qwen2 (7B) | 70 | 68 | 65 | 63 | 61 | 60 |
| Mistral (7B) | 65 | 63 | 60 | 58 | 55 | 53 |
| Gemma 2 (2B) | 50 | 48 | 45 | 43 | 40 | 38 |
| GPT-2 (163M) | 5 | 4.5 | 4 | 3.5 | 3 | 2.5 |
---
## Graph 2: Exact Match (%) vs Temperature
### Axes
- **X-axis**: Temperature (τ) [0.001, 0.25, 0.5, 0.75, 1, 1.5]
- **Y-axis**: Exact Match (%) [0, 10, 20, ..., 60]
### Legend
- **Placement**: Right side of graph
- **Color Mapping**: Same as Graph 1
### Key Trends
1. **Qwen2 (7B)**:
- Starts at ~60% at τ=0.001
- Declines to ~55% at τ=1.5
- Confidence interval narrows slightly
2. **Mistral (7B)**:
- Starts at ~55% at τ=0.001
- Drops to ~50% at τ=1.5
- Confidence interval widens moderately
3. **Gemma 2 (2B)**:
- Starts at ~40% at τ=0.001
- Declines to ~35% at τ=1.5
- Confidence interval remains stable
4. **GPT-2 (163M)**:
- Starts at ~1% at τ=0.001
- Drops to ~0.5% at τ=1.5
- Confidence interval shows minimal variation
### Data Points (Approximate)
| Model | τ=0.001 | τ=0.25 | τ=0.5 | τ=0.75 | τ=1.0 | τ=1.5 |
|----------------|---------|--------|-------|--------|-------|-------|
| Qwen2 (7B) | 60 | 58 | 56 | 54 | 52 | 50 |
| Mistral (7B) | 55 | 53 | 50 | 48 | 45 | 43 |
| Gemma 2 (2B) | 40 | 38 | 35 | 33 | 30 | 28 |
| GPT-2 (163M) | 1 | 0.8 | 0.6 | 0.5 | 0.4 | 0.3 |
---
## Graph 3: Semantic Match (%) vs Temperature
### Axes
- **X-axis**: Temperature (τ) [0.001, 0.25, 0.5, 0.75, 1, 1.5]
- **Y-axis**: Semantic Match (%) [0, 5, 10, ..., 70]
### Legend
- **Placement**: Right side of graph
- **Color Mapping**: Same as Graph 1
### Key Trends
1. **Qwen2 (7B)**:
- Starts at ~70% at τ=0.001
- Declines to ~60% at τ=1.5
- Confidence interval narrows slightly
2. **Mistral (7B)**:
- Starts at ~65% at τ=0.001
- Drops to ~55% at τ=1.5
- Confidence interval widens moderately
3. **Gemma 2 (2B)**:
- Starts at ~50% at τ=0.001
- Declines to ~40% at τ=1.5
- Confidence interval remains stable
4. **GPT-2 (163M)**:
- Starts at ~5% at τ=0.001
- Drops to ~4% at τ=1.5
- Confidence interval shows minimal variation
### Data Points (Approximate)
| Model | τ=0.001 | τ=0.25 | τ=0.5 | τ=0.75 | τ=1.0 | τ=1.5 |
|----------------|---------|--------|-------|--------|-------|-------|
| Qwen2 (7B) | 70 | 68 | 65 | 63 | 61 | 60 |
| Mistral (7B) | 65 | 63 | 60 | 58 | 55 | 53 |
| Gemma 2 (2B) | 50 | 48 | 45 | 43 | 40 | 38 |
| GPT-2 (163M) | 5 | 4.5 | 4 | 3.5 | 3 | 2.5 |
---
## Observations
1. **Temperature Sensitivity**:
- All models show performance degradation as temperature increases
- Larger models (Qwen2, Mistral) maintain higher performance across τ ranges
2. **Model Hierarchy**:
- Qwen2 > Mistral > Gemma 2 > GPT-2 in all metrics
- Performance gaps widen at higher τ values
3. **Confidence Intervals**:
- Wider intervals at higher τ values suggest increased uncertainty
- GPT-2 shows the most stable confidence intervals despite lowest performance
4. **Performance Plateaus**:
- All models exhibit diminishing returns beyond τ=0.5
- GPT-2 shows near-linear decline across all τ values
---
## Critical Notes
- All graphs use identical τ ranges and scaling
- Shaded areas represent 95% confidence intervals
- No textual annotations present beyond axis labels and legends
- No non-English text detected in the image
</details>
Figure 6: We evaluate several pretrained language models on the SQuADv2 Question-Answering reading comprehesion dataset, a task consisting of answering a question given some snippet from a Wikipedia article. We report F1, ’Exact Match’, and ’Semantic Match’ scores of several different language models of varying size from 163M parameters to 7B parameters, over several different temperatures. Semantic Match is calculated by using another LLM (llama3.1) to judge if two responses are equivalent, even if the exact strings slightly differ between the model output and the correct response. We also report 95% confidence intervals calculated through taking $± 1.96\sigma$ .
Toy Model Setting and Results. In addition, we develop a toy theoretical model to further study when transcendence is possible. This model involves a classification task with Gaussian input data and linearly separable classes. Experts label the data with noisy versions of the ground truth separator. We trained a linear model on a dataset labeled by random experts and observed the test accuracy for different temperature settings. The synthetic experiments demonstrated that transcendence occurs when expert diversity is high and temperature is low, aligning with our theoretical and empirical analysis in the chess domain.
<details>
<summary>2406.11741v4/x3.png Details</summary>

### Visual Description
# Technical Document Extraction: Accuracy vs. Temperature Analysis
## Overview
The image contains **four line graphs** comparing **accuracy** against **temperature** under varying **standard deviation (std)** conditions. Each graph includes a **red data line** and a **dashed black reference line** labeled "Best Expert." All graphs share identical axis labels and scales.
---
## Key Components
### Axis Labels
- **X-axis**: "Temperature" (range: 0.0 to 0.5, increments of 0.1)
- **Y-axis**: "Accuracy" (range: 0.0 to 1.0, increments of 0.2)
### Legends
- **Best Expert**: Dashed black horizontal line at **0.6 accuracy** (consistent across all graphs).
---
## Graph Analysis
### Graph 1: `std: 0.1`
- **Title**: "std: 0.1"
- **Data Line (Red)**:
- Starts at **~0.9 accuracy** at 0.0 temperature.
- Declines sharply to **~0.15 accuracy** at 0.5 temperature.
- **Best Expert Line**: Horizontal at **0.6 accuracy**.
### Graph 2: `std: 0.2`
- **Title**: "std: 0.2"
- **Data Line (Red)**:
- Starts at **~0.85 accuracy** at 0.0 temperature.
- Declines to **~0.1 accuracy** at 0.5 temperature.
- **Best Expert Line**: Horizontal at **0.6 accuracy**.
### Graph 3: `std: 0.4`
- **Title**: "std: 0.4"
- **Data Line (Red)**:
- Starts at **~0.8 accuracy** at 0.0 temperature.
- Declines to **~0.05 accuracy** at 0.5 temperature.
- **Best Expert Line**: Horizontal at **0.6 accuracy**.
### Graph 4: `std: 0.6`
- **Title**: "std: 0.6"
- **Data Line (Red)**:
- Starts at **~0.75 accuracy** at 0.0 temperature.
- Declines to **~0.0 accuracy** at 0.5 temperature.
- **Best Expert Line**: Horizontal at **0.6 accuracy**.
---
## Trends
1. **Data Line (Red)**:
- All graphs show a **monotonic decrease** in accuracy as temperature increases.
- Higher std values correlate with **steeper declines** in accuracy.
- At 0.5 temperature, accuracy drops below **0.2** for std ≥ 0.2.
2. **Best Expert Line**:
- Remains **constant at 0.6 accuracy** across all temperatures and std values.
- Acts as a **baseline reference** for comparison.
---
## Spatial Grounding
- **Legend Placement**: Upper-right corner of each graph.
- **Line Colors**:
- Red: Data line (no explicit legend label).
- Dashed Black: "Best Expert" (explicitly labeled).
---
## Notes
- No additional text, tables, or embedded diagrams are present.
- All graphs share identical axis scales and labeling conventions.
- The "Best Expert" line is **not temperature-dependent** and serves as a static benchmark.
</details>
Figure 7: Toy model for demonstrating transcendence. Input data is $d$ -dimensional Gaussian, with $d=100$ . Output is classification with $10$ classes. Ground-truth is generated by a linear function, i.e. $y=\arg\max_{i}W_{i}^{\star}x$ for some $W^{*}∈\mathbb{R}^{10× d}$ . We sample $k$ experts, with $k=5$ , to label the data, where the labels of each expert are generated by some $W∈\mathbb{R}^{10× d}$ s.t. $W=W^{*}+\xi$ , where $\xi_{i,j}\sim\mathcal{N}(0,\sigma^{2})$ , for some standard deviation $\sigma$ . Namely, each expert labels the data with a noisy version of the ground truth separator, with noise std $\sigma$ . We then train a linear model on a dataset with $10K$ examples, where each example is labeled by a random expert. We plot the test accuracy, measured by the probability assigned to the correct class, for different choices of temperature, and compare to the best expert.
5 Related Work
Chess and AI.
Chess has been motivating AI research since the field began. In 1950, before anyone had used the term “artificial intelligence”, automated chess were explored by both Claude Shannon [26] and Alan Turing [32]. Arguably, this history goes back even further: the famed “mechanical turk” of the 18th century was a fraudulently automated chess player. These centuries of mechanical ambitions were finally realized in 1997, when world champion Garry Kasparov was defeated by IBM’s Deep Blue [3]. Since then, chess program developers have drawn on neural approaches, with the RL-based convolutional network AlphaZero [27] far surpassing prior world champion engines such as Stockfish [25].Our chess model testbed is inspired by a number of existing approaches, including other models trained on lichess data [18], and other transformer-based sequential chess agents [23, 5].
Diversity beats Strength.
Another historical thread in AI research is the strength of diverse learners. Long since the development of ensemble methods that exploit learner diversity—including bagging [1], boosting [6], and model averaging [19] —researchers have continued to articulate this insight across settings. Similar to our chess setting, a diverse team of go playing agents have been proven and empirically shown to outperform solitary agents [9] and homogeneous teams [28], even when the alternative models individually outperform the diverse team members [17]. We draw a connection to this deep literature through our theory, which shows that imitation learning objective and then performing low-temperature sampling subtly implies the same principle of majority voting. Teacher diversity has also been explored in the machine learning literature. One related method is ensemble distillation [16], in which a model is trained with an additional objective to match a variety of weaker teacher models. Closer to our setting, ensemble self-training approaches [24] train a learner directly on the labels produced by varied teachers. Large language models supervised by smaller or less trained models are said to exhibit “weak to strong generalization” [2]. Overall, evidence continues to accrue that the general phenomenon we address is pervasive: that is, models can substantially improve over the experts that generate their training data.
Offline Reinforcement Learning.
Our work also draws connections to the Offline Reinforcement Learning [14] setting, where one attempts to learn a new policy $\pi$ that improves upon a fixed dataset generated by some behavior policy $\pi_{\beta}$ . However, our setting of imitation learning differs substantially from this literature, as we do not explicitly train our model on a RL objective that attempts to improve upon the dataset. Importantly, such an objective oftentimes introduces training instabilities [15] and also assumes reward labels. We defer a more extended discussion of related work to Appendix D.
6 Discussion and Future Work
This paper introduces the concept of transcendence. Our theoretical analysis shows that low-temperature sampling is key to achieving transcendence by denoising expert biases and consolidating diverse knowledge. We validate our findings empirically by training several chess models which, under low-temperature sampling, surpass the performance of the players who produced their training data, as well as further experiments in natural language question-answering and toy Gaussian models. We additionally highlight the necessity of dataset diversity for transcendence, emphasizing the role of varied expert perspectives.
Limitations.
While our work provides a strong foundation for understanding and achieving transcendence in generative models, several avenues for future research remain. Future work may investigate transcendence and its causes in domains and contexts beyond chess, such as natural language processing, computer vision, and text-to-video, to understand the generalizability of our findings. Additionally, our theoretical framework assumes that game conditions at test time match those seen during training; in order to extend our findings to cases of composition or reasoning, we must forego this assumption.
Future Work.
Future work could also explore the practical implementations of transcendence, and ethical considerations in the broader context of deployed generative models. Ultimately, our findings lay the groundwork for leveraging generative models to not only match but exceed human expertise across diverse applications, pushing the theoretical boundaries of what generative models can achieve.
Broader Impact.
The possibility of “superintelligent” AGI has recently fueled many speculative hopes and fears. It is therefore possible that our work will be cited by concerned communities as evidence of a threat, but we would highlight that the denoising effect addressed in this paper does not offer any evidence for a model being able to produce novel solutions that a human expert would be incapable of devising. In particular, we do not present evidence that low temperature sampling leads to novel abstract reasoning, but just denoising of errors.
Acknowledgements
Sham Kakade acknowledges this work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence; support from the Office of Naval Research under award N00014-22-1-2377, and the National Science Foundation Grant under award #IIS 2229881.
References
- Breiman [1996] L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. URL https://api.semanticscholar.org/CorpusID:47328136.
- Burns et al. [2023] C. Burns, P. Izmailov, J. H. Kirchner, B. Baker, L. Gao, L. Aschenbrenner, Y. Chen, A. Ecoffet, M. Joglekar, J. Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
- Campbell et al. [2002] M. Campbell, A. J. Hoane, and F.-h. Hsu. Deep Blue. Artificial Intelligence, 134(1):57–83, Jan. 2002. ISSN 0004-3702. doi: 10.1016/S0004-3702(01)00129-1.
- Chen et al. [2021] L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch. Decision transformer: Reinforcement learning via sequence modeling, 2021.
- Feng et al. [2023] X. Feng, Y. Luo, Z. Wang, H. Tang, M. Yang, K. Shao, D. Mguni, Y. Du, and J. Wang. ChessGPT: Bridging Policy Learning and Language Modeling. Advances in Neural Information Processing Systems, 36:7216–7262, Dec. 2023.
- Freund and Schapire [1999] Y. Freund and R. E. Schapire. A short introduction to boosting, 1999. URL https://api.semanticscholar.org/CorpusID:9621074.
- Glickman [2012] M. E. Glickman. Example of the glicko-2 system. Boston University, 28, 2012.
- Janner et al. [2021] M. Janner, Q. Li, and S. Levine. Offline reinforcement learning as one big sequence modeling problem, 2021.
- Jiang et al. [2014] A. Jiang, L. Soriano Marcolino, A. D. Procaccia, T. Sandholm, N. Shah, and M. Tambe. Diverse randomized agents vote to win. Advances in Neural Information Processing Systems, 27, 2014.
- Kakade and Langford [2002] S. M. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002. URL https://api.semanticscholar.org/CorpusID:31442909.
- Karpathy [2022] A. Karpathy. NanoGPT. https://github.com/karpathy/nanoGPT, 2022.
- Karvonen [2024] A. Karvonen. Emergent world models and latent variable estimation in chess-playing language models. arXiv preprint arXiv:2403.15498, 2024.
- Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Levine et al. [2020] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Li et al. [2023] J. Li, E. Zhang, M. Yin, Q. Bai, Y.-X. Wang, and W. Y. Wang. Offline reinforcement learning with closed-form policy improvement operators. In International Conference on Machine Learning, pages 20485–20528. PMLR, 2023.
- Lin et al. [2020] T. Lin, L. Kong, S. U. Stich, and M. Jaggi. Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33:2351–2363, 2020.
- Marcolino et al. [2013] L. S. Marcolino, A. X. Jiang, and M. Tambe. Multi-agent team formation: Diversity beats strength? In IJCAI, volume 13, 2013.
- McIlroy-Young et al. [2020] R. McIlroy-Young, S. Sen, J. Kleinberg, and A. Anderson. Aligning Superhuman AI with Human Behavior: Chess as a Model System. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1677–1687, Aug. 2020. doi: 10.1145/3394486.3403219.
- McMahan et al. [2023] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data, 2023.
- Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Munos et al. [2016] R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient off-policy reinforcement learning. Advances in neural information processing systems, 29, 2016.
- Nasu [2018] Y. Nasu. Efficiently Updatable Neural-Network-based Evaluation Functions for Computer Shogi, 2018.
- Noever et al. [2020] D. Noever, M. Ciolino, and J. Kalin. The Chess Transformer: Mastering Play using Generative Language Models, Sept. 2020.
- Odonnat et al. [2024] A. Odonnat, V. Feofanov, and I. Redko. Leveraging ensemble diversity for robust self-training in the presence of sample selection bias, 2024.
- Pete [2018] Pete. AlphaZero Crushes Stockfish In New 1,000-Game Match. https://www.chess.com/news/view/updated-alphazero-crushes-stockfish-in-new-1-000-game-match, Dec. 2018.
- Shannon [1950] C. E. Shannon. XXII. Programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, Mar. 1950. ISSN 1941-5982, 1941-5990. doi: 10.1080/14786445008521796.
- Silver et al. [2017] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, Dec. 2017.
- Soriano Marcolino et al. [2014] L. Soriano Marcolino, H. Xu, A. Xin Jiang, M. Tambe, and E. Bowring. Give a Hard Problem to a Diverse Team: Exploring Large Action Spaces. Proceedings of the AAAI Conference on Artificial Intelligence, 28(1), June 2014. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v28i1.8880.
- The Stockfish developers (2024) [see AUTHORS file] The Stockfish developers (see AUTHORS file). Stockfish, 2024. URL https://stockfishchess.org/.
- Toshniwal et al. [2022] S. Toshniwal, S. Wiseman, K. Livescu, and K. Gimpel. Chess as a Testbed for Language Model State Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11385–11393, June 2022. doi: 10.1609/aaai.v36i10.21390.
- Touati et al. [2020] A. Touati, A. Zhang, J. Pineau, and P. Vincent. Stable policy optimization via off-policy divergence regularization. In Conference on Uncertainty in Artificial Intelligence, pages 1328–1337. PMLR, 2020.
- Turing [2004] A. Turing. Chess (1953). In B. J. Copeland, editor, The Essential Turing, page 0. Oxford University Press, Sept. 2004. ISBN 978-0-19-825079-1. doi: 10.1093/oso/9780198250791.003.0023.
- Uma et al. [2021] A. N. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, and M. Poesio. Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72:1385–1470, 2021.
- Van der Maaten and Hinton [2008] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Xie et al. [2020] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020.
- Zhang et al. [2022] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Appendix A Proofs
Here we prove Proposition 1, where transcendence cannot occur by purely using imitation learning in our setting where all experts are sampled uniformly across the input distribution.
* Proof*
From linearity of the expectation
| | $\displaystyle R_{p_{\mathrm{test}}}(\hat{f})$ | $\displaystyle=\mathbb{E}_{x\sim p_{\mathrm{test}}}\left[r_{x}(\overline{f})\right]$ | |
| --- | --- | --- | --- |
∎
We now give the proof of Proposition 2 that if the arg-max prediction is better than the best expert, then transcendence is possible with low-temperature sampling.
* Proof*
Observe that for all $q$ , it holds that $\lim_{\tau→ 0}\mathrm{softmax}(q;\tau)=\arg\max(q)$ . Therefore, for all $x$
$$
\lim_{\tau\to 0}r_{x}(\hat{f}_{\tau})=\lim_{\tau\to 0}\sum_{y}r(x,y)\cdot\hat{%
f}_{\tau}(y|x)=\sum_{y}r(x,y)\hat{f}_{\max}(y|x)=r_{x}(\hat{f}_{\max})
$$
and so,
| | $\displaystyle\lim_{\tau→ 0}R_{p_{\mathrm{test}}}(\hat{f}_{\tau})$ | $\displaystyle=\lim_{\tau→ 0}\mathbb{E}_{x\sim p_{\mathrm{test}}}\left[r_{x}(%
\hat{f}_{\tau})\right]$ | |
| --- | --- | --- | --- |
Therefore, the required immediately follows. ∎
To prove Proposition 3, we directly use the result in Proposition 2.
* Proof*
Notice that for this expert, $\operatorname*{arg\,max}(f(·|x))=f^{*}(y|x)$ , which achieves higher reward compared to $f$ . Therefore, Theorem 2 implies that we achieve transcendence in the setting where all the data is generated by a single expert $f$ . ∎
Next, we give the proof for that low-temperature sampling can be thought of as performing majority vote [1, 6] between the experts:
**Proposition 5**
*Let $\mathbf{z}=[z_{1},z_{2},...,z_{n}]$ be a vector and $\tau>0$ be a temperature parameter. Define the softmax function as
$$
\sigma_{\tau}(z_{i})=\frac{e^{z_{i}/\tau}}{\sum_{j=1}^{n}e^{z_{j}/\tau}}.
$$
Then, as $\tau→ 0^{+}$ , the limit of the softmax function is given by
$$
\lim_{\tau\to 0^{+}}\sigma_{\tau}(z_{i})=\begin{cases}\displaystyle\frac{1}{k}%
,&\text{if }z_{i}=z_{\max},\\
0,&\text{otherwise},\end{cases}
$$
where $z_{\max}=\max_{1≤ j≤ n}z_{j}$ , and $k$ is the number of indices $i$ such that $z_{i}=z_{\max}$ .*
* Proof*
Let $z_{\max}=\max_{1≤ j≤ n}z_{j}$ , and define the set
$$
S=\{i\mid z_{i}=z_{\max}\},
$$
with cardinality $k=|S|$ . For each $i$ , let
$$
\Delta_{i}=z_{i}-z_{\max}\leq 0.
$$
Then the softmax function becomes
$$
\sigma_{\tau}(z_{i})=\frac{e^{(z_{\max}+\Delta_{i})/\tau}}{\sum_{j=1}^{n}e^{(z%
_{\max}+\Delta_{j})/\tau}}=\frac{e^{\Delta_{i}/\tau}}{\sum_{j=1}^{n}e^{\Delta_%
{j}/\tau}},
$$
since $e^{z_{\max}/\tau}$ cancels out in the numerator and denominator. We analyze the behavior of the terms as $\tau→ 0^{+}$ . For $i∈ S$ we have $\Delta_{i}=0$ and so:
$$
e^{\Delta_{i}/\tau}=e^{0}=1.
$$ For $i∉ S$ we have $\Delta_{i}<0$ so
$$
\lim_{\tau\to 0}e^{\Delta_{i}/\tau}=0
$$ Therefore, the denominator simplifies to
$$
\lim_{\tau\to 0^{+}}\sum_{j=1}^{n}e^{\Delta_{j}/\tau}=\sum_{j\in S}\lim_{\tau%
\to 0^{+}}e^{\Delta_{j}/\tau}+\sum_{j\notin S}\lim_{\tau\to 0^{+}}e^{\Delta_{j%
}/\tau}=\sum_{j\in S}1+\sum_{j\notin S}0=k.
$$ Similarly, the numerator becomes
$$
\lim_{\tau\to 0^{+}}e^{\Delta_{i}/\tau}=\begin{cases}1,&\text{if }i\in S,\\
0,&\text{if }i\notin S.\end{cases}
$$ Thus, for each $i$ ,
$$
\lim_{\tau\to 0^{+}}\sigma_{\tau}(z_{i})=\frac{\lim_{\tau\to 0^{+}}e^{\Delta_{%
i}/\tau}}{\lim_{\tau\to 0^{+}}\sum_{j=1}^{n}e^{\Delta_{j}/\tau}}=\begin{cases}%
\displaystyle\frac{1}{k},&\text{if }i\in S,\\
0,&\text{if }i\notin S.\end{cases}
$$ This concludes the proof. ∎
Finally, we give the proof of Proposition 4, or the statement that transcendence can occur from multiple experts if the test distribution $p_{\mathrm{test}}$ is spread across multiple disjoing subsets of $\mathcal{X}_{i}$ .
* Proof*
In this case, observe that for all $i$
| | $\displaystyle R_{p_{\mathrm{test}}}(f_{i})$ | $\displaystyle=p_{\mathrm{test}}(\mathcal{X}_{i})·\mathbb{E}_{x\sim p_{%
\mathrm{test}}|_{\mathcal{X}_{i}}}r_{x}(f^{*})+p_{\mathrm{test}}(\mathcal{X}%
\setminus\mathcal{X}_{i})·\mathbb{E}_{x\sim\overline{p}|_{\mathcal{X}%
\setminus\mathcal{X}_{i}}}\left[\mathbb{E}_{y\sim\mathrm{Uni}(\mathcal{Y})}r(x%
,y)\right]$ | |
| --- | --- | --- | --- |
Therefore, we get that for all $x$
$$
\hat{f}(y|x)=\frac{1}{k}\sum_{j=1}^{k}f_{j}(y|x)=\frac{k-1}{k}\cdot\frac{1}{%
\left\lvert\mathcal{Y}\right\rvert}+\frac{1}{k}f^{*}(y|x)=\frac{k-1}{k\cdot%
\left\lvert\mathcal{Y}\right\rvert}+\frac{1}{k\left\lvert Y_{x}^{*}\right%
\rvert}\cdot\mathbf{1}_{y\in Y_{x}^{*}}
$$
Thus, we get $f_{\max}=f^{*}$ , and the required follows from Proposition 2. ∎
Appendix B Additional Denoising Visualizations
<details>
<summary>2406.11741v4/extracted/5922169/denoising_viz_1.png Details</summary>

### Visual Description
# Technical Document Extraction: Chessboard Reward Analysis
## 1. Legend & Color Mapping
- **Legend Title**: `Reward R_x(y)`
- **Color Gradient**:
- Blue → Purple (low to high reward)
- Spatial Position: Left vertical bar, spanning full height of image
- Coordinates: `[x=0, y=0]` to `[x=0, y=100%]` (normalized)
## 2. Main Chessboard Analysis
### A. τ = 1.0
- **Chessboard Layout**:
- Coordinates: a-h (columns), 1-8 (rows)
- Key Arrows:
- `e8` (blue square, low reward) → King (black)
- `d7` (purple square, high reward) → Queen (black)
- `b5` (blue square, low reward) → Rook (black)
- **Reward Distribution**:
- High-reward squares (purple): d7, f2
- Low-reward squares (blue): e8, b5
### B. τ = 0.75
- **Chessboard Layout**:
- Coordinates: a-h, 1-8
- Key Arrows:
- `e8` (blue square, low reward) → King (black)
- `d6` (purple square, high reward) → Queen (black)
- `a4` (blue square, low reward) → Bishop (black)
- **Reward Distribution**:
- High-reward squares (purple): d6, g1
- Low-reward squares (blue): e8, a4
### C. τ = 0.001
- **Chessboard Layout**:
- Coordinates: a-h, 1-8
- Key Arrows:
- `d4` (purple square, high reward) → King (black)
- `f5` (blue square, low reward) → Queen (black)
- `c3` (blue square, low reward) → Rook (black)
- **Reward Distribution**:
- High-reward squares (purple): d4, e7
- Low-reward squares (blue): f5, c3
## 3. Bar Graph Analysis
### A. τ = 1.0
- **X-Axis**: τ = 1.0
- **Y-Axis**: Reward (0.0–1.0)
- **Bars**:
- King (black): 0.2 (purple, high reward)
- Queen (black): 0.1 (blue, low reward)
- Rook (black): 0.05 (blue, low reward)
- **Arrow**: Points to King (highest reward)
### B. τ = 0.75
- **X-Axis**: τ = 0.75
- **Y-Axis**: Reward (0.0–1.0)
- **Bars**:
- King (black): 0.3 (purple, high reward)
- Queen (black): 0.15 (blue, low reward)
- Bishop (black): 0.05 (blue, low reward)
- **Arrow**: Points to King (highest reward)
### C. τ = 0.001
- **X-Axis**: τ = 0.001
- **Y-Axis**: Reward (0.0–1.0)
- **Bars**:
- King (black): 1.0 (purple, maximum reward)
- Queen (black): 0.0 (blue, no reward)
- Rook (black): 0.0 (blue, no reward)
- **Arrow**: Points to King (exclusive reward)
## 4. Key Trends
1. **τ Degradation**:
- As τ decreases, reward distribution becomes more concentrated on the King.
- At τ = 0.001, only the King has non-zero reward.
2. **Arrow Consistency**:
- Arrows on chessboards always point to the highest-reward square (purple).
- Bar graph arrows align with the highest-reward piece.
## 5. Spatial Grounding Verification
- **Legend**: Confirmed color-to-reward mapping matches chessboard squares and bar graph bars.
- **Arrow Coordinates**:
- τ = 1.0: `e8` (x=5, y=8), `d7` (x=4, y=7)
- τ = 0.75: `e8` (x=5, y=8), `d6` (x=4, y=6)
- τ = 0.001: `d4` (x=4, y=4), `f5` (x=6, y=5)
## 6. Data Table Reconstruction
| τ Value | Piece | Square | Reward | Color |
|---------|---------|--------|--------|--------|
| 1.0 | King | e8 | 0.2 | Blue |
| 1.0 | Queen | d7 | 0.3 | Purple |
| 0.75 | King | e8 | 0.3 | Purple |
| 0.75 | Queen | d6 | 0.4 | Purple |
| 0.001 | King | d4 | 1.0 | Purple |
## 7. Component Isolation
- **Header**: Legend (Reward R_x(y))
- **Main Chart**: Three chessboards with τ labels and arrows
- **Footer**: Three bar graphs with τ-specific reward distributions
## 8. Language Notes
- All text is in English. No non-English content detected.
</details>
Figure 8: An example of where denoising helps black find the only correct move. White has pinned the black rook to the Queen: any move where the rook does not move to e4 results in a heavy loss of material. As $\tau$ decreasses, the expected reward increases substantially and converges onto the correct move.
<details>
<summary>2406.11741v4/extracted/5922169/denoising_viz_2.png Details</summary>

### Visual Description
# Technical Document Extraction: Chessboard Reward Analysis
## Legend
- **Title**: Reward \( R_x(y) \)
- **Color Gradient**: Blue (low reward) to Purple (high reward)
- **Placement**: Left vertical axis
## Chessboard Components
### Chessboard Layout
- **Grid**: 8x8 squares labeled a-h (columns) and 1-8 (rows)
- **Pieces**: Standard chess pieces (♔, ♕, ♗, ♘, ♖, ♙) in black and white
- **Key Observations**:
- **τ = 1.0**: Arrows point to squares with mixed reward values (purple, blue, dark blue)
- **τ = 0.75**: Arrows show reduced purple dominance, increased blue/dark blue
- **τ = 0.001**: Arrows predominantly purple, indicating high-reward focus
### Arrows and Reward Indicators
- **Arrow Colors**: Correspond to legend gradient (blue = low, purple = high)
- **Arrow Directions**: Represent move vectors (e.g., ♔↑ = up, ♔↓ = down)
- **Critical Data Points**:
- **τ = 1.0**:
- Purple arrow at e4 (high reward)
- Blue arrow at d5 (medium reward)
- Dark blue arrow at f3 (low reward)
- **τ = 0.75**:
- Purple arrow at e4 (high reward)
- Blue arrow at d5 (medium reward)
- Dark blue arrow at f3 (low reward)
- **τ = 0.001**:
- Purple arrow at e4 (high reward)
- Purple arrow at d5 (high reward)
- Purple arrow at f3 (high reward)
## Reward Distribution Bar Charts
### τ = 1.0
- **X-Axis**: Move directions (♔↑, ♔↓, ♔←, ♔→)
- **Y-Axis**: Reward frequency (0.0–1.0)
- **Bars**:
- ♔↑: ~0.2 (purple)
- ♔↓: ~0.3 (blue)
- ♔←: ~0.1 (dark blue)
- ♔→: ~0.05 (dark blue)
### τ = 0.75
- **X-Axis**: Move directions (♔↑, ♔↓, ♔←, ♔→)
- **Y-Axis**: Reward frequency (0.0–1.0)
- **Bars**:
- ♔↑: ~0.25 (purple)
- ♔↓: ~0.2 (blue)
- ♔←: ~0.15 (dark blue)
- ♔→: ~0.1 (dark blue)
### τ = 0.001
- **X-Axis**: Move directions (♔↑, ♔↓, ♔←, ♔→)
- **Y-Axis**: Reward frequency (0.0–1.0)
- **Bars**:
- ♔↑: ~1.0 (purple)
- ♔↓: ~0.0 (no bar)
- ♔←: ~0.0 (no bar)
- ♔→: ~0.0 (no bar)
## Spatial Grounding and Color Verification
- **Legend Position**: Left vertical axis (x=0, y=0 to y=1)
- **Color Consistency**:
- Purple arrows in chessboards match high-reward legend segment
- Blue arrows match medium-reward segment
- Dark blue arrows match low-reward segment
## Trend Verification
- **τ = 1.0**: Broad reward distribution across move directions
- **τ = 0.75**: Slight consolidation toward high-reward moves
- **τ = 0.001**: Extreme focus on high-reward moves (e.g., ♔↑ dominates)
## Component Isolation
1. **Legend**: Defines reward value mapping
2. **Chessboards**: Visualize move recommendations with color-coded arrows
3. **Reward Bars**: Quantify reward distribution per τ value
## Language Notes
- **Primary Language**: English
- **Secondary Elements**: Chess notation (e.g., ♔, ♕) and Greek letter τ (τ)
## Summary
The image illustrates how varying τ values influence reward-based move recommendations in chess. Higher τ (1.0) shows diverse reward distributions, while lower τ (0.001) prioritizes high-reward moves. Arrows and bar charts quantitatively represent these trends.
</details>
Figure 9: Another example where denoising helps avoid errors. Moving the queen to either d1 or h1 takes a bishop or rook, respectively, but loses the queen in the following turn. While queen to e5 does not put the queen in immediate danger, it allows white to push the pawn on f3 to d3, where it threatens the queen and is protected by the bishop on c1. The queen then must move out of danger, losing its opportunity to take the free pawn on h4 and giving white valuable space towards the center of the board. As $\tau$ decreases, the expected reward converges to the move queen to d4, taking the pawn and checking the black king.
<details>
<summary>2406.11741v4/extracted/5922169/denoising_viz_3.png Details</summary>

### Visual Description
# Technical Document Extraction: Chessboard Reward Analysis
## Legend
- **Label**: `Reward R_x(y)`
- **Color Gradient**: Blue (low reward) to Purple (high reward)
- **Spatial Position**: Left-aligned vertical bar spanning all panels
## Panel Structure
Three identical chessboard diagrams with varying reward sensitivity parameters (τ) and associated reward heatmaps/bar charts.
---
### Panel 1: τ = 1.0
#### Chessboard
- **Piece Positions**:
- Pawn (♙): e2
- Rook (♖): f1
- Queen (♕): e1
- **Heatmap Observations**:
- High reward (purple): e2 (pawn), e1 (queen)
- Low reward (blue): f1 (rook)
#### Bar Chart
- **Axes**:
- X-axis: Piece types (`♙`, `♖`, `♕`)
- Y-axis: `Reward` (0.0–1.0)
- **Data Points**:
- `♙`: 0.3 (purple)
- `♖`: 0.2 (blue)
- `♕`: 0.1 (purple)
- **Trend**: Pawn and queen rewards dominate; rook reward is minimal.
---
### Panel 2: τ = 0.75
#### Chessboard
- **Piece Positions**:
- Pawn (♙): e2
- Rook (♖): f1
- Queen (♕): e1
- **Heatmap Observations**:
- High reward (purple): e2 (pawn), e1 (queen)
- Low reward (blue): f1 (rook)
#### Bar Chart
- **Axes**:
- X-axis: Piece types (`♙`, `♖`, `♕`)
- Y-axis: `Reward` (0.0–1.0)
- **Data Points**:
- `♙`: 0.4 (purple)
- `♖`: 0.15 (blue)
- `♕`: 0.05 (purple)
- **Trend**: Pawn reward increases slightly; rook reward decreases further.
---
### Panel 3: τ = 0.001
#### Chessboard
- **Piece Positions**:
- Pawn (♙): e2
- Rook (♖): f1
- Queen (♕): e1
- **Heatmap Observations**:
- High reward (purple): e2 (pawn), e1 (queen)
- Low reward (blue): f1 (rook)
#### Bar Chart
- **Axes**:
- X-axis: Piece types (`♙`, `♖`, `♕`)
- Y-axis: `Reward` (0.0–1.0)
- **Data Points**:
- `♙`: 0.8 (purple)
- `♖`: 0.2 (blue)
- `♕`: 1.0 (purple)
- **Trend**: Queen reward becomes dominant; pawn reward increases significantly.
---
## Key Observations
1. **Reward Sensitivity**:
- Lower τ values (e.g., τ = 0.001) amplify rewards for high-value pieces (queen).
- Rook rewards remain consistently low across all τ values.
2. **Spatial Correlation**:
- High-reward squares (purple) align with pawn and queen positions.
- Low-reward squares (blue) align with rook positions.
3. **Legend Consistency**:
- All color mappings (blue/purple) across panels match the legend's gradient.
## Language Notes
- **Primary Language**: English
- **Secondary Language**: None detected.
## Spatial Grounding
- Legend: `[x=0, y=0]` (left edge, spanning vertically)
- Chessboard Coordinates: Standard algebraic notation (e.g., e2 = column e, row 2)
</details>
Figure 10: In this setup, a higher temperature shows two plausible moves for the black rook: g1 or f1. As the temperature decreases, the expected reward converges to g1. If the black rook were to move to f1, the white rook would take the black rook, blocking the black pawn on f2 from promoting and protecting the promotion square from the h2 pawn. If the rook were to move to g1, on the other hand, it would open the promotion square from the h2 pawn without being at any immediate risk. If white responded by moving its bishop to g2, protecting the promotion squares from both of the advanced black pawns, black could respond by taking the rook on a1, gaining significant material.
Appendix C Intuition of low temperature sampling inducing transcendence
To build intuition for the primary mechanism of transcendence that we explore in this paper, we give the following toy progression of distributions in order to clearly illustrate how low-temperature sampling can induce transcendence through majority voting. Here, the middle purple action represent the correct, high-reward output, whilst the left and right actions are low-reward bad outputs. We plot the probability of each output as a label on the x axis.
<details>
<summary>2406.11741v4/extracted/5922169/intuition1.png Details</summary>

### Visual Description
# Technical Document Analysis of Bar Chart
## 1. Labels and Axis Titles
- **Y-Axis**: Labeled `ℙ(Y | X)` (probability of Y given X), scaled from 0 to 100 in increments of 25.
- **X-Axis**: Displays categorical values: `60`, `40`, and `0` (no bar present at `0`).
- **Legend**: Located at the **bottom-right** of the chart, mapping colors to labels:
- Dark blue: `60`
- Purple: `40`
## 2. Key Trends and Data Points
- **Bar Heights**:
- Dark blue bar (value `60`): Reaches approximately **60** on the Y-axis.
- Purple bar (value `40`): Reaches approximately **40** on the Y-axis.
- **Trend**: Probability decreases from `60` to `40`; no data represented at `0`.
## 3. Legend and Color Mapping
- **Legend Position**: Bottom-right corner.
- **Color Consistency**:
- Dark blue bar corresponds to `60` (matches legend).
- Purple bar corresponds to `40` (matches legend).
## 4. Chart Components
- **Main Chart Area**: Contains two bars (dark blue and purple) aligned with X-axis categories.
- **Axes**:
- Y-axis: Vertical, labeled `ℙ(Y | X)`.
- X-axis: Horizontal, labeled with discrete values (`60`, `40`, `0`).
- **Gridlines**: Subtle gridlines visible in the background for reference.
## 5. Missing Data
- No bar present at `X = 0`, indicating either:
- Absence of data for this category.
- Probability of `ℙ(Y | X = 0)` is zero or undefined.
## 6. Spatial Grounding
- **Legend Coordinates**: Bottom-right (exact pixel position unspecified in image).
- **Bar Alignment**: Bars centered over their respective X-axis labels.
## 7. Trend Verification
- **Dark Blue Series (`60`)**: Single bar at `X = 60`, height ~60.
- **Purple Series (`40`)**: Single bar at `X = 40`, height ~40.
- **No Data at `X = 0`**: Confirmed by absence of bar.
## 8. Component Isolation
- **Regions**:
- **Header**: None (no title or subtitle).
- **Main Chart**: Bars and axes dominate the visual space.
- **Footer**: Legend occupies the bottom-right.
## 9. Additional Notes
- **Language**: All text is in English (no non-English content detected).
- **Data Table**: No embedded table present; data represented visually via bars.
## Conclusion
The chart visualizes a probability distribution (`ℙ(Y | X)`) for two discrete X-values (`60` and `40`), with no data for `X = 0`. Probability decreases as X decreases from 60 to 40.
</details>
Figure 11: The first expert output distribution. Although it puts non-negligible mass on the purple, high-reward action, it still samples a low-reward action the majority of the time.
<details>
<summary>2406.11741v4/extracted/5922169/intuition2.png Details</summary>

### Visual Description
# Technical Document Extraction: Bar Chart Analysis
## Chart Type
- **Bar Chart** with two vertical bars.
## Axis Labels
- **Y-Axis**: Labeled `P(Y | X)` with a range from 0 to 100 (in increments of 25).
- **X-Axis**: Labeled with numerical markers at 0, 40, and 60.
## Legend
- **Position**: Right side of the chart.
- **Entries**:
- **Purple**: Corresponds to `P(Y | X=40)`.
- **Blue**: Corresponds to `P(Y | X=60)`.
## Data Points
1. **Purple Bar**:
- **X-Coordinate**: 40.
- **Y-Value**: Approximately 40 (height of the bar).
2. **Blue Bar**:
- **X-Coordinate**: 60.
- **Y-Value**: Approximately 60 (height of the bar).
## Trends
- The blue bar (`P(Y | X=60)`) is taller than the purple bar (`P(Y | X=40)`), indicating a higher value for `P(Y | X=60)` compared to `P(Y | X=40)`.
- No additional trends observed due to the limited number of data points.
## Spatial Grounding
- **Legend Position**: Right-aligned, outside the main chart area.
- **Bar Placement**:
- Purple bar centered at x=40.
- Blue bar centered at x=60.
## Component Isolation
- **Header**: No explicit title or header text.
- **Main Chart**: Two bars with grid lines (light gray background, white grid).
- **Footer**: No footer elements.
## Notes
- The chart lacks a title, axis units, or explicit numerical annotations for the bars.
- The y-axis label `P(Y | X)` suggests a conditional probability or proportion, but the scale (0–100) implies a percentage or normalized value.
- The x-axis includes 0 as a placeholder, but no data is associated with it.
## Conclusion
The chart visually compares two conditional probabilities (`P(Y | X=40)` and `P(Y | X=60)`) using bar heights. The blue bar (`X=60`) exceeds the purple bar (`X=40`) in value, though exact numerical values are not provided.
</details>
Figure 12: The second expert output distribution. Symmetric to to the first expert, it also puts non-negligible mass on the purple, high-reward action. However, it samples a low-reward action the majority of the time on the right.
<details>
<summary>2406.11741v4/extracted/5922169/intuition3.png Details</summary>

### Visual Description
# Technical Document Extraction: Bar Chart Analysis
## Chart Overview
The image depicts a **bar chart** with three vertical bars representing probabilistic values. The chart lacks a legend, but color coding is used to differentiate data series.
---
### Axis Labels and Markers
- **Y-Axis**:
- Title: `ℙ(Y | X)` (probability of Y given X)
- Range: 0 to 100 (in increments of 25)
- Units: Probability (unitless)
- **X-Axis**:
- Categories: `30`, `40`, `30` (no explicit labels beyond numerical values)
- Position: Bottom of chart
---
### Data Series and Bars
1. **First Bar**
- **X Position**: 30
- **Y Value**: ~30 (height reaches 30 on Y-axis)
- **Color**: Blue
2. **Second Bar**
- **X Position**: 40
- **Y Value**: ~40 (height reaches 40 on Y-axis)
- **Color**: Purple
3. **Third Bar**
- **X Position**: 30
- **Y Value**: ~30 (height reaches 30 on Y-axis)
- **Color**: Blue
---
### Trends and Observations
- **Trend Verification**:
- The middle bar (X=40) exhibits the highest probability (~40), while the first and third bars (X=30) share identical lower probabilities (~30).
- No upward/downward slope observed (static bars).
- **Color Consistency**:
- Blue corresponds to X=30 (first and third bars).
- Purple corresponds to X=40 (second bar).
- No legend present to confirm color mappings; inferred from bar placement.
---
### Spatial Grounding
- **Legend**: Not present in the image.
- **Bar Placement**:
- X=30: Left and right edges of chart.
- X=40: Center of chart.
---
### Component Isolation
1. **Header**: No header text or title.
2. **Main Chart**:
- Three bars with distinct colors and positions.
- Gridlines visible (horizontal at Y=25, 50, 75, 100; vertical at X=30, 40).
3. **Footer**: No footer text or annotations.
---
### Data Table Reconstruction
| X Value | Y Value | Color |
|---------|---------|--------|
| 30 | 30 | Blue |
| 40 | 40 | Purple |
| 30 | 30 | Blue |
---
### Notes
- The chart lacks explicit textual annotations beyond axis labels.
- No embedded text or diagrams beyond the bars and gridlines.
- Probability values are inferred from bar heights relative to the Y-axis scale.
</details>
Figure 13: By taking the average of the first and second expert, we observe that this distribution now puts the majority of mass onto the correct action.
<details>
<summary>2406.11741v4/extracted/5922169/intuition4.png Details</summary>

### Visual Description
# Technical Document Extraction: Bar Chart Analysis
## 1. Axis Labels and Titles
- **Vertical Axis**: Labeled `ℙ(Y | X)` with a range from `0` to `100` in increments of `25`.
- **Horizontal Axis**: Categorical labels: `5`, `90`, `5` (repeated).
## 2. Data Points and Categories
- **Categories**: Three distinct groups labeled `5`, `90`, and `5` on the horizontal axis.
- **Values**:
- First category (`5`): Bar height ≈ `5` (dark blue).
- Second category (`90`): Bar height ≈ `90` (purple).
- Third category (`5`): Bar height ≈ `5` (dark blue).
## 3. Color Coding and Legend
- **Legend**: Not explicitly visible in the image. However, inferred color associations:
- `Dark Blue`: Corresponds to the first and third categories (`5`).
- `Purple`: Corresponds to the second category (`90`).
## 4. Spatial Grounding
- **Legend Placement**: Not applicable (no legend present in the image).
- **Bar Positions**:
- First bar: `[x=5, y≈5]`.
- Second bar: `[x=90, y≈90]`.
- Third bar: `[x=5, y≈5]`.
## 5. Trend Verification
- **Visual Trends**:
- The second category (`90`) exhibits a **dominant upward trend**, with its bar reaching near the maximum vertical axis value (`90`).
- The first and third categories (`5`) show **minimal values**, with bars barely above the baseline (`y≈5`).
## 6. Component Isolation
- **Regions**:
- **Header**: No explicit header text.
- **Main Chart**: Bar chart with three vertical bars and labeled axes.
- **Footer**: No explicit footer text.
## 7. Data Table Reconstruction
- **Structure**:
| Category | Value (ℙ(Y | X)) | Color |
|----------|------------------|-----------|
| 5 | ~5 | Dark Blue |
| 90 | ~90 | Purple |
| 5 | ~5 | Dark Blue |
## 8. Additional Notes
- **Language**: All textual elements are in English.
- **Missing Elements**: No embedded text, data tables, or legends explicitly present in the image.
- **Key Insight**: The chart emphasizes a stark contrast between the second category (`90`) and the others (`5`), suggesting a probabilistic or categorical relationship between `X` and `Y`.
## 9. Final Validation
- All axis labels, data points, and trends are cross-referenced for consistency. No discrepancies detected between visual and textual elements.
</details>
Figure 14: Finally, by setting temperature $\tau$ to be $<1$ , more weight is shifted towards the high probability action, leading to a gain in the expected reward.
Appendix D Further Related Work
D.1 Label Disagreement
Label disagreement in training data, in particular, can improve models in practice. Xie et al. [36] empirically show that adding random noise to teacher-generated labels can improve a student model. Uma et al. [33] even survey the literature on human interannotator disagreement and find a trend of improvements when models are trained on the full set of disagreeing labels rather than on majority vote labels or only on data where labelers agree. Our theoretical claims build on these findings by making the point that the learner can even improve on these original diverse labelers.
D.2 Offline Reinforcement Learning
Although most Offline Reinforcement Learning algorithms train on an RL objective, perhaps most similar to our work is Decision Transformer [4] and Trajectory Transformer [8]: prior models trained on just the sequence prediction of trajectories. Most notably, Decision Transformer also finds an alternative form of transcendence than the one explored in this paper: by conditioning the trained transformer by the performance of the trajectory, at inference time they can then prompt the model to perform better than the best trajectory seen during training. This remains another promising direction to explore transcendence under.
Interestingly, an analogue to low-temperature sampling also has been noticed and exploited by Reinforcement Learning practitioners in the context of off-policy learning, where a different exploration policy $\pi_{E}$ is used than the final learned target policy $\pi_{T}$ . Oftentimes $\pi_{T}$ will just be set to a greedy version of $\pi_{E}$ [21], such as choosing $\pi_{T}$ to take the $\operatorname*{arg\,max}$ action of $\pi_{E}$ , which we note is directly equivalent to setting temperature to 0.
Appendix E Training Details
We give a full list of the hyperparameters we used for training here. Note that we largely follow the same hyperparameter set as [37], but lower the batch size to $125K$ as we found training to still be stable ta this level. We also release our code openly to support further research into transcendence, which was built off the wonderful work done by Karvonen [12] and Karpathy [11].
| ChessFormer | Optimizer Activation Function Mini-batch size | AdamW [13] ReLU 125K tokens |
| --- | --- | --- |
| Gradient Accumulation Steps | 1 | |
| Transformer num. layers | 16 | |
| Transformer num. heads | 8 | |
| Transformer embedding dim. | 512 | |
| Dropout | 0.0 | |
| Learning Rate | 3e-4 | |
| Number of gradient steps | 100K | |
| Weight Decay | 0.1 | |
| Critic hidden layers | 3 | |
| Adam $\beta_{1}$ | 0.90 | |
| Adam $\beta_{2}$ | 0.95 | |
| Gradient Clip | 1.0 | |
| Cosine Learning Rate | True | |
| Warmup Iterations | 2000 | |
| Minimum Learning Rate | 3e-5 | |
| Learning Rate Deacy Iterations | 400K | |
| Tensor datatype | bfloat16 | |
Table 2: Hyperparameters for our ChessFormer model.
Appendix F Compute Resources
We train all of our models on the Nvidia H100 80GB GPU. To train one of our models takes around 6 to 12 hours.
Appendix G Full t-SNE
We visualize the full t-SNE here, coloring by the reward of the game. We see that the model has learned some representation of the reward, with high absolute reward states being more likely to be near each other in the latent space. This also points towards evidence that the model has learned some sort equivariant representation of the player identity, as the region of symmetric high reward states indicate. Note that reward is not directly given to the model during training.
{adjustwidth}
-0.1-0.1
<details>
<summary>2406.11741v4/x4.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## Header Section
- **Title**: "Third States" (Position: Top center, Bold black text)
## Main Chart
### Axes
- **X-Axis**: Labeled "O" (Position: Bottom center, Black text)
- **Y-Axis**: No explicit label visible (Implicitly categorical or unlabeled)
### Data Points
- **Color-Coded Clusters**:
- **Red Dots** (Legend: "Group A"): Clustered in lower-left quadrant, dense aggregation.
- **Blue Dots** (Legend: "Group B"): Clustered in upper-right quadrant, moderate density.
- **Purple Dots** (Legend: "Group C"): Scattered across mid-range, no distinct cluster.
### Legend
- **Position**: Top-right corner
- **Labels**:
- Red: "Group A"
- Blue: "Group B"
- Purple: "Group C"
### Trends
- **Group A (Red)**: Dominates lower-left space; no upward/downward trend (static cluster).
- **Group B (Blue)**: Concentrated in upper-right; no discernible slope.
- **Group C (Purple)**: Diffuse distribution; no clear pattern.
## Footer Section
- No explicit footer text or markers.
## Notes
- **Language**: English (No non-English text detected).
- **Data Structure**: No embedded tables or numerical annotations.
- **Spatial Grounding**:
- Legend colors match data points exactly (Red = Group A, Blue = Group B, Purple = Group C).
- **Missing Elements**: No axis scales, numerical values, or secondary legends.
## Conclusion
The chart visualizes three distinct groups ("Group A", "Group B", "Group C") with spatial clustering. No quantitative trends or time-series data are present. The absence of axis labels beyond "O" limits interpretability of the x-axis.
</details>
We visualize the same t-SNE, but this time coloring by game length rather than reward. We see that games with high reward tend to be longer, which makes logical sense as the result of the game will tend to be clearer as the game proogresses.
{adjustwidth}
-0.1-0.1
<details>
<summary>2406.11741v4/x5.png Details</summary>

### Visual Description
# Technical Document Extraction: Scatter Plot Analysis
## Header
- **Title**: "States" (centered at the top of the image)
- **Visual Elements**: No additional header text or graphics.
## Main Chart
### Axes
- **X-Axis**:
- Label: "0" (centered at the bottom of the image).
- No numerical scale or additional markers visible.
- **Y-Axis**:
- No label or scale provided.
- Data points distributed vertically across the plot.
### Data Points
- **Colors**:
- **Blue**: Represents "Group A" (per legend).
- **Purple**: Represents "Group B" (per legend).
- **Distribution**:
- **Central Cluster**: Dense aggregation of both blue and purple points near the center (x ≈ 0, y ≈ 0).
- **Peripheral Spread**: Smaller clusters and isolated points radiating outward from the center.
- **Density Gradient**: Higher concentration in the central region, decreasing toward the edges.
### Legend
- **Location**: Top-right corner of the image.
- **Labels**:
- Blue: "Group A"
- Purple: "Group B"
## Footer
- No footer text or elements present.
## Observations
1. **Spatial Grounding**:
- Legend colors match data point colors exactly (blue = Group A, purple = Group B).
- No discrepancies detected between legend and plotted data.
2. **Trend Verification**:
- Central cluster suggests a shared origin or common characteristic for both groups.
- Peripheral spread indicates divergence or variability in distribution patterns.
3. **Missing Information**:
- Y-axis label and scale are absent, limiting quantitative analysis.
- No explicit units or contextual labels for the x-axis beyond "0".
## Conclusion
The scatter plot visualizes two distinct groups ("Group A" and "Group B") with overlapping central clusters and dispersed peripheral points. The lack of axis labels and numerical scales prevents detailed quantitative interpretation. Further context is required to determine the significance of the spatial distribution.
</details>