# Intuition emerges in Maximum Caliber models at criticality
**Authors**: Lluís Arola-Fernández
> Instituto de Física Interdisciplinar y Sistemas Complejos IFISC (CSIC-UIB), Campus UIB, 07122 Palma de Mallorca, Spain Departament d’Enginyeria Informàtica i Matemàtiques, Universitat Rovira i Virgili, 43007 Tarragona, Catalonia, Spain Current address.
(September 26, 2025)
## Abstract
Whether large predictive models merely parrot their training data or produce genuine insight lacks a physical explanation. This work reports a primitive form of intuition that emerges as a metastable phase of next-token prediction under future path-entropy maximization. The intuition mechanism is discovered via mind-tuning, the minimal principle that imposes Maximum Caliber in predictive models with a control temperature-like parameter $\lambda$ . Training on random walks in deterministic mazes reveals a rich phase diagram: imitation (low $\lambda$ ), rule-breaking hallucination (high $\lambda$ ), and a fragile in-between window exhibiting strong protocol-dependence (hysteresis) and multistability, where models spontaneously discover novel goal-directed strategies. These results are captured by a mechanistic low-dimensional theory and frame intuition as an emergent property at the critical balance between memorizing what is and wondering what could be.
Introduction.— The rise of large-scale predictive models is reshaping artificial intelligence and transforming science and society. This progress is built upon a dominant scaling paradigm: pre-training autoregressive neural networks [1] with enormous parameter counts on big volumes of data [2] using massive compute resources [3]. When coupled with powerful search at inference time [4], this approach has yielded impressive performance in complex games [5], medical diagnosis [6] and algorithmic discovery [7]. Yet, the brute-force solution does not match the elegant efficiency of natural intelligence, which discovers intuitive shortcuts and novel, creative strategies from sparse data without rewards [8]. This contrast sharpens a foundational debate: are these models showing sparks of artificial general intelligence (AGI) [9], or are they “stochastic parrots” [10] that leverage vast experience to create an illusion of thought [5, 11]? While often addressed via complex reasoning benchmarks [12], the paradigm’s limits can be distilled into a simple Gedankenexperiment (Fig. 1).
<details>
<summary>Figure1.png Details</summary>

### Visual Description
## Diagram: Maze Structure
### Overview
The image depicts a black-and-white maze-like structure composed of interconnected pathways and walls. The maze is enclosed within a square boundary, with no textual labels, legends, or numerical annotations visible. The design emphasizes geometric precision, with sharp angles and uniform line widths.
### Components/Axes
- **Walls**: Represented by thick black lines forming the boundaries and dead-ends of the maze.
- **Pathways**: Represented by white corridors that allow traversal through the maze.
- **Symmetry**: The maze exhibits approximate bilateral symmetry along a vertical axis, though minor asymmetries exist (e.g., the top-right quadrant has a shorter dead-end path compared to the bottom-left).
### Detailed Analysis
- **Path Complexity**: The maze contains 12 distinct dead-ends and 3 primary branching points where pathways diverge.
- **Traversal Routes**: Two main routes lead toward the center:
1. A longer, winding path starting from the top-left corner.
2. A shorter, more direct path originating from the bottom-right corner.
- **Center Target**: The central area is obscured by overlapping walls, making the endpoint ambiguous.
### Key Observations
- **Ambiguity in Solution**: No clear "exit" is marked, suggesting the maze may be incomplete or intentionally open-ended.
- **Proportional Consistency**: All walls are uniformly thick (~2px), and pathways maintain consistent width (~1px), indicating deliberate design constraints.
- **Asymmetry Note**: The top-right quadrant’s shorter dead-end introduces a minor imbalance in path length distribution.
### Interpretation
This maze likely serves as a symbolic or functional representation of problem-solving, decision-making, or spatial navigation. The absence of an explicit endpoint could imply:
1. A metaphor for unresolved challenges or infinite complexity.
2. A placeholder for dynamic content (e.g., a game level where the exit is procedurally generated).
3. An abstract art piece emphasizing structural tension between order (symmetry) and chaos (dead-ends).
The lack of textual or numerical data prevents quantitative analysis. The design prioritizes visual logic over empirical measurement, aligning with principles of minimalist diagrammatic representation.
</details>
Figure 1: Gedankenexperiment on emergent reasoning. A minimal environment abstracts a reasoning task into its essential components: a constrained space (a maze) and a hidden, optimal solution (to escape). The reader’s own intuition immediately grasps the task, yet a standard predictive model trained on random walk trajectories (i.e., non-intelligent data without rewards) will never discover it.
This work provides a physical explanation for this leap. We introduce mind-tuning, a simple principle that balances next-token prediction against future path-entropy maximization with a temperature-like parameter $\lambda$ . To our knowledge, mind-tuning is the minimal implementation of the Maximum Caliber (MaxCal) principle [13, 14, 15] compatible with autoregressive training. It reveals the emergence of a fragile metastable phase, within a narrow temperature window between imitation and hallucination regimes, that is reminiscent of intuition.
While our intuition mechanism points toward a horizon of diverse futures to explore, the prevailing paradigm remains blind, fixated only on predicting the next token. Constrained path-entropy maximization is already implicit in intrinsic motivation frameworks [16] like Causal Entropic Forces [17], Active Inference [18], Empowerment [19], or the Maximum Occupancy Principle [20]. Yet, a physical basis for such emergent behavior in pure predictive models has remained elusive. The metastable regime reported here, bounded by distinct, entropy and energy-driven transitions with strong hysteresis and multistability, explains that emergent reasoning is both rare and protocol-dependent. Furthermore, the high-dimensional mechanisms behind this phenomenology are captured analytically by a low-dimensional theory.
This perspective casts intelligence as a state of computational matter [21], building on a rich history of minimal models for emergent cognitive behavior, from Hopfield’s memory [22] and Kuramoto’s synchronization [23] to phenomena in deep learning like double-descent [24], grokking [25], neural collapse [26], symmetry-breaking [27], and collective learning [28], often analyzed through spin-glass analogies [29] and phase diagrams [30, 28]. The phase-transition picture is key to research showing that intelligent systems may operate near a critical point, at the “edge of chaos” [31, 32, 33]. At criticality, fluctuations and system responsiveness peak [31, 34], creating the ideal conditions for the leap from mimicry to insight. In the learning problem, our theory points toward a critical scaling axis driven by the system’s intrinsic dynamics and suggests that current models operate in a suboptimal imitation phase, lacking the intuition that a physical mechanism unlocks.
Mind-tuning.— We focus on reasoning problems solvable by generating trajectories $z=(x_{0},a_{0},x_{1},a_{1},\dots)$ . The system’s behavior is governed by a policy $\pi_{\theta,\beta}$ , a neural network with parameters $\theta$ that maps a data history $h_{t}=(x_{0},a_{0},\dots,x_{t})$ to a probability distribution over a discrete set of actions $\mathcal{A}$ via a softmax function
$$
\pi_{\theta,\beta}(a_{t}\!\mid\!h_{t})=\frac{e^{\beta\,\ell_{\theta}(h_{t},a_{t})}}{\sum_{a^{\prime}\in\mathcal{A}}e^{\beta\,\ell_{\theta}(h_{t},a^{\prime})}}, \tag{1}
$$
where $\ell_{\theta}$ are the network’s output logits and $\beta$ controls the policy’s stochasticity. This general setting includes state-decision spaces, standard autoregressive models where histories contain tokens and other representations (see SM Sec. S1 for implementation details).
To isolate the intuition mechanism, we assume an offline, imperfect setting [35]: the model never interacts with the environment, has no external rewards, and learns from a dataset $\mathcal{D}$ of non-optimal histories. How can a purely predictive model discover a better solution than what it has seen? By biasing prediction toward futures with high causal path diversity, as prescribed by the Maximum Caliber (MaxCal) principle [13]: among all dynamics consistent with known constraints, prefer those that maximize the entropy of trajectories.
The most unbiased learning objective that imposes MaxCal is the free-energy-like functional
$$
\mathcal{F}_{\lambda,\beta,\tau}(\theta)=\mathcal{E}_{\beta}(\theta)-\lambda\mathcal{H}_{\tau,\beta}(\theta), \tag{2}
$$
where $\lambda\!\geq\!0$ is an effective temperature controlling the energy–entropy trade-off. The first term is the standard Cross-Entropy or negative log-likelihood $(\mathcal{E}$ ), measuring the cost of imitating the training data
$$
\mathcal{E}_{\beta}(\theta)=\left\langle-\log\pi_{\theta,\beta}(a_{t}|h_{t})\right\rangle_{(h_{t},a_{t})\in\mathcal{D}}. \tag{3}
$$
This energy $\mathcal{E}$ is traded against the causal path-entropy $\mathcal{H}$ , a Shannon entropy of self-generated futures up to a horizon of length $\tau$
$$
\mathcal{H}_{\tau,\beta}(\theta)=\left\langle\frac{1}{\tau}\left\langle-\ln P(z_{\text{future}}|h_{t})\right\rangle_{z_{\text{future}}\sim\pi_{\theta,\beta}}\right\rangle_{h_{t}\in\mathcal{D}}. \tag{4}
$$
Eq.(4) is estimated over the cone of futures induced by the model itself (see SM Sec. S2B for entropy calculations), making the objective function inherently subjective and self-referential, as the internal beliefs dynamically shape the learning landscape. The gradient update
$$
\theta(t+1)\leftarrow\theta(t)+\eta[{-\nabla_{\theta}\mathcal{E}_{\beta}(\theta)}+{\lambda\nabla_{\theta}\mathcal{H}_{\tau,\beta}(\theta)}] \tag{5}
$$
frames learning as a competition between prediction and causal entropic forces acting on the system’s degrees of freedom, i.e. the network weights. To our knowledge, this self-contained mechanism is the minimal MaxCal implementation compatible with prevalent offline auto-regressive training. Unlike surprise-minimization [36, 37], here the entropic term rewards keeping plausible futures open, pulling toward the adjacent possible [38], without environment interaction [19, 20, 39]. The framework also admits a Bayesian interpretation [40, 41]: standard auto-regressive training use flat priors. In mind-tuning, instead, the data likelihood filters an optimistic entropic prior over futures with high diversity.
Experiments.— We test this principle in the minimal sandbox of the Gedankenexperiment (Fig. 1). A model is trained on constrained random-walk trajectories, which respect the maze walls but contain no intelligent strategies for escaping. Sweeping the parameter $\lambda$ yields a rich phase diagram, with clear transitions in both genotype (Fig. 2 A) and phenotype (Fig. 2 B) metrics.
<details>
<summary>Figure2.png Details</summary>

### Visual Description
## Line Graph and Scatter Plot with Maze Diagrams: Metric Analysis Across Strategies
### Overview
The image contains three primary components:
1. **Section A**: A line graph comparing cross-entropy (Eλ) and path-entropy (Hλ) across a logarithmic λ scale.
2. **Section B**: A scatter plot comparing normalized MFPT/WHR values against λ, with a shaded region labeled *p_intuition*.
3. **Sections C-E**: Three maze diagrams labeled "Imitation," "Intuition," and "Hallucination," showing increasing path complexity.
---
### Components/Axes
#### Section A (Line Graph)
- **X-axis**: λ (log scale, 10⁻³ to 10¹)
- **Y-axis**: Metric value (0.0 to 1.4)
- **Legend**:
- Red line: Eλ (cross-entropy)
- Blue line: Hλ (path-entropy)
- **Inset**: Fluctuations in Eλ and Hλ around λ=0.1 (log scale: 10⁻³ to 10¹).
#### Section B (Scatter Plot)
- **X-axis**: λ (log scale, 10⁻³ to 10²)
- **Y-axis**: MFPT/WHR (normalized, 0.0 to 1.0)
- **Legend**:
- Black line: MFPT (Mean First Passage Time)
- Red line: WHR (Worst-Hit Ratio)
- Blue shaded region: *p_intuition* (peak at λ=0.1)
#### Sections C-E (Diagrams)
- **Labels**:
- C: Imitation (sparse red paths)
- D: Intuition (grid-like red paths)
- E: Hallucination (dense, chaotic red paths)
---
### Detailed Analysis
#### Section A
- **Eλ (red)**:
- Starts flat (~0.8) for λ < 0.1.
- Sharp increase to ~1.4 at λ=1.0, then plateaus.
- **Hλ (blue)**:
- Gradual rise from ~0.6 to ~1.2 at λ=1.0, then stabilizes.
- **Inset**:
- Fluctuations peak at λ=0.1 (log scale), with Eλ showing sharper oscillations than Hλ.
#### Section B
- **MFPT (black)**:
- Decreases to ~0.4 at λ=0.1, then rises to ~0.8 at λ=10.
- **WHR (red)**:
- Increases from ~0.2 to ~0.9 at λ=10, then plateaus.
- **p_intuition (blue)**:
- Peaks at λ=0.1 (~0.6), then declines.
#### Sections C-E
- **Imitation (C)**:
- Minimal red paths; simple, direct routes.
- **Intuition (D)**:
- Grid-like red paths; moderate complexity.
- **Hallucination (E)**:
- Dense, overlapping red paths; chaotic structure.
---
### Key Observations
1. **λ=0.1 (log scale)**:
- Both Eλ and Hλ show fluctuations (inset A).
- p_intuition peaks, suggesting optimal intuition at this λ.
2. **λ > 1.0**:
- Eλ and Hλ plateau, indicating diminishing returns.
- MFPT/WHR trends stabilize, but WHR remains high.
3. **Diagram Progression**:
- Path complexity increases from Imitation (C) to Hallucination (E), correlating with λ-dependent metric changes.
---
### Interpretation
- **Metrics and λ**:
- Cross-entropy (Eλ) is more sensitive to λ changes than path-entropy (Hλ), suggesting stronger dependence on path variability.
- The peak in *p_intuition* at λ=0.1 implies intuition is most effective at intermediate λ values.
- **Diagrams and Strategies**:
- Imitation (C) aligns with low λ (simple paths), while Hallucination (E) matches high λ (complex paths).
- Intuition (D) bridges the two, reflecting balanced path exploration.
- **Anomalies**:
- WHR remains elevated even as λ increases, indicating persistent worst-case performance despite metric stabilization.
This analysis highlights the interplay between λ, path complexity, and performance metrics, with intuition emerging as a critical strategy at intermediate λ values.
</details>
Figure 2: Experimental phase diagram. Sweeping $\lambda$ reveals three behavioral phases. (A) Genotype metrics: Cross-Entropy ( $\mathcal{E}$ ) and causal path-entropy ( $\mathcal{H}$ ). Inset: steady-state fluctuations $\sigma$ over different initial realizations depending on $\lambda$ . (B) Phenotype metrics: Mean First Passage Time (MFPT), Wall Hit Ratio (WHR) and intuition likelihood (see SM Sec. 4B). (C-E) Example trajectories for each phase: (C) Imitation, (D) Intuition, and (E) Hallucination.
For low $\lambda$ , the system is in an imitation phase: cross-entropy is low, path-entropy is low, and trajectories reproduce the suboptimal random walks from the data, leading to a high Mean First Passage Time (MFPT) to the exit (Fig. 2 C). For high $\lambda$ , the entropic term dominates and the system enters a hallucination phase: cross- and path-entropy are high; maze rules are broken to maximize path diversity, and the Wall Hit Ratio (WHR) increases sharply (Fig. 2 E). Between these two regimes lies a narrow intuition phase, where the trade-off between $\mathcal{E}$ and $\mathcal{H}$ yields an emergent strategy: the model discovers the shortest legal path to the exit (Fig. 2 D), achieving minimal MFPT with zero WHR. The separation between the fluctuation peaks of $\mathcal{E}$ and $\mathcal{H}$ (Fig. 2 A inset) reveals distinct entropy- and energy-driven phase boundaries.
<details>
<summary>Figure3.png Details</summary>

### Visual Description
## Composite Chart: Multi-Panel Analysis of Metric Performance Across λ
### Overview
The image presents four panels (A-D) analyzing metric performance across a logarithmic λ scale (10⁻² to 10²). Panels A and D show forward/backward metric trajectories, while B and C compare MFPT/WHR ratios. Insets in A and D display random baseline comparisons.
### Components/Axes
**Panel A:**
- **Y-axis:** Metric values (0.4–1.4)
- **X-axis:** λ (log scale: 10⁻² to 10²)
- **Legends:**
- Solid red: ε_λ forward
- Solid blue: H_λ (fwd)
- Dashed red: ε_λ backward
- Dashed blue: H_λ (bwd)
- **Inset:** Random baseline (blue line with sharp jump at λ=1)
**Panels B/C:**
- **Y-axis:** MFPT/WHR (fwd/bwd) ratios (0–1)
- **X-axis:** λ (log scale)
- **Legends:**
- Solid black: MFPT
- Solid red: WHR
- Dashed black: MFPT (bwd)
- Dashed red: WHR (bwd)
**Panel D:**
- **Y-axis:** Mean weight (0–1.2)
- **X-axis:** λ (log scale)
- **Legends:**
- Solid line: Forward
- Dashed line: Backward
- **Inset:** Random baseline (gradual decline)
### Detailed Analysis
**Panel A:**
- **Forward metrics (ε_λ, H_λ):** Both rise from ~0.8–0.6 (λ=10⁻²) to 1.4 (λ=10²), with H_λ (fwd) showing a steeper initial increase.
- **Backward metrics:** Mirror forward trends but with delayed convergence (ε_λ backward plateaus at ~0.9 before rising).
- **Random baseline inset:** Blue line jumps from 0.5 to 1.0 at λ=1, suggesting threshold behavior.
**Panels B/C:**
- **MFPT (black lines):**
- Panel B: Drops sharply at λ=0.1 (from 1.0 to 0.25), then rises to 0.75 by λ=1.
- Panel C: Drops abruptly at λ=1 (from 1.0 to 0.2), then stabilizes.
- **WHR (red lines):**
- Panel B: Remains near 0 until λ=1, then jumps to 1.0.
- Panel C: Peaks at λ=10 (0.8), then declines to 0.6 by λ=100.
**Panel D:**
- **Forward metric:** Drops from 1.2 (λ=10⁻²) to 0.6 (λ=1), then plunges to 0.2 (λ=10).
- **Backward metric:** Follows similar trajectory but with delayed drop (0.8 at λ=1, 0.4 at λ=10).
- **Random baseline inset:** Gradual decline from 1.0 to 0.4 over λ=10⁻² to 10².
### Key Observations
1. **Convergence at λ=10²:** Forward/backward metrics in A and D both approach 1.0–1.4, suggesting asymptotic behavior.
2. **Threshold at λ=1:**
- Panel A's random baseline and Panel C's WHR show abrupt changes.
- Panel D's mean weight drops sharply at λ=10.
3. **MFPT vs WHR Divergence:**
- MFPT drops early (λ=0.1–1), while WHR activates later (λ=1–10).
- Suggests MFPT reflects sensitivity to small λ, WHR to larger λ.
### Interpretation
The data demonstrates **λ-dependent performance thresholds**:
- **Forward metrics** (A, D) show rapid convergence to optimal values (1.4) as λ increases, outperforming random baselines.
- **Backward metrics** lag but follow similar trajectories, indicating directional asymmetry.
- **MFPT/WHR ratios** (B, C) reveal complementary dynamics: MFPT captures early-stage sensitivity, while WHR reflects late-stage stability.
- The **λ=10 threshold** in Panel D (mean weight drop) may indicate a phase transition or overfitting boundary.
These patterns suggest the system exhibits **scale-dependent optimization**, with forward metrics dominating at high λ and backward metrics maintaining relevance at intermediate scales. The random baseline comparisons highlight that observed trends are not artifacts of random initialization.
</details>
Figure 3: Hysteresis and protocol-dependence. Comparing a forward (solid) and backward (dashed) sweep of $\lambda$ reveals that the intuitive state is stable once found. (A) Hysteresis loop in genotype metrics ( $\mathcal{E},\mathcal{H}$ ). (B, C) Phenotype for the forward and backward sweeps, respectively, with the forward sweep showing a wider intuition window. (D) The mean network weight $\langle w\rangle$ acts as an order parameter capturing the system’s bistability. Insets show baselines without protocol.
Operationally, this critical learning phase maximizes future path-entropy with minimal cross-entropy, enabling novel, goal-directed behavior at inference without interaction or explicit rewards. Reaching this phase depends on data quality and model complexity, requiring a sufficiently large future horizon and adequate model capacity (see SM Sec. S3 for a parametric study). The fragility of the mechanism is tied to multistability, as observed when applying adiabatic protocols that smoothly sweep the control parameter $\lambda$ (Fig. 3). A large hysteretic loop appears in the genotype metrics (A), which has behavioral consequences in the phenotype: while a forward sweep from $\lambda\approx 0$ opens the intuition window, with low MFPT and low WHR (B), a backward sweep starting from high $\lambda$ does not reach the desired phase (C). The bistability is captured by an effective order parameter –the mean network weight– which remains in an ordered intuitive state once the system has been guided there (D). The adiabatic protocol shows that a self-referential fine-tuning from imitation to controlled imagination allows the system to stabilize in a metastable phase, a process that motivates the term mind-tuning.
Effective theory.— The phenomenology of mind-tuning emerges from a high-dimensional, multistable free-energy landscape. We capture the essential mechanism in a scalar order parameter $m\in[0,1]$ , representing the model’s rationality, and define a Boltzmann policy with an effective potential $U_{m}(a)$ :
$$
p_{m,\beta}(a|h_{t})=\frac{e^{-\beta U_{m}(a)}}{\sum_{a^{\prime}\in\mathcal{A}}e^{-\beta U_{m}(a^{\prime})}}. \tag{6}
$$
Actions, or decisions, are classified into optimal $a^{*}$ , rational-but-suboptimal $a^{r}$ , and non-rational $a^{n}$ and $m_{D}$ is a free parameter representing the training data’s rationality. The effective costs,
$$
\displaystyle U(a^{*}) \displaystyle=0, \displaystyle U(a^{r}) \displaystyle=\frac{\max(0,m-m_{D})}{1-m_{D}}, \displaystyle U(a^{n}) \displaystyle=m, \tag{7}
$$
are designed to create a trade-off: as the model’s rationality $m$ improves beyond the data’s, the cost of suboptimal-but-legal actions grows, forcing a choice between true optimality and rule-breaking. For the simple Markovian maze with a small state-space, the free energy $\mathcal{F}_{\lambda}(m)$ can be computed analytically (see SM Sec. S4A). For a given $\lambda$ , one can also explore the learning dynamics in this landscape by sampling rationality states $m$ from the equilibrium distribution $P(m)\propto e^{-\hat{\beta}\mathcal{F}(m)}$ , where the inverse temperature $\hat{\beta}$ controls the exploration-exploitation trade-off, modeling stochasticity during gradient descent.
This effective theory qualitatively reproduces the experimental phase diagram, including the transitions in both genotypic (Fig. 4 A) and phenotypic metrics (Fig. 4 B). The underlying mechanism is revealed by exploring the minima of the free-energy landscape, found by solving $\partial\mathcal{F}_{\lambda}(m)/\partial m=0$ . This analysis confirms a smooth, entropy-driven transition followed by an abrupt, first-order energy-driven one, creating a bistable region where intuition ( $m>m_{D}$ ) and hallucination ( $m\ll m_{D}$ ) coexist (Fig. 4 C). Intriguingly, the theory further predicts a more elusive inspiration phase: a third stable solution with $m\approx 1$ , associated to a state of true creative insight. This strategy abruptly departs from data and represents internalized understanding. Unlike the subtle intuitive state, which often requires a high inference $\beta$ to be executed without error, this inspired solution would be robust even with a noisy policy. Yet, it is hidden within a tiny basin of attraction masked by the dominant hallucination phase (see SM Sec. S4.C). These predictions point to a very rich phase diagram, where intuition may be the trigger of even more exotic phenomena.
<details>
<summary>Figure4.png Details</summary>

### Visual Description
## Multi-Panel Graph: Cross-Entropy, Path-Entropy, and Rationality Metrics
### Overview
The image contains three vertically stacked panels (A, B, C) analyzing metrics across a logarithmic λ-axis (10⁻³ to 10³). Panel A compares cross-entropy (⟨Eλ⟩) and path-entropy (⟨Hλ⟩), Panel B contrasts MFPT/WHR ratios with an intuition region, and Panel C examines rationality minima under different criteria.
---
### Components/Axes
**Panel A**
- **Y-axis**: Mean metric values (0.8–1.4)
- **X-axis**: λ (log scale: 10⁻³ to 10³)
- **Lines**:
- Solid red: ⟨Eλ⟩ (cross-entropy)
- Dashed blue: ⟨Hλ⟩ (path-entropy)
- **Inset**: First derivative of E and H (peaks at λ ≈ ±1)
**Panel B**
- **Y-axis**: MFPT/WHR (normalized, 0–1.0)
- **X-axis**: λ (log scale: 10⁻³ to 10³)
- **Lines**:
- Solid black: MFPT
- Dashed red: WHR
- **Shaded region**: p_intuition (light blue, λ ≈ 10⁻¹ to 10⁰)
**Panel C**
- **Y-axis**: Rationality minima (m, 0–1.0)
- **X-axis**: λ (log scale: 10⁻³ to 10³)
- **Lines**:
- Solid black: m* (global)
- Dashed black: m** (2nd order)
- Dotted black: m*** (3rd order)
- Dash-dot purple: m_d = 0.7
---
### Detailed Analysis
**Panel A**
- ⟨Eλ⟩ (red) starts at ~1.0, rises sharply at λ ≈ 10⁻¹, plateaus near 1.3.
- ⟨Hλ⟩ (blue) begins at ~0.8, increases gradually, overtakes ⟨Eλ⟩ at λ ≈ 10⁻¹, and plateaus at ~1.3.
- **Inset**: First derivative of E (red) peaks sharply at λ ≈ 1, while H (blue) has a broader peak centered at λ ≈ 1.
**Panel B**
- MFPT (black) starts at 1.0, drops to ~0.2 at λ ≈ 10⁻¹, then rises to ~0.8.
- WHR (red) begins near 0, rises steeply at λ ≈ 10⁻¹, plateaus at ~0.8.
- **p_intuition** (light blue) spans λ ≈ 10⁻¹ to 10⁰, overlapping with MFPT/WHR transition.
**Panel C**
- **m* (global)**: Starts at ~0.8, drops to ~0.4 at λ ≈ 10⁻¹, then plateaus.
- **m** (2nd order): Sharp drop at λ ≈ 10⁻¹, then gradual decline.
- **m*** (3rd order): Steeper drop at λ ≈ 10⁻¹, followed by oscillations.
- **m_d = 0.7** (purple): Horizontal line at y = 0.7, intersecting m* at λ ≈ 10⁰.
---
### Key Observations
1. **Panel A**: ⟨Hλ⟩ surpasses ⟨Eλ⟩ at λ ≈ 10⁻¹, suggesting path-entropy dominates in this regime.
2. **Panel B**: The intuition region (p_intuition) aligns with the MFPT/WHR crossover, implying heuristic relevance.
3. **Panel C**: All rationality metrics (m*) converge toward m_d = 0.7 at λ ≈ 10⁰, indicating a critical threshold.
---
### Interpretation
- **Panel A**: The divergence between cross-entropy and path-entropy highlights trade-offs in optimization landscapes. The inset’s derivative peaks suggest λ ≈ 1 is a critical point for sensitivity.
- **Panel B**: The intuition region (p_intuition) may represent a "sweet spot" where MFPT and WHR align, potentially guiding heuristic decision-making.
- **Panel C**: The convergence of rationality metrics at m_d = 0.7 implies this value acts as a universal benchmark across criteria. The 2nd/3rd-order minima (m**, m***) show increased sensitivity to λ near 10⁻¹, possibly reflecting higher-order corrections.
**Critical Insight**: The graphs collectively suggest λ ≈ 10⁻¹ to 10⁰ is a pivotal range for balancing entropy metrics, intuition, and rationality, with m_d = 0.7 serving as a normative threshold.
</details>
Figure 4: Theoretical predictions. The low-dimensional model reproduces the experimental findings. (A) Theoretical $\mathcal{E}$ and $\mathcal{H}$ vs. $\lambda$ . (B) Corresponding MFPT and WHR. (C) Minima of the free-energy landscape vs the control parameter $\lambda$ . The plot reveals coexisting stable states ( $m^{*},m^{**},m^{***}$ ) and a first-order transition where the global minimum jumps discontinuously, explaining the observed hysteresis.
Accessing these different cognitive phases requires navigating a complex landscape. Indeed, the observed hysteresis and the success of the adiabatic protocol are explained by this multi-stability. The analytical phase-diagram (Fig. 4 C) shows that slowly increasing $\lambda$ is a safe route to guide the system into the intuition basin of attraction. In Bayesian terms, it first grounds the model with the data likelihood before introducing the entropic prior. Reaching more exotic phases in the landscape, like the predicted inspiration state, would likely demand more complex, non-equilibrium protocols.
Discussion.— High-quality human data can carry an implicit drive toward path diversity, and optimization itself can induce entropic pressures that improve generalization [42], yielding an “intelligent simulator” from curated experience. This view predicts that current models should spontaneously increase their causal path entropy with scale. Our framework makes this drive explicit and grounded in MaxCal, providing a shortcut to intuition that encodes implicit search into model weights to reduce the need for expensive search at inference [43]. These results point toward a hidden axis, training-time imagination, that may be key to unlock out-of-distribution generalization in offline predictive models [35].
Our results are demonstrated in a minimal sandbox, a choice that is deliberate. The maze is the simplest non-trivial setting where the mechanism can be isolated and reproduced analytically. Many reasoning tasks can be viewed as navigation through a “conceptual maze” where a key insight unlocks a vastly larger state-space [17, 19, 20, 21]. This argument promises applications in control [17, 20], reasoning [8, 44], and planning [44]. Stefan Zweig’s The Royal Game [45] provides a compelling literary analogue: a prisoner achieves chess mastery by first studying games (imitation) and then playing against himself in his mind (imagination). His triumph occurs at the edge of madness, a state mirroring intuition coexisting with hallucination in our phase diagram.
Yet, scaling mind-tuning to real-world cases faces significant challenges. Computationally, estimating path-entropy for long horizons is hard due to the combinatorial explosion of futures [13]. This requires designing clever sampling strategies [17, 46], perhaps inspired by dreaming, hierarchical reasoning [44] and unconventional methods and architectures [47, 48]. Theoretically, a full characterization of the phase diagrams and universality classe is needed to design optimal tuning protocols [49]. For uncharted domains, identifying the right spaces for entropy maximization can be difficult and the offline theory may need data augmentation from environment interaction [18]. Yet, tuning $\lambda$ for future diversity in practice can turn into an alignment problem, trading benefits for safety [50]. Despite these challenges, this work takes a high-risk, high-reward route to reframing intelligence not merely as compression and computation, but as a physical phenomenon emerging at criticality.
Acknowledgments.— The author thanks many colleagues at IFISC and URV for enriching discussions. This work has been partially supported by the María de Maeztu project CEX2021-001164-M funded by the MICIU/AEI/10.13039/501100011033 and by Programa Maria Goyri URV.
## References
- Vaswani et al. [2017] A. Vaswani et al., Attention is all you need, in Adv. in Neural Info. Processing Systems, Vol. 30 (2017).
- Kaplan et al. [2020] J. Kaplan et al., Scaling laws for neural language models (2020), arXiv:2001.08361 [cs.LG] .
- Hoffmann et al. [2022] J. Hoffmann et al., Training compute-optimal large language models, arXiv preprint (2022), 2203.15556 .
- DeepSeek-AI et al. [2025] DeepSeek-AI et al., Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (2025), arXiv:2501.12948 [cs.CL] .
- Shojaee et al. [2024] P. Shojaee et al., The illusion of thinking, arXiv preprint (2024), 2401.00675 .
- Brodeur et al. [2024] P. G. Brodeur et al., Superhuman performance of a large language model on the reasoning tasks of a physician, arXiv preprint (2024), 2412.10849 .
- Novikov et al. [2025] A. Novikov et al., Alphaevolve: A coding agent for scientific and algorithmic discovery (2025), arXiv:2506.13131 .
- Chollet [2019] F. Chollet, On the measure of intelligence, arXiv preprint (2019), 1911.01547 .
- Bubeck et al. [2023] S. Bubeck et al., Sparks of artificial general intelligence: Early experiments with gpt-4, (2023), 2303.12712 .
- Bender et al. [2021] E. M. Bender et al., On the dangers of stochastic parrots: Can language models be too big?, in Proceedings ACM (2021) pp. 610–623.
- Mitchell and Krakauer [2023] M. Mitchell and D. C. Krakauer, The debate over understanding in ai’s large language models, PNAS 120, e2215907120 (2023).
- Liang et al. [2022] P. Liang et al., Holistic evaluation of language models, arXiv preprint (2022), 2211.09110 .
- Jaynes [1980] E. T. Jaynes, The minimum entropy production principle, Ann. Rev.of Physical Chemistry 31, 579 (1980).
- Pressé et al. [2013] S. Pressé, K. Ghosh, J. Lee, and K. A. Dill, Principles of maximum entropy and maximum caliber in statistical physics, Reviews of Modern Physics 85, 1115 (2013).
- Dixit et al. [2018] P. D. Dixit et al., Perspective: Maximum caliber is a general variational principle for dynamical systems, The Journal of Chemical Physics 148, 010901 (2018).
- Kiefer [2025] A. B. Kiefer, Intrinsic motivation as constrained entropy maximization, arXiv preprint (2025), 2502.02962 .
- Wissner-Gross and Freer [2013] A. D. Wissner-Gross and C. E. Freer, Causal entropic forces, Physical Review Letters 110, 168702 (2013).
- Wen [2025] B. Wen, The missing reward: Active inference in the era of experience (2025), arXiv:2508.05619 .
- Klyubin et al. [2005] A. S. Klyubin, D. Polani, and C. L. Nehaniv, Empowerment: A universal agent-centric measure of control, in 2005 IEEE CEC, Vol. 1 (2005) pp. 128–135.
- Ramirez-Ruiz et al. [2024] J. Ramirez-Ruiz et al., Complex behavior from intrinsic motivation to occupy future action-state path space, Nature Communications 15, 5281 (2024).
- Friston et al. [2022] K. J. Friston et al., Designing ecosystems of intelligence from first principles, arXiv preprint (2022), 2212.01354 .
- Hopfield [1982] J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proceedings of the National Academy of Sciences 79, 2554 (1982).
- Kuramoto [1975] Y. Kuramoto, Self-entrainment of a population of coupled non-linear oscillators, in International Symposium on Mathematical Problems in Theoretical Physics (Springer, 1975) pp. 420–422.
- Belkin et al. [2019] M. Belkin, D. Hsu, S. Ma, and S. Mandal, Reconciling modern machine-learning practice and the classical bias–variance trade-off, PNAS 116, 15849 (2019).
- Power et al. [2022] A. Power et al., Grokking: Generalization beyond overfitting in small neural networks, arXiv (2022), 2201.02177 .
- Papyan et al. [2020] V. Papyan, X. Y. Han, and D. L. Donoho, Prevalence of neural collapse during the terminal phase of deep learning training, PNAS 117, 24927 (2020).
- Liu et al. [2025] Z. Liu, Y. Xu, T. Poggio, and I. Chuang, Parameter symmetry potentially unifies deep learning theory, arXiv preprint (2025), 2502.05300 .
- Arola-Fernández and Lacasa [2024] L. Arola-Fernández and L. Lacasa, Effective theory of collective deep learning, Phys. Rev. Res. 6, L042040 (2024).
- Carleo et al. [2019] G. Carleo et al., Machine learning and the physical sciences, Rev. Mod. Phys. 91, 045002 (2019).
- Lewkowycz et al. [2020] A. Lewkowycz et al., The large learning rate phase of deep learning: the catapult mechanism (2020), arXiv:2003.02218 [stat.ML] .
- Muñoz [2018] M. A. Muñoz, Colloq.: Criticality and dynamical scaling in living systems, R. of Mod. Phys. 90, 031001 (2018).
- Zhang et al. [2025] S. Zhang et al., Intelligence at the edge of chaos (2025), arXiv:2410.02536 [cs.AI] .
- Jiménez-González et al. [2025] P. Jiménez-González, M. C. Soriano, and L. Lacasa, Leveraging chaos in the training of artificial neural networks (2025), arXiv:2506.08523 [cs.LG] .
- Arola-Fernández et al. [2020] L. Arola-Fernández et al., Uncertainty propagation in complex networks: From noisy links to critical properties, Chaos: An Interdisciplinary Journal of Nonlinear Science 30, 023129 (2020).
- Levine et al. [2020] S. Levine, A. Kumar, G. Tucker, and J. Fu, Offline reinforcement learning: Tutorial, review, and perspectives on open problems (2020), arXiv:2005.01643 [cs.LG] .
- Heins et al. [2024] C. Heins et al., Collective behavior from surprise minimization, PNAS 121, e2320239121 (2024).
- Friston [2010] K. Friston, The free-energy principle: A unified brain theory?, Nature Reviews Neuroscience 11, 127 (2010).
- Kauffman [2000] S. A. Kauffman, Investigations (Oxford Univ. Pr., 2000).
- Eysenbach and Levine [2022] B. Eysenbach and S. Levine, Maximum entropy rl (provably) solves some robust rl problems (2022), arXiv:2103.06257 [cs.LG] .
- Jaynes [1957] E. T. Jaynes, Information theory and statistical mechanics, The Physical Review 106, 620 (1957).
- Zdeborová and Krzakala [2016] L. Zdeborová and F. Krzakala, Statistical physics of inference: thresholds and algorithms, Adv. in Phys. 65, 453–552 (2016).
- Ziyin et al. [2025] L. Ziyin, Y. Xu, and I. Chuang, Neural thermodynamics i: Entropic forces in deep and universal representation learning (2025), arXiv:2505.12387 [cs.LG] .
- Belcak et al. [2025] P. Belcak et al., Small language models are the future of agentic ai (2025), arXiv:2506.02153 .
- Wang et al. [2025] G. Wang et al., Hierarchical reasoning models, arXiv preprint (2025), 2506.21734 .
- Zweig [1943] S. Zweig, The Royal Game (Viking Press, 1943).
- Aguilar [2022] J. e. a. Aguilar, Sampling rare trajectories using stochastic bridges, Phys. Rev. E 105, 064138 (2022).
- Labay-Mora et al. [2025] Labay-Mora et al., Theoretical framework for quantum associative memories, Quantum Science and Technology 10, 035050 (2025).
- Brunner et al. [2025] D. Brunner et al., Roadmap on neuromorphic photonics (2025), arXiv:2501.07917 [cs.ET] .
- Manzano et al. [2024] G. Manzano et al., Thermodynamics of computations with absolute irreversibility, unidirectional transitions, and stochastic computation times, Phys. Rev. X 14, 021026 (2024).
- Arenas et al. [2011] A. Arenas et al., The joker effect: Cooperation driven by destructive agents, J. of Theo. Bio. 279, 113–119 (2011).
- Maddison et al. [2017] C. J. Maddison, A. Mnih, and Y. W. Teh, The concrete distribution: A continuous relaxation of discrete random variables, in ICLR (2017) 1611.00712 .
- Williams [1992] R. J. Williams, Simple statistical gradient-following algorithms for connectionist rl, ML 8, 229 (1992).
Supplementary Material for: “Intuition emerges in Maximum Caliber models at criticality”
## Appendix A S1. Experimental Setup and Hyperparameters
The experimental setting is a minimal yet non-trivial environment for testing emergent reasoning. It consists of a deterministic $24\times 24$ maze with periodic boundary conditions, where an agent must find the path to a designated exit. This controlled testbed provides a tractable state space for analyzing the learning dynamics. The agent’s behavior is determined by a policy network that maps the current state (2D position $x_{t}$ ) to a probability distribution over the four cardinal actions: $\mathcal{A}=\{\text{Up, Down, Right, Left}\}$ . For auto-regressive training, a simple deterministic function $f(x_{t},a)$ maps the last action to the next state.
The training dataset $\mathcal{D}$ is intentionally non-optimal. In our main experiments, it contains $N=100$ trajectories, each of length $T=60$ steps, generated by a constrained random walks. These walkers respect the maze walls (i.e., never collide with them) but otherwise move randomly, exhibiting no goal-directed behavior. This design ensures that the optimal exit strategy is not present in the training data, forcing the model to discover it.
The model parameters $\theta$ are optimized by minimizing the free-energy functional $\mathcal{F}_{\lambda,\beta,\tau}(\theta)$ (Eq. (2) in the main text) via the Adam optimizer. The results presented in the main text (Fig. 2) are averaged over 20 independent training runs, each with a different random weight initialization, to ensure statistical robustness. The key hyperparameters used in the main experiments are: a policy network structured as a multi-layer perceptron (MLP) with one hidden layer of 128 neurons and ReLU activation; a learning rate of $1\times 10^{-3}$ ; 300 training epochs per $\lambda$ value; and a future horizon of $\tau=40$ steps in the entropy calculation.
The policy stochasticities are set to $\beta=1$ for training, $\beta=5$ for entropy calculation (imagination), and $\beta=10$ at inference time. A high imagination $\beta$ (compared to the training $\beta$ ) is beneficial for discovering hidden solutions that maximize causal entropy (i.e., finding the exit) with a finite $\tau$ and sparse data. A high inference $\beta$ is necessary to induce intuitive behavior in practice. In the intuition phase, the agent finds a superior solution but must execute its policy quite deterministically to follow the optimal path in the minimum time.
For problems that are not Markovian or where the data representation does not contain full state information (e.g., data are sequences of moves or the agent only sees its local environment), a more advanced neural network is required. Transformers are the standard for modeling long, non-Markovian sequences of tokens. Our framework naturally extends to these sequential autoregressive architectures, albeit at the cost of more parameters and computational effort.
Code availability.— PyTorch source code to reproduce the results of this paper is publicly available on GitHub: https://github.com/mystic-blue/mind-tuning.
## Appendix B S2. Calculation of Objective Functionals
The mind-tuning objective function $\mathcal{F}_{\lambda,\beta,\tau}(\theta)=\mathcal{E}_{\beta}(\theta)-\lambda\mathcal{H}_{\tau,\beta}(\theta)$ consists of two key terms. Below we detail their calculation.
### B.1 A. Cross-Entropy Estimation
The cross-entropy term $\mathcal{E}_{\beta}(\theta)$ , defined in Eq. (3) of the main text, measures the model’s ability to imitate the training data. It is estimated by averaging the negative log-likelihood of the actions taken in the dataset $\mathcal{D}$ given the preceding histories:
$$
\hat{\mathcal{E}}_{\beta}(\theta)=\frac{1}{|\mathcal{D}|}\sum_{(h_{t},a_{t})\in\mathcal{D}}[-\log\pi_{\theta,\beta}(a_{t}|h_{t})]
$$
where $|\mathcal{D}|$ is the total number of state-action pairs in the training set. This term encourages the policy to assign high probability to the trajectories observed during training.
### B.2 B. Causal Path-Entropy: Analytic Calculation for Markovian Systems
For systems with fully-observed, discrete, and reasonably small state spaces $\mathcal{V}$ , such as our maze environment, the path-entropy can be computed analytically. Since the system is Markovian ( $h_{t}=x_{t}$ ), we can define a policy-dependent transition matrix $M_{\pi}$ . The element $(M_{\pi})_{x^{\prime},x}$ gives the probability of transitioning from state $x$ to state $x^{\prime}$ under the current policy $\pi_{\theta,\beta}$ . Specifically, $(M_{\pi})_{x^{\prime},x}=\sum_{a\in\mathcal{A}}\pi_{\theta,\beta}(a|x)\delta_{x^{\prime},f(x,a)}$ , where $f(x,a)$ is the deterministic function that returns the next state.
Given a starting state $x_{start}$ , we can compute the probability distribution over future states $\vec{\rho}_{k}$ at any time step $k$ by evolving an initial occupancy vector (a point mass at $x_{start}$ ) via the recursion $\vec{\rho}_{k+1}=M_{\pi}\vec{\rho}_{k}$ . The conditional path-entropy for a trajectory starting at $x_{start}$ is then the time-averaged Shannon entropy of the policy, weighted by the occupancy probability at each future state:
$$
\mathcal{H}_{\tau,\beta}(\theta|x_{start})=\frac{1}{\tau}\sum_{k=0}^{\tau-1}\sum_{x\in\mathcal{V}}(\rho_{k})_{x}\left[-\sum_{a\in\mathcal{A}}\pi_{\theta,\beta}(a|x)\log\pi_{\theta,\beta}(a|x)\right].
$$
The total functional $\mathcal{H}_{\tau,\beta}(\theta)$ is the expectation of Eq. (S2) over all starting states in the training dataset $\mathcal{D}$ . This entire calculation is fully differentiable with respect to the network parameters $\theta$ , allowing for efficient gradient-based optimization. This exact method was used to produce all experimental and theoretical results in this work. Its primary computational cost scales with the size of the state space $|\mathcal{V}|$ , making it suitable for our testbed.
### B.3 C. Causal Path-Entropy: Monte Carlo Estimation for High-Dimensional Systems
For high-dimensional or continuous state spaces, or for non-Markovian sequence models like Transformers, the analytic approach becomes intractable. In these cases, $\mathcal{H}$ must be estimated via Monte Carlo sampling. For each starting history $h_{start}$ in a training mini-batch, we generate $K$ independent future trajectories (rollouts) of length $\tau$ by autoregressively sampling actions from the policy. The estimator for the path-entropy functional is:
$$
\hat{\mathcal{H}}_{\tau,\beta}(\theta)\approx\frac{1}{|\mathcal{B}|}\sum_{h_{start}\in\mathcal{B}}\left(\frac{1}{K\tau}\sum_{k=1}^{K}\sum_{j=0}^{\tau-1}\left[-\ln\pi_{\theta,\beta}(a_{j}^{(k)}|h_{j}^{(k)})\right]_{h_{start}}\right).
$$
To ensure that gradients can be backpropagated through the sampling process, especially for discrete action spaces, reparameterization techniques are required. A standard method is the Gumbel-Softmax trick [51], which provides a continuous, differentiable approximation to the sampling procedure. Alternatively, the gradient of the entropic objective can be estimated using policy gradient methods like REINFORCE [52], though this often suffers from high variance.
## Appendix C S3. Parametric Dependencies of the Intuition Phase
The emergence of the fragile intuition phase is a critical phenomenon highly sensitive to the model, data, and learning protocol parameters. Below, we detail the key dependencies we investigated.
### C.1 A. Future Horizon $\tau$
The future horizon $\tau$ dictates the timescale of the model’s “imagination”. Our experiments show that the intuition phase only emerges for a sufficiently long horizon (Fig. S1).
For a small $\tau$ , the model is myopic; the long-term entropic gain from escaping the maze is not visible, so the model defaults to minimizing cross-entropy and remains in the imitation phase. As $\tau$ increases, the model can foresee the vast expansion of possible futures that awaits outside the maze, creating a strong entropic incentive to find an exit. For intermediate horizons, we often observe a cheating phase—a local minimum in the free-energy landscape where the model learns to take a single illegal step through a wall. This strategy is a compromise: it incurs a small penalty for rule-breaking but gains a significant medium-term entropic advantage. Only for large $\tau$ does the incentive to find a legal path to maximal freedom dominate (i.e., virtue over vice).
<details>
<summary>FigureS1.png Details</summary>

### Visual Description
## Composite Plot: Cross-Entropy, Path-Entropy, MFPT, and WHR Metrics Across τ Values
### Overview
The image contains six panels (A-C, top and bottom rows) comparing metrics across three τ values (1, 20, 40). Each panel includes:
- **Top row**: Cross-entropy (Eλ(θ*)) and path-entropy (Hλ(θ*)) vs. λ (log scale).
- **Bottom row**: Normalized MFPT/WHR vs. λ (log scale).
- **Legends**: Red for cross-entropy, blue for path-entropy (top row); black for MFPT, red for WHR (bottom row).
- **Shaded region**: Blue area in panel C (τ=40) highlights a divergence region.
---
### Components/Axes
#### Top Row (Cross-Entropy & Path-Entropy)
- **X-axis**: λ (log scale, 10⁻³ to 10²).
- **Y-axis**: Metric value (0.2–1.4).
- **Legends**:
- Red: Eλ(θ*) (cross-entropy).
- Blue: Hλ(θ*) (path-entropy).
#### Bottom Row (MFPT & WHR)
- **X-axis**: λ (log scale, 10⁻³ to 10²).
- **Y-axis**: Normalized MFPT/WHR (0–1).
- **Legends**:
- Black: MFPT.
- Red: WHR.
- **Shaded region**: Blue area in panel C (τ=40, λ ≈ 10⁻¹ to 10⁰).
---
### Detailed Analysis
#### Top Row Trends
1. **τ=1 (Panel A)**:
- **Cross-entropy (red)**: Sharp rise from ~0.8 to 1.4 at λ ≈ 10⁻¹, then plateaus.
- **Path-entropy (blue)**: Gradual increase from ~0.6 to 1.4 at λ ≈ 10⁰, then plateaus.
2. **τ=20 (Panel B)**:
- **Cross-entropy (red)**: Slower rise to 1.4 at λ ≈ 10⁻⁰.⁵, then plateaus.
- **Path-entropy (blue)**: Gradual increase to 1.4 at λ ≈ 10⁰, then plateaus.
3. **τ=40 (Panel C)**:
- **Cross-entropy (red)**: Smooth rise to 1.4 at λ ≈ 10⁻⁰.⁵, then plateaus.
- **Path-entropy (blue)**: Gradual increase to 1.4 at λ ≈ 10⁰, then plateaus.
#### Bottom Row Trends
1. **MFPT (black)**:
- Sharp drop from ~1.0 to ~0.2 at λ ≈ 10⁻¹, then plateaus.
2. **WHR (red)**:
- Sharp rise from ~0.2 to ~1.0 at λ ≈ 10⁻¹, then plateaus.
3. **Shaded region (Panel C)**:
- Highlights λ ≈ 10⁻¹ to 10⁰, where MFPT and WHR diverge most.
---
### Key Observations
1. **Cross-Entropy vs. Path-Entropy**:
- Cross-entropy (red) rises more abruptly than path-entropy (blue) across all τ values.
- Higher τ values (e.g., τ=40) show smoother transitions in cross-entropy.
2. **MFPT vs. WHR**:
- Inverse relationship: MFPT decreases as WHR increases.
- Critical divergence at λ ≈ 10⁻¹ (shaded region in τ=40).
3. **τ Dependence**:
- Larger τ values (e.g., τ=40) show delayed but smoother metric transitions.
---
### Interpretation
1. **Metric Behavior**:
- Cross-entropy (Eλ) and path-entropy (Hλ) both increase with λ, but cross-entropy reflects sharper transitions, suggesting sensitivity to λ in early stages.
- MFPT (performance metric) and WHR (robustness/regularization metric) exhibit opposing trends, indicating a trade-off: higher λ improves robustness (WHR) but may reduce performance (MFPT).
2. **τ Impact**:
- Larger τ values (e.g., τ=40) suggest delayed convergence, with metrics stabilizing at higher λ thresholds.
3. **Shaded Region (Panel C)**:
- The blue area highlights a critical λ range (10⁻¹ to 10⁰) where MFPT and WHR diverge, potentially marking a phase transition or optimal λ window for balancing performance and robustness.
---
### Conclusion
The data demonstrates that λ and τ jointly influence metric behavior. Cross-entropy and path-entropy trends reflect model convergence dynamics, while MFPT and WHR reveal trade-offs between performance and regularization. The shaded region in τ=40 underscores a critical λ range for model optimization.
</details>
Figure S1: Dependence on Future Horizon $\tau$ . Phase diagram of the genotypic (top) and phenotypic (bottom) metrics as a function of $\lambda$ for different future horizons. The intuition window (sharp dip in MFPT and zero WHR, shaded blue) appears and stabilizes only for a long horizon ( $\tau=40$ ). (A) A short horizon ( $\tau=1$ ) yields only imitation and hallucination. (B) An intermediate horizon ( $\tau=20$ ) can lead to a cheating strategy, which is worse than the true intuitive solution (C).
### C.2 B. Model Capacity
The capacity of the policy network, controlled by the number of neurons, is relevant (Fig. S2). A model with insufficient capacity has high bias and lacks the representational power to learn the complex, mixed strategy required to balance maze constraints with goal-directed exploration. It cannot simultaneously represent the world model and the entropic drive, so the intuition phase does not emerge. Conversely, a model with excessive capacity relative to the task complexity is prone to overfitting. It may perfectly memorize the noisy random walks from the training data or discover trivial, non-generalizable solutions (e.g., exploiting specific numerical artifacts) to maximize entropy. The intuition phase occupies a “sweet spot” where model capacity is well-matched to the problem, enabling generalization from sparse data rather than mere memorization or unconstrained hallucination.
<details>
<summary>FigureS2.png Details</summary>

### Visual Description
## Composite Plot Analysis: Cross-Entropy vs. Path-Entropy and MFPT/WHR Trends
### Overview
The image contains six subplots organized into three groups (A, B, C), each with two panels. The top row (A, B, C) compares **cross-entropy** (red) and **path-entropy** (blue) metrics across varying capacities (8, 64, 128). The bottom row (A, B, C) compares **MFPT** (black) and **WHR** (red) normalized values across a logarithmic scale of λ.
---
### Components/Axes
#### Top Row (A, B, C):
- **X-axis**: "Metric value" (linear scale, 0.2–1.4).
- **Y-axis**: "Metric value" (linear scale, 0.2–1.4).
- **Legends**:
- Red: **E_λ(θ*) (cross-entropy)**.
- Blue: **H_λ(θ*) (path-entropy)**.
- **Capacity labels**: "capacity=8", "capacity=64", "capacity=128" (top of each plot).
#### Bottom Row (A, B, C):
- **X-axis**: "λ (log scale)" (10⁻³ to 10²).
- **Y-axis**: "Normalised MFPT / WHR" (0–1).
- **Legends**:
- Black: **MFPT**.
- Red: **WHR**.
- **Highlighted region**: Light blue shaded area (λ ≈ 10⁻¹ to 10⁰) in plots B and C.
---
### Detailed Analysis
#### Top Row (Cross-Entropy vs. Path-Entropy):
- **Plot A (capacity=8)**:
- Cross-entropy (red) starts at ~0.8, rises sharply to ~1.4 at metric value ~0.8, then plateaus.
- Path-entropy (blue) starts at ~0.4, rises steeply to ~1.4 at metric value ~0.6, then plateaus.
- **Key observation**: Path-entropy converges faster to the maximum metric value than cross-entropy.
- **Plot B (capacity=64)**:
- Cross-entropy (red) starts at ~0.8, rises to ~1.2 at metric value ~0.6, then plateaus.
- Path-entropy (blue) starts at ~0.6, rises to ~1.4 at metric value ~0.8, then plateaus.
- **Key observation**: Path-entropy again outperforms cross-entropy in convergence.
- **Plot C (capacity=128)**:
- Cross-entropy (red) starts at ~0.8, rises to ~1.2 at metric value ~0.6, then plateaus.
- Path-entropy (blue) starts at ~0.6, rises to ~1.4 at metric value ~0.8, then plateaus.
- **Key observation**: Similar trend to Plot B, with path-entropy achieving higher metric values.
#### Bottom Row (MFPT vs. WHR):
- **Plot A (capacity=8)**:
- MFPT (black) drops sharply at λ ≈ 10⁻¹, then stabilizes at ~0.2.
- WHR (red) rises sharply at λ ≈ 10⁻¹, then stabilizes at ~0.6.
- **Key observation**: MFPT and WHR exhibit inverse trends.
- **Plot B (capacity=64)**:
- MFPT (black) drops sharply at λ ≈ 10⁻¹, then stabilizes at ~0.2.
- WHR (red) rises sharply at λ ≈ 10⁻¹, then stabilizes at ~0.6.
- **Key observation**: Same inverse trend as Plot A.
- **Plot C (capacity=128)**:
- MFPT (black) drops sharply at λ ≈ 10⁻¹, then stabilizes at ~0.2.
- WHR (red) rises sharply at λ ≈ 10⁻¹, then stabilizes at ~0.6.
- **Key observation**: Consistent inverse trend across capacities.
---
### Key Observations
1. **Top Row**:
- Path-entropy (blue) consistently converges to higher metric values than cross-entropy (red) across all capacities.
- The convergence threshold for path-entropy occurs at lower metric values (e.g., ~0.6 for capacity=8 vs. ~0.8 for capacity=128).
2. **Bottom Row**:
- MFPT (black) and WHR (red) exhibit **inverse relationships**: as λ increases, MFPT decreases, and WHR increases.
- The sharp drop in MFPT and rise in WHR occur at λ ≈ 10⁻¹, suggesting a critical threshold for system behavior.
3. **Highlighted Region**:
- The light blue shaded area (λ ≈ 10⁻¹ to 10⁰) in Plots B and C may indicate a region of interest where MFPT/WHR dynamics are most sensitive to λ.
---
### Interpretation
- **Top Row**:
- Path-entropy (H_λ(θ*)) appears more efficient than cross-entropy (E_λ(θ*)) in maximizing the metric value, especially at higher capacities. This suggests path-entropy may be better suited for optimization tasks with larger system capacities.
- **Bottom Row**:
- The inverse relationship between MFPT and WHR implies that increasing λ reduces the mean first passage time (MFPT) but increases the Wasserstein Hellinger Ratio (WHR), potentially indicating a trade-off between convergence speed and distributional similarity.
- The sharp transition at λ ≈ 10⁻¹ suggests a phase transition or critical point in the system’s behavior, warranting further investigation.
- **Cross-Plot Insights**:
- The capacity-dependent performance of path-entropy (top row) and the λ-dependent dynamics of MFPT/WHR (bottom row) may be linked. For example, higher capacities (e.g., 128) might require specific λ values to balance convergence and distributional properties.
---
### Notes on Data Extraction
- **Top Row**:
- Cross-entropy (red) and path-entropy (blue) values were extracted by tracing the curves at key metric values (e.g., 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4).
- Uncertainty: Approximate values due to overlapping data points and lack of error bars.
- **Bottom Row**:
- MFPT (black) and WHR (red) values were extracted at λ = 10⁻³, 10⁻², 10⁻¹, 10⁰, 10¹, 10².
- Uncertainty: Sharp transitions (e.g., at λ ≈ 10⁻¹) may introduce ambiguity in exact threshold values.
---
### Final Notes
- **Language**: All text is in English.
- **Missing Data**: No explicit numerical tables or additional annotations beyond the legends and axis labels.
- **Critical Limitation**: The overlapping of data points in the top row plots complicates precise value extraction, requiring visual estimation.
This analysis highlights the interplay between entropy metrics and system dynamics, with implications for optimization and statistical modeling.
</details>
Figure S2: Dependence on Model Capacity. Emergence of the intuition phase as a function of the number of neurons in the hidden layer. (A) A model with insufficient capacity (e.g., 8 neurons) cannot learn the required behavior. The intuition phase is robust for models with sufficient capacity (e.g., 64 (B) or 128 neurons (C)), which are powerful enough to discover the solution but not so powerful that they immediately overfit.
### C.3 C. Maze Complexity
We evaluated the framework on several environments of increasing complexity (Fig. S3). In simpler environments (e.g., a straight corridor), the escape task is trivial because the data trajectories are very close to the optimal solution. The intuition window is consequently wide and appears at lower values of $\lambda$ . As maze complexity increases, finding the optimal path becomes a harder constraint-satisfaction problem. The cross-entropy term $\mathcal{E}$ more strongly penalizes deviations from valid paths. To overcome this, a stronger entropic pressure (a higher $\lambda$ ) is required to motivate the search for the distant exit. As a result, the intuition window narrows and shifts in the phase diagram, indicating that a more precise tuning of the energy-entropy balance is needed for more difficult problems. In some cases, the intuition window may disappear entirely, requiring protocols like the adiabatic sweep to be reached.
<details>
<summary>FigureS3.png Details</summary>

### Visual Description
## Heatmap and Subplot Analysis: Environmental Metric Evaluation
### Overview
The image presents four environmental configurations (A-D: Corridor, Path, Maze, Room) with associated heatmaps and three subplots per configuration. Each configuration includes:
1. A heatmap visualization with grid-based spatial distribution
2. Three subplots showing metric evolution across λ values
3. Legend elements and fluctuation insets
### Components/Axes
**Heatmap Elements:**
- Grid-based spatial distribution with red/blue color coding
- Central "obstacle" structure in each configuration
- Labels A-D for configurations
**Subplot 1 (Top):**
- Y-axis: Metric value (0.6-1.4)
- X-axis: λ (log scale, 10^-3 to 10^2)
- Lines:
- Red: Eλ(θ*) (mean metric value)
- Blue: Hλ(θ*) (fluctuation metric)
**Subplot 2 (Middle):**
- Same axes as Subplot 1
- Inset: Fluctuation distribution (σ vs λ)
- σ range: 0.00-0.05
**Subplot 3 (Bottom):**
- Y-axis: MFPT/WHR (normalized, 0-1)
- X-axis: λ (log scale, 10^-3 to 10^2)
- Data points:
- Black: MFPT (Mean First Passage Time)
- Red: WHR (Wandering Home Ratio)
- Shaded area: P_intuition (intuition probability)
### Detailed Analysis
**Heatmap Patterns:**
- A (Corridor): Central red cluster with blue periphery
- B (Path): Concentrated red core with radial blue distribution
- C (Maze): Fragmented red regions with complex blue network
- D (Room): Compact red core with minimal blue extension
**Subplot 1 Trends:**
- All configurations show:
- Red line (Eλ) increasing with λ, plateauing near 1.4
- Blue line (Hλ) following similar trend with higher variance
- Notable σ peaks at λ≈10^-1 in all cases
**Subplot 3 Patterns:**
- MFPT (black):
- Sharp decline at λ=10^-1
- Stabilizes near 0.2 for λ≥10^0
- WHR (red):
- Gradual increase until λ=10^0
- Sharp rise to 1.0 for λ≥10^1
- P_intuition (cyan):
- Peaks at λ=10^-1 across all configurations
### Key Observations
1. Environmental complexity correlates with:
- Increased metric fluctuations (σ peaks at λ=10^-1)
- More pronounced intuition probability peaks
2. WHR shows stronger λ dependence than MFPT
3. Maze configuration (C) exhibits most complex spatial distribution
4. Room configuration (D) demonstrates most stable metric behavior
### Interpretation
The data suggests that:
- λ=10^-1 represents a critical threshold where:
- Metric fluctuations peak (σ)
- Intuition probability maximizes (P_intuition)
- WHR begins rapid increase
- Maze environments create most spatially complex metric distributions
- Room environments show most stable metric behavior across λ values
- The relationship between spatial configuration and metric evolution indicates that:
- Lower λ values (10^-3 to 10^-1) capture environmental sensitivity
- Higher λ values (10^0 to 10^2) reflect stabilized behavioral patterns
- The intuition probability (P_intuition) appears to predict significant behavioral transitions at λ=10^-1
*Note: All values are approximate due to image resolution constraints. σ values in fluctuation insets show consistent peak magnitude (~0.05) across configurations.*
</details>
Figure S3: Dependence on Maze Complexity. Position and width of the intuition window (measured by MFPT) for environments of varying difficulty. (A,B) For a simple corridor, the window is wide and appears at low $\lambda$ . (C) For the more complex maze used in the main text, the window is narrower, reflecting the increased difficulty. (D) For even more complex problems, the intuition window can disappear, necessitating specific protocols to reach the desired phase.
## Appendix D S4. Detailed Theory and Further Predictions
Here we expand on the theory from the main text, providing the explicit analytical forms for the free energy functional. We also clarify the calculation of the intuition likelihood ( $p_{\text{intuition}}$ ) and discuss the existence of a more elusive inspiration phase as a further prediction of the theory.
### D.1 A. The Effective Free Energy Functional
For the Markovian maze environment with a small state-space, the terms of the free energy functional $\mathcal{F}_{\lambda}(m)=\mathcal{E}(m)-\lambda\mathcal{H}(m)$ can be computed analytically as a function of the rationality order parameter $m$ . Note that this theory is effective: the $\beta$ of the analytical policy is distinct from the experimental one, since the former controls only three minimal analytical costs while the latter modulates the entire logit vector.
The effective cross-entropy, $\mathcal{E}(m)$ , is the expectation of the negative log-likelihood over the state distribution of the given data, $\rho_{\mathcal{D}}(s)$ . For a single state $s$ , the cross-entropy is $E_{s}(m)=\log Z_{m}(s)+\beta\langle U_{m}(a)\rangle_{a\sim\text{uniform}}$ , where the average is over legal moves from $s$ . Summing over the data distribution gives
$$
\mathcal{E}(m)=\left\langle\log\left(\sum_{a\in\mathcal{A}}e^{-\beta U_{m}(a|s)}\right)+\beta\frac{\sum_{a^{\prime}\in\mathcal{A}_{\text{legal}}(s)}U_{m}(a^{\prime}|s)}{|\mathcal{A}_{\text{legal}}(s)|}\right\rangle_{s\sim\rho_{\mathcal{D}}},
$$
where $U_{m}(a|s)$ is the cost of action $a$ in state $s$ (which depends on whether the move is optimal, suboptimal, or a wall collision) and $\mathcal{A}_{\text{legal}}(s)$ is the set of valid moves from $s$ .
The effective path-entropy, $\mathcal{H}(m)$ , is the time-averaged Shannon entropy of the policy $p_{m,\beta}$ over trajectories of length $\tau$ starting from an initial state distribution $\rho_{0}$ (in our case, a single point at the maze start). It is calculated using the policy-dependent transition matrix $M_{m}$
$$
\mathcal{H}(m)=\frac{1}{\tau}\sum_{k=0}^{\tau-1}\left(\sum_{s\in\mathcal{V}}(\rho_{k})_{s}\cdot h_{m}(s)\right),
$$
where $\vec{\rho}_{k}=(M_{m})^{k}\vec{\rho}_{0}$ is the state occupancy vector at time $k$ , and $h_{m}(s)$ is the local policy entropy at state $s$ :
$$
h_{m}(s)=-\sum_{a\in\mathcal{A}}p_{m,\beta}(a|s)\log p_{m,\beta}(a|s).
$$
These analytical expressions are used to generate the theoretical plots in the main text.
### D.2 B. Calculating the Intuition Likelihood ( $p_{\text{intuition}}$ )
The intuition metric, visualized as the cyan region in the plots, quantifies the model’s ability to spontaneously follow the optimal path. In experiments, this intuition likelihood is measured as the fraction of independent trials where the system displays the optimal solution at inference (minimal MFPT with zero WHR).
The same empirical criterion can be applied to the effective theory. More interestingly, the intuition likelihood can also be calculated analytically if the optimal route is known. We define it as the joint probability of generating the true shortest path to the exit, for a horizon of $q$ steps (where $q$ depends on the maze topology). Let the optimal path be the sequence of states $z^{*}=(s_{0}^{*},s_{1}^{*},\dots,s_{q}^{*})$ , where $s_{0}^{*}$ is the starting position, and let $a_{t}^{*}$ be the optimal action to transition from $s_{t}^{*}$ to $s_{t+1}^{*}$ . The intuition likelihood for a given rationality level $m$ is:
$$
p_{\text{intuition}}(m)=\prod_{t=0}^{q-1}p_{m,\beta}(a_{t}^{*}|s_{t}^{*})
$$
Since the system can be multistable, the final value reported in the figure for a given $\lambda$ is the Boltzmann-weighted average of this likelihood over all coexisting free energy minima ( $m^{*},m^{**},\dots$ ):
$$
\langle p_{\text{intuition}}\rangle_{\lambda}=\sum_{i}w_{i}(\lambda)\cdot p_{\text{intuition}}(m_{i})\quad\text{where}\quad w_{i}(\lambda)=\frac{e^{-\hat{\beta}\mathcal{F}_{\lambda}(m_{i})}}{\sum_{j}e^{-\hat{\beta}\mathcal{F}_{\lambda}(m_{j})}},
$$
where $\hat{\beta}$ is an inverse temperature controlling the sampling of minima. At high $\hat{\beta}$ (low thermal noise), the system predominantly samples the global minimum, reproducing the steady-state results of the main experiments. In the numerical experiments, each run starts from a random weight initialization, and gradient descent acts as a local search that can fall into any of the attracting states. The likelihood metric is therefore zero in the imitation and hallucination phases (where the probability of following the optimal path is negligible) and peaks sharply in the narrow intuition window, provided the policy’s inference $\beta$ is sufficiently high.
### D.3 C. From Intuition to Inspiration: Further Predictions of the Effective Theory
The intuition phase represents a significant discovery, where the model finds a hidden, optimal solution that smoothly branches from the data-driven imitation phase. It is a better way, but not a radical departure. Intriguingly, the theory predicts the existence of a distinct, more profound cognitive phase: inspiration. Inspiration is not a continuous improvement but an abrupt jump to a qualitatively different state of insight. This corresponds to the emergence of a new, globally optimal minimum in the free-energy landscape, where the rationality parameter is close to $m\approx 1$ . A model in the inspiration phase does not merely approximate the optimal policy; it knows the solution is correct. This internalized understanding would manifest through a key operational signature: the model could execute the optimal strategy robustly, even with a stochastic policy (low inference $\beta$ ), distinguishing it from the more tentative intuitive state.
The theory predicts that the imagination temperature $\beta_{\text{dream}}$ —the policy stochasticity in the entropy term—is a key parameter for accessing these states (Fig. S4). At low $\beta_{\text{dream}}$ , the intuition phase ( $m>m_{D}$ ) is unstable. It emerges in a stable window only for sufficiently large $\beta_{\text{dream}}$ . At even higher values, this stable intuition branch can bifurcate into two locally stable solutions: the familiar intuition phase and this hidden inspiration phase ( $m\approx 1$ ). Both can coexist while the hallucination phase ( $m\ll m_{D}$ ) remains the global attractor. Observing this more exotic inspiration phase in practice would likely require careful tuning protocols, potentially starting from the intuition phase and employing non-equilibrium dynamics.
<details>
<summary>FigureS4.png Details</summary>

### Visual Description
## Line Plots with Multiple Metrics Across β_dream Values
### Overview
The image presents three sets of line plots (A, B, C) corresponding to β_dream values of 2.0, 5.0, and 20.0. Each set contains three subplots:
1. Cross-entropy vs. path-entropy curves
2. Normalized MFPT/WHR vs. λ (log scale)
3. Rationality minima vs. λ (log scale)
All plots share a logarithmic x-axis (λ) and metric-based y-axes.
---
### Components/Axes
#### Common Elements
- **X-axis**: λ (log scale, 10⁻⁴ to 10⁴)
- **Y-axes**:
- Subplot 1: Expected metric values (0.0–1.4)
- Subplot 2: Normalized MFPT/WHR (0.0–1.0)
- Subplot 3: Rationality minima (0.0–1.0)
- **Legends**:
- Red solid: Cross-entropy (⟨E⟩_λ)
- Blue dashed: Path-entropy (⟨H⟩_λ)
- Black solid: MFPT (Mean First Passage Time)
- Red dashed: WHR (Wasserstein Heuristic Ratio)
- Blue shaded: Intuition region (subplot 2 only)
- Purple dashed: m_d = 0.7 threshold (subplot 3 only)
#### Spatial Grounding
- **Legend placement**: Upper-right corner of each subplot
- **Intuition shading**: Centered in subplot 2 (λ ≈ 10⁻¹ to 10¹)
- **Critical points insets**: Top-right corner of subplot 1
---
### Detailed Analysis
#### Subplot 1: Entropy Curves
- **β = 2.0 (A)**:
- Cross-entropy (red) rises sharply at λ ≈ 10⁻¹, plateauing at 1.35.
- Path-entropy (blue) remains flat until λ ≈ 10⁻², then increases to 1.25.
- Critical points: E peaks at λ ≈ 10⁻¹ (derivative ≈ 0.15), H peaks at λ ≈ 10⁻² (derivative ≈ 0.05).
- **β = 5.0 (B)**:
- Cross-entropy rises at λ ≈ 10⁻², reaching 1.3.
- Path-entropy increases at λ ≈ 10⁻³, peaking at 1.15.
- Critical points: E derivative ≈ 0.2, H derivative ≈ 0.1.
- **β = 20.0 (C)**:
- Cross-entropy rises abruptly at λ ≈ 10⁻¹, reaching 1.3.
- Path-entropy increases at λ ≈ 10⁻², peaking at 1.1.
- Critical points: E derivative ≈ 0.6, H derivative ≈ 0.3.
#### Subplot 2: MFPT/WHR
- **β = 2.0 (A)**:
- MFPT (black) drops from 1.0 to 0.2 at λ ≈ 10⁻¹.
- WHR (red) remains flat until λ ≈ 10⁻¹, then rises to 0.8.
- Intuition region (blue) spans λ ≈ 10⁻¹ to 10¹.
- **β = 5.0 (B)**:
- MFPT drops sharply at λ ≈ 10⁻², reaching 0.2.
- WHR rises at λ ≈ 10⁻², peaking at 0.6.
- Intuition region expands to λ ≈ 10⁻² to 10².
- **β = 20.0 (C)**:
- MFPT drops to 0.1 at λ ≈ 10⁻².
- WHR rises to 0.4 at λ ≈ 10⁻².
- Intuition region spans λ ≈ 10⁻² to 10³.
#### Subplot 3: Rationality Minima
- **β = 2.0 (A)**:
- Global minima (black) drops to 0.2 at λ ≈ 10⁻¹.
- Second minima (black dashed) remains at 0.6.
- m_d = 0.7 threshold (purple) is horizontal.
- **β = 5.0 (B)**:
- Global minima drops to 0.1 at λ ≈ 10⁻².
- Second minima (black dashed) appears at λ ≈ 10⁻¹.
- m_d = 0.7 threshold intersects second minima.
- **β = 20.0 (C)**:
- Global minima drops to 0.05 at λ ≈ 10⁻².
- Second/third minima (black dashed/dotted) are negligible.
- m_d = 0.7 threshold aligns with global minima.
---
### Key Observations
1. **Entropy Sensitivity**:
- Higher β_dream increases sensitivity to λ, with sharper entropy transitions.
- Critical point derivatives (E, H) scale with β_dream (e.g., β=20: E derivative ≈ 0.6 vs. β=2: 0.15).
2. **MFPT/WHR Dynamics**:
- MFPT decreases monotonically with λ, while WHR increases after a threshold.
- Intuition regions expand with β_dream, suggesting broader alignment between metrics and model behavior.
3. **Rationality Minima**:
- Global minima improve (lower values) with higher β_dream, especially at λ ≈ 10⁻².
- Second/third minima vanish at β=20, indicating a single dominant solution.
---
### Interpretation
- **β_dream Role**:
- Higher β_dream amplifies the impact of λ on entropy and metric values, reflecting stronger regularization or constraint enforcement.
- The intuition regions (blue shading) likely represent λ ranges where the model’s learned representations align optimally with the metric.
- **Practical Implications**:
- For β=20, the sharp drop in global minima (subplot 3) suggests λ ≈ 10⁻² is optimal for minimizing rationality loss.
- The disappearance of higher-order minima at β=20 implies a unique global solution dominates, reducing overfitting risks.
- **Anomalies**:
- In β=5 (B), the intuition region overlaps with the second minima in subplot 3, hinting at a trade-off between intuition and metric alignment.
- Cross-entropy plateaus at ~1.3 across all β values, indicating a saturation point in metric discrimination.
</details>
Figure S4: Dependence of theoretical predictions on the imagination temperature $\beta_{\text{dream}}$ . The theoretical phase diagram is shown for increasing values of $\beta_{\text{dream}}=\{2.0,5.0,20.0\}$ . This parameter controls the policy stochasticity in the self-referential entropy calculation. (A-C) As $\beta_{\text{dream}}$ increases, the system’s phase diagram (bottom row) changes. Higher values of this temperature can also reveal more complex phase structures, including the emergence of the inspiration phase, as discussed in the text. Insets in the first row (here and in the main text) measure the numerical first-derivatives of both the cross-entropy and path-entropy for low sampling temperature at equilibrium (thus for global attractors). The separation in the peaks of the discontinuities (B,C) signal the entropy- and energy- driven transitions that delimitate the intuition window.