# Intuition emerges in Maximum Caliber models at criticality
**Authors**: LluĂs Arola-FernĂĄndez
> Instituto de FĂsica Interdisciplinar y Sistemas Complejos IFISC (CSIC-UIB), Campus UIB, 07122 Palma de Mallorca, Spain Departament dâEnginyeria InformĂ tica i MatemĂ tiques, Universitat Rovira i Virgili, 43007 Tarragona, Catalonia, Spain Current address.
(September 26, 2025)
## Abstract
Whether large predictive models merely parrot their training data or produce genuine insight lacks a physical explanation. This work reports a primitive form of intuition that emerges as a metastable phase of next-token prediction under future path-entropy maximization. The intuition mechanism is discovered via mind-tuning, the minimal principle that imposes Maximum Caliber in predictive models with a control temperature-like parameter $\lambda$ . Training on random walks in deterministic mazes reveals a rich phase diagram: imitation (low $\lambda$ ), rule-breaking hallucination (high $\lambda$ ), and a fragile in-between window exhibiting strong protocol-dependence (hysteresis) and multistability, where models spontaneously discover novel goal-directed strategies. These results are captured by a mechanistic low-dimensional theory and frame intuition as an emergent property at the critical balance between memorizing what is and wondering what could be.
Introduction.â The rise of large-scale predictive models is reshaping artificial intelligence and transforming science and society. This progress is built upon a dominant scaling paradigm: pre-training autoregressive neural networks [1] with enormous parameter counts on big volumes of data [2] using massive compute resources [3]. When coupled with powerful search at inference time [4], this approach has yielded impressive performance in complex games [5], medical diagnosis [6] and algorithmic discovery [7]. Yet, the brute-force solution does not match the elegant efficiency of natural intelligence, which discovers intuitive shortcuts and novel, creative strategies from sparse data without rewards [8]. This contrast sharpens a foundational debate: are these models showing sparks of artificial general intelligence (AGI) [9], or are they âstochastic parrotsâ [10] that leverage vast experience to create an illusion of thought [5, 11]? While often addressed via complex reasoning benchmarks [12], the paradigmâs limits can be distilled into a simple Gedankenexperiment (Fig. 1).
<details>
<summary>Figure1.png Details</summary>

### Visual Description
\n
## Diagram: Abstract Geometric Pattern
### Overview
The image presents a black and white abstract geometric pattern contained within a square frame. The pattern consists of a series of interconnected white lines against a solid black background. There are no labels, axes, or legends present. The image does not contain any factual data or numerical values.
### Components/Axes
There are no axes or legends. The primary components are the white lines forming the pattern and the black background. The overall shape is a square.
### Detailed Analysis or Content Details
The pattern is composed of a series of right-angle turns and straight lines. It begins in the top-left corner with a vertical line, then proceeds with horizontal and vertical segments. The lines create a maze-like structure within the square. The pattern does not appear to have a clear starting or ending point, and it loops back on itself in several places. The lines have a consistent width throughout the image. The pattern is symmetrical about the vertical axis.
### Key Observations
The pattern is visually complex despite its simple components (lines and background). The absence of color beyond black and white emphasizes the form and structure of the design. The pattern's maze-like quality suggests a sense of exploration or complexity.
### Interpretation
The image appears to be a purely abstract design, likely intended for aesthetic purposes. It could represent a stylized maze, a complex network, or simply a visual exploration of geometric forms. The lack of any contextual information or labels suggests that the meaning is open to interpretation. The pattern's symmetry and repetition could symbolize order or balance, while its complexity could represent chaos or uncertainty. Without further information, it is difficult to determine the specific intent or meaning behind the design. It is possible this is a QR code or similar machine-readable code, but it is not a standard format. The pattern is not easily decipherable as a known symbol or logo.
</details>
Figure 1: Gedankenexperiment on emergent reasoning. A minimal environment abstracts a reasoning task into its essential components: a constrained space (a maze) and a hidden, optimal solution (to escape). The readerâs own intuition immediately grasps the task, yet a standard predictive model trained on random walk trajectories (i.e., non-intelligent data without rewards) will never discover it.
This work provides a physical explanation for this leap. We introduce mind-tuning, a simple principle that balances next-token prediction against future path-entropy maximization with a temperature-like parameter $\lambda$ . To our knowledge, mind-tuning is the minimal implementation of the Maximum Caliber (MaxCal) principle [13, 14, 15] compatible with autoregressive training. It reveals the emergence of a fragile metastable phase, within a narrow temperature window between imitation and hallucination regimes, that is reminiscent of intuition.
While our intuition mechanism points toward a horizon of diverse futures to explore, the prevailing paradigm remains blind, fixated only on predicting the next token. Constrained path-entropy maximization is already implicit in intrinsic motivation frameworks [16] like Causal Entropic Forces [17], Active Inference [18], Empowerment [19], or the Maximum Occupancy Principle [20]. Yet, a physical basis for such emergent behavior in pure predictive models has remained elusive. The metastable regime reported here, bounded by distinct, entropy and energy-driven transitions with strong hysteresis and multistability, explains that emergent reasoning is both rare and protocol-dependent. Furthermore, the high-dimensional mechanisms behind this phenomenology are captured analytically by a low-dimensional theory.
This perspective casts intelligence as a state of computational matter [21], building on a rich history of minimal models for emergent cognitive behavior, from Hopfieldâs memory [22] and Kuramotoâs synchronization [23] to phenomena in deep learning like double-descent [24], grokking [25], neural collapse [26], symmetry-breaking [27], and collective learning [28], often analyzed through spin-glass analogies [29] and phase diagrams [30, 28]. The phase-transition picture is key to research showing that intelligent systems may operate near a critical point, at the âedge of chaosâ [31, 32, 33]. At criticality, fluctuations and system responsiveness peak [31, 34], creating the ideal conditions for the leap from mimicry to insight. In the learning problem, our theory points toward a critical scaling axis driven by the systemâs intrinsic dynamics and suggests that current models operate in a suboptimal imitation phase, lacking the intuition that a physical mechanism unlocks.
Mind-tuning.â We focus on reasoning problems solvable by generating trajectories $z=(x_{0},a_{0},x_{1},a_{1},\dots)$ . The systemâs behavior is governed by a policy $\pi_{\theta,\beta}$ , a neural network with parameters $\theta$ that maps a data history $h_{t}=(x_{0},a_{0},\dots,x_{t})$ to a probability distribution over a discrete set of actions $\mathcal{A}$ via a softmax function
$$
\pi_{\theta,\beta}(a_{t}\!\mid\!h_{t})=\frac{e^{\beta\,\ell_{\theta}(h_{t},a_{t})}}{\sum_{a^{\prime}\in\mathcal{A}}e^{\beta\,\ell_{\theta}(h_{t},a^{\prime})}}, \tag{1}
$$
where $\ell_{\theta}$ are the networkâs output logits and $\beta$ controls the policyâs stochasticity. This general setting includes state-decision spaces, standard autoregressive models where histories contain tokens and other representations (see SM Sec. S1 for implementation details).
To isolate the intuition mechanism, we assume an offline, imperfect setting [35]: the model never interacts with the environment, has no external rewards, and learns from a dataset $\mathcal{D}$ of non-optimal histories. How can a purely predictive model discover a better solution than what it has seen? By biasing prediction toward futures with high causal path diversity, as prescribed by the Maximum Caliber (MaxCal) principle [13]: among all dynamics consistent with known constraints, prefer those that maximize the entropy of trajectories.
The most unbiased learning objective that imposes MaxCal is the free-energy-like functional
$$
\mathcal{F}_{\lambda,\beta,\tau}(\theta)=\mathcal{E}_{\beta}(\theta)-\lambda\mathcal{H}_{\tau,\beta}(\theta), \tag{2}
$$
where $\lambda\!\geq\!0$ is an effective temperature controlling the energyâentropy trade-off. The first term is the standard Cross-Entropy or negative log-likelihood $(\mathcal{E}$ ), measuring the cost of imitating the training data
$$
\mathcal{E}_{\beta}(\theta)=\left\langle-\log\pi_{\theta,\beta}(a_{t}|h_{t})\right\rangle_{(h_{t},a_{t})\in\mathcal{D}}. \tag{3}
$$
This energy $\mathcal{E}$ is traded against the causal path-entropy $\mathcal{H}$ , a Shannon entropy of self-generated futures up to a horizon of length $\tau$
$$
\mathcal{H}_{\tau,\beta}(\theta)=\left\langle\frac{1}{\tau}\left\langle-\ln P(z_{\text{future}}|h_{t})\right\rangle_{z_{\text{future}}\sim\pi_{\theta,\beta}}\right\rangle_{h_{t}\in\mathcal{D}}. \tag{4}
$$
Eq.(4) is estimated over the cone of futures induced by the model itself (see SM Sec. S2B for entropy calculations), making the objective function inherently subjective and self-referential, as the internal beliefs dynamically shape the learning landscape. The gradient update
$$
\theta(t+1)\leftarrow\theta(t)+\eta[{-\nabla_{\theta}\mathcal{E}_{\beta}(\theta)}+{\lambda\nabla_{\theta}\mathcal{H}_{\tau,\beta}(\theta)}] \tag{5}
$$
frames learning as a competition between prediction and causal entropic forces acting on the systemâs degrees of freedom, i.e. the network weights. To our knowledge, this self-contained mechanism is the minimal MaxCal implementation compatible with prevalent offline auto-regressive training. Unlike surprise-minimization [36, 37], here the entropic term rewards keeping plausible futures open, pulling toward the adjacent possible [38], without environment interaction [19, 20, 39]. The framework also admits a Bayesian interpretation [40, 41]: standard auto-regressive training use flat priors. In mind-tuning, instead, the data likelihood filters an optimistic entropic prior over futures with high diversity.
Experiments.â We test this principle in the minimal sandbox of the Gedankenexperiment (Fig. 1). A model is trained on constrained random-walk trajectories, which respect the maze walls but contain no intelligent strategies for escaping. Sweeping the parameter $\lambda$ yields a rich phase diagram, with clear transitions in both genotype (Fig. 2 A) and phenotype (Fig. 2 B) metrics.
<details>
<summary>Figure2.png Details</summary>

### Visual Description
## Chart/Diagram Type: Multi-Panel Scientific Plot
### Overview
This image presents a multi-panel figure (A, B, C, D, E) exploring the relationship between a parameter λ (lambda) and various metrics related to model behavior, potentially in the context of reinforcement learning or agent-based modeling. Panel A and B are charts showing metric values as a function of λ. Panels C, D, and E are diagrams illustrating different behavioral regimes: Imitation, Intuition, and Hallucination, respectively.
### Components/Axes
**Panel A:**
* **X-axis:** λ (lambda), on a logarithmic scale. Marked with values approximately 10<sup>-3</sup>, 10<sup>-2</sup>, 10<sup>-1</sup>, 10<sup>0</sup>, 10<sup>1</sup>, 10<sup>2</sup>.
* **Y-axis:** Metric value, ranging from approximately 0.6 to 1.4.
* **Data Series:**
* Δ<sub>λ</sub> (cross-entropy) - Red line with circular data points.
* H<sub>λ</sub> (path-entropy) - Blue line with circular data points.
* **Inset:** A smaller chart showing "Fluctuations" with λ on the x-axis (log scale) and a value 'b' on the y-axis, ranging from 0 to 0.05.
**Panel B:**
* **X-axis:** λ (lambda), on a logarithmic scale. Marked with values approximately 10<sup>-3</sup>, 10<sup>-2</sup>, 10<sup>-1</sup>, 10<sup>0</sup>, 10<sup>1</sup>, 10<sup>2</sup>.
* **Y-axis:** MFPT / WHR (normalized), ranging from approximately 0 to 1.
* **Data Series:**
* MFPT - Black line with circular data points.
* WHR - Red line with circular data points.
* P<sub>intuition</sub> - Teal shaded area.
**Panels C, D, E:**
* Each panel depicts a grid-like environment with a path highlighted in red.
* Panel C: Labeled "Imitation".
* Panel D: Labeled "Intuition".
* Panel E: Labeled "Hallucination".
### Detailed Analysis or Content Details
**Panel A:**
* The red line (Δ<sub>λ</sub>) starts at approximately 0.75 at λ = 10<sup>-3</sup> and increases relatively steadily until approximately λ = 10<sup>0</sup>, where it reaches a peak of around 1.3. After this peak, it plateaus.
* The blue line (H<sub>λ</sub>) starts at approximately 0.7 at λ = 10<sup>-3</sup> and increases more rapidly than the red line, reaching a peak of around 1.25 at approximately λ = 10<sup>-1</sup>. It then decreases slightly and plateaus.
* The inset shows a curve with a peak around λ = 10<sup>-1</sup>, indicating fluctuations.
**Panel B:**
* The black line (MFPT) starts at approximately 0.4 at λ = 10<sup>-3</sup> and decreases steadily to approximately 0.1 at λ = 10<sup>0</sup>. It then remains relatively constant.
* The red line (WHR) starts at approximately 0.8 at λ = 10<sup>-3</sup> and remains relatively constant until approximately λ = 10<sup>-1</sup>, where it increases sharply to approximately 0.9.
* The teal shaded area (P<sub>intuition</sub>) is centered around λ = 10<sup>-1</sup> and represents a region of high intuition probability.
**Panels C, D, E:**
* **Panel C (Imitation):** The red path is relatively simple and follows the grid closely.
* **Panel D (Intuition):** The red path is more direct and efficient, taking shortcuts through the grid.
* **Panel E (Hallucination):** The red path is highly erratic and deviates significantly from the grid, with many loops and unnecessary turns. Multiple red lines are present, suggesting multiple possible paths.
### Key Observations
* Panel A shows that cross-entropy and path-entropy both increase with λ, but path-entropy reaches a peak and then plateaus.
* Panel B shows an inverse relationship between MFPT and WHR, with MFPT decreasing and WHR increasing as λ increases. The teal area highlights a region where intuition is high.
* The diagrams in Panels C, D, and E visually represent different levels of agent behavior, ranging from simple imitation to intuitive problem-solving to chaotic hallucination.
### Interpretation
The data suggests a transition in agent behavior as the parameter λ is varied. At low values of λ (Imitation), the agent follows a simple, grid-aligned path. As λ increases (Intuition), the agent begins to exhibit more efficient and direct behavior, suggesting an ability to learn and generalize. However, at higher values of λ (Hallucination), the agent's behavior becomes erratic and unpredictable, indicating a breakdown in its learning process.
The relationship between the metrics in Panels A and B provides insight into this transition. The increase in cross-entropy (Δ<sub>λ</sub>) suggests increasing uncertainty in the agent's predictions, while the increase in path-entropy (H<sub>λ</sub>) suggests increasing exploration. The decrease in MFPT (Mean First Passage Time) indicates that the agent is finding shorter paths to its goal, while the increase in WHR (presumably a measure of path width or complexity) suggests that the agent is exploring a wider range of possibilities.
The teal shaded area in Panel B, representing P<sub>intuition</sub>, highlights a region where the agent is most likely to exhibit intuitive behavior. This region corresponds to a balance between exploration and exploitation, where the agent is able to leverage its learned knowledge to solve problems efficiently. The fluctuations shown in the inset of Panel A may represent the inherent stochasticity of the learning process. The outlier behavior in Panel E suggests a potential overfitting or instability in the model.
</details>
Figure 2: Experimental phase diagram. Sweeping $\lambda$ reveals three behavioral phases. (A) Genotype metrics: Cross-Entropy ( $\mathcal{E}$ ) and causal path-entropy ( $\mathcal{H}$ ). Inset: steady-state fluctuations $\sigma$ over different initial realizations depending on $\lambda$ . (B) Phenotype metrics: Mean First Passage Time (MFPT), Wall Hit Ratio (WHR) and intuition likelihood (see SM Sec. 4B). (C-E) Example trajectories for each phase: (C) Imitation, (D) Intuition, and (E) Hallucination.
For low $\lambda$ , the system is in an imitation phase: cross-entropy is low, path-entropy is low, and trajectories reproduce the suboptimal random walks from the data, leading to a high Mean First Passage Time (MFPT) to the exit (Fig. 2 C). For high $\lambda$ , the entropic term dominates and the system enters a hallucination phase: cross- and path-entropy are high; maze rules are broken to maximize path diversity, and the Wall Hit Ratio (WHR) increases sharply (Fig. 2 E). Between these two regimes lies a narrow intuition phase, where the trade-off between $\mathcal{E}$ and $\mathcal{H}$ yields an emergent strategy: the model discovers the shortest legal path to the exit (Fig. 2 D), achieving minimal MFPT with zero WHR. The separation between the fluctuation peaks of $\mathcal{E}$ and $\mathcal{H}$ (Fig. 2 A inset) reveals distinct entropy- and energy-driven phase boundaries.
<details>
<summary>Figure3.png Details</summary>

### Visual Description
## Chart: Metric Values vs. Lambda (λ)
### Overview
The image presents four charts (A, B, C, and D) illustrating relationships between various metrics and a parameter denoted as lambda (λ). The charts appear to analyze forward and backward dynamics of certain quantities, potentially in a network or system. Each chart focuses on different metrics, and a "Random baseline" is included for comparison in charts A and D.
### Components/Axes
* **Chart A:**
* X-axis: λ (lambda), logarithmic scale from 10<sup>-2</sup> to 10<sup>2</sup>.
* Y-axis: Metric values, ranging from approximately 0.6 to 1.4.
* Lines:
* Δ<sub>λ</sub> forward (solid black line with triangle markers)
* H<sub>λ</sub> (fwd) (solid blue line with circle markers)
* Δ<sub>λ</sub> backward (dashed red line with triangle markers)
* H<sub>λ</sub> (bwd) (dashed blue line with square markers)
* Inset: A smaller chart showing the Random baseline, with λ on the x-axis and a value between 0.5 and 1.0 on the y-axis.
* **Chart B:**
* X-axis: Not explicitly labeled, but appears to represent a sequence of steps or iterations.
* Y-axis: MFPT / WHR (fwd), ranging from approximately 0.0 to 1.0.
* Lines:
* MFPT (solid black line)
* WHR (solid red line)
* **Chart C:**
* X-axis: Not explicitly labeled, but appears to represent a sequence of steps or iterations.
* Y-axis: MFPT / WHR (bwd), ranging from approximately 0.0 to 1.0.
* Lines:
* MFPT (dashed black line)
* WHR (dashed red line)
* **Chart D:**
* X-axis: λ (lambda), logarithmic scale from 10<sup>-2</sup> to 10<sup>2</sup>.
* Y-axis: w (mean weight), ranging from approximately 0.0 to 1.2.
* Lines:
* forward (solid black line)
* backward (dashed black line)
* Inset: A smaller chart showing the Random baseline, with λ on the x-axis and a value between 0.4 and 1.0 on the y-axis.
### Detailed Analysis or Content Details
* **Chart A:**
* Δ<sub>λ</sub> forward: Starts at approximately 0.75, increases steadily to around 1.35 at λ = 10<sup>1</sup>, then plateaus.
* H<sub>λ</sub> (fwd): Starts at approximately 0.7, increases to around 1.2 at λ = 10<sup>0</sup>, then plateaus.
* Δ<sub>λ</sub> backward: Starts at approximately 0.8, decreases to around 0.7 at λ = 10<sup>-1</sup>, then increases to around 1.1 at λ = 10<sup>1</sup>, then plateaus.
* H<sub>λ</sub> (bwd): Starts at approximately 0.65, decreases to around 0.6 at λ = 10<sup>-1</sup>, then increases to around 1.0 at λ = 10<sup>1</sup>, then plateaus.
* **Chart B:**
* MFPT: Starts at approximately 0.25, fluctuates significantly, reaching a peak around 0.9 at step 10, then decreases to around 0.4.
* WHR: Starts at approximately 0.7, decreases to around 0.3 at step 5, then increases to around 0.7.
* **Chart C:**
* MFPT: Starts at approximately 0.8, fluctuates significantly, reaching a peak around 0.9 at step 5, then decreases to around 0.3.
* WHR: Starts at approximately 0.2, increases to around 0.7 at step 10, then decreases to around 0.4.
* **Chart D:**
* forward: Starts at approximately 1.0, decreases sharply to around 0.1 at λ = 10<sup>0</sup>, then increases slightly to around 0.2 at λ = 10<sup>1</sup>, then increases sharply to around 1.0 at λ = 10<sup>2</sup>.
* backward: Starts at approximately 1.0, remains relatively stable around 0.2 across the entire range of λ.
### Key Observations
* In Chart A, both forward metrics (Δ<sub>λ</sub> and H<sub>λ</sub>) increase with λ until they plateau. The backward metrics show a more complex behavior, initially decreasing before increasing.
* Charts B and C show highly oscillatory behavior for both MFPT and WHR, suggesting dynamic changes in these quantities.
* Chart D shows a dramatic shift in the forward metric's behavior around λ = 10<sup>0</sup>, while the backward metric remains relatively constant.
* The Random baseline in Charts A and D provides a reference point for assessing the significance of the observed trends.
### Interpretation
The data suggests an investigation into the effects of a parameter λ on various metrics within a system. The forward and backward analyses indicate directional dependencies, with different behaviors observed depending on the direction of the process. The oscillatory behavior in Charts B and C suggests a complex interplay between MFPT and WHR, potentially indicating instability or cyclical patterns. The sharp transition in Chart D around λ = 10<sup>0</sup> suggests a critical point or threshold in the system's behavior. The comparison to the random baseline helps to determine whether the observed trends are statistically significant or simply due to chance. The use of both forward and backward dynamics suggests an attempt to understand the system's stability and reversibility. The metrics themselves (Δ<sub>λ</sub>, H<sub>λ</sub>, MFPT, WHR, w) are not defined in the image, but their relationships to λ provide insights into the system's underlying mechanisms.
</details>
Figure 3: Hysteresis and protocol-dependence. Comparing a forward (solid) and backward (dashed) sweep of $\lambda$ reveals that the intuitive state is stable once found. (A) Hysteresis loop in genotype metrics ( $\mathcal{E},\mathcal{H}$ ). (B, C) Phenotype for the forward and backward sweeps, respectively, with the forward sweep showing a wider intuition window. (D) The mean network weight $\langle w\rangle$ acts as an order parameter capturing the systemâs bistability. Insets show baselines without protocol.
Operationally, this critical learning phase maximizes future path-entropy with minimal cross-entropy, enabling novel, goal-directed behavior at inference without interaction or explicit rewards. Reaching this phase depends on data quality and model complexity, requiring a sufficiently large future horizon and adequate model capacity (see SM Sec. S3 for a parametric study). The fragility of the mechanism is tied to multistability, as observed when applying adiabatic protocols that smoothly sweep the control parameter $\lambda$ (Fig. 3). A large hysteretic loop appears in the genotype metrics (A), which has behavioral consequences in the phenotype: while a forward sweep from $\lambda\approx 0$ opens the intuition window, with low MFPT and low WHR (B), a backward sweep starting from high $\lambda$ does not reach the desired phase (C). The bistability is captured by an effective order parameter âthe mean network weightâ which remains in an ordered intuitive state once the system has been guided there (D). The adiabatic protocol shows that a self-referential fine-tuning from imitation to controlled imagination allows the system to stabilize in a metastable phase, a process that motivates the term mind-tuning.
Effective theory.â The phenomenology of mind-tuning emerges from a high-dimensional, multistable free-energy landscape. We capture the essential mechanism in a scalar order parameter $m\in[0,1]$ , representing the modelâs rationality, and define a Boltzmann policy with an effective potential $U_{m}(a)$ :
$$
p_{m,\beta}(a|h_{t})=\frac{e^{-\beta U_{m}(a)}}{\sum_{a^{\prime}\in\mathcal{A}}e^{-\beta U_{m}(a^{\prime})}}. \tag{6}
$$
Actions, or decisions, are classified into optimal $a^{*}$ , rational-but-suboptimal $a^{r}$ , and non-rational $a^{n}$ and $m_{D}$ is a free parameter representing the training dataâs rationality. The effective costs,
$$
\displaystyle U(a^{*}) \displaystyle=0, \displaystyle U(a^{r}) \displaystyle=\frac{\max(0,m-m_{D})}{1-m_{D}}, \displaystyle U(a^{n}) \displaystyle=m, \tag{7}
$$
are designed to create a trade-off: as the modelâs rationality $m$ improves beyond the dataâs, the cost of suboptimal-but-legal actions grows, forcing a choice between true optimality and rule-breaking. For the simple Markovian maze with a small state-space, the free energy $\mathcal{F}_{\lambda}(m)$ can be computed analytically (see SM Sec. S4A). For a given $\lambda$ , one can also explore the learning dynamics in this landscape by sampling rationality states $m$ from the equilibrium distribution $P(m)\propto e^{-\hat{\beta}\mathcal{F}(m)}$ , where the inverse temperature $\hat{\beta}$ controls the exploration-exploitation trade-off, modeling stochasticity during gradient descent.
This effective theory qualitatively reproduces the experimental phase diagram, including the transitions in both genotypic (Fig. 4 A) and phenotypic metrics (Fig. 4 B). The underlying mechanism is revealed by exploring the minima of the free-energy landscape, found by solving $\partial\mathcal{F}_{\lambda}(m)/\partial m=0$ . This analysis confirms a smooth, entropy-driven transition followed by an abrupt, first-order energy-driven one, creating a bistable region where intuition ( $m>m_{D}$ ) and hallucination ( $m\ll m_{D}$ ) coexist (Fig. 4 C). Intriguingly, the theory further predicts a more elusive inspiration phase: a third stable solution with $m\approx 1$ , associated to a state of true creative insight. This strategy abruptly departs from data and represents internalized understanding. Unlike the subtle intuitive state, which often requires a high inference $\beta$ to be executed without error, this inspired solution would be robust even with a noisy policy. Yet, it is hidden within a tiny basin of attraction masked by the dominant hallucination phase (see SM Sec. S4.C). These predictions point to a very rich phase diagram, where intuition may be the trigger of even more exotic phenomena.
<details>
<summary>Figure4.png Details</summary>

### Visual Description
## Charts: Multi-Panel Analysis of Metric Values and Rationality
### Overview
This image presents a multi-panel chart (A, B, and C) analyzing metric values related to entropy, intuition, and rationality as a function of a parameter denoted by λ (lambda). Each panel displays different metrics and their relationships, with a shared x-axis representing λ on a logarithmic scale. Panel A focuses on cross-entropy and path-entropy. Panel B shows the ratio of Mean First Passage Time (MFPT) to Width of the Hole Ratio (WHR), alongside an area representing Pintuition. Panel C illustrates rationality minima.
### Components/Axes
* **X-axis (all panels):** λ (lambda), ranging from 10<sup>-3</sup> to 10<sup>3</sup> on a logarithmic scale.
* **Panel A:**
* **Y-axis:** Mean metric values, ranging from approximately 0.9 to 1.4.
* **Lines:**
* Red solid line: âšÎ”<sub>λ</sub>â© (cross-entropy)
* Blue dashed line: âšH<sub>λ</sub>â© (path-entropy)
* **Inset:** First derivative of Δ and H with respect to λ.
* X-axis: λ (lambda), ranging from 10<sup>-3</sup> to 10<sup>3</sup> on a logarithmic scale.
* Y-axis: First derivative (unlabeled).
* Lines:
* Red solid line: Δ
* Blue dashed line: H
* **Inset Label:** "Critical points"
* **Panel B:**
* **Y-axis:** MFPT / WHR (normalized), ranging from approximately 0 to 1.0.
* **Lines:**
* Black solid line: MFPT
* Red dashed line: WHR
* **Area:** Light blue shaded area representing Pintuition.
* **Panel C:**
* **Y-axis:** Rationality minima (m), ranging from approximately 0 to 1.0.
* **Lines:**
* Black solid line: m<sup>*</sup> (global)
* Black dashed line: m<sup>**</sup> (2nd)
* Black dotted line: m<sup>***</sup> (3rd)
* Gray dashed line: m<sub>d</sub> = 0.7
### Detailed Analysis or Content Details
**Panel A:**
* The red line (âšÎ”<sub>λ</sub>â©) starts at approximately 1.15 at λ = 10<sup>-3</sup> and initially decreases, reaching a minimum around λ = 0.01, then rapidly increases to approximately 1.38 at λ = 1.
* The blue line (âšH<sub>λ</sub>â©) starts at approximately 1.05 at λ = 10<sup>-3</sup> and initially decreases, reaching a minimum around λ = 0.1, then increases more gradually to approximately 1.25 at λ = 1.
* The inset shows the first derivatives of both lines. Both derivatives have a peak around λ = 0.01, indicating a rapid change in both cross-entropy and path-entropy at that point.
**Panel B:**
* The black line (MFPT) starts at approximately 0.2 at λ = 10<sup>-3</sup>, increases to a peak around λ = 0.1 (approximately 0.8), then decreases to approximately 0.4 at λ = 10<sup>3</sup>.
* The red line (WHR) starts at approximately 0.1 at λ = 10<sup>-3</sup>, increases rapidly to a peak around λ = 1 (approximately 0.9), then continues to increase slowly to approximately 0.95 at λ = 10<sup>3</sup>.
* The light blue area (Pintuition) is most prominent between approximately λ = 0.01 and λ = 0.1, indicating a region of significant intuition.
**Panel C:**
* The black solid line (m<sup>*</sup>) starts at approximately 0.8 at λ = 10<sup>-3</sup> and decreases rapidly to approximately 0.2 at λ = 1, then remains relatively constant.
* The black dashed line (m<sup>**</sup>) starts at approximately 0.6 at λ = 10<sup>-3</sup> and decreases to approximately 0.2 at λ = 1, then increases slightly to approximately 0.3 at λ = 10<sup>3</sup>.
* The black dotted line (m<sup>***</sup>) starts at approximately 0.4 at λ = 10<sup>-3</sup> and decreases to approximately 0.2 at λ = 1, then increases more significantly to approximately 0.6 at λ = 10<sup>3</sup>.
* The gray dashed line (m<sub>d</sub> = 0.7) is a horizontal line at y = 0.7.
### Key Observations
* All lines in Panel A exhibit a minimum within the range of λ = 0.01 to 0.1.
* Panel B shows a clear interplay between MFPT and WHR, with Pintuition being most pronounced when both are changing rapidly.
* Panel C demonstrates a decreasing trend in rationality minima (m<sup>*</sup>, m<sup>**</sup>, m<sup>***</sup>) as λ increases, followed by a potential increase for m<sup>**</sup> and m<sup>***</sup> at higher λ values.
* The vertical dashed line at λ = 1 in Panel C appears to mark a transition point in the rationality minima.
### Interpretation
The charts suggest a complex relationship between entropy, intuition, and rationality as influenced by the parameter λ. The initial decrease in cross-entropy and path-entropy (Panel A) may indicate a period of increasing order or predictability. The peak in MFPT and WHR around λ = 0.1 (Panel B) coupled with the prominence of Pintuition suggests a critical point where intuitive reasoning is most valuable. The decreasing rationality minima (Panel C) as λ increases could indicate a loss of rational control or an increased reliance on heuristics. The different levels of rationality minima (m<sup>*</sup>, m<sup>**</sup>, m<sup>***</sup>) may represent different levels of cognitive processing or decision-making strategies. The horizontal line at m<sub>d</sub> = 0.7 could represent a threshold for acceptable rationality. The overall trend suggests a shift from a more rational state at low λ values to a potentially more intuitive or heuristic-driven state at higher λ values. The inset in Panel A highlights critical points where the rate of change in entropy is maximized, potentially indicating phase transitions or bifurcations in the system's behavior.
</details>
Figure 4: Theoretical predictions. The low-dimensional model reproduces the experimental findings. (A) Theoretical $\mathcal{E}$ and $\mathcal{H}$ vs. $\lambda$ . (B) Corresponding MFPT and WHR. (C) Minima of the free-energy landscape vs the control parameter $\lambda$ . The plot reveals coexisting stable states ( $m^{*},m^{**},m^{***}$ ) and a first-order transition where the global minimum jumps discontinuously, explaining the observed hysteresis.
Accessing these different cognitive phases requires navigating a complex landscape. Indeed, the observed hysteresis and the success of the adiabatic protocol are explained by this multi-stability. The analytical phase-diagram (Fig. 4 C) shows that slowly increasing $\lambda$ is a safe route to guide the system into the intuition basin of attraction. In Bayesian terms, it first grounds the model with the data likelihood before introducing the entropic prior. Reaching more exotic phases in the landscape, like the predicted inspiration state, would likely demand more complex, non-equilibrium protocols.
Discussion.â High-quality human data can carry an implicit drive toward path diversity, and optimization itself can induce entropic pressures that improve generalization [42], yielding an âintelligent simulatorâ from curated experience. This view predicts that current models should spontaneously increase their causal path entropy with scale. Our framework makes this drive explicit and grounded in MaxCal, providing a shortcut to intuition that encodes implicit search into model weights to reduce the need for expensive search at inference [43]. These results point toward a hidden axis, training-time imagination, that may be key to unlock out-of-distribution generalization in offline predictive models [35].
Our results are demonstrated in a minimal sandbox, a choice that is deliberate. The maze is the simplest non-trivial setting where the mechanism can be isolated and reproduced analytically. Many reasoning tasks can be viewed as navigation through a âconceptual mazeâ where a key insight unlocks a vastly larger state-space [17, 19, 20, 21]. This argument promises applications in control [17, 20], reasoning [8, 44], and planning [44]. Stefan Zweigâs The Royal Game [45] provides a compelling literary analogue: a prisoner achieves chess mastery by first studying games (imitation) and then playing against himself in his mind (imagination). His triumph occurs at the edge of madness, a state mirroring intuition coexisting with hallucination in our phase diagram.
Yet, scaling mind-tuning to real-world cases faces significant challenges. Computationally, estimating path-entropy for long horizons is hard due to the combinatorial explosion of futures [13]. This requires designing clever sampling strategies [17, 46], perhaps inspired by dreaming, hierarchical reasoning [44] and unconventional methods and architectures [47, 48]. Theoretically, a full characterization of the phase diagrams and universality classe is needed to design optimal tuning protocols [49]. For uncharted domains, identifying the right spaces for entropy maximization can be difficult and the offline theory may need data augmentation from environment interaction [18]. Yet, tuning $\lambda$ for future diversity in practice can turn into an alignment problem, trading benefits for safety [50]. Despite these challenges, this work takes a high-risk, high-reward route to reframing intelligence not merely as compression and computation, but as a physical phenomenon emerging at criticality.
Acknowledgments.â The author thanks many colleagues at IFISC and URV for enriching discussions. This work has been partially supported by the MarĂa de Maeztu project CEX2021-001164-M funded by the MICIU/AEI/10.13039/501100011033 and by Programa Maria Goyri URV.
## References
- Vaswani et al. [2017] A. Vaswani et al., Attention is all you need, in Adv. in Neural Info. Processing Systems, Vol. 30 (2017).
- Kaplan et al. [2020] J. Kaplan et al., Scaling laws for neural language models (2020), arXiv:2001.08361 [cs.LG] .
- Hoffmann et al. [2022] J. Hoffmann et al., Training compute-optimal large language models, arXiv preprint (2022), 2203.15556 .
- DeepSeek-AI et al. [2025] DeepSeek-AI et al., Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (2025), arXiv:2501.12948 [cs.CL] .
- Shojaee et al. [2024] P. Shojaee et al., The illusion of thinking, arXiv preprint (2024), 2401.00675 .
- Brodeur et al. [2024] P. G. Brodeur et al., Superhuman performance of a large language model on the reasoning tasks of a physician, arXiv preprint (2024), 2412.10849 .
- Novikov et al. [2025] A. Novikov et al., Alphaevolve: A coding agent for scientific and algorithmic discovery (2025), arXiv:2506.13131 .
- Chollet [2019] F. Chollet, On the measure of intelligence, arXiv preprint (2019), 1911.01547 .
- Bubeck et al. [2023] S. Bubeck et al., Sparks of artificial general intelligence: Early experiments with gpt-4, (2023), 2303.12712 .
- Bender et al. [2021] E. M. Bender et al., On the dangers of stochastic parrots: Can language models be too big?, in Proceedings ACM (2021) pp. 610â623.
- Mitchell and Krakauer [2023] M. Mitchell and D. C. Krakauer, The debate over understanding in aiâs large language models, PNAS 120, e2215907120 (2023).
- Liang et al. [2022] P. Liang et al., Holistic evaluation of language models, arXiv preprint (2022), 2211.09110 .
- Jaynes [1980] E. T. Jaynes, The minimum entropy production principle, Ann. Rev.of Physical Chemistry 31, 579 (1980).
- Pressé et al. [2013] S. Pressé, K. Ghosh, J. Lee, and K. A. Dill, Principles of maximum entropy and maximum caliber in statistical physics, Reviews of Modern Physics 85, 1115 (2013).
- Dixit et al. [2018] P. D. Dixit et al., Perspective: Maximum caliber is a general variational principle for dynamical systems, The Journal of Chemical Physics 148, 010901 (2018).
- Kiefer [2025] A. B. Kiefer, Intrinsic motivation as constrained entropy maximization, arXiv preprint (2025), 2502.02962 .
- Wissner-Gross and Freer [2013] A. D. Wissner-Gross and C. E. Freer, Causal entropic forces, Physical Review Letters 110, 168702 (2013).
- Wen [2025] B. Wen, The missing reward: Active inference in the era of experience (2025), arXiv:2508.05619 .
- Klyubin et al. [2005] A. S. Klyubin, D. Polani, and C. L. Nehaniv, Empowerment: A universal agent-centric measure of control, in 2005 IEEE CEC, Vol. 1 (2005) pp. 128â135.
- Ramirez-Ruiz et al. [2024] J. Ramirez-Ruiz et al., Complex behavior from intrinsic motivation to occupy future action-state path space, Nature Communications 15, 5281 (2024).
- Friston et al. [2022] K. J. Friston et al., Designing ecosystems of intelligence from first principles, arXiv preprint (2022), 2212.01354 .
- Hopfield [1982] J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proceedings of the National Academy of Sciences 79, 2554 (1982).
- Kuramoto [1975] Y. Kuramoto, Self-entrainment of a population of coupled non-linear oscillators, in International Symposium on Mathematical Problems in Theoretical Physics (Springer, 1975) pp. 420â422.
- Belkin et al. [2019] M. Belkin, D. Hsu, S. Ma, and S. Mandal, Reconciling modern machine-learning practice and the classical biasâvariance trade-off, PNAS 116, 15849 (2019).
- Power et al. [2022] A. Power et al., Grokking: Generalization beyond overfitting in small neural networks, arXiv (2022), 2201.02177 .
- Papyan et al. [2020] V. Papyan, X. Y. Han, and D. L. Donoho, Prevalence of neural collapse during the terminal phase of deep learning training, PNAS 117, 24927 (2020).
- Liu et al. [2025] Z. Liu, Y. Xu, T. Poggio, and I. Chuang, Parameter symmetry potentially unifies deep learning theory, arXiv preprint (2025), 2502.05300 .
- Arola-FernĂĄndez and Lacasa [2024] L. Arola-FernĂĄndez and L. Lacasa, Effective theory of collective deep learning, Phys. Rev. Res. 6, L042040 (2024).
- Carleo et al. [2019] G. Carleo et al., Machine learning and the physical sciences, Rev. Mod. Phys. 91, 045002 (2019).
- Lewkowycz et al. [2020] A. Lewkowycz et al., The large learning rate phase of deep learning: the catapult mechanism (2020), arXiv:2003.02218 [stat.ML] .
- Muñoz [2018] M. A. Muñoz, Colloq.: Criticality and dynamical scaling in living systems, R. of Mod. Phys. 90, 031001 (2018).
- Zhang et al. [2025] S. Zhang et al., Intelligence at the edge of chaos (2025), arXiv:2410.02536 [cs.AI] .
- Jiménez-Gonzålez et al. [2025] P. Jiménez-Gonzålez, M. C. Soriano, and L. Lacasa, Leveraging chaos in the training of artificial neural networks (2025), arXiv:2506.08523 [cs.LG] .
- Arola-FernĂĄndez et al. [2020] L. Arola-FernĂĄndez et al., Uncertainty propagation in complex networks: From noisy links to critical properties, Chaos: An Interdisciplinary Journal of Nonlinear Science 30, 023129 (2020).
- Levine et al. [2020] S. Levine, A. Kumar, G. Tucker, and J. Fu, Offline reinforcement learning: Tutorial, review, and perspectives on open problems (2020), arXiv:2005.01643 [cs.LG] .
- Heins et al. [2024] C. Heins et al., Collective behavior from surprise minimization, PNAS 121, e2320239121 (2024).
- Friston [2010] K. Friston, The free-energy principle: A unified brain theory?, Nature Reviews Neuroscience 11, 127 (2010).
- Kauffman [2000] S. A. Kauffman, Investigations (Oxford Univ. Pr., 2000).
- Eysenbach and Levine [2022] B. Eysenbach and S. Levine, Maximum entropy rl (provably) solves some robust rl problems (2022), arXiv:2103.06257 [cs.LG] .
- Jaynes [1957] E. T. Jaynes, Information theory and statistical mechanics, The Physical Review 106, 620 (1957).
- ZdeborovĂĄ and Krzakala [2016] L. ZdeborovĂĄ and F. Krzakala, Statistical physics of inference: thresholds and algorithms, Adv. in Phys. 65, 453â552 (2016).
- Ziyin et al. [2025] L. Ziyin, Y. Xu, and I. Chuang, Neural thermodynamics i: Entropic forces in deep and universal representation learning (2025), arXiv:2505.12387 [cs.LG] .
- Belcak et al. [2025] P. Belcak et al., Small language models are the future of agentic ai (2025), arXiv:2506.02153 .
- Wang et al. [2025] G. Wang et al., Hierarchical reasoning models, arXiv preprint (2025), 2506.21734 .
- Zweig [1943] S. Zweig, The Royal Game (Viking Press, 1943).
- Aguilar [2022] J. e. a. Aguilar, Sampling rare trajectories using stochastic bridges, Phys. Rev. E 105, 064138 (2022).
- Labay-Mora et al. [2025] Labay-Mora et al., Theoretical framework for quantum associative memories, Quantum Science and Technology 10, 035050 (2025).
- Brunner et al. [2025] D. Brunner et al., Roadmap on neuromorphic photonics (2025), arXiv:2501.07917 [cs.ET] .
- Manzano et al. [2024] G. Manzano et al., Thermodynamics of computations with absolute irreversibility, unidirectional transitions, and stochastic computation times, Phys. Rev. X 14, 021026 (2024).
- Arenas et al. [2011] A. Arenas et al., The joker effect: Cooperation driven by destructive agents, J. of Theo. Bio. 279, 113â119 (2011).
- Maddison et al. [2017] C. J. Maddison, A. Mnih, and Y. W. Teh, The concrete distribution: A continuous relaxation of discrete random variables, in ICLR (2017) 1611.00712 .
- Williams [1992] R. J. Williams, Simple statistical gradient-following algorithms for connectionist rl, ML 8, 229 (1992).
Supplementary Material for: âIntuition emerges in Maximum Caliber models at criticalityâ
## Appendix A S1. Experimental Setup and Hyperparameters
The experimental setting is a minimal yet non-trivial environment for testing emergent reasoning. It consists of a deterministic $24\times 24$ maze with periodic boundary conditions, where an agent must find the path to a designated exit. This controlled testbed provides a tractable state space for analyzing the learning dynamics. The agentâs behavior is determined by a policy network that maps the current state (2D position $x_{t}$ ) to a probability distribution over the four cardinal actions: $\mathcal{A}=\{\text{Up, Down, Right, Left}\}$ . For auto-regressive training, a simple deterministic function $f(x_{t},a)$ maps the last action to the next state.
The training dataset $\mathcal{D}$ is intentionally non-optimal. In our main experiments, it contains $N=100$ trajectories, each of length $T=60$ steps, generated by a constrained random walks. These walkers respect the maze walls (i.e., never collide with them) but otherwise move randomly, exhibiting no goal-directed behavior. This design ensures that the optimal exit strategy is not present in the training data, forcing the model to discover it.
The model parameters $\theta$ are optimized by minimizing the free-energy functional $\mathcal{F}_{\lambda,\beta,\tau}(\theta)$ (Eq. (2) in the main text) via the Adam optimizer. The results presented in the main text (Fig. 2) are averaged over 20 independent training runs, each with a different random weight initialization, to ensure statistical robustness. The key hyperparameters used in the main experiments are: a policy network structured as a multi-layer perceptron (MLP) with one hidden layer of 128 neurons and ReLU activation; a learning rate of $1\times 10^{-3}$ ; 300 training epochs per $\lambda$ value; and a future horizon of $\tau=40$ steps in the entropy calculation.
The policy stochasticities are set to $\beta=1$ for training, $\beta=5$ for entropy calculation (imagination), and $\beta=10$ at inference time. A high imagination $\beta$ (compared to the training $\beta$ ) is beneficial for discovering hidden solutions that maximize causal entropy (i.e., finding the exit) with a finite $\tau$ and sparse data. A high inference $\beta$ is necessary to induce intuitive behavior in practice. In the intuition phase, the agent finds a superior solution but must execute its policy quite deterministically to follow the optimal path in the minimum time.
For problems that are not Markovian or where the data representation does not contain full state information (e.g., data are sequences of moves or the agent only sees its local environment), a more advanced neural network is required. Transformers are the standard for modeling long, non-Markovian sequences of tokens. Our framework naturally extends to these sequential autoregressive architectures, albeit at the cost of more parameters and computational effort.
Code availability.â PyTorch source code to reproduce the results of this paper is publicly available on GitHub: https://github.com/mystic-blue/mind-tuning.
## Appendix B S2. Calculation of Objective Functionals
The mind-tuning objective function $\mathcal{F}_{\lambda,\beta,\tau}(\theta)=\mathcal{E}_{\beta}(\theta)-\lambda\mathcal{H}_{\tau,\beta}(\theta)$ consists of two key terms. Below we detail their calculation.
### B.1 A. Cross-Entropy Estimation
The cross-entropy term $\mathcal{E}_{\beta}(\theta)$ , defined in Eq. (3) of the main text, measures the modelâs ability to imitate the training data. It is estimated by averaging the negative log-likelihood of the actions taken in the dataset $\mathcal{D}$ given the preceding histories:
$$
\hat{\mathcal{E}}_{\beta}(\theta)=\frac{1}{|\mathcal{D}|}\sum_{(h_{t},a_{t})\in\mathcal{D}}[-\log\pi_{\theta,\beta}(a_{t}|h_{t})]
$$
where $|\mathcal{D}|$ is the total number of state-action pairs in the training set. This term encourages the policy to assign high probability to the trajectories observed during training.
### B.2 B. Causal Path-Entropy: Analytic Calculation for Markovian Systems
For systems with fully-observed, discrete, and reasonably small state spaces $\mathcal{V}$ , such as our maze environment, the path-entropy can be computed analytically. Since the system is Markovian ( $h_{t}=x_{t}$ ), we can define a policy-dependent transition matrix $M_{\pi}$ . The element $(M_{\pi})_{x^{\prime},x}$ gives the probability of transitioning from state $x$ to state $x^{\prime}$ under the current policy $\pi_{\theta,\beta}$ . Specifically, $(M_{\pi})_{x^{\prime},x}=\sum_{a\in\mathcal{A}}\pi_{\theta,\beta}(a|x)\delta_{x^{\prime},f(x,a)}$ , where $f(x,a)$ is the deterministic function that returns the next state.
Given a starting state $x_{start}$ , we can compute the probability distribution over future states $\vec{\rho}_{k}$ at any time step $k$ by evolving an initial occupancy vector (a point mass at $x_{start}$ ) via the recursion $\vec{\rho}_{k+1}=M_{\pi}\vec{\rho}_{k}$ . The conditional path-entropy for a trajectory starting at $x_{start}$ is then the time-averaged Shannon entropy of the policy, weighted by the occupancy probability at each future state:
$$
\mathcal{H}_{\tau,\beta}(\theta|x_{start})=\frac{1}{\tau}\sum_{k=0}^{\tau-1}\sum_{x\in\mathcal{V}}(\rho_{k})_{x}\left[-\sum_{a\in\mathcal{A}}\pi_{\theta,\beta}(a|x)\log\pi_{\theta,\beta}(a|x)\right].
$$
The total functional $\mathcal{H}_{\tau,\beta}(\theta)$ is the expectation of Eq. (S2) over all starting states in the training dataset $\mathcal{D}$ . This entire calculation is fully differentiable with respect to the network parameters $\theta$ , allowing for efficient gradient-based optimization. This exact method was used to produce all experimental and theoretical results in this work. Its primary computational cost scales with the size of the state space $|\mathcal{V}|$ , making it suitable for our testbed.
### B.3 C. Causal Path-Entropy: Monte Carlo Estimation for High-Dimensional Systems
For high-dimensional or continuous state spaces, or for non-Markovian sequence models like Transformers, the analytic approach becomes intractable. In these cases, $\mathcal{H}$ must be estimated via Monte Carlo sampling. For each starting history $h_{start}$ in a training mini-batch, we generate $K$ independent future trajectories (rollouts) of length $\tau$ by autoregressively sampling actions from the policy. The estimator for the path-entropy functional is:
$$
\hat{\mathcal{H}}_{\tau,\beta}(\theta)\approx\frac{1}{|\mathcal{B}|}\sum_{h_{start}\in\mathcal{B}}\left(\frac{1}{K\tau}\sum_{k=1}^{K}\sum_{j=0}^{\tau-1}\left[-\ln\pi_{\theta,\beta}(a_{j}^{(k)}|h_{j}^{(k)})\right]_{h_{start}}\right).
$$
To ensure that gradients can be backpropagated through the sampling process, especially for discrete action spaces, reparameterization techniques are required. A standard method is the Gumbel-Softmax trick [51], which provides a continuous, differentiable approximation to the sampling procedure. Alternatively, the gradient of the entropic objective can be estimated using policy gradient methods like REINFORCE [52], though this often suffers from high variance.
## Appendix C S3. Parametric Dependencies of the Intuition Phase
The emergence of the fragile intuition phase is a critical phenomenon highly sensitive to the model, data, and learning protocol parameters. Below, we detail the key dependencies we investigated.
### C.1 A. Future Horizon $\tau$
The future horizon $\tau$ dictates the timescale of the modelâs âimaginationâ. Our experiments show that the intuition phase only emerges for a sufficiently long horizon (Fig. S1).
For a small $\tau$ , the model is myopic; the long-term entropic gain from escaping the maze is not visible, so the model defaults to minimizing cross-entropy and remains in the imitation phase. As $\tau$ increases, the model can foresee the vast expansion of possible futures that awaits outside the maze, creating a strong entropic incentive to find an exit. For intermediate horizons, we often observe a cheating phaseâa local minimum in the free-energy landscape where the model learns to take a single illegal step through a wall. This strategy is a compromise: it incurs a small penalty for rule-breaking but gains a significant medium-term entropic advantage. Only for large $\tau$ does the incentive to find a legal path to maximal freedom dominate (i.e., virtue over vice).
<details>
<summary>FigureS1.png Details</summary>

### Visual Description
## Chart: Metric Value vs. Lambda (λ) for Different Time Points
### Overview
This image presents three charts (A, B, and C) displaying the relationship between a metric value (cross-entropy and path-entropy) and lambda (λ) on a logarithmic scale, for different time points (Ï = 1, Ï = 20, and Ï = 40). The charts also show the normalized MFPT/WHR ratio against lambda. The charts appear to be investigating the behavior of these metrics as a parameter (λ) changes over time.
### Components/Axes
* **X-axis:** λ (Lambda) - Log Scale, ranging from 10<sup>-3</sup> to 10<sup>2</sup>.
* **Y-axis (Top):** Metric Value - ranging from 0.0 to 1.4. Two metrics are plotted: Δ<sub>1</sub>(Ξ*) (cross-entropy) and H<sub>1</sub>(Ξ*) (path-entropy).
* **Y-axis (Bottom):** Normalized MFPT / WHR - ranging from 0.0 to 1.1.
* **Legend (Bottom-Left):**
* MFPT (teal/cyan) - represented by a solid line with markers.
* WHR (red) - represented by a solid line with markers.
* **Titles (Top):**
* A: Ï = 1
* B: Ï = 20
* C: Ï = 40
### Detailed Analysis or Content Details
**Chart A (Ï = 1):**
* **Cross-Entropy (Δ<sub>1</sub>(Ξ*)):** (Blue) Starts at approximately 0.75, remains relatively stable until λ â 10<sup>-1</sup>, then rapidly increases to approximately 1.35.
* **Path-Entropy (H<sub>1</sub>(Ξ*)):** (Light Blue) Starts at approximately 0.7, remains relatively stable until λ â 10<sup>-1</sup>, then rapidly increases to approximately 1.3.
* **MFPT:** (Teal) Starts at approximately 0.9, decreases to a minimum of approximately 0.05 at λ â 10<sup>-2</sup>, then increases to approximately 0.25 at λ = 10<sup>2</sup>.
* **WHR:** (Red) Starts at approximately 0.9, fluctuates with peaks around λ â 10<sup>-2</sup> and 10<sup>-1</sup>, and ends at approximately 0.85.
**Chart B (Ï = 20):**
* **Cross-Entropy (Δ<sub>1</sub>(Ξ*)):** (Blue) Starts at approximately 0.75, remains relatively stable until λ â 10<sup>-1</sup>, then rapidly increases to approximately 1.3.
* **Path-Entropy (H<sub>1</sub>(Ξ*)):** (Light Blue) Starts at approximately 0.7, remains relatively stable until λ â 10<sup>-1</sup>, then rapidly increases to approximately 1.3.
* **MFPT:** (Teal) Starts at approximately 0.9, decreases to a minimum of approximately 0.2 at λ â 10<sup>-2</sup>, then increases to approximately 0.6 at λ = 10<sup>2</sup>.
* **WHR:** (Red) Starts at approximately 1.0, decreases to a minimum of approximately 0.2 at λ â 10<sup>-2</sup>, then increases to approximately 0.8 at λ = 10<sup>2</sup>.
**Chart C (Ï = 40):**
* **Cross-Entropy (Δ<sub>1</sub>(Ξ*)):** (Blue) Starts at approximately 0.75, remains relatively stable until λ â 10<sup>-1</sup>, then rapidly increases to approximately 1.3.
* **Path-Entropy (H<sub>1</sub>(Ξ*)):** (Light Blue) Starts at approximately 0.7, remains relatively stable until λ â 10<sup>-1</sup>, then rapidly increases to approximately 1.3.
* **MFPT:** (Teal) Starts at approximately 0.9, decreases to a minimum of approximately 0.1 at λ â 10<sup>-2</sup>, then increases to approximately 0.7 at λ = 10<sup>2</sup>.
* **WHR:** (Red) Starts at approximately 1.0, decreases to a minimum of approximately 0.1 at λ â 10<sup>-2</sup>, then increases to approximately 0.8 at λ = 10<sup>2</sup>.
### Key Observations
* The cross-entropy and path-entropy curves are nearly identical across all three time points. They exhibit a step-like increase at approximately λ = 10<sup>-1</sup>.
* Both MFPT and WHR show a similar trend across all time points: a decrease to a minimum around λ = 10<sup>-2</sup>, followed by an increase.
* The minimum values of MFPT and WHR appear to decrease as time (Ï) increases. The dip is more pronounced at Ï = 40.
* The WHR curve exhibits more fluctuations than the MFPT curve, particularly at lower values of λ.
### Interpretation
The charts suggest that the cross-entropy and path-entropy metrics reach a threshold at a specific value of lambda (around 10<sup>-1</sup>), indicating a change in the system's state or behavior. The normalized MFPT/WHR ratio appears to be sensitive to changes in lambda, with a minimum value indicating an optimal or critical point.
The decreasing minimum values of MFPT and WHR with increasing time (Ï) suggest that the system is becoming more efficient or optimized over time. The fluctuations in the WHR curve might indicate a higher degree of variability or sensitivity to changes in the system.
The relationship between these metrics and lambda could be related to a parameter controlling the exploration-exploitation trade-off in a reinforcement learning or optimization context. The time parameter (Ï) likely represents the number of iterations or steps taken in the process. The data suggests that as the process continues (increasing Ï), the system converges towards a more optimal state, as reflected by the lower MFPT and WHR values. The sharp increase in entropy metrics at a certain lambda value could indicate a phase transition or a point of instability.
</details>
Figure S1: Dependence on Future Horizon $\tau$ . Phase diagram of the genotypic (top) and phenotypic (bottom) metrics as a function of $\lambda$ for different future horizons. The intuition window (sharp dip in MFPT and zero WHR, shaded blue) appears and stabilizes only for a long horizon ( $\tau=40$ ). (A) A short horizon ( $\tau=1$ ) yields only imitation and hallucination. (B) An intermediate horizon ( $\tau=20$ ) can lead to a cheating strategy, which is worse than the true intuitive solution (C).
### C.2 B. Model Capacity
The capacity of the policy network, controlled by the number of neurons, is relevant (Fig. S2). A model with insufficient capacity has high bias and lacks the representational power to learn the complex, mixed strategy required to balance maze constraints with goal-directed exploration. It cannot simultaneously represent the world model and the entropic drive, so the intuition phase does not emerge. Conversely, a model with excessive capacity relative to the task complexity is prone to overfitting. It may perfectly memorize the noisy random walks from the training data or discover trivial, non-generalizable solutions (e.g., exploiting specific numerical artifacts) to maximize entropy. The intuition phase occupies a âsweet spotâ where model capacity is well-matched to the problem, enabling generalization from sparse data rather than mere memorization or unconstrained hallucination.
<details>
<summary>FigureS2.png Details</summary>

### Visual Description
\n
## Chart: Metric Value vs. Lambda for Different Capacities
### Overview
The image presents three separate charts (A, B, and C) displaying the relationship between a metric value and lambda (λ) for different network capacities (8, 64, and 128 respectively). Each chart contains two line plots representing cross-entropy and path-entropy, along with scatter plots representing MFPT (Mean First Passage Time) and WHR (Width-to-Height Ratio). The x-axis is on a logarithmic scale.
### Components/Axes
* **X-axis (all charts):** λ (Lambda) - labeled as "λ (log scale)". Scale ranges from 10<sup>-2</sup> to 10<sup>2</sup>.
* **Y-axis (top plots):** Metric value. Scale ranges from 0 to 1.4.
* **Y-axis (bottom plots):** Normalized MFPT / WHR. Scale ranges from 0 to 1.
* **Legend (all charts):**
* Red Line: Δ<sub>l</sub>(Ξ<sup>*</sup>) (cross-entropy)
* Blue Line: H<sub>l</sub>(Ξ<sup>*</sup>) (path-entropy)
* Grey Circles: MFPT
* Black Circles: WHR
* **Titles:**
* A: capacity=8
* B: capacity=64
* C: capacity=128
### Detailed Analysis or Content Details
**Chart A (capacity=8):**
* **Cross-Entropy (Red Line):** The line starts at approximately 0.15 at λ = 10<sup>-2</sup>, increases rapidly to a peak of around 1.25 at λ â 0.01, then decreases to approximately 0.75 at λ = 10<sup>2</sup>.
* **Path-Entropy (Blue Line):** The line starts at approximately 0.1 at λ = 10<sup>-2</sup>, increases steadily to a plateau around 1.35 between λ â 0.01 and λ = 10<sup>2</sup>.
* **MFPT (Grey Circles):** The scatter plot shows a high concentration of points around 0.1-0.2 for λ < 0.1, then a rapid increase to approximately 0.8 at λ â 0.1, followed by a decrease to around 0.2 at λ = 10<sup>2</sup>.
* **WHR (Black Circles):** The scatter plot shows a concentration of points around 0.6-0.8 for λ < 0.1, then a sharp decrease to approximately 0.2 at λ â 0.1, followed by a relatively stable value around 0.2-0.3 at λ = 10<sup>2</sup>.
**Chart B (capacity=64):**
* **Cross-Entropy (Red Line):** The line starts at approximately 0.2 at λ = 10<sup>-2</sup>, increases to a peak of around 1.2 at λ â 0.01, then fluctuates between 0.7 and 1.1 until λ = 10<sup>2</sup>. There is a large spike to approximately 1.3 at λ â 0.1.
* **Path-Entropy (Blue Line):** The line starts at approximately 0.2 at λ = 10<sup>-2</sup>, increases steadily to a plateau around 1.3 between λ â 0.01 and λ = 10<sup>2</sup>.
* **MFPT (Grey Circles):** The scatter plot shows a concentration of points around 0.2-0.3 for λ < 0.1, then a rapid increase to approximately 0.8 at λ â 0.1, followed by a decrease to around 0.3 at λ = 10<sup>2</sup>.
* **WHR (Black Circles):** The scatter plot shows a concentration of points around 0.6-0.8 for λ < 0.1, then a sharp decrease to approximately 0.2 at λ â 0.1, followed by a relatively stable value around 0.2-0.4 at λ = 10<sup>2</sup>.
**Chart C (capacity=128):**
* **Cross-Entropy (Red Line):** The line starts at approximately 0.15 at λ = 10<sup>-2</sup>, increases to a peak of around 1.2 at λ â 0.01, then fluctuates between 0.7 and 1.1 until λ = 10<sup>2</sup>.
* **Path-Entropy (Blue Line):** The line starts at approximately 0.15 at λ = 10<sup>-2</sup>, increases steadily to a plateau around 1.3 between λ â 0.01 and λ = 10<sup>2</sup>.
* **MFPT (Grey Circles):** The scatter plot shows a concentration of points around 0.1-0.2 for λ < 0.1, then a rapid increase to approximately 0.8 at λ â 0.1, followed by a decrease to around 0.3 at λ = 10<sup>2</sup>.
* **WHR (Black Circles):** The scatter plot shows a concentration of points around 0.6-0.8 for λ < 0.1, then a sharp decrease to approximately 0.2 at λ â 0.1, followed by a relatively stable value around 0.2-0.4 at λ = 10<sup>2</sup>.
### Key Observations
* The path-entropy consistently plateaus at a higher value than the cross-entropy across all capacities.
* MFPT and WHR both exhibit a similar trend: high values at low lambda, a sharp decrease around λ â 0.1, and then stabilization at lower values.
* The cross-entropy shows more variability, especially at higher capacities (B and C).
* The spike in cross-entropy at λ â 0.1 in Chart B is a notable outlier.
### Interpretation
The charts demonstrate the relationship between network capacity, a parameter λ, and various metrics related to network performance (cross-entropy, path-entropy, MFPT, and WHR). The consistent plateau in path-entropy suggests that, beyond a certain point, increasing the complexity of the path does not significantly improve the metric. The behavior of MFPT and WHR indicates a transition in network behavior around λ â 0.1, potentially representing a shift in the dominant pathways or a change in the network's structure. The outlier spike in cross-entropy for capacity=64 at λ â 0.1 could indicate a local optimum or a region of instability in the network's learning process. The normalization of MFPT and WHR suggests a comparison of these metrics relative to their maximum values, allowing for a more meaningful assessment of their impact on overall network performance. The increasing capacity (from A to C) appears to smooth out the cross-entropy curve, potentially indicating a more stable learning process with larger networks.
</details>
Figure S2: Dependence on Model Capacity. Emergence of the intuition phase as a function of the number of neurons in the hidden layer. (A) A model with insufficient capacity (e.g., 8 neurons) cannot learn the required behavior. The intuition phase is robust for models with sufficient capacity (e.g., 64 (B) or 128 neurons (C)), which are powerful enough to discover the solution but not so powerful that they immediately overfit.
### C.3 C. Maze Complexity
We evaluated the framework on several environments of increasing complexity (Fig. S3). In simpler environments (e.g., a straight corridor), the escape task is trivial because the data trajectories are very close to the optimal solution. The intuition window is consequently wide and appears at lower values of $\lambda$ . As maze complexity increases, finding the optimal path becomes a harder constraint-satisfaction problem. The cross-entropy term $\mathcal{E}$ more strongly penalizes deviations from valid paths. To overcome this, a stronger entropic pressure (a higher $\lambda$ ) is required to motivate the search for the distant exit. As a result, the intuition window narrows and shifts in the phase diagram, indicating that a more precise tuning of the energy-entropy balance is needed for more difficult problems. In some cases, the intuition window may disappear entirely, requiring protocols like the adiabatic sweep to be reached.
<details>
<summary>FigureS3.png Details</summary>

### Visual Description
\n
## Chart: Spatial Exploration Metrics in Different Environments
### Overview
The image presents a comparative analysis of spatial exploration behavior across four distinct environments: Corridor, Path, Maze, and Room. Each environment is represented by a heatmap visualizing spatial usage, accompanied by two charts displaying metrics related to exploration â "Metric value" and "MFPT/WHR (normalized)" â plotted against a logarithmic scale of lambda (λ).
### Components/Axes
Each of the four panels (A, B, C, D) shares the same chart structure:
* **Top:** Heatmap representing spatial usage. Color intensity indicates frequency of visits, with blue representing low frequency and red representing high frequency.
* **Middle:** Line chart plotting "Metric value" against λ (log scale). Includes error bars representing fluctuations.
* **Bottom:** Line chart plotting "MFPT/WHR (normalized)" against λ (log scale). Three lines are displayed: MFPT (blue), WHR (red), and Intuition (green). Scatter points are used to represent the data, with lines showing the trend.
* **X-axis (all charts):** λ (log scale), ranging from approximately 10<sup>0</sup> to 10<sup>2</sup>.
* **Y-axis (Middle chart):** Metric value, ranging from approximately 0.8 to 1.4.
* **Y-axis (Bottom chart):** MFPT/WHR (normalized), ranging from approximately 0.2 to 0.8.
* **Legend (Bottom chart):** MFPT (blue line), WHR (red line), Intuition (green line).
* **Inset (Middle chart):** A small plot showing Ï (0.05) against η<sub>f</sub>(ÎČ<sup>-1</sup>) for λ.
### Detailed Analysis or Content Details
**Panel A: Corridor**
* **Heatmap:** Shows high usage concentrated along a central path, with lower usage on the sides.
* **Metric Value Chart:** The blue line representing the metric value shows a generally upward trend, starting around 0.95 at λ = 10<sup>0</sup> and reaching approximately 1.3 at λ = 10<sup>2</sup>. The error bars (fluctuations) are relatively small.
* **MFPT/WHR Chart:**
* MFPT (blue): Starts around 0.7 at λ = 10<sup>0</sup>, initially increases, then plateaus around 0.65-0.7.
* WHR (red): Starts around 0.5 at λ = 10<sup>0</sup>, decreases to approximately 0.45 at λ = 10<sup>1</sup>, then increases slightly to around 0.5 at λ = 10<sup>2</sup>.
* Intuition (green): Starts around 0.4 at λ = 10<sup>0</sup>, increases rapidly to a peak around 0.6 at λ = 10<sup>1</sup>, then decreases to approximately 0.5 at λ = 10<sup>2</sup>.
**Panel B: Path**
* **Heatmap:** Shows a clear, focused path of high usage.
* **Metric Value Chart:** Similar to the Corridor, the metric value increases with λ, starting around 0.9 and reaching approximately 1.35 at λ = 10<sup>2</sup>. Fluctuations are small.
* **MFPT/WHR Chart:**
* MFPT (blue): Starts around 0.7, increases to approximately 0.75 at λ = 10<sup>1</sup>, then plateaus.
* WHR (red): Starts around 0.5, decreases to approximately 0.4 at λ = 10<sup>1</sup>, then increases slightly.
* Intuition (green): Starts around 0.4, increases to a peak around 0.65 at λ = 10<sup>1</sup>, then decreases.
**Panel C: Maze**
* **Heatmap:** Shows a more dispersed usage pattern, indicative of exploration within a maze structure.
* **Metric Value Chart:** The metric value increases with λ, starting around 0.9 and reaching approximately 1.3 at λ = 10<sup>2</sup>. Fluctuations are slightly larger than in the Corridor and Path.
* **MFPT/WHR Chart:**
* MFPT (blue): Starts around 0.7, increases to approximately 0.75 at λ = 10<sup>1</sup>, then plateaus.
* WHR (red): Starts around 0.5, decreases to approximately 0.4 at λ = 10<sup>1</sup>, then increases slightly.
* Intuition (green): Starts around 0.4, increases to a peak around 0.6 at λ = 10<sup>1</sup>, then decreases.
**Panel D: Room**
* **Heatmap:** Shows a relatively uniform usage pattern across the room.
* **Metric Value Chart:** The metric value increases with λ, starting around 0.9 and reaching approximately 1.3 at λ = 10<sup>2</sup>. Fluctuations are small.
* **MFPT/WHR Chart:**
* MFPT (blue): Starts around 0.7, increases to approximately 0.75 at λ = 10<sup>1</sup>, then plateaus.
* WHR (red): Starts around 0.5, decreases to approximately 0.4 at λ = 10<sup>1</sup>, then increases slightly.
* Intuition (green): Starts around 0.4, increases to a peak around 0.6 at λ = 10<sup>1</sup>, then decreases.
### Key Observations
* The "Metric value" consistently increases with λ across all environments, suggesting a positive correlation between λ and the measured metric.
* The "Intuition" metric consistently peaks around λ = 10<sup>1</sup> in all environments, then declines.
* The "WHR" metric generally decreases initially with increasing λ, then plateaus or slightly increases.
* The heatmaps reveal distinct spatial usage patterns for each environment, reflecting the constraints and opportunities for exploration.
### Interpretation
The data suggests that the parameter λ influences spatial exploration behavior in a consistent manner across different environments. The increasing "Metric value" with λ indicates that as λ increases, the agent explores the environment more thoroughly or efficiently. The peaking of "Intuition" at intermediate values of λ suggests an optimal level of exploration where the agent balances exploration and exploitation. The behavior of "WHR" suggests a trade-off between the time spent in familiar areas and the time spent exploring new areas.
The differences in heatmap patterns highlight the impact of environmental structure on exploration. The focused path in the "Path" environment contrasts with the dispersed pattern in the "Maze" environment, indicating that the agent adapts its exploration strategy to the specific constraints of each environment. The relatively uniform pattern in the "Room" suggests a lack of strong directional biases.
The inset plots showing Ï vs η<sub>f</sub>(ÎČ<sup>-1</sup>) provide additional information about the fluctuations in the metric value, but their specific meaning requires further context. The consistent shape of these insets across all environments suggests a common underlying mechanism governing the fluctuations.
</details>
Figure S3: Dependence on Maze Complexity. Position and width of the intuition window (measured by MFPT) for environments of varying difficulty. (A,B) For a simple corridor, the window is wide and appears at low $\lambda$ . (C) For the more complex maze used in the main text, the window is narrower, reflecting the increased difficulty. (D) For even more complex problems, the intuition window can disappear, necessitating specific protocols to reach the desired phase.
## Appendix D S4. Detailed Theory and Further Predictions
Here we expand on the theory from the main text, providing the explicit analytical forms for the free energy functional. We also clarify the calculation of the intuition likelihood ( $p_{\text{intuition}}$ ) and discuss the existence of a more elusive inspiration phase as a further prediction of the theory.
### D.1 A. The Effective Free Energy Functional
For the Markovian maze environment with a small state-space, the terms of the free energy functional $\mathcal{F}_{\lambda}(m)=\mathcal{E}(m)-\lambda\mathcal{H}(m)$ can be computed analytically as a function of the rationality order parameter $m$ . Note that this theory is effective: the $\beta$ of the analytical policy is distinct from the experimental one, since the former controls only three minimal analytical costs while the latter modulates the entire logit vector.
The effective cross-entropy, $\mathcal{E}(m)$ , is the expectation of the negative log-likelihood over the state distribution of the given data, $\rho_{\mathcal{D}}(s)$ . For a single state $s$ , the cross-entropy is $E_{s}(m)=\log Z_{m}(s)+\beta\langle U_{m}(a)\rangle_{a\sim\text{uniform}}$ , where the average is over legal moves from $s$ . Summing over the data distribution gives
$$
\mathcal{E}(m)=\left\langle\log\left(\sum_{a\in\mathcal{A}}e^{-\beta U_{m}(a|s)}\right)+\beta\frac{\sum_{a^{\prime}\in\mathcal{A}_{\text{legal}}(s)}U_{m}(a^{\prime}|s)}{|\mathcal{A}_{\text{legal}}(s)|}\right\rangle_{s\sim\rho_{\mathcal{D}}},
$$
where $U_{m}(a|s)$ is the cost of action $a$ in state $s$ (which depends on whether the move is optimal, suboptimal, or a wall collision) and $\mathcal{A}_{\text{legal}}(s)$ is the set of valid moves from $s$ .
The effective path-entropy, $\mathcal{H}(m)$ , is the time-averaged Shannon entropy of the policy $p_{m,\beta}$ over trajectories of length $\tau$ starting from an initial state distribution $\rho_{0}$ (in our case, a single point at the maze start). It is calculated using the policy-dependent transition matrix $M_{m}$
$$
\mathcal{H}(m)=\frac{1}{\tau}\sum_{k=0}^{\tau-1}\left(\sum_{s\in\mathcal{V}}(\rho_{k})_{s}\cdot h_{m}(s)\right),
$$
where $\vec{\rho}_{k}=(M_{m})^{k}\vec{\rho}_{0}$ is the state occupancy vector at time $k$ , and $h_{m}(s)$ is the local policy entropy at state $s$ :
$$
h_{m}(s)=-\sum_{a\in\mathcal{A}}p_{m,\beta}(a|s)\log p_{m,\beta}(a|s).
$$
These analytical expressions are used to generate the theoretical plots in the main text.
### D.2 B. Calculating the Intuition Likelihood ( $p_{\text{intuition}}$ )
The intuition metric, visualized as the cyan region in the plots, quantifies the modelâs ability to spontaneously follow the optimal path. In experiments, this intuition likelihood is measured as the fraction of independent trials where the system displays the optimal solution at inference (minimal MFPT with zero WHR).
The same empirical criterion can be applied to the effective theory. More interestingly, the intuition likelihood can also be calculated analytically if the optimal route is known. We define it as the joint probability of generating the true shortest path to the exit, for a horizon of $q$ steps (where $q$ depends on the maze topology). Let the optimal path be the sequence of states $z^{*}=(s_{0}^{*},s_{1}^{*},\dots,s_{q}^{*})$ , where $s_{0}^{*}$ is the starting position, and let $a_{t}^{*}$ be the optimal action to transition from $s_{t}^{*}$ to $s_{t+1}^{*}$ . The intuition likelihood for a given rationality level $m$ is:
$$
p_{\text{intuition}}(m)=\prod_{t=0}^{q-1}p_{m,\beta}(a_{t}^{*}|s_{t}^{*})
$$
Since the system can be multistable, the final value reported in the figure for a given $\lambda$ is the Boltzmann-weighted average of this likelihood over all coexisting free energy minima ( $m^{*},m^{**},\dots$ ):
$$
\langle p_{\text{intuition}}\rangle_{\lambda}=\sum_{i}w_{i}(\lambda)\cdot p_{\text{intuition}}(m_{i})\quad\text{where}\quad w_{i}(\lambda)=\frac{e^{-\hat{\beta}\mathcal{F}_{\lambda}(m_{i})}}{\sum_{j}e^{-\hat{\beta}\mathcal{F}_{\lambda}(m_{j})}},
$$
where $\hat{\beta}$ is an inverse temperature controlling the sampling of minima. At high $\hat{\beta}$ (low thermal noise), the system predominantly samples the global minimum, reproducing the steady-state results of the main experiments. In the numerical experiments, each run starts from a random weight initialization, and gradient descent acts as a local search that can fall into any of the attracting states. The likelihood metric is therefore zero in the imitation and hallucination phases (where the probability of following the optimal path is negligible) and peaks sharply in the narrow intuition window, provided the policyâs inference $\beta$ is sufficiently high.
### D.3 C. From Intuition to Inspiration: Further Predictions of the Effective Theory
The intuition phase represents a significant discovery, where the model finds a hidden, optimal solution that smoothly branches from the data-driven imitation phase. It is a better way, but not a radical departure. Intriguingly, the theory predicts the existence of a distinct, more profound cognitive phase: inspiration. Inspiration is not a continuous improvement but an abrupt jump to a qualitatively different state of insight. This corresponds to the emergence of a new, globally optimal minimum in the free-energy landscape, where the rationality parameter is close to $m\approx 1$ . A model in the inspiration phase does not merely approximate the optimal policy; it knows the solution is correct. This internalized understanding would manifest through a key operational signature: the model could execute the optimal strategy robustly, even with a stochastic policy (low inference $\beta$ ), distinguishing it from the more tentative intuitive state.
The theory predicts that the imagination temperature $\beta_{\text{dream}}$ âthe policy stochasticity in the entropy termâis a key parameter for accessing these states (Fig. S4). At low $\beta_{\text{dream}}$ , the intuition phase ( $m>m_{D}$ ) is unstable. It emerges in a stable window only for sufficiently large $\beta_{\text{dream}}$ . At even higher values, this stable intuition branch can bifurcate into two locally stable solutions: the familiar intuition phase and this hidden inspiration phase ( $m\approx 1$ ). Both can coexist while the hallucination phase ( $m\ll m_{D}$ ) remains the global attractor. Observing this more exotic inspiration phase in practice would likely require careful tuning protocols, potentially starting from the intuition phase and employing non-equilibrium dynamics.
<details>
<summary>FigureS4.png Details</summary>

### Visual Description
## Chart: Expected Metric Values vs. Log Scale Lambda
### Overview
The image presents three panels (A, B, and C) each displaying a series of plots illustrating the relationship between Expected Metric Values (top), Normalized MFPT/WHR (middle), and Rationality Minima (bottom) against a log scale lambda (λ). Each panel corresponds to a different value of ÎČdream (2.0, 5.0, and 20.0 respectively). Each panel contains three main plots, with insets showing the first derivative of the expected metric values.
### Components/Axes
* **X-axis (all panels):** λ (log scale), ranging from 10^-4 to 10^1.
* **Y-axis (Panel A):** Expected Metric Values, ranging from 0.0 to 1.40.
* **Y-axis (Panel B):** Expected Metric Values, ranging from 0.0 to 0.30.
* **Y-axis (Panel C):** Expected Metric Values, ranging from 0.0 to 0.6.
* **Y-axis (Middle plots, all panels):** Normalized MFPT/WHR, ranging from 0.0 to 1.10.
* **Y-axis (Bottom plots, all panels):** Rationality Minima (m'), ranging from 0.0 to 0.2.
* **Legend (all panels):**
* MFPT (solid blue line)
* WHR (dashed red line)
* P Intuition (dotted green line)
* **Inset Plots (all panels):** First Derivative vs. λ, with labels 'Δ' (solid blue line) and 'H' (dashed red line).
* **Panel Labels:** A (ÎČdream = 2.0), B (ÎČdream = 5.0), C (ÎČdream = 20.0).
* **Bottom of each panel:** m* (global) (solid black line), m**(2nd) (dashed black line), m**(3rd) (dotted black line), mΞ = 0.7 (horizontal black line).
### Detailed Analysis or Content Details
**Panel A (ÎČdream = 2.0):**
* **Expected Metric Values:** The MFPT line (blue) starts at approximately 1.35 and decreases to around 0.25. The WHR line (red) starts at approximately 1.10 and decreases to around 0.15. The P Intuition line (green) starts at approximately 0.45 and decreases to around 0.05.
* **Normalized MFPT/WHR:** The MFPT line (blue) starts at approximately 1.05 and decreases to around 0.2. The WHR line (red) starts at approximately 0.95 and decreases to around 0.1. The P Intuition line (green) starts at approximately 0.5 and decreases to around 0.02.
* **Rationality Minima:** m* (global) (black solid) starts at approximately 0.15 and decreases to around 0.01. m**(2nd) (black dashed) starts at approximately 0.10 and decreases to around 0.005. m**(3rd) (black dotted) starts at approximately 0.05 and remains near 0.002. mΞ = 0.7 is a horizontal line at 0.7.
* **Inset:** The first derivative of Δ (blue) shows a peak around λ = 0.01, decreasing to near zero. The first derivative of H (red) shows a peak around λ = 0.1, decreasing to near zero.
**Panel B (ÎČdream = 5.0):**
* **Expected Metric Values:** The MFPT line (blue) starts at approximately 0.25 and decreases to around 0.05. The WHR line (red) starts at approximately 0.20 and decreases to around 0.03. The P Intuition line (green) starts at approximately 0.10 and decreases to around 0.01.
* **Normalized MFPT/WHR:** The MFPT line (blue) starts at approximately 0.9 and decreases to around 0.1. The WHR line (red) starts at approximately 0.8 and decreases to around 0.05. The P Intuition line (green) starts at approximately 0.4 and decreases to around 0.01.
* **Rationality Minima:** m* (global) (black solid) starts at approximately 0.10 and decreases to around 0.005. m**(2nd) (black dashed) starts at approximately 0.05 and decreases to around 0.002. m**(3rd) (black dotted) starts at approximately 0.02 and remains near 0.001. mΞ = 0.7 is a horizontal line at 0.7.
* **Inset:** The first derivative of Δ (blue) shows a peak around λ = 0.01, decreasing to near zero. The first derivative of H (red) shows a peak around λ = 0.1, decreasing to near zero.
**Panel C (ÎČdream = 20.0):**
* **Expected Metric Values:** The MFPT line (blue) starts at approximately 0.50 and decreases to around 0.05. The WHR line (red) starts at approximately 0.40 and decreases to around 0.03. The P Intuition line (green) starts at approximately 0.20 and decreases to around 0.01.
* **Normalized MFPT/WHR:** The MFPT line (blue) starts at approximately 1.0 and decreases to around 0.2. The WHR line (red) starts at approximately 0.9 and decreases to around 0.1. The P Intuition line (green) starts at approximately 0.5 and decreases to around 0.02.
* **Rationality Minima:** m* (global) (black solid) starts at approximately 0.15 and decreases to around 0.01. m**(2nd) (black dashed) starts at approximately 0.10 and decreases to around 0.005. m**(3rd) (black dotted) starts at approximately 0.05 and remains near 0.002. mΞ = 0.7 is a horizontal line at 0.7.
* **Inset:** The first derivative of Δ (blue) shows a peak around λ = 0.01, decreasing to near zero. The first derivative of H (red) shows a peak around λ = 0.1, decreasing to near zero.
### Key Observations
* As ÎČdream increases (from A to C), the overall magnitude of the Expected Metric Values decreases.
* In all panels, the MFPT line consistently remains above the WHR line, which in turn remains above the P Intuition line.
* The Rationality Minima plots show a consistent pattern of decreasing values for m* (global), m**(2nd), and m**(3rd) as λ increases.
* The insets consistently show peaks in the first derivatives of Δ and H, indicating points of maximum change in the respective metric values.
### Interpretation
The charts demonstrate the impact of varying ÎČdream values on expected metric values, normalized MFPT/WHR, and rationality minima. The decreasing magnitude of the expected metric values with increasing ÎČdream suggests a diminishing return or increased cost associated with higher ÎČdream values. The consistent ordering of the MFPT, WHR, and P Intuition lines indicates a hierarchical relationship between these metrics, potentially reflecting different levels of cognitive effort or complexity. The decreasing rationality minima suggest that as λ increases, the system converges towards a more rational state, but the rate of convergence varies depending on the order of the minima (global, 2nd, 3rd). The insets provide insights into the rate of change of the expected metric values, highlighting the points where the system is most sensitive to changes in λ. The horizontal line at mΞ = 0.7 provides a reference point for assessing the rationality of the system. The consistent shape of the derivative plots across the panels suggests that the underlying dynamics are relatively stable, despite the changes in ÎČdream. The data suggests a complex interplay between these metrics and the parameter ÎČdream, with implications for understanding decision-making processes and cognitive biases.
</details>
Figure S4: Dependence of theoretical predictions on the imagination temperature $\beta_{\text{dream}}$ . The theoretical phase diagram is shown for increasing values of $\beta_{\text{dream}}=\{2.0,5.0,20.0\}$ . This parameter controls the policy stochasticity in the self-referential entropy calculation. (A-C) As $\beta_{\text{dream}}$ increases, the systemâs phase diagram (bottom row) changes. Higher values of this temperature can also reveal more complex phase structures, including the emergence of the inspiration phase, as discussed in the text. Insets in the first row (here and in the main text) measure the numerical first-derivatives of both the cross-entropy and path-entropy for low sampling temperature at equilibrium (thus for global attractors). The separation in the peaks of the discontinuities (B,C) signal the entropy- and energy- driven transitions that delimitate the intuition window.