# 1 Introduction
Statistical mechanics of extensive-width Bayesian neural networks near interpolation
Jean Barbier * 1 Francesco Camilli * 1 Minh-Toan Nguyen * 1 Mauro Pastore * 1 Rudy Skerk * 2 footnotetext: * Equal contribution 1 The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34151 Trieste, Italy 2 International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy.
## Abstract
For three decades statistical mechanics has been providing a framework to analyse neural networks. However, the theoretically tractable models, e.g., perceptrons, random features models and kernel machines, or multi-index models and committee machines with few neurons, remained simple compared to those used in applications. In this paper we help reducing the gap between practical networks and their theoretical understanding through a statistical physics analysis of the supervised learning of a two-layer fully connected network with generic weight distribution and activation function, whose hidden layer is large but remains proportional to the inputs dimension. This makes it more realistic than infinitely wide networks where no feature learning occurs, but also more expressive than narrow ones or with fixed inner weights. We focus on the Bayes-optimal learning in the teacher-student scenario, i.e., with a dataset generated by another network with the same architecture. We operate around interpolation, where the number of trainable parameters and of data are comparable and feature learning emerges. Our analysis uncovers a rich phenomenology with various learning transitions as the number of data increases. In particular, the more strongly the features (i.e., hidden neurons of the target) contribute to the observed responses, the less data is needed to learn them. Moreover, when the data is scarce, the model only learns non-linear combinations of the teacher weights, rather than “specialising” by aligning its weights with the teacher’s. Specialisation occurs only when enough data becomes available, but it can be hard to find for practical training algorithms, possibly due to statistical-to-computational gaps.
Understanding the expressive power and generalisation capabilities of neural networks is not only a stimulating intellectual activity, producing surprising results that seem to defy established common sense in statistics and optimisation (Bartlett et al., 2021), but has important practical implications in cost-benefit planning whenever a model is deployed. E.g., from a fruitful research line that spanned three decades, we now know that deep fully connected Bayesian neural networks with $O(1)$ readout weights and $L_{2}$ regularisation behave as kernel machines (the so-called Neural Network Gaussian processes, NNGPs) in the heavily overparametrised, infinite-width regime (Neal, 1996; Williams, 1996; Lee et al., 2018; Matthews et al., 2018; Hanin, 2023), and so suffer from these models’ limitations. Indeed, kernel machines infer the decision rule by first embedding the data in a fixed a priori feature space, the renowned kernel trick, then operating linear regression/classification over the features. In this respect, they do not learn features (in the sense of statistics relevant for the decision rule) from the data, so they need larger and larger feature spaces and training sets to fit their higher order statistics (Yoon & Oh, 1998; Dietrich et al., 1999; Gerace et al., 2021; Bordelon et al., 2020; Canatar et al., 2021; Xiao et al., 2023).
Many efforts have been devoted to studying Bayesian neural networks beyond this regime. In the so-called proportional regime, when the width is large and proportional to the training set size, recent studies showed how a limited amount of feature learning makes the network equivalent to optimally regularised kernels (Li & Sompolinsky, 2021; Pacelli et al., 2023; Camilli et al., 2023; Cui et al., 2023; Baglioni et al., 2024; Camilli et al., 2025). This could be a consequence of the fully connected architecture, as, e.g., convolutional neural networks learn more informative features (Naveh & Ringel, 2021; Seroussi et al., 2023; Aiudi et al., 2025; Bassetti et al., 2024). Another scenario is the mean-field scaling, i.e., when the readout weights are small: in this case too a Bayesian network can learn features in the proportional regime (Rubin et al., 2024a; van Meegen & Sompolinsky, 2024).
Here instead we analyse a fully connected two-layer Bayesian network trained end-to-end near the interpolation threshold, when the sample size $n$ is scaling like the number of trainable parameters: for input dimension $d$ and width $k$ , both large and proportional, $n=\Theta(d^{2})=\Theta(kd)$ , a regime where non-trivial feature learning can happen. We consider i.i.d. Gaussian input vectors with labels generated by a teacher network with matching architecture, in order to study the Bayes-optimal learning of this neural network target function. Our results thus provide a benchmark for the performance of any model trained on the same dataset.
## 2 Setting and main results
### 2.1 Teacher-student setting
We consider supervised learning with a shallow neural network in the classical teacher-student setup (Gardner & Derrida, 1989). The data-generating model, i.e., the teacher (or target function), is thus a two-layer neural network itself, with readout weights ${\mathbf{v}}^{0}\in\mathbb{R}^{k}$ and internal weights ${\mathbf{W}}^{0}\in\mathbb{R}^{k\times d}$ , drawn entrywise i.i.d. from $P_{v}^{0}$ and $P^{0}_{W}$ , respectively; we assume $P^{0}_{W}$ to be centred while $P^{0}_{v}$ has mean $\bar{v}$ , and both priors have unit second moment. We denote the whole set of parameters of the target as ${\bm{\theta}}^{0}=({\mathbf{v}}^{0},{\mathbf{W}}^{0})$ . The inputs are i.i.d. standard Gaussian vectors ${\mathbf{x}}_{\mu}\in\mathbb{R}^{d}$ for $\mu\leq n$ . The responses/labels $y_{\mu}$ are drawn from a kernel $P^{0}_{\rm out}$ :
$$
\textstyle y_{\mu}\sim P^{0}_{\rm out}(\,\cdot\mid\lambda^{0}_{\mu}),\quad
\lambda^{0}_{\mu}:=\frac{1}{\sqrt{k}}{{\mathbf{v}}^{0\intercal}}\sigma(\frac{1
}{\sqrt{d}}{{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}). \tag{1}
$$
The kernel can be stochastic or model a deterministic rule if $P^{0}_{\rm out}(y\mid\lambda)=\delta(y-\mathsf{f}^{0}(\lambda))$ for some outer non-linearity $\mathsf{f}^{0}$ . The activation function $\sigma$ is applied entrywise to vectors and is required to admit an expansion in Hermite polynomials with Hermite coefficients $(\mu_{\ell})_{\ell\geq 0}$ , see App. A: $\sigma(x)=\sum_{\ell\geq 0}\frac{\mu_{\ell}}{\ell!}{\rm He}_{\ell}(x)$ . We assume it has vanishing 0th Hermite coefficient, i.e., that it is centred $\mathbb{E}_{z\sim\mathcal{N}(0,1)}\sigma(z)=0$ ; in App. D.5 we relax this assumption. The input/output pairs $\mathcal{D}=\{({\mathbf{x}}_{\mu},y_{\mu})\}_{\mu\leq n}$ form the training set for a student network with matching architecture.
Notice that the readouts ${\mathbf{v}}^{0}$ are only $k$ unknowns in the target compared to the $kd=\Theta(k^{2})$ inner weights ${\mathbf{W}}^{0}$ . Therefore, they can be equivalently considered quenched, i.e., either given and thus fixed in the student network defined below, or unknown and thus learnable, without changing the leading order of the information-theoretic quantities we aim for. E.g., in terms of mutual information per parameter $\frac{1}{kd+k}I(({\mathbf{W}}^{0},{\mathbf{v}}^{0});\mathcal{D})=\frac{1}{kd}I ({\mathbf{W}}^{0};\mathcal{D}\mid{\mathbf{v}}^{0})+o_{d}(1)$ . Without loss of generality, we thus consider ${\mathbf{v}}^{0}$ quenched and denote it ${\mathbf{v}}$ from now on. This equivalence holds at leading order and at equilibrium only, but not at the dynamical level, the study of which is left for future work.
The Bayesian student learns via the posterior distribution of the weights ${\mathbf{W}}$ given the training data (and ${\mathbf{v}}$ ), defined by
| | $\textstyle dP({\mathbf{W}}\mid\mathcal{D}):=\mathcal{Z}(\mathcal{D})^{-1}dP_{W }({\mathbf{W}})\prod_{\mu\leq n}P_{\rm out}\big{(}y_{\mu}\mid\lambda_{\mu}({ \mathbf{W}})\big{)}$ | |
| --- | --- | --- |
with post-activation $\lambda_{\mu}({\mathbf{W}}):=\frac{1}{\sqrt{k}}{\mathbf{v}}^{\intercal}\sigma( \frac{1}{\sqrt{d}}{{\mathbf{W}}{\mathbf{x}}_{\mu}})$ , the posterior normalisation constant $\mathcal{Z}(\mathcal{D})$ called the partition function, and $P_{W}$ is the prior assumed by the student. From now on, we focus on the Bayes-optimal case $P_{W}=P_{W}^{0}$ and $P_{\rm out}=P_{\rm out}^{0}$ , but the approach can be extended to account for a mismatch.
We aim at evaluating the expected generalisation error of the student. Let $({\mathbf{x}}_{\rm test},y_{\rm test}\sim P_{\rm out}(\,\cdot\mid\lambda^{0}_{ \rm test}))$ be a fresh sample (not present in $\mathcal{D}$ ) drawn using the teacher, where $\lambda_{\rm test}^{0}$ is defined as in (1) with ${\mathbf{x}}_{\mu}$ replaced by ${\mathbf{x}}_{\rm test}$ (and similarly for $\lambda_{\rm test}({\mathbf{W}})$ ). Given any prediction function $\mathsf{f}$ , the Bayes estimator for the test response reads $\hat{y}^{\mathsf{f}}({\mathbf{x}}_{\rm test},{\mathcal{D}}):=\langle\mathsf{f} (\lambda_{\rm test}({\mathbf{W}}))\rangle$ , where the expectation $\langle\,\cdot\,\rangle:=\mathbb{E}[\,\cdot\mid\mathcal{D}]$ is w.r.t. the posterior $dP({\mathbf{W}}\mid\mathcal{D})$ . Then, for a performance measure $\mathcal{C}:\mathbb{R}\times\mathbb{R}\mapsto\mathbb{R}_{\geq 0}$ the Bayes generalisation error is
$$
\displaystyle\varepsilon^{\mathcal{C},\mathsf{f}}:=\mathbb{E}_{{\bm{\theta}}^{
0},{\mathcal{D}},{\mathbf{x}}_{\rm test},y_{\rm test}}\mathcal{C}\big{(}y_{\rm
test
},\big{\langle}\mathsf{f}(\lambda_{\rm test}({\mathbf{W}}))\big{\rangle}\big{)}. \tag{2}
$$
An important case is the square loss $\mathcal{C}(y,\hat{y})=(y-\hat{y})^{2}$ with the choice $\mathsf{f}(\lambda)=\int dy\,y\,P_{\rm out}(y\mid\lambda)=:\mathbb{E}[y\mid\lambda]$ . The Bayes-optimal mean-square generalisation error follows:
$$
\displaystyle\varepsilon^{\rm opt} \displaystyle:=\mathbb{E}_{{\bm{\theta}}^{0},{\mathcal{D}},{\mathbf{x}}_{\rm
test
},y_{\rm test}}\big{(}y_{\rm test}-\big{\langle}\mathbb{E}[y\mid\lambda_{\rm
test
}({\mathbf{W}})]\big{\rangle}\big{)}^{2}. \tag{3}
$$
Our main example will be the case of linear readout with Gaussian label noise: $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ . In this case, the generalisation error $\varepsilon^{\rm opt}$ takes a simpler form for numerical evaluation than (3), thanks to the concentration of “overlaps” entering it, see App. C.
We study the challenging extensive-width regime with quadratically many samples, i.e., a large size limit
$$
\displaystyle d,k,n\to+\infty\quad\text{with}\quad k/d\to\gamma,\quad n/d^{2}
\to\alpha. \tag{4}
$$
We denote this joint $d,k,n$ limit with these rates by “ ${\lim}$ ”.
In order to access $\varepsilon^{\mathcal{C},\mathsf{f}},\varepsilon^{\rm opt}$ and other relevant quantities, one can tackle the computation of the average log-partition function, or free entropy in statistical physics language:
$$
\textstyle f_{n}:=\frac{1}{n}\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\ln
\mathcal{Z}(\mathcal{D}). \tag{5}
$$
The mutual information between teacher weights and the data is related to the free entropy $f_{n}$ , see App. F. E.g., in the case of linear readout with Gaussian label noise we have $\lim\frac{1}{kd}I({\mathbf{W}}^{0};\mathcal{D}\mid{\mathbf{v}})=-\frac{\alpha} {\gamma}\lim f_{n}-\frac{\alpha}{2\gamma}\ln(2\pi e\Delta)$ . Considering the mutual information per parameter allows us to interpret $\alpha$ as a sort of signal-to-noise ratio, so that the mutual information defined in this way increases with it.
Notations: Bold is for vectors and matrices; $d$ is the input dimension, $k$ the width of the hidden layer, $n$ the size of the training set $\mathcal{D}$ , with asymptotic ratios given by (4); ${\mathbf{A}}^{\circ\ell}$ is the Hadamard power of a matrix; for a vector ${\mathbf{v}}$ , $({\mathbf{v}})$ is the diagonal matrix ${\rm diag}({\mathbf{v}})$ ; $(\mu_{\ell})$ are the Hermite coefficients of the activation function $\sigma(x)=\sum_{\ell\geq 0}\frac{\mu_{\ell}}{\ell!}{\rm He}_{\ell}(x)$ ; the norm $\|\,\cdot\,\|$ for vectors and matrices is the Frobenius norm.
### 2.2 Main results
The aforementioned setting is related to the recent paper Maillard et al. (2024a), with two major differences: said work considers Gaussian distributed weights and quadratic activation. These hypotheses allow numerous simplifications in the analysis, exploited in a series of works Du & Lee (2018); Soltanolkotabi et al. (2019); Venturi et al. (2019); Sarao Mannelli et al. (2020); Gamarnik et al. (2024); Martin et al. (2024); Arjevani et al. (2025). Thanks to this, Maillard et al. (2024a) maps the learning task onto a generalised linear model (GLM) where the goal is to infer a Wishart matrix from linear observations, which is analysable using known results on the GLM Barbier et al. (2019) and matrix denoising Barbier & Macris (2022); Maillard et al. (2022); Pourkamali et al. (2024); Semerjian (2024).
Our main contribution is a statistical mechanics framework for characterising the prediction performance of shallow Bayesian neural networks, able to handle arbitrary activation functions and different distributions of i.i.d. weights, both ingredients playing an important role for the phenomenology.
The theory we derive draws a rich picture with various learning transitions when tuning the sample rate $\alpha\approx n/d^{2}$ . For low $\alpha$ , feature learning occurs because the student tunes its weights to match non-linear combinations of the teacher’s, rather than aligning to those weights themselves. This phase is universal in the (centred, with unit variance) law of the i.i.d. teacher inner weights: our numerics obtained both with binary and Gaussian inner weights match well the theory, which does not depend on this prior here. When increasing $\alpha$ , strong feature learning emerges through specialisation phase transitions, where the student aligns some of its weights with the actual teacher’s ones. In particular, when the readouts ${\mathbf{v}}$ in the target function have a non-trivial distribution, a whole sequence of specialisation transitions occurs as $\alpha$ grows, for the following intuitive reason. Different features in the data are related to the weights of the teacher neurons, $({\mathbf{W}}^{0}_{j}\in\mathbb{R}^{d})_{j\leq k}$ . The strength with which the responses $(y_{\mu})$ depend on the feature ${\mathbf{W}}_{j}^{0}$ is tuned by the corresponding readout through $|v_{j}|$ , which plays the role of a feature-dependent “signal-to-noise ratio”. Therefore, features/hidden neurons $j\in[k]$ corresponding to the largest readout amplitude $\max\{|v_{j}|\}$ are learnt first by the student when increasing $\alpha$ (in the sense that the teacher-student overlap ${\mathbf{W}}^{\intercal}_{j}{\mathbf{W}}^{0}_{j}/d>o_{d}(1)$ ), then features with the second largest amplitude are, and so on. If the readouts are continuous, an infinite sequence of specialisation transitions emerges in the limit (4). On the contrary, if the readouts are homogeneous (i.e. take a unique value), then a single transition occurs where almost all neurons of the student specialise jointly (possibly up to a vanishing fraction). We predict specialisation transitions to occur for binary inner weights and generic activation, or for Gaussian ones and more-than-quadratic activation. We provide a theoretical description of these learning transitions and identify the order parameters (sufficient statistics) needed to deduce the generalisation error through scalar equations.
The picture that emerges is connected to recent findings in the context of extensive-rank matrix denoising Barbier et al. (2025). In that model, a recovery transition was also identified, separating a universal phase (i.e., independent of the signal prior), from a factorisation phase akin to specialisation in the present context. We believe that this picture and the one found in the present paper are not just similar, but a manifestation of the same fundamental mechanism inherent to the extensive-rank of the matrices involved. Indeed, matrix denoising and neural networks share features with both matrix models Kazakov (2000); Brézin et al. (2016); Anninos & Mühlmann (2020) and planted mean-field spin glasses Nishimori (2001); Zdeborová & and (2016). This mixed nature requires blending techniques from both fields to tackle them. Consequently, the approach developed in Sec. 4 based on the replica method Mezard et al. (1986) is non-standard, as it crucially relies on the Harish Chandra–Itzykson–Zuber (HCIZ), or “spherical”, integral used in matrix models Itzykson & Zuber (1980); Matytsin (1994); Guionnet & Zeitouni (2002). Mixing spherical integration and the replica method has been previously attempted in Schmidt (2018); Barbier & Macris (2022) for matrix denoising, both papers yielding promising but quantitatively inaccurate or non-computable results. Another attempt to exploit a mean-field technique for matrix denoising (in that case a high-temperature expansion) is Maillard et al. (2022), which suffers from similar limitations. The more quantitative answer from Barbier et al. (2025) was made possible precisely thanks to the understanding that the problem behaves more as a matrix model or as a planted mean-field spin glass depending on the phase in which it lives. The two phases could then be treated separately and then joined using an appropriate criterion to locate the transition.
It would be desirable to derive a unified theory able to describe the whole phase diagram based on a single formalism. This is what the present paper provides through a principled combination of spherical integration and the replica method, yielding predictive formulas that are easy to evaluate. It is important to notice that the presence of the HCIZ integral, which is a high-dimensional matrix integral, in the replica formula presented in Result 2.1 suggests that effective one-body problems are not enough to capture alone the physics of the problem, as it is usually the case in standard mean-field inference and spin glass models. Indeed, the appearance of effective one-body problems to describe complex statistical models is usually related to the asymptotic decoupling of the finite marginals of the variables in the problem at hand in terms of products of the single-variable marginals. Therefore, we do not expect a standard cavity (or leave-one-out) approach based on single-variable extraction to be exact, while it is usually showed that the replica and cavity approaches are equivalent in mean-field models Mezard et al. (1986). This may explain why the approximate message-passing algorithms proposed in Parker et al. (2014); Krzakala et al. (2013); Kabashima et al. (2016) are, as stated by the authors, not properly converging nor able to match their corresponding theoretical predictions based on the cavity method. Algorithms for extensive-rank systems should therefore combine ingredients from matrix denoising and standard message-passing, reflecting their hybrid mean-field/matrix model nature.
In order to face this, we adapt the GAMP-RIE (generalised approximate message-passing with rotational invariant estimator) introduced in Maillard et al. (2024a) for the special case of quadratic activation, to accommodate a generic activation function $\sigma$ . By construction, the resulting algorithm described in App. H cannot find the specialisation solution, i.e., a solution where at least $\Theta(k)$ neurons align with the teacher’s. Nevertheless, it matches the performance associated with the so-called universal solution/branch of our theory for all $\alpha$ , which describes a solution with overlap ${\mathbf{W}}^{\intercal}_{j}{\mathbf{W}}^{0}_{j}/d>o_{d}(1)$ for at most $o(k)$ neurons. As a side investigation, we show empirically that the specialisation solution is potentially hard to reach with popular algorithms for some target functions: the algorithms we tested either fail to find it and instead get stuck in a sub-optimal glassy phase (Metropolis-Hastings sampling for the case of binary inner weights), or may find it but in a training time increasing exponentially with $d$ (ADAM Kingma & Ba (2017) and Hamiltonian Monte Carlo (HMC) for the case of Gaussian weights). It would thus be interesting to settle whether GAMP-RIE has the best prediction performance achievable by a polynomial-time learner when $n=\Theta(d^{2})$ for such targets. For specific choices of the distribution of the readout weights, the evidence of hardness is not conclusive and requires further investigation.
#### Replica free entropy
Our first result is a tractable approximation for the free entropy. To state it, let us introduce two functions $\mathcal{Q}_{W}(\mathsf{v}),\hat{\mathcal{Q}}_{W}(\mathsf{v})\in[0,1]$ for $\mathsf{v}\in{\rm Supp}(P_{v})$ , which are non-decreasing in $|\mathsf{v}|$ . Let (see (43) in appendix for a more explicit expression of $g$ )
$$
\textstyle g(x):=\sum_{\ell\geq 3}x^{\ell}{\mu_{\ell}^{2}}/{\ell!}, \textstyle q_{K}(x,\mathcal{Q}_{W}):=\mu_{1}^{2}+{\mu_{2}^{2}}\,x/2+\mathbb{E}
_{v\sim P_{v}}[v^{2}g(\mathcal{Q}_{W}(v))], \textstyle r_{K}:=\mu_{1}^{2}+{\mu_{2}^{2}}(1+\gamma\bar{v}^{2})/2+g(1), \tag{1}
$$
and the auxiliary potentials
| | $\textstyle\psi_{P_{W}}(x):=\mathbb{E}_{w^{0},\xi}\ln\mathbb{E}_{w}\exp(-\frac{ 1}{2}xw^{2}+xw^{0}w+\sqrt{x}\xi w),$ | |
| --- | --- | --- |
where $w^{0},w\sim P_{W}$ and $\xi,u_{0},u\sim{\mathcal{N}}(0,1)$ all independent. Moreover, $\mu_{{\mathbf{Y}}(x)}$ is the limiting (in $d\to\infty$ ) spectral density of data ${\mathbf{Y}}(x)=\sqrt{x/(kd)}\,{\mathbf{S}}^{0}+{\mathbf{Z}}$ in the denoising problem of the matrix ${\mathbf{S}}^{0}:={\mathbf{W}}^{0\intercal}({\mathbf{v}}){\mathbf{W}}^{0}\in \mathbb{R}^{d\times d}$ , with ${\mathbf{Z}}$ a standard GOE matrix (a symmetric matrix whose upper triangular part has i.i.d. entries from $\mathcal{N}(0,(1+\delta_{ij})/d)$ ). Denote the minimum mean-square error associated with this denoising problem as ${\rm mmse}_{S}(x)=\lim_{d\to\infty}d^{-2}\mathbb{E}\|{\mathbf{S}}^{0}-\mathbb{ E}[{\mathbf{S}}^{0}\mid{\mathbf{Y}}(x)]\|^{2}$ (whose explicit definition is given in App. D.3) and its functional inverse by ${\rm mmse}_{S}^{-1}$ (which exists by monotonicity).
**Result 2.1 (Replica symmetric free entropy)**
*Let the functional $\tau(\mathcal{Q}_{W}):={\rm mmse}_{S}^{-1}(1-\mathbb{E}_{v\sim P_{v}}[v^{2} \mathcal{Q}_{W}(v)^{2}])$ . Given $(\alpha,\gamma)$ , the replica symmetric (RS) free entropy approximating ${\lim}\,f_{n}$ in the scaling limit (4) is ${\rm extr}\,f_{\rm RS}^{\alpha,\gamma}$ with RS potential $f^{\alpha,\gamma}_{\rm RS}=f^{\alpha,\gamma}_{\rm RS}(q_{2},\hat{q}_{2}, \mathcal{Q}_{W},\hat{\mathcal{Q}}_{W})$ given by
$$
\textstyle f^{\alpha,\gamma}_{\rm RS} \textstyle:=\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})+\frac{1}
{4\alpha}(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2} \textstyle\qquad+\frac{\gamma}{\alpha}\mathbb{E}_{v\sim P_{v}}\big{[}\psi_{P_{
W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{1}{2}\mathcal{Q}_{W}(v)\hat{\mathcal{Q}}_{
W}(v)\big{]} \textstyle\qquad+\frac{1}{\alpha}\big{[}\iota(\tau(\mathcal{Q}_{W}))-\iota(
\hat{q}_{2}+\tau(\mathcal{Q}_{W}))\big{]}. \tag{6}
$$
The extremisation operation in ${\rm extr}\,f^{\alpha,\gamma}_{\rm RS}$ selects a solution $(q_{2}^{*},\hat{q}_{2}^{*},\mathcal{Q}_{W}^{*},\hat{\mathcal{Q}}_{W}^{*})$ of the saddle point equations, obtained from $\nabla f^{\alpha,\gamma}_{\rm RS}=\mathbf{0}$ , which maximises the RS potential.*
The extremisation of $f_{\rm RS}^{\alpha,\gamma}$ yields the system (76) in the appendix, solved numerically in a standard way (see provided code).
The order parameters $q_{2}^{*}$ and $\mathcal{Q}_{W}^{*}$ have a precise physical meaning that will be clear from the discussion in Sec. 4. In particular, $q_{2}^{*}$ is measuring the alignment of the student’s combination of weights ${\mathbf{W}}^{\intercal}({\mathbf{v}}){\mathbf{W}}/\sqrt{k}$ with the corresponding teacher’s ${\mathbf{W}}^{0\intercal}({\mathbf{v}}){\mathbf{W}}^{0}/\sqrt{k}$ , which is non trivial with $n=\Theta(d^{2})$ data even when the student is not able to reconstruct ${\mathbf{W}}^{0}$ itself (i.e., to specialise). On the other hand, $\mathcal{Q}_{W}^{*}(\mathsf{v})$ measures the overlap between weights $\{{\mathbf{W}}_{i}^{0/\cdot}\mid v_{i}=\mathsf{v}\}$ (a different treatment for weights connected to different $\mathsf{v}$ ’s is needed because, as discussed earlier, the student will learn first –with less data– weights connected to larger readouts). A non-trivial $\mathcal{Q}_{W}^{*}(\mathsf{v})\neq 0$ signals that the student learns something about ${\mathbf{W}}^{0}$ . Thus, the specialisation transitions are naturally defined, based on the extremiser of $f_{\rm RS}^{\alpha,\gamma}$ in the result above, as $\alpha_{\rm sp,\mathsf{v}}(\gamma):=\sup\,\{\alpha\mid\mathcal{Q}^{*}_{W}( \mathsf{v})=0\}$ . For non-homogeneous readouts, we call the specialisation transition $\alpha_{\rm sp}(\gamma):=\min_{\mathsf{v}}\alpha_{\rm sp,\mathsf{v}}(\gamma)$ . In this article, we report cases where the inner weights are discrete or Gaussian distributed. For activations different than a pure quadratic, $\sigma(x)\neq x^{2}$ , we predict the transition to occur in both cases (see Fig. 1 and 2). Then, $\alpha<\alpha_{\rm sp}$ corresponds to the universal phase, where the free entropy is independent of the choice of the prior over the inner weights. Instead, $\alpha>\alpha_{\rm sp}$ is the specialisation phase where the prior $P_{W}$ matters, and the student aligns a finite fraction of its weights $({\mathbf{W}}_{j})_{j\leq k}$ with those of the teacher, which lowers the generalisation error.
Let us comment on why the special case $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ could be treated exactly with known techniques (spherical integration) in Maillard et al. (2024a); Xu et al. (2025). With $\sigma(x)=x^{2}$ the responses $(y_{\mu})$ depend on ${\mathbf{W}}^{0\intercal}({\mathbf{v}}){\mathbf{W}}^{0}$ only. If ${\mathbf{v}}$ has finite fractions of equal entries, a large invariance group prevents learning ${\mathbf{W}}^{0}$ and thus specialisation. Take as example ${\mathbf{v}}=(1,\ldots,1,-1,\ldots,-1)$ with the first half filled with ones. Then, the responses are indistinguishable from those obtained using a modified matrix ${\mathbf{W}}^{0\intercal}{\mathbf{U}}^{\intercal}({\mathbf{v}}){\mathbf{U}}{ \mathbf{W}}^{0}$ where ${\mathbf{U}}=(({\mathbf{U}}_{1},\mathbf{0}_{d/2})^{\intercal},(\mathbf{0}_{d/2 },{\mathbf{U}}_{2})^{\intercal})$ is block diagonal with $d/2\times d/2$ orthogonal ${\mathbf{U}}_{1},{\mathbf{U}}_{2}$ and zeros on off-diagonal blocks. The Gaussian prior $P_{W}$ is rotationally invariant and, thus, does not break any invariance, so ${\mathbf{U}}_{1},{\mathbf{U}}_{2}$ are arbitrary. The resulting invariance group has an $\Theta(d^{2})$ entropy (the logarithm of its volume), which is comparable to the leading order of the free entropy. Therefore, it cannot be broken using infinitesimal perturbations (or “side information”) and, consequently, prevents specialisation. This reasoning can be extended to $P_{v}$ with a continuous support, as long as we can discretise it with a finite (possibly large) number of bins, take the limit (4) first, and then take the continuum limit of the binning afterwards. However, the picture changes if the prior breaks rotational invariance; e.g., with Rademacher $P_{W}$ , only signed permutation invariances survive, a symmetry with negligible entropy $o(d^{2})$ which, consequently, does not change the limiting thermodynamic (information-theoretic) quantities. The large rotational invariance group is the reason why $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ can be treated using the HCIZ integral alone. Even when $P_{W}=\mathcal{N}(0,1)$ , the presence of any other term in the series expansion of $\sigma$ breaks invariances with large entropy: specialisation can then occur, thus requiring our theory. We mention that our theory seems inexact When solving the extremisation of (6) for $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ , we noticed that the difference between the RS free entropy of the correct universal solution, $\mathcal{Q}_{W}(\mathsf{v})=0$ , and the maximiser, predicting $\mathcal{Q}_{W}(\mathsf{v})>0$ , does not exceed $\approx 1\$ : the RS potential is very flat as a function of $\mathcal{Q}_{W}$ . We thus cannot discard that the true maximiser of the potential is at $\mathcal{Q}_{W}(\mathsf{v})=0$ , and that we observe otherwise due to numerical errors. Indeed, evaluating the spherical integrals $\iota(\,\cdot\,)$ in $f^{\alpha,\gamma}_{\rm RS}$ is challenging, in particular when $\gamma$ is small. Actually, for $\gamma\gtrsim 1$ we do get that $\mathcal{Q}_{W}(\mathsf{v})=0$ is always the maximiser for $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ . for $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ if applied naively, as it predicts ${\mathcal{Q}}_{W}(\mathsf{v})>0$ and therefore does not recover the rigorous result of Xu et al. (2025) (yet, it predicts a free entropy less than $1\$ away from the truth). Nevertheless, the solution of Maillard et al. (2024a); Xu et al. (2025) is recovered from our equations by enforcing a vanishing overlap $\mathcal{Q}_{W}(\mathsf{v})=0$ , i.e., via its universal branch.
#### Bayes generalisation error
Another main result is an approximate formula for the generalisation error. Let $({\mathbf{W}}^{a})_{a\geq 1}$ be i.i.d. samples from the posterior $dP(\,\cdot\mid\mathcal{D})$ and ${\mathbf{W}}^{0}$ the teacher’s weights. Assuming that the joint law of $(\lambda_{\rm test}({\mathbf{W}}^{a},{\mathbf{x}}_{\rm test}))_{a\geq 0}=:( \lambda^{a})_{a\geq 0}$ for a common test input ${\mathbf{x}}_{\rm test}\notin\mathcal{D}$ is a centred Gaussian, our framework predicts its covariance. Our approximation for the Bayes error follows.
**Result 2.2 (Bayes generalisation error)**
*Let $q_{K}^{*}=q_{K}(q_{2}^{*},\mathcal{Q}_{W}^{*})$ where $(q_{2}^{*},\hat{q}_{2}^{*},\mathcal{Q}_{W}^{*},\hat{\mathcal{Q}}_{W}^{*})$ is an extremiser of $f_{\rm RS}^{\alpha,\gamma}$ as in Result 2.1. Assuming joint Gaussianity of the post-activations $(\lambda^{a})_{a\geq 0}$ , in the scaling limit (4) their mean is zero and their covariance is approximated by $\mathbb{E}\lambda^{a}\lambda^{b}=q_{K}^{*}+(r_{K}-q_{K}^{*})\delta_{ab}=:( \mathbf{\Gamma})_{ab}$ , see App. C. Assume $\mathcal{C}$ has the series expansion $\mathcal{C}(y,\hat{y})=\sum_{i\geq 0}c_{i}(y)\hat{y}^{i}$ . The Bayes error $\smash{\lim\,\varepsilon^{\mathcal{C},\mathsf{f}}}$ is approximated by
| | $\textstyle\mathbb{E}_{(\lambda^{a})\sim\mathcal{N}(\mathbf{0},\mathbf{\Gamma}) }\mathbb{E}_{y_{\rm test}\sim P_{\rm out}(\,\cdot\mid\lambda^{0})}\sum_{i\geq 0 }c_{i}(y_{\rm test}(\lambda^{0}))\prod_{a=1}^{i}\mathsf{f}(\lambda^{a}).$ | |
| --- | --- | --- |
Letting $\mathbb{E}[\,\cdot\mid\lambda]=\int dy\,(\,\cdot\,)\,P_{\rm out}(y\mid\lambda)$ , the Bayes-optimal mean-square generalisation error $\smash{\lim\,\varepsilon^{\rm opt}}$ is approximated by
$$
\textstyle\mathbb{E}_{\lambda^{0},\lambda^{1}}\big{(}\mathbb{E}[y^{2}\mid
\lambda^{0}]-\mathbb{E}[y\mid\lambda^{0}]\mathbb{E}[y\mid\lambda^{1}]\big{)}. \tag{7}
$$*
This result assumed that $\mu_{0}=0$ ; see App. D.5 if this is not the case. Results 2.1 and 2.2 provide an effective theory for the generalisation capabilities of Bayesian shallow networks with generic activation. We call these “results” because, despite their excellent match with numerics, we do not expect these formulas to be exact: their derivation is based on an unconventional mix of spin glass techniques and spherical integrals, and require approximations in order to deal with the fact that the degrees of freedom to integrate are large matrices of extensive rank. This is in contrast with simpler (vector) models (perceptrons, multi-index models, etc) where replica formulas are routinely proved correct, see e.g. Barbier & Macris (2019); Barbier et al. (2019); Aubin et al. (2018).
|
<details>
<summary>x1.png Details</summary>

### Visual Description
## Line Plot with Error Bars: Optimal Error (ε_opt) vs. Parameter α
### Overview
The image is a scientific line plot displaying two distinct data series, each plotted with error bars. The plot shows how an optimal error metric, ε_opt, changes as a function of a parameter α. The data suggests a relationship where ε_opt generally decreases with increasing α, but the two series exhibit markedly different behaviors.
### Components/Axes
* **X-Axis:**
* **Label:** `α` (Greek letter alpha).
* **Scale:** Linear scale from 0 to 7.
* **Major Tick Marks:** At integer values: 0, 1, 2, 3, 4, 5, 6, 7.
* **Y-Axis:**
* **Label:** `ε_opt` (Greek letter epsilon with subscript "opt").
* **Scale:** Linear scale.
* **Major Tick Marks:** Labeled at 0, 0.025, 0.050, 0.075, 0.100. The topmost visible tick is at 0.08, indicating the axis extends slightly beyond the labeled 0.100.
* **Legend:**
* **Placement:** Top-right corner of the plot area.
* **Content:** Two entries without explicit text labels, differentiated by color and marker style.
1. **Blue Series:** Represented by a blue line with open circle markers (`o`) and a blue line with 'x' markers (`x`).
2. **Red Series:** Represented by a red line with open circle markers (`o`) and a red line with plus sign markers (`+`).
* **Grid:** A light gray grid is present, aligned with the major tick marks on both axes.
### Detailed Analysis
The plot contains two primary data series, each consisting of two sub-series distinguished by marker type.
**1. Blue Data Series (Upper Curve):**
* **Trend:** Shows a steep, monotonic decrease from left to right. The slope is very high for low α and gradually flattens as α increases.
* **Data Points & Values (Approximate):**
* At α ≈ 0: ε_opt ≈ 0.085 (highest point on the plot).
* At α ≈ 1: ε_opt ≈ 0.040.
* At α ≈ 2: ε_opt ≈ 0.030.
* At α ≈ 4: ε_opt ≈ 0.020.
* At α ≈ 7: ε_opt ≈ 0.015 (lowest point for this series).
* **Error Bars:** Vertical error bars are present on all data points. The bars appear relatively consistent in size across the α range, suggesting similar uncertainty for each measurement.
* **Sub-series:** The 'x' markers and circle markers follow the same trend line very closely, indicating they represent measurements from the same condition or model, possibly with slight variations.
**2. Red Data Series (Lower, Bifurcated Curve):**
* **Trend:** Exhibits a dramatic, non-monotonic behavior. It has two distinct segments.
* **Segment 1 (α ≈ 0 to 1.5):** A nearly horizontal line at ε_opt ≈ 0.100.
* **Segment 2 (α > 1.5):** A sharp, near-vertical drop at α ≈ 1.5, followed by a steady decrease that parallels the blue curve but at lower ε_opt values.
* **Data Points & Values (Approximate):**
* **Horizontal Segment:** From α=0 to α≈1.5, ε_opt holds steady at ~0.100.
* **Transition:** At α ≈ 1.5, the value plummets from ~0.100 to ~0.050.
* **Decreasing Segment:**
* At α ≈ 2: ε_opt ≈ 0.040.
* At α ≈ 4: ε_opt ≈ 0.015.
* At α ≈ 7: ε_opt ≈ 0.005 (lowest point on the entire plot).
* **Error Bars:** Error bars are visible on all points. They appear slightly larger on the horizontal segment and during the sharp transition.
* **Outliers/Anomalies:** There are three isolated red data points with error bars located at α ≈ 6.0, 6.5, and 7.0, with ε_opt values of approximately 0.075, 0.050, and 0.040, respectively. These points lie significantly above the main red trend line and are marked with '+' symbols. Their placement suggests they may represent failed runs, alternative solutions, or a different experimental condition.
### Key Observations
1. **Divergent Behavior:** The two series start at very different ε_opt values (Blue ~0.085 vs. Red ~0.100 at α=0) and follow completely different paths until α > 1.5.
2. **Threshold Effect in Red Series:** The red series demonstrates a clear critical threshold or phase transition at α ≈ 1.5, where its performance (lower ε_opt is better) dramatically improves.
3. **Convergence at High α:** For α > 4, both main series show a decreasing trend, with the red series achieving a lower final ε_opt than the blue series.
4. **Anomalous Red Points:** The three high-ε_opt red points at α ≥ 6 are clear outliers from the primary red trend, indicating instability or the existence of multiple solution types at high α for that condition.
### Interpretation
This plot likely compares the performance of two different algorithms, models, or system configurations (Blue vs. Red) as a function of a control parameter α. The metric ε_opt is probably an error or cost function where lower values are better.
* **Blue Series:** Represents a configuration that is sensitive to α from the start. Its performance improves smoothly and predictably as α increases, suggesting a stable, gradient-based optimization landscape.
* **Red Series:** Represents a configuration with a bistable or threshold-dependent behavior. For low α (α < 1.5), it is trapped in a high-error state (ε_opt ≈ 0.100). At a critical α value (~1.5), it undergoes a sudden transition to a much lower-error state, after which its performance improves steadily. This is characteristic of systems with phase transitions, symmetry breaking, or non-convex optimization landscapes where a parameter must overcome a barrier.
* **Anomalous Points:** The outlier red points at high α suggest that while the primary low-error solution exists, the system can occasionally converge to a much poorer solution, indicating potential instability or sensitivity to initial conditions in that regime.
**In summary, the data demonstrates that the "Red" condition has a superior ultimate performance (lower ε_opt at high α) but requires a minimum threshold value of α (≈1.5) to activate its effective regime. The "Blue" condition offers more consistent, gradual improvement across the entire range of α.**
</details>
|
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Chart with Error Bars: Performance vs. α for ReLU and Tanh Activation Functions
### Overview
This is a technical line chart comparing the performance (y-axis, unlabeled but likely a metric like accuracy or loss) of two neural network activation functions, **ReLU** (blue) and **Tanh** (red), as a function of a parameter **α** (x-axis). The chart includes multiple data series for each function, distinguished by marker type, and all data points feature vertical error bars indicating variability or confidence intervals.
### Components/Axes
* **X-Axis:**
* **Label:** `α` (Greek letter alpha).
* **Scale:** Linear, ranging from 0 to 7.
* **Major Tick Marks:** At integer values 0, 1, 2, 3, 4, 5, 6, 7.
* **Y-Axis:**
* **Label:** Not explicitly labeled in the image.
* **Scale:** Linear. Based on the data points, the visible range is approximately from 0.1 to 1.0.
* **Grid Lines:** Horizontal and vertical grid lines are present, aiding in value estimation.
* **Legend:**
* **Position:** Top-right corner of the chart area.
* **Content:** Two entries.
1. A blue line labeled `ReLU`.
2. A red line labeled `Tanh`.
* **Data Series & Markers:**
* **ReLU (Blue):** Three distinct series are visible, differentiated by markers.
1. **Blue line with circle markers (`o`):** The highest-performing ReLU series.
2. **Blue line with cross markers (`x`):** The middle-performing ReLU series.
3. **Blue line with plus markers (`+`):** The lowest-performing ReLU series.
* **Tanh (Red):** Three distinct series are visible, differentiated by markers.
1. **Red line with circle markers (`o`):** A relatively flat, high-performing Tanh series.
2. **Red line with cross markers (`x`):** A Tanh series that declines sharply.
3. **Red line with plus markers (`+`):** The lowest-performing Tanh series, showing the steepest decline.
* **Error Bars:** Vertical black lines extending above and below each data point, representing uncertainty (e.g., standard deviation or standard error).
### Detailed Analysis
**Trend Verification & Data Point Extraction (Approximate Values):**
* **ReLU Series (Blue):**
* **General Trend:** All three ReLU series show a **monotonic decreasing trend** as α increases. The rate of decrease is steepest for low α (0 to ~2) and becomes more gradual for higher α.
* **Circle Series (Top Blue):** Starts at α=0 with a value of ~0.95. Decreases to ~0.85 at α=1, ~0.78 at α=2, and continues a slow decline to ~0.70 at α=7.
* **Cross Series (Middle Blue):** Starts at α=0 with a value of ~0.90. Decreases to ~0.80 at α=1, ~0.72 at α=2, and ends at ~0.65 at α=7.
* **Plus Series (Bottom Blue):** Starts at α=0 with a value of ~0.85. Decreases to ~0.75 at α=1, ~0.68 at α=2, and ends at ~0.60 at α=7.
* **Error Bars:** For ReLU, error bars are relatively small and consistent across the α range, suggesting stable variance.
* **Tanh Series (Red):**
* **General Trend:** The trends are more varied. One series is nearly flat, while the other two show a **sharp initial decline** followed by a slower decrease.
* **Circle Series (Top Red):** This series is an outlier among the Tanh data. It starts at α=0 with a value of ~0.50 and remains remarkably flat, ending at α=7 with a value of ~0.48. Its error bars are very small.
* **Cross Series (Middle Red):** Starts at α=0 with a value of ~0.50. Experiences a sharp drop to ~0.35 at α=1 and ~0.28 at α=2. The decline continues more gradually, reaching ~0.22 at α=7.
* **Plus Series (Bottom Red):** Starts at α=0 with a value of ~0.50. Shows the most dramatic drop, falling to ~0.20 at α=1 and ~0.17 at α=2. It continues to decline slowly, ending at ~0.15 at α=7.
* **Error Bars:** For the declining Tanh series (cross and plus), error bars are notably larger at low α (0-2), indicating higher variance in performance when α is small. The flat Tanh series (circle) has minimal error bars.
### Key Observations
1. **Clear Performance Gap:** The ReLU activation function (all blue lines) consistently achieves a higher metric value than the Tanh function (all red lines) across the entire range of α shown.
2. **Divergent Tanh Behavior:** One Tanh configuration (circle markers) is stable and unaffected by α, while the other two configurations are highly sensitive, performing poorly as α increases.
3. **Impact of α:** For most series (5 out of 6), increasing the parameter α has a negative impact on the measured performance metric. The effect is most pronounced for Tanh (excluding the stable series) at low α values.
4. **Variance Patterns:** Higher variance (larger error bars) is associated with the declining Tanh series at low α, suggesting less predictable performance in that region.
### Interpretation
This chart likely illustrates the sensitivity of different neural network configurations (perhaps different architectures or initializations, represented by the different markers) to a hyperparameter `α`, using two common activation functions.
* **What the data suggests:** The data strongly suggests that **ReLU is more robust and performs better than Tanh** for the task and metric being measured, across a wide range of `α` values. The stability of the top Tanh line indicates that under one specific condition, Tanh can be made robust to `α`, but this is not the general case.
* **Relationship between elements:** The parameter `α` acts as a stressor. The chart reveals how the fundamental choice of activation function (ReLU vs. Tanh) dictates the system's resilience to this stressor. The different marker series likely represent other factors (e.g., network depth, width) that interact with the activation function choice.
* **Notable anomalies:** The flat red line (Tanh-circle) is a significant anomaly. It demonstrates that the negative trend is not inherent to Tanh itself but to specific configurations of it. Investigating what makes this configuration unique would be a key next step. The large error bars for Tanh at low `α` indicate a region of instability where performance is not only lower but also less reliable.
**Language Note:** The only non-English text present is the Greek letter `α` (alpha) used as the x-axis label. It is a standard mathematical symbol.
</details>
|
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Plot with Error Bars: Performance Metrics vs. Parameter α
### Overview
The image is a technical line plot comparing the performance of four different algorithms or methods (ADAM, informative HMC, uninformative HMC, GAMP-RIE) as a function of a parameter labeled α. The plot displays two distinct groups of data series, colored blue and red, each with associated error bars. The general trend for all series is a decrease in the y-axis value as α increases, with the blue series starting at a higher value and the red series starting at a lower value.
### Components/Axes
* **X-Axis:**
* **Label:** `α` (Greek letter alpha).
* **Scale:** Linear scale from 0 to 7.
* **Major Tick Marks:** At integer values 0, 1, 2, 3, 4, 5, 6, 7.
* **Y-Axis:**
* **Label:** Not explicitly labeled in the visible portion of the image. The axis represents a quantitative metric (likely error, loss, or a similar performance measure where lower is better).
* **Scale:** Linear scale. The visible range is from 0 to an unmarked upper limit, estimated to be around 1.0 based on the starting points of the blue lines.
* **Major Tick Marks:** Not numerically labeled, but grid lines are present.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Content:** Lists four data series with corresponding markers:
1. `ADAM` - Blue line with circle markers (`o`).
2. `informative HMC` - Blue line with plus sign markers (`+`).
3. `uninformative HMC` - Red line with 'x' markers (`x`).
4. `GAMP-RIE` - Red line with circle markers (`o`).
* **Grid:** A light gray grid is present, aiding in value estimation.
### Detailed Analysis
**Data Series and Trends:**
1. **Blue Group (Higher Starting Values):**
* **Trend:** Both blue lines show a steep, concave-upward decreasing trend. They drop rapidly from α=0 to α≈2, after which the rate of decrease slows, approaching a plateau.
* **ADAM (Blue, `o`):** Starts at the highest point on the y-axis (estimated ~0.95 at α=0). Decreases to approximately 0.25 at α=7. Error bars are present and appear relatively consistent in size across the range.
* **informative HMC (Blue, `+`):** Starts slightly below ADAM (estimated ~0.85 at α=0). Follows a very similar trajectory to ADAM, remaining slightly below it throughout. Ends at approximately 0.20 at α=7. Error bars are present.
2. **Red Group (Lower Starting Values):**
* **Trend:** The two red lines show different behaviors.
* **uninformative HMC (Red, `x`):** Shows a steep, concave-upward decreasing trend similar to the blue lines but starting from a much lower point. It decreases from an estimated ~0.35 at α=0 to near zero (~0.02) by α=7. Error bars are present and appear larger relative to the line's value, especially at lower α.
* **GAMP-RIE (Red, `o`):** This series is distinct. It appears as a nearly horizontal, dashed red line. Its value remains approximately constant at an estimated ~0.35 across the entire range of α from 0 to 7. It has very small, consistent error bars.
**Spatial Grounding & Value Estimation (Approximate):**
* At α=0: ADAM (~0.95) > informative HMC (~0.85) > uninformative HMC (~0.35) ≈ GAMP-RIE (~0.35).
* At α=3: ADAM (~0.35) > informative HMC (~0.30) > uninformative HMC (~0.10) > GAMP-RIE (~0.35). Note: GAMP-RIE is now the highest value.
* At α=7: ADAM (~0.25) > informative HMC (~0.20) > GAMP-RIE (~0.35) > uninformative HMC (~0.02). The red `x` line is the lowest.
### Key Observations
1. **Performance Crossover:** The decreasing red `x` line (uninformative HMC) crosses below the constant red `o` line (GAMP-RIE) at approximately α=1.5. This indicates that for α > 1.5, uninformative HMC achieves a lower (better) metric value than GAMP-RIE.
2. **Plateau vs. Constant:** The blue lines and the red `x` line show asymptotic behavior, suggesting their performance improves with α but with diminishing returns. In contrast, GAMP-RIE's performance is invariant to α.
3. **Error Bar Magnitude:** The error bars for the decreasing lines (especially uninformative HMC) are proportionally larger at small α, suggesting higher variance or uncertainty in the metric when α is small. The error bars for the constant GAMP-RIE line are very small.
4. **Group Separation:** There is a clear separation between the blue and red groups at low α, which narrows as α increases. The blue methods start worse but converge toward the performance level of the red methods.
### Interpretation
This chart likely illustrates the sensitivity of different optimization or inference algorithms to a hyperparameter α (which could represent something like step size, noise level, or model complexity).
* **What the data suggests:** The methods ADAM and informative HMC (blue) are highly sensitive to α, performing poorly at low α but improving significantly as α increases. Uninformative HMC (red `x`) is also sensitive but starts from a better baseline and ultimately achieves the best (lowest) metric value at high α. GAMP-RIE (red `o`) is robust/insensitive to α, providing consistent but not optimal performance across the board.
* **Relationships:** The similarity in trend between ADAM and informative HMC suggests they may share underlying algorithmic properties. The stark difference between informative and uninformative HMC highlights the impact of prior information ("informative" vs. "uninformative") on the algorithm's behavior and its interaction with α.
* **Notable Anomaly/Takeaway:** The key insight is the trade-off between peak performance and robustness. If one can set α appropriately high (e.g., α > 4), uninformative HMC is the best choice. If α is unknown, variable, or must be kept low, GAMP-RIE offers reliable, though sub-optimal, performance. The blue methods appear to be dominated by the red ones in this specific metric and range of α.
</details>
|
| --- | --- | --- |
Figure 1: Theoretical prediction (solid curves) of the Bayes-optimal mean-square generalisation error for Gaussian inner weights with ReLU(x) activation (blue curves) and Tanh(2x) activation (red curves), $d=150,\gamma=0.5$ , with linear readout with Gaussian label noise of variance $\Delta=0.1$ and different $P_{v}$ laws. The dashed lines are the theoretical predictions associated with the universal solution, obtained by plugging ${\mathcal{Q}}_{W}(\mathsf{v})=0\ \forall\ \mathsf{v}$ in (6) and extremising w.r.t. $(q_{2},\hat{q}_{2})$ (the curve coincides with the optimal one before the transition $\alpha_{\rm sp}(\gamma)$ ). The numerical points are obtained with Hamiltonian Monte Carlo (HMC) with informative initialisation on the target (empty circles), uninformative, random, initialisation (empty crosses), and ADAM (thin crosses). Triangles are the error of GAMP-RIE (Maillard et al., 2024a) extended to generic activation, obtained by plugging estimator (109) in (3) in appendix. Each point has been averaged over 10 instances of the teacher and training set. Error bars are the standard deviation over instances. The generalisation error for a given training set is evaluated as $\frac{1}{2}\mathbb{E}_{{\mathbf{x}}_{\rm test}\sim\mathcal{N}(0,I_{d})}( \lambda_{\rm test}({\mathbf{W}})-\lambda_{\rm test}^{0})^{2}$ , using a single sample ${\mathbf{W}}$ from the posterior for HMC. For ADAM, with batch size fixed to $n/5$ and initial learning rate $0.05$ , the error corresponds to the lowest one reached during training, i.e., we use early stopping based on the minimum test loss over all gradient updates. Its generalisation error is then evaluated at this point and divided by two (for comparison with the theory). The average over ${\mathbf{x}}_{\rm test}$ is computed empirically from $10^{5}$ i.i.d. test samples. We exploit that, for typical posterior samples, the Gibbs error $\varepsilon^{\rm Gibbs}$ defined in (39) in App. C is linked to the Bayes-optimal error as $(\varepsilon^{\rm Gibbs}-\Delta)/2=\varepsilon^{\rm opt}-\Delta$ , see (40) in appendix. To use this formula, we are assuming the concentration of the Gibbs error w.r.t. the posterior distribution, in order to evaluate it from a single sample per instance. Left: Homogeneous readouts $P_{v}=\delta_{1}$ . Centre: 4-points readouts $P_{v}=\frac{1}{4}(\delta_{-3/\sqrt{5}}+\delta_{-1/\sqrt{5}}+\delta_{1/\sqrt{5} }+\delta_{3/\sqrt{5}})$ . Right: Gaussian readouts $P_{v}=\mathcal{N}(0,1)$ .
## 3 Theoretical predictions and numerical experiments
Let us compare our theoretical predictions with simulations. In Fig. 1 and 2, we report the theoretical curves from Result 2.2, focusing on the optimal mean-square generalisation error for networks with different $\sigma$ , with linear readout with Gaussian noise variance $\Delta$ . The Gibbs error divided by $2$ is used to compute the optimal error, see Remark C.2 in App. C for a justification. In what follows, the error attained by ADAM is also divided by two, only for the purpose of comparison.
Figure 1 focuses on networks with Gaussian inner weights, various readout laws, for $\sigma(x)={\rm ReLU}(x)$ and ${\rm Tanh}(2x)$ . Informative (i.e., on the teacher) and uninformative (random) initialisations are used when sampling the posterior by HMC. We also run ADAM, always selecting its best performance over all epochs, and implemented an extension of the GAMP-RIE of Maillard et al. (2024a) for generic activation (see App. H). It can be shown analytically that GAMP-RIE’s generalisation error asymptotically (in $d$ ) matches the prediction of the universal branch of our theory (i.e., associated with $\mathcal{Q}_{W}(\mathsf{v})=0\ \forall\ \mathsf{v}$ ).
For ReLU activation and homogeneous readouts (left panel), informed HMC follows the specialisation branch (the solution of the saddle point equations with $\mathcal{Q}_{W}(\mathsf{v})\neq 0$ for at least one $\mathsf{v}$ ), while with uninformative initialisation it sticks to the universal branch, thus suggesting algorithmic hardness. We shall be back to this matter in the following. We note that the error attained by ADAM (divided by 2), is close to the performance associated with the universal branch, which suggests that ADAM is an effective Gibbs estimator for this $\sigma$ . For Tanh and homogeneous readouts, both the uninformative and informative points lie on the specialisation branch, while ADAM attains an error greater than twice the posterior sample’s generalisation error.
For non-homogeneous readouts (centre and right panels) the points associated with the informative initialisation lie consistently on the specialisation branch, for both ${\rm ReLU}$ and Tanh, while the uninformatively initialised samples have a slightly worse performance for Tanh. Non-homogeneous readouts improves the ADAM performance: for Gaussian readouts and high sampling ratio its half-generalisation error is consistently below the error associated with the universal branch of the theory.
Figure 2 concerns networks with Rademacher weights and homogeneous readout. The numerical points are of two kinds: the dots, obtained from Metropolis–Hastings sampling of the weight posterior, and the circles, obtained from the GAMP-RIE (App. H). We report analogous simulations for ${\rm ReLU}$ and ${\rm ELU}$ activations in Figure 7, App. H. The remarkable agreement between theoretical curves and experimental points in both phases supports the assumptions used in Sec. 4.
<details>
<summary>x4.png Details</summary>

### Visual Description
## [Line Graph with Inset]: ε_opt vs. α for Three Series (σ₁, σ₂, σ₃)
### Overview
The image is a scientific line graph (with a logarithmic inset) illustrating the relationship between a parameter \( \boldsymbol{\alpha} \) (x-axis) and an optimized quantity \( \boldsymbol{\varepsilon_{\text{opt}}} \) (y-axis). Three data series (\( \sigma_1 \), \( \sigma_2 \), \( \sigma_3 \)) are plotted with error bars, and an inset (top-right) provides a logarithmic view of the same data for small \( \varepsilon_{\text{opt}} \) values.
### Components/Axes
#### Main Graph (Linear Scale)
- **X-axis**: Label \( \boldsymbol{\alpha} \), range \( 0 \) to \( 4 \) (ticks at \( 0, 1, 2, 3, 4 \)).
- **Y-axis**: Label \( \boldsymbol{\varepsilon_{\text{opt}}} \), range \( 0 \) to \( 1.2 \) (ticks at \( 0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2 \)).
- **Legend**: Positioned at the bottom-right (relative to the main graph), with three entries:
- Blue line: \( \sigma_1 \)
- Green line: \( \sigma_2 \)
- Red line: \( \sigma_3 \)
#### Inset Graph (Logarithmic Y-axis, Top-Right)
- **X-axis**: Range \( 1 \) to \( 4 \) (ticks at \( 1, 2, 3, 4 \)).
- **Y-axis**: Logarithmic scale (labels: \( 10^{-3}, 10^{-2}, 10^{-1} \)), showing small \( \varepsilon_{\text{opt}} \) values.
- **Data Series**: Same colors (blue, green, red) as the main graph, with error bars.
### Detailed Analysis
#### Main Graph (ε_opt vs. α)
- **\( \boldsymbol{\sigma_1} \) (Blue Line)**:
- Trend: Decreasing with \( \alpha \). At \( \alpha = 0 \), \( \varepsilon_{\text{opt}} \approx 1.0 \); at \( \alpha = 4 \), \( \varepsilon_{\text{opt}} \approx 0.1 \) (dashed line, likely a fit/extrapolation).
- Error bars: Present, indicating variability.
- **\( \boldsymbol{\sigma_2} \) (Green Line)**:
- Trend: Decreasing, with a sharp drop at \( \alpha \approx 1 \) (from \( \approx 1.0 \) to \( \approx 0.2 \)), then continues decreasing. At \( \alpha = 4 \), \( \varepsilon_{\text{opt}} \approx 0.0 \) (or very low).
- Error bars: Larger near \( \alpha \approx 1 \) (critical region).
- **\( \boldsymbol{\sigma_3} \) (Red Line)**:
- Trend: Decreasing, with a sharp drop at \( \alpha \approx 2 \) (from \( \approx 0.4 \) to \( \approx 0.1 \)), then continues decreasing. At \( \alpha = 4 \), \( \varepsilon_{\text{opt}} \approx 0.0 \) (or very low).
- Error bars: Larger near \( \alpha \approx 2 \) (critical region).
- **Dashed Lines**: For \( \sigma_1 \) (blue) and \( \sigma_3 \) (red), dashed lines extend beyond solid lines (likely model fits/extrapolations).
#### Inset Graph (Logarithmic Y-axis)
- **\( \boldsymbol{\sigma_1} \) (Blue)**: Decreasing trend. At \( \alpha = 1 \), \( \varepsilon_{\text{opt}} \approx 10^{-1} \); at \( \alpha = 4 \), \( \varepsilon_{\text{opt}} \approx 10^{-3} \).
- **\( \boldsymbol{\sigma_2} \) (Green)**: Decreasing trend (steeper than \( \sigma_1 \) in the inset). At \( \alpha = 1 \), \( \varepsilon_{\text{opt}} \approx 10^{-1} \); at \( \alpha = 4 \), \( \varepsilon_{\text{opt}} \approx 10^{-3} \).
- **\( \boldsymbol{\sigma_3} \) (Red)**: Decreasing trend. At \( \alpha = 1 \), \( \varepsilon_{\text{opt}} \approx 10^{-1} \); at \( \alpha = 4 \), \( \varepsilon_{\text{opt}} \approx 10^{-3} \).
### Key Observations
- **Trends**: All three series show \( \varepsilon_{\text{opt}} \) decreasing with \( \alpha \), but with distinct sharp drops ( \( \sigma_2 \) at \( \alpha \approx 1 \), \( \sigma_3 \) at \( \alpha \approx 2 \)).
- **Error Bars**: Larger errors near critical regions (sharp drops), indicating increased variability.
- **Inset**: Confirms the decreasing trend for small \( \varepsilon_{\text{opt}} \) values, showing the trend continues beyond the main graph’s visible range.
### Interpretation
- The graph likely models an optimization problem where \( \varepsilon_{\text{opt}} \) (e.g., error/efficiency) decreases as \( \alpha \) (a control parameter) increases. Sharp drops suggest **phase transitions** or critical points (e.g., \( \sigma_2 \) transitions at \( \alpha \approx 1 \), \( \sigma_3 \) at \( \alpha \approx 2 \)).
- Different series (\( \sigma_1, \sigma_2, \sigma_3 \)) may represent distinct system configurations, showing how optimization behavior varies with conditions.
- The inset’s log scale clarifies behavior at small \( \varepsilon_{\text{opt}} \), confirming the trend persists for large \( \alpha \).
- Larger error bars near critical regions imply more complex dynamics (e.g., increased uncertainty in measurements or model behavior).
(Note: All values are approximate, with uncertainty indicated by error bars. The language is English; no other languages are present.)
</details>
Figure 2: Theoretical prediction (solid curves) of the Bayes-optimal mean-square generalisation error for binary inner weights and polynomial activations: $\sigma_{1}={\rm He}_{2}/\sqrt{2}$ , $\sigma_{2}={\rm He}_{3}/\sqrt{6}$ , $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ , with $\gamma=0.5$ , $d=150$ , linear readout with Gaussian label noise with $\Delta=1.25$ , and homogeneous readouts ${\mathbf{v}}=\mathbf{1}$ . Dots are optimal errors computed via Gibbs errors (see Fig. 1) by running a Metropolis-Hastings MCMC initialised near the teacher. Circles are the error of GAMP-RIE (Maillard et al., 2024a) extended to generic activation, see App. H. Points are averaged over 16 data instances. Error bars for MCMC are the standard deviation over instances (omitted for GAMP-RIE, but of the same order). Dashed and dotted lines denote, respectively, the universal (i.e. the $\mathcal{Q}_{W}(\mathsf{v})=0\ \forall\ \mathsf{v}$ solution of the saddle point equations) and the specialisation branches where they are metastable (i.e., a local maximiser of the RS potential but not the global one).
Figure 3 illustrates the learning mechanism for models with Gaussian weights and non-homogeneous readouts, revealing a sequence of phase transitions as $\alpha$ increases. Top panel shows the overlap function $\mathcal{Q}_{W}(\mathsf{v})$ in the case of Gaussian readouts for four different sample rates $\alpha$ . In the bottom panel the readout assumes four different values with equal probabilities; the figure shows the evolution of the two relevant overlaps associated with the symmetric readout values $\pm 3/\sqrt{5}$ and $\pm 1/\sqrt{5}$ . As $\alpha$ increases, the student weights start aligning with the teacher weights associated with the highest readout amplitude, marking the first phase transition. As these alignments strengthen when $\alpha$ further increases, the second transition occurs when the weights corresponding to the next largest readout amplitude are learnt, and so on. In this way, continuous readouts produce an infinite sequence of learning transitions, as supported by the upper part of Figure 3.
Even when dominating the posterior measure, we observe in simulations that the specialisation solution can be algorithmically hard to reach. With a discrete distribution of readouts (such as $P_{v}=\delta_{1}$ or Rademacher), simulations for binary inner weights exhibit it only when sampling with informative initialisation (i.e., the MCMC runs to sample ${\mathbf{W}}$ are initialised in the vicinity of ${\mathbf{W}}^{0}$ ). Moreover, even in cases where algorithms (such as ADAM or HMC for Gaussian inner weights) are able to find the specialisation solution, they manage to do so only after a training time increasing exponentially with $d$ , and for relatively small values of the label noise $\Delta$ , see discussion in App. I. For the case of the continuous distribution of readouts $P_{v}={\mathcal{N}}(0,1)$ , our numerical results are inconclusive on hardness, and deserve numerical investigation at a larger scale.
The universal phase is superseded at $\alpha_{\rm sp}$ by a specialisation phase, where the student’s inner weights start aligning with the teacher ones. This transition occurs for both binary and Gaussian priors over the inner weights, and it is different in nature w.r.t. the perfect recovery threshold identified in Maillard et al. (2024a), which is the point where the student with Gaussian weights learns perfectly ${\mathbf{W}}^{0\intercal}({\mathbf{v}}){\mathbf{W}}^{0}$ (but not ${\mathbf{W}}^{0}$ ) and thus attains perfect generalisation in the case of purely quadratic activation and noiseless labels. For large $\alpha$ , the student somehow realises that the higher order terms of the activation’s Hermite decomposition are not label noise, but are informative on the decision rule. The two identified phases are akin to those recently described in Barbier et al. (2025) for matrix denoising. The model we consider is also a matrix model in ${\mathbf{W}}$ , with the amount of data scaling as the number of matrix elements. When data are scarce, the student cannot break the numerous symmetries of the problem, resulting in an “effective rotational invariance” at the source of the prior universality, with posterior samples having a vanishing overlap with ${\mathbf{W}}^{0}$ . On the other hand, when data are sufficiently abundant, $\alpha>\alpha_{\rm sp}$ , there is a “synchronisation” of the student’s samples with the teacher.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: Q*_M(v) vs. v for Different α Values
### Overview
The image is a line graph with four data series (blue, orange, green, red) and corresponding error bands (shaded regions). It plots \( Q^*_M(v) \) (y-axis) against \( v \) (x-axis) for four values of the parameter \( \alpha \) (alpha). The graph is symmetric around \( v = 0 \), with error bands indicating variability in \( Q^*_M(v) \).
### Components/Axes
- **X-axis**: Labeled \( \boldsymbol{v} \), with ticks at \( -2.0, -1.5, -1.0, -0.5, 0.0, 0.5, 1.0, 1.5, 2.0 \).
- **Y-axis**: Labeled \( \boldsymbol{Q^*_M(v)} \), with ticks at \( 0.00, 0.25, 0.50, 0.75, 1.00 \).
- **Legend**: Positioned in the center (middle of the graph), with four entries:
- Blue line: \( \alpha = 0.50 \)
- Orange line: \( \alpha = 1.00 \)
- Green line: \( \alpha = 2.00 \)
- Red line: \( \alpha = 5.00 \)
- **Data Series**: Each series has a solid line (trend) and a shaded error band (same color as the line) with cross markers (\( \boldsymbol{\times} \)) for data points.
### Detailed Analysis
We analyze each series by color and \( \alpha \):
#### 1. Blue (\( \alpha = 0.50 \))
- **Trend**: At \( v = -2.0 \), \( Q^*_M(v) \approx 0.6 \). It drops to near \( 0 \) by \( v = -1.0 \), stays near \( 0 \) until \( v = 0.0 \), then increases to \( \approx 0.6 \) at \( v = 2.0 \).
- **Error Band**: Shaded blue, wider at \( v = \pm 2.0 \) (extremes) and narrower in the middle (\( v = -1.0 \) to \( 0.0 \)).
- **Data Points**: Cross markers follow the line, with spread matching the error band.
#### 2. Orange (\( \alpha = 1.00 \))
- **Trend**: At \( v = -2.0 \), \( Q^*_M(v) \approx 0.8 \). Drops to near \( 0 \) by \( v = -1.0 \), stays near \( 0 \) until \( v = 0.0 \), then increases to \( \approx 0.8 \) at \( v = 2.0 \).
- **Error Band**: Shaded orange, wider at \( v = \pm 2.0 \), narrower in the middle.
- **Data Points**: Cross markers follow the line, with error band showing variability.
#### 3. Green (\( \alpha = 2.00 \))
- **Trend**: At \( v = -2.0 \), \( Q^*_M(v) \approx 0.9 \). Drops to near \( 0 \) by \( v = -1.0 \), stays near \( 0 \) until \( v = 0.0 \), then increases to \( \approx 0.9 \) at \( v = 2.0 \).
- **Error Band**: Shaded green, wider at \( v = \pm 2.0 \), narrower in the middle.
- **Data Points**: Cross markers follow the line, with error band.
#### 4. Red (\( \alpha = 5.00 \))
- **Trend**: At \( v = -2.0 \), \( Q^*_M(v) \approx 1.0 \). Drops to near \( 0 \) by \( v = -0.5 \), stays near \( 0 \) until \( v = 0.5 \), then increases to \( \approx 1.0 \) at \( v = 2.0 \).
- **Error Band**: Shaded red, wider at \( v = \pm 2.0 \), narrower in the middle.
- **Data Points**: Cross markers follow the line, with error band.
### Key Observations
- **Symmetry**: All series are symmetric around \( v = 0 \) (even function).
- **\( \alpha \) Effect**: As \( \alpha \) increases (0.50 → 5.00):
- Initial \( Q^*_M(v) \) at \( v = \pm 2.0 \) increases (blue ≈ 0.6, orange ≈ 0.8, green ≈ 0.9, red ≈ 1.0).
- The transition from high to low \( Q^*_M(v) \) occurs closer to \( v = 0 \) (red drops at \( v \approx -0.5 \), others at \( v \approx -1.0 \)).
- **Error Bands**: Wider at \( v = \pm 2.0 \) (extremes), narrower in the middle (\( v = -1.0 \) to \( 0.0 \)), indicating more variability at the ends of the \( v \)-range.
### Interpretation
The graph likely models a symmetric function \( Q^*_M(v) \) (e.g., a physical or mathematical response) dependent on \( \alpha \).
- **\( \alpha \) as a Control Parameter**: Higher \( \alpha \) amplifies \( Q^*_M(v) \) at the extremes (\( v = \pm 2.0 \)) and sharpens the transition near \( v = 0 \). This suggests \( \alpha \) scales the function’s “strength” or “peak.”
- **Error Bands**: Wider bands at \( v = \pm 2.0 \) imply greater uncertainty in \( Q^*_M(v) \) when \( v \) is far from 0 (e.g., fewer data points or higher variability).
- **Practical Context**: If \( v \) is a physical quantity (e.g., velocity, voltage) and \( \alpha \) a model parameter, the graph shows how \( \alpha \) tunes the response \( Q^*_M(v) \). Higher \( \alpha \) produces a more pronounced response at the extremes and a sharper transition near \( v = 0 \).
This description captures all visible elements, trends, and interpretive insights, enabling reconstruction of the graph’s information without the image.
</details>
<details>
<summary>x6.png Details</summary>

### Visual Description
## [Line Chart with Confidence Intervals]: Performance Metrics vs. α (Alpha)
### Overview
The image is a line chart plotting three data series (with shaded confidence intervals) against the x-axis variable \( \boldsymbol{\alpha} \) (alpha) and a y-axis ranging from 0.0 to 1.0. Each series is labeled in a legend at the bottom-right, and the chart includes grid lines for reference.
### Components/Axes
- **X-axis**: Labeled \( \boldsymbol{\alpha} \) (alpha), with tick marks at 1, 2, 3, 4, 5, 6, 7 (range: 0 to 7).
- **Y-axis**: Numerical scale from 0.0 to 1.0, with major ticks at 0.0, 0.5, 1.0.
- **Legend**: Positioned at the bottom-right, containing three entries:
- Blue line: \( \boldsymbol{Q_W^*(3/\sqrt{5})} \)
- Orange line: \( \boldsymbol{Q_W^*(1/\sqrt{5})} \)
- Green line: \( \boldsymbol{q_2^*} \)
- **Lines & Shaded Areas**: Each line has a color-matched shaded region (likely representing uncertainty/confidence intervals).
### Detailed Analysis
#### 1. Blue Line (\( Q_W^*(3/\sqrt{5}) \))
- **Trend**: Rapid increase from \( \alpha=0 \) to \( \alpha=1 \), then plateaus near 1.0 for \( \alpha \geq 1 \).
- **Data Points (approximate)**:
- \( \alpha=0 \): ~0.0
- \( \alpha=1 \): ~0.8
- \( \alpha=2 \) to \( 7 \): ~1.0 (stable)
- **Shaded Area**: Narrow, indicating low variability.
#### 2. Orange Line (\( Q_W^*(1/\sqrt{5}) \))
- **Trend**: Remains near 0 until \( \alpha=5 \), then rises sharply, approaching ~0.8 at \( \alpha=7 \).
- **Data Points (approximate)**:
- \( \alpha=0 \) to \( 4 \): ~0.0
- \( \alpha=5 \): ~0.5
- \( \alpha=6 \): ~0.7
- \( \alpha=7 \): ~0.8
- **Shaded Area**: Narrow until \( \alpha=5 \), then widens slightly as the line rises.
#### 3. Green Line (\( q_2^* \))
- **Trend**: Gradual increase from \( \alpha=0 \), with fluctuations (shaded area shows variability), plateauing near 1.0 for \( \alpha \geq 2 \).
- **Data Points (approximate)**:
- \( \alpha=0 \): ~0.0
- \( \alpha=1 \): ~0.6
- \( \alpha=2 \): ~0.8
- \( \alpha=3 \) to \( 7 \): ~1.0 (with minor fluctuations)
- **Shaded Area**: Wider than the blue line, indicating more variability.
### Key Observations
- The blue line reaches a high value (near 1.0) much earlier (\( \alpha=1 \)) than the orange line (\( \alpha=5 \)).
- The green line has a more gradual increase and higher variability (wider shaded area) compared to the blue line.
- The orange line remains low until \( \alpha=5 \), then shows a sharp increase, suggesting a **threshold effect** at \( \alpha=5 \).
### Interpretation
The chart likely illustrates the behavior of three metrics (or functions) as a function of \( \alpha \) (e.g., a parameter like signal-to-noise ratio, regularization strength, or a system parameter).
- The blue line (\( Q_W^*(3/\sqrt{5}) \)) is highly responsive to \( \alpha \), reaching a maximum value quickly—suggesting it is sensitive to small changes in \( \alpha \).
- The orange line (\( Q_W^*(1/\sqrt{5}) \)) has a delayed response, only increasing significantly after \( \alpha=5 \)—indicating a critical threshold where \( \alpha \) must exceed 5 to trigger a response.
- The green line (\( q_2^* \)) shows a more gradual, variable increase—suggesting a different underlying mechanism (e.g., a less sensitive or more stochastic process).
The shaded regions imply uncertainty: the blue line has the least uncertainty, while the green line has more, and the orange line’s uncertainty increases as it rises. This could be relevant in fields like optimization, signal processing, or statistical modeling, where \( \alpha \) and the y-axis metric (e.g., accuracy, probability) are key variables.
(Note: All values are approximate, based on visual inspection of the chart.)
</details>
Figure 3: Top: Theoretical prediction (solid curves) of the overlap function $\mathcal{Q}_{W}(\mathsf{v})$ for different sampling ratios $\alpha$ for Gaussian inner weights, ReLU(x) activation, $d=150,\gamma=0.5$ , linear readout with $\Delta=0.1$ and $P_{v}=\mathcal{N}(0,1)$ . The shaded curves were obtained from HMC initialised informatively. Using a single sample ${\mathbf{W}}^{a}$ from the posterior, $\mathcal{Q}_{W}(\mathsf{v})$ has been evaluated numerically by dividing the interval $[-2,2]$ into 50 bins and by computing the value of the overlap associated with each bin. Each point has been averaged over 50 instances of the training set, and shaded regions around them correspond to one standard deviation. Bottom: Theoretical prediction (solid curves) of the overlaps as function of the sampling ratio $\alpha$ for Gaussian inner weights, Tanh(2x) activation, $d=150,\gamma=0.5$ , linear readout with $\Delta=0.1$ and $P_{v}=\frac{1}{4}(\delta_{-3/\sqrt{5}}+\delta_{-1/\sqrt{5}}+\delta_{1/\sqrt{5} }+\delta_{3/\sqrt{5}})$ . The shaded curves were obtained from informed HMC. Each point has been averaged over 10 instances of the training set, with one standard deviation depicted.
The phenomenology observed depends on the activation function selected. In particular, by expanding $\sigma$ in the Hermite basis we realise that the way the first three terms enter information theoretical quantities is completely described by order 0, 1 and 2 tensors later defined in (12), that give rise to combinations of the inner and readout weights. In the regime of quadratically many data, order 0 and 1 tensors are recovered exactly by the student because of the overwhelming abundance of data compared to their dimension. The challenge is thus to learn the second order tensor. On the contrary, we claim that learning any higher order tensors can only happen when the student aligns its weights with ${\mathbf{W}}^{0}$ : before this “synchronisation”, they play the role of an effective noise. This is the mechanism behind the specialisation transition. For odd activation ( ${\rm Tanh}$ in Figure 1, $\sigma_{3}$ in Figure 2), where $\mu_{2}=0$ , the aforementioned order-2 tensor does not contribute any more to the learning. Indeed, we observe numerically that the generalisation error sticks to a constant value for $\alpha<\alpha_{\rm sp}$ , whereas at the phase transition it suddenly drops. This is because the learning of the order-2 tensor is skipped entirely, and the only chance to perform better is to learn all the other higher-order tensors through specialisation.
By extrapolating universality results to generic activations, we are able to use the GAMP-RIE of Maillard et al. (2024a), publicly available at Maillard et al. (2024b), to obtain a polynomial-time predictor for test data. Its generalisation error follows our universal theoretical curve even in the $\alpha$ regime where MCMC sampling experiences a computationally hard phase with worse performance (for binary weights), and in particular after $\alpha_{\rm sp}$ (see Fig. 2, circles). Extending this algorithm, initially proposed for quadratic activation, to a generic one is possible thanks to the identification of an effective GLM onto which the learning problem can be mapped (while the mapping is exact when $\sigma(x)=x^{2}$ as exploited by Maillard et al. (2024a)), see App. H. The key observation is that our effective GLM representation holds not only from a theoretical perspective when describing the universal phase, but also algorithmically.
Finally, we emphasise that our theory is consistent with Cui et al. (2023), which considers the simpler strongly over-parametrised regime $n=\Theta(d)$ rather than the interpolation one $n=\Theta(d^{2})$ : our generalisation curves at $\alpha\to 0$ match theirs at $\alpha_{1}:=n/d\to\infty$ , which is when the student learns perfectly the combinations ${\mathbf{v}}^{0\intercal}{\mathbf{W}}^{0}/\sqrt{k}$ (but nothing more).
## 4 Accessing the free entropy and generalisation error: replica method and spherical integration combined
The goal is to compute the asymptotic free entropy by the replica method Mezard et al. (1986), a powerful heuristic from spin glasses also used in machine learning Engel & Van den Broeck (2001), combined with the HCIZ integral. Our derivation is based on a Gaussian ansatz on the replicated post-activations of the hidden layer, which generalises Conjecture 3.1 of Cui et al. (2023), now proved in Camilli et al. (2025), where it is specialised to the case of linearly many data ( $n=\Theta(d)$ ). To obtain this generalisation, we will write the kernel arising from the covariance of the aforementioned post-activations as an infinite series of scalar order parameters derived from the expansion of the activation function in the Hermite basis, following an approach recently devised in Aguirre-López et al. (2025) in the context of the random features model (see also Hu et al. (2024) and Ghorbani et al. (2021)). Another key ingredient of our analysis will be a generalisation of an ansatz used in the replica method by Sakata & Kabashima (2013) for dictionary learning.
### 4.1 Replicated system and order parameters
The starting point in the replica method to tackle the data average is the replica trick:
| | $\textstyle{\lim}\,\frac{1}{n}\mathbb{E}\ln{\mathcal{Z}}(\mathcal{D})={\lim}{ \lim\limits_{\,\,s\to 0^{+}}}\!\frac{1}{ns}\ln\mathbb{E}\mathcal{Z}^{s}=\lim \limits_{\,\,s\to 0^{+}}\!{\lim}\,\frac{1}{ns}\ln\mathbb{E}\mathcal{Z}^{s}$ | |
| --- | --- | --- |
assuming the limits commute. Recall ${\mathbf{W}}^{0}$ are the teacher weights. Consider first $s\in\mathbb{N}^{+}$ . Let the “replicas” of the post-activation $\{\lambda^{a}({\mathbf{W}}^{a}):=\frac{1}{\sqrt{k}}{{\mathbf{v}}^{\intercal}} \sigma(\frac{1}{\sqrt{d}}{{\mathbf{W}}^{a}{\mathbf{x}}})\}_{a=0,\ldots,s}$ . We then directly obtain
| | $\textstyle\mathbb{E}\mathcal{Z}^{s}=\mathbb{E}_{{\mathbf{v}}}\int\prod\limits_ {a}\limits^{0,s}dP_{W}({\mathbf{W}}^{a})\big{[}\mathbb{E}_{\mathbf{x}}\int dy \prod\limits_{a}\limits^{0,s}P_{\rm out}(y\mid\lambda^{a}({\mathbf{W}}^{a})) \big{]}^{n}.$ | |
| --- | --- | --- |
The key is to identify the law of the replicas $\{\lambda^{a}\}_{a=0,\ldots,s}$ , which are dependent random variables due to the common random Gaussian input ${\mathbf{x}}$ , conditionally on $({\mathbf{W}}^{a})$ . Our key hypothesis is that $\{\lambda^{a}\}$ is jointly Gaussian, an ansatz we cannot prove but that we validate a posteriori thanks to the excellent match between our theory and the empirical generalisation curves, see Sec. 2.2. Similar Gaussian assumptions have been the crux of a whole line of recent works on the analysis of neural networks, and are now known under the name of “Gaussian equivalence” (Goldt et al., 2020; Hastie et al., 2022; Mei & Montanari, 2022; Goldt et al., 2022; Hu & Lu, 2023). This can also sometimes be heuristically justified based on Breuer–Major Theorems (Nourdin et al., 2011; Pacelli et al., 2023).
Given two replica indices $a,b\in\{0,\ldots,s\}$ we define the neuron-neuron overlap matrix $\Omega^{ab}_{ij}:={{\mathbf{W}}_{i}^{a\intercal}{\mathbf{W}}^{b}_{j}}/d$ with $i,j\in[k]$ . Recalling the Hermite expansion of $\sigma$ , by using Mehler’s formula, see App. A, the post-activations covariance $K^{ab}:=\mathbb{E}\lambda^{a}\lambda^{b}$ reads
$$
\textstyle K^{ab} \textstyle=\sum_{\ell\geq 1}^{\infty}\frac{\mu^{2}_{\ell}}{\ell!}Q_{\ell}^{ab}
\ \ \text{with}\ \ Q_{\ell}^{ab}:=\frac{1}{k}\sum_{i,j\leq k}v_{i}v_{j}(\Omega
^{ab}_{ij})^{\ell}. \tag{8}
$$
This covariance ${\mathbf{K}}$ is complicated but, as we argue hereby, simplifications occur as $d\to\infty$ . In particular, the first two overlaps $Q_{1}^{ab},Q_{2}^{ab}$ are special. We claim that higher-order overlaps $(Q_{\ell}^{ab})_{\ell\geq 3}$ can be simplified as functions of simpler order parameters.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Graph with Inset: Overlaps vs. HMC Steps (Q1–Q5 and \( \varepsilon^{\text{opt}} \))
### Overview
The image is a line graph (with an inset) illustrating the relationship between "Overlaps" (y-axis) and "HMC steps" (x-axis) for five data series (Q₁–Q₅) and an inset graph showing the optimal parameter \( \varepsilon^{\text{opt}} \). The main graph spans 0–125,000 HMC steps, while the inset focuses on 0–100,000 steps.
### Components/Axes
#### Main Graph
- **Y-axis**: Label = "Overlaps", scale = 0.0–1.0 (ticks: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0).
- **X-axis**: Label = "HMC steps", scale = 0–125,000 (ticks: 0, 25,000, 50,000, 75,000, 100,000, 125,000).
- **Legend (Top-Right)**:
- Q₁: Blue line
- Q₂: Orange line
- Q₃: Green line
- Q₄: Red line
- Q₅: Purple line
#### Inset Graph (Center, Overlapping Main Graph)
- **Y-axis**: Scale = 0.000–0.025 (ticks: 0.000, 0.025).
- **X-axis**: Scale = 0–100,000 (ticks: 0, 50,000, 100,000).
- **Legend**: Black line labeled \( \boldsymbol{\varepsilon^{\text{opt}}} \).
### Detailed Analysis
#### 1. Q₁ (Blue Line)
- **Trend**: Starts at ~1.0 (x=0) and remains **flat** (constant) at ~1.0 across all HMC steps (0–125,000).
- **Data Points**: \( (0, 1.0) \), \( (125,000, 1.0) \).
#### 2. Q₂ (Orange Line)
- **Trend**: Starts at ~1.0 (x=0), drops sharply to ~0.6 (x≈25,000), then remains **flat** at ~0.6 (25,000–125,000 steps).
- **Data Points**: \( (0, 1.0) \), \( (25,000, 0.6) \), \( (125,000, 0.6) \).
#### 3. Q₃ (Green Line)
- **Trend**: Starts at ~1.0 (x=0), drops sharply to ~0.0 (x≈25,000), then remains **flat** at ~0.0 (25,000–125,000 steps).
- **Data Points**: \( (0, 1.0) \), \( (25,000, 0.0) \), \( (125,000, 0.0) \).
#### 4. Q₄ (Red Line)
- **Trend**: Identical to Q₃: Starts at ~1.0 (x=0), drops to ~0.0 (x≈25,000), then flat at ~0.0.
- **Data Points**: \( (0, 1.0) \), \( (25,000, 0.0) \), \( (125,000, 0.0) \).
#### 5. Q₅ (Purple Line)
- **Trend**: Identical to Q₃/Q₄: Starts at ~1.0 (x=0), drops to ~0.0 (x≈25,000), then flat at ~0.0.
- **Data Points**: \( (0, 1.0) \), \( (25,000, 0.0) \), \( (125,000, 0.0) \).
#### 6. Inset: \( \boldsymbol{\varepsilon^{\text{opt}}} \) (Black Line)
- **Trend**: Starts at 0.000 (x=0), rises sharply to ~0.025 (x≈25,000), then remains **flat** at ~0.025 (25,000–100,000 steps).
- **Data Points**: \( (0, 0.000) \), \( (25,000, 0.025) \), \( (100,000, 0.025) \).
### Key Observations
- **Q₁**: Maintains high overlap (~1.0) throughout, indicating consistent performance.
- **Q₂**: Stabilizes at moderate overlap (~0.6) after an initial drop.
- **Q₃–Q₅**: All drop to near-zero overlap and stabilize, indicating low long-term overlap.
- **Inset (\( \varepsilon^{\text{opt}} \))**: Rises to a steady state (~0.025), suggesting the optimal parameter stabilizes after ~25,000 steps.
- **Initial Phase (0–25,000 steps)**: All series start at ~1.0, then diverge: Q₁ stays high, Q₂ moderate, Q₃–Q₅ low.
### Interpretation
- **Q₁**: Likely represents a configuration (e.g., model, parameter set) with **high, consistent overlap** over HMC steps.
- **Q₂**: Moderate overlap suggests a configuration that stabilizes at a lower (but non-zero) overlap.
- **Q₃–Q₅**: Low overlap implies configurations that quickly lose overlap (e.g., different models, parameters, or conditions).
- **\( \varepsilon^{\text{opt}} \) (Inset)**: The optimal \( \varepsilon \) stabilizes after ~25,000 steps, correlating with the overlap behavior (e.g., \( \varepsilon \) influences overlap, and once \( \varepsilon \) stabilizes, overlaps stabilize too).
- **Initial Steps (0–25,000)**: Critical for determining long-term overlap: Q₁ retains high overlap, while Q₂–Q₅ diverge to lower steady states.
This analysis captures all visible trends, data points, and relationships, enabling reconstruction of the graph’s information without the image.
</details>
<details>
<summary>x8.png Details</summary>

### Visual Description
## [Line Graph with Inset Plot]: Overlaps vs. HMC Steps (with ε^opt Inset)
### Overview
The image is a line graph (with an inset plot) illustrating the evolution of **"Overlaps"** (a measure of distribution similarity) as a function of **"HMC steps"** (Hamiltonian Monte Carlo sampling steps) for multiple data series. An inset plot in the top-right shows the evolution of a parameter **"ε^opt"** (epsilon optimal, likely a step-size parameter for HMC) over the same HMC steps.
### Components/Axes
#### Main Plot (Primary Graph)
- **X-axis**: Labeled *"HMC steps"* (Hamiltonian Monte Carlo steps, a sampling algorithm).
- Ticks: 0, 25000, 50000, 75000, 100000, 125000 (range: 0 to 125000).
- **Y-axis**: Labeled *"Overlaps"* (a metric for distribution similarity, e.g., in Bayesian inference).
- Ticks: 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 (range: 0.5 to 1.0).
- **Lines/Legends** (colored lines with dashed reference lines):
- **Blue line**: Stable near 1.0 (dashed blue line at ~1.0).
- **Orange line**: Stable near 0.9 (dashed orange line at ~0.9).
- **Green line**: Decreases from ~0.7 to ~0.6, then fluctuates (dashed green line at ~0.6).
- **Red line**: Decreases from ~0.7 to ~0.6, then fluctuates (dashed red line at ~0.6).
- **Purple line**: Decreases from ~0.7 to ~0.5, then fluctuates (dashed purple line at ~0.5).
#### Inset Plot (Top-Right)
- **X-axis**: 0 to 100000 (HMC steps, same as the main x-axis).
- **Y-axis**: 0.00 to 0.01 (small numerical values, likely a parameter like epsilon).
- **Line**: Black line labeled *"ε^opt"* (epsilon optimal), showing rapid increase then stabilization.
### Detailed Analysis
#### Main Plot Trends
- **Blue line**: Remains nearly constant at ~1.0 across all HMC steps, indicating **high, stable overlap** (e.g., consistent similarity between distributions).
- **Orange line**: Remains nearly constant at ~0.9, with minor fluctuations, indicating **stable overlap** (slightly lower than blue).
- **Green line**: Starts at ~0.7, decreases to ~0.6 by ~25000 steps, then fluctuates around ~0.6 (converges to a lower overlap state).
- **Red line**: Similar to green: starts at ~0.7, decreases to ~0.6, then fluctuates (converges to the same lower overlap as green).
- **Purple line**: Starts at ~0.7, decreases to ~0.5 by ~25000 steps, then fluctuates around ~0.5 (converges to the lowest overlap state).
#### Inset Plot Trend
- **Black line (ε^opt)**: Increases rapidly from 0 to ~0.01 within the first ~10000 steps, then **stabilizes at ~0.01** for the remaining steps (suggesting the optimal step size converges early).
### Key Observations
- **Stable Overlap**: Blue and orange lines maintain high, stable overlap (near 1.0 and 0.9), indicating consistent performance or similarity for these parameters/distributions.
- **Convergence to Lower Overlap**: Green, red, and purple lines show an initial decrease in overlap, then stabilize at lower values (0.6, 0.6, 0.5), suggesting a transition to a lower (but stable) overlap state.
- **Early Stabilization of ε^opt**: The inset plot shows the optimal step size (ε^opt) stabilizes quickly, implying efficient convergence of the HMC algorithm’s step-size parameter.
### Interpretation
This graph likely illustrates the performance of a **Hamiltonian Monte Carlo (HMC)** sampling algorithm (used in Bayesian inference or statistical sampling). The "Overlaps" metric measures similarity between sampled distributions (e.g., posterior distributions).
- **Stable Overlap (Blue/Orange)**: These lines suggest some parameters/distributions maintain high similarity, indicating robust sampling or consistent model behavior.
- **Convergence to Lower Overlap (Green/Red/Purple)**: These lines show a transition to lower overlap, possibly reflecting a shift in the sampling process (e.g., convergence to a different distribution or parameter regime).
- **ε^opt Stabilization**: The inset plot confirms the optimal step size for HMC stabilizes early, which is critical for efficient sampling (avoiding excessive computation or instability).
Overall, the trends suggest the HMC algorithm effectively stabilizes the sampling process: some parameters maintain high similarity, while others converge to a lower (but stable) overlap state, with the optimal step size converging rapidly. This could imply the algorithm is well-tuned for efficient, stable sampling in this context.
(Note: No non-English text is present in the image.)
</details>
Figure 4: Hamiltonian Monte Carlo dynamics of the overlaps $Q_{\ell}=Q_{\ell}^{01}$ between student and teacher weights for $\ell\in[5]$ , with activation function ReLU(x), $d=200$ , $\gamma=0.5$ , linear readout with $\Delta=0.1$ and two choices of sample rates and readouts: $\alpha=1.0$ with $P_{v}=\delta_{1}$ (Left) and $\alpha=3.0$ with $P_{v}=\mathcal{N}(0,1)$ (Right). The teacher weights ${\mathbf{W}}^{0}$ are Gaussian. The dynamics is initialised informatively, i.e., on ${\mathbf{W}}^{0}$ . The overlap $Q_{1}$ always fluctuates around 1. Left: The overlaps $Q_{\ell}$ for $\ell\geq 3$ at equilibrium converge to 0, while $Q_{2}$ is well estimated by the theory (orange dashed line). Right: At higher sample rate $\alpha$ , also the $Q_{\ell}$ for $\ell\geq 3$ are non zero and agree with their theoretical prediction (dashed lines). Insets show the mean-square generalisation error and the theoretical prediction.
### 4.2 Simplifying the order parameters
In this section we show how to drastically reduce the number of order parameters to track. Assume at the moment that the readout prior $P_{v}$ has discrete support $\mathsf{V}=\{\mathsf{v}\}$ ; this can be relaxed by binning a continuous support, as mentioned in Sec. 2.2. The overlaps in (8) can be written as
$$
\textstyle Q_{\ell}^{ab}=\frac{1}{k}\sum_{\mathsf{v},\mathsf{v}^{\prime}\in
\mathsf{V}}\mathsf{v}\,\mathsf{v}^{\prime}\sum_{\{i,j\leq k\mid v_{i}=\mathsf{
v},v_{j}=\mathsf{v}^{\prime}\}}(\Omega_{ij}^{ab})^{\ell}. \tag{9}
$$
In the following, for $\ell\geq 3$ we discard the terms $\mathsf{v}\neq\mathsf{v}^{\prime}$ in the above sum, assuming they are suppressed w.r.t. the diagonal ones. In other words, a neuron ${\mathbf{W}}^{a}_{i}$ of a student (replica) with a readout value $v_{i}=\mathsf{v}$ is assumed to possibly align only with neurons of the teacher (or, by Bayes-optimality, of another replica) with the same readout. Moreover, in the resulting sum over the neurons indices $\{i,j\mid v_{i}=v_{j}=\mathsf{v}\}$ , we assume that, for each $i$ , a single index $j=\pi_{i}$ , with $\pi$ a permutation, contributes at leading order. The model is symmetric under permutations of hidden neurons. We thus take $\pi$ to be the identity without loss of generality.
We now assume that for Hadamard powers $\ell\geq 3$ , the off-diagonal of the overlap $({\bm{\Omega}}^{ab})^{\circ\ell}$ , obtained from typical weight matrices sampled from the posterior, is sufficiently small to consider it diagonal in any quadratic form. Moreover, by exchangeability among neurons with the same readout value, we further assume that all diagonal elements $\{\Omega_{ii}^{ab}\mid i\in\mathcal{I}_{\mathsf{v}}\}$ concentrate onto the constant $\mathcal{Q}_{W}^{ab}(\mathsf{v})$ , where $\mathcal{I}_{\mathsf{v}}:=\{i\leq k\mid v_{i}=\mathsf{v}\}$ :
$$
\textstyle(\Omega_{ij}^{ab})^{\ell}=(\frac{1}{d}{\mathbf{W}}_{i}^{a\intercal}{
\mathbf{W}}^{b}_{j})^{\ell}\approx\delta_{ij}\mathcal{Q}_{W}^{ab}(\mathsf{v})^
{\ell} \tag{10}
$$
if $\ell\geq 3$ , $i\ \text{or}\ j\in\mathcal{I}_{\mathsf{v}}$ . Approximate equality here is up to a matrix with $o_{d}(1)$ norm. The same happens, e.g., for a standard Wishart matrix: its eigenvectors and the ones of its square Hadamard power are delocalised, while for higher Hadamard powers $\ell\geq 3$ its eigenvectors are strongly localised; this is why $Q_{2}^{ab}$ will require a separate treatment. With these simplifications we can write
$$
\textstyle Q_{\ell}^{ab}=\mathbb{E}_{v\sim P_{v}}[v^{2}\mathcal{Q}_{W}^{ab}(v)
^{\ell}]+o_{d}(1)\ \text{for}\ \ell\geq 3. \tag{1}
$$
This is is verified numerically a posteriori as follows. Identity (11) is true (without $o_{d}(1)$ ) for the predicted theoretical values of the order parameters by construction of our theory. Fig. 3 verified the good agreement between the theoretical and experimental overlap profiles $\mathcal{Q}^{01}_{W}(\mathsf{v})$ for all $\mathsf{v}\in\mathsf{V}$ (which is statistically the same as $\smash{\mathcal{Q}^{ab}_{W}(\mathsf{v})}$ for any $a\neq b$ by the so-called Nishimori identity following from Bayes-optimality, see App. B), while Fig. 4 verifies the agreement at the level of $(Q_{\ell}^{ab})$ . Consequently, (11) is also true for the experimental overlaps.
It is convenient to define the symmetric tensors ${\mathbf{S}}_{\ell}^{a}$ with entries
$$
\textstyle S^{a}_{\ell;\alpha_{1}\ldots\alpha_{\ell}}:=\frac{1}{\sqrt{k}}\sum_
{i\leq k}v_{i}W^{a}_{i\alpha_{1}}\cdots W^{a}_{i\alpha_{\ell}}. \tag{12}
$$
Indeed, the generic $\ell$ -th term of the series (8) can be written as the overlap $Q^{ab}_{\ell}=\langle{\mathbf{S}}^{a}_{\ell},{\mathbf{S}}^{b}_{\ell}\rangle/d^ {\ell}$ of these tensors (where $\langle\,,\,\rangle$ is the inner product), e.g., $Q_{2}^{ab}={\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}/d^{2}$ . Given that the number of data $n=\Theta(d^{2})$ and that $({\mathbf{S}}_{1}^{a})$ are only $d$ -dimensional, they are reconstructed perfectly (the same argument was used to argue that readouts ${\mathbf{v}}$ can be quenched). We thus assume right away that at equilibrium the overlaps $Q_{1}^{ab}=1$ (or saturate to their maximum value; if tracked, the corresponding saddle point equations end up being trivial and do fix this). In other words, in the quadratic data regime, the $\mu_{1}$ contribution in the Hermite decomposition of $\sigma$ for the target is perfectly learnable, while higher order ones play a non-trivial role. In contrast, Cui et al. (2023) study the regime $n=\Theta(d)$ where $\mu_{1}$ is the only learnable term.
Then, the average replicated partition function reads $\mathbb{E}\mathcal{Z}^{s}=\int d{\mathbf{Q}}_{2}d\bm{\mathcal{Q}}_{W}\exp(F_{S }\!+\!nF_{E})$ where $F_{E},F_{S}$ depend on ${\mathbf{Q}}_{2}=(Q_{2}^{ab})$ and $\bm{\mathcal{Q}}_{W}:=\{\mathcal{Q}_{W}^{ab}\mid a\leq b\}$ , where $\mathcal{Q}_{W}^{ab}:=\{\mathcal{Q}_{W}^{ab}(\mathsf{v})\mid\mathsf{v}\in \mathsf{V}\}$ .
The “energetic potential” is defined as
$$
\textstyle e^{nF_{E}}:=\big{(}\int dyd{\bm{\lambda}}\frac{\exp(-\frac{1}{2}{
\bm{\lambda}}^{\intercal}{\mathbf{K}}^{-1}{\bm{\lambda}})}{((2\pi)^{s+1}\det{
\mathbf{K}})^{1/2}}\prod_{a}^{0,s}P_{\rm out}(y\mid\lambda^{a})\big{)}^{n}. \tag{13}
$$
It takes this form due to our Gaussian assumption on the replicated post-activations and is thus easily computed, see App. D.1.
The “entropic potential” $F_{S}$ taking into account the degeneracy of the order parameters is obtained by averaging delta functions fixing their definitions w.r.t. the “microscopic degrees of freedom” $({\mathbf{W}}^{a})$ . It can be written compactly using the following conditional law over the tensors $({\mathbf{S}}_{2}^{a})$ :
$$
\textstyle P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}):=V_{W}^{kd}(\bm{
\mathcal{Q}}_{W})^{-1}\int\prod_{a}^{0,s}dP_{W}({\mathbf{W}}^{a}) \textstyle\qquad\times\prod_{a\leq b}^{0,s}\prod_{\mathsf{v}\in\mathsf{V}}
\prod_{i\in\mathcal{I}_{\mathsf{v}}}\delta(d\,\mathcal{Q}_{W}^{ab}(\mathsf{v})
-{{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b}}) \textstyle\qquad\times\prod_{a}^{0,s}\delta({\mathbf{S}}^{a}_{2}-{\mathbf{W}}^
{a\intercal}({\mathbf{v}}){\mathbf{W}}^{a}/\sqrt{k}), \tag{14}
$$
with the normalisation
| | $\textstyle V_{W}^{kd}:=\int\prod_{a}dP_{W}({\mathbf{W}}^{a})\prod_{a\leq b, \mathsf{v},i\in\mathcal{I}_{\mathsf{v}}}\delta(d\,\mathcal{Q}_{W}^{ab}(\mathsf {v})-{{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b}}).$ | |
| --- | --- | --- |
The entropy, which is the challenging term to compute, then reads
| | $\textstyle e^{F_{S}}:=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})\int dP(({\mathbf{S}}_{2 }^{a})\mid\bm{\mathcal{Q}}_{W})\prod\limits_{a\leq b}\limits^{0,s}\delta(d^{2} Q_{2}^{ab}-{{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}}).$ | |
| --- | --- | --- |
### 4.3 Tackling the entropy: measure simplification by moment matching
The delta functions above fixing $Q_{2}^{ab}$ induce quartic constraints between the weights degrees of freedom $(W_{i\alpha}^{a})$ instead of quadratic as in standard settings. A direct computation thus seems out of reach. However, we will exploit the fact that the constraints are quadratic in the matrices $({\mathbf{S}}_{2}^{a})$ . Consequently, shifting our focus towards $({\mathbf{S}}_{2}^{a})$ as the basic degrees of freedom to integrate rather than $(W_{i\alpha}^{a})$ will allow us to move forward by simplifying their measure (14). Note that while $(W_{i\alpha}^{a})$ are i.i.d. under their prior measure, $({\mathbf{S}}_{2}^{a})$ have coupled entries, even for a fixed replica index $a$ . This can be taken into account as follows.
Define $P_{S}$ as the probability density of a generalised Wishart random matrix, i.e., of $\tilde{\mathbf{W}}^{\intercal}({\mathbf{v}})\tilde{\mathbf{W}}/\sqrt{k}$ where $\tilde{\mathbf{W}}\in\mathbb{R}^{k\times d}$ is made of i.i.d. standard Gaussian entries. The simplification we consider consists in replacing (14) by the effective measure
$$
\textstyle\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}):=\frac{1}{
\tilde{V}_{W}^{kd}}\prod\limits_{a}\limits^{0,s}P_{S}({\mathbf{S}}_{2}^{a})
\prod\limits_{a<b}\limits^{0,s}e^{\frac{1}{2}\tau(\mathcal{Q}_{W}^{ab}){\rm Tr
}\,{\mathbf{S}}^{a}_{2}{\mathbf{S}}^{b}_{2}} \tag{15}
$$
where $\tilde{V}_{W}^{kd}=\tilde{V}_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ is the proper normalisation constant, and
$$
\textstyle\tau(\mathcal{Q}_{W}^{ab}):=\text{mmse}_{S}^{-1}(1-\mathbb{E}_{v\sim
P
_{v}}[v^{2}\mathcal{Q}^{ab}_{W}(v)^{2}]). \tag{16}
$$
The rationale behind this choice goes as follows. The matrices $({\mathbf{S}}_{2}^{a})$ are, under the measure (14), $(i)$ generalised Wishart matrices, constructed from $(ii)$ non-Gaussian factors $({\mathbf{W}}^{a})$ , which $(iii)$ are coupled between different replicas, thus inducing a coupling among replicas $({\mathbf{S}}^{a})$ . The proposed simplified measure captures all three aspects while remaining tractable, as we explain now. The first assumption is that in the measure (14) the details of the (centred, unit variance) prior $P_{W}$ enter only through $\bm{\mathcal{Q}}_{W}$ at leading order. Due to the conditioning, we can thus relax it to Gaussian (with the same two first moments) by universality, as is often the case in random matrix theory. $P_{W}$ will instead explicitly enter in the entropy of $\bm{\mathcal{Q}}_{W}$ related to $V_{W}^{kd}$ . Point $(ii)$ is thus taken care by the conditioning. Then, the generalised Wishart prior $P_{S}$ encodes $(i)$ and, finally, the exponential tilt in $\tilde{P}$ induces the replica couplings of point $(iii)$ . It remains to capture the dependence of measure (14) on $\bm{\mathcal{Q}}_{W}$ . This is done by realising that
| | $\textstyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})\frac{1}{d^{2 }}{\rm Tr}{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}=\mathbb{E}_{v\sim P_{v}}[v^ {2}\mathcal{Q}_{W}^{ab}(v)^{2}]+\gamma\bar{v}^{2}.$ | |
| --- | --- | --- |
It is shown in App. D.2. The Lagrange multiplier $\tau(\mathcal{Q}_{W}^{ab})$ to plug in $\tilde{P}$ enforcing this moment matching condition between true and simplified measures as $s\to 0^{+}$ is (16), see App. D.3. For completeness, we provide in App. E alternatives to the simplification (15), whose analysis are left for future work.
### 4.4 Final steps and spherical integration
Combining all our findings, the average replicated partition function is simplified as
| | $\textstyle\mathbb{E}\mathcal{Z}^{s}=\int d{\mathbf{Q}}_{2}d\bm{\mathcal{Q}}_{W }e^{nF_{E}+kd\ln V_{W}(\bm{\mathcal{Q}}_{W})-kd\ln\tilde{V}_{W}(\bm{\mathcal{Q }}_{W})}$ | |
| --- | --- | --- |
The equality should be interpreted as holding at leading exponential order $\exp(\Theta(n))$ , assuming the validity of our previous measure simplification. All remaining steps but the last are standard:
$(i)$ Express the delta functions fixing $\bm{\mathcal{Q}}_{W}$ and ${\mathbf{Q}}_{2}$ in exponential form using their Fourier representation; this introduces additional Fourier conjugate order parameters $\hat{\mathbf{Q}}_{2},\hat{\bm{\mathcal{Q}}}_{W}$ of same dimensions.
$(ii)$ Once this is done, the terms coupling different replicas of $({\mathbf{W}}^{a})$ or of $({\mathbf{S}}^{a})$ are all quadratic. Using the Hubbard–Stratonovich transformation (i.e., $\mathbb{E}_{{\mathbf{Z}}}\exp(\frac{d}{2}{\rm Tr}\,{\mathbf{M}}{\mathbf{Z}})= \exp(\frac{d}{4}{\rm Tr}\,{\mathbf{M}}^{2})$ for a $d\times d$ symmetric matrix ${\mathbf{M}}$ with ${\mathbf{Z}}$ a standard GOE matrix) therefore allows us to linearise all replica-replica coupling terms, at the price of introducing new Gaussian fields interacting with all replicas.
$(iii)$ After these manipulations, we identify at leading exponential order an effective action $\mathcal{S}$ depending on the order parameters only, which allows a saddle point integration w.r.t. them as $n\to\infty$ :
| | $\textstyle\lim\frac{1}{ns}\ln\mathbb{E}\mathcal{Z}^{s}\!=\!\lim\frac{1}{ns}\ln \int d{\mathbf{Q}}_{2}d\hat{\mathbf{Q}}_{2}d\bm{\mathcal{Q}}_{W}d\hat{\bm{ \mathcal{Q}}}_{W}e^{n\mathcal{S}}\!=\!\frac{1}{s}{\rm extr}\,\mathcal{S}.$ | |
| --- | --- | --- |
$(iv)$ Next, the replica limit $s\to 0^{+}$ of the previously obtained expression has to be considered. To do so, we make a replica symmetric assumption, i.e., we consider that at the saddle point, all order parameters entering the action $\mathcal{S}$ , and thus $K^{ab}$ too, take a simple form of the type $R^{ab}=r\delta_{ab}+q(1-\delta_{ab})$ . Replica symmetry is rigorously known to be correct in general settings of Bayes-optimal learning and is thus justified here, see Barbier & Panchenko (2022); Barbier & Macris (2019).
$(v)$ After all these steps, the resulting expression still includes two high-dimensional integrals related to the ${\mathbf{S}}_{2}$ ’s matrices. They can be recognised as corresponding to the free entropies associated with the Bayes-optimal denoising of a generalised Wishart matrix, as described just above Result 2.1, for two different signal-to-noise ratios. The last step consists in dealing with these integrals using the HCIZ integral whose form is tractable in this case, see Maillard et al. (2022); Pourkamali et al. (2024). These free entropies yield the two last terms $\iota(\,\cdot\,)$ in $f_{\rm RS}^{\alpha,\gamma}$ , (6).
The complete derivation is in App. D and gives Result 2.1. From the physical meaning of the order parameters, this analysis also yields the post-activations covariance ${\mathbf{K}}$ and thus Result 2.2.
As a final remark, we emphasise a key difference between our approach and earlier works on extensive-rank systems. If, instead of taking the generalised Wishart $P_{S}$ as the base measure over the matrices $({\mathbf{S}}_{2}^{a})$ in the simplified $\tilde{P}$ with moment matching, one takes a factorised Gaussian measure, thus entirely forgetting the dependencies among ${\mathbf{S}}_{2}^{a}$ entries, this mimics the Sakata-Kabashima replica method Sakata & Kabashima (2013). Our ansatz thus captures important correlations that were previously neglected in Sakata & Kabashima (2013); Krzakala et al. (2013); Kabashima et al. (2016); Barbier et al. (2025) in the context of extensive-rank matrix inference. For completeness, we show in App. E that our ansatz indeed greatly improves the prediction compared to these earlier approaches.
## 5 Conclusion and perspectives
We have provided an effective, quantitatively accurate, description of the optimal generalisation capability of a fully-trained two-layer neural network of extensive width with generic activation when the sample size scales with the number of parameters. This setting has resisted for a long time to mean-field approaches used, e.g., for committee machines Barkai et al. (1992); Engel et al. (1992); Schwarze & Hertz (1992; 1993); Mato & Parga (1992); Monasson & Zecchina (1995); Aubin et al. (2018); Baldassi et al. (2019).
A natural extension is to consider non Bayes-optimal models, e.g., trained through empirical risk minimisation to learn a mismatched target function. The formalism we provide here can be extended to these cases, by keeping track of additional order parameters. The extension to deeper architectures is also possible, in the vein of Cui et al. (2023); Pacelli et al. (2023) who analysed the over-parametrised proportional regime. Accounting for structured inputs is another direction: data with a covariance (Monasson, 1992; Loureiro et al., 2021a), mixture models (Del Giudice, P. et al., 1989; Loureiro et al., 2021b), hidden manifolds (Goldt et al., 2020), object manifolds and simplexes (Chung et al., 2018; Rotondo et al., 2020), etc.
Phase transitions in supervised learning are known in the statistical mechanics literature at least since Györgyi (1990), when the theory was limited to linear models. It would be interesting to connect the picture we have drawn here with Grokking, a sudden drop in generalisation error occurring during the training of neural nets close to interpolation, see Power et al. (2022); Rubin et al. (2024b).
A more systematic analysis on the computational hardness of the problem (as carried out for multi-index models in Troiani et al. (2025)) is an important step towards a full characterisation of the class of target functions that are fundamentally hard to learn.
A key novelty of our approach is to blend matrix models and spin glass techniques in a unified formalism. A limitation is then linked to the restricted class of solvable matrix models (see Kazakov (2000); Anninos & Mühlmann (2020) for a list). Indeed, as explained in App. E, possible improvements to our approach need additional finer order parameters than those appearing in Results 2.1, 2.2 (at least for inhomogeneous readouts ${\mathbf{v}}$ ). Taking them into account yields matrix models when computing their entropy which, to the best of our knowledge, are not currently solvable. We believe that obtaining asymptotically exact formulas for the log-partition function and generalisation error in the current setting and its relatives will require some major breakthrough in the field of multi-matrix models. This is an exciting direction to pursue at the crossroad of the fields of matrix models and high-dimensional inference and learning of extensive-rank matrices.
## Software and data
Experiments with ADAM/HMC were performed through standard implementations in PyTorch/TensorFlow/NumPyro; the Metropolis-Hastings and GAMP-RIE routines were coded from scratch (the latter was inspired by Maillard et al. (2024a)). GitHub repository to reproduce the results: https://github.com/Minh-Toan/extensive-width-NN
## Acknowledgements
J.B., F.C., M.-T.N. and M.P. were funded by the European Union (ERC, CHORAL, project number 101039794). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. M.P. thanks Vittorio Erba and Pietro Rotondo for interesting discussions and suggestions.
## References
- Aguirre-López et al. (2025) Aguirre-López, F., Franz, S., and Pastore, M. Random features and polynomial rules. SciPost Phys., 18:039, 2025. 10.21468/SciPostPhys.18.1.039. URL https://scipost.org/10.21468/SciPostPhys.18.1.039.
- Aiudi et al. (2025) Aiudi, R., Pacelli, R., Baglioni, P., Vezzani, A., Burioni, R., and Rotondo, P. Local kernel renormalization as a mechanism for feature learning in overparametrized convolutional neural networks. Nature Communications, 16(1):568, Jan 2025. ISSN 2041-1723. 10.1038/s41467-024-55229-3. URL https://doi.org/10.1038/s41467-024-55229-3.
- Anninos & Mühlmann (2020) Anninos, D. and Mühlmann, B. Notes on matrix models (matrix musings). Journal of Statistical Mechanics: Theory and Experiment, 2020(8):083109, aug 2020. 10.1088/1742-5468/aba499. URL https://dx.doi.org/10.1088/1742-5468/aba499.
- Arjevani et al. (2025) Arjevani, Y., Bruna, J., Kileel, J., Polak, E., and Trager, M. Geometry and optimization of shallow polynomial networks, 2025. URL https://arxiv.org/abs/2501.06074.
- Aubin et al. (2018) Aubin, B., Maillard, A., Barbier, J., Krzakala, F., Macris, N., and Zdeborová, L. The committee machine: Computational to statistical gaps in learning a two-layers neural network. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/84f0f20482cde7e5eacaf7364a643d33-Paper.pdf.
- Baglioni et al. (2024) Baglioni, P., Pacelli, R., Aiudi, R., Di Renzo, F., Vezzani, A., Burioni, R., and Rotondo, P. Predictive power of a Bayesian effective action for fully connected one hidden layer neural networks in the proportional limit. Phys. Rev. Lett., 133:027301, Jul 2024. 10.1103/PhysRevLett.133.027301. URL https://link.aps.org/doi/10.1103/PhysRevLett.133.027301.
- Baldassi et al. (2019) Baldassi, C., Malatesta, E. M., and Zecchina, R. Properties of the geometry of solutions and capacity of multilayer neural networks with rectified linear unit activations. Phys. Rev. Lett., 123:170602, Oct 2019. 10.1103/PhysRevLett.123.170602. URL https://link.aps.org/doi/10.1103/PhysRevLett.123.170602.
- Barbier (2020) Barbier, J. Overlap matrix concentration in optimal Bayesian inference. Information and Inference: A Journal of the IMA, 10(2):597–623, 05 2020. ISSN 2049-8772. 10.1093/imaiai/iaaa008. URL https://doi.org/10.1093/imaiai/iaaa008.
- Barbier & Macris (2019) Barbier, J. and Macris, N. The adaptive interpolation method: a simple scheme to prove replica formulas in Bayesian inference. Probability Theory and Related Fields, 174(3):1133–1185, Aug 2019. ISSN 1432-2064. 10.1007/s00440-018-0879-0. URL https://doi.org/10.1007/s00440-018-0879-0.
- Barbier & Macris (2022) Barbier, J. and Macris, N. Statistical limits of dictionary learning: Random matrix theory and the spectral replica method. Phys. Rev. E, 106:024136, Aug 2022. 10.1103/PhysRevE.106.024136. URL https://link.aps.org/doi/10.1103/PhysRevE.106.024136.
- Barbier & Panchenko (2022) Barbier, J. and Panchenko, D. Strong replica symmetry in high-dimensional optimal Bayesian inference. Communications in Mathematical Physics, 393(3):1199–1239, Aug 2022. ISSN 1432-0916. 10.1007/s00220-022-04387-w. URL https://doi.org/10.1007/s00220-022-04387-w.
- Barbier et al. (2019) Barbier, J., Krzakala, F., Macris, N., Miolane, L., and Zdeborová, L. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019. 10.1073/pnas.1802705116. URL https://www.pnas.org/doi/abs/10.1073/pnas.1802705116.
- Barbier et al. (2025) Barbier, J., Camilli, F., Ko, J., and Okajima, K. Phase diagram of extensive-rank symmetric matrix denoising beyond rotational invariance. Physical Review X, 2025.
- Barkai et al. (1992) Barkai, E., Hansel, D., and Sompolinsky, H. Broken symmetries in multilayered perceptrons. Phys. Rev. A, 45:4146–4161, Mar 1992. 10.1103/PhysRevA.45.4146. URL https://link.aps.org/doi/10.1103/PhysRevA.45.4146.
- Bartlett et al. (2021) Bartlett, P. L., Montanari, A., and Rakhlin, A. Deep learning: a statistical viewpoint. Acta Numerica, 30:87–201, 2021. 10.1017/S0962492921000027. URL https://doi.org/10.1017/S0962492921000027.
- Bassetti et al. (2024) Bassetti, F., Gherardi, M., Ingrosso, A., Pastore, M., and Rotondo, P. Feature learning in finite-width Bayesian deep linear networks with multiple outputs and convolutional layers, 2024. URL https://arxiv.org/abs/2406.03260.
- Bordelon et al. (2020) Bordelon, B., Canatar, A., and Pehlevan, C. Spectrum dependent learning curves in kernel regression and wide neural networks. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 1024–1034. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/bordelon20a.html.
- Brézin et al. (2016) Brézin, E., Hikami, S., et al. Random matrix theory with an external source. Springer, 2016.
- Camilli et al. (2023) Camilli, F., Tieplova, D., and Barbier, J. Fundamental limits of overparametrized shallow neural networks for supervised learning, 2023. URL https://arxiv.org/abs/2307.05635.
- Camilli et al. (2025) Camilli, F., Tieplova, D., Bergamin, E., and Barbier, J. Information-theoretic reduction of deep neural networks to linear models in the overparametrized proportional regime. The 38th Annual Conference on Learning Theory (to appear), 2025.
- Canatar et al. (2021) Canatar, A., Bordelon, B., and Pehlevan, C. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature Communications, 12(1):2914, 05 2021. ISSN 2041-1723. 10.1038/s41467-021-23103-1. URL https://doi.org/10.1038/s41467-021-23103-1.
- Chung et al. (2018) Chung, S., Lee, D. D., and Sompolinsky, H. Classification and geometry of general perceptual manifolds. Phys. Rev. X, 8:031003, Jul 2018. 10.1103/PhysRevX.8.031003. URL https://link.aps.org/doi/10.1103/PhysRevX.8.031003.
- Cui et al. (2023) Cui, H., Krzakala, F., and Zdeborova, L. Bayes-optimal learning of deep random networks of extensive-width. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 6468–6521. PMLR, 07 2023. URL https://proceedings.mlr.press/v202/cui23b.html.
- Del Giudice, P. et al. (1989) Del Giudice, P., Franz, S., and Virasoro, M. A. Perceptron beyond the limit of capacity. J. Phys. France, 50(2):121–134, 1989. 10.1051/jphys:01989005002012100. URL https://doi.org/10.1051/jphys:01989005002012100.
- Dietrich et al. (1999) Dietrich, R., Opper, M., and Sompolinsky, H. Statistical mechanics of support vector networks. Phys. Rev. Lett., 82:2975–2978, 04 1999. 10.1103/PhysRevLett.82.2975. URL https://link.aps.org/doi/10.1103/PhysRevLett.82.2975.
- Du & Lee (2018) Du, S. and Lee, J. On the power of over-parametrization in neural networks with quadratic activation. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1329–1338. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/du18a.html.
- Engel & Van den Broeck (2001) Engel, A. and Van den Broeck, C. Statistical mechanics of learning. Cambridge University Press, 2001. ISBN 9780521773072.
- Engel et al. (1992) Engel, A., Köhler, H. M., Tschepke, F., Vollmayr, H., and Zippelius, A. Storage capacity and learning algorithms for two-layer neural networks. Phys. Rev. A, 45:7590–7609, May 1992. 10.1103/PhysRevA.45.7590. URL https://link.aps.org/doi/10.1103/PhysRevA.45.7590.
- Gamarnik et al. (2024) Gamarnik, D., Kızıldağ, E. C., and Zadik, I. Stationary points of a shallow neural network with quadratic activations and the global optimality of the gradient descent algorithm. Mathematics of Operations Research, 50(1):209–251, 2024. 10.1287/moor.2021.0082. URL https://doi.org/10.1287/moor.2021.0082.
- Gardner & Derrida (1989) Gardner, E. and Derrida, B. Three unfinished works on the optimal storage capacity of networks. Journal of Physics A: Mathematical and General, 22(12):1983, jun 1989. 10.1088/0305-4470/22/12/004. URL https://dx.doi.org/10.1088/0305-4470/22/12/004.
- Gerace et al. (2021) Gerace, F., Loureiro, B., Krzakala, F., Mézard, M., and Zdeborová, L. Generalisation error in learning with random features and the hidden manifold model. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124013, Dec 2021. ISSN 1742-5468. 10.1088/1742-5468/ac3ae6. URL http://dx.doi.org/10.1088/1742-5468/ac3ae6.
- Ghorbani et al. (2021) Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029 – 1054, 2021. 10.1214/20-AOS1990. URL https://doi.org/10.1214/20-AOS1990.
- Goldt et al. (2020) Goldt, S., Mézard, M., Krzakala, F., and Zdeborová, L. Modeling the influence of data structure on learning in neural networks: The hidden manifold model. Phys. Rev. X, 10:041044, Dec 2020. 10.1103/PhysRevX.10.041044. URL https://link.aps.org/doi/10.1103/PhysRevX.10.041044.
- Goldt et al. (2022) Goldt, S., Loureiro, B., Reeves, G., Krzakala, F., Mezard, M., and Zdeborová, L. The Gaussian equivalence of generative models for learning with shallow neural networks. In Bruna, J., Hesthaven, J., and Zdeborová, L. (eds.), Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, volume 145 of Proceedings of Machine Learning Research, pp. 426–471. PMLR, 08 2022. URL https://proceedings.mlr.press/v145/goldt22a.html.
- Guionnet & Zeitouni (2002) Guionnet, A. and Zeitouni, O. Large deviations asymptotics for spherical integrals. Journal of Functional Analysis, 188(2):461–515, 2002. ISSN 0022-1236. 10.1006/jfan.2001.3833. URL https://www.sciencedirect.com/science/article/pii/S0022123601938339.
- Guo et al. (2005) Guo, D., Shamai, S., and Verdú, S. Mutual information and minimum mean-square error in gaussian channels. IEEE Transactions on Information Theory, 51(4):1261–1282, 2005. 10.1109/TIT.2005.844072. URL https://doi.org/10.1109/TIT.2005.844072.
- Györgyi (1990) Györgyi, G. First-order transition to perfect generalization in a neural network with binary synapses. Phys. Rev. A, 41:7097–7100, Jun 1990. 10.1103/PhysRevA.41.7097. URL https://link.aps.org/doi/10.1103/PhysRevA.41.7097.
- Hanin (2023) Hanin, B. Random neural networks in the infinite width limit as Gaussian processes. The Annals of Applied Probability, 33(6A):4798 – 4819, 2023. 10.1214/23-AAP1933. URL https://doi.org/10.1214/23-AAP1933.
- Hastie et al. (2022) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 – 986, 2022. 10.1214/21-AOS2133. URL https://doi.org/10.1214/21-AOS2133.
- Hu & Lu (2023) Hu, H. and Lu, Y. M. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 69(3):1932–1964, 2023. 10.1109/TIT.2022.3217698. URL https://doi.org/10.1109/TIT.2022.3217698.
- Hu et al. (2024) Hu, H., Lu, Y. M., and Misiakiewicz, T. Asymptotics of random feature regression beyond the linear scaling regime, 2024. URL https://arxiv.org/abs/2403.08160.
- Itzykson & Zuber (1980) Itzykson, C. and Zuber, J. The planar approximation. II. Journal of Mathematical Physics, 21(3):411–421, 03 1980. ISSN 0022-2488. 10.1063/1.524438. URL https://doi.org/10.1063/1.524438.
- Kabashima et al. (2016) Kabashima, Y., Krzakala, F., Mézard, M., Sakata, A., and Zdeborová, L. Phase transitions and sample complexity in Bayes-optimal matrix factorization. IEEE Transactions on Information Theory, 62(7):4228–4265, 2016. 10.1109/TIT.2016.2556702. URL https://doi.org/10.1109/TIT.2016.2556702.
- Kazakov (2000) Kazakov, V. A. Solvable matrix models, 2000. URL https://arxiv.org/abs/hep-th/0003064.
- Kingma & Ba (2017) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980.
- Krzakala et al. (2013) Krzakala, F., Mézard, M., and Zdeborová, L. Phase diagram and approximate message passing for blind calibration and dictionary learning. In 2013 IEEE International Symposium on Information Theory, pp. 659–663, 2013. 10.1109/ISIT.2013.6620308. URL https://doi.org/10.1109/ISIT.2013.6620308.
- Lee et al. (2018) Lee, J., Sohl-dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.
- Li & Sompolinsky (2021) Li, Q. and Sompolinsky, H. Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Phys. Rev. X, 11:031059, Sep 2021. 10.1103/PhysRevX.11.031059. URL https://link.aps.org/doi/10.1103/PhysRevX.11.031059.
- Loureiro et al. (2021a) Loureiro, B., Gerbelot, C., Cui, H., Goldt, S., Krzakala, F., Mezard, M., and Zdeborová, L. Learning curves of generic features maps for realistic datasets with a teacher-student model. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 18137–18151. Curran Associates, Inc., 2021a. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/9704a4fc48ae88598dcbdcdf57f3fdef-Paper.pdf.
- Loureiro et al. (2021b) Loureiro, B., Sicuro, G., Gerbelot, C., Pacco, A., Krzakala, F., and Zdeborová, L. Learning Gaussian mixtures with generalized linear models: Precise asymptotics in high-dimensions. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 10144–10157. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/543e83748234f7cbab21aa0ade66565f-Paper.pdf.
- Maillard et al. (2022) Maillard, A., Krzakala, F., Mézard, M., and Zdeborová, L. Perturbative construction of mean-field equations in extensive-rank matrix factorization and denoising. Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083301, Aug 2022. 10.1088/1742-5468/ac7e4c. URL https://dx.doi.org/10.1088/1742-5468/ac7e4c.
- Maillard et al. (2024a) Maillard, A., Troiani, E., Martin, S., Krzakala, F., and Zdeborová, L. Bayes-optimal learning of an extensive-width neural network from quadratically many samples, 2024a. URL https://arxiv.org/abs/2408.03733.
- Maillard et al. (2024b) Maillard, A., Troiani, E., Martin, S., Krzakala, F., and Zdeborová, L. Github repository ExtensiveWidthQuadraticSamples. https://github.com/SPOC-group/ExtensiveWidthQuadraticSamples, 2024b.
- Martin et al. (2024) Martin, S., Bach, F., and Biroli, G. On the impact of overparameterization on the training of a shallow neural network in high dimensions. In Dasgupta, S., Mandt, S., and Li, Y. (eds.), Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pp. 3655–3663. PMLR, 02–04 May 2024. URL https://proceedings.mlr.press/v238/martin24a.html.
- Mato & Parga (1992) Mato, G. and Parga, N. Generalization properties of multilayered neural networks. Journal of Physics A: Mathematical and General, 25(19):5047, Oct 1992. 10.1088/0305-4470/25/19/017. URL https://dx.doi.org/10.1088/0305-4470/25/19/017.
- Matthews et al. (2018) Matthews, A. G. D. G., Hron, J., Rowland, M., Turner, R. E., and Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1-nGgWC-.
- Matytsin (1994) Matytsin, A. On the large- $N$ limit of the Itzykson-Zuber integral. Nuclear Physics B, 411(2):805–820, 1994. ISSN 0550-3213. 10.1016/0550-3213(94)90471-5. URL https://www.sciencedirect.com/science/article/pii/0550321394904715.
- Mei & Montanari (2022) Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022. 10.1002/cpa.22008. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.22008.
- Mezard et al. (1986) Mezard, M., Parisi, G., and Virasoro, M. Spin Glass Theory and Beyond. World Scientific, 1986. 10.1142/0271. URL https://www.worldscientific.com/doi/abs/10.1142/0271.
- Monasson (1992) Monasson, R. Properties of neural networks storing spatially correlated patterns. Journal of Physics A: Mathematical and General, 25(13):3701, Jul 1992. 10.1088/0305-4470/25/13/019. URL https://dx.doi.org/10.1088/0305-4470/25/13/019.
- Monasson & Zecchina (1995) Monasson, R. and Zecchina, R. Weight space structure and internal representations: A direct approach to learning and generalization in multilayer neural networks. Phys. Rev. Lett., 75:2432–2435, Sep 1995. 10.1103/PhysRevLett.75.2432. URL https://link.aps.org/doi/10.1103/PhysRevLett.75.2432.
- Naveh & Ringel (2021) Naveh, G. and Ringel, Z. A self consistent theory of Gaussian processes captures feature learning effects in finite CNNs. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 21352–21364. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/b24d21019de5e59da180f1661904f49a-Paper.pdf.
- Neal (1996) Neal, R. M. Priors for Infinite Networks, pp. 29–53. Springer New York, New York, NY, 1996. ISBN 978-1-4612-0745-0. 10.1007/978-1-4612-0745-0_2. URL https://doi.org/10.1007/978-1-4612-0745-0_2.
- Nishimori (2001) Nishimori, H. Statistical Physics of Spin Glasses and Information Processing: An Introduction. Oxford University Press, 07 2001. ISBN 9780198509417. 10.1093/acprof:oso/9780198509417.001.0001.
- Nourdin et al. (2011) Nourdin, I., Peccati, G., and Podolskij, M. Quantitative Breuer–Major theorems. Stochastic Processes and their Applications, 121(4):793–812, 2011. ISSN 0304-4149. https://doi.org/10.1016/j.spa.2010.12.006. URL https://www.sciencedirect.com/science/article/pii/S0304414910002917.
- Pacelli et al. (2023) Pacelli, R., Ariosto, S., Pastore, M., Ginelli, F., Gherardi, M., and Rotondo, P. A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit. Nature Machine Intelligence, 5(12):1497–1507, 12 2023. ISSN 2522-5839. 10.1038/s42256-023-00767-6. URL https://doi.org/10.1038/s42256-023-00767-6.
- Parker et al. (2014) Parker, J. T., Schniter, P., and Cevher, V. Bilinear generalized approximate message passing—Part I: Derivation. IEEE Transactions on Signal Processing, 62(22):5839–5853, 2014. 10.1109/TSP.2014.2357776. URL https://doi.org/10.1109/TSP.2014.2357776.
- Potters & Bouchaud (2020) Potters, M. and Bouchaud, J.-P. A first course in random matrix theory: for physicists, engineers and data scientists. Cambridge University Press, 2020.
- Pourkamali et al. (2024) Pourkamali, F., Barbier, J., and Macris, N. Matrix inference in growing rank regimes. IEEE Transactions on Information Theory, 70(11):8133–8163, 2024. 10.1109/TIT.2024.3422263. URL https://doi.org/10.1109/TIT.2024.3422263.
- Power et al. (2022) Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. URL https://arxiv.org/abs/2201.02177.
- Rotondo et al. (2020) Rotondo, P., Pastore, M., and Gherardi, M. Beyond the storage capacity: Data-driven satisfiability transition. Phys. Rev. Lett., 125:120601, Sep 2020. 10.1103/PhysRevLett.125.120601. URL https://link.aps.org/doi/10.1103/PhysRevLett.125.120601.
- Rubin et al. (2024a) Rubin, N., Ringel, Z., Seroussi, I., and Helias, M. A unified approach to feature learning in Bayesian neural networks. In High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning, 2024a. URL https://openreview.net/forum?id=ZmOSJ2MV2R.
- Rubin et al. (2024b) Rubin, N., Seroussi, I., and Ringel, Z. Grokking as a first order phase transition in two layer networks. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=3ROGsTX3IR.
- Sakata & Kabashima (2013) Sakata, A. and Kabashima, Y. Statistical mechanics of dictionary learning. Europhysics Letters, 103(2):28008, Aug 2013. 10.1209/0295-5075/103/28008. URL https://dx.doi.org/10.1209/0295-5075/103/28008.
- Sarao Mannelli et al. (2020) Sarao Mannelli, S., Vanden-Eijnden, E., and Zdeborová, L. Optimization and generalization of shallow neural networks with quadratic activation functions. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 13445–13455. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/9b8b50fb590c590ffbf1295ce92258dc-Paper.pdf.
- Schmidt (2018) Schmidt, H. C. Statistical physics of sparse and dense models in optimization and inference. PhD thesis, 2018. URL http://www.theses.fr/2018SACLS366.
- Schwarze & Hertz (1992) Schwarze, H. and Hertz, J. Generalization in a large committee machine. Europhysics Letters, 20(4):375, Oct 1992. 10.1209/0295-5075/20/4/015. URL https://dx.doi.org/10.1209/0295-5075/20/4/015.
- Schwarze & Hertz (1993) Schwarze, H. and Hertz, J. Generalization in fully connected committee machines. Europhysics Letters, 21(7):785, Mar 1993. 10.1209/0295-5075/21/7/012. URL https://dx.doi.org/10.1209/0295-5075/21/7/012.
- Semerjian (2024) Semerjian, G. Matrix denoising: Bayes-optimal estimators via low-degree polynomials. Journal of Statistical Physics, 191(10):139, Oct 2024. ISSN 1572-9613. 10.1007/s10955-024-03359-9. URL https://doi.org/10.1007/s10955-024-03359-9.
- Seroussi et al. (2023) Seroussi, I., Naveh, G., and Ringel, Z. Separation of scales and a thermodynamic description of feature learning in some CNNs. Nature Communications, 14(1):908, Feb 2023. ISSN 2041-1723. 10.1038/s41467-023-36361-y. URL https://doi.org/10.1038/s41467-023-36361-y.
- Soltanolkotabi et al. (2019) Soltanolkotabi, M., Javanmard, A., and Lee, J. D. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2019. 10.1109/TIT.2018.2854560. URL https://doi.org/10.1109/TIT.2018.2854560.
- Troiani et al. (2025) Troiani, E., Dandi, Y., Defilippis, L., Zdeborova, L., Loureiro, B., and Krzakala, F. Fundamental computational limits of weak learnability in high-dimensional multi-index models. In The 28th International Conference on Artificial Intelligence and Statistics, 2025. URL https://openreview.net/forum?id=Mwzui5H0VN.
- van Meegen & Sompolinsky (2024) van Meegen, A. and Sompolinsky, H. Coding schemes in neural networks learning classification tasks, 2024. URL https://arxiv.org/abs/2406.16689.
- Venturi et al. (2019) Venturi, L., Bandeira, A. S., and Bruna, J. Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of Machine Learning Research, 20(133):1–34, 2019. URL http://jmlr.org/papers/v20/18-674.html.
- Williams (1996) Williams, C. Computing with infinite networks. In Mozer, M., Jordan, M., and Petsche, T. (eds.), Advances in Neural Information Processing Systems, volume 9. MIT Press, 1996. URL https://proceedings.neurips.cc/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735-Paper.pdf.
- Xiao et al. (2023) Xiao, L., Hu, H., Misiakiewicz, T., Lu, Y. M., and Pennington, J. Precise learning curves and higher-order scaling limits for dot-product kernel regression. Journal of Statistical Mechanics: Theory and Experiment, 2023(11):114005, Nov 2023. 10.1088/1742-5468/ad01b7. URL https://dx.doi.org/10.1088/1742-5468/ad01b7.
- Xu et al. (2025) Xu, Y., Maillard, A., Zdeborová, L., and Krzakala, F. Fundamental limits of matrix sensing: Exact asymptotics, universality, and applications, 2025. URL https://arxiv.org/abs/2503.14121.
- Yoon & Oh (1998) Yoon, H. and Oh, J.-H. Learning of higher-order perceptrons with tunable complexities. Journal of Physics A: Mathematical and General, 31(38):7771–7784, 09 1998. 10.1088/0305-4470/31/38/012. URL https://doi.org/10.1088/0305-4470/31/38/012.
- Zdeborová & and (2016) Zdeborová, L. and and, F. K. Statistical physics of inference: thresholds and algorithms. Advances in Physics, 65(5):453–552, 2016. 10.1080/00018732.2016.1211393. URL https://doi.org/10.1080/00018732.2016.1211393.
## Appendix A Hermite basis and Mehler’s formula
Recall the Hermite expansion of the activation:
$$
\sigma(x)=\sum_{\ell=0}^{\infty}\frac{\mu_{\ell}}{\ell!}{\rm He}_{\ell}(x). \tag{17}
$$
We are expressing it on the basis of the probabilist’s Hermite polynomials, generated through
$$
{\rm He}_{\ell}(z)=\frac{d^{\ell}}{{dt}^{\ell}}\exp\big{(}tz-t^{2}/2\big{)}
\big{|}_{t=0}. \tag{18}
$$
The Hermite basis has the property of being orthogonal with respect to the standard Gaussian measure, which is the distribution of the input data:
$$
\int Dz\,{\rm He}_{k}(z){\rm He}_{\ell}(z)=\ell!\,\delta_{k\ell}, \tag{19}
$$
where $Dz:=dz\exp(-z^{2}/2)/\sqrt{2\pi}$ . By orthogonality, the coefficients of the expansions can be obtained as
$$
\mu_{\ell}=\int Dz{\rm He}_{\ell}(z)\sigma(z). \tag{20}
$$
Moreover,
$$
\mathbb{E}[\sigma(z)^{2}]=\int Dz\,\sigma(z)^{2}=\sum_{\ell=0}^{\infty}\frac{
\mu_{\ell}^{2}}{\ell!}. \tag{21}
$$
These coefficients for some popular choices of $\sigma$ are reported in Table 1 for reference.
Table 1: First Hermite coefficients of some activation functions reported in the figues. $\theta$ is the Heaviside step function.
| ${\rm ReLU}(z)=z\theta(z)$ ${\rm ELU}(z)=z\theta(z)+(e^{z}-1)\theta(-z)$ ${\rm Tanh}(2z)$ | $1/\sqrt{2\pi}$ 0.16052 0 | $1/2$ 0.76158 0.72948 | $1/\sqrt{2\pi}$ 0.26158 0 | 0 -0.13736 -0.61398 | $-1/\sqrt{2\pi}$ -0.13736 0 | $\cdots$ $\cdots$ $\cdots$ | 1/2 0.64494 0.63526 |
| --- | --- | --- | --- | --- | --- | --- | --- |
The Hermite basis can be generalised to an orthogonal basis with respect to the Gaussian measure with generic variance:
$$
{\rm He}_{\ell}^{[r]}(z)=\frac{d^{\ell}}{dt^{\ell}}\exp\big{(}tz-t^{2}r/2\big{
)}\big{|}_{t=0}, \tag{22}
$$
so that, with $D_{r}z:=dz\exp(-z^{2}/2r)/\sqrt{2\pi r}$ , we have
$$
\int D_{r}z\,{\rm He}_{k}^{[r]}(z){\rm He}_{\ell}^{[r]}(z)=\ell!\,r^{\ell}
\delta_{k\ell}. \tag{23}
$$
From Mehler’s formula
$$
\frac{1}{2\pi\sqrt{r^{2}-q^{2}}}\exp\!\Big{[}-\frac{1}{2}(u,v)\begin{pmatrix}r
&q\\
q&r\end{pmatrix}^{-1}\begin{pmatrix}u\\
v\end{pmatrix}\Big{]}=\frac{e^{-\frac{u^{2}}{2r}}}{\sqrt{2\pi r}}\frac{e^{-
\frac{v^{2}}{2r}}}{\sqrt{2\pi r}}\sum_{\ell=0}^{+\infty}\frac{q^{\ell}}{\ell!r
^{2\ell}}{\rm He}_{\ell}^{[r]}(u){\rm He}_{\ell}^{[r]}(v), \tag{24}
$$
and by orthogonality of the Hermite basis, (8) readily follows by noticing that the variables $(h_{i}^{a}=({\mathbf{W}}^{a}{\mathbf{x}})_{i}/\sqrt{d})_{i,a}$ at given $({\mathbf{W}}^{a})$ are Gaussian with covariances $\Omega^{ab}_{ij}={\mathbf{W}}_{i}^{a\intercal}{\mathbf{W}}^{b}_{j}/d$ , so that
$$
\mathbb{E}[\sigma(h_{i}^{a})\sigma(h_{j}^{b})]=\sum_{\ell=0}^{\infty}\frac{(
\mu_{\ell}^{[r]})^{2}}{\ell!r^{2\ell}}(\Omega_{ij}^{ab})^{\ell},\qquad\mu_{
\ell}^{[r]}=\int D_{r}z\,{\rm He}^{[r]}_{\ell}(z)\sigma(z). \tag{25}
$$
Moreover, as $r=\Omega^{aa}_{ii}$ converges for $d$ large to the variance of the prior of ${\mathbf{W}}^{0}$ by Bayes-optimality, whenever $\Omega^{aa}_{ii}\to 1$ we can specialise this formula to the simpler case $r=1$ we reported in the main text.
## Appendix B Nishimori identities
The Nishimori identities are a very general set of symmetries arising in inference in the Bayes-optimal setting as a consequence of Bayes’ rule. To introduce them, consider a test function $f$ of the teacher weights, collectively denoted by ${\bm{\theta}}^{0}$ , of $s-1$ replicas of the student’s weights $({\bm{\theta}}^{a})_{2\leq a\leq s}$ drawn conditionally i.i.d. from the posterior, and possibly also of the training set $\mathcal{D}$ : $f({\bm{\theta}}^{0},{\bm{\theta}}^{2},\dots,{\bm{\theta}}^{s};\mathcal{D})$ . Then
$$
\displaystyle\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle f({\bm{\theta}}
^{0},{\bm{\theta}}^{2},\dots,{\bm{\theta}}^{s};\mathcal{D})\rangle=\mathbb{E}_
{{\bm{\theta}}^{0},\mathcal{D}}\langle f({\bm{\theta}}^{1},{\bm{\theta}}^{2},
\dots,{\bm{\theta}}^{s};\mathcal{D})\rangle, \tag{26}
$$
where we have replaced the teacher’s weights with another replica from the student. The proof is elementary, see e.g. Barbier et al. (2019).
The Nishimori identities have some consequences also on our replica symmetric ansatz for the free entropy. In particular, they constrain the values of the asymptotic mean of some order parameters. For instance
$$
\displaystyle m_{2}=\lim\frac{1}{d^{2}}\mathbb{E}_{\mathcal{D},{\bm{\theta}}^{
0}}\langle{\rm Tr}[{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{0}]\rangle=\lim\frac{
1}{d^{2}}\mathbb{E}_{\mathcal{D}}\langle{\rm Tr}[{\mathbf{S}}_{2}^{a}{\mathbf{
S}}_{2}^{b}]\rangle=q_{2},\quad\text{for }a\neq b. \tag{27}
$$
Combined with the concentration of such order parameters, which can be proven in great generality in Bayes-optimal learning Barbier (2020); Barbier & Panchenko (2022), it fixes the values for some of them. For instance, we have that with high probability
$$
\displaystyle\frac{1}{d^{2}}{\rm Tr}[({\mathbf{S}}_{2}^{a})^{2}]\to r_{2}=\lim
\frac{1}{d^{2}}\mathbb{E}_{\mathcal{D}}\langle{\rm Tr}[({\mathbf{S}}_{2}^{a})^
{2}]\rangle=\lim\frac{1}{d^{2}}\mathbb{E}_{{\bm{\theta}}^{0}}{\rm Tr}[({
\mathbf{S}}_{2}^{0})^{2}]=\rho_{2}=1+\gamma\bar{v}^{2}. \tag{28}
$$
When the values of some order parameters are determined by the Nishimori identities (and their concentration), as for those fixed to $r_{2}=\rho_{2}$ , then the respective Fourier conjugates $\hat{r}_{2},\hat{\rho}_{2}$ vanish (meaning that the desired constraints were already asymptotically enforced without the need of additional delta functions). This is because the configurations that make the order parameters take those values exponentially (in $n$ ) dominate the posterior measure, so these constraints are automatically imposed by the measure.
## Appendix C Alternative representation for the optimal mean-square generalisation error
We recall that ${\bm{\theta}}^{0}=({\mathbf{v}}^{0},{\mathbf{W}}^{0})$ and similarly for ${\bm{\theta}}^{1}={\bm{\theta}},{\bm{\theta}}^{2},\ldots$ which are replicas, i.e., conditionally i.i.d. samples from $dP({\mathbf{W}},{\mathbf{v}}\mid\mathcal{D})$ (the reasoning below applies whether ${\mathbf{v}}$ is learnable or quenched, so in general we can consider a joint posterior over both). In this section we report the details on how to obtain Result 2.2 and how to write the generalisation error defined in (3) in a form more convenient for numerical estimation.
From its definition, the Bayes-optimal mean-square generalisation error can be recast as
$$
\displaystyle\varepsilon^{\rm opt}=\mathbb{E}_{{\bm{\theta}}^{0},{\mathbf{x}}_
{\rm test}}\mathbb{E}[y^{2}_{\rm test}\mid\lambda^{0}]-2\mathbb{E}_{{\bm{
\theta}}^{0},\mathcal{D},{\mathbf{x}}_{\rm test}}\mathbb{E}[y_{\rm test}\mid
\lambda^{0}]\langle\mathbb{E}[y\mid\lambda]\rangle+\mathbb{E}_{{\bm{\theta}}^{
0},\mathcal{D},{\mathbf{x}}_{\rm test}}\langle\mathbb{E}[y\mid\lambda]\rangle^
{2}, \tag{29}
$$
where $\mathbb{E}[y\mid\lambda]=\int dy\,y\,P_{\rm out}(y\mid\lambda)$ , and $\lambda^{0}$ , $\lambda$ are the random variables (random due to the test input ${\mathbf{x}}_{\rm test}$ , drawn independently of the training data $\mathcal{D}$ , and their respective weights ${\bm{\theta}}^{0},{\bm{\theta}}$ )
$$
\displaystyle\lambda^{0}=\lambda({\bm{\theta}}^{0},{\mathbf{x}}_{\rm test})=
\frac{{\mathbf{v}}^{0\intercal}}{\sqrt{k}}\sigma\Big{(}\frac{{\mathbf{W}}^{0}{
\mathbf{x}}_{\rm test}}{\sqrt{d}}\Big{)},\qquad\lambda=\lambda^{1}=\lambda({
\bm{\theta}},{\mathbf{x}}_{\rm test})=\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}
}\sigma\Big{(}\frac{{\mathbf{W}}{\mathbf{x}}_{\rm test}}{\sqrt{d}}\Big{)}. \tag{30}
$$
Recall that the bracket $\langle\,\cdot\,\rangle$ is the average w.r.t. to the posterior and acts on ${\bm{\theta}}^{1}={\bm{\theta}},{\bm{\theta}}^{2},\ldots$ . Notice that the last term on the r.h.s. of (29) can be rewritten as
| | $\displaystyle\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D},{\mathbf{x}}_{\rm test} }\langle\mathbb{E}[y\mid\lambda]\rangle^{2}=\mathbb{E}_{{\bm{\theta}}^{0}, \mathcal{D},{\mathbf{x}}_{\rm test}}\langle\mathbb{E}[y\mid\lambda^{1}]\mathbb {E}[y\mid\lambda^{2}]\rangle,$ | |
| --- | --- | --- |
with superscripts being replica indices, i.e., $\lambda^{a}:=\lambda({\bm{\theta}}^{a},{\mathbf{x}}_{\rm test})$ .
In order to show Result 2.2 for a generic $P_{\rm out}$ we assume the joint Gaussianity of the variables $(\lambda^{0},\lambda^{1},\lambda^{2},\ldots)$ , with covariance given by $K^{ab}$ with $a,b\in\{0,1,2,\ldots\}$ . Indeed, in the limit “ $\lim$ ”, our theory considers $(\lambda^{a})_{a\geq 0}$ as jointly Gaussian under the randomness of a common input, here ${\mathbf{x}}_{\rm test}$ , conditionally on the weights $({\bm{\theta}}^{a})$ . Their covariance depends on the weights $({\bm{\theta}}^{a})$ through various overlap order parameters introduced in the main. But in the large limit “ $\lim$ ” these overlaps are assumed to concentrate under the quenched posterior average $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle\,\cdot\,\rangle$ towards non-random asymptotic values corresponding to the extremiser globally maximising the RS potential in Result 2.1, with the overlaps entering $K^{ab}$ through (42). This hypothesis is then confirmed by the excellent agreement between our theoretical predictions based on this assumption and the experimental results. This implies directly the equation for $\lim\,\varepsilon^{\mathcal{C},\mathsf{f}}$ in Result 2.2 from definition (2). For the special case of optimal mean-square generalisation error it yields
$$
\displaystyle\lim\,\varepsilon^{\rm opt}=\mathbb{E}_{\lambda^{0}}\mathbb{E}[y^
{2}_{\rm test}\mid\lambda^{0}]-2\mathbb{E}_{\lambda^{0},\lambda^{1}}\mathbb{E}
[y_{\rm test}\mid\lambda^{0}]\mathbb{E}[y\mid\lambda^{1}]+\mathbb{E}_{\lambda^
{1},\lambda^{2}}\mathbb{E}[y\mid\lambda^{1}]\mathbb{E}[y\mid\lambda^{2}] \tag{31}
$$
where, in the replica symmetric ansatz,
$$
\displaystyle\mathbb{E}[(\lambda^{0})^{2}]=K^{00},\quad\mathbb{E}[\lambda^{0}
\lambda^{1}]=\mathbb{E}[\lambda^{0}\lambda^{2}]=K^{01},\quad\mathbb{E}[\lambda
^{1}\lambda^{2}]=K^{12},\quad\mathbb{E}[(\lambda^{1})^{2}]=\mathbb{E}[(\lambda
^{2})^{2}]=K^{11}. \tag{32}
$$
For the dependence of the elements of ${\mathbf{K}}$ on the overlaps under this ansatz we defer the reader to (45), (46). In the Bayes-optimal setting, using the Nishimori identities (see App. B), one can show that $K^{01}=K^{12}$ and $K^{00}=K^{11}$ . Because of these identifications, we would additionally have
$$
\displaystyle\mathbb{E}_{\lambda^{0},\lambda^{1}}\mathbb{E}[y_{\rm test}\mid
\lambda^{0}]\mathbb{E}[y\mid\lambda^{1}]=\mathbb{E}_{\lambda^{1},\lambda^{2}}
\mathbb{E}[y\mid\lambda^{1}]\mathbb{E}[y\mid\lambda^{2}]. \tag{33}
$$
Plugging the above in (31) yields (7).
Let us now prove a formula for the optimal mean-square generalisation error written in terms of the overlaps that will be simpler to evaluate numerically, which holds for the special case of linear readout with Gaussian label noise $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ . The following derivation is exact and does not require any Gaussianity assumption on the random variables $(\lambda^{a})$ . For the linear Gaussian channel the means verify $\mathbb{E}[y\mid\lambda]=\lambda$ and $\mathbb{E}[y^{2}\mid\lambda]=\lambda^{2}+\Delta$ . Plugged in (29) this yields
$$
\displaystyle\varepsilon^{\rm opt}-\Delta=\mathbb{E}_{{\bm{\theta}}^{0},{
\mathbf{x}}_{\rm test}}\lambda^{2}_{\rm test}-2\mathbb{E}_{{\bm{\theta}}^{0},
\mathcal{D},{\mathbf{x}}_{\rm test}}\lambda^{0}\langle\lambda\rangle+\mathbb{E
}_{{\bm{\theta}}^{0},\mathcal{D},{\mathbf{x}}_{\rm test}}\langle\lambda^{1}
\lambda^{2}\rangle, \tag{34}
$$
whence we clearly see that the generalisation error depends only on the covariance of $\lambda_{\rm test}({\bm{\theta}}^{0})=\lambda^{0}({\bm{\theta}}^{0}),\lambda^{ 1}({\bm{\theta}}^{1}),\lambda^{2}({\bm{\theta}}^{2})$ under the randomness of the shared input ${\mathbf{x}}_{\rm test}$ at fixed weights, regardless of the validity of the Gaussian equivalence principle we assume in the replica computation. This covariance was already computed in (8); we recall it here for the reader’s convenience
$$
\displaystyle K({\bm{\theta}}^{a},{\bm{\theta}}^{b}):=\mathbb{E}\lambda^{a}
\lambda^{b}=\sum_{\ell=1}^{\infty}\frac{\mu_{\ell}^{2}}{\ell!}\frac{1}{k}\sum_
{i,j=1}^{k}v_{i}^{a}(\Omega^{ab}_{ij})^{\ell}v^{b}_{j}=\sum_{\ell=1}^{\infty}
\frac{\mu_{\ell}^{2}}{\ell!}Q_{\ell}^{ab}, \tag{35}
$$
where $\Omega^{ab}_{ij}:={\mathbf{W}}_{i}^{a\intercal}{\mathbf{W}}_{j}^{b}/d$ , and $Q_{\ell}^{ab}$ as introduced in (8) for $a,b=0,1,2$ . We stress that $K({\bm{\theta}}^{a},{\bm{\theta}}^{b})$ is not the limiting covariance $K^{ab}$ whose elements are in (45), (46), but rather the finite size one. $K({\bm{\theta}}^{a},{\bm{\theta}}^{b})$ provides us with an efficient way to compute the generalisation error numerically, that is through the formula
$$
\displaystyle\varepsilon^{\rm opt}-\Delta \displaystyle=\mathbb{E}_{{\bm{\theta}}^{0}}K({\bm{\theta}}^{0},{\bm{\theta}}^
{0})-2\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle K({\bm{\theta}}^{0},{
\bm{\theta}}^{1})\rangle+\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle K({
\bm{\theta}}^{1},{\bm{\theta}}^{2})\rangle=\sum_{\ell=1}^{\infty}\frac{\mu_{
\ell}^{2}}{\ell!}\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle Q_{\ell}^{0
0}-2Q_{\ell}^{01}+Q^{12}_{\ell}\rangle. \tag{36}
$$
In the above, the posterior measure $\langle\,\cdot\,\rangle$ is taken care of by Monte Carlo sampling (when it equilibrates). In addition, as in the main text, we assume that in the large system limit the (numerically confirmed) identity (11) holds. Putting all ingredients together we get
$$
\displaystyle\varepsilon^{\rm opt}-\Delta=\mathbb{E}_{{\bm{\theta}}^{0},
\mathcal{D}} \displaystyle\Big{\langle}\mu_{1}^{2}(Q_{1}^{00}-2Q^{01}_{1}+Q^{12}_{1})+\frac
{\mu_{2}^{2}}{2}(Q_{2}^{00}-2Q^{01}_{2}+Q^{12}_{2}) \displaystyle+\mathbb{E}_{v\sim P_{v}}v^{2}\big{[}g(\mathcal{Q}_{W}^{00}(v))-2
g(\mathcal{Q}_{W}^{01}(v))+g(\mathcal{Q}_{W}^{12}(v))\big{]}\Big{\rangle}. \tag{37}
$$
In the Bayes-optimal setting one can use again the Nishimori identities that imply $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{12}_{1}\rangle=\mathbb{E} _{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{01}_{1}\rangle$ , and analogously $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{12}_{2}\rangle=\mathbb{E} _{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{01}_{2}\rangle$ and $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle g(\mathcal{Q}^{12}_{W}(v)) \rangle=\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle g(\mathcal{Q}^{01}_{ W}(v))\rangle$ . Inserting these identities in (37) one gets
$$
\displaystyle\varepsilon^{\rm opt}-\Delta \displaystyle=\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\Big{\langle}\mu_{1}^{
2}(Q_{1}^{00}-Q^{01}_{1})+\frac{\mu_{2}^{2}}{2}(Q_{2}^{00}-Q^{01}_{2})+\mathbb
{E}_{v\sim P_{v}}v^{2}\big{[}g(\mathcal{Q}_{W}^{00}(v))-g(\mathcal{Q}_{W}^{01}
(v))\big{]}\Big{\rangle}. \tag{38}
$$
This formula makes no assumption (other than (11)), including on the law of the $\lambda$ ’s. That it depends only on their covariance is simply a consequence of the quadratic nature of the mean-square generalisation error.
**Remark C.1**
*Note that the derivation up to (36) did not assume Bayes-optimality (while (38) does). Therefore, one can consider it in cases where the true posterior average $\langle\,\cdot\,\rangle$ is replaced by one not verifying the Nishimori identities. This is the formula we use to compute the generalisation error of Monte Carlo-based estimators in the inset of Fig. 7. This is indeed needed to compute the generalisation in the glassy regime, where MCMC cannot equilibrate.*
**Remark C.2**
*Using the Nishimory identity of App. B and again that, for the linear readout with Gaussian label noise $\mathbb{E}[y\mid\lambda]=\lambda$ and $\mathbb{E}[y^{2}\mid\lambda]=\lambda^{2}+\Delta$ , it is easy to check that the so-called Gibbs error
$$
\varepsilon^{\rm Gibbs}:=\mathbb{E}_{\bm{\theta}^{0},{\mathcal{D}},{\mathbf{x}
}_{\rm test},y_{\rm test}}\big{\langle}(y_{\rm test}-\mathbb{E}[y\mid\lambda_{
\rm test}({\bm{\theta}})])^{2}\big{\rangle} \tag{39}
$$
is related for this channel to the Bayes-optimal mean-square generalisation error through the identity
$$
\varepsilon^{\rm Gibbs}-\Delta=2(\varepsilon^{\rm opt}-\Delta). \tag{40}
$$
We exploited this relationship together with the concentration of the Gibbs error w.r.t. the quenched posterior measure $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle\,\cdot\,\rangle$ when evaluating the numerical generalisation error of the Monte Carlo algorithms reported in the main text.*
## Appendix D Details of the replica calculation
### D.1 Energetic potential
The replicated energetic term under our Gaussian assumption on the joint law of the post-activations replicas is reported here for the reader’s convenience:
$$
F_{E}=\ln\int dy\int d{\bm{\lambda}}\frac{e^{-\frac{1}{2}{\bm{\lambda}}^{
\intercal}{\mathbf{K}}^{-1}{\bm{\lambda}}}}{\sqrt{(2\pi)^{s+1}\det{\mathbf{K}}
}}\prod_{a=0}^{s}P_{\rm out}(y\mid\lambda^{a}). \tag{41}
$$
After applying our ansatz (10) and using that $Q_{1}^{ab}=1$ in the quadratic-data regime, the covariance matrix ${\mathbf{K}}$ in replica space defined in (8) reads
$$
\displaystyle K^{ab} \displaystyle=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}Q^{ab}_{2}+\mathbb{E}_{v\sim P_
{v}}v^{2}g(\mathcal{Q}_{W}^{ab}(v)), \tag{42}
$$
where the function
$$
g(x)=\sum_{\ell=3}^{\infty}\frac{\mu_{\ell}^{2}}{\ell!}x^{\ell}=\mathbb{E}_{(y
,z)|x}[\sigma(y)\sigma(z)]-\mu_{0}^{2}-\mu_{1}^{2}x-\frac{\mu_{2}^{2}}{2}x^{2}
,\qquad(y,z)\sim{\mathcal{N}}\left((0,0),\begin{pmatrix}1&x\\
x&1\end{pmatrix}\right). \tag{43}
$$
The energetic term $F_{E}$ is already expressed as a low-dimensional integral, but within the replica symmetric (RS) ansatz it simplifies considerably. Let us denote $\bm{\mathcal{Q}}_{W}(\mathsf{v})=(\mathcal{Q}_{W}^{ab}(\mathsf{v}))_{a,b=0}^{s}$ . The RS ansatz amounts to assume that the saddle point solutions are dominated by order parameters of the form (below $\bm{1}_{s}$ and ${\mathbb{I}}_{s}$ are the all-ones vector and identity matrix of size $s$ )
$$
\bm{\mathcal{Q}}_{W}(\mathsf{v})=\begin{pmatrix}\rho_{W}&m_{W}\bm{1}_{s}^{
\intercal}\\
m_{W}\bm{1}_{s}&(r_{W}-\mathcal{Q}_{W}){\mathbb{I}}_{s}+\mathcal{Q}_{W}\bm{1}_
{s}\bm{1}_{s}^{\intercal}\end{pmatrix}\iff\hat{\bm{\mathcal{Q}}}_{W}(\mathsf{v
})=\begin{pmatrix}\hat{\rho}_{W}&-\hat{m}_{W}\bm{1}_{s}^{\intercal}\\
-\hat{m}_{W}\bm{1}_{s}&(\hat{r}_{W}+\hat{\mathcal{Q}}_{W}){\mathbb{I}}_{s}-
\hat{\mathcal{Q}}_{W}\bm{1}_{s}\bm{1}_{s}^{\intercal}\end{pmatrix},
$$
where all the above parameter $\rho_{W}=\rho_{W}(\mathsf{v}),\hat{\rho}_{W},m_{W},\ldots$ depend on $\mathsf{v}$ , and similarly
$$
{\mathbf{Q}}_{2}=\begin{pmatrix}\rho_{2}&m_{2}\bm{1}_{s}^{\intercal}\\
m_{2}\bm{1}_{s}&(r_{2}-q_{2}){\mathbb{I}}_{s}+q_{2}\bm{1}_{s}\bm{1}_{s}^{
\intercal}\end{pmatrix}\iff\hat{{\mathbf{Q}}}_{2}=\begin{pmatrix}\hat{\rho}_{2
}&-\hat{m}_{2}\bm{1}_{s}^{\intercal}\\
-\hat{m}_{2}\bm{1}_{s}&(\hat{r}_{2}+\hat{q}_{2}){\mathbb{I}}_{s}-\hat{q}_{2}
\bm{1}_{s}\bm{1}_{s}^{\intercal}\end{pmatrix},
$$
where we reported the ansatz also for the Fourier conjugates We are going to use repeatedly the Fourier representation of the delta function, namely $\delta(x)=\frac{1}{2\pi}\int d\hat{x}\exp(i\hat{x}x)$ . Because the integrals we will end-up with will always be at some point evaluated by saddle point, implying a deformation of the integration contour in the complex plane, tracking the imaginary unit $i$ in the delta functions will be irrelevant. Similarly, the normalization $1/2\pi$ will always contribute to sub-leading terms in the integrals at hand. Therefore, we will allow ourselves to formally write $\delta(x)=\int d\hat{x}\exp(r\hat{x}x)$ for a convenient constant $r$ , keeping in mind these considerations (again, as we evaluate the final integrals by saddle point, the choice of $r$ ends-up being irrelevant). for future convenience, though not needed for the energetic potential. The RS ansatz, which is equivalent to an assumption of concentration of the order parameters in the high-dimensional limit, is known to be exact when analysing Bayes-optimal inference and learning, as in the present paper, see Nishimori (2001); Barbier (2020); Barbier & Panchenko (2022). Under the RS ansatz ${\mathbf{K}}$ acquires a similar form:
$$
\displaystyle{\mathbf{K}}=\begin{pmatrix}\rho_{K}&m_{K}\bm{1}_{s}^{\intercal}
\\
m_{K}\bm{1}_{s}&(r_{K}-q_{K}){\mathbb{I}}_{s}+q_{K}\bm{1}_{s}\bm{1}_{s}^{
\intercal}\end{pmatrix} \tag{44}
$$
with
$$
\displaystyle m_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}m_{2}+\mathbb{E}_{v\sim P
_{v}}v^{2}g(m_{W}(v)),\quad \displaystyle q_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}q_{2}+\mathbb{E}_{v\sim P
_{v}}v^{2}g(\mathcal{Q}_{W}(v)), \displaystyle\rho_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}\rho_{2}+\mathbb{E}_{v
\sim P_{v}}v^{2}g(\rho_{W}(v)),\quad \displaystyle r_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}r_{2}+\mathbb{E}_{v\sim P
_{v}}v^{2}g(r_{W}(v)). \tag{45}
$$
In the RS ansatz it is thus possible to give a convenient low-dimensional representation of the multivariate Gaussian integral of $F_{E}$ in terms of white Gaussian random variables:
$$
\displaystyle\lambda^{a}=\xi\sqrt{q_{K}}+u^{a}\sqrt{r_{K}-q_{K}}\quad\text{for
}a=1,\dots,s,\qquad\lambda^{0}=\xi\sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{
\rho_{K}-\frac{m_{K}^{2}}{q_{K}}} \tag{47}
$$
where $\xi,(u^{a})_{a=0}^{s}$ are i.i.d. standard Gaussian variables. Then
$$
\displaystyle F_{E}=\ln\int dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}\Big{(}y\mid
\xi\sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{\rho_{K}-\frac{m_{K}^{2}}{q_{K}}}
\Big{)}\prod_{a=1}^{s}\mathbb{E}_{u^{a}}P_{\rm out}(y\mid\xi\sqrt{q_{K}}+u^{a}
\sqrt{r_{K}-q_{K}}). \tag{48}
$$
The last product over the replica index $a$ contains identical factors thanks to the RS ansatz. Therefore, by expanding in $s\to 0^{+}$ we get
$$
\displaystyle F_{E}=s\int dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}\Big{(}y\mid\xi
\sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{\rho_{K}-\frac{m_{K}^{2}}{q_{K}}}
\Big{)}\ln\mathbb{E}_{u}P_{\rm out}(y\mid\xi\sqrt{q_{K}}+u\sqrt{r_{K}-q_{K}})+
O(s^{2}). \tag{49}
$$
For the linear readout with Gaussian label noise $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ the above gives
$$
\displaystyle F_{E}=-\frac{s}{2}\ln\big{[}2\pi(\Delta+r_{K}-q_{K})\big{]}-
\frac{s}{2}\frac{\Delta+\rho_{K}-2m_{K}+q_{K}}{\Delta+r_{K}-q_{K}}+O(s^{2}). \tag{50}
$$
In the Bayes-optimal setting the Nishimori identities enforce
$$
\displaystyle r_{2}=\rho_{2}=\lim_{d\to\infty}\frac{1}{d^{2}}\mathbb{E}{\rm Tr
}[({\mathbf{S}}_{2}^{0})^{2}]=1+\gamma\bar{v}^{2}\quad\text{and}\quad m_{2}=q_
{2}, \displaystyle r_{W}(\mathsf{v})=\rho_{W}(\mathsf{v})=1\quad\text{and}\quad m_{
W}(\mathsf{v})=\mathcal{Q}_{W}(\mathsf{v})\ \forall\ \mathsf{v}\in\mathsf{V}, \tag{51}
$$
which implies also that
$$
\displaystyle r_{K}=\rho_{K}=\mu_{1}^{2}+\frac{1}{2}r_{2}\mu_{2}^{2}+g(1),\,
\quad m_{K}=q_{K}. \tag{1}
$$
Therefore the above simplifies to
$$
\displaystyle F_{E} \displaystyle=s\int dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}(y\mid\xi\sqrt{q_{K}}
+u^{0}\sqrt{r_{K}-q_{K}})\ln\mathbb{E}_{u}P_{\rm out}(y\mid\xi\sqrt{q_{K}}+u
\sqrt{r_{K}-q_{K}})+O(s^{2}) \displaystyle=:s\,\psi_{P_{\rm{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})+O(s^
{2}). \tag{54}
$$
Notice that the energetic contribution to the free entropy has the same form as in the generalised linear model Barbier et al. (2019). For our running example of linear readout with Gaussian noise the function $\psi_{P_{\rm out}}$ reduces to
$$
\displaystyle\psi_{P_{\rm{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})=-\frac{1}
{2}\ln\big{[}2\pi e(\Delta+r_{K}-q_{K})\big{]}. \tag{56}
$$
In what follows we shall restrict ourselves only to the replica symmetric ansatz, in the Bayes-optimal setting. Therefore, identifications as the ones in (51), (52) are assumed.
### D.2 Second moment of $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$
For the reader’s convenience we report here the measure
$$
\displaystyle P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})^{-1}\int\prod_{a}^{0,s}dP_{W}({
\mathbf{W}}^{a})\delta({\mathbf{S}}^{a}_{2}-{\mathbf{W}}^{a\intercal}({\mathbf
{v}}){\mathbf{W}}^{a}/\sqrt{k})\prod_{a\leq b}^{0,s}\prod_{\mathsf{v}\in
\mathsf{V}}\prod_{i\in\mathcal{I}_{\mathsf{v}}}\delta({d}\,\mathcal{Q}_{W}^{ab
}(\mathsf{v})-{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b}). \tag{57}
$$
Recall $\mathsf{V}$ is the support of $P_{v}$ (assumed discrete for the moment). Recall also that we have quenched the readout weights to the ground truth. Indeed, as discussed in the main, considering them learnable or fixed to the truth does not change the leading order of the information-theoretic quantities.
In this measure, one can compute the asymptotic of its second moment
$$
\displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})\frac{1}{d
^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b} \displaystyle=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})^{-1}\int\prod_{a}^{0,s}dP_{W}({
\mathbf{W}}^{a})\frac{1}{kd^{2}}{\rm Tr}[{\mathbf{W}}^{a\intercal}({\mathbf{v}
}){\mathbf{W}}^{a}{\mathbf{W}}^{b\intercal}({\mathbf{v}}){\mathbf{W}}^{b}] \displaystyle\qquad\qquad\times\prod_{a\leq b}^{0,s}\prod_{\mathsf{v}\in
\mathsf{V}}\prod_{i\in\mathcal{I}_{v}}\delta({d}\,\mathcal{Q}_{W}^{ab}(\mathsf
{v})-{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b}). \tag{58}
$$
The measure is coupled only through the latter $\delta$ ’s. We can decouple the measure at the cost of introducing Fourier conjugates whose values will then be fixed by a saddle point computation. The second moment computed will not affect the saddle point, hence it is sufficient to determine the value of the Fourier conjugates through the computation of $V_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ , which rewrites as
$$
\displaystyle V_{W}^{kd}(\bm{\mathcal{Q}}_{W}) \displaystyle=\int\prod_{a}^{0,s}dP_{W}({\mathbf{W}}^{a})\prod_{a\leq b}^{0,s}
\prod_{\mathsf{v}\in\mathsf{V}}\prod_{i\in\mathcal{I}_{\mathsf{v}}}d\hat{B}^{
ab}_{i}(\mathsf{v})\exp\big{[}-\hat{B}^{ab}_{i}(\mathsf{v})({d}\,\mathcal{Q}_{
W}^{ab}(\mathsf{v})-{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b})\big{]} \displaystyle\approx\prod_{\mathsf{v}\in\mathsf{V}}\prod_{i\in\mathcal{I}_{
\mathsf{v}}}\exp\Big{(}d\,{\rm extr}_{(\hat{B}^{ab}_{i}(\mathsf{v}))}\Big{[}-
\sum_{a\leq b,0}^{s}\hat{B}^{ab}_{i}(\mathsf{v})\mathcal{Q}_{W}^{ab}(\mathsf{v
})+\ln\int\prod_{a=0}^{s}dP_{W}(w_{a})e^{\sum_{a\leq b,0}^{s}\hat{B}_{i}^{ab}(
\mathsf{v})w_{a}w_{b}}\Big{]}\Big{)}. \tag{59}
$$
In the last line we have used saddle point integration over $\hat{B}^{ab}_{i}(\mathsf{v})$ and the approximate equality is up to a multiplicative $\exp(o(n))$ constant. From the above, it is clear that the stationary $\hat{B}^{ab}_{i}(\mathsf{v})$ are such that
$$
\displaystyle\mathcal{Q}_{W}^{ab}(\mathsf{v})=\frac{\int\prod_{r=0}^{s}dP_{W}(
w_{r})w_{a}w_{b}\prod_{r\leq t,0}^{s}e^{\hat{B}_{i}^{rt}(\mathsf{v})w_{r}w_{t}
}}{\int\prod_{r=0}^{s}dP_{W}(w_{r})\prod_{r\leq t,0}^{s}e^{\hat{B}_{i}^{rt}(
\mathsf{v})w_{r}w_{t}}}=:\langle w_{a}w_{b}\rangle_{\hat{\mathbf{B}}(\mathsf{v
})}. \tag{60}
$$
Hence $\hat{B}_{i}^{ab}(\mathsf{v})=\hat{B}^{ab}(\mathsf{v})$ are homogeneous. Using these notations, the asymptotic trace moment of the ${\mathbf{S}}_{2}$ ’s at leading order becomes
$$
\displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle\frac{1}{d^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}
=\frac{1}{kd^{2}}\sum_{i,l=1}^{k}\sum_{j,p=1}^{d}\langle W_{ij}^{a}v_{i}W_{ip}
^{a}W_{lj}^{b}v_{l}W_{lp}^{b}\rangle_{\{\hat{\mathbf{B}}(\mathsf{v})\}_{
\mathsf{v}\in\mathsf{V}}} \displaystyle=\frac{1}{k}\sum_{\mathsf{v}\in\mathsf{V}}\mathsf{v}^{2}\sum_{i
\in\mathcal{I}_{\mathsf{v}}}\Big{\langle}\Big{(}\frac{1}{d}\sum_{j=1}^{d}W_{ij
}^{a}W_{ij}^{b}\Big{)}^{2}\Big{\rangle}_{\hat{\mathbf{B}}(\mathsf{v})}+\frac{1
}{k}\sum_{j=1}^{d}\Big{\langle}\sum_{i=1}^{k}\frac{v_{i}(W_{ij}^{a})^{2}}{d}
\sum_{l\neq i,1}^{k}\frac{v_{l}(W_{lj}^{b})^{2}}{d}\Big{\rangle}_{\hat{\mathbf
{B}}(\mathsf{v})}. \tag{61}
$$
We have used the fact that $\smash{\langle\,\cdot\,\rangle_{\hat{\mathbf{B}}(\mathsf{v})}}$ is symmetric if the prior $P_{W}$ is, thus forcing us to match $j$ with $p$ if $i\neq l$ . Considering that by the Nishimori identities $\mathcal{Q}_{W}^{aa}(\mathsf{v})=1$ , it implies $\hat{B}^{aa}(\mathsf{v})=0$ for any $a=0,1,\dots,s$ and $\mathsf{v}\in\mathsf{V}$ . Furthermore, the measure $\langle\,\cdot\,\rangle_{\hat{\mathbf{B}}(\mathsf{v})}$ is completely factorised over neuron and input indices. Hence every normalised sum can be assumed to concentrate to its expectation by the law of large numbers. Specifically, we can write that with high probability as $d,k\to\infty$ ,
$$
\displaystyle\frac{1}{d}\sum_{j=1}^{d}W_{ij}^{a}W_{ij}^{b}\xrightarrow{}
\mathcal{Q}_{W}^{ab}(\mathsf{v})\ \forall\ i\in\mathcal{I}_{\mathsf{v}},\qquad
\frac{1}{k}\sum_{\mathsf{v},\mathsf{v}^{\prime}\in\mathsf{V}}\mathsf{v}\mathsf
{v}^{\prime}\sum_{j=1}^{d}\sum_{i\in\mathcal{I}_{\mathsf{v}}}\frac{(W_{ij}^{a}
)^{2}}{d}\sum_{l\in\mathcal{I}_{\mathsf{v}^{\prime}},l\neq i}\frac{(W_{lj}^{b}
)^{2}}{d}\approx\gamma\sum_{\mathsf{v},\mathsf{v}^{\prime}\in\mathsf{V}}\frac{
|\mathcal{I}_{\mathsf{v}}||\mathcal{I}_{\mathsf{v}^{\prime}}|}{k^{2}}\mathsf{v
}\mathsf{v}^{\prime}\to\gamma\bar{v}^{2}, \tag{62}
$$
where we used $|\mathcal{I}_{\mathsf{v}}|/k\to P_{v}(\mathsf{v})$ as $k$ diverges. Consequently, the second moment at leading order appears as claimed:
$$
\displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle\frac{1}{d^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}
=\sum_{\mathsf{v}\in\mathsf{V}}P_{v}(\mathsf{v})\mathsf{v}^{2}\mathcal{Q}_{W}^
{ab}(\mathsf{v})^{2}+\gamma\bar{v}^{2}=\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q
}_{W}^{ab}(v)^{2}+\gamma\bar{v}^{2}. \tag{63}
$$
Notice that the effective law $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ in (15) is the least restrictive choice among the Wishart-type distributions with a trace moment fixed precisely to the one above. In more specific terms, it is the solution of the following maximum entropy problem:
$$
\displaystyle\inf_{P,\tau}\Big{\{}D_{\rm KL}(P\,\|\,P_{S}^{\otimes s+1})+\sum_
{a\leq b,0}^{s}\tau^{ab}\Big{(}\mathbb{E}_{P}\frac{1}{d^{2}}{\rm Tr}\,{\mathbf
{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}-\gamma\bar{v}^{2}-\mathbb{E}_{v\sim P_{v}}v^{
2}\mathcal{Q}_{W}^{ab}(v)^{2}\Big{)}\Big{\}}, \tag{64}
$$
where $P_{S}$ is a generalised Wishart distribution (as defined above (15)), and $P$ is in the space of joint probability distributions over $s+1$ symmetric matrices of dimension $d\times d$ . The rationale behind the choice of $P_{S}$ as a base measure is that, in absence of any other information, a statistician can always use a generalised Wishart measure for the ${\mathbf{S}}_{2}$ ’s if they assume universality in the law of the inner weights. This ansatz would yield the theory of Maillard et al. (2024a), which still describes a non-trivial performance, achieved by the adaptation of GAMP-RIE of Appendix H.
Note that if $a=b$ then, by (51), the second moment above matches precisely $r_{2}=1+\gamma\bar{v}^{2}$ . This entails directly $\tau^{aa}=0$ , as the generalised Wishart prior $P_{S}$ already imposes this constraint.
### D.3 Entropic potential
We now use the results from the previous section to compute the entropic contribution $F_{S}$ to the free entropy:
$$
\displaystyle e^{F_{S}}:=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})\int dP(({\mathbf{S}}
_{2}^{a})\mid\bm{\mathcal{Q}}_{W})\prod_{a\leq b}^{0,s}\delta(d^{2}Q_{2}^{ab}-
{{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}}). \tag{65}
$$
The factor $V_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ was already treated in the previous section. However, here it will contribute as a tilt of the overall entropic contribution, and the Fourier conjugates $\hat{\mathcal{Q}}_{W}^{ab}(\mathsf{v})$ will appear in the final variational principle.
Let us now proceed with the relaxation of the measure $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ by replacing it with $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ given by (15):
$$
\displaystyle e^{F_{S}}=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})\int d\hat{{\mathbf{Q}
}}_{2}\exp\Big{(}-\frac{d^{2}}{2}\sum_{a\leq b,0}^{s}\hat{Q}^{ab}_{2}Q^{ab}_{2
}\Big{)}\frac{1}{\tilde{V}^{kd}_{W}(\bm{\mathcal{Q}}_{W})}\int\prod_{a=0}^{s}
dP_{S}({\mathbf{S}}_{2}^{a})\exp\Big{(}\sum_{a\leq b,0}^{s}\frac{\tau_{ab}+
\hat{Q}_{2}^{ab}}{2}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}\Big{)} \tag{66}
$$
where we have introduced another set of Fourier conjugates $\hat{\mathbf{Q}}_{2}$ for ${\mathbf{Q}}_{2}$ . As usual, the Nishimori identities impose $Q_{2}^{aa}=r_{2}=1+\gamma\bar{v}^{2}$ without the need of any Fourier conjugate. Hence, similarly to $\tau^{aa}$ , $\hat{Q}_{2}^{aa}=0$ too. Furthermore, in the hypothesis of replica symmetry, we set $\tau^{ab}=\tau$ and $\hat{Q}_{2}^{ab}=\hat{q}_{2}$ for all $0\leq a<b\leq s$ .
Then, when the number of replicas $s$ tends to $0^{+}$ , we can recognise the free entropy of a matrix denoising problem. More specifically, using the Hubbard–Stratonovich transformation (i.e., $\mathbb{E}_{{\mathbf{Z}}}\exp(\frac{d}{2}{\rm Tr}\,{\mathbf{M}}{\mathbf{Z}})= \exp(\frac{d}{4}{\rm Tr}\,{\mathbf{M}}^{2})$ for a $d\times d$ symmetric matrix ${\mathbf{M}}$ with ${\mathbf{Z}}$ a standard GOE matrix) we get
$$
\displaystyle J_{n}(\tau,\hat{q}_{2}) \displaystyle:=\lim_{s\to 0^{+}}\frac{1}{ns}\ln\int\prod_{a=0}^{s}dP_{S}({
\mathbf{S}}_{2}^{a})\exp\Big{(}\frac{\tau+\hat{q}_{2}}{2}\sum_{a<b,0}^{s}{\rm
Tr
}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}\Big{)} \displaystyle=\frac{1}{n}\mathbb{E}\ln\int dP_{S}({\mathbf{S}}_{2})\exp\frac{1
}{2}{\rm Tr}\Big{(}\sqrt{\tau+\hat{q}_{2}}{\mathbf{Y}}{\mathbf{S}}_{2}-(\tau+
\hat{q}_{2})\frac{{\mathbf{S}}_{2}^{2}}{2}\Big{)}, \tag{67}
$$
where ${\mathbf{Y}}={\mathbf{Y}}(\tau+\hat{q}_{2})=\sqrt{\tau+\hat{q}_{2}}{\mathbf{S} }_{2}^{0}+{\bm{\xi}}$ with ${\bm{\xi}}/\sqrt{d}$ a standard GOE matrix, and the outer expectation is w.r.t. ${\mathbf{Y}}$ (or ${\mathbf{S}}^{0},{\bm{\xi}}$ ). Thanks to the fact that the base measure $P_{S}$ is rotationally invariant, the above can be solved exactly in the limit $n\to\infty,\,n/d^{2}\to\alpha$ (see e.g. Pourkamali et al. (2024)):
$$
\displaystyle J(\tau,\hat{q}_{2})=\lim J_{n}(\tau,\hat{q}_{2})=\frac{1}{\alpha
}\Big{(}\frac{(\tau+\hat{q}_{2})r_{2}}{4}-\iota(\tau+\hat{q}_{2})\Big{)},\quad
\text{with}\quad\iota(\eta):=\frac{1}{8}+\frac{1}{2}\Sigma(\mu_{{\mathbf{Y}}(
\eta)}). \tag{68}
$$
Here $\iota(\eta)=\lim I({\mathbf{Y}}(\eta);{\mathbf{S}}^{0}_{2})/d^{2}$ is the limiting mutual information between data ${\mathbf{Y}}(\eta)$ and signal ${\mathbf{S}}^{0}_{2}$ for the channel ${\mathbf{Y}}(\eta)=\sqrt{\eta}{\mathbf{S}}^{0}_{2}+{\bm{\xi}}$ , the measure $\mu_{{\mathbf{Y}}(\eta)}$ is the asymptotic spectral law of the rescaled observation matrix ${\mathbf{Y}}(\eta)/\sqrt{d}$ , and $\Sigma(\mu):=\int\ln|x-y|d\mu(x)d\mu(y)$ . Using free probability, the law $\mu_{{\mathbf{Y}}(\eta)}$ can be obtained as the free convolution of a generalised Marchenko-Pastur distribution (the asymptotic spectral law of ${\mathbf{S}}^{0}_{2}$ , which is a generalised Wishart random matrix) and the semicircular distribution (the asymptotic spectral law of ${\bm{\xi}}$ ), see Potters & Bouchaud (2020). We provide the code to obtain this distribution numerically in the attached repository. The function ${\rm mmse}_{S}(\eta)$ is obtained through a derivative of $\iota$ , using the so-called I-MMSE relation Guo et al. (2005); Pourkamali et al. (2024):
$$
\displaystyle 4\frac{d}{d\eta}\iota(\eta)={\rm mmse}_{S}(\eta)=\frac{1}{\eta}
\Big{(}1-\frac{4\pi^{2}}{3}\int\mu^{3}_{{\mathbf{Y}}(\eta)}(y)dy\Big{)}. \tag{69}
$$
The normalisation $\frac{1}{ns}\ln\tilde{V}_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ in the limit $n\to\infty,s\to 0^{+}$ can be simply computed as $J(\tau,0)$ .
For the other normalisation, following the same steps as in the previous section, we can simplify $V^{kd}_{W}(\bm{\mathcal{Q}}_{W})$ as follows:
$$
\displaystyle\frac{1}{ns}\ln V_{W}^{kd}(\bm{\mathcal{Q}}_{W})\approx\frac{
\gamma}{\alpha s}\sum_{\mathsf{v}\in\mathsf{V}}\frac{1}{k}\sum_{i\in\mathcal{I
}_{\mathsf{v}}}{\rm extr}\Big{[}-\sum_{a\leq b,0}^{s}\hat{\mathcal{Q}}^{ab}_{W
,i}(\mathsf{v})\mathcal{Q}^{ab}_{W}(\mathsf{v})+\ln\int\prod_{a=0}^{s}dP_{W}(w
_{a})e^{\sum_{a\leq b,0}^{s}\hat{\mathcal{Q}}^{ab}_{W,i}(\mathsf{v})w_{a}w_{b}
}\Big{]}, \tag{70}
$$
as $n$ grows, where extremisation is w.r.t. the hatted variables only. As in the previous section, $\hat{\mathcal{Q}}^{ab}_{W,i}(\mathsf{v})$ is homogeneous over $i\in\mathcal{I}_{\mathsf{v}}$ for a given $\mathsf{v}$ . Furthermore, thanks to the Nishimori identities we have that at the saddle point $\hat{\mathcal{Q}}_{W}^{aa}(\mathsf{v})=0$ and ${\mathcal{Q}}_{W}^{aa}(\mathsf{v})=1+\gamma\bar{v}^{2}$ . This, together with standard steps and the RS ansatz, allows to write the $d\to\infty,s\to 0^{+}$ limit of the above as
$$
\displaystyle\lim_{s\to 0^{+}}\lim\frac{1}{ns}\ln V_{W}^{kd}(\bm{\mathcal{Q}}_
{W})=\frac{\gamma}{\alpha}\mathbb{E}_{v\sim P_{v}}{\rm extr}\Big{[}-\frac{\hat
{\mathcal{Q}}_{W}(v)\mathcal{Q}_{W}(v)}{2}+\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(
v))\Big{]} \tag{71}
$$
with $\psi_{P_{W}}(\,\cdot\,)$ as in the main. Gathering all these results yields directly
$$
\displaystyle\lim_{s\to 0^{+}}\lim\frac{F_{S}}{ns}={\rm extr}\Big{\{} \displaystyle\frac{\hat{q}_{2}(r_{2}-q_{2})}{4\alpha}-\frac{1}{\alpha}\big{[}
\iota(\tau+\hat{q}_{2})-\iota(\tau)\big{]}+\frac{\gamma}{\alpha}\mathbb{E}_{v
\sim P_{v}}\Big{[}\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{\hat{\mathcal{Q
}}_{W}(v)\mathcal{Q}_{W}(v)}{2}\Big{]}\Big{\}}. \tag{72}
$$
Extremisation is w.r.t. $\hat{q}_{2},\hat{\mathcal{Q}}_{W}$ . $\tau$ has to be intended as a function of $\mathcal{Q}_{W}=\{{\mathcal{Q}}_{W}(\mathsf{v})\mid\mathsf{v}\in\mathsf{V}\}$ through the moment matching condition:
$$
\displaystyle 4\alpha\,\partial_{\tau}J(\tau,0)=r_{2}-4\iota^{\prime}(\tau)=
\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2}+\gamma\bar{v}^{2}, \tag{73}
$$
which is the $s\to 0^{+}$ limit of the moment matching condition between $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ and $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ . Simplifying using the value of $r_{2}=1+\gamma\bar{v}^{2}$ according to the Nishimori identities, and using the I-MMSE relation between $\iota(\tau)$ and ${\rm mmse}_{S}(\tau)$ , we get
$$
\displaystyle{\rm mmse}_{S}(\tau)=1-\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{
W}(v)^{2}\quad\iff\quad\tau={\rm mmse}_{S}^{-1}\big{(}1-\mathbb{E}_{v\sim P_{v
}}v^{2}\mathcal{Q}_{W}(v)^{2}\big{)}. \tag{74}
$$
Since ${\rm mmse}_{S}$ is a monotonic decreasing function of its argument (and thus invertible), the above always has a solution, and it is unique for a given collection $\mathcal{Q}_{W}$ .
### D.4 RS free entropy and saddle point equations
Putting the energetic and entropic contributions together we obtain the variational replica symmetric free entropy potential:
$$
\displaystyle f^{\alpha,\gamma}_{\rm RS} \displaystyle:=\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})+\frac
{1}{4\alpha}(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2}+\frac{\gamma}{\alpha}
\mathbb{E}_{v\sim P_{v}}\big{[}\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{1}
{2}\mathcal{Q}_{W}(v)\hat{\mathcal{Q}}_{W}(v)\big{]} \displaystyle\qquad+\frac{1}{\alpha}\big{[}\iota(\tau(\mathcal{Q}_{W}))-\iota(
\hat{q}_{2}+\tau(\mathcal{Q}_{W}))\big{]}, \tag{75}
$$
which is then extremised w.r.t. $\{\hat{\mathcal{Q}}_{W}(\mathsf{v}),\mathcal{Q}_{W}(\mathsf{v})\mid\mathsf{v} \in\mathsf{V}\},\hat{q}_{2},q_{2}$ while $\tau$ is a function of ${\mathcal{Q}}_{W}$ through the moment matching condition (74). The saddle point equations are then
$$
\left[\begin{array}[]{@{}l@{\quad}l@{}}&{\mathcal{Q}}_{W}(\mathsf{v})=\mathbb{
E}_{w^{0},\xi}[w^{0}{\langle w\rangle}_{\hat{\mathcal{Q}}_{W}(\mathsf{v})}],\\
&\hat{\mathcal{Q}}_{W}(\mathsf{v})=\frac{1}{2\gamma}(q_{2}-\gamma\bar{v}^{2}-
\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2})\partial_{{\mathcal{Q}}_{W
}(\mathsf{v})}\tau(\mathcal{Q}_{W})+2\frac{\alpha}{\gamma}\partial_{{\mathcal{
Q}}_{W}(\mathsf{v})}\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K}),
\\
&q_{2}=r_{2}-\frac{1}{\hat{q}_{2}+\tau(\mathcal{Q}_{W})}(1-\frac{4\pi^{2}}{3}
\int\mu^{3}_{{\mathbf{Y}}(\hat{q}_{2}+\tau(\mathcal{Q}_{W}))}(y)dy),\\
&\hat{q}_{2}=4\alpha\,\partial_{q_{2}}\psi_{P_{\text{out}}}(q_{K}(q_{2},
\mathcal{Q}_{W});r_{K}),\end{array}\right. \tag{76}
$$
where, letting i.i.d. $w^{0},\xi\sim\mathcal{N}(0,1)$ , we define the measure
$$
\displaystyle\langle\,\cdot\,\rangle_{x}=\langle\,\cdot\,\rangle_{x}(w^{0},\xi
):=\frac{\int dP_{W}(w)(\,\cdot\,)e^{(\sqrt{x}\xi+xw^{0})w-\frac{1}{2}xw^{2}}}
{\int dP_{W}(w)e^{(\sqrt{x}\xi+xw^{0})w-\frac{1}{2}xw^{2}}}. \tag{77}
$$
All the above formulae are easily specialised for the linear readout with Gaussian label noise using (56). We report here the saddle point equations in this case (recalling that $g$ is defined in (43)):
$$
\left[\begin{array}[]{@{}l@{\quad}l@{}}&{\mathcal{Q}}_{W}(\mathsf{v})=\mathbb{
E}_{w^{0},\xi}[w^{0}{\langle w\rangle}_{\hat{\mathcal{Q}}_{W}(v)}],\\
&\hat{\mathcal{Q}}_{W}(\mathsf{v})=\frac{1}{2\gamma}(q_{2}-\gamma\bar{v}^{2}-
\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2})\partial_{{\mathcal{Q}}_{W
}(\mathsf{v})}\tau(\mathcal{Q}_{W})+\frac{\alpha}{\gamma}\frac{\mathsf{v}^{2}
\,g^{\prime}(\mathcal{Q}_{W}(\mathsf{v}))}{\Delta+\frac{1}{2}\mu_{2}^{2}(r_{2}
-q_{2})+g(1)-\mathbb{E}_{v\sim P_{v}}{v}^{2}g(\mathcal{Q}_{W}(v))},\\
&q_{2}=r_{2}-\frac{1}{\hat{q}_{2}+\tau}(1-\frac{4\pi^{2}}{3}\int\mu^{3}_{{
\mathbf{Y}}(\hat{q}_{2}+\tau(\mathcal{Q}_{W}))}(y)dy),\\
&\hat{q}_{2}=\frac{\alpha\mu_{2}^{2}}{\Delta+\frac{1}{2}\mu_{2}^{2}(r_{2}-q_{2
})+g(1)-\mathbb{E}_{v\sim P_{v}}v^{2}g(\mathcal{Q}_{W}(v))}.\end{array}\right. \tag{1}
$$
If one assumes that the overlaps appearing in (38) are self-averaging around the values that solve the saddle point equations (and maximise the RS potential), that is $Q^{00}_{1},Q_{1}^{01}\to 1$ (as assumed in this scaling), $Q_{2}^{00}\to r_{2},Q_{2}^{01}\to q_{2}^{*}$ , and ${\mathcal{Q}}_{W}^{00}(\mathsf{v})\to 1,{\mathcal{Q}}_{W}^{01}(\mathsf{v})\to{ \mathcal{Q}}_{W}^{*}(\mathsf{v})$ , then the limiting Bayes-optimal mean-square generalisation error for the linear readout with Gaussian noise case appears as
$$
\displaystyle\varepsilon^{\rm opt}-\Delta=r_{K}-q_{K}^{*}=\frac{\mu_{2}^{2}}{2
}(r_{2}-q_{2}^{*})+g(1)-\mathbb{E}_{v\sim P_{v}}v^{2}g(\mathcal{Q}^{*}_{W}(v)). \tag{1}
$$
This is the formula used to evaluate the theoretical Bayes-optimal mean-square generalisation error used along the paper.
### D.5 Non-centred activations
Consider a non-centred activation function, i.e., $\mu_{0}\neq 0$ in (17). This reflects on the law of the post-activations, which will still be Gaussian, centred at
$$
\displaystyle\mathbb{E}_{\mathbf{x}}\lambda^{a}=\frac{\mu_{0}}{\sqrt{k}}\sum_{
i=1}^{k}v_{i}=:\mu_{0}\Lambda, \tag{80}
$$
and with the covariance given by (8) (we are assuming $Q_{W}^{aa}=1$ ; if not, $Q_{W}^{aa}=r$ , the formula can be generalised as explained in App. A, and that the readout weights are quenched). In the above, we have introduced the new mean parameter $\Lambda$ . Notice that, if the ${\mathbf{v}}$ ’s have a $\bar{v}=O(1)$ mean, then $\Lambda$ scales as $\sqrt{k}$ due to our choice of normalisation.
One can carry out the replica computation for a fixed $\Lambda$ . This new parameter, being quenched, does not affect the entropic term. It will only appear in the energetic term as a shift to the means, yielding
$$
F_{E}=F_{E}({\mathbf{K}},\Lambda)=\ln\int dy\int d{\bm{\lambda}}\frac{e^{-
\frac{1}{2}{\bm{\lambda}}^{\intercal}{\mathbf{K}}^{-1}{\bm{\lambda}}}}{\sqrt{(
2\pi)^{s+1}\det{\mathbf{K}}}}\prod_{a=0}^{s}P_{\rm{out}}(y\mid\lambda^{a}+\mu_
{0}\Lambda). \tag{81}
$$
Within the replica symmetric ansatz, the above turns into
| | $\displaystyle e^{F_{E}}=\int dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}\Big{(}y\mid \mu_{0}\Lambda+\xi\sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{\rho_{K}-\frac{m_{ K}^{2}}{q_{K}}}\Big{)}\prod_{a=1}^{s}\mathbb{E}_{u^{a}}P_{\rm out}(y\mid\mu_{0 }\Lambda+\xi\sqrt{q_{K}}+u^{a}\sqrt{r_{K}-q_{K}}).$ | |
| --- | --- | --- |
Therefore, the simplification of the potential $F_{E}$ proceeds as in the centred activation case, yielding at leading order in the number $s$ of replicas
| | $\displaystyle\frac{F_{E}(r_{K},q_{K},\Lambda)}{s}\!=\!\int dy\,\mathbb{E}_{\xi ,u^{0}}P_{\rm out}\Big{(}y\mid\mu_{0}\Lambda+\xi\sqrt{q_{K}}+u^{0}\sqrt{r_{K}- q_{K}}\Big{)}\ln\mathbb{E}_{u}P_{\rm out}(y\mid\mu_{0}\Lambda+\xi\sqrt{q_{K}}+ u\sqrt{r_{K}-q_{K}})+O(s)$ | |
| --- | --- | --- |
in the Bayes-optimal setting. In the case when $P_{\rm out}(y\mid\lambda)=f(y-\lambda)$ then one can verify that the contributions due to the means, containing $\mu_{0}$ , cancel each other. This is verified in our running example where $P_{\rm out}$ is the Gaussian channel:
$$
\frac{F_{E}(r_{K},q_{K},\Lambda)}{s}=-\frac{1}{2}\ln\big{[}2\pi(\Delta+r_{K}-q
_{K})\big{]}-\frac{1}{2}-\frac{\mu_{0}^{2}}{2}\frac{(\Lambda-\Lambda)^{2}}{
\Delta+r_{K}-q_{K}}+O(s)=-\frac{1}{2}\ln\big{[}2\pi(\Delta+r_{K}-q_{K})\big{]}
-\frac{1}{2}+O(s). \tag{82}
$$
## Appendix E Alternative simplifications of $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ through moment matching
A crucial step that allowed us to obtain a closed-form expression for the model’s free entropy is the relaxation $\tilde{P}(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (15) of the true measure $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (14) entering the replicated partition function, as explained in Sec. 4. The specific form we chose (tilted Wishart distribution with a matching second moment) has the advantage of capturing crucial features of the true measure, such as the fact that the matrices ${\mathbf{S}}^{a}_{2}$ are generalised Wishart matrices with coupled replicas, while keeping the problem solvable with techniques derived from random matrix theory of rotationally invariant ensembles. In this appendix, we report some alternative routes one can take to simplify, or potentially improve the theory.
### E.1 A factorised simplified distribution
In the specialisation phase, one can assume that the only crucial feature to keep track in relaxing $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (14) is the coupling between different replicas, becoming more and more relevant as $\alpha$ increases. In this case, inspired by Sakata & Kabashima (2013); Kabashima et al. (2016), in order to relax (14) we can propose the Gaussian ansatz
$$
\displaystyle d\bar{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})=\prod
_{a=0}^{s}d{\mathbf{S}}^{a}_{2}\prod_{\alpha=1}^{d}\delta(S^{a}_{2;\alpha
\alpha}-\sqrt{k}\bar{v})\times\prod_{\alpha_{1}<\alpha_{2}}^{d}\frac{e^{-\frac
{1}{2}\sum_{a,b=0}^{s}S^{a}_{2;\alpha_{1}\alpha_{2}}\bar{\tau}^{ab}(\bm{{
\mathcal{Q}}}_{W})S^{b}_{2;\alpha_{1}\alpha_{2}}}}{\sqrt{(2\pi)^{s+1}\det(\bar
{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})^{-1})}}, \tag{83}
$$
where $\bar{v}$ is the mean of the readout prior $P_{v}$ , and $\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W}):=(\bar{\tau}^{ab}(\bm{{\mathcal{Q}}}_{ W}))_{a,b}$ is fixed by
$$
[\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})^{-1}]_{ab}=\mathbb{E}_{v\sim P_{v}}v^
{2}{\mathcal{Q}}_{W}^{ab}(v)^{2}.
$$
In words, first, the diagonal elements of ${\mathbf{S}}_{2}^{a}$ are $d$ random variables whose $O(1)$ fluctuations cannot affect the free entropy in the asymptotic regime we are considering, being too few compared to $n=\Theta(d^{2})$ . Hence, we assume they concentrate to their mean. Concerning the $d(d-1)/2$ off-diagonal elements of the matrices $({\mathbf{S}}_{2}^{a})_{a}$ , they are zero-mean variables whose distribution at given $\bm{{\mathcal{Q}}}_{W}$ is assumed to be factorised over the input indices. The definition of $\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})$ ensures matching with the true second moment (63).
(83) is considerably simpler than (15): following this ansatz, the entropic contribution to the free entropy gives
$$
\displaystyle e^{\bar{F}_{S}}:=\int\prod_{a\leq b,0}^{s}d\hat{Q}_{2}^{ab}\,e^{
kd\ln V_{W}(\bm{\mathcal{Q}}_{W})+\frac{d^{2}}{4}{\rm Tr}\hat{\mathbf{Q}}^{
\intercal}_{2}{\mathbf{Q}}_{2}}\Big{[}\int\prod_{a=0}^{s}dS^{a}_{2}\,\frac{e^{
-\frac{1}{2}\sum_{a,b=0}^{s}S^{a}_{2}[\bar{\tau}^{ab}(\bm{{\mathcal{Q}}}_{W})+
\hat{Q}_{2}^{ab}]S^{b}_{2}}}{\sqrt{(2\pi)^{s+1}\det(\bar{\bm{\tau}}(\bm{{
\mathcal{Q}}_{W}})^{-1})}}\Big{]}^{d(d-1)/2} \displaystyle\qquad\qquad\qquad\times\int\prod_{a=0}^{s}\prod_{\alpha=1}^{d}dS
^{a}_{2;\alpha\alpha}\delta(S^{a}_{2;\alpha\alpha}-\sqrt{k}\bar{v})\,e^{-\frac
{1}{4}\sum_{a,b=0}^{s}\hat{Q}_{2}^{ab}\sum_{\alpha=1}^{d}S_{2;\alpha\alpha}^{a
}S_{2;\alpha\alpha}^{b}}, \tag{84}
$$
instead of (66). Integration over the diagonal elements $(S_{2;\alpha\alpha}^{a})_{\alpha}$ can be done straightforwardly, yielding
$$
\displaystyle e^{\bar{F}_{S}} \displaystyle=\int\prod_{a\leq b,0}^{s}d\hat{Q}_{2}^{ab}\,e^{kd\ln V_{W}(\bm{
\mathcal{Q}}_{W})+\frac{d^{2}}{4}{\rm Tr}\hat{\mathbf{Q}}_{2}^{\intercal}({
\mathbf{Q}}_{2}-\gamma\mathbf{1}\mathbf{1}^{\intercal}\bar{v}^{2})}\Big{[}\int
\prod_{a=0}^{s}dS^{a}_{2}\,\frac{e^{-\frac{1}{2}\sum_{a,b=0}^{s}S^{a}_{2}[\bar
{\tau}^{ab}(\bm{{\mathcal{Q}}}_{W})+\hat{Q}_{2}^{ab}]S^{b}_{2}}}{\sqrt{(2\pi)^
{s+1}\det(\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})^{-1})}}\Big{]}^{d(d-1)/2}. \tag{85}
$$
The remaining Gaussian integral over the off-diagonal elements of ${\mathbf{S}}_{2}$ can be performed exactly, leading to
$$
\displaystyle e^{\bar{F}_{S}} \displaystyle=\int\prod_{a\leq b,0}^{s}d\hat{Q}_{2}^{ab}\,e^{kd\ln V_{W}(\bm{
\mathcal{Q}}_{W})+\frac{d^{2}}{4}{\rm Tr}\hat{\mathbf{Q}}_{2}^{\intercal}({
\mathbf{Q}}_{2}-\gamma\mathbf{1}\mathbf{1}^{\intercal}\bar{v}^{2})-\frac{d(d-1
)}{4}\ln\det[{\mathbb{I}}_{s+1}+\hat{\mathbf{Q}}_{2}\bar{\bm{\tau}}(\bm{{
\mathcal{Q}}}_{W})^{-1}]}. \tag{86}
$$
In order to proceed and perform the $s\to 0^{+}$ limit, we use the RS ansatz for the overlap matrices, combined with the Nishimori identities, as explained above. The only difference w.r.t. the approach detailed in Appendix D is the determinant in the exponent of the integrand of (86), which reads
$$
\displaystyle\ln\det[{\mathbb{I}}_{s+1}+\hat{\mathbf{Q}}_{2}\bar{\bm{\tau}}(
\bm{{\mathcal{Q}}}_{W})^{-1}]=s\ln[1+\hat{q}_{2}(1-\mathbb{E}_{v\sim P_{v}}v^{
2}\mathcal{Q}_{W}(v)^{2})]-s\hat{q}_{2}+O(s^{2}). \tag{87}
$$
After taking the replica and high-dimensional limits, the resulting free entropy is
$$
\displaystyle f_{\rm sp}^{\alpha,\gamma}={} \displaystyle\psi_{P_{\text{out}}}(q_{K}(q_{2},{\mathcal{Q}}_{W});r_{K})+\frac
{(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2}}{4\alpha}+\frac{\gamma}{\alpha}\mathbb
{E}_{v\sim P_{v}}\big{[}\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{1}{2}
\mathcal{Q}_{W}(v)\hat{{\mathcal{Q}}}_{W}(v)\big{]} \displaystyle\qquad-\frac{1}{4\alpha}\ln\big{[}1+\hat{q}_{2}(1-\mathbb{E}_{v
\sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2})\big{]}, \tag{88}
$$
to be extremised w.r.t. $q_{2},\hat{q}_{2},\{{\mathcal{Q}}_{W}(\mathsf{v}),\hat{{\mathcal{Q}}}_{W}( \mathsf{v})\}$ . The main advantage of this expression over (75) is its simplicity: the moment-matching condition fixing $\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})$ is straightforward (and has been solved explicitly in the final formula) and the result does not depend on the non-trivial (and difficult to numerically evaluate) function $\iota(\eta)$ , which is the mutual information of the associated matrix denoising problem (which has been effectively replaced by the much simpler denoising problem of independent Gaussian variables under Gaussian noise). Moreover, one can show, in the same fashion as done in Appendix G, that the generalisation error predicted from this expression has the same large- $\alpha$ behaviour than the one obtained from (75). However, not surprisingly, being derived from an ansatz ignoring the Wishart-like nature of the matrices ${\mathbf{S}}_{2}^{a}$ , this expression does not reproduce the expected behaviour of the model in the universal phase, i.e. for $\alpha<\alpha_{\rm sp}(\gamma)$ .
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Optimal Error (ε_opt) vs. Parameter α
### Overview
This is a line chart plotting the optimal error, denoted as ε_opt, against a parameter labeled α. The chart compares three different data series or methods, each represented by a distinct colored line. The general trend for all series is a decreasing, convex curve, where ε_opt decreases as α increases. The chart includes error bars for two of the series, indicating variability or uncertainty in the measurements.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** α (alpha)
* **Scale:** Linear scale.
* **Major Tick Markers:** 0, 2, 4, 6. The axis extends slightly beyond 6.
* **Y-Axis (Vertical):**
* **Label:** ε_opt (epsilon sub opt)
* **Scale:** Linear scale.
* **Major Tick Markers:** 0.02, 0.04, 0.06, 0.08.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Entries:**
1. **Blue Line:** "main text"
2. **Red Line:** "sp"
3. **Green Line:** "uni"
* **Grid:** A light gray grid is present, aligned with the major tick marks on both axes.
### Detailed Analysis
**1. Data Series "main text" (Blue Line with Circle Markers and Error Bars):**
* **Trend:** Steeply decreasing from α=0 to α≈2, then decreasing more gradually. The line appears to approach an asymptote near ε_opt ≈ 0.01 as α increases towards 6.
* **Key Data Points (Approximate):**
* At α ≈ 0.1: ε_opt ≈ 0.080 (with a large error bar from ~0.072 to ~0.088).
* At α ≈ 0.5: ε_opt ≈ 0.058.
* At α ≈ 1.0: ε_opt ≈ 0.040.
* At α ≈ 2.0: ε_opt ≈ 0.022.
* At α ≈ 4.0: ε_opt ≈ 0.013.
* At α ≈ 6.0: ε_opt ≈ 0.010.
* **Error Bars:** The vertical error bars are most pronounced at low α values (e.g., α < 1) and become progressively smaller as α increases, indicating reduced uncertainty for larger α.
**2. Data Series "sp" (Red Line with Plus Markers and Error Bars):**
* **Trend:** Follows a very similar decreasing, convex trend to the "main text" series. It is generally positioned slightly above the blue line for α < 2 and converges with it for α > 2.
* **Key Data Points (Approximate):**
* At α ≈ 0.1: ε_opt ≈ 0.082 (error bar from ~0.074 to ~0.090).
* At α ≈ 0.5: ε_opt ≈ 0.062.
* At α ≈ 1.0: ε_opt ≈ 0.043.
* At α ≈ 2.0: ε_opt ≈ 0.023.
* At α ≈ 4.0: ε_opt ≈ 0.013.
* At α ≈ 6.0: ε_opt ≈ 0.010.
* **Error Bars:** Similar pattern to the blue series: large at low α, diminishing as α increases.
**3. Data Series "uni" (Green Smooth Line):**
* **Trend:** A smooth, decreasing convex curve without markers or error bars. It starts at a similar point to the other series at α=0 but decays more slowly, remaining consistently above both the blue and red lines for all α > 0.
* **Key Data Points (Approximate):**
* At α = 0: ε_opt ≈ 0.085.
* At α ≈ 1.0: ε_opt ≈ 0.048.
* At α ≈ 2.0: ε_opt ≈ 0.030.
* At α ≈ 4.0: ε_opt ≈ 0.020.
* At α ≈ 6.0: ε_opt ≈ 0.017.
### Key Observations
1. **Converging Performance:** The "main text" (blue) and "sp" (red) methods show very similar performance, especially for α > 2, where their lines nearly overlap. Their error bars also overlap significantly throughout the range.
2. **Divergent Baseline:** The "uni" (green) method serves as a baseline or alternative that performs worse (higher ε_opt) than the other two methods across the entire plotted range of α > 0.
3. **Diminishing Returns:** All curves show a "law of diminishing returns" – the reduction in ε_opt per unit increase in α is greatest at small α and becomes minimal at large α.
4. **Uncertainty Pattern:** The uncertainty (error bar size) in the "main text" and "sp" measurements is inversely related to α, being highest when α is small.
### Interpretation
This chart likely illustrates the optimization of an error metric (ε_opt) with respect to a control parameter (α) in a computational or statistical model. The "main text" and "sp" represent two proposed or analyzed methods, while "uni" may represent a uniform or naive baseline approach.
The data suggests that both the "main text" and "sp" methods are effective at reducing the optimal error as α increases, significantly outperforming the "uni" baseline. The near-identical performance of "main text" and "sp" for α > 2 implies that the choice between these two methods may be inconsequential in that regime, or that they converge to the same solution. The larger error bars at low α indicate that the system's performance is more volatile or sensitive to initial conditions when the parameter α is small. The overall convex, decreasing shape is characteristic of many optimization and learning curves, where initial gains are rapid but further improvement requires increasingly larger adjustments to the parameter α.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Comparison of Function f vs. Parameter α for Three Methods
### Overview
This image is a line chart plotting a function `f` against a parameter `α`. It displays three distinct curves, each representing a different method or dataset labeled "main text", "sp", and "uni". All curves show an increasing, concave-down trend, starting from a common point and diverging as `α` increases.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** `α` (Greek letter alpha).
* **Scale:** Linear scale ranging from 0 to approximately 7.
* **Major Tick Marks:** Located at 0, 2, 4, and 6.
* **Y-Axis (Vertical):**
* **Label:** `f`.
* **Scale:** Linear scale ranging from -0.60 to -0.35.
* **Major Tick Marks:** Located at -0.60, -0.55, -0.50, -0.45, -0.40, and -0.35.
* **Legend:**
* **Position:** Bottom-right corner of the chart area.
* **Entries:**
1. **Blue Line:** Labeled "main text".
2. **Red Line:** Labeled "sp".
3. **Green Line:** Labeled "uni".
* **Grid:** A light gray grid is present, with vertical lines at the major x-ticks and horizontal lines at the major y-ticks.
### Detailed Analysis
**Trend Verification:** All three data series exhibit the same fundamental trend: they are monotonically increasing (sloping upward) and concave down (the rate of increase slows as `α` grows). They all originate from the same point at `α=0`.
**Data Series & Approximate Values:**
1. **"main text" (Blue Line):**
* **Trend:** This is the uppermost curve throughout the entire range after `α=0`. It shows the highest values of `f` for any given `α`.
* **Key Points (Approximate):**
* At `α = 0`: `f ≈ -0.59`
* At `α = 2`: `f ≈ -0.45`
* At `α = 4`: `f ≈ -0.40`
* At `α = 6`: `f ≈ -0.37`
* At `α ≈ 7`: `f ≈ -0.355`
2. **"sp" (Red Line):**
* **Trend:** This is the middle curve. It lies below the blue line but above the green line for all `α > 0`.
* **Key Points (Approximate):**
* At `α = 0`: `f ≈ -0.59` (same starting point)
* At `α = 2`: `f ≈ -0.46`
* At `α = 4`: `f ≈ -0.41`
* At `α = 6`: `f ≈ -0.38`
* At `α ≈ 7`: `f ≈ -0.36`
3. **"uni" (Green Line):**
* **Trend:** This is the lowest curve. It diverges downward from the other two most significantly as `α` increases.
* **Key Points (Approximate):**
* At `α = 0`: `f ≈ -0.59` (same starting point)
* At `α = 2`: `f ≈ -0.47`
* At `α = 4`: `f ≈ -0.42`
* At `α = 6`: `f ≈ -0.39`
* At `α ≈ 7`: `f ≈ -0.385`
### Key Observations
* **Common Origin:** All three methods yield an identical value of `f ≈ -0.59` when the parameter `α` is zero.
* **Divergence:** The performance or output (`f`) of the three methods diverges as `α` increases. The "main text" method consistently produces the highest (least negative) `f` value, followed by "sp", with "uni" producing the lowest.
* **Convergence of Slope:** While the absolute values differ, the shapes of the curves are similar, suggesting the underlying relationship between `f` and `α` is of the same functional form for all three methods, differing only in a scaling or offset parameter.
### Interpretation
This chart likely compares the performance or behavior of three different models, algorithms, or theoretical approaches ("main text", "sp", "uni") as a function of a controlling parameter `α`. The function `f` could represent a metric like free energy, a log-likelihood, or an optimization objective where higher (less negative) values are typically better.
The key takeaway is that the method labeled **"main text" outperforms the "sp" and "uni" methods** across the entire tested range of `α > 0`, achieving higher `f` values. The "uni" method shows the poorest performance. The fact that all methods start at the same point suggests they share a common baseline or initial condition, but their response to increasing `α` differs. This could indicate that the "main text" method is more efficient, better optimized, or incorporates a more accurate model of the system being studied. The concave-down shape indicates diminishing returns: increasing `α` continues to improve `f`, but at a progressively slower rate.
</details>
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: Performance Metrics vs. Parameter α
### Overview
The image displays a line chart plotting three different performance metrics (Q) against a parameter α. The chart includes solid lines representing theoretical or fitted curves, dashed lines with 'x' markers representing empirical or sampled data, and shaded regions representing confidence intervals or variability around the empirical data. The overall trend shows two metrics improving with α and saturating near 1.0, while one metric remains near zero.
### Components/Axes
* **X-Axis:** Labeled "α" (alpha). The scale is linear, ranging from 0 to 7, with major tick marks at every integer (1, 2, 3, 4, 5, 6, 7).
* **Y-Axis:** Unlabeled, but represents a performance metric (likely probability, accuracy, or a similar bounded measure). The scale is linear, ranging from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Positioned in the center-right area of the plot. It contains three entries:
1. A blue solid line labeled `Q*_W(3/√5)`
2. An orange solid line labeled `Q*_W(1/√5)`
3. A green solid line labeled `q*_2`
* **Data Series:** Each legend entry corresponds to a pair of visual elements on the chart:
* A **solid line** of the specified color (theoretical curve).
* A **dashed line with 'x' markers** of the same color (empirical data).
* A **shaded region** of a lighter tint of the same color (confidence interval/variability band).
### Detailed Analysis
**1. Blue Series: `Q*_W(3/√5)`**
* **Trend:** The solid blue line shows a steep, concave-down increase from near 0 at α=0, crossing 0.5 around α≈0.8, and asymptotically approaching a value just below 1.0 (≈0.95) as α increases to 7.
* **Empirical Data (Dashed Blue):** Follows the solid line closely. Starts near 0, rises sharply between α=0.5 and α=2, and then fluctuates slightly around the solid line for α > 2.
* **Confidence Interval (Light Blue Shading):** Narrow at low α, widens significantly between α=1 and α=3 (spanning roughly 0.1 to 0.3 in height at its widest), and remains moderately wide for higher α values.
**2. Orange Series: `Q*_W(1/√5)`**
* **Trend:** The solid orange line remains very close to 0.0 across the entire range of α. It shows a very slight, gradual increase only at the highest α values (α > 6), reaching perhaps 0.05 at α=7.
* **Empirical Data (Dashed Orange):** Hovers near zero with minor fluctuations. The 'x' markers are consistently between 0.0 and 0.1.
* **Confidence Interval (Light Orange Shading):** Very narrow throughout, indicating low variability in the near-zero measurements.
**3. Green Series: `q*_2`**
* **Trend:** The solid green line starts higher than the blue line at α=0 (≈0.2), rises steeply, and follows a similar concave-down path, converging with the blue line's asymptote near 0.95 for α > 4.
* **Empirical Data (Dashed Green):** Closely tracks the solid green line. It starts around 0.2, increases rapidly, and shows more pronounced fluctuations around the solid line for α > 3 compared to the blue series.
* **Confidence Interval (Light Green Shading):** The widest of all three series. It is substantial across the entire range, indicating high variability in the empirical measurements for this metric. The band spans approximately ±0.1 to ±0.15 around the central trend for most α values.
### Key Observations
1. **Performance Dichotomy:** There is a stark contrast between the metrics. `Q*_W(3/√5)` and `q*_2` achieve high performance (>0.9) for α > 3, while `Q*_W(1/√5)` fails to improve significantly, remaining near baseline (≈0).
2. **Convergence:** The blue (`Q*_W(3/√5)`) and green (`q*_2`) series converge to nearly the same asymptotic value (≈0.95) for large α, despite starting from different points at α=0.
3. **Variability:** The green series (`q*_2`) exhibits the highest empirical variability (widest confidence band), while the orange series (`Q*_W(1/√5)`) has the lowest.
4. **Critical Region:** The most dramatic changes for the successful metrics occur in the low-α region (0 < α < 2.5).
### Interpretation
This chart likely compares the effectiveness of different strategies or parameter settings (denoted by the Q functions with different arguments) as a function of a resource or complexity parameter α.
* **Parameter Sensitivity:** The argument inside `Q*_W` is critical. The setting `(3/√5)` leads to successful learning/performance, while `(1/√5)` results in failure, suggesting a threshold or phase transition in the parameter space.
* **Metric Comparison:** The `q*_2` metric appears to be a different type of measure (perhaps a lower bound or an alternative estimator) that starts with an advantage at low α but ultimately matches the performance of the successful `Q*_W` variant. Its higher variance suggests it might be a noisier estimator.
* **Implication:** The data demonstrates that with sufficient α (resource/complexity), both the `Q*_W(3/√5)` strategy and the `q*_2` metric can achieve near-optimal performance (≈0.95), but they follow different trajectories and have different stability profiles. The failure of `Q*_W(1/√5)` highlights the importance of proper parameter tuning. The widening confidence intervals at higher α for the successful metrics could indicate increased sensitivity or harder-to-predict outcomes as the system scales.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Chart: Performance vs. Parameter α (with Error Bands)
### Overview
The image is a line chart plotting a performance metric (y-axis, range 0.0–1.0) against a parameter \( \alpha \) (x-axis, range 0–7). Three data series are displayed, each with a solid line, shaded error/confidence band, and cross markers (×) for data points. The legend identifies the series: blue (\( \mathcal{Q}_W^*(3/\sqrt{5}) \)), orange (\( \mathcal{Q}_W^*(1/\sqrt{5}) \)), and green (\( q_2^* \)).
### Components/Axes
- **X-axis (α)**: Labeled "α", with ticks at 0, 1, 2, 3, 4, 5, 6, 7.
- **Y-axis**: Unlabeled, with ticks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0 (range 0.0–1.0).
- **Legend**: Positioned in the **center** of the chart, with three entries:
- Blue line: \( \mathcal{Q}_W^*(3/\sqrt{5}) \)
- Orange line: \( \mathcal{Q}_W^*(1/\sqrt{5}) \)
- Green line: \( q_2^* \)
- **Data Series**: Each series includes a solid line (mean trend), shaded region (error/confidence band), and cross markers (×) at discrete \( \alpha \) values.
### Detailed Analysis
#### 1. Blue Series (\( \mathcal{Q}_W^*(3/\sqrt{5}) \))
- **Trend**: Starts near 0 at \( \alpha=0 \), rises steeply until \( \alpha \approx 2 \), then plateaus near 1.0 (with minor fluctuations) from \( \alpha \approx 2 \) to 7.
- **Data Points (crosses)**:
- \( \alpha=0 \): ~0.0
- \( \alpha=1 \): ~0.15
- \( \alpha=2 \): ~0.85
- \( \alpha=3 \): ~0.9
- \( \alpha=4 \): ~0.95
- \( \alpha=5 \): ~0.95
- \( \alpha=6 \): ~0.95
- \( \alpha=7 \): ~0.95
- **Error Band**: Narrow (low variance), especially after \( \alpha \approx 2 \).
#### 2. Orange Series (\( \mathcal{Q}_W^*(1/\sqrt{5}) \))
- **Trend**: Remains near 0 until \( \alpha \approx 4 \), then rises gradually, reaching ~0.45 at \( \alpha=7 \).
- **Data Points (crosses)**:
- \( \alpha=0 \): ~0.0
- \( \alpha=1 \): ~0.0
- \( \alpha=2 \): ~0.0
- \( \alpha=3 \): ~0.0
- \( \alpha=4 \): ~0.05
- \( \alpha=5 \): ~0.1
- \( \alpha=6 \): ~0.1
- \( \alpha=7 \): ~0.1 (note: the solid line rises to ~0.45, but data points remain low, indicating high variance).
- **Error Band**: Wide (high variance), especially after \( \alpha \approx 4 \).
#### 3. Green Series (\( q_2^* \))
- **Trend**: Starts at ~0.15 at \( \alpha=0 \), rises steeply until \( \alpha \approx 2 \), then plateaus near 1.0 (similar to blue but slightly lower initially).
- **Data Points (crosses)**:
- \( \alpha=0 \): ~0.15
- \( \alpha=1 \): ~0.6
- \( \alpha=2 \): ~0.8
- \( \alpha=3 \): ~0.9
- \( \alpha=4 \): ~0.95
- \( \alpha=5 \): ~0.95
- \( \alpha=6 \): ~0.95
- \( \alpha=7 \): ~0.95
- **Error Band**: Wider than blue (more variance), especially at low \( \alpha \) (0–2).
### Key Observations
- **Blue vs. Green**: Both rise rapidly with \( \alpha \), reaching near-optimal performance (y≈1.0) by \( \alpha \approx 2–3 \). Blue starts lower than green at \( \alpha=0 \) but catches up by \( \alpha \approx 2 \).
- **Orange**: Remains near 0 until \( \alpha \approx 4 \), then increases slowly (never reaching 1.0). Its error band is the widest, indicating high variability.
- **Error Bands**: Blue has the narrowest band (least variance), green has a wider band (more variance), and orange has the widest band (most variance) after \( \alpha \approx 4 \).
### Interpretation
- **Parameter Sensitivity**: \( \alpha \) strongly influences performance. For \( \mathcal{Q}_W^*(3/\sqrt{5}) \) (blue) and \( q_2^* \) (green), \( \alpha \geq 2 \) yields near-optimal results. For \( \mathcal{Q}_W^*(1/\sqrt{5}) \) (orange), performance is low until \( \alpha \geq 4 \) and remains suboptimal.
- **Model Relationships**: Blue and green have similar trends, suggesting they may be related (e.g., different parameterizations of the same model). Orange is distinct, with a delayed, weaker response to \( \alpha \).
- **Practical Implications**: If \( \alpha \) is a tuning parameter, prioritize \( \alpha \geq 2 \) for blue/green models. Orange requires \( \alpha \geq 4 \) and still underperforms, making it less favorable for high-performance tasks.
This description captures all visible elements, trends, and relationships, enabling reconstruction of the chart’s information without the image.
</details>
Figure 5: Different theoretical curves and numerical results for ReLU(x) activation, $P_{v}=\frac{1}{4}(\delta_{-3/\sqrt{5}}+\delta_{-1/\sqrt{5}}+\delta_{1/\sqrt{5} }+\delta_{3/\sqrt{5}})$ , $d=150$ , $\gamma=0.5$ , with linear readout with Gaussian noise of variance $\Delta=0.1$ Top left: Optimal mean-square generalisation error predicted by the theory reported in the main text (solid blue) versus the branch obtained from the simplified ansatz (83) (solid red); the green solid line shows the universal branch corresponding to $\mathcal{Q}_{W}\equiv 0$ , and empty circles are HMC results with informative initialisation. Top right: Theoretical free entropy curves (colors and linestyles as top left). Bottom: Predictions for the overlaps $\mathcal{Q}_{W}(\mathsf{v})$ and $q_{2}$ from the theory devised in the main text (left) and in Appendix E.1 (right).
To fix this issue, one can compare the predictions of the theory derived from this ansatz, with the ones obtained by plugging ${\mathcal{Q}}_{W}(\mathsf{v})=0\ \forall\ \mathsf{v}$ (denoted ${\mathcal{Q}}_{W}\equiv 0$ ) in the theory devised in the main text (6),
$$
f_{\rm uni}^{\alpha,\gamma}:=\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W}
\equiv 0);r_{K})+\frac{1}{4\alpha}(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2}-\frac
{1}{\alpha}\iota(\hat{q}_{2}), \tag{89}
$$
to be extremised now only w.r.t. the scalar parameters $q_{2}$ , $\hat{q}_{2}$ (one can easily verify that, for ${\mathcal{Q}}_{W}\equiv 0$ , $\tau({\mathcal{Q}}_{W})=0$ and the extremisation w.r.t. $\hat{{\mathcal{Q}}}_{W}$ in (6) gives $\hat{{\mathcal{Q}}}_{W}\equiv 0$ ). Notice that $f_{\rm uni}^{\alpha,\gamma}$ is not depending on the prior over the inner weights, which is the reason why we are calling it “universal”. For consistency, the two free entropies $f_{\rm sp}^{\alpha,\gamma}$ , $f_{\rm uni}^{\alpha,\gamma}$ should be compared through a discrete variational principle, that is the free entropy of the model is predicted to be
$$
\bar{f}^{\alpha,\gamma}_{\rm RS}:=\max\{{\rm extr}f_{\rm uni}^{\alpha,\gamma},
{\rm extr}f_{\rm sp}^{\alpha,\gamma}\}, \tag{90}
$$
instead of the unified variational form (6). Quite generally, ${\rm extr}f_{\rm uni}^{\alpha,\gamma}>{\rm extr}f_{\rm sp}^{\alpha,\gamma}$ for low values of $\alpha$ , so that the behaviour of the model in the universal phase is correctly predicted. The curves cross at a critical value
$$
\bar{\alpha}_{\rm sp}(\gamma)=\sup\{\alpha\mid{\rm extr}f_{\rm uni}^{\alpha,
\gamma}>{\rm extr}f_{\rm sp}^{\alpha,\gamma}\}, \tag{91}
$$
instead of the value $\alpha_{\rm sp}(\gamma)$ reported in the main. This approach has been profitably adopted in Barbier et al. (2025) in the context of matrix denoising This is also the approach we used in a earlier version of this paper (superseded by the present one), accessible on ArXiv at this link., a problem sharing some of the challenges presented in this paper. In this respect, it provides a heuristic solution that quantitatively predicts the behaviour of the model in most of its phase diagram. Moreover, for any activation $\sigma$ with a second Hermite coefficient $\mu_{2}=0$ (e.g., all odd activations) the ansatz (83) yields the same theory as the one devised in the main text, as in this case $q_{K}(q_{2},{\mathcal{Q}}_{W})$ entering the energetic part of the free entropy does not depend on $q_{2}$ , so that the extremisation selects $q_{2}=\hat{q}_{2}=0$ and the remaining parts of (88) match the ones of (6). Finally, (83) is consistent with the observation that specialisation never arises in the case of quadratic activation and Gaussian prior over the inner weights: in this case, one can check that the universal branch ${\rm extr}f_{\rm uni}^{\alpha,\gamma}$ is always higher than ${\rm extr}f_{\rm sp}^{\alpha,\gamma}$ , and thus never selected by (90). For a convincing check on the validity of this approach, and a comparison with the theory devised in the main text and numerical results, see Fig. 5, top left panel.
However, despite its merits listed above, this Appendix’s approach presents some issues, both from the theoretical and practical points of view:
1. the final free entropy of the model is obtained by comparing curves derived from completely different ansätze for the distribution $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (Gaussian with coupled replicas, leading to $f_{\rm sp}$ , vs. pure generalised Wishart with independent replicas, leading to $f_{\rm uni}$ ), rather than within a unified theory as in the main text;
1. the predicted critical value $\bar{\alpha}_{\rm sp}(\gamma)$ seems to be systematically larger than the one observed in experiments (see Fig. 5, top right panel, and compare the crossing point of the “sp” and “uni” free entropies with the actual transition where the numerical points depart from the universal branch in the top left panel);
1. predictions for the functional overlap ${\mathcal{Q}}_{W}^{*}$ from this approach are in much worse agreement with experimental data w.r.t. the ones from the theory presented in the main text (see Fig. 5, bottom panel, and compare with Fig. 3 in the main text);
1. in the cases we tested, the prediction for the generalisation error from the theory devised in the main text are in much better agreement with numerical simulations than the one from this Appendix (see Fig. 6 for a comparison).
Therefore, the more elaborate theory presented in the main is not only more meaningful from the theoretical viewpoint, but also in overall better agreement with simulations.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Line Chart: Optimal Error (ε_opt) vs. Parameter α
### Overview
This is a line chart plotting the optimal error, denoted as ε_opt, against a parameter α. It compares three different datasets or models labeled "main text", "sp", and "uni". The chart shows a decreasing trend for all three series as α increases, with distinct crossover points between the lines.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** α (alpha)
* **Scale:** Linear, ranging from 0 to approximately 7.
* **Major Tick Marks:** 0, 2, 4, 6.
* **Y-Axis (Vertical):**
* **Label:** ε_opt (epsilon_opt)
* **Scale:** Linear, ranging from approximately 0.01 to 0.09.
* **Major Tick Marks:** 0.02, 0.04, 0.06, 0.08.
* **Legend:**
* **Position:** Top-right corner of the chart area.
* **Entries:**
1. **Blue line:** "main text"
2. **Red line:** "sp"
3. **Green line:** "uni"
* **Grid:** A light gray grid is present, aiding in value estimation.
### Detailed Analysis
**Data Series Trends & Approximate Values:**
1. **"main text" (Blue Line with Error Bars):**
* **Trend:** Starts at the highest point and decreases monotonically, flattening as α increases. It exhibits small vertical error bars at each data point, indicating measurement or calculation uncertainty.
* **Key Points (Approximate):**
* α ≈ 0: ε_opt ≈ 0.08
* α ≈ 1: ε_opt ≈ 0.045
* α ≈ 2: ε_opt ≈ 0.03
* α ≈ 4: ε_opt ≈ 0.02
* α ≈ 6: ε_opt ≈ 0.01
2. **"sp" (Red Line):**
* **Trend:** Starts at a similar high point as the blue line but decreases more steeply initially. It crosses below the blue line at approximately α ≈ 3.5 and continues to decline, becoming the lowest of the three lines for α > 3.5.
* **Key Points (Approximate):**
* α ≈ 0: ε_opt ≈ 0.08
* α ≈ 1: ε_opt ≈ 0.055
* α ≈ 2: ε_opt ≈ 0.035
* α ≈ 4: ε_opt ≈ 0.015
* α ≈ 6: ε_opt ≈ 0.008
3. **"uni" (Green Line):**
* **Trend:** Starts slightly lower than the other two lines at α=0. It decreases at a slower, more gradual rate. It crosses above the blue line at approximately α ≈ 4.5 and remains the highest line for larger α values.
* **Key Points (Approximate):**
* α ≈ 0: ε_opt ≈ 0.075
* α ≈ 1: ε_opt ≈ 0.05
* α ≈ 2: ε_opt ≈ 0.04
* α ≈ 4: ε_opt ≈ 0.02
* α ≈ 6: ε_opt ≈ 0.018
### Key Observations
* **Crossover Points:** The most significant feature is the interaction between the lines. The "sp" (red) line becomes superior (lower ε_opt) to the "main text" (blue) line around α ≈ 3.5. The "uni" (green) line becomes inferior (higher ε_opt) to the "main text" line around α ≈ 4.5.
* **Error Bars:** Only the "main text" (blue) series displays error bars, suggesting it may represent empirical or stochastic results, while the "sp" and "uni" lines could represent theoretical bounds or deterministic models.
* **Asymptotic Behavior:** All three lines appear to approach a non-zero asymptote as α increases, with "sp" approaching the lowest value and "uni" the highest.
### Interpretation
This chart likely compares the performance of three different methods, models, or theoretical predictions ("main text", "sp", "uni") in minimizing an optimal error (ε_opt) as a function of a control parameter α.
* **Performance Ranking:** The relative effectiveness of the methods depends critically on the value of α. For low α (< 3.5), "main text" and "sp" are best. For intermediate α (3.5 to 4.5), "sp" is best. For high α (> 4.5), "sp" remains best, but "uni" becomes worse than "main text".
* **Underlying Meaning:** The parameter α could represent a resource (like data size, computation time, or signal strength), a regularization term, or a system property. The decreasing ε_opt indicates that increasing α generally improves performance (reduces error). The different slopes suggest the methods have different sensitivities to α. The "sp" method benefits the most from increasing α, while the "uni" method is the most robust or least sensitive to changes in α.
* **Investigative Insight:** The presence of error bars only on the "main text" line invites questions. Is "main text" the primary method under study, with "sp" and "uni" serving as theoretical benchmarks? The crossover points are critical for practical application, defining the regimes where one method should be chosen over another. The chart effectively communicates that there is no single best method across all conditions; the optimal choice is parameter-dependent.
</details>
Figure 6: Generalisation error for ReLU activation and Rademacher readout prior $P_{v}$ of the theory reported in the main text (solid blue) versus the branch obtained from the simplified ansatz (83) (solid red); the green solid line shows $\mathcal{Q}_{W}\equiv 0$ (universal branch), and empty circles are HMC results with informative initialisation.
### E.2 Possible refined analyses with structured ${\mathbf{S}}_{2}$ matrices
In the main text, we kept track of the inhomogeneous profile of the readouts induced by the non-trivial distribution $P_{v}$ , which is ultimately responsible for the sequence of specialisation phase transitions occurring at increasing $\alpha$ , thanks to a functional order parameter ${\mathcal{Q}}_{W}(\mathsf{v})$ measuring how much the student’s hidden weights corresponding to all the readout elements equal to $\mathsf{v}$ have aligned with the teacher’s. However, when writing $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})$ we treated the tensor ${\mathbf{S}}_{2}^{a}$ as a whole, without considering the possibility that its “components”
$$
\displaystyle S_{2;\alpha_{1}\alpha_{2}}^{a}(\mathsf{v}):=\frac{\mathsf{v}}{
\sqrt{|\mathcal{I}_{\mathsf{v}}|}}\sum_{i\in\mathcal{I}_{\mathsf{v}}}W^{a}_{i
\alpha_{1}}W^{a}_{i\alpha_{2}} \tag{92}
$$
could follow different laws for different $\mathsf{v}\in\mathsf{V}$ . To do so, let us define
$$
\displaystyle Q_{2}^{ab}=\frac{1}{k}\sum_{\mathsf{v},\mathsf{v}^{\prime}}
\mathsf{v}\,\mathsf{v}^{\prime}\sum_{i\in\mathcal{I}_{\mathsf{v}},j\in\mathcal
{I}_{\mathsf{v^{\prime}}}}(\Omega_{ij}^{ab})^{2}=\sum_{\mathsf{v},\mathsf{v}^{
\prime}}\frac{\sqrt{|\mathcal{I}_{\mathsf{v}}||\mathcal{I}_{\mathsf{v^{\prime}
}}|}}{k}{\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}),\quad\text{
where}\quad{\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}):=\frac{1}{d^
{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}(\mathsf{v}){\mathbf{S}}_{2}^{b}(\mathsf{v}^
{\prime})^{\intercal}. \tag{93}
$$
The generalisation of (63) then reads
$$
\displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle\frac{1}{d^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}(\mathsf{v}){\mathbf
{S}}_{2}^{b}(\mathsf{v}^{\prime})^{\intercal}=\delta_{\mathsf{v}\mathsf{v}^{
\prime}}\mathsf{v}^{2}\mathcal{Q}_{W}^{ab}(\mathsf{v})^{2}+\gamma\,\mathsf{v}
\mathsf{v}^{\prime}\sqrt{P_{v}(\mathsf{v})P_{v}(\mathsf{v}^{\prime})} \tag{94}
$$
w.r.t. the true distribution $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ reported in (14). Despite the already good match of the theory in the main with the numerics, taking into account this additional level of structure thanks to a refined simplified measure could potentially lead to further improvements. The simplified measure able to match these moment-matching conditions while taking into account the Wishart form (92) of the matrices $({\mathbf{S}}_{2}^{a}(\mathsf{v}))$ is
$$
\displaystyle d\bar{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})
\propto\prod_{\mathsf{v}\in\mathsf{V}}\prod_{a}dP_{S}^{\mathsf{v}}({\mathbf{S}
}_{2}^{a}(\mathsf{v}))\times\prod_{\mathsf{v}\in\mathsf{V}}\prod_{a<b}e^{\frac
{1}{2}\bar{\tau}^{ab}_{\mathsf{v}}(\bm{{\mathcal{Q}}}_{W}){\rm Tr}{\mathbf{S}}
_{2}^{a}(\mathsf{v}){\mathbf{S}}_{2}^{b}(\mathsf{v})}, \tag{95}
$$
where $P_{S}^{\mathsf{v}}$ is the law of a random matrix $\mathsf{v}\bar{{\mathbf{W}}}\bar{{\mathbf{W}}}^{\intercal}|\mathcal{I}_{ \mathsf{v}}|^{-1/2}$ with $\bar{\mathbf{W}}\in\mathbb{R}^{d\times|\mathcal{I}_{\mathsf{v}}|}$ having i.i.d. standard Gaussian entries. For properly chosen $(\bar{\tau}_{\mathsf{v}}^{ab})$ , (94) is verified for this simplified measure.
However, the order parameters $({\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}))$ are difficult to deal with if keeping a general form, as they not only imply coupled replicas $({\mathbf{S}}_{2}^{a}(\mathsf{v}))_{a}$ for a given $\mathsf{v}$ (a kind of coupling that is easily linearised with a single Hubbard-Stratonovich transformation, within the replica symmetric treatment justified in Bayes-optimal learning), but also a coupling for different values of the variable $\mathsf{v}$ . Linearising it would yield a more complicated matrix model than the integral reported in (D.3), because the resulting coupling field would break rotational invariance and therefore the model does not have a form which is known to be solvable, see Kazakov (2000).
A first idea to simplify $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (14) while taking into account the additional structure induced by (93), (94) and maintaining a solvable model, is to consider a generalisation of the relaxation (83). This entails dropping entirely the dependencies among matrix entries, induced by their Wishart-like form (92), for each ${\mathbf{S}}_{2}^{a}(\mathsf{v})$ . In this case, the moment constraints (94) can be exactly enforced by choosing the simplified measure
$$
\displaystyle d\bar{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})=\prod
_{\mathsf{v}\in\mathsf{V}}\prod_{a=0}^{s}d{\mathbf{S}}^{a}_{2}(\mathsf{v})
\prod_{\alpha=1}^{d}\delta(S^{a}_{2;\alpha\alpha}(\mathsf{v})-\mathsf{v}\sqrt{
|\mathcal{I}_{\mathsf{v}}|})\times\prod_{\mathsf{v}\in\mathsf{V}}\prod_{\alpha
_{1}<\alpha_{2}}^{d}\frac{e^{-\frac{1}{2}\sum_{a,b=0}^{s}S^{a}_{2;\alpha_{1}
\alpha_{2}}(\mathsf{v})\bar{\tau}_{\mathsf{v}}^{ab}(\bm{{\mathcal{Q}}}_{W})S^{
b}_{2;\alpha_{1}\alpha_{2}}(\mathsf{v})}}{\sqrt{(2\pi)^{s+1}\det(\bar{\bm{\tau
}}_{\mathsf{v}}(\bm{{\mathcal{Q}}}_{W})^{-1})}}. \tag{96}
$$
The parameters $(\bar{\tau}^{ab}_{\mathsf{v}}(\bm{{\mathcal{Q}}}_{W}))$ are then properly chosen to enforce (94) for all $0\leq a\leq b\leq s$ and $\mathsf{v},\mathsf{v}^{\prime}\in\mathsf{V}$ . Using this measure, the resulting entropic term, taking into account the degeneracy of the order parameters $({\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}))$ and $({\mathcal{Q}}_{W}^{ab}(\mathsf{v}))$ , remains tractable through Gaussian integrals (the energetic term is obviously unchanged once we express $(Q_{2}^{ab})$ entering it using these new order parameters through the identity (93), and keeping in mind that nothing changes for higher order overlaps compared to the theory in the main). We leave for future work the analysis of this Gaussian relaxation and other possible simplifications of (95) leading to solvable models.
## Appendix F Linking free entropy and mutual information
It is possible to relate the mutual information (MI) of the inference task to the free entropy $f_{n}=\mathbb{E}\ln\mathcal{Z}$ introduced in the main. Indeed, we can write the MI as
$$
\frac{I({\mathbf{W}}^{0};\mathcal{D})}{kd}=\frac{\mathcal{H}(\mathcal{D})}{kd}
-\frac{\mathcal{H}(\mathcal{D}\mid{\mathbf{W}}^{0})}{kd}, \tag{97}
$$
where $\mathcal{H}(Y\mid X)$ is the conditional Shannon entropy of $Y$ given $X$ . It is straightforward to show that the free entropy is
$$
-\frac{\alpha}{\gamma}f_{n}=\frac{\mathcal{H}(\{y_{\mu}\}_{\mu\leq n}\mid\{{
\mathbf{x}}_{\mu}\}_{\mu\leq n})}{kd}=\frac{\mathcal{H}(\mathcal{D})}{kd}-
\frac{\mathcal{H}(\{{\mathbf{x}}_{\mu}\}_{\mu\leq n})}{kd}, \tag{98}
$$
by the chain rule for the entropy. On the other hand $\mathcal{H}(\mathcal{D}\mid{\mathbf{W}}^{0})=\mathcal{H}(\{y_{\mu}\}\mid{ \mathbf{W}}^{0},\{{\mathbf{x}}_{\mu}\})+\mathcal{H}(\{{\mathbf{x}}_{\mu}\})$ , i.e.,
$$
\frac{\mathcal{H}(\mathcal{D}\mid{\mathbf{W}}^{0})}{kd}\approx-\frac{\alpha}{
\gamma}\mathbb{E}_{\lambda}\int dyP_{\text{out}}(y\mid\lambda)\ln P_{\text{out
}}(y\mid\lambda)+\frac{\mathcal{H}(\{{\mathbf{x}}_{\mu}\}_{\mu\leq n})}{kd}, \tag{99}
$$
where $\lambda\sim{\mathcal{N}}(0,r_{K})$ , with $r_{K}$ given by (53) (assuming here that $\mu_{0}=0$ , see App. D.5 if the activation $\sigma$ is non-centred), and the equality holds asymptotically in the limit $\lim$ . This allows us to express the MI as
$$
\frac{I({\mathbf{W}}^{0};\mathcal{D})}{kd}=-\frac{\alpha}{\gamma}f_{n}+\frac{
\alpha}{\gamma}\mathbb{E}_{\lambda}\int dyP_{\text{out}}(y|\lambda)\ln P_{
\text{out}}(y|\lambda). \tag{100}
$$
Specialising the equation to the Gaussian channel, one obtains
$$
\frac{I({\mathbf{W}}^{0};\mathcal{D})}{kd}=-\frac{\alpha}{\gamma}f_{n}-\frac{
\alpha}{2\gamma}\ln(2\pi e\Delta). \tag{101}
$$
Note that the choice of normalising by $kd$ is not accidental. Indeed, the number of parameters is $kd+k\approx kd$ . Hence with this choice one can interpret the parameter $\alpha$ as an effective signal-to-noise ratio.
**Remark F.1**
*The arguments of Barbier et al. (2025) to show the existence of an upper bound on the mutual information per variable in the case of discrete variables and the associated inevitable breaking of prior universality beyond a certain threshold in matrix denoising apply to the present model too. It implies, as in the aforementioned paper, that the mutual information per variable cannot go beyond $\ln 2$ for Rademacher inner weights. Our theory is consistent with this fact; this is a direct consequence of the analysis in App. G (see in particular (108)) specialised to binary prior over ${\mathbf{W}}$ .*
## Appendix G Large sample rate limit of $f_{\rm RS}^{\alpha,\gamma}$
In this section we show that when the prior over the weights ${\mathbf{W}}$ is discrete the MI can never exceed the entropy of the prior itself.
To do this, we first need to control the function $\rm mmse$ when its argument is large. By a saddle point argument, it is not difficult to show that the leading term for ${\rm mmse}_{S}(\tau)$ when $\tau\to\infty$ if of the type $C(\gamma)/\tau$ for a proper constant $C$ depending at most on the rectangularity ratio $\gamma$ .
We now notice that the equation for $\hat{\mathcal{Q}}_{W}(v)$ in (76) can be rewritten as
$$
\displaystyle\hat{\mathcal{Q}}_{W}(v)=\frac{1}{2\gamma}[{\rm mmse}_{S}(\tau)-{
\rm mmse}_{S}(\tau+\hat{q}_{2})]\partial_{{\mathcal{Q}}_{W}(v)}\tau+2\frac{
\alpha}{\gamma}\partial_{{\mathcal{Q}}_{W}(v)}\psi_{P_{\text{out}}}(q_{K}(q_{2
},\mathcal{Q}_{W});r_{K}). \tag{102}
$$
For $\alpha\to\infty$ we make the self-consistent ansatz $\mathcal{Q}_{W}(v)=1-o_{\alpha}(1)$ . As a consequence $1/\tau$ has to vanish by the moment matching condition (74) as $o_{\alpha}(1)$ too. Using the very same equation, we are also able to evaluate $\partial_{\mathcal{Q}_{W}(v)}\tau$ as follows:
$$
\displaystyle\partial_{\mathcal{Q}_{W}(v)}\tau=\frac{-2v^{2}\mathcal{Q}_{W}(v)
}{{\rm mmse^{\prime}}(\tau)}\sim\tau^{2} \tag{103}
$$
as $\alpha\to\infty$ , where we have used ${\rm mmse}_{S}(\tau)\sim C(\gamma)/\tau$ to estimate the derivative. We use the same approximation for the two $\rm mmse$ ’s appearing in the fixed point equation for $\hat{\mathcal{Q}}_{W}(v)$ :
$$
\displaystyle\hat{\mathcal{Q}}_{W}(v)\sim\frac{\hat{q}_{2}}{2\gamma(\tau(\tau+
\hat{q}_{2}))}\tau^{2}+2\frac{\alpha}{\gamma}\partial_{{\mathcal{Q}}_{W}(v)}
\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K}). \tag{104}
$$
From the last equation in (76) we see that $\hat{q}_{2}$ cannot diverge more than $O(\alpha)$ . Thanks to the above approximation and the first equation of (76) this entails that $\mathcal{Q}_{W}(v)$ is approaching $1$ exponentially fast in $\alpha$ , which in turn implies $\tau$ is diverging exponentially in $\alpha$ . As a consequence
$$
\displaystyle\frac{\tau^{2}}{\tau(\tau+\hat{q}_{2})}\sim 1. \tag{105}
$$
Furthermore, one also has
$$
\displaystyle\frac{1}{\alpha}[\iota(\tau)-\iota(\tau+\hat{q}_{2})]=-\frac{1}{4
\alpha}\int_{\tau}^{\tau+\hat{q}_{2}}{\rm mmse}_{S}(t)\,dt\approx-\frac{C(
\gamma)}{4\alpha}\log(1+\frac{\hat{q}_{2}}{\tau})\xrightarrow[]{\alpha\to
\infty}0, \tag{106}
$$
as $\frac{\hat{q}_{2}}{\tau}$ vanishes with exponential speed in $\alpha$ .
Concerning the function $\psi_{P_{W}}$ , given that it is realted to a Bayes-optimal scalar Gaussian channel, and its SNRs $\hat{\mathcal{Q}}_{W}(v)$ are all diverging one can compute the integral by saddle point, which is inevitably attained at the ground truth:
$$
\displaystyle\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v)) \displaystyle-\frac{\hat{\mathcal{Q}}_{W}(v)\mathcal{Q}_{W}(v)}{2}\approx
\mathbb{E}_{w^{0}}\ln\int dP_{W}(w)\mathbbm{1}(w=w^{0}) \displaystyle+\mathbb{E}\Big{[}(\sqrt{\hat{\mathcal{Q}}_{W}(v)}\xi+\hat{
\mathcal{Q}}_{W}(v)w^{0})w^{0}-\frac{\hat{\mathcal{Q}}_{W}(v)}{2}(w^{0})^{2}
\Big{]}-\frac{\hat{\mathcal{Q}}_{W}(v)(1-o_{\alpha}(1))}{2}=-\mathcal{H}(W)+o_
{\alpha}(1). \tag{1}
$$
Considering that $\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})\xrightarrow[]{\alpha \to\infty}\psi_{P_{\text{out}}}(r_{K};r_{K})$ , and using (100), it is then straightforward to check that our RS version of the MI saturates to the entropy of the prior $P_{W}$ when $\alpha\to\infty$ :
$$
\displaystyle-\frac{\alpha}{\gamma}\text{extr}f_{\rm RS}^{\alpha,\gamma}+\frac
{\alpha}{\gamma}\mathbb{E}_{\lambda}\int dyP_{\text{out}}(y|\lambda)\ln P_{
\text{out}}(y|\lambda)\xrightarrow[]{\alpha\to\infty}\mathcal{H}(W). \tag{108}
$$
## Appendix H Extension of GAMP-RIE to arbitrary activation
Algorithm 1 GAMP-RIE for training shallow neural networks with arbitrary activation
Input: Fresh data point ${\mathbf{x}}_{\text{test}}$ with unknown associated response $y_{\text{test}}$ , dataset $\mathcal{D}=\{({\mathbf{x}}_{\mu},y_{\mu})\}_{\mu=1}^{n}$ .
Output: Estimator $\hat{y}_{\text{test}}$ of $y_{\text{test}}$ .
Estimate $y^{(0)}:=\mu_{0}{\mathbf{v}}^{\intercal}\bm{1}/\sqrt{k}$ as
$$
\hat{y}^{(0)}=\frac{1}{n}\sum_{\mu}y_{\mu}; \tag{0}
$$
Estimate $\langle{\mathbf{W}}^{\intercal}{\mathbf{v}}\rangle/\sqrt{k}$ using (117).
Estimate the $\mu_{1}$ term in the Hermite expansion (111) as
$$
\displaystyle\hat{y}_{\mu}^{(1)} \displaystyle=\mu_{1}\frac{\langle{\mathbf{v}}^{\intercal}{\mathbf{W}}\rangle{
\mathbf{x}}_{\mu}}{\sqrt{kd}}; \tag{1}
$$
Compute
$$
\displaystyle\tilde{y}_{\mu} \displaystyle=\frac{y_{\mu}-\hat{y}_{\mu}^{(0)}-\hat{y}_{\mu}^{(1)}}{\mu_{2}/2
};\qquad\tilde{\Delta}=\frac{\Delta+g(1)}{\mu_{2}^{2}/4}; \tag{0}
$$
Input $\{({\mathbf{x}}_{\mu},\tilde{y}_{\mu})\}_{\mu=1}^{n}$ and $\tilde{\Delta}$ into Algorithm 1 in Maillard et al. (2024a) to estimate $\langle{\mathbf{W}}^{\intercal}({\mathbf{v}}){\mathbf{W}}\rangle$ ;
Output
$$
\displaystyle\hat{y}_{\text{test}}=\hat{y}^{(0)}+\mu_{1}\frac{\langle{\mathbf{
v}}^{\intercal}{\mathbf{W}}\rangle{\mathbf{x}}_{\text{test}}}{\sqrt{kd}}+\frac
{\mu_{2}}{2}\frac{1}{d\sqrt{k}}{\rm Tr}[({\mathbf{x}}_{\text{test}}{\mathbf{x}
}_{\text{test}}^{\intercal}-{\mathbb{I}})\langle{\mathbf{W}}^{\intercal}({
\mathbf{v}}){\mathbf{W}}\rangle]. \tag{0}
$$
For simplicity, let us consider $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ , which entails:
$$
\displaystyle y_{\mu}\mid({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})\overset{\rm{d}
}{=}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}\sigma\Big{(}\frac{{\mathbf{W}}^{
0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\sqrt{\Delta}\,z_{\mu},\quad\mu=1\dots,n, \tag{110}
$$
where $z_{\mu}$ are i.i.d. standard Gaussian random variables and $\overset{\rm d}{{}={}}$ means equality in law. Expanding $\sigma$ in the Hermite polynomial basis we have
$$
\displaystyle y_{\mu}\mid({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})\overset{\rm{d}
}{=}\mu_{0}\frac{{\mathbf{v}}^{\intercal}\bm{1}_{k}}{\sqrt{k}}+\mu_{1}\frac{{
\mathbf{v}}^{\intercal}{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{kd}}+\frac{
\mu_{2}}{2}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_{2}\Big{(}\frac{{
\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\dots+\sqrt{\Delta}z_{\mu} \tag{111}
$$
where $\dots$ represents the terms beyond second order. Without loss of generality, for this choice of output channel we can set $\mu_{0}=0$ as discussed in App. D.5. For low enough $\alpha$ it is reasonable to assume that higher order terms in $\dots$ cannot be learnt given quadratically many samples and, as a result, play the role of effective noise, which we assume independent of the first three terms. We shall see that this reasoning actually applies to the extension of the GAMP-RIE we derive, which plays the role of a “smart” spectral algorithm, regardless of the value of $\alpha$ . Therefore, these terms accumulate in an asymptotically Gaussian noise thanks to the central limit theorem (it is a projection of a centred function applied entry-wise to a vector with i.i.d. entries), with variance $g(1)$ (see (43)). We thus obtain the effective model
$$
\displaystyle y_{\mu}\mid({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})\overset{\rm{d}
}{=}\mu_{1}\frac{{\mathbf{v}}^{\intercal}{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{
\sqrt{kd}}+\frac{\mu_{2}}{2}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_
{2}\Big{(}\frac{{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\sqrt{
\Delta+g(1)}\,z_{\mu}. \tag{1}
$$
The first term in this expression can be learnt with vanishing error given quadratically many samples (Remark H.1), thus can be ignored. This further simplifies the model to
$$
\displaystyle\bar{y}_{\mu}:=y_{\mu}-\mu_{1}\frac{{\mathbf{v}}^{\intercal}{
\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{kd}}\overset{\rm d}{{}={}}\frac{\mu_{
2}}{2}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_{2}\Big{(}\frac{{
\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\sqrt{\Delta+g(1)}\,z_{\mu}, \tag{1}
$$
where $\bar{y}_{\mu}$ is $y_{\mu}$ with the (asymptotically) perfectly learnt linear term removed, and the last equality in distribution is again conditional on $({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})$ . From the formula
$$
\displaystyle\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_{2}\Big{(}\frac
{{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}={\rm Tr}\frac{{\mathbf{W
}}^{0\intercal}({\mathbf{v}}){\mathbf{W}}^{0}}{d\sqrt{k}}{\mathbf{x}}_{\mu}{
\mathbf{x}}_{\mu}^{\intercal}-\frac{{\mathbf{v}}^{\intercal}\bm{1}_{k}}{\sqrt{
k}}\approx\frac{1}{\sqrt{k}d}{\rm Tr}[({\mathbf{x}}_{\mu}{\mathbf{x}}_{\mu}^{
\intercal}-{\mathbb{I}}_{d}){\mathbf{W}}^{0\intercal}({\mathbf{v}}){\mathbf{W}
}^{0}], \tag{114}
$$
where $\approx$ is exploiting the concentration ${\rm Tr}{\mathbf{W}}^{0\intercal}({\mathbf{v}}){\mathbf{W}}^{0}/(d\sqrt{k})\to {\mathbf{v}}^{\intercal}\bm{1}_{k}/\sqrt{k}$ , and the Gaussian equivalence property that ${\mathbf{M}}_{\mu}:=({\mathbf{x}}_{\mu}{\mathbf{x}}_{\mu}^{\intercal}-{\mathbb {I}}_{d})/\sqrt{d}$ behaves like a GOE sensing matrix, i.e., a symmetric matrix whose upper triangular part has i.i.d. entries from $\mathcal{N}(0,(1+\delta_{ij})/d)$ Maillard et al. (2024a), the model can be seen as a GLM with signal $\bar{\mathbf{S}}^{0}_{2}:={\mathbf{W}}^{0\intercal}({\mathbf{v}}){\mathbf{W}}^ {0}/\sqrt{kd}$ :
$$
\displaystyle y^{\rm GLM}_{\mu}=\frac{\mu_{2}}{2}{\rm Tr}[{\mathbf{M}}_{\mu}
\bar{\mathbf{S}}^{0}_{2}]+\sqrt{\Delta+g(1)}\,z_{\mu}. \tag{1}
$$
Starting from this equation, the arguments of App. D and Maillard et al. (2024a), based on known results on the GLM Barbier et al. (2019) and matrix denoising Barbier & Macris (2022); Maillard et al. (2022); Pourkamali et al. (2024), allow us to obtain the free entropy of such matrix sensing problem. The result is consistent with the $\mathcal{Q}_{W}\equiv 0$ solution of the saddle point equations obtained from the replica method in App. D, which, as anticipated, corresponds to the case where the Hermite polynomial combinations of the signal following the second one are not learnt.
Note that, as supported by the numerics, the model actually admits specialisation when $\alpha$ is big enough, hence the above equivalence cannot hold on the whole phase diagram at the information theoretic level. In fact, if specialisation occurs one cannot consider the $\dots$ terms in (111) as noise uncorrelated with the first ones, as the model is aligning with the actual teacher’s weights, such that it learns all the successive terms at once.
<details>
<summary>x14.png Details</summary>

### Visual Description
## [Line Graph]: Test Error of ReLU vs. ELU Activation Functions as a Function of α
### Overview
The image is a line graph comparing the **test error** of two neural network activation functions—**ReLU** (red) and **ELU** (blue)—as a function of the parameter \( \boldsymbol{\alpha} \) (alpha). An inset plot (top-right) provides a zoomed view of the region \( \alpha \approx 1 \) to \( 2 \) to clarify transition points.
### Components/Axes
- **Main Plot Axes**:
- **X-axis**: Labeled \( \boldsymbol{\alpha} \) (alpha), with ticks at \( 0, 1, 2, 3, 4 \).
- **Y-axis**: Labeled "Test error", with ticks at \( 0.00, 0.02, 0.04, 0.06, 0.08 \).
- **Legend**: Top-left, with two entries:
- "ReLU" (red line, square data points with error bars).
- "ELU" (blue line, circular data points with error bars).
- **Inset Plot (Top-Right)**:
- X-axis: Ticks at \( 1, 2 \) (zoomed \( \alpha \approx 1 \) to \( 2 \)).
- Y-axis: Ticks at \( 0.00, 0.02, 0.04, 0.06, 0.08 \) (same as main plot, but magnified).
### Detailed Analysis
#### Main Plot Trends:
- **ReLU (Red Line)**:
- Starts at \( \alpha = 0 \) with test error \( \approx 0.09 \) (above \( 0.08 \)).
- Decreases with \( \alpha \), with a **sharp step (drop)** at \( \alpha \approx 2 \) (error falls to near \( 0.00 \)).
- After \( \alpha = 2 \), error remains near \( 0.00 \) (flat line).
- Data points (red squares) follow the line, with vertical error bars (indicating variability).
- **ELU (Blue Line)**:
- Starts at \( \alpha = 0 \) with test error \( \approx 0.04 \) (lower than ReLU at \( \alpha = 0 \)).
- Decreases more gradually than ReLU, with a **smaller step (drop)** at \( \alpha \approx 3.5 \) (error falls to near \( 0.00 \)).
- After \( \alpha = 3.5 \), error remains near \( 0.00 \) (flat line).
- Data points (blue circles) follow the line, with vertical error bars.
#### Inset Plot (Zoomed Region \( \boldsymbol{\alpha \approx 1} \) to \( \boldsymbol{2} \)):
- **ReLU (Red)**:
- At \( \alpha = 1 \), error \( \approx 0.06 \); at \( \alpha = 2 \), error \( \approx 0.03 \) (before the sharp step).
- The line declines steeply in this interval, highlighting the transition before the major step at \( \alpha \approx 2 \).
- **ELU (Blue)**:
- At \( \alpha = 1 \), error \( \approx 0.03 \); at \( \alpha = 2 \), error \( \approx 0.02 \).
- The line declines more gradually than ReLU in this interval.
### Key Observations
1. **Initial Performance (Low \( \boldsymbol{\alpha} \))**:
- ELU outperforms ReLU (lower test error) for \( \alpha < 2 \). For example:
- At \( \alpha = 0 \): ReLU error \( \approx 0.09 \) vs. ELU \( \approx 0.04 \).
- At \( \alpha = 1 \): ReLU error \( \approx 0.04 \) vs. ELU \( \approx 0.025 \).
2. **Step-Like Transitions**:
- Both functions exhibit **abrupt drops (steps)** in test error at critical \( \alpha \) values:
- ReLU: Step at \( \alpha \approx 2 \) (error falls from \( \approx 0.03 \) to \( \approx 0.00 \)).
- ELU: Step at \( \alpha \approx 3.5 \) (error falls from \( \approx 0.015 \) to \( \approx 0.00 \)).
3. **Post-Step Performance**:
- After their respective steps, both functions approach near-zero test error. ReLU reaches near-zero earlier (at \( \alpha \approx 2 \)) than ELU (at \( \alpha \approx 4 \)).
### Interpretation
- **What the Data Suggests**: The graph compares how test error (a measure of model performance) changes with \( \alpha \) (a hyperparameter, e.g., network depth/width) for ReLU and ELU.
- **ReLU’s Behavior**: ReLU has higher initial error but benefits drastically from increasing \( \alpha \) up to \( 2 \), where a sharp improvement (step) occurs. After \( \alpha = 2 \), ReLU outperforms ELU until ELU’s step at \( \alpha \approx 3.5 \).
- **ELU’s Behavior**: ELU starts with lower error, decreases more gradually, and has a smaller step later. Its performance is more consistent but slower to improve.
- **Implications**: The step-like drops suggest **threshold effects** in \( \alpha \)—critical values where model performance improves drastically. ReLU is more sensitive to \( \alpha \) in the \( 0 \) to \( 2 \) range, while ELU’s improvement is sustained but slower. The inset clarifies the transition region, showing ReLU’s steeper decline before its major step.
This analysis captures all visible elements, trends, and interpretive insights, enabling reconstruction of the graph’s information without the image.
</details>
Figure 7: Theoretical prediction (solid curves) of the Bayes-optimal mean-square generalisation error for binary inner weights and ReLU, eLU activations, with $\gamma=0.5$ , $d=150$ , Gaussian label noise with $\Delta=0.1$ , and fixed readouts ${\mathbf{v}}=\mathbf{1}$ . Dashed lines are obtained from the solution of the fixed point equations (76) with all $\mathcal{Q}_{W}(\mathsf{v})=0$ . Circles are the test error of GAMP-RIE (Maillard et al., 2024a) extended to generic activation. The MCMC points initialised uninformatively (inset) are obtained using (36), to account for lack of equilibration due to glassiness, which prevents using (38). Even in the possibly glassy region, the GAMP-RIE attains the universal branch performance. Data for GAMP-RIE and MCMC are averaged over 16 data instances, with error bars representing one standard deviation over instances.
We now assume that this mapping holds at the algorithmic level, namely, that we can process the data algorithmically as if they were coming from the identified GLM, and thus try to infer the signal $\bar{\mathbf{S}}_{2}^{0}={\mathbf{W}}^{0\intercal}({\mathbf{v}}){\mathbf{W}}^{ 0}/\sqrt{kd}$ and construct a predictor from it. Based on this idea, we propose Algorithm 1 that can indeed reach the performance predicted by the $\mathcal{Q}_{W}\equiv 0$ solution of our replica theory.
**Remark H.1**
*In the linear data regime, where $n/d$ converges to a fixed constant $\alpha_{1}$ , only the first term in (111) can be learnt while the rest behaves like noise. By the same argument as above, the model is equivalent to
$$
\displaystyle y_{\mu}=\mu_{1}\frac{{\mathbf{v}}^{\intercal}{\mathbf{W}}^{0}{
\mathbf{x}}_{\mu}}{\sqrt{kd}}+\sqrt{\Delta+\nu-\mu_{0}^{2}-\mu_{1}^{2}}\,z_{
\mu}, \tag{116}
$$
where $\nu=\mathbb{E}_{z\sim{\mathcal{N}}(0,1)}\sigma^{2}(z)$ . This is again a GLM with signal ${\mathbf{S}}_{1}^{0}={\mathbf{W}}^{0\intercal}{\mathbf{v}}/\sqrt{k}$ and Gaussian sensing vectors ${\mathbf{x}}_{\mu}$ . Define $q_{1}$ as the limit of ${\mathbf{S}}_{1}^{a\intercal}{\mathbf{S}}_{1}^{b}/d$ where ${\mathbf{S}}_{1}^{a},{\mathbf{S}}_{1}^{b}$ are drawn independently from the posterior. With $k\rightarrow\infty$ , the signal converges in law to a standard Gaussian vector. Using known results on GLMs with Gaussian signal Barbier et al. (2019), we obtain the following equations characterising $q_{1}$ :
| | $\displaystyle q_{1}$ | $\displaystyle=\frac{\hat{q}_{1}}{\hat{q}_{1}+1},\qquad\hat{q}_{1}=\frac{\alpha _{1}}{1+\Delta_{1}-q_{1}},\quad\text{where}\quad\Delta_{1}=\frac{\Delta+\nu- \mu_{0}^{2}-\mu_{1}^{2}}{\mu_{1}^{2}}.$ | |
| --- | --- | --- | --- |
In the quadratic data regime, as $\alpha_{1}=n/d$ goes to infinity, the overlap $q_{1}$ converges to $1$ and the first term in (111) is learnt with vanishing error. Moreover, since ${\mathbf{S}}_{1}^{0}$ is asymptotically Gaussian, the linear problem (116) is equivalent to denoising the Gaussian vector $({\mathbf{v}}^{\intercal}{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}/\sqrt{kd})_{\mu=0} ^{n}$ whose covariance is known as a function of ${\mathbf{X}}=({\mathbf{x}}_{1},\dots,{\mathbf{x}}_{n})\in\mathbb{R}^{d\times n}$ . This leads to the following simple MMSE estimator for ${\mathbf{S}}_{1}^{0}$ :
$$
\displaystyle\langle{\mathbf{S}}_{1}^{0}\rangle=\frac{1}{\sqrt{d\Delta_{1}}}
\left(\mathbf{I}+\frac{1}{d\Delta_{1}}{\mathbf{X}}{\mathbf{X}}^{\intercal}
\right)^{-1}{\mathbf{X}}{\mathbf{y}} \tag{117}
$$
where ${\mathbf{y}}=(y_{1},\dots,y_{n})$ . Note that the derivation of this estimator does not assume the Gaussianity of ${\mathbf{x}}_{\mu}$ .*
**Remark H.2**
*The same argument can be easily generalised for general $P_{\text{out}}$ , leading to the following equivalent GLM in the universal ${\mathcal{Q}}_{W}^{*}\equiv 0$ phase of the quadratic data regime:
$$
\displaystyle y_{\mu}^{\rm GLM}\sim\tilde{P}_{\text{out}}(\cdot\mid{\rm Tr}[{
\mathbf{M}}_{\mu}\bar{\mathbf{S}}^{0}_{2}]),\quad\text{where}\quad\tilde{P}_{
\text{out}}(y|x):=\mathbb{E}_{z\sim\mathcal{N}(0,1)}P_{\text{out}}\Big{(}y\mid
\frac{\mu_{2}}{2}x+z\sqrt{g(1)}\Big{)}, \tag{1}
$$
and ${\mathbf{M}}_{\mu}$ are independent GOE sensing matrices.*
**Remark H.3**
*One can show that the system of equations $({\rm S})$ in (LABEL:NSB_equations_gaussian_ch) with $\mathcal{Q}_{W}(\mathsf{v})$ all set to $0 0$ (and consequently $\tau=0$ ) can be mapped onto the fixed point of the state evolution equations (92), (94) of the GAMP-RIE in Maillard et al. (2024a) up to changes of variables. This confirms that when such a system has a unique solution, which is the case in all our tests, the GAMP-RIE asymptotically matches our universal solution. Assuming the validity of the aforementioned effective GLM, a potential improvement for discrete weights could come from a generalisation of GAMP which, in the denoising step, would correctly exploit the discrete prior over inner weights rather than using the RIE (which is prior independent). However, the results of Barbier et al. (2025) suggest that optimally denoising matrices with discrete entries is hard, and the RIE is the best efficient procedure to do so. Consequently, we tend to believe that improving GAMP-RIE in the case of discrete weights is out of reach without strong side information about the teacher, or exploiting non-polynomial-time algorithms (see Appendix I).*
## Appendix I Algorithmic complexity of finding the specialisation solution
<details>
<summary>x15.png Details</summary>

### Visual Description
## [Scatter Plot with Linear Fits]: Gradient Updates vs. Dimension for Different ε* Values
### Overview
The image is a scientific scatter plot on a semi-logarithmic scale (log scale on the y-axis). It displays the relationship between "Dimension" (x-axis) and "Gradient updates" (y-axis) for three different values of a parameter denoted as ε* (epsilon star). Each data series includes error bars and a corresponding linear fit line. The plot suggests an exponential relationship between the variables due to the linear trend on the log-scale y-axis.
### Components/Axes
* **Chart Type:** Scatter plot with error bars and linear regression lines.
* **X-Axis:**
* **Label:** "Dimension"
* **Scale:** Linear scale.
* **Range:** Approximately 40 to 250.
* **Major Ticks:** 50, 75, 100, 125, 150, 175, 200, 225, 250.
* **Y-Axis:**
* **Label:** "Gradient updates (log scale)"
* **Scale:** Logarithmic scale (base 10).
* **Range:** Approximately 200 to 10,000 (10^2 to 10^4).
* **Major Ticks:** 10^3 (1000), 10^4 (10,000).
* **Legend:** Located in the top-left corner of the plot area. It contains six entries, pairing data series markers with their corresponding linear fit lines.
* **Data Series (Markers with Error Bars):**
1. **Blue Circle:** ε* = 0.008
2. **Green Square:** ε* = 0.01
3. **Red Triangle:** ε* = 0.012
* **Linear Fit Lines (Dashed):**
1. **Blue Dashed Line:** Linear fit: slope=0.0146 (corresponds to ε* = 0.008)
2. **Green Dashed Line:** Linear fit: slope=0.0138 (corresponds to ε* = 0.01)
3. **Red Dashed Line:** Linear fit: slope=0.0136 (corresponds to ε* = 0.012)
### Detailed Analysis
The plot shows three data series, each exhibiting a strong positive, approximately linear trend on the semi-log plot. This indicates that "Gradient updates" increase exponentially with "Dimension".
**Trend Verification & Data Points (Approximate):**
For each series, the number of gradient updates increases as the dimension increases. The error bars (vertical lines) represent uncertainty and generally increase in magnitude with higher dimension values.
1. **Series 1 (Blue Circles, ε* = 0.008):**
* **Trend:** Steepest upward slope among the three series.
* **Approximate Data Points:**
* Dimension ~40: Gradient updates ≈ 300
* Dimension ~100: Gradient updates ≈ 1,200
* Dimension ~175: Gradient updates ≈ 4,000
* Dimension ~220: Gradient updates ≈ 7,000
* **Linear Fit:** slope = 0.0146.
2. **Series 2 (Green Squares, ε* = 0.01):**
* **Trend:** Middle slope, positioned between the blue and red series.
* **Approximate Data Points:**
* Dimension ~40: Gradient updates ≈ 280
* Dimension ~100: Gradient updates ≈ 1,100
* Dimension ~175: Gradient updates ≈ 3,500
* Dimension ~220: Gradient updates ≈ 6,000
* **Linear Fit:** slope = 0.0138.
3. **Series 3 (Red Triangles, ε* = 0.012):**
* **Trend:** Shallowest slope, consistently below the other two series.
* **Approximate Data Points:**
* Dimension ~40: Gradient updates ≈ 250
* Dimension ~100: Gradient updates ≈ 1,000
* Dimension ~175: Gradient updates ≈ 3,000
* Dimension ~240: Gradient updates ≈ 9,000 (Note: This point has a very large error bar).
* **Linear Fit:** slope = 0.0136.
**Spatial Grounding:** The legend is fixed in the top-left. The data points for each series are plotted along the x-axis (Dimension). For any given dimension value, the blue point (ε*=0.008) is highest, followed by green (ε*=0.01), and then red (ε*=0.012), which is consistent with the legend order and the slopes of the fit lines.
### Key Observations
1. **Consistent Ordering:** For all measured dimensions, a smaller ε* value (0.008) results in a higher number of gradient updates compared to larger ε* values (0.01, 0.012). The relationship is monotonic.
2. **Exponential Scaling:** The linear trend on the semi-log plot confirms that gradient updates grow exponentially with the dimension of the problem.
3. **Slope Convergence:** The slopes of the linear fits (0.0146, 0.0138, 0.0136) are relatively close to each other, differing by less than ~7%. This suggests the *rate* of exponential scaling is similar across the tested ε* values, though the absolute number of updates differs.
4. **Increasing Uncertainty:** The error bars for all series tend to grow larger as the dimension increases, indicating greater variance or measurement uncertainty in the gradient update count for higher-dimensional problems.
5. **Outlier Point:** The final red data point at Dimension ~240 has an exceptionally large error bar compared to others, suggesting a potential outlier or a significant increase in variance at that specific condition.
### Interpretation
This chart likely comes from an optimization or machine learning context, analyzing how the computational cost (measured in gradient updates) scales with the problem size (Dimension) under different precision or tolerance settings (ε*).
* **What the data suggests:** The primary finding is that the number of required gradient updates scales **exponentially** with the dimension. This is a critical insight, as exponential scaling implies that problem difficulty increases very rapidly with size.
* **Relationship between elements:** The parameter ε* acts as a control knob. A **tighter tolerance** (smaller ε* = 0.008) demands **more computational work** (higher gradient updates) across all dimensions. A **looser tolerance** (larger ε* = 0.012) reduces the computational burden. The similar slopes indicate that while ε* affects the *constant factor* (the intercept on the log plot), it has a weaker effect on the *exponential scaling rate* (the slope) within the tested range.
* **Why it matters:** This analysis helps in understanding the computational limits of an algorithm. If a problem's dimension doubles, the required resources don't just double; they multiply by a large factor (e^k, where k is related to the slope). The choice of ε* involves a direct trade-off between solution precision and computational cost. The increasing error bars warn that predictions become less reliable for very high dimensions. The outlier at Dimension ~240 for ε*=0.012 may indicate a point where the algorithm's behavior becomes unstable or enters a different regime.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
## Log-Log Scatter Plot with Linear Fits: Gradient Updates vs. Dimension
### Overview
The image is a scientific chart, specifically a log-log scatter plot with error bars and overlaid linear regression lines. It illustrates the relationship between the dimension of a problem (x-axis) and the number of gradient updates required (y-axis) for three different values of a parameter denoted as ε* (epsilon star). The plot demonstrates a power-law scaling relationship.
### Components/Axes
* **Chart Type:** Log-log scatter plot with linear fits.
* **X-Axis:**
* **Label:** "Dimension (log scale)"
* **Scale:** Logarithmic.
* **Major Tick Markers:** `4 × 10¹`, `6 × 10¹`, `10²`, `2 × 10²`. These correspond to the numerical values 40, 60, 100, and 200.
* **Y-Axis:**
* **Label:** "Gradient updates (log scale)"
* **Scale:** Logarithmic.
* **Major Tick Markers:** `10³`, `10⁴`. These correspond to the numerical values 1000 and 10,000.
* **Legend (Position: Top-Left Corner):**
* **Linear Fit Entries (Dashed Lines):**
* Blue dashed line: "Linear fit: slope=1.4451"
* Green dashed line: "Linear fit: slope=1.4692"
* Red dashed line: "Linear fit: slope=1.5340"
* **Data Series Entries (Points with Error Bars):**
* Blue circle marker: "ε* = 0.008"
* Green square marker: "ε* = 0.01"
* Red triangle marker: "ε* = 0.012"
* **Grid:** A light gray grid is present, aligned with the major tick marks on both axes.
### Detailed Analysis
**Data Series and Trends:**
1. **Blue Series (ε* = 0.008):**
* **Visual Trend:** The data points (blue circles) follow a clear upward linear trend on the log-log plot. The associated blue dashed linear fit line has a slope of 1.4451.
* **Data Points (Approximate from visual inspection):**
* At Dimension ~40: Gradient updates ~300
* At Dimension ~60: Gradient updates ~600
* At Dimension ~100: Gradient updates ~1,200
* At Dimension ~200: Gradient updates ~4,000
* **Error Bars:** The vertical error bars (representing uncertainty or variance) are relatively small for lower dimensions and increase moderately with dimension.
2. **Green Series (ε* = 0.01):**
* **Visual Trend:** The data points (green squares) also follow a strong upward linear trend, slightly steeper than the blue series. The green dashed linear fit line has a slope of 1.4692.
* **Data Points (Approximate):**
* At Dimension ~40: Gradient updates ~350
* At Dimension ~60: Gradient updates ~700
* At Dimension ~100: Gradient updates ~1,500
* At Dimension ~200: Gradient updates ~5,500
* **Error Bars:** Error bars are larger than those for the blue series at corresponding dimensions, indicating greater variance.
3. **Red Series (ε* = 0.012):**
* **Visual Trend:** The data points (red triangles) show the steepest upward linear trend of the three series. The red dashed linear fit line has the highest slope of 1.5340.
* **Data Points (Approximate):**
* At Dimension ~40: Gradient updates ~400
* At Dimension ~60: Gradient updates ~800
* At Dimension ~100: Gradient updates ~1,800
* At Dimension ~200: Gradient updates ~7,000
* **Error Bars:** This series exhibits the largest error bars, which grow significantly with dimension, suggesting the highest variability in the number of updates required.
**Linear Fits:**
All three linear fits (dashed lines) align well with their respective data series, confirming the power-law relationship: `Gradient updates ∝ (Dimension)^slope`. The slope increases with ε*.
### Key Observations
1. **Consistent Scaling Law:** All three data series exhibit a linear relationship on the log-log plot, indicating a power-law scaling between problem dimension and computational cost (gradient updates).
2. **Slope Increases with ε*:** The slope of the linear fit increases from 1.4451 (ε*=0.008) to 1.5340 (ε*=0.012). This means the required gradient updates grow *super-linearly* with dimension, and the rate of this super-linear growth is higher for larger values of ε*.
3. **Variance Increases with ε* and Dimension:** The size of the error bars increases both with the parameter ε* and with the dimension. The red series (ε*=0.012) at the highest dimension (200) shows the largest uncertainty.
4. **Absolute Values:** For any given dimension, a larger ε* value results in a higher number of gradient updates. For example, at dimension 200, the updates range from ~4,000 (ε*=0.008) to ~7,000 (ε*=0.012).
### Interpretation
This chart likely comes from the field of optimization or machine learning, analyzing the convergence rate of an algorithm (e.g., for training a model or solving a high-dimensional problem).
* **What the data suggests:** The number of iterations (gradient updates) needed for the algorithm to converge scales as a power law with the problem's dimensionality. The exponent of this power law (the slope) is greater than 1, meaning the problem becomes disproportionately harder as dimension increases.
* **Role of ε*:** The parameter ε* appears to be a tolerance or step-size related parameter. A larger ε* (0.012 vs. 0.008) leads to:
1. **More updates needed** (higher y-values), suggesting a more conservative or precise convergence criterion.
2. **A steeper scaling slope**, meaning the penalty for increasing dimension is more severe.
3. **Greater variability** in the number of updates required, as shown by the larger error bars. This could imply that with a larger ε*, the algorithm's performance becomes more sensitive to the specific instance of the problem at a given dimension.
* **Practical Implication:** The chart provides a quantitative model for predicting computational cost. If one knows the dimension of their problem and chooses a tolerance ε*, they can estimate the expected number of gradient updates and the associated uncertainty. The super-linear scaling (slope > 1) is a critical consideration for scalability, warning that simply doubling the dimension will more than double the required computational effort.
</details>
<details>
<summary>x17.png Details</summary>

### Visual Description
## Scatter Plot with Linear Fits: Gradient Updates vs. Dimension
### Overview
The image is a technical scatter plot with error bars and overlaid linear regression lines. It displays the relationship between the dimension of a system (x-axis) and the number of gradient updates required (y-axis, on a logarithmic scale) for three different values of a parameter denoted as ε* (epsilon star). The plot suggests a power-law or exponential relationship, as the log-scale y-axis versus linear x-axis yields approximately straight lines.
### Components/Axes
* **Chart Type:** Scatter plot with error bars and linear fit lines.
* **X-Axis:**
* **Label:** "Dimension"
* **Scale:** Linear, ranging from approximately 40 to 250.
* **Major Ticks:** 50, 75, 100, 125, 150, 175, 200, 225, 250.
* **Y-Axis:**
* **Label:** "Gradient updates (log scale)"
* **Scale:** Logarithmic (base 10). The visible major ticks are at 10² (100) and 10³ (1000).
* **Legend (Located in the top-left corner of the plot area):**
* **Blue dashed line:** "Linear fit: slope=0.0127"
* **Green dashed line:** "Linear fit: slope=0.0128"
* **Red dashed line:** "Linear fit: slope=0.0135"
* **Blue circle marker with error bar:** "ε* = 0.008"
* **Green square marker with error bar:** "ε* = 0.01"
* **Red triangle marker with error bar:** "ε* = 0.012"
* **Data Series:** Three distinct series, each represented by a specific color/marker combination and accompanied by a linear fit line of the same color.
### Detailed Analysis
**Trend Verification:** All three data series show a clear, consistent upward trend. As the Dimension increases, the number of Gradient updates increases. On this semi-log plot, the trends are approximately linear, indicating an exponential relationship between Gradient updates and Dimension.
**Data Points & Linear Fits (Approximate Values):**
The data points are plotted at discrete Dimension values. The following table reconstructs the approximate y-values (Gradient updates) read from the log-scale axis for each series. The error bars indicate the uncertainty or variance in the measurement.
| Dimension | ε* = 0.008 (Blue Circle) | ε* = 0.01 (Green Square) | ε* = 0.012 (Red Triangle) |
| :--- | :--- | :--- | :--- |
| **~45** | ~150 (Error bar: ~120-180) | ~140 (Error bar: ~110-170) | ~130 (Error bar: ~100-160) |
| **~60** | ~220 (Error bar: ~190-250) | ~200 (Error bar: ~170-230) | ~180 (Error bar: ~150-210) |
| **~80** | ~320 (Error bar: ~280-360) | ~290 (Error bar: ~250-330) | ~260 (Error bar: ~220-300) |
| **~100** | ~450 (Error bar: ~400-500) | ~410 (Error bar: ~360-460) | ~370 (Error bar: ~320-420) |
| **~120** | ~620 (Error bar: ~560-680) | ~570 (Error bar: ~510-630) | ~520 (Error bar: ~460-580) |
| **~140** | ~850 (Error bar: ~780-920) | ~780 (Error bar: ~710-850) | ~710 (Error bar: ~640-780) |
| **~160** | ~1150 (Error bar: ~1050-1250) | ~1050 (Error bar: ~950-1150) | ~960 (Error bar: ~860-1060) |
| **~180** | ~1500 (Error bar: ~1350-1650) | ~1380 (Error bar: ~1230-1530) | ~1250 (Error bar: ~1100-1400) |
| **~200** | ~1950 (Error bar: ~1750-2150) | ~1780 (Error bar: ~1580-1980) | ~1600 (Error bar: ~1400-1800) |
| **~220** | N/A | ~2250 (Error bar: ~2000-2500) | ~2050 (Error bar: ~1800-2300) |
| **~240** | N/A | N/A | ~2600 (Error bar: ~2300-2900) |
**Linear Fit Slopes:**
* Blue (ε* = 0.008): slope = 0.0127
* Green (ε* = 0.01): slope = 0.0128
* Red (ε* = 0.012): slope = 0.0135
### Key Observations
1. **Consistent Linear Scaling:** The primary observation is the strong linear relationship on the semi-log plot for all three series. This indicates that the number of gradient updates grows exponentially with the dimension.
2. **Slope Dependence on ε*:** The slope of the linear fit increases slightly with the value of ε*. The series for ε* = 0.012 (red) has the steepest slope (0.0135), while the slopes for ε* = 0.008 (blue) and ε* = 0.01 (green) are nearly identical (0.0127 vs. 0.0128).
3. **Increasing Variance:** The vertical spread of the error bars appears to increase with Dimension for all series, suggesting that the variance or uncertainty in the number of gradient updates grows as the problem size (dimension) increases.
4. **Data Range:** The blue series (ε* = 0.008) has data points only up to Dimension ~200, while the red series (ε* = 0.012) extends to Dimension ~240. This may indicate experimental limits or convergence behavior.
### Interpretation
This chart demonstrates a fundamental scaling law in an optimization or machine learning context. The "Gradient updates" likely represent the computational cost or training time required for an algorithm to converge. The "Dimension" represents the size or complexity of the model or problem.
The key finding is that this cost scales **exponentially with dimension**, as evidenced by the linear trend on a log-linear plot. This is a significant result, implying that doubling the model dimension more than doubles the required training effort.
The parameter ε* appears to be a tolerance or step-size parameter. The data suggests that a **larger ε* (0.012) leads to a slightly faster increase in cost with dimension** (steeper slope) compared to smaller values. However, at any given dimension, the absolute number of updates is lower for larger ε* (the red points are consistently below the blue and green points). This presents a trade-off: a larger ε* may yield faster initial progress (fewer updates at low dimensions) but becomes relatively less efficient as the problem scales up.
The increasing error bars with dimension highlight that for larger, more complex problems, the performance of the algorithm becomes less predictable, with a wider range of possible outcomes for the number of updates required. This chart would be critical for researchers or engineers to predict computational budgets and choose appropriate hyperparameters (ε*) when scaling up their models.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Log-Log Plot of Gradient Updates vs. Dimension
### Overview
The image is a scientific scatter plot with error bars and linear regression fits, presented on a log-log scale. It illustrates the relationship between the dimension of a problem (x-axis) and the number of gradient updates required (y-axis) for three different values of a parameter denoted as ε* (epsilon star). The plot demonstrates a power-law relationship, as indicated by the linear trends on the logarithmic axes.
### Components/Axes
* **X-Axis:**
* **Label:** `Dimension (log scale)`
* **Scale:** Logarithmic.
* **Major Tick Marks:** `4 × 10¹`, `6 × 10¹`, `10²`, `2 × 10²`. These correspond to the numerical values 40, 60, 100, and 200.
* **Y-Axis:**
* **Label:** `Gradient updates (log scale)`
* **Scale:** Logarithmic.
* **Major Tick Marks:** `10²`, `10³`. These correspond to the numerical values 100 and 1000.
* **Legend (Located in the top-left corner):**
* The legend contains six entries, pairing data series symbols with their corresponding linear fit lines.
* **Data Series:**
1. Blue circle with error bars: `ε* = 0.008`
2. Green square with error bars: `ε* = 0.01`
3. Red triangle with error bars: `ε* = 0.012`
* **Linear Fit Lines:**
1. Blue dashed line: `Linear fit: slope=1.2884`
2. Green dashed line: `Linear fit: slope=1.3823`
3. Red dashed line: `Linear fit: slope=1.5535`
### Detailed Analysis
The plot shows three distinct data series, each following an upward linear trend on the log-log scale. This indicates a power-law relationship of the form `Gradient updates ∝ (Dimension)^slope`.
**Trend Verification:** All three data series (blue, green, red) show a clear upward slope from left to right, confirming that the number of gradient updates increases with dimension.
**Data Series and Approximate Points:**
1. **Blue Series (ε* = 0.008):**
* Trend: Slopes upward with the shallowest slope of the three fits (1.2884).
* Approximate Data Points (Dimension, Gradient updates):
* (40, ~150)
* (60, ~250)
* (100, ~450)
* (150, ~700)
* (200, ~1000)
2. **Green Series (ε* = 0.01):**
* Trend: Slopes upward with a moderate slope (1.3823).
* Approximate Data Points (Dimension, Gradient updates):
* (40, ~130)
* (60, ~220)
* (100, ~400)
* (150, ~650)
* (200, ~950)
3. **Red Series (ε* = 0.012):**
* Trend: Slopes upward with the steepest slope (1.5535).
* Approximate Data Points (Dimension, Gradient updates):
* (40, ~110)
* (60, ~180)
* (100, ~350)
* (150, ~600)
* (200, ~1200) - *Note: This final red point has a notably large error bar extending significantly above the trend line.*
**Error Bars:** All data points include vertical error bars, indicating variability or uncertainty in the measured "Gradient updates." The error bars generally increase in size with dimension, with the red series (ε* = 0.012) showing the largest uncertainty at the highest dimension (200).
### Key Observations
1. **Consistent Scaling Law:** All three conditions follow a clear power-law scaling, evidenced by the linear fits on the log-log plot.
2. **Slope Dependence on ε*:** The exponent (slope) of the power law increases with the parameter ε*. The relationship is: higher ε* leads to a steeper slope (1.2884 < 1.3823 < 1.5535).
3. **Crossover at Low Dimension:** At the lowest dimension (40), the order of required updates is inverted compared to higher dimensions: Blue (ε*=0.008) requires the *most* updates, while Red (ε*=0.012) requires the *fewest*. This order flips as dimension increases.
4. **Increasing Uncertainty:** The variability in the number of gradient updates (size of error bars) tends to grow with the problem dimension for all series.
### Interpretation
This plot likely comes from an optimization or machine learning context, analyzing how the computational cost (gradient updates) scales with the problem size (dimension) for different algorithmic settings (ε*).
* **What the data suggests:** The number of gradient updates needed to solve a problem scales as a power law with the problem's dimension. The exponent of this scaling law is not fixed but is controlled by the parameter ε*. A larger ε* results in a worse (steeper) scaling exponent, meaning the computational cost grows more rapidly with dimension.
* **Relationship between elements:** The parameter ε* acts as a tuning knob that trades off performance at low dimensions versus scalability. While a larger ε* (red) is more efficient for small problems (dimension ~40), it becomes less efficient than smaller ε* values for large-scale problems due to its steeper scaling. The crossover point appears to be around dimension 60-100.
* **Notable Anomalies:** The final red data point at dimension 200 is a potential outlier. Its central value lies above the fitted line, and its very large error bar suggests high instability or variance in the measurement for that specific condition (high ε*, high dimension). This could indicate the onset of a different regime or numerical challenges.
* **Underlying Implication:** For large-scale applications (high dimension), choosing a smaller ε* (e.g., 0.008) is crucial for better asymptotic performance, despite it being slightly less efficient for tiny problems. The analysis provides a quantitative basis for selecting algorithmic parameters based on the expected problem scale.
</details>
<details>
<summary>x19.png Details</summary>

### Visual Description
## Scatter Plot with Error Bars and Linear Fits: Gradient Updates vs. Dimension
### Overview
The image is a technical scatter plot with error bars and overlaid linear regression lines. It displays the relationship between "Dimension" (x-axis) and "Gradient updates" on a logarithmic scale (y-axis) for three different experimental conditions, denoted by different values of ε* (epsilon star). The plot includes a legend and grid lines for reference.
### Components/Axes
* **Chart Type:** Scatter plot with error bars and linear fit lines.
* **X-Axis:**
* **Label:** "Dimension"
* **Scale:** Linear scale.
* **Range:** Approximately 40 to 260.
* **Major Ticks:** 50, 100, 150, 200, 250.
* **Y-Axis:**
* **Label:** "Gradient updates (log scale)"
* **Scale:** Logarithmic scale (base 10).
* **Range:** Approximately 50 to 1000.
* **Major Ticks:** 10² (100), 10³ (1000).
* **Legend:** Located in the bottom-right quadrant of the chart area.
* **Blue dashed line:** "Linear fit: slope=0.0090"
* **Green dashed line:** "Linear fit: slope=0.0090"
* **Red dashed line:** "Linear fit: slope=0.0088"
* **Blue circle marker with error bars:** "ε* = 0.008"
* **Green square marker with error bars:** "ε* = 0.01"
* **Red triangle marker with error bars:** "ε* = 0.012"
* **Data Series:** Three distinct series, each represented by a specific color, marker shape, and error bars.
* **Grid:** A light gray grid is present for both major x and y ticks.
### Detailed Analysis
**Trend Verification:** All three data series show a clear, consistent upward trend. As the Dimension increases, the Gradient updates increase. The relationship appears linear on this log-linear plot, indicating an exponential relationship in linear space.
**Data Series & Linear Fits:**
1. **Blue Series (ε* = 0.008):**
* **Marker:** Blue circles.
* **Trend:** The highest of the three series across all dimensions. The data points follow a steady upward slope.
* **Linear Fit:** Blue dashed line with a reported slope of 0.0090.
* **Approximate Data Points (Dimension, Gradient updates):** (40, ~90), (60, ~110), (80, ~130), (100, ~160), (120, ~190), (140, ~220), (160, ~260), (180, ~300), (200, ~350), (220, ~400), (240, ~460).
* **Error Bars:** Present for all points. The vertical span (uncertainty) appears to increase slightly with dimension.
2. **Green Series (ε* = 0.01):**
* **Marker:** Green squares.
* **Trend:** Positioned between the blue and red series. Follows a similar upward slope.
* **Linear Fit:** Green dashed line with a reported slope of 0.0090.
* **Approximate Data Points (Dimension, Gradient updates):** (40, ~80), (60, ~100), (80, ~120), (100, ~140), (120, ~170), (140, ~200), (160, ~230), (180, ~270), (200, ~310), (220, ~360), (240, ~410).
* **Error Bars:** Present for all points. The span is generally smaller than the blue series' error bars at corresponding dimensions.
3. **Red Series (ε* = 0.012):**
* **Marker:** Red triangles.
* **Trend:** The lowest of the three series across all dimensions. Follows a similar upward slope.
* **Linear Fit:** Red dashed line with a reported slope of 0.0088.
* **Approximate Data Points (Dimension, Gradient updates):** (40, ~60), (60, ~75), (80, ~90), (100, ~110), (120, ~130), (140, ~150), (160, ~180), (180, ~210), (200, ~240), (220, ~280), (240, ~320).
* **Error Bars:** Present for all points. The span appears comparable to or slightly larger than the green series' error bars.
### Key Observations
1. **Consistent Ordering:** For any given Dimension, the Gradient updates are consistently highest for ε* = 0.008 (blue), followed by ε* = 0.01 (green), and lowest for ε* = 0.012 (red). This suggests an inverse relationship between the ε* parameter and the magnitude of gradient updates.
2. **Parallel Trends:** The three linear fit lines are nearly parallel, with slopes of 0.0090, 0.0090, and 0.0088. This indicates that the *rate of increase* of gradient updates with respect to dimension is very similar across the three conditions, despite their different absolute levels.
3. **Logarithmic Scale Effect:** The y-axis is logarithmic. The straight-line fits on this plot imply that Gradient updates grow *exponentially* with Dimension in linear space.
4. **Error Bar Pattern:** Error bars are present for all data points, indicating measured variability or confidence intervals. Their size relative to the data values seems fairly consistent across the plot, though absolute span increases with the y-value.
### Interpretation
This chart demonstrates a clear, positive, and approximately exponential relationship between the dimensionality of a problem (or model) and the number of gradient updates required during an optimization process. The key finding is the role of the parameter ε*:
* **What the data suggests:** A smaller ε* value (0.008) leads to a significantly higher number of gradient updates compared to larger ε* values (0.01, 0.012) for the same dimension. This could imply that ε* is a tolerance or step-size related parameter where a smaller value demands more precise (and thus more numerous) updates.
* **How elements relate:** The dimension (x-axis) is the independent variable driving the increase in updates. The ε* parameter acts as a scaling factor, shifting the entire curve up or down without drastically changing its shape (slope). The linear fits quantify this relationship, showing the growth rate is robust across conditions.
* **Notable implications:** The near-identical slopes (0.0090 vs. 0.0088) are striking. They suggest that the fundamental scaling law connecting dimension to optimization cost (in updates) is largely invariant to the ε* parameter within the tested range. The primary effect of ε* is on the constant multiplier, not the scaling exponent. This is valuable for predicting computational cost: if you know the cost for one dimension and one ε*, you can estimate it for another dimension with high confidence, and for another ε* by applying a roughly constant scaling factor. The error bars remind us that these are empirical measurements with inherent variability.
</details>
<details>
<summary>x20.png Details</summary>

### Visual Description
## Scatter Plot with Linear Fits: Gradient Updates vs. Dimension
### Overview
This is a log-log scatter plot with error bars and overlaid linear regression lines. It visualizes the relationship between the dimensionality of a problem (x-axis) and the number of gradient updates required (y-axis) for three different error tolerance levels (ε*). The plot demonstrates a power-law scaling relationship.
### Components/Axes
* **X-Axis:** "Dimension (log scale)". Major tick marks are at `4 × 10¹` (40), `6 × 10¹` (60), `10²` (100), and `2 × 10²` (200). The scale is logarithmic.
* **Y-Axis:** "Gradient updates (log scale)". Major tick marks are at `10²` (100) and `10³` (1000). The scale is logarithmic.
* **Legend (Top-Left Corner):**
* **Linear Fits (Dashed Lines):**
* Blue dashed line: "Linear fit: slope=1.0114"
* Green dashed line: "Linear fit: slope=1.0306"
* Red dashed line: "Linear fit: slope=1.0967"
* **Data Series (Points with Error Bars):**
* Blue circle with error bars: `ε* = 0.008`
* Green circle with error bars: `ε* = 0.01`
* Red circle with error bars: `ε* = 0.012`
* **Grid:** A light gray grid is present for both major and minor ticks on both axes.
### Detailed Analysis
The plot contains three data series, each with 5 data points corresponding to approximate dimensions of 40, 60, 100, 150, and 200. All series show a positive, near-linear trend on this log-log scale.
**Trend Verification & Data Points (Approximate):**
1. **Blue Series (ε* = 0.008):** The line slopes upward consistently. Points (Dimension, Gradient Updates):
* (40, ~80)
* (60, ~120)
* (100, ~200)
* (150, ~300)
* (200, ~400)
* *Associated Linear Fit Slope: 1.0114*
2. **Green Series (ε* = 0.01):** The line slopes upward, positioned slightly below the blue series. Points (Dimension, Gradient Updates):
* (40, ~70)
* (60, ~100)
* (100, ~170)
* (150, ~250)
* (200, ~350)
* *Associated Linear Fit Slope: 1.0306*
3. **Red Series (ε* = 0.012):** The line slopes upward, positioned the lowest of the three. Points (Dimension, Gradient Updates):
* (40, ~50)
* (60, ~80)
* (100, ~130)
* (150, ~200)
* (200, ~280)
* *Associated Linear Fit Slope: 1.0967*
**Error Bars:** All data points have vertical error bars indicating variability. The size of the error bars generally increases with dimension for all series, with the largest bars appearing at dimension 200.
### Key Observations
1. **Consistent Hierarchy:** For any given dimension, a smaller error tolerance (ε*) requires more gradient updates. The order is consistently Blue (ε*=0.008) > Green (ε*=0.01) > Red (ε*=0.012).
2. **Power-Law Scaling:** The linear fits on the log-log plot indicate a relationship of the form: `Gradient Updates ∝ (Dimension)^slope`. All slopes are slightly greater than 1 (ranging from ~1.01 to ~1.10).
3. **Slope vs. Tolerance:** The slope of the linear fit increases as the error tolerance (ε*) increases. The red series (highest ε*) has the steepest slope (1.0967).
4. **Increasing Variance:** The uncertainty (error bar size) in the number of gradient updates grows as the problem dimension increases.
### Interpretation
This plot likely comes from an analysis of optimization algorithms (e.g., in machine learning or numerical analysis). It demonstrates how the computational cost (measured in gradient updates) scales with the problem's dimensionality for different target accuracy levels.
* **Core Finding:** The number of gradient updates scales nearly linearly with dimension (slope ≈ 1), but with a slight super-linear component. This is a favorable scaling property, suggesting the algorithm is relatively efficient in high dimensions.
* **Trade-off:** There is a clear trade-off between solution accuracy and computational cost. Demanding a smaller error (lower ε*) shifts the entire cost curve upward.
* **Increasing Slope with Tolerance:** The observation that the slope increases with ε* is subtle but important. It suggests that for looser error tolerances (higher ε*), the cost grows *slightly faster* with dimension than for tighter tolerances. This could imply that achieving coarse solutions becomes disproportionately more expensive in very high dimensions compared to achieving precise solutions, though the effect is small.
* **Practical Implication:** When scaling a problem to higher dimensions, one must budget for a near-linear increase in computational effort. The large error bars at high dimensions also indicate that performance becomes less predictable as dimensionality grows.
</details>
Figure 8: Semilog (Left) and log-log (Right) plots of the number of gradient updates needed to achieve a test loss below the threshold $\varepsilon^{*}<\varepsilon^{\rm uni}$ . Student network trained with ADAM with optimised batch size for each point. The dataset was generated from a teacher network with ReLU activation and parameters $\Delta=10^{-4}$ for the Gaussian noise variance of the linear readout, $\gamma=0.5$ and $\alpha=5.0$ for which $\varepsilon^{\rm opt}-\Delta=1.115\times 10^{-5}$ . Points are obtained averaging over 10 teacher/data instances with error bars representing the standard deviation. Each row corresponds to a different distribution of the readouts, kept fixed during training. Top: homogeneous readouts, for which the error of the universal branch is $\varepsilon^{\rm uni}-\Delta=1.217\times 10^{-2}$ . Centre: Rademacher readouts, for which $\varepsilon^{\rm uni}-\Delta=1.218\times 10^{-2}$ . Bottom: Gaussian readouts, for which $\varepsilon^{\rm uni}-\Delta=1.210\times 10^{-2}$ . The quality of the fits can be read from Table 2.
| Homogeneous | | $\bm{5.57}$ | $\bm{9.00}$ | $\bm{21.1}$ | $32.3$ | $26.5$ | $61.1$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Rademacher | | $\bm{4.51}$ | $\bm{6.84}$ | $\bm{12.7}$ | $12.0$ | $17.4$ | $16.0$ |
| Uniform $[-\sqrt{3},\sqrt{3}]$ | | $\bm{5.08}$ | $\bm{1.44}$ | ${4.21}$ | $8.26$ | $8.57$ | $\bm{3.82}$ |
| Gaussian | | $2.66$ | $\bm{0.76}$ | $3.02$ | $\bm{0.55}$ | $2.31$ | $\bm{1.36}$ |
Table 2: $\chi^{2}$ test for exponential and power-law fits for the time needed by ADAM to reach the thresholds $\varepsilon^{*}$ , for various priors on the readouts. Fits are displayed in Figure 8. Smaller values of $\chi^{2}$ (in bold, for given threshold and readouts) indicate a better compatibility with the hypothesis.
<details>
<summary>x21.png Details</summary>

### Visual Description
## Scatter Plot with Error Bars and Fit Lines: Gradient Updates vs. Dimension
### Overview
The image is a technical scatter plot chart displaying the relationship between "Dimension" (x-axis) and "Gradient updates" (y-axis). It includes experimental data points with error bars and two theoretical fit lines: an exponential fit and a power law fit. The chart is designed to compare how well these two mathematical models describe the observed data.
### Components/Axes
* **Chart Type:** Scatter plot with error bars and overlaid trend lines.
* **X-Axis:**
* **Label:** "Dimension"
* **Scale:** Linear scale.
* **Markers/Ticks:** 40, 60, 80, 100, 120, 140, 160, 180, 200.
* **Y-Axis:**
* **Label:** "Gradient updates"
* **Scale:** Linear scale.
* **Markers/Ticks:** 0, 1000, 2000, 3000, 4000, 5000, 6000, 7000.
* **Legend:**
* **Position:** Top-left corner of the plot area.
* **Entries:**
1. `--- Exponential fit` (Red dashed line)
2. `--- Power law fit` (Green dashed line)
3. `• Data` (Blue circle marker with vertical error bar)
* **Grid:** A light gray grid is present for both major x and y ticks.
### Detailed Analysis
**Data Points (Blue with Error Bars):**
The data shows a clear increasing trend. The approximate values (y) and their associated error ranges (±) are estimated from the chart:
* Dimension 40: ~500 (± ~50)
* Dimension 60: ~700 (± ~100)
* Dimension 80: ~1000 (± ~150)
* Dimension 100: ~1200 (± ~200)
* Dimension 120: ~1500 (± ~250)
* Dimension 140: ~1900 (± ~300)
* Dimension 160: ~2500 (± ~400)
* Dimension 180: ~4000 (± ~1300)
* Dimension 200: ~5900 (± ~1700)
**Trend Verification:**
* **Data Trend:** The data points follow a curve that accelerates upward as dimension increases.
* **Exponential Fit (Red Dashed Line):** This line starts near the first data point and curves upward, closely following the data's accelerating trend. It passes within the error bars of most points, especially at higher dimensions (160-200).
* **Power Law Fit (Green Dashed Line):** This line also starts near the first data point but follows a less severe upward curve. It begins to diverge below the data points and the exponential fit line starting around dimension 120-140. By dimension 200, it is significantly below the data point and the exponential fit.
### Key Observations
1. **Positive Correlation:** There is a strong positive, non-linear correlation between Dimension and Gradient updates.
2. **Model Comparison:** The exponential fit appears to be a better model for the data than the power law fit, particularly for dimensions greater than 120.
3. **Increasing Variance:** The size of the error bars (uncertainty) increases substantially with dimension, indicating greater variability or measurement difficulty at higher dimensions.
4. **Divergence Point:** The two fit lines are nearly indistinguishable below dimension 80 but begin to clearly diverge after dimension 100.
### Interpretation
The chart demonstrates that the number of gradient updates required grows **more than linearly** with the dimension of the problem (likely a machine learning model parameter space). The superior fit of the exponential model suggests this growth is potentially exponential, which has significant implications for computational cost and training time as model complexity (dimension) scales.
The increasing error bars at higher dimensions could indicate that the process becomes inherently more stochastic or harder to measure precisely as the system grows larger. The clear divergence between the power law and exponential models provides a visual test for theoretical predictions; in this case, the data supports an exponential scaling law over a power law for the observed range. This type of analysis is crucial for understanding the scalability of algorithms and predicting resource requirements for large-scale systems.
</details>
<details>
<summary>x22.png Details</summary>

### Visual Description
## Scatter Plot with Error Bars and Fitted Curves: Dimension vs. Gradient Updates
### Overview
The image is a 2D scatter plot with error bars, displaying the relationship between "Dimension" (x-axis) and "Gradient updates" (y-axis). It includes two fitted trend lines: an exponential fit and a power law fit. The plot is presented on a white background with a light gray grid.
### Components/Axes
* **X-Axis:**
* **Label:** "Dimension"
* **Scale:** Linear, ranging from approximately 40 to 225.
* **Major Tick Marks:** 50, 75, 100, 125, 150, 175, 200, 225.
* **Y-Axis:**
* **Label:** "Gradient updates"
* **Scale:** Linear, ranging from approximately 50 to 750.
* **Major Tick Marks:** 100, 200, 300, 400, 500, 600, 700.
* **Legend:**
* **Position:** Top-left corner of the plot area.
* **Entries:**
1. `--- Exponential fit` (Red dashed line)
2. `--- Power law fit` (Green dashed line)
3. `• Data` (Blue dot with vertical error bar)
* **Data Series:**
* **Data Points:** 8 blue circular markers, each with vertical error bars representing uncertainty or variance.
* **Fitted Lines:** Two dashed curves (red and green) representing mathematical models fitted to the data.
### Detailed Analysis
**Data Points (Approximate Values):**
The following table lists the approximate coordinates for each data point (blue dot) and the extent of its error bars. Values are estimated from the grid.
| Dimension (x) | Gradient Updates (y) | Error Bar Range (y_min to y_max) |
| :--- | :--- | :--- |
| ~50 | ~80 | ~60 to ~100 |
| ~75 | ~120 | ~100 to ~140 |
| ~100 | ~160 | ~140 to ~180 |
| ~125 | ~220 | ~180 to ~260 |
| ~150 | ~260 | ~220 to ~300 |
| ~175 | ~280 | ~240 to ~320 |
| ~200 | ~360 | ~300 to ~420 |
| ~225 | ~480 | ~240 to ~720 |
**Fitted Curves:**
* **Exponential Fit (Red Dashed Line):** Starts near y=100 at x=50. It curves upward, passing slightly above most data points in the mid-range and ending near y=550 at x=225. The trend is accelerating growth.
* **Power Law Fit (Green Dashed Line):** Starts near y=90 at x=50. It follows a more gradual, concave-upward curve, passing close to the data points and ending near y=470 at x=225. The trend is also increasing but at a slower rate than the exponential model at higher dimensions.
**Trend Verification:**
* **Data Trend:** The blue data points show a clear, positive, non-linear trend. As Dimension increases, Gradient updates increase. The rate of increase appears to grow with dimension.
* **Error Bar Trend:** The vertical spread of the error bars generally increases with Dimension. The uncertainty is smallest at low dimensions and becomes very large at the highest dimension (x=225), where the error bar spans nearly 480 units.
### Key Observations
1. **Positive Correlation:** There is a strong positive correlation between Dimension and Gradient updates.
2. **Increasing Variance:** The uncertainty (error bar size) in the Gradient updates measurement grows substantially as Dimension increases, peaking dramatically at the final data point (Dimension ~225).
3. **Model Comparison:** Both the exponential and power law models capture the increasing trend. The exponential fit predicts higher values at the upper end of the dimension range compared to the power law fit.
4. **Outlier/Uncertainty:** The data point at Dimension ~225 is notable for its extremely large error bar, suggesting high variability or measurement difficulty at this scale.
### Interpretation
The data suggests that the computational cost or effort required for a process (measured in "Gradient updates") increases non-linearly with the problem's "Dimension." This is a common pattern in optimization and machine learning, where higher-dimensional spaces are more complex to navigate.
The choice between an exponential or power law model has implications. An exponential increase would indicate that costs become prohibitively expensive very quickly as dimensions grow. A power law increase, while still growing, suggests a more manageable, albeit still significant, scaling behavior. The large error at the highest dimension indicates that predictions become much less reliable at the extremes of the tested range, which could be due to increased stochasticity, numerical instability, or simply fewer samples at that scale.
The plot effectively communicates that dimensionality is a critical factor impacting the measured outcome, and that any model or system operating in this space must account for this non-linear scaling and the associated increase in uncertainty.
</details>
Figure 9: Same as in Fig. 8, but in linear scale for better visualisation, for homogeneous readouts (Left) and Gaussian readouts (Right), with threshold $\varepsilon^{*}=0.008$ .
<details>
<summary>x23.png Details</summary>

### Visual Description
## Line Chart: Test Loss vs. Gradient Updates for Different Model Dimensions (d)
### Overview
The image is a line chart plotting "Test loss" against "Gradient updates" for seven different model dimension values (d). Each line represents a distinct value of `d`, showing how the test loss evolves during training. The chart includes reference lines for specific loss thresholds. The overall trend shows that test loss decreases with more gradient updates, but the rate and final value depend significantly on the model dimension `d`.
### Components/Axes
* **X-Axis:** "Gradient updates" (linear scale). Major ticks are at 0, 1000, 2000, 3000, 4000, 5000, and 6000.
* **Y-Axis:** "Test loss" (linear scale). Major ticks are at 0.00, 0.01, 0.02, 0.03, 0.04, 0.05, and 0.06.
* **Legend (Top-Right Corner):** Contains 10 entries.
* **Solid Lines (Model Dimension `d`):** Seven entries, each with a distinct color gradient from light orange to dark red.
* `d = 60` (Lightest orange)
* `d = 80`
* `d = 100`
* `d = 120`
* `d = 140`
* `d = 160`
* `d = 180` (Darkest red)
* **Reference Lines:** Three dashed/dotted lines.
* `--- 2ε^min` (Black dashed line)
* `... ε^min` (Black dotted line)
* `--- ε^opt` (Red dashed line)
* **Plot Area:** Contains the seven colored data series (solid lines) with shaded regions around each, likely indicating variance or confidence intervals across multiple runs. Three horizontal reference lines are overlaid.
### Detailed Analysis
**Trend Verification & Data Points (Approximate):**
All series begin at a high test loss (~0.06) at 0 gradient updates and show an initial rapid decrease. The behavior diverges significantly after the initial drop.
1. **Low `d` Values (d=60, 80):**
* **Trend:** Steep, smooth decline. They reach a low plateau quickly and remain stable.
* **Key Points:** By ~1000 updates, loss is already below 0.01. They stabilize at the lowest final loss, approximately **0.005 - 0.007**.
2. **Medium `d` Values (d=100, 120, 140):**
* **Trend:** Initial decline is followed by a period of increased volatility (a "bump" or rise in loss) before a second decline to a stable plateau.
* **Key Points:** The volatile "bump" occurs between ~1000-3000 updates. Final stabilized loss is higher than for low `d`, approximately **0.008 - 0.012**.
3. **High `d` Values (d=160, 180):**
* **Trend:** Most pronounced volatility. After the initial drop, loss increases significantly, forming a large hump, before slowly decreasing again. Convergence is much slower.
* **Key Points:** The peak of the volatile hump for `d=180` is near **0.035** at ~2500 updates. By 6000 updates, they are still descending and have not fully stabilized, with loss around **0.010 - 0.015**.
**Reference Lines (Spatial Grounding):**
* `ε^min` (Black dotted): Horizontal line at **y ≈ 0.012**.
* `2ε^min` (Black dashed): Horizontal line at **y ≈ 0.024**.
* `ε^opt` (Red dashed): Horizontal line at **y ≈ 0.024**, overlapping with `2ε^min`.
**Component Isolation - Shaded Regions:** The shaded area around each line represents the spread of results. The spread is visibly larger for higher `d` values (darker red lines), indicating greater variance in training outcomes for larger models.
### Key Observations
1. **Inverse Relationship between `d` and Convergence Speed:** Lower `d` models converge faster to a lower loss.
2. **Volatility Increases with `d`:** Larger models (`d=160, 180`) exhibit a characteristic "double descent" or volatile hump pattern during training, which is absent in smaller models.
3. **Final Loss Hierarchy:** The final test loss at 6000 updates is clearly stratified by `d`: `d=60` < `d=80` < `d=100` < ... < `d=180`.
4. **Reference Line Context:** Most of the training dynamics for all models occur between the `ε^min` (0.012) and `2ε^min`/`ε^opt` (0.024) thresholds. Only the volatile phase of the largest models exceeds the upper threshold.
### Interpretation
This chart demonstrates a critical phenomenon in machine learning model training, likely related to the **"double descent"** or **model size vs. generalization** trade-off.
* **What the data suggests:** Increasing the model dimension (`d`), which corresponds to model capacity or size, does not lead to monotonic improvement in test loss during training. While larger models have the potential for lower loss, their training path is more unstable and, in this specific training regime (6000 updates), they fail to surpass the performance of smaller, more efficiently trained models.
* **How elements relate:** The `d` value directly controls the training dynamics. The reference lines (`ε^min`, `ε^opt`) likely represent theoretical or empirical loss bounds. The fact that smaller models settle near `ε^min` suggests they are reaching an optimal or near-optimal solution for their capacity. The larger models' struggle to pass below `ε^opt` during the observed training window indicates they may require more updates, different hyperparameters, or are experiencing optimization difficulties due to their size.
* **Notable Anomaly:** The most striking anomaly is the pronounced loss increase (the "hump") for `d=160` and `d=180`. This is a counter-intuitive but well-documented effect where over-parameterized models can temporarily perform worse on test data during training before potentially improving again with further training. The chart captures this unstable phase vividly.
* **Peircean Insight (Reading between the lines):** The chart is not just about loss values; it's a visual argument about **optimization difficulty**. It implies that simply making a model bigger (`d`) is not a guaranteed path to better performance and can introduce significant training instability. The optimal model size (`d`) is context-dependent and must be balanced with training duration and methodology. The shaded variance for high `d` further suggests that training such models is less reliable.
</details>
<details>
<summary>x24.png Details</summary>

### Visual Description
## Line Chart: Gradient Update Performance Across Model Dimensions
### Overview
This image is a line chart plotting a performance metric (y-axis) against the number of gradient updates (x-axis) for machine learning models of varying dimensions (`d`). The chart compares the convergence behavior of models with dimensions ranging from 60 to 180. It also includes three horizontal reference lines representing specific theoretical or optimal values.
### Components/Axes
* **X-Axis:** Labeled "Gradient updates". Linear scale from 0 to 2000, with major tick marks every 250 units (0, 250, 500, 750, 1000, 1250, 1500, 1750, 2000).
* **Y-Axis:** Unlabeled, but represents a numerical metric. Linear scale from 0.00 to 0.06, with major tick marks every 0.01 units.
* **Legend (Top-Right Corner):** Contains 10 entries.
* **Solid Lines (Model Dimensions):** A gradient of red/orange colors from light to dark.
* `d = 60` (lightest orange)
* `d = 80`
* `d = 100`
* `d = 120`
* `d = 140`
* `d = 160`
* `d = 180` (darkest red)
* **Dashed Lines (Reference Values):**
* `2 ε^umi` (gray dashed line)
* `ε^umi` (black dashed line)
* `ε^opt` (red dashed line)
### Detailed Analysis
**Trend Verification & Data Points:**
All model dimension lines (`d=60` to `d=180`) follow a similar pattern: a very steep initial decline from a high starting point (off the top of the visible y-axis, >0.06) within the first ~100 updates, followed by a slower, noisy descent that eventually plateaus.
1. **Initial Phase (0-250 updates):** All lines drop precipitously. By 250 updates, the lines have separated, with lower `d` values achieving lower y-axis values.
* `d=60`: ~0.015
* `d=180`: ~0.025
2. **Middle Phase (250-1500 updates):** Lines continue to decrease at a decelerating rate. The ordering is consistent: higher `d` values maintain higher y-values. The lines for `d=160` and `d=180` show the most significant decrease during this phase.
* At 750 updates: `d=60` ~0.008, `d=180` ~0.020.
* At 1250 updates: `d=60` ~0.005, `d=180` ~0.012.
3. **Convergence Phase (1500-2000 updates):** Most lines stabilize into a noisy plateau. The lines for lower dimensions (`d=60` to `d=120`) cluster tightly between ~0.003 and ~0.008. The lines for higher dimensions (`d=140`, `d=160`, `d=180`) converge to a slightly higher band, approximately between ~0.006 and ~0.012.
4. **Reference Lines (Horizontal):**
* `2 ε^umi` (gray dashed): Constant at y ≈ 0.024.
* `ε^umi` (black dashed): Constant at y ≈ 0.012.
* `ε^opt` (red dashed): Constant at y ≈ 0.006.
**Component Isolation & Cross-Referencing:**
* The `d=180` (dark red) line crosses below the `2 ε^umi` threshold around 800 updates and below the `ε^umi` threshold around 1300 updates.
* The `d=60` (light orange) line crosses below `ε^opt` around 500 updates and remains below it.
* By 2000 updates, the cluster of lower-d lines (`d=60` to `d=120`) is centered near or below the `ε^opt` line, while the higher-d lines (`d=140` to `d=180`) are centered near the `ε^umi` line.
### Key Observations
1. **Inverse Relationship:** There is a clear inverse relationship between model dimension (`d`) and the final achieved value of the plotted metric. Lower-dimensional models converge to lower values.
2. **Convergence Speed:** Lower-dimensional models not only reach a lower final value but also converge to their plateau faster (e.g., `d=60` stabilizes around 1000 updates, while `d=180` is still descending noticeably at 1500 updates).
3. **Threshold Crossing:** All models eventually perform better than the `2 ε^umi` benchmark. Higher-dimensional models take longer to surpass the `ε^umi` benchmark, and only the lower-dimensional models consistently achieve performance better than `ε^opt`.
4. **Noise:** The lines exhibit significant high-frequency noise or variance, especially after the initial descent, suggesting stochasticity in the training process or measurement.
### Interpretation
This chart likely visualizes the training dynamics of a machine learning model (e.g., a neural network) where `d` represents a key hyperparameter like hidden layer width or embedding dimension. The y-axis metric is probably a loss function or error rate, where lower is better.
The data suggests a **trade-off between model capacity and optimization difficulty**. Higher-capacity models (`d=180`) have a higher loss throughout training, indicating they are harder to optimize to the same level as lower-capacity models within the given number of updates. This could be due to factors like more complex loss landscapes or the need for more tuning.
The reference lines (`ε^umi`, `ε^opt`) likely represent theoretical bounds or performance targets from a related analysis (e.g., information-theoretic limits or optimal performance under certain assumptions). The chart demonstrates that while all models beat a loose bound (`2 ε^umi`), only smaller models approach the optimal bound (`ε^opt`), highlighting a practical limitation in scaling model size without corresponding adjustments to training procedure or duration. The persistent noise indicates that the optimization process has inherent variance, which is a critical consideration for reproducibility and final model selection.
</details>
<details>
<summary>x25.png Details</summary>

### Visual Description
## Line Chart: Convergence of Error Metrics with Gradient Updates
### Overview
The image is a line chart plotting the value of an unspecified error metric (y-axis) against the number of gradient updates (x-axis). It displays multiple data series corresponding to different values of a parameter `d`, alongside three horizontal reference lines representing specific error thresholds. The chart demonstrates how the error decreases and converges as training progresses, with the rate and final value influenced by the parameter `d`.
### Components/Axes
* **X-Axis:**
* **Title:** "Gradient updates"
* **Scale:** Linear, from 0 to 600.
* **Major Tick Marks:** 0, 100, 200, 300, 400, 500, 600.
* **Y-Axis:**
* **Title:** Not explicitly labeled. Represents the value of an error metric.
* **Scale:** Linear, from 0.00 to 0.06.
* **Major Tick Marks:** 0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06.
* **Legend (Position: Top-right corner, inside the plot area):**
* **Data Series (Solid Lines, color gradient from light to dark red):**
* `d = 60` (Lightest red)
* `d = 80`
* `d = 100`
* `d = 120`
* `d = 140`
* `d = 160`
* `d = 180` (Darkest red)
* **Reference Lines (Dashed Lines):**
* `2ε^init` (Grey dashed line)
* `ε^uni` (Black dashed line)
* `ε^opt` (Red dashed line)
### Detailed Analysis
**Data Series Trends (All `d` values):**
* **General Trend:** All solid lines exhibit a steep, convex decay. They start at a high error value (off the top of the chart at 0 updates) and decrease rapidly initially, then more gradually, asymptotically approaching zero.
* **Effect of Parameter `d`:** There is a clear ordering. Lines for higher `d` values (darker red) start at a higher error value for the same early update count (e.g., at 100 updates) and converge more slowly. The line for `d=60` is the lowest throughout, while `d=180` is the highest.
* **Convergence:** By 600 gradient updates, all lines have converged to a very low error value, visually between 0.000 and 0.005. The separation between the lines diminishes significantly as updates increase.
**Reference Lines (Horizontal, Constant Values):**
* `2ε^init` (Grey dashed): Positioned at approximately y = 0.024.
* `ε^uni` (Black dashed): Positioned at approximately y = 0.012.
* `ε^opt` (Red dashed): Positioned very close to the x-axis, at approximately y = 0.002.
**Key Intersections:**
* All data series cross below the `2ε^init` threshold between approximately 50 and 150 updates (lower `d` crosses first).
* All data series cross below the `ε^uni` threshold between approximately 150 and 300 updates.
* The data series for lower `d` values (e.g., `d=60`) appear to approach or potentially cross the `ε^opt` line by 600 updates, while higher `d` values remain slightly above it.
### Key Observations
1. **Monotonic Decrease:** All error curves decrease monotonically with more gradient updates; there are no visible increases or oscillations.
2. **Diminishing Returns:** The rate of error reduction slows dramatically after the first 100-200 updates.
3. **Parameter Sensitivity:** The model's convergence behavior is sensitive to the parameter `d`. A larger `d` results in a slower convergence trajectory.
4. **Reference Line Hierarchy:** The chart establishes a hierarchy of error benchmarks: `2ε^init` > `ε^uni` > `ε^opt`.
5. **Final Convergence Zone:** The region below `ε^opt` (y < ~0.002) appears to be the target convergence zone, which the models are approaching.
### Interpretation
This chart likely visualizes the training dynamics of a machine learning model, where `d` could represent model dimension, dataset size, or a similar capacity parameter. The y-axis represents a loss or error metric.
* **What the data suggests:** The plot demonstrates the **law of diminishing returns** in optimization. Initial updates yield massive error reduction, but progress slows as the model nears a solution. It also shows that increasing the parameter `d` (e.g., making a model larger) makes the optimization problem harder initially, requiring more updates to reach the same error level, though all configurations eventually converge to a similar low-error state.
* **Relationship between elements:** The solid lines show the actual training path. The dashed lines act as benchmarks. `2ε^init` may represent an initial, high-error state. `ε^uni` could be the error of a uniform/random baseline predictor. `ε^opt` likely represents the optimal or irreducible error (Bayes error) for the task. The goal of training is to drive the error from above `2ε^init` down towards `ε^opt`.
* **Notable trends/anomalies:** The most notable trend is the perfect ordering and non-intersection of the `d` curves, indicating a very consistent and predictable effect of this parameter on the optimization landscape. There are no anomalies; the curves are smooth and well-behaved, suggesting a stable training process. The fact that all curves converge near `ε^opt` suggests the models are successfully learning the underlying pattern, with the final performance being largely independent of `d` given sufficient training time.
</details>
Figure 10: Trajectories of the generalisation error of neural networks trained with ADAM at fixed batch size $B=\lfloor n/4\rfloor$ , learning rate 0.05, for ReLU activation with parameters $\Delta=10^{-4}$ for the linear readout, $\gamma=0.5$ and $\alpha=5.0>\alpha_{\rm sp}$ ( $=0.22,0.12,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively). The error $\varepsilon^{\rm uni}$ is the mean-square generalisation error associated to the universal solution with overlap $\mathcal{Q}_{W}\equiv 0$ . Left: Homogeneous readouts. Centre: Rademacher readouts. Right: Gaussian readouts. Readouts are kept fixed (and equal to the teacher’s) in all cases during training. Points on the solid lines are obtained by averaging over 5 teacher/data instances, and shaded regions around them correspond to one standard deviation.
We now provide empirical evidence concerning the computational complexity to attain specialisation, namely to have one of the $\mathcal{Q}_{W}(\mathsf{v})>0$ , or equivalently to beat the “universal” performance ( $\mathcal{Q}_{W}(\mathsf{v})=0$ for all $\mathsf{v}\in\mathsf{V}$ ) in terms of generalisation error. We tested two algorithms that can find it in affordable computational time: ADAM with optimised batch size for every dimension tested (the learning rate is automatically tuned), and Hamiltonian Monte Carlo (HMC), both trying to infer a two-layer teacher network with Gaussian inner weights.
#### ADAM
We focus on ReLU activation, with $\gamma=0.5$ , Gaussian output channel with low label noise ( $\Delta=10^{-4}$ ) and $\alpha=5.0>\alpha_{\rm sp}$ ( $=0.22,0.12,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively, thus we are deep in the specialisation phase in all the cases we report), so that the specialisation solution exhibits a very low generalisation error. We test the learnt model at each gradient update measuring the generalisation error with a moving average of 10 steps to smoothen the curves. Let us define $\varepsilon^{\rm uni}$ as the generalisation error associated to the overlap $\mathcal{Q}_{W}\equiv 0$ , then fixing a threshold $\varepsilon^{\rm opt}<\varepsilon^{*}<\varepsilon^{\rm uni}$ , we define $t^{*}(d)$ the time (in gradient updates) needed for the algorithm to cross the threshold for the first time. We optimise over different batch sizes $B_{p}$ as follows: we define them as $B_{p}=\left\lfloor\frac{n}{2^{p}}\right\rfloor,\quad p=2,3,\dots,\lfloor\log_{ 2}(n)\rfloor-1$ . Then for each batch size, the student network is trained until the moving average of the test loss drops below $\varepsilon^{*}$ and thus outperforms the universal solution; we have checked that in such a scenario, the student ultimately gets close to the performance of the specialisation solution. The batch size that requires the least gradient updates is selected. We used the ADAM routine implemented in PyTorch.
We test different distributions for the readout weights (kept fixed to ${\mathbf{v}}$ during training of the inner weights). We report all the values of $t^{*}(d)$ in Fig. 8 for various dimensions $d$ at fixed $(\alpha,\gamma)$ , providing an exponential fit $t^{*}(d)=\exp(ad+b)$ (left panel) and a power-law fit $t^{*}(d)=ad^{b}$ (right panel). We report the $\chi^{2}$ test for the fits in Table 2. We observe that for homogeneous and Rademacher readouts, the exponential fit is more compatible with the experiments, while for Gaussian readouts the comparison is inconclusive.
In Fig. 10, we report the test loss of ADAM as a function of the gradient updates used for training, for various dimensions and choice of the readout distribution (as before, the readouts are not learnt but fixed to the teacher’s). Here, we fix a batch size for simplicity. For both the cases of homogeneous ( ${\mathbf{v}}=\bm{1}$ ) and Rademacher readouts (left and centre panels), the model experiences plateaux in performance increasing with the system size, in accordance with the observation of exponential complexity we reported above. The plateaux happen at values of the test loss comparable with twice the value for the Bayes error predicted by the universal branch of our theory (remember the relationship between Gibbs and Bayes errors reported in App. C). The curves are smoother for the case of Gaussian readouts.
#### Hamiltonian Monte Carlo
<details>
<summary>x26.png Details</summary>

### Visual Description
## Line Chart: Convergence of HMC Sampling Across Dimensions
### Overview
The image is a line chart displaying the performance or convergence metric of a Hamiltonian Monte Carlo (HMC) sampling process over a number of steps. It compares multiple runs across different problem dimensions (`d`), showing how the metric evolves and approaches theoretical and universal limits.
### Components/Axes
* **X-Axis:** Labeled **"HMC step"**. It is a linear scale ranging from **0 to 4000**, with major tick marks at 0, 1000, 2000, 3000, and 4000.
* **Y-Axis:** Has numerical markers but no explicit title. The scale is linear, ranging from **0.80 to 0.95**, with major tick marks at 0.80, 0.85, 0.90, and 0.95.
* **Legend:** Positioned in the **center-right** of the chart area. It contains:
* Seven solid lines representing different dimensions (`d`), with a color gradient from red to black:
* `d=120` (bright red)
* `d=140` (red)
* `d=160` (dark red)
* `d=180` (brownish-red)
* `d=200` (dark brown)
* `d=220` (very dark brown/black)
* `d=240` (black)
* Two dashed reference lines:
* `theory` (red dashed line)
* `universal` (black dashed line)
### Detailed Analysis
* **Data Series Trends:** All seven solid lines follow a similar pattern: a very steep, near-vertical increase from step 0 to approximately step 100-200, followed by a gradual, concave-down increase that plateaus as steps increase.
* **Dimension (`d`) Effect:** There is a clear inverse relationship between dimension `d` and the final plateau value. Lines for lower `d` (e.g., `d=120`, red) plateau at higher y-values, while lines for higher `d` (e.g., `d=240`, black) plateau at lower y-values.
* **Reference Lines:**
* The **`theory`** red dashed line is a horizontal line at **y ≈ 0.95**. The curves for the lowest dimensions (`d=120`, `d=140`) approach this line very closely by step 4000.
* The **`universal`** black dashed line is a horizontal line at **y ≈ 0.88**. All curves surpass this line early in the process (before step 500).
* **Approximate Final Values (at HMC step 4000):**
* `d=120`: ~0.945
* `d=140`: ~0.940
* `d=160`: ~0.935
* `d=180`: ~0.930
* `d=200`: ~0.925
* `d=220`: ~0.920
* `d=240`: ~0.915
### Key Observations
1. **Convergence Behavior:** All processes show rapid initial improvement followed by diminishing returns, characteristic of many optimization or sampling algorithms.
2. **Dimensional Scaling:** Performance, as measured by the y-axis metric, degrades systematically as the problem dimension `d` increases. The gap between the `d=120` and `d=240` curves is significant (~0.03 units).
3. **Theoretical Bound:** The `theory` line appears to represent an asymptotic upper bound that lower-dimensional problems can nearly achieve within 4000 steps.
4. **Universal Baseline:** The `universal` line acts as a lower performance threshold that all tested dimensions exceed very quickly.
### Interpretation
This chart likely illustrates the **"curse of dimensionality"** in the context of HMC sampling. The y-axis probably represents a measure of sampling efficiency, such as the effective sample size, acceptance rate, or a convergence diagnostic like the Gelman-Rubin statistic.
The data suggests that for this specific model or target distribution:
* HMC is highly effective in lower dimensions (`d=120-160`), achieving performance near the theoretical optimum.
* As dimensionality increases, the sampler's ability to explore the space efficiently diminishes, resulting in a lower final performance metric. This is a common challenge in Bayesian computation, where higher-dimensional spaces are harder to traverse.
* The existence of a `universal` bound suggests there may be a fundamental limit to performance that applies regardless of dimension, which the algorithm surpasses but cannot exceed by a large margin in high dimensions.
The chart provides a clear visual argument for the importance of dimensionality reduction techniques or more advanced sampling methods when dealing with high-dimensional (`d > 200`) problems using HMC.
</details>
<details>
<summary>x27.png Details</summary>

### Visual Description
## Line Chart: Convergence of HMC Performance Across Dimensions
### Overview
The image is a line chart plotting a performance metric (y-axis) against the number of Hamiltonian Monte Carlo (HMC) steps (x-axis) for various dimensionality values (`d`). The chart illustrates how the convergence behavior and final performance plateau differ as the dimensionality of the problem increases.
### Components/Axes
* **X-Axis:**
* **Label:** `HMC step`
* **Scale:** Linear, from 0 to 2000.
* **Major Ticks:** 0, 500, 1000, 1500, 2000.
* **Y-Axis:**
* **Scale:** Linear, from 0.75 to 0.95.
* **Major Ticks:** 0.75, 0.80, 0.85, 0.90, 0.95.
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains 9 entries:
1. `d=120` (Bright red line)
2. `d=140` (Red-orange line)
3. `d=160` (Orange line)
4. `d=180` (Dark orange/brown line)
5. `d=200` (Brown line)
6. `d=220` (Dark brown line)
7. `d=240` (Black line)
8. `theory` (Red dashed line)
9. `universal` (Black dashed line)
* **Reference Lines:**
* A horizontal red dashed line (`theory`) at y = 0.95.
* A horizontal black dashed line (`universal`) at approximately y = 0.87.
### Detailed Analysis
The chart displays 7 solid data series, each corresponding to a different dimension `d`, and 2 horizontal reference lines.
**Trend Verification & Data Point Extraction:**
All 7 solid lines originate from approximately the same point at HMC step 0: y ≈ 0.87 (coinciding with the `universal` reference line). From there, they diverge based on their `d` value.
* **Low Dimensions (Redder lines, `d=120` to `d=160`):** These lines exhibit a steep initial ascent. They rise sharply from ~0.87, crossing y=0.90 before HMC step 250, and begin to plateau as they approach the `theory` line (y=0.95). The `d=120` line is the highest, nearly reaching 0.95 by step 2000. The `d=140` and `d=160` lines follow closely but plateau at slightly lower values (approx. 0.945 and 0.94, respectively).
* **Mid Dimensions (Brown lines, `d=180`, `d=200`):** These lines have a more moderate ascent. They cross y=0.90 later (around step 500-750) and continue to rise steadily, reaching approximately 0.935 (`d=180`) and 0.93 (`d=200`) by step 2000.
* **High Dimensions (Darker/Black lines, `d=220`, `d=240`):** These lines show the slowest convergence. Their ascent is gradual. The `d=220` line reaches ~0.925, and the `d=240` line (the lowest solid line) reaches only ~0.92 by step 2000. They remain significantly below the `theory` line and are still visibly rising at the end of the plot.
**Spatial Grounding of Legend:** The legend is placed in the lower right, over an area where the lines are densely packed and rising. The color gradient in the legend from red (`d=120`) to black (`d=240`) perfectly matches the vertical ordering of the lines on the chart, with red lines on top and black lines at the bottom.
### Key Observations
1. **Dimensionality-Dependent Convergence Rate:** There is a clear, monotonic relationship: as dimension `d` increases, the rate of convergence (the slope of the line) decreases. Lower-dimensional problems converge faster.
2. **Performance Plateau:** The final performance value at HMC step 2000 is inversely related to `d`. Lower `d` values achieve a higher final metric, approaching the `theory` limit.
3. **Common Starting Point:** All simulations, regardless of dimension, begin at the same performance level (~0.87), which is defined by the `universal` reference line.
4. **Asymptotic Behavior:** The lines for lower `d` appear to asymptotically approach the `theory` line (y=0.95), while lines for higher `d` appear to be asymptotically approaching a lower value, potentially closer to the `universal` line, but within the plotted range they have not yet plateaued.
### Interpretation
This chart demonstrates a fundamental challenge in Hamiltonian Monte Carlo (HMC) sampling: the "curse of dimensionality." The performance metric (likely a measure of sampling efficiency, such as the effective sample size or a correlation metric) degrades as the problem's dimensionality (`d`) increases.
* **The `theory` line (y=0.95)** likely represents an idealized, theoretical upper bound on performance for this specific model or target distribution, achievable under optimal conditions or in low dimensions.
* **The `universal` line (y≈0.87)** represents a baseline or worst-case performance level that all runs start from, possibly the performance of a naive sampler or the initial state.
* **The core finding** is that while HMC can eventually approach theoretical optimality in lower dimensions (`d=120`), in higher dimensions (`d=240`), it converges much more slowly and, within a practical computational budget (2000 steps), achieves a significantly lower performance. This implies that for high-dimensional problems, either many more HMC steps are required to reach the same level of performance, or that the achievable performance ceiling is inherently lower. The chart visually quantifies the cost of added complexity in terms of both convergence speed and final result quality.
</details>
<details>
<summary>x28.png Details</summary>

### Visual Description
## Line Chart: Convergence of a Metric over HMC Steps for Different Dimensions
### Overview
The image is a line chart plotting a numerical metric (y-axis) against the number of Hamiltonian Monte Carlo (HMC) steps (x-axis). It displays multiple curves corresponding to different dimensionality parameters (`d`), along with two theoretical reference lines. The chart demonstrates the convergence behavior of the metric as the number of sampling steps increases.
### Components/Axes
* **X-Axis:**
* **Label:** `HMC step`
* **Scale:** Linear, ranging from 0 to 2000.
* **Major Tick Marks:** 0, 500, 1000, 1500, 2000.
* **Y-Axis:**
* **Label:** Not explicitly stated. The axis represents a performance or similarity metric (e.g., accuracy, correlation).
* **Scale:** Linear, ranging from 0.80 to approximately 0.96.
* **Major Tick Marks:** 0.80, 0.85, 0.90, 0.95.
* **Legend (Position: Bottom-right corner):**
* **Data Series (Solid Lines):** A gradient of colors from red to black, each representing a different dimension `d`.
* `d=120` (Lightest red)
* `d=140`
* `d=160`
* `d=180`
* `d=200`
* `d=220`
* `d=240` (Black)
* **Reference Lines (Dashed Lines):**
* `theory` (Red dashed line)
* `universal` (Black dashed line)
### Detailed Analysis
1. **Trend Verification:** All solid lines for `d=120` through `d=240` exhibit the same fundamental trend: a very steep, near-vertical increase from a starting point near y=0.80 at step 0, followed by a rapid deceleration in growth, forming a "knee" around step 200-300. After the knee, the curves continue to rise very gradually, appearing to plateau as they approach step 2000.
2. **Data Series Values:**
* **Starting Point (Step 0):** All curves begin at approximately y ≈ 0.80.
* **Knee Region (Steps ~200-300):** The curves reach a value of approximately y ≈ 0.93-0.94.
* **Plateau Region (Steps 1500-2000):** All curves converge to a very similar final value. The lines are tightly clustered, making precise differentiation difficult. The final value is approximately y ≈ 0.955 ± 0.005. There is no visually significant separation between the curves for different `d` values in the plateau region.
3. **Reference Lines:**
* **`theory` (Red Dashed):** A horizontal line positioned at y ≈ 0.96. This represents a theoretical upper bound or target value.
* **`universal` (Black Dashed):** A horizontal line positioned at y ≈ 0.90. This represents a different baseline or universal benchmark.
### Key Observations
* **Rapid Convergence:** The primary learning or convergence happens within the first 300 HMC steps.
* **Dimensional Insensitivity:** The final converged value (≈0.955) appears largely independent of the dimension `d` within the tested range (120 to 240). The curves for all `d` values are nearly indistinguishable after step 500.
* **Proximity to Theory:** The empirical results converge to a value (≈0.955) that is very close to, but slightly below, the `theory` line (≈0.96).
* **Above Universal Baseline:** All results converge to a value significantly above the `universal` baseline (≈0.90).
### Interpretation
This chart likely illustrates the performance of a sampling algorithm (HMC) in estimating a property of a high-dimensional statistical model or machine learning system. The y-axis metric could be something like predictive accuracy, log-likelihood, or a similarity score.
The data suggests that:
1. **The algorithm is effective:** It quickly moves from a poor initial state (0.80) to a high-performance state (0.955).
2. **The result is robust to dimensionality:** Within the range tested, increasing the problem dimension `d` does not degrade the final quality of the estimate, which is a desirable property for scalability.
3. **The theoretical model is accurate:** The close match between the empirical plateau and the `theory` line validates the underlying theoretical understanding of the system. The small gap may be due to finite sampling (2000 steps may not be full convergence) or minor model misspecification.
4. **The system performs well relative to a baseline:** It consistently outperforms the `universal` benchmark, indicating the specific model or method being tested has an advantage.
**Omission Note:** The y-axis lacks an explicit descriptive label (e.g., "Accuracy," "R²"), which is a critical piece of information for full interpretation. The analysis infers its meaning from context.
</details>
Figure 11: Trajectories of the overlap $q_{2}$ in HMC runs initialised uninformatively for the polynomial activation $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ with parameters $\Delta=0.1$ for the linear readout, $\gamma=0.5$ and $\alpha=1.0$ . Left: Homogeneous readouts. Centre: Rademacher readouts. Right: Gaussian readouts. Points on the solid lines are obtained by averaging over 10 teacher/data instances, and shaded regions around them correspond to one standard deviation. Notice that the $y$ -axes are limited for better visualisation. For the left and centre plot, any threshold (horizontal line in the plot) between the prediction of the $\mathcal{Q}_{W}\equiv 0$ branch of our theory (black dashed line) and its prediction for the Bayes-optimal $q_{2}$ (red dashed line) crosses the curves in points $t^{*}(d)$ more compatible with an exponential fit (see Fig. 12 and Table 3, where these fits are reported and $\chi^{2}$ -tested). For the cases of homogeneous and Rademacher readouts, the value of the overlap at which the dynamics slows down (predicted by the $\mathcal{Q}_{W}\equiv 0$ branch) is in quantitative agreement with the theoretical predictions (lower dashed line). The theory is instead off by $\approx 1\$ for the values $q_{2}$ at which the runs ultimately converge.
The experiment is performed for the polynomial activation $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ with parameters $\Delta=0.1$ for the Gaussian noise in the linear readout, $\gamma=0.5$ and $\alpha=1.0>\alpha_{\rm sp}$ ( $=0.26,0.30,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively). Our HMC consists of $4000$ iterations for homogeneous readouts, or $2000$ iterations for Rademacher and Gaussian readouts. Each iteration is adaptive (with initial step size of $0.01$ ) and uses $10$ leapfrog steps. Instead of measuring the Gibbs error, whose relationship with $\varepsilon^{\rm opt}$ holds only at equilibrium (see the last remark in App. C), we measured the teacher-student $q_{2}$ -overlap which is meaningful at any HMC step and is informative about the learning. For a fixed threshold $q_{2}^{*}$ and dimension $d$ , we measure $t^{*}(d)$ as the number of HMC iterations needed for the $q_{2}$ -overlap between the HMC sample (obtained from uninformative initialisation) and the teacher weights ${\mathbf{W}}^{0}$ to cross the threshold. This criterion is again enough to assess that the student outperforms the universal solution.
As before, we test homogeneous, Rademacher and Gaussian readouts, getting to the same conclusions: while for homogeneous and Rademacher readouts exponential time is more compatible with the observations, the experiments remain inconclusive for Gaussian readouts (see Fig. 12). We report in Fig. 11 the values of the overlap $q_{2}$ measured along the HMC runs for different dimensions. Note that, with HMC steps, all $q_{2}$ curves saturate to a value that is off by $\approx 1\$ w.r.t. that predicted by our theory for the selected values of $\alpha,\gamma$ and $\Delta$ . Whether this is a finite size effect, or an effect not taken into account by the current theory is an interesting question requiring further investigation, see App. E.2 for possible directions.
<details>
<summary>x29.png Details</summary>

### Visual Description
## Scatter Plot with Linear Fits: Dimension vs. Number of MC Steps
### Overview
The image is a scientific scatter plot with error bars and overlaid linear regression lines. It illustrates the relationship between a system's "Dimension" (x-axis) and the "Number of MC (Monte Carlo) steps" required (y-axis, on a logarithmic scale). Three distinct data series are plotted, each with its own linear fit, suggesting a study of how computational cost scales with system size under different conditions or parameters.
### Components/Axes
* **X-Axis:**
* **Label:** "Dimension"
* **Scale:** Linear scale.
* **Range & Ticks:** Values from 80 to 240, with major tick marks at intervals of 20 (80, 100, 120, 140, 160, 180, 200, 220, 240).
* **Y-Axis:**
* **Label:** "Number of MC steps (log scale)"
* **Scale:** Logarithmic (base 10) scale.
* **Range & Ticks:** Values from 10² (100) to just above 10³ (1000). Major grid lines are at 10² and 10³. Minor grid lines are visible between them.
* **Legend (Position: Top-Left Corner):**
* Contains three entries, each with a colored dashed line and a corresponding colored data point symbol.
* **Blue (dashed line, circle marker):** "Linear fit: slope=0.0167", "q₂² = 0.903"
* **Green (dashed line, square marker):** "Linear fit: slope=0.0175", "q₂² = 0.906"
* **Red (dashed line, triangle marker):** "Linear fit: slope=0.0174", "q₂² = 0.909"
* **Data Series:** Each series consists of data points with vertical error bars at each dimension value (80, 100, 120, 140, 160, 180, 200, 220, 240). The points are connected by their respective colored, dashed linear fit lines.
### Detailed Analysis
**Trend Verification:** All three data series show a clear, consistent upward trend. As the Dimension increases, the Number of MC steps increases. On this log-linear plot, the data points for each series follow a roughly straight line, indicating an exponential relationship between Dimension and MC steps.
**Data Point Extraction (Approximate Values from Log Scale):**
* **At Dimension = 80:**
* Blue: ~100 (10²)
* Green: ~120
* Red: ~150
* **At Dimension = 160:**
* Blue: ~300
* Green: ~400
* Red: ~500
* **At Dimension = 240:**
* Blue: ~800
* Green: ~900
* Red: ~1000 (10³)
**Linear Fit Parameters:**
* The slopes of the linear fits (on the log scale) are very similar: 0.0167 (Blue), 0.0175 (Green), and 0.0174 (Red). This indicates the exponential growth rate is nearly identical across the three conditions.
* The q₂² values (likely a goodness-of-fit metric, analogous to R²) are all high (0.903, 0.906, 0.909), confirming the linear model is a good fit for the log-transformed data.
**Spatial Grounding & Color Matching:**
* The legend is positioned in the top-left quadrant of the plot area.
* The blue circle markers and dashed line correspond to the lowest set of data points and the fit with the smallest slope (0.0167).
* The green square markers and dashed line correspond to the middle set of data points.
* The red triangle markers and dashed line correspond to the highest set of data points and the fit with the largest q₂² value (0.909).
### Key Observations
1. **Consistent Hierarchy:** The red series consistently requires the most MC steps for any given dimension, followed by green, then blue. This ordering is maintained across the entire range.
2. **Parallel Trends:** The three linear fit lines are nearly parallel, suggesting the fundamental scaling law (exponential growth with dimension) is the same for all three cases, but with different constant prefactors (intercepts on the log scale).
3. **Increasing Variance:** The vertical error bars appear to grow slightly larger as the Dimension increases, indicating greater absolute uncertainty or variability in the number of MC steps for larger systems.
4. **High Fit Quality:** The q₂² values close to 1 indicate that the exponential model (linear on a log scale) explains the data very well.
### Interpretation
This chart demonstrates that the computational cost (measured in Monte Carlo steps) of simulating or analyzing a system grows exponentially with its dimensionality. This is a common phenomenon in complex systems and high-dimensional statistics, often referred to as the "curse of dimensionality."
The three series likely represent different algorithms, parameter settings, or system variants. While they all obey the same exponential scaling law (similar slopes), the red condition is consistently the most computationally expensive, and the blue the least. The high q₂² values provide strong evidence for this exponential relationship. The slight increase in error bar size with dimension suggests that not only does the mean cost increase, but the predictability or stability of the process may also decrease for larger systems. This information is critical for resource planning and algorithm selection in high-dimensional computational studies.
</details>
<details>
<summary>x30.png Details</summary>

### Visual Description
## [Log-Log Scatter Plot with Linear Fits]: Scaling of Monte Carlo Stops vs. Dimension
### Overview
The image is a scientific log-log plot showing the relationship between the dimension of a problem (x-axis) and the number of Monte Carlo (MC) stops required (y-axis). Three distinct data series are plotted, each with error bars and a corresponding linear fit line. The plot demonstrates a power-law relationship between the variables.
### Components/Axes
* **Chart Type:** Scatter plot with error bars and linear regression lines on a log-log scale.
* **X-Axis:**
* **Label:** `Dimension (log scale)`
* **Scale:** Logarithmic. Major tick marks are visible at `10^2` (100) and `2 x 10^2` (200). The axis spans approximately from 80 to 300.
* **Y-Axis:**
* **Label:** `Number of MC stops (log scale)`
* **Scale:** Logarithmic. Major tick marks are visible at `10^2` (100) and `10^3` (1000). The axis spans approximately from 80 to 4000.
* **Legend:** Located in the top-left corner of the plot area. It contains six entries, pairing a line style/color with a statistical parameter.
* **Entry 1:** `--- Linear fit: slope=2.4082` (Blue dashed line)
* **Entry 2:** `--- Linear fit: slope=2.5207` (Green dashed line)
* **Entry 3:** `--- Linear fit: slope=2.5297` (Red dashed line)
* **Entry 4:** `● q²_a = 0.903` (Blue circle marker)
* **Entry 5:** `■ q²_a = 0.906` (Green square marker)
* **Entry 6:** `▲ q²_a = 0.909` (Red triangle marker)
* **Data Series:** Three series of data points with vertical error bars.
* **Series 1 (Blue Circles):** Corresponds to `q²_a = 0.903` and the blue linear fit (slope=2.4082).
* **Series 2 (Green Squares):** Corresponds to `q²_a = 0.906` and the green linear fit (slope=2.5207).
* **Series 3 (Red Triangles):** Corresponds to `q²_a = 0.909` and the red linear fit (slope=2.5297).
* **Grid:** A light gray grid is present in the background.
### Detailed Analysis
* **Trend Verification:** All three data series show a clear, strong positive linear trend on the log-log plot. This indicates a power-law relationship: `Number of MC stops ∝ (Dimension)^slope`. The lines are nearly parallel, with slopes increasing slightly with the parameter `q²_a`.
* **Data Point Extraction (Approximate Values):**
* **At Dimension ≈ 100:**
- Blue (q²=0.903): ~120 MC stops
- Green (q²=0.906): ~150 MC stops
- Red (q²=0.909): ~180 MC stops
* **At Dimension ≈ 200:**
- Blue (q²=0.903): ~800 MC stops
- Green (q²=0.906): ~1000 MC stops
- Red (q²=0.909): ~1200 MC stops
* **At Dimension ≈ 250 (rightmost points):**
- Blue (q²=0.903): ~1500 MC stops
- Green (q²=0.906): ~1900 MC stops
- Red (q²=0.909): ~2300 MC stops
* **Error Bars:** Vertical error bars are present on all data points, indicating the uncertainty or variance in the measured "Number of MC stops." The relative size of the error bars appears consistent across the dimension range for each series.
* **Linear Fit Parameters:**
* The fitted slope increases with `q²_a`: 2.4082 (blue) < 2.5207 (green) < 2.5297 (red).
* The goodness-of-fit parameter `q²_a` is very high for all series (0.903, 0.906, 0.909), suggesting the linear model on the log-log scale is an excellent description of the data.
### Key Observations
1. **Consistent Power-Law Scaling:** The data for all three conditions follows a power law with an exponent (slope) between approximately 2.41 and 2.53.
2. **Systematic Parameter Dependence:** There is a clear, monotonic relationship between the parameter `q²_a` and both the vertical offset (intercept) and the scaling exponent (slope) of the data. Higher `q²_a` leads to more MC stops at any given dimension and a slightly steeper increase with dimension.
3. **High-Quality Fits:** The `q²_a` values close to 1.0 and the visual alignment of points with the dashed lines indicate the linear fits are highly reliable models for the observed data.
4. **Log-Log Linearity:** The straight-line behavior on the log-log plot is the defining characteristic, confirming the multiplicative/power-law nature of the relationship.
### Interpretation
This plot likely comes from a computational physics or statistics context, analyzing the performance or cost (in terms of Monte Carlo simulation steps) of an algorithm as the problem size (dimension) increases.
* **What the data suggests:** The number of computational steps required scales polynomially with the problem dimension (`MC stops ∝ Dimension^k`, where k ≈ 2.4-2.5). This is a "moderately expensive" scaling, worse than linear but better than cubic.
* **How elements relate:** The parameter `q²_a` (possibly a quality factor, acceptance rate, or tuning parameter) acts as a control knob. Increasing `q²_a` improves some aspect of the simulation (perhaps accuracy or convergence probability, as suggested by the high `q²_a` values themselves) but comes at the direct cost of requiring more computational resources (MC stops), and this cost grows slightly faster with problem size.
* **Notable trends/anomalies:** The most notable trend is the tight coupling between `q²_a`, the intercept, and the slope. There are no obvious outliers; the data is remarkably consistent. The slight increase in slope with `q²_a` is a subtle but important detail, indicating that the penalty for higher `q²_a` becomes marginally more severe for larger problems.
* **Underlying Implication:** The chart provides a quantitative trade-off analysis. A user could use this to predict the computational cost for a given dimension and desired `q²_a` level, or to choose an optimal `q²_a` that balances performance (lower MC stops) against whatever benefit a higher `q²_a` provides.
</details>
<details>
<summary>x31.png Details</summary>

### Visual Description
## Scatter Plot with Linear Fits: Number of Monte Carlo Steps vs. Dimension
### Overview
The image is a scientific scatter plot on a semi-logarithmic scale (log scale on the y-axis). It displays the relationship between the "Dimension" of a system (x-axis) and the "Number of MC (Monte Carlo) steps" required (y-axis) for three different data series. Each data series is represented by colored points with error bars and is accompanied by a dashed linear fit line. The plot includes a legend in the bottom-right quadrant.
### Components/Axes
* **X-Axis:**
* **Label:** "Dimension"
* **Scale:** Linear scale.
* **Range:** 80 to 240.
* **Major Tick Marks:** 80, 100, 120, 140, 160, 180, 200, 220, 240.
* **Y-Axis:**
* **Label:** "Number of MC steps (log scale)"
* **Scale:** Logarithmic scale (base 10).
* **Range:** Approximately 100 (10²) to over 1000 (10³).
* **Major Tick Marks:** 10², 10³.
* **Legend (Bottom-Right):**
* **Position:** Located in the lower right area of the plot, inside the axes.
* **Content:** Contains six entries, pairing line styles with data point symbols and their corresponding statistical values.
* **Entries (from top to bottom):**
1. `---` (Blue dashed line): "Linear fit: slope=0.0136"
2. `---` (Green dashed line): "Linear fit: slope=0.0140"
3. `---` (Red dashed line): "Linear fit: slope=0.0138"
4. `●` (Blue circle with error bar): "q₂² = 0.897"
5. `■` (Green square with error bar): "q₂² = 0.904"
6. `▲` (Red triangle with error bar): "q₂² = 0.911"
### Detailed Analysis
The plot shows three distinct data series, each following a clear upward trend on the semi-log plot, indicating an exponential relationship between Dimension and the Number of MC steps.
1. **Blue Series (Circle markers, `●`):**
* **Trend:** The data points follow a straight line sloping upward from left to right on the semi-log plot.
* **Linear Fit:** Represented by the blue dashed line. The legend states the slope of the fit is **0.0136**.
* **Goodness of Fit:** The coefficient of determination, q₂² (likely R²), is **0.897**.
* **Approximate Data Points (y-values are log-scale estimates):**
* Dimension 80: ~100 MC steps
* Dimension 160: ~300 MC steps
* Dimension 240: ~1000 MC steps
* **Error Bars:** Vertical error bars are present on each data point. Their size appears to increase slightly with dimension.
2. **Green Series (Square markers, `■`):**
* **Trend:** Follows a similar upward linear trend on the semi-log plot, positioned above the blue series.
* **Linear Fit:** Represented by the green dashed line. The legend states the slope is **0.0140**.
* **Goodness of Fit:** q₂² = **0.904**.
* **Approximate Data Points:**
* Dimension 80: ~120 MC steps
* Dimension 160: ~400 MC steps
* Dimension 240: ~1500 MC steps
* **Error Bars:** Vertical error bars are present, generally larger than those for the blue series at corresponding dimensions.
3. **Red Series (Triangle markers, `▲`):**
* **Trend:** Follows the highest upward linear trend on the semi-log plot.
* **Linear Fit:** Represented by the red dashed line. The legend states the slope is **0.0138**.
* **Goodness of Fit:** q₂² = **0.911**.
* **Approximate Data Points:**
* Dimension 80: ~150 MC steps
* Dimension 160: ~500 MC steps
* Dimension 240: ~2000 MC steps
* **Error Bars:** Vertical error bars are present and are the largest among the three series at each dimension.
### Key Observations
* **Consistent Scaling:** All three data series exhibit a strong linear relationship on the semi-log plot, confirming that the Number of MC steps grows exponentially with the Dimension.
* **Similar Slopes:** The fitted slopes for the three series are very close (0.0136, 0.0140, 0.0138), suggesting a similar exponential scaling factor across the different conditions (represented by the different q₂² values).
* **Hierarchy:** The series are consistently ordered: Red (highest MC steps) > Green > Blue (lowest MC steps) at every dimension point.
* **Increasing Variance:** The size of the error bars for all series tends to increase as the Dimension increases, indicating greater variability or uncertainty in the measurement at higher dimensions.
* **High Goodness of Fit:** All q₂² values are above 0.89, indicating that the linear model fits the log-transformed data very well.
### Interpretation
This chart demonstrates a fundamental scaling law in the system being studied. The exponential increase in the required Monte Carlo steps with dimension is a classic signature of the "curse of dimensionality," where computational cost grows prohibitively fast as the problem size increases.
The three series likely represent different algorithms, parameter settings, or system configurations, as indicated by their distinct q₂² values (which may be a performance or quality metric). While all configurations suffer from the same fundamental exponential scaling (similar slopes), there is a clear and consistent performance hierarchy: the configuration corresponding to the red triangles (q₂² = 0.911) requires the most computational effort (MC steps), while the blue circle configuration (q₂² = 0.897) is the most efficient. The trade-off might be that the red configuration achieves a slightly higher q₂² value. The increasing error bars suggest that predictions or measurements become less precise in higher dimensions, which is a common challenge in high-dimensional statistical analysis.
</details>
<details>
<summary>x32.png Details</summary>

### Visual Description
## Scatter Plot with Linear Fits: Dimension vs. Number of Monte Carlo Stops
### Overview
The image is a scatter plot on a log-log scale, displaying the relationship between "Dimension" (x-axis) and the "Number of MC stops" (y-axis). Three distinct data series are plotted, each with error bars and a corresponding linear fit line. The chart demonstrates a power-law relationship between the variables.
### Components/Axes
* **Chart Type:** Scatter plot with error bars and linear regression lines.
* **X-Axis:**
* **Label:** `Dimension (log scale)`
* **Scale:** Logarithmic. Major tick marks are visible at `10^2` (100) and `2 x 10^2` (200). The axis spans approximately from 50 to 250.
* **Y-Axis:**
* **Label:** `Number of MC stops (log scale)`
* **Scale:** Logarithmic. Major tick marks are visible at `10^2` (100) and `10^3` (1000). The axis spans approximately from 80 to 2000.
* **Legend (Top-Left Corner):**
* **Blue Dashed Line:** `Linear fit: slope=1.9791`
* **Green Dashed Line:** `Linear fit: slope=2.0467`
* **Red Dashed Line:** `Linear fit: slope=2.0093`
* **Blue Circle Marker:** `q² = 0.897`
* **Green Square Marker:** `q² = 0.904`
* **Red Triangle Marker:** `q² = 0.911`
* **Data Series:** Each series consists of data points with vertical error bars, plotted at the same set of dimension values.
### Detailed Analysis
**Data Series & Trends:**
1. **Blue Series (Circles):**
* **Trend:** The data points follow a clear upward linear trend on the log-log plot. The fitted line has a slope of **1.9791**.
* **Data Points (Approximate):**
* Dimension ~60: MC stops ~90 (Error bar range: ~80-100)
* Dimension ~80: MC stops ~130 (Error bar range: ~115-145)
* Dimension ~100: MC stops ~180 (Error bar range: ~160-200)
* Dimension ~130: MC stops ~250 (Error bar range: ~220-280)
* Dimension ~160: MC stops ~350 (Error bar range: ~310-390)
* Dimension ~200: MC stops ~480 (Error bar range: ~420-540)
* Dimension ~250: MC stops ~650 (Error bar range: ~570-730)
* **Goodness of Fit:** `q² = 0.897`.
2. **Green Series (Squares):**
* **Trend:** The data points follow a clear upward linear trend, positioned above the blue series. The fitted line has a slope of **2.0467**.
* **Data Points (Approximate):**
* Dimension ~60: MC stops ~120 (Error bar range: ~105-135)
* Dimension ~80: MC stops ~170 (Error bar range: ~150-190)
* Dimension ~100: MC stops ~240 (Error bar range: ~210-270)
* Dimension ~130: MC stops ~340 (Error bar range: ~300-380)
* Dimension ~160: MC stops ~470 (Error bar range: ~410-530)
* Dimension ~200: MC stops ~650 (Error bar range: ~570-730)
* Dimension ~250: MC stops ~880 (Error bar range: ~770-990)
* **Goodness of Fit:** `q² = 0.904`.
3. **Red Series (Triangles):**
* **Trend:** The data points follow a clear upward linear trend, positioned highest on the chart. The fitted line has a slope of **2.0093**.
* **Data Points (Approximate):**
* Dimension ~60: MC stops ~150 (Error bar range: ~130-170)
* Dimension ~80: MC stops ~220 (Error bar range: ~190-250)
* Dimension ~100: MC stops ~310 (Error bar range: ~270-350)
* Dimension ~130: MC stops ~440 (Error bar range: ~380-500)
* Dimension ~160: MC stops ~620 (Error bar range: ~540-700)
* Dimension ~200: MC stops ~860 (Error bar range: ~750-970)
* Dimension ~250: MC stops ~1150 (Error bar range: ~1000-1300)
* **Goodness of Fit:** `q² = 0.911`.
### Key Observations
1. **Consistent Power-Law Scaling:** All three series exhibit a near-perfect linear relationship on the log-log plot, indicating a power-law relationship: `Number of MC stops ∝ Dimension^k`. The fitted slopes (`k`) are all very close to **2** (ranging from 1.9791 to 2.0467).
2. **Hierarchy of Series:** The red series (triangles) consistently requires the highest number of MC stops for a given dimension, followed by the green series (squares), and then the blue series (circles).
3. **High Goodness of Fit:** The `q²` values for all fits are high (0.897 to 0.911), indicating the linear models explain the data variance very well.
4. **Increasing Variance:** The vertical error bars appear to grow in absolute size as the dimension (and consequently the number of MC stops) increases, which is typical for multiplicative or proportional error.
### Interpretation
The data strongly suggests that the computational cost (measured in Monte Carlo stops) of the process being studied scales **quadratically** with the problem dimension (`~Dimension^2`). This is a fundamental scaling law for the algorithm or system under test.
The three distinct series likely represent different parameter settings, algorithms, or problem instances. The consistent vertical offset between them (red > green > blue) indicates a constant multiplicative factor in their cost. For example, the red setting might be more accurate or complex, requiring roughly 1.5 times more stops than the blue setting at any given dimension.
The high `q²` values and clear trends provide reliable evidence for this quadratic scaling. The increasing error bars suggest that the absolute uncertainty in the number of required stops grows with the problem size, which is an important consideration for resource allocation and prediction in large-scale applications. The chart effectively communicates that doubling the dimension will approximately quadruple the required computational effort.
</details>
<details>
<summary>x33.png Details</summary>

### Visual Description
## Scatter Plot with Linear Fits: Number of Monte Carlo Steps vs. Dimension
### Overview
The image is a scatter plot with error bars and overlaid linear regression lines. It displays the relationship between the "Dimension" (x-axis) and the "Number of MC steps" (y-axis, on a logarithmic scale) for three different data series, each corresponding to a different value of a parameter denoted as \( q_2^2 \). The chart includes linear fits for each series, with their slopes reported in the legend.
### Components/Axes
* **Chart Type:** Scatter plot with error bars and linear regression lines.
* **X-Axis:**
* **Label:** "Dimension"
* **Scale:** Linear scale.
* **Range/Ticks:** Major ticks at 100, 120, 140, 160, 180, 200, 220, 240.
* **Y-Axis:**
* **Label:** "Number of MC steps (log scale)"
* **Scale:** Logarithmic scale (base 10).
* **Range/Ticks:** Major ticks at \(10^2\) (100) and \(10^3\) (1000). Minor ticks are present between them.
* **Legend (Position: Top-Left Corner):**
* Contains entries for three linear fits and three data series.
* **Linear Fit Entries:**
1. Blue dashed line: "Linear fit: slope=0.0048"
2. Green dashed line: "Linear fit: slope=0.0058"
3. Red dashed line: "Linear fit: slope=0.0065"
* **Data Series Entries (with markers and error bars):**
1. Blue circle marker: \( q_2^2 = 0.940 \)
2. Green square marker: \( q_2^2 = 0.945 \)
3. Red triangle marker: \( q_2^2 = 0.950 \)
### Detailed Analysis
The chart plots three distinct data series, each showing an upward trend. The data points are accompanied by vertical error bars indicating variability or uncertainty.
**1. Data Series for \( q_2^2 = 0.940 \) (Blue Circles, Blue Dashed Fit Line):**
* **Trend:** The data points show a clear upward trend as Dimension increases. The associated linear fit has the shallowest slope (0.0048).
* **Approximate Data Points (Dimension, Number of MC steps):**
* (100, ~90)
* (120, ~85)
* (140, ~120)
* (160, ~110)
* (180, ~130)
* (200, ~160)
* (220, ~150)
* (240, ~140)
* **Error Bars:** The error bars are substantial, often spanning a range of ±30 to ±50 steps around the central point.
**2. Data Series for \( q_2^2 = 0.945 \) (Green Squares, Green Dashed Fit Line):**
* **Trend:** This series also shows a consistent upward trend, steeper than the blue series. The linear fit slope is 0.0058.
* **Approximate Data Points (Dimension, Number of MC steps):**
* (100, ~120)
* (120, ~120)
* (140, ~160)
* (160, ~150)
* (180, ~170)
* (200, ~200)
* (220, ~210)
* (240, ~190)
* **Error Bars:** Error bars are large, similar in magnitude to the blue series.
**3. Data Series for \( q_2^2 = 0.950 \) (Red Triangles, Red Dashed Fit Line):**
* **Trend:** This series exhibits the steepest upward trend of the three. The linear fit has the highest slope (0.0065).
* **Approximate Data Points (Dimension, Number of MC steps):**
* (100, ~200)
* (120, ~200)
* (140, ~250)
* (160, ~220)
* (180, ~280)
* (200, ~350)
* (220, ~400)
* (240, ~350)
* **Error Bars:** The error bars for this series are the largest, especially at higher dimensions (e.g., at Dimension 220, the error bar spans from ~200 to ~800).
### Key Observations
1. **Positive Correlation:** For all three values of \( q_2^2 \), the number of Monte Carlo (MC) steps increases with the Dimension.
2. **Slope Dependence on \( q_2^2 \):** The rate of increase (slope of the linear fit) is positively correlated with the parameter \( q_2^2 \). A higher \( q_2^2 \) (0.950) results in a steeper slope (0.0065) compared to a lower \( q_2^2 \) (0.940, slope 0.0048).
3. **Magnitude Dependence on \( q_2^2 \):** At any given dimension, a higher \( q_2^2 \) value corresponds to a higher number of MC steps. The red series (\( q_2^2=0.950 \)) is consistently above the green (\( q_2^2=0.945 \)), which is above the blue (\( q_2^2=0.940 \)).
4. **High Variability:** The large error bars across all series indicate significant variability or uncertainty in the measured number of MC steps for each dimension. This variability appears to increase with both dimension and \( q_2^2 \).
5. **Log-Linear Relationship:** The use of a log-scale y-axis and the reporting of linear fits suggest the underlying relationship between Dimension and the *logarithm* of MC steps is approximately linear.
### Interpretation
This chart likely comes from a computational physics or statistics context, analyzing the performance or convergence of a Monte Carlo (MC) simulation algorithm. The "Dimension" probably refers to the dimensionality of the problem space (e.g., number of variables in a system).
* **What the data suggests:** The data demonstrates that the computational cost (measured in MC steps required) scales with the problem's dimensionality. This scaling is not uniform; it depends critically on a system parameter \( q_2^2 \). As \( q_2^2 \) increases towards 1, the algorithm becomes less efficient, requiring more steps to achieve its goal (e.g., convergence, sampling) and this inefficiency worsens more rapidly as the problem size (dimension) grows.
* **Relationship between elements:** The three linear fits model the scaling law for each parameter setting. The slopes (0.0048, 0.0058, 0.0065) quantify the "cost of dimensionality" for each \( q_2^2 \). The error bars highlight the stochastic nature of MC methods, showing that the reported step count for a given condition is an average with considerable spread.
* **Notable implications:** The clear stratification by \( q_2^2 \) indicates it is a key control parameter affecting algorithmic performance. The increasing error bars with dimension and \( q_2^2 \) suggest that simulations become not only more expensive but also less predictable in their runtime as the problem becomes harder (higher dimension) and the parameter \( q_2^2 \) increases. This information is crucial for resource planning and for understanding the limits of the simulated method.
</details>
<details>
<summary>x34.png Details</summary>

### Visual Description
## Scatter Plot with Linear Fits: Number of Monte Carlo (MC) Stops vs. Dimension
### Overview
This is a log-log scatter plot with error bars and linear regression fits. It illustrates the relationship between the dimension of a problem (x-axis) and the number of Monte Carlo (MC) simulation steps required for convergence (y-axis). The data is presented for three different values of a parameter denoted as \( q^2 \). The chart demonstrates a positive correlation between dimension and computational cost (MC stops), with the rate of increase depending on the \( q^2 \) value.
### Components/Axes
* **Chart Type:** Scatter plot with error bars and linear fit lines.
* **X-Axis:**
* **Label:** "Dimension (log scale)"
* **Scale:** Logarithmic.
* **Major Tick Markers:** \( 10^2 \), \( 1.2 \times 10^2 \), \( 1.4 \times 10^2 \), \( 1.6 \times 10^2 \), \( 1.8 \times 10^2 \), \( 2 \times 10^2 \), \( 2.2 \times 10^2 \), \( 2.4 \times 10^2 \).
* **Y-Axis:**
* **Label:** "Number of MC stops (log scale)"
* **Scale:** Logarithmic.
* **Major Tick Markers:** \( 10^2 \), \( 10^3 \).
* **Legend (Position: Top-Left Corner):**
* **Linear Fit Series:**
* Blue dashed line: "Linear fit: slope=0.7867"
* Green dashed line: "Linear fit: slope=0.9348"
* Red dashed line: "Linear fit: slope=1.0252"
* **Data Series (with error bars):**
* Blue circle marker: \( q^2 = 0.940 \)
* Green square marker: \( q^2 = 0.945 \)
* Red triangle marker: \( q^2 = 0.950 \)
### Detailed Analysis
The chart plots three distinct data series, each corresponding to a different \( q^2 \) value. Each data point includes vertical error bars indicating the uncertainty or variance in the "Number of MC stops" measurement.
**Data Series & Trends:**
1. **Blue Series (\( q^2 = 0.940 \)):**
* **Trend:** The data points follow a clear upward linear trend on the log-log plot. The associated blue dashed linear fit line has a slope of **0.7867**.
* **Approximate Data Points (Dimension, MC stops):**
* (100, ~90)
* (120, ~90)
* (140, ~120)
* (160, ~130)
* (180, ~130)
* (200, ~180)
* (220, ~190)
* (240, ~170)
* **Error Bars:** The error bars are relatively consistent in size across dimensions, spanning approximately ±20-40 MC stops.
2. **Green Series (\( q^2 = 0.945 \)):**
* **Trend:** The data points follow a steeper upward linear trend than the blue series. The associated green dashed linear fit line has a slope of **0.9348**.
* **Approximate Data Points (Dimension, MC stops):**
* (100, ~120)
* (120, ~120)
* (140, ~160)
* (160, ~180)
* (180, ~180)
* (200, ~250)
* (220, ~280)
* (240, ~260)
* **Error Bars:** Error bars are larger than the blue series, spanning approximately ±30-60 MC stops.
3. **Red Series (\( q^2 = 0.950 \)):**
* **Trend:** The data points follow the steepest upward linear trend of the three series. The associated red dashed linear fit line has a slope of **1.0252**.
* **Approximate Data Points (Dimension, MC stops):**
* (100, ~200)
* (120, ~200)
* (140, ~280)
* (160, ~320)
* (180, ~320)
* (200, ~450)
* (220, ~500)
* (240, ~400)
* **Error Bars:** Error bars are the largest among the series, spanning approximately ±50-150 MC stops, and appear to increase with dimension.
### Key Observations
1. **Positive Correlation:** For all three \( q^2 \) values, the number of MC stops increases with the dimension of the problem.
2. **Effect of \( q^2 \):** A higher \( q^2 \) value (0.950 vs. 0.940) results in both a higher absolute number of MC stops and a steeper rate of increase (higher slope) with dimension.
3. **Power-Law Relationship:** The linear fits on the log-log plot indicate a power-law relationship: \( \text{MC stops} \propto \text{Dimension}^{\text{slope}} \). The exponent (slope) increases with \( q^2 \).
4. **Increasing Variance:** The size of the error bars, particularly for the red series (\( q^2 = 0.950 \)), tends to increase with dimension, suggesting greater uncertainty or variability in the simulation cost for higher-dimensional problems at this parameter value.
5. **Data Consistency:** The data points generally align well with their respective linear fit lines, confirming the modeled trend. The last red data point at dimension 240 appears slightly below the trend line.
### Interpretation
This chart quantifies the **curse of dimensionality** for a Monte Carlo simulation process, where the computational cost (measured in simulation steps or "stops") grows polynomially with the problem's dimension. The parameter \( q^2 \) acts as a **difficulty multiplier**.
* **What the data suggests:** The simulation becomes more expensive as the problem becomes more complex (higher dimension) and as the parameter \( q^2 \) increases. The relationship is predictable and follows a power law.
* **How elements relate:** The legend correctly maps colors and markers to their respective \( q^2 \) values and fit lines. The slopes provided in the legend (0.7867, 0.9348, 1.0252) are the exponents in the power-law relationship, directly quantifying how sensitively the cost scales with dimension for each \( q^2 \).
* **Notable implications:** The slope for \( q^2 = 0.950 \) is slightly greater than 1, indicating a **super-linear** (slightly worse than linear) scaling of cost with dimension. This has significant implications for the feasibility of running such simulations for very high-dimensional problems at this parameter setting, as the cost will escalate rapidly. The increasing error bars for higher \( q^2 \) and dimension also imply that performance becomes less predictable under these more demanding conditions.
</details>
Figure 12: Semilog (Left) and log-log (Right) plots of the number of Hamiltonian Monte Carlo steps needed to achieve an overlap $q_{2}^{*}>q_{2}^{\rm uni}$ , that certifies the universal solution is outperformed. The dataset was generated from a teacher with polynomial activation $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ and parameters $\Delta=0.1$ for the linear readout, $\gamma=0.5$ and $\alpha=1.0>\alpha_{\rm sp}$ ( $=0.26,0.30,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively). Student weights are sampled using HMC (initialised uninformatively) with $4000$ iterations for homogeneous readouts (Top row, for which $q_{2}^{\rm uni}=0.883$ ), or $2000$ iterations for Rademacher (Centre row, with $q_{2}^{\rm uni}=0.868$ ) and Gaussian readouts (Bottom row, for which $q_{2}^{\rm uni}=0.903$ ). Each iteration is adaptative (with initial step size of $0.01$ ) and uses $10$ leapfrog steps. $q_{2}^{\rm sp}=0.941,0.948,0.963$ in the three cases. The readouts are kept fixed during training. Points are obtained averaging over 10 teacher/data instances with error bars representing the standard deviation.
| Homogeneous Rademacher Gaussian | ( $q_{2}^{*}\in\{0.903,0.906,0.909\}$ ) ( $q_{2}^{*}\in\{0.897,0.904,0.911\}$ ) ( $q_{2}^{*}\in\{0.940,0.945,0.950\}$ ) | $\bm{2.22}$ $\bm{1.88}$ $0.66$ | $\bm{1.47}$ $\bm{2.12}$ $\bm{0.44}$ | $\bm{1.14}$ $\bm{1.70}$ $\bm{0.26}$ | $8.01$ $8.10$ $\bm{0.62}$ | $7.25$ $7.70$ $0.53$ | $6.35$ $8.57$ $0.39$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
Table 3: $\chi^{2}$ test for exponential and power-law fits for the time needed by Hamiltonian Monte Carlo to reach the thresholds $q_{2}^{*}$ , for various priors on the readouts. For a given row, we report three values of the $\chi^{2}$ test per hypothesis, corresponding with the thresholds $q_{2}^{*}$ on the left, in the order given. Fits are displayed in Figure 12. Smaller values of $\chi^{2}$ (in bold, for given threshold and readouts) indicate a better compatibility with the hypothesis.