# 1 Introduction
Statistical mechanics of extensive-width Bayesian neural networks near interpolation
Jean Barbier * 1 Francesco Camilli * 1 Minh-Toan Nguyen * 1 Mauro Pastore * 1 Rudy Skerk * 2 footnotetext: * Equal contribution 1 The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34151 Trieste, Italy 2 International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy.
Abstract
For three decades statistical mechanics has been providing a framework to analyse neural networks. However, the theoretically tractable models, e.g., perceptrons, random features models and kernel machines, or multi-index models and committee machines with few neurons, remained simple compared to those used in applications. In this paper we help reducing the gap between practical networks and their theoretical understanding through a statistical physics analysis of the supervised learning of a two-layer fully connected network with generic weight distribution and activation function, whose hidden layer is large but remains proportional to the inputs dimension. This makes it more realistic than infinitely wide networks where no feature learning occurs, but also more expressive than narrow ones or with fixed inner weights. We focus on the Bayes-optimal learning in the teacher-student scenario, i.e., with a dataset generated by another network with the same architecture. We operate around interpolation, where the number of trainable parameters and of data are comparable and feature learning emerges. Our analysis uncovers a rich phenomenology with various learning transitions as the number of data increases. In particular, the more strongly the features (i.e., hidden neurons of the target) contribute to the observed responses, the less data is needed to learn them. Moreover, when the data is scarce, the model only learns non-linear combinations of the teacher weights, rather than “specialising” by aligning its weights with the teacher’s. Specialisation occurs only when enough data becomes available, but it can be hard to find for practical training algorithms, possibly due to statistical-to-computational gaps.
Understanding the expressive power and generalisation capabilities of neural networks is not only a stimulating intellectual activity, producing surprising results that seem to defy established common sense in statistics and optimisation (Bartlett et al., 2021), but has important practical implications in cost-benefit planning whenever a model is deployed. E.g., from a fruitful research line that spanned three decades, we now know that deep fully connected Bayesian neural networks with $O(1)$ readout weights and $L_{2}$ regularisation behave as kernel machines (the so-called Neural Network Gaussian processes, NNGPs) in the heavily overparametrised, infinite-width regime (Neal, 1996; Williams, 1996; Lee et al., 2018; Matthews et al., 2018; Hanin, 2023), and so suffer from these models’ limitations. Indeed, kernel machines infer the decision rule by first embedding the data in a fixed a priori feature space, the renowned kernel trick, then operating linear regression/classification over the features. In this respect, they do not learn features (in the sense of statistics relevant for the decision rule) from the data, so they need larger and larger feature spaces and training sets to fit their higher order statistics (Yoon & Oh, 1998; Dietrich et al., 1999; Gerace et al., 2021; Bordelon et al., 2020; Canatar et al., 2021; Xiao et al., 2023).
Many efforts have been devoted to studying Bayesian neural networks beyond this regime. In the so-called proportional regime, when the width is large and proportional to the training set size, recent studies showed how a limited amount of feature learning makes the network equivalent to optimally regularised kernels (Li & Sompolinsky, 2021; Pacelli et al., 2023; Camilli et al., 2023; Cui et al., 2023; Baglioni et al., 2024; Camilli et al., 2025). This could be a consequence of the fully connected architecture, as, e.g., convolutional neural networks learn more informative features (Naveh & Ringel, 2021; Seroussi et al., 2023; Aiudi et al., 2025; Bassetti et al., 2024). Another scenario is the mean-field scaling, i.e., when the readout weights are small: in this case too a Bayesian network can learn features in the proportional regime (Rubin et al., 2024a; van Meegen & Sompolinsky, 2024).
Here instead we analyse a fully connected two-layer Bayesian network trained end-to-end near the interpolation threshold, when the sample size $n$ is scaling like the number of trainable parameters: for input dimension $d$ and width $k$ , both large and proportional, $n=\Theta(d^{2})=\Theta(kd)$ , a regime where non-trivial feature learning can happen. We consider i.i.d. Gaussian input vectors with labels generated by a teacher network with matching architecture, in order to study the Bayes-optimal learning of this neural network target function. Our results thus provide a benchmark for the performance of any model trained on the same dataset.
2 Setting and main results
2.1 Teacher-student setting
We consider supervised learning with a shallow neural network in the classical teacher-student setup (Gardner & Derrida, 1989). The data-generating model, i.e., the teacher (or target function), is thus a two-layer neural network itself, with readout weights ${\mathbf{v}}^{0}∈\mathbb{R}^{k}$ and internal weights ${\mathbf{W}}^{0}∈\mathbb{R}^{k× d}$ , drawn entrywise i.i.d. from $P_{v}^{0}$ and $P^{0}_{W}$ , respectively; we assume $P^{0}_{W}$ to be centred while $P^{0}_{v}$ has mean $\bar{v}$ , and both priors have unit second moment. We denote the whole set of parameters of the target as ${\bm{\theta}}^{0}=({\mathbf{v}}^{0},{\mathbf{W}}^{0})$ . The inputs are i.i.d. standard Gaussian vectors ${\mathbf{x}}_{\mu}∈\mathbb{R}^{d}$ for $\mu≤ n$ . The responses/labels $y_{\mu}$ are drawn from a kernel $P^{0}_{\rm out}$ :
$$
\textstyle y_{\mu}\sim P^{0}_{\rm out}(\,\cdot\mid\lambda^{0}_{\mu}),\quad%
\lambda^{0}_{\mu}:=\frac{1}{\sqrt{k}}{{\mathbf{v}}^{0\intercal}}\sigma(\frac{1%
}{\sqrt{d}}{{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}). \tag{1}
$$
The kernel can be stochastic or model a deterministic rule if $P^{0}_{\rm out}(y\mid\lambda)=\delta(y-\mathsf{f}^{0}(\lambda))$ for some outer non-linearity $\mathsf{f}^{0}$ . The activation function $\sigma$ is applied entrywise to vectors and is required to admit an expansion in Hermite polynomials with Hermite coefficients $(\mu_{\ell})_{\ell≥ 0}$ , see App. A: $\sigma(x)=\sum_{\ell≥ 0}\frac{\mu_{\ell}}{\ell!}{\rm He}_{\ell}(x)$ . We assume it has vanishing 0th Hermite coefficient, i.e., that it is centred $\mathbb{E}_{z\sim\mathcal{N}(0,1)}\sigma(z)=0$ ; in App. D.5 we relax this assumption. The input/output pairs $\mathcal{D}=\{({\mathbf{x}}_{\mu},y_{\mu})\}_{\mu≤ n}$ form the training set for a student network with matching architecture.
Notice that the readouts ${\mathbf{v}}^{0}$ are only $k$ unknowns in the target compared to the $kd=\Theta(k^{2})$ inner weights ${\mathbf{W}}^{0}$ . Therefore, they can be equivalently considered quenched, i.e., either given and thus fixed in the student network defined below, or unknown and thus learnable, without changing the leading order of the information-theoretic quantities we aim for. E.g., in terms of mutual information per parameter $\frac{1}{kd+k}I(({\mathbf{W}}^{0},{\mathbf{v}}^{0});\mathcal{D})=\frac{1}{kd}I%
({\mathbf{W}}^{0};\mathcal{D}\mid{\mathbf{v}}^{0})+o_{d}(1)$ . Without loss of generality, we thus consider ${\mathbf{v}}^{0}$ quenched and denote it ${\mathbf{v}}$ from now on. This equivalence holds at leading order and at equilibrium only, but not at the dynamical level, the study of which is left for future work.
The Bayesian student learns via the posterior distribution of the weights ${\mathbf{W}}$ given the training data (and ${\mathbf{v}}$ ), defined by
| | $\textstyle dP({\mathbf{W}}\mid\mathcal{D}):=\mathcal{Z}(\mathcal{D})^{-1}dP_{W%
}({\mathbf{W}})\prod_{\mu≤ n}P_{\rm out}\big{(}y_{\mu}\mid\lambda_{\mu}({%
\mathbf{W}})\big{)}$ | |
| --- | --- | --- |
with post-activation $\lambda_{\mu}({\mathbf{W}}):=\frac{1}{\sqrt{k}}{\mathbf{v}}^{∈tercal}\sigma(%
\frac{1}{\sqrt{d}}{{\mathbf{W}}{\mathbf{x}}_{\mu}})$ , the posterior normalisation constant $\mathcal{Z}(\mathcal{D})$ called the partition function, and $P_{W}$ is the prior assumed by the student. From now on, we focus on the Bayes-optimal case $P_{W}=P_{W}^{0}$ and $P_{\rm out}=P_{\rm out}^{0}$ , but the approach can be extended to account for a mismatch.
We aim at evaluating the expected generalisation error of the student. Let $({\mathbf{x}}_{\rm test},y_{\rm test}\sim P_{\rm out}(\,·\mid\lambda^{0}_{%
\rm test}))$ be a fresh sample (not present in $\mathcal{D}$ ) drawn using the teacher, where $\lambda_{\rm test}^{0}$ is defined as in (1) with ${\mathbf{x}}_{\mu}$ replaced by ${\mathbf{x}}_{\rm test}$ (and similarly for $\lambda_{\rm test}({\mathbf{W}})$ ). Given any prediction function $\mathsf{f}$ , the Bayes estimator for the test response reads $\hat{y}^{\mathsf{f}}({\mathbf{x}}_{\rm test},{\mathcal{D}}):=\langle\mathsf{f}%
(\lambda_{\rm test}({\mathbf{W}}))\rangle$ , where the expectation $\langle\,·\,\rangle:=\mathbb{E}[\,·\mid\mathcal{D}]$ is w.r.t. the posterior $dP({\mathbf{W}}\mid\mathcal{D})$ . Then, for a performance measure $\mathcal{C}:\mathbb{R}×\mathbb{R}\mapsto\mathbb{R}_{≥ 0}$ the Bayes generalisation error is
$$
\displaystyle\varepsilon^{\mathcal{C},\mathsf{f}}:=\mathbb{E}_{{\bm{\theta}}^{%
0},{\mathcal{D}},{\mathbf{x}}_{\rm test},y_{\rm test}}\mathcal{C}\big{(}y_{\rm
test%
},\big{\langle}\mathsf{f}(\lambda_{\rm test}({\mathbf{W}}))\big{\rangle}\big{)}. \tag{2}
$$
An important case is the square loss $\mathcal{C}(y,\hat{y})=(y-\hat{y})^{2}$ with the choice $\mathsf{f}(\lambda)=∈t dy\,y\,P_{\rm out}(y\mid\lambda)=:\mathbb{E}[y\mid\lambda]$ . The Bayes-optimal mean-square generalisation error follows:
$$
\displaystyle\varepsilon^{\rm opt} \displaystyle:=\mathbb{E}_{{\bm{\theta}}^{0},{\mathcal{D}},{\mathbf{x}}_{\rm
test%
},y_{\rm test}}\big{(}y_{\rm test}-\big{\langle}\mathbb{E}[y\mid\lambda_{\rm
test%
}({\mathbf{W}})]\big{\rangle}\big{)}^{2}. \tag{3}
$$
Our main example will be the case of linear readout with Gaussian label noise: $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ . In this case, the generalisation error $\varepsilon^{\rm opt}$ takes a simpler form for numerical evaluation than (3), thanks to the concentration of “overlaps” entering it, see App. C.
We study the challenging extensive-width regime with quadratically many samples, i.e., a large size limit
$$
\displaystyle d,k,n\to+\infty\quad\text{with}\quad k/d\to\gamma,\quad n/d^{2}%
\to\alpha. \tag{4}
$$
We denote this joint $d,k,n$ limit with these rates by “ ${\lim}$ ”.
In order to access $\varepsilon^{\mathcal{C},\mathsf{f}},\varepsilon^{\rm opt}$ and other relevant quantities, one can tackle the computation of the average log-partition function, or free entropy in statistical physics language:
$$
\textstyle f_{n}:=\frac{1}{n}\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\ln%
\mathcal{Z}(\mathcal{D}). \tag{5}
$$
The mutual information between teacher weights and the data is related to the free entropy $f_{n}$ , see App. F. E.g., in the case of linear readout with Gaussian label noise we have $\lim\frac{1}{kd}I({\mathbf{W}}^{0};\mathcal{D}\mid{\mathbf{v}})=-\frac{\alpha}%
{\gamma}\lim f_{n}-\frac{\alpha}{2\gamma}\ln(2\pi e\Delta)$ . Considering the mutual information per parameter allows us to interpret $\alpha$ as a sort of signal-to-noise ratio, so that the mutual information defined in this way increases with it.
Notations: Bold is for vectors and matrices; $d$ is the input dimension, $k$ the width of the hidden layer, $n$ the size of the training set $\mathcal{D}$ , with asymptotic ratios given by (4); ${\mathbf{A}}^{\circ\ell}$ is the Hadamard power of a matrix; for a vector ${\mathbf{v}}$ , $({\mathbf{v}})$ is the diagonal matrix ${\rm diag}({\mathbf{v}})$ ; $(\mu_{\ell})$ are the Hermite coefficients of the activation function $\sigma(x)=\sum_{\ell≥ 0}\frac{\mu_{\ell}}{\ell!}{\rm He}_{\ell}(x)$ ; the norm $\|\,·\,\|$ for vectors and matrices is the Frobenius norm.
2.2 Main results
The aforementioned setting is related to the recent paper Maillard et al. (2024a), with two major differences: said work considers Gaussian distributed weights and quadratic activation. These hypotheses allow numerous simplifications in the analysis, exploited in a series of works Du & Lee (2018); Soltanolkotabi et al. (2019); Venturi et al. (2019); Sarao Mannelli et al. (2020); Gamarnik et al. (2024); Martin et al. (2024); Arjevani et al. (2025). Thanks to this, Maillard et al. (2024a) maps the learning task onto a generalised linear model (GLM) where the goal is to infer a Wishart matrix from linear observations, which is analysable using known results on the GLM Barbier et al. (2019) and matrix denoising Barbier & Macris (2022); Maillard et al. (2022); Pourkamali et al. (2024); Semerjian (2024).
Our main contribution is a statistical mechanics framework for characterising the prediction performance of shallow Bayesian neural networks, able to handle arbitrary activation functions and different distributions of i.i.d. weights, both ingredients playing an important role for the phenomenology.
The theory we derive draws a rich picture with various learning transitions when tuning the sample rate $\alpha≈ n/d^{2}$ . For low $\alpha$ , feature learning occurs because the student tunes its weights to match non-linear combinations of the teacher’s, rather than aligning to those weights themselves. This phase is universal in the (centred, with unit variance) law of the i.i.d. teacher inner weights: our numerics obtained both with binary and Gaussian inner weights match well the theory, which does not depend on this prior here. When increasing $\alpha$ , strong feature learning emerges through specialisation phase transitions, where the student aligns some of its weights with the actual teacher’s ones. In particular, when the readouts ${\mathbf{v}}$ in the target function have a non-trivial distribution, a whole sequence of specialisation transitions occurs as $\alpha$ grows, for the following intuitive reason. Different features in the data are related to the weights of the teacher neurons, $({\mathbf{W}}^{0}_{j}∈\mathbb{R}^{d})_{j≤ k}$ . The strength with which the responses $(y_{\mu})$ depend on the feature ${\mathbf{W}}_{j}^{0}$ is tuned by the corresponding readout through $|v_{j}|$ , which plays the role of a feature-dependent “signal-to-noise ratio”. Therefore, features/hidden neurons $j∈[k]$ corresponding to the largest readout amplitude $\max\{|v_{j}|\}$ are learnt first by the student when increasing $\alpha$ (in the sense that the teacher-student overlap ${\mathbf{W}}^{∈tercal}_{j}{\mathbf{W}}^{0}_{j}/d>o_{d}(1)$ ), then features with the second largest amplitude are, and so on. If the readouts are continuous, an infinite sequence of specialisation transitions emerges in the limit (4). On the contrary, if the readouts are homogeneous (i.e. take a unique value), then a single transition occurs where almost all neurons of the student specialise jointly (possibly up to a vanishing fraction). We predict specialisation transitions to occur for binary inner weights and generic activation, or for Gaussian ones and more-than-quadratic activation. We provide a theoretical description of these learning transitions and identify the order parameters (sufficient statistics) needed to deduce the generalisation error through scalar equations.
The picture that emerges is connected to recent findings in the context of extensive-rank matrix denoising Barbier et al. (2025). In that model, a recovery transition was also identified, separating a universal phase (i.e., independent of the signal prior), from a factorisation phase akin to specialisation in the present context. We believe that this picture and the one found in the present paper are not just similar, but a manifestation of the same fundamental mechanism inherent to the extensive-rank of the matrices involved. Indeed, matrix denoising and neural networks share features with both matrix models Kazakov (2000); Brézin et al. (2016); Anninos & Mühlmann (2020) and planted mean-field spin glasses Nishimori (2001); Zdeborová & and (2016). This mixed nature requires blending techniques from both fields to tackle them. Consequently, the approach developed in Sec. 4 based on the replica method Mezard et al. (1986) is non-standard, as it crucially relies on the Harish Chandra–Itzykson–Zuber (HCIZ), or “spherical”, integral used in matrix models Itzykson & Zuber (1980); Matytsin (1994); Guionnet & Zeitouni (2002). Mixing spherical integration and the replica method has been previously attempted in Schmidt (2018); Barbier & Macris (2022) for matrix denoising, both papers yielding promising but quantitatively inaccurate or non-computable results. Another attempt to exploit a mean-field technique for matrix denoising (in that case a high-temperature expansion) is Maillard et al. (2022), which suffers from similar limitations. The more quantitative answer from Barbier et al. (2025) was made possible precisely thanks to the understanding that the problem behaves more as a matrix model or as a planted mean-field spin glass depending on the phase in which it lives. The two phases could then be treated separately and then joined using an appropriate criterion to locate the transition.
It would be desirable to derive a unified theory able to describe the whole phase diagram based on a single formalism. This is what the present paper provides through a principled combination of spherical integration and the replica method, yielding predictive formulas that are easy to evaluate. It is important to notice that the presence of the HCIZ integral, which is a high-dimensional matrix integral, in the replica formula presented in Result 2.1 suggests that effective one-body problems are not enough to capture alone the physics of the problem, as it is usually the case in standard mean-field inference and spin glass models. Indeed, the appearance of effective one-body problems to describe complex statistical models is usually related to the asymptotic decoupling of the finite marginals of the variables in the problem at hand in terms of products of the single-variable marginals. Therefore, we do not expect a standard cavity (or leave-one-out) approach based on single-variable extraction to be exact, while it is usually showed that the replica and cavity approaches are equivalent in mean-field models Mezard et al. (1986). This may explain why the approximate message-passing algorithms proposed in Parker et al. (2014); Krzakala et al. (2013); Kabashima et al. (2016) are, as stated by the authors, not properly converging nor able to match their corresponding theoretical predictions based on the cavity method. Algorithms for extensive-rank systems should therefore combine ingredients from matrix denoising and standard message-passing, reflecting their hybrid mean-field/matrix model nature.
In order to face this, we adapt the GAMP-RIE (generalised approximate message-passing with rotational invariant estimator) introduced in Maillard et al. (2024a) for the special case of quadratic activation, to accommodate a generic activation function $\sigma$ . By construction, the resulting algorithm described in App. H cannot find the specialisation solution, i.e., a solution where at least $\Theta(k)$ neurons align with the teacher’s. Nevertheless, it matches the performance associated with the so-called universal solution/branch of our theory for all $\alpha$ , which describes a solution with overlap ${\mathbf{W}}^{∈tercal}_{j}{\mathbf{W}}^{0}_{j}/d>o_{d}(1)$ for at most $o(k)$ neurons. As a side investigation, we show empirically that the specialisation solution is potentially hard to reach with popular algorithms for some target functions: the algorithms we tested either fail to find it and instead get stuck in a sub-optimal glassy phase (Metropolis-Hastings sampling for the case of binary inner weights), or may find it but in a training time increasing exponentially with $d$ (ADAM Kingma & Ba (2017) and Hamiltonian Monte Carlo (HMC) for the case of Gaussian weights). It would thus be interesting to settle whether GAMP-RIE has the best prediction performance achievable by a polynomial-time learner when $n=\Theta(d^{2})$ for such targets. For specific choices of the distribution of the readout weights, the evidence of hardness is not conclusive and requires further investigation.
Replica free entropy
Our first result is a tractable approximation for the free entropy. To state it, let us introduce two functions $\mathcal{Q}_{W}(\mathsf{v}),\hat{\mathcal{Q}}_{W}(\mathsf{v})∈[0,1]$ for $\mathsf{v}∈{\rm Supp}(P_{v})$ , which are non-decreasing in $|\mathsf{v}|$ . Let (see (43) in appendix for a more explicit expression of $g$ )
$$
\textstyle g(x):=\sum_{\ell\geq 3}x^{\ell}{\mu_{\ell}^{2}}/{\ell!}, \textstyle q_{K}(x,\mathcal{Q}_{W}):=\mu_{1}^{2}+{\mu_{2}^{2}}\,x/2+\mathbb{E}%
_{v\sim P_{v}}[v^{2}g(\mathcal{Q}_{W}(v))], \textstyle r_{K}:=\mu_{1}^{2}+{\mu_{2}^{2}}(1+\gamma\bar{v}^{2})/2+g(1), \tag{1}
$$
and the auxiliary potentials
| | $\textstyle\psi_{P_{W}}(x):=\mathbb{E}_{w^{0},\xi}\ln\mathbb{E}_{w}\exp(-\frac{%
1}{2}xw^{2}+xw^{0}w+\sqrt{x}\xi w),$ | |
| --- | --- | --- |
where $w^{0},w\sim P_{W}$ and $\xi,u_{0},u\sim{\mathcal{N}}(0,1)$ all independent. Moreover, $\mu_{{\mathbf{Y}}(x)}$ is the limiting (in $d→∞$ ) spectral density of data ${\mathbf{Y}}(x)=\sqrt{x/(kd)}\,{\mathbf{S}}^{0}+{\mathbf{Z}}$ in the denoising problem of the matrix ${\mathbf{S}}^{0}:={\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{0}∈%
\mathbb{R}^{d× d}$ , with ${\mathbf{Z}}$ a standard GOE matrix (a symmetric matrix whose upper triangular part has i.i.d. entries from $\mathcal{N}(0,(1+\delta_{ij})/d)$ ). Denote the minimum mean-square error associated with this denoising problem as ${\rm mmse}_{S}(x)=\lim_{d→∞}d^{-2}\mathbb{E}\|{\mathbf{S}}^{0}-\mathbb{%
E}[{\mathbf{S}}^{0}\mid{\mathbf{Y}}(x)]\|^{2}$ (whose explicit definition is given in App. D.3) and its functional inverse by ${\rm mmse}_{S}^{-1}$ (which exists by monotonicity).
**Result 2.1 (Replica symmetric free entropy)**
*Let the functional $\tau(\mathcal{Q}_{W}):={\rm mmse}_{S}^{-1}(1-\mathbb{E}_{v\sim P_{v}}[v^{2}%
\mathcal{Q}_{W}(v)^{2}])$ . Given $(\alpha,\gamma)$ , the replica symmetric (RS) free entropy approximating ${\lim}\,f_{n}$ in the scaling limit (4) is ${\rm extr}\,f_{\rm RS}^{\alpha,\gamma}$ with RS potential $f^{\alpha,\gamma}_{\rm RS}=f^{\alpha,\gamma}_{\rm RS}(q_{2},\hat{q}_{2},%
\mathcal{Q}_{W},\hat{\mathcal{Q}}_{W})$ given by
$$
\textstyle f^{\alpha,\gamma}_{\rm RS} \textstyle:=\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})+\frac{1}%
{4\alpha}(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2} \textstyle\qquad+\frac{\gamma}{\alpha}\mathbb{E}_{v\sim P_{v}}\big{[}\psi_{P_{%
W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{1}{2}\mathcal{Q}_{W}(v)\hat{\mathcal{Q}}_{%
W}(v)\big{]} \textstyle\qquad+\frac{1}{\alpha}\big{[}\iota(\tau(\mathcal{Q}_{W}))-\iota(%
\hat{q}_{2}+\tau(\mathcal{Q}_{W}))\big{]}. \tag{6}
$$
The extremisation operation in ${\rm extr}\,f^{\alpha,\gamma}_{\rm RS}$ selects a solution $(q_{2}^{*},\hat{q}_{2}^{*},\mathcal{Q}_{W}^{*},\hat{\mathcal{Q}}_{W}^{*})$ of the saddle point equations, obtained from $∇ f^{\alpha,\gamma}_{\rm RS}=\mathbf{0}$ , which maximises the RS potential.*
The extremisation of $f_{\rm RS}^{\alpha,\gamma}$ yields the system (76) in the appendix, solved numerically in a standard way (see provided code).
The order parameters $q_{2}^{*}$ and $\mathcal{Q}_{W}^{*}$ have a precise physical meaning that will be clear from the discussion in Sec. 4. In particular, $q_{2}^{*}$ is measuring the alignment of the student’s combination of weights ${\mathbf{W}}^{∈tercal}({\mathbf{v}}){\mathbf{W}}/\sqrt{k}$ with the corresponding teacher’s ${\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{0}/\sqrt{k}$ , which is non trivial with $n=\Theta(d^{2})$ data even when the student is not able to reconstruct ${\mathbf{W}}^{0}$ itself (i.e., to specialise). On the other hand, $\mathcal{Q}_{W}^{*}(\mathsf{v})$ measures the overlap between weights $\{{\mathbf{W}}_{i}^{0/·}\mid v_{i}=\mathsf{v}\}$ (a different treatment for weights connected to different $\mathsf{v}$ ’s is needed because, as discussed earlier, the student will learn first –with less data– weights connected to larger readouts). A non-trivial $\mathcal{Q}_{W}^{*}(\mathsf{v})≠ 0$ signals that the student learns something about ${\mathbf{W}}^{0}$ . Thus, the specialisation transitions are naturally defined, based on the extremiser of $f_{\rm RS}^{\alpha,\gamma}$ in the result above, as $\alpha_{\rm sp,\mathsf{v}}(\gamma):=\sup\,\{\alpha\mid\mathcal{Q}^{*}_{W}(%
\mathsf{v})=0\}$ . For non-homogeneous readouts, we call the specialisation transition $\alpha_{\rm sp}(\gamma):=\min_{\mathsf{v}}\alpha_{\rm sp,\mathsf{v}}(\gamma)$ . In this article, we report cases where the inner weights are discrete or Gaussian distributed. For activations different than a pure quadratic, $\sigma(x)≠ x^{2}$ , we predict the transition to occur in both cases (see Fig. 1 and 2). Then, $\alpha<\alpha_{\rm sp}$ corresponds to the universal phase, where the free entropy is independent of the choice of the prior over the inner weights. Instead, $\alpha>\alpha_{\rm sp}$ is the specialisation phase where the prior $P_{W}$ matters, and the student aligns a finite fraction of its weights $({\mathbf{W}}_{j})_{j≤ k}$ with those of the teacher, which lowers the generalisation error.
Let us comment on why the special case $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ could be treated exactly with known techniques (spherical integration) in Maillard et al. (2024a); Xu et al. (2025). With $\sigma(x)=x^{2}$ the responses $(y_{\mu})$ depend on ${\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{0}$ only. If ${\mathbf{v}}$ has finite fractions of equal entries, a large invariance group prevents learning ${\mathbf{W}}^{0}$ and thus specialisation. Take as example ${\mathbf{v}}=(1,...,1,-1,...,-1)$ with the first half filled with ones. Then, the responses are indistinguishable from those obtained using a modified matrix ${\mathbf{W}}^{0∈tercal}{\mathbf{U}}^{∈tercal}({\mathbf{v}}){\mathbf{U}}{%
\mathbf{W}}^{0}$ where ${\mathbf{U}}=(({\mathbf{U}}_{1},\mathbf{0}_{d/2})^{∈tercal},(\mathbf{0}_{d/2%
},{\mathbf{U}}_{2})^{∈tercal})$ is block diagonal with $d/2× d/2$ orthogonal ${\mathbf{U}}_{1},{\mathbf{U}}_{2}$ and zeros on off-diagonal blocks. The Gaussian prior $P_{W}$ is rotationally invariant and, thus, does not break any invariance, so ${\mathbf{U}}_{1},{\mathbf{U}}_{2}$ are arbitrary. The resulting invariance group has an $\Theta(d^{2})$ entropy (the logarithm of its volume), which is comparable to the leading order of the free entropy. Therefore, it cannot be broken using infinitesimal perturbations (or “side information”) and, consequently, prevents specialisation. This reasoning can be extended to $P_{v}$ with a continuous support, as long as we can discretise it with a finite (possibly large) number of bins, take the limit (4) first, and then take the continuum limit of the binning afterwards. However, the picture changes if the prior breaks rotational invariance; e.g., with Rademacher $P_{W}$ , only signed permutation invariances survive, a symmetry with negligible entropy $o(d^{2})$ which, consequently, does not change the limiting thermodynamic (information-theoretic) quantities. The large rotational invariance group is the reason why $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ can be treated using the HCIZ integral alone. Even when $P_{W}=\mathcal{N}(0,1)$ , the presence of any other term in the series expansion of $\sigma$ breaks invariances with large entropy: specialisation can then occur, thus requiring our theory. We mention that our theory seems inexact When solving the extremisation of (6) for $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ , we noticed that the difference between the RS free entropy of the correct universal solution, $\mathcal{Q}_{W}(\mathsf{v})=0$ , and the maximiser, predicting $\mathcal{Q}_{W}(\mathsf{v})>0$ , does not exceed $≈ 1\%$ : the RS potential is very flat as a function of $\mathcal{Q}_{W}$ . We thus cannot discard that the true maximiser of the potential is at $\mathcal{Q}_{W}(\mathsf{v})=0$ , and that we observe otherwise due to numerical errors. Indeed, evaluating the spherical integrals $\iota(\,·\,)$ in $f^{\alpha,\gamma}_{\rm RS}$ is challenging, in particular when $\gamma$ is small. Actually, for $\gamma\gtrsim 1$ we do get that $\mathcal{Q}_{W}(\mathsf{v})=0$ is always the maximiser for $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ . for $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ if applied naively, as it predicts ${\mathcal{Q}}_{W}(\mathsf{v})>0$ and therefore does not recover the rigorous result of Xu et al. (2025) (yet, it predicts a free entropy less than $1\%$ away from the truth). Nevertheless, the solution of Maillard et al. (2024a); Xu et al. (2025) is recovered from our equations by enforcing a vanishing overlap $\mathcal{Q}_{W}(\mathsf{v})=0$ , i.e., via its universal branch.
Bayes generalisation error
Another main result is an approximate formula for the generalisation error. Let $({\mathbf{W}}^{a})_{a≥ 1}$ be i.i.d. samples from the posterior $dP(\,·\mid\mathcal{D})$ and ${\mathbf{W}}^{0}$ the teacher’s weights. Assuming that the joint law of $(\lambda_{\rm test}({\mathbf{W}}^{a},{\mathbf{x}}_{\rm test}))_{a≥ 0}=:(%
\lambda^{a})_{a≥ 0}$ for a common test input ${\mathbf{x}}_{\rm test}∉\mathcal{D}$ is a centred Gaussian, our framework predicts its covariance. Our approximation for the Bayes error follows.
**Result 2.2 (Bayes generalisation error)**
*Let $q_{K}^{*}=q_{K}(q_{2}^{*},\mathcal{Q}_{W}^{*})$ where $(q_{2}^{*},\hat{q}_{2}^{*},\mathcal{Q}_{W}^{*},\hat{\mathcal{Q}}_{W}^{*})$ is an extremiser of $f_{\rm RS}^{\alpha,\gamma}$ as in Result 2.1. Assuming joint Gaussianity of the post-activations $(\lambda^{a})_{a≥ 0}$ , in the scaling limit (4) their mean is zero and their covariance is approximated by $\mathbb{E}\lambda^{a}\lambda^{b}=q_{K}^{*}+(r_{K}-q_{K}^{*})\delta_{ab}=:(%
\mathbf{\Gamma})_{ab}$ , see App. C. Assume $\mathcal{C}$ has the series expansion $\mathcal{C}(y,\hat{y})=\sum_{i≥ 0}c_{i}(y)\hat{y}^{i}$ . The Bayes error $\smash{\lim\,\varepsilon^{\mathcal{C},\mathsf{f}}}$ is approximated by
| | $\textstyle\mathbb{E}_{(\lambda^{a})\sim\mathcal{N}(\mathbf{0},\mathbf{\Gamma})%
}\mathbb{E}_{y_{\rm test}\sim P_{\rm out}(\,·\mid\lambda^{0})}\sum_{i≥ 0%
}c_{i}(y_{\rm test}(\lambda^{0}))\prod_{a=1}^{i}\mathsf{f}(\lambda^{a}).$ | |
| --- | --- | --- |
Letting $\mathbb{E}[\,·\mid\lambda]=∈t dy\,(\,·\,)\,P_{\rm out}(y\mid\lambda)$ , the Bayes-optimal mean-square generalisation error $\smash{\lim\,\varepsilon^{\rm opt}}$ is approximated by
$$
\textstyle\mathbb{E}_{\lambda^{0},\lambda^{1}}\big{(}\mathbb{E}[y^{2}\mid%
\lambda^{0}]-\mathbb{E}[y\mid\lambda^{0}]\mathbb{E}[y\mid\lambda^{1}]\big{)}. \tag{7}
$$*
This result assumed that $\mu_{0}=0$ ; see App. D.5 if this is not the case. Results 2.1 and 2.2 provide an effective theory for the generalisation capabilities of Bayesian shallow networks with generic activation. We call these “results” because, despite their excellent match with numerics, we do not expect these formulas to be exact: their derivation is based on an unconventional mix of spin glass techniques and spherical integrals, and require approximations in order to deal with the fact that the degrees of freedom to integrate are large matrices of extensive rank. This is in contrast with simpler (vector) models (perceptrons, multi-index models, etc) where replica formulas are routinely proved correct, see e.g. Barbier & Macris (2019); Barbier et al. (2019); Aubin et al. (2018).
|
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Chart: εopt vs. α
### Overview
The image presents a line chart displaying the relationship between two variables: εopt (on the y-axis) and α (on the x-axis). Two distinct data series are plotted, each represented by a different color and marker style, with error bars indicating uncertainty. The chart appears to explore how an optimal error (εopt) changes as a parameter α varies.
### Components/Axes
* **X-axis:** Labeled "α", ranging from approximately 0 to 7. The scale is linear.
* **Y-axis:** Labeled "εopt", ranging from approximately 0 to 0.11. The scale is linear.
* **Data Series 1 (Blue):** Represented by a solid blue line with 'x' markers. Error bars are present.
* **Data Series 2 (Red):** Represented by a dashed red line with 'o' markers. Error bars are present.
### Detailed Analysis
**Data Series 1 (Blue):**
The blue line exhibits a decreasing trend from α = 0 to approximately α = 4, after which it plateaus and fluctuates.
* At α ≈ 0, εopt ≈ 0.08.
* At α ≈ 1, εopt ≈ 0.06.
* At α ≈ 2, εopt ≈ 0.035.
* At α ≈ 3, εopt ≈ 0.025.
* At α ≈ 4, εopt ≈ 0.02.
* At α ≈ 5, εopt ≈ 0.018.
* At α ≈ 6, εopt ≈ 0.016.
* At α ≈ 7, εopt ≈ 0.017.
**Data Series 2 (Red):**
The red line shows a sharp decrease from α = 0 to approximately α = 2, then stabilizes around εopt ≈ 0.1.
* At α ≈ 0, εopt ≈ 0.105.
* At α ≈ 1, εopt ≈ 0.10.
* At α ≈ 2, εopt ≈ 0.05.
* At α ≈ 3, εopt ≈ 0.03.
* At α ≈ 4, εopt ≈ 0.025.
* At α ≈ 5, εopt ≈ 0.025.
* At α ≈ 6, εopt ≈ 0.08.
* At α ≈ 7, εopt ≈ 0.06.
The error bars for both series indicate varying degrees of uncertainty in the measurements. The error bars are relatively small for the blue line, especially at higher α values, suggesting more precise measurements. The error bars for the red line are larger, particularly around α = 6 and 7, indicating greater uncertainty.
### Key Observations
* The blue line consistently shows lower εopt values compared to the red line across most of the α range.
* The red line exhibits a significant drop in εopt between α = 1 and α = 2.
* The blue line plateaus after α = 4, indicating that further increases in α do not significantly reduce εopt.
* The red line shows a large increase in εopt at α = 6 and 7, with large error bars.
### Interpretation
The chart suggests that the optimal error (εopt) is strongly dependent on the parameter α. For the blue data series, increasing α initially leads to a substantial reduction in εopt, but this effect diminishes beyond α = 4. The red data series shows a different behavior, with a rapid decrease in εopt at lower α values, followed by stabilization. The divergence between the two series suggests that different mechanisms or conditions may be influencing the optimal error for different ranges of α.
The large error bars for the red line at higher α values could indicate instability or limitations in the measurement process. The increase in εopt at α = 6 and 7 for the red line might signify a transition to a different regime or a breakdown of the optimization process.
The chart could be representing the optimization of a model or algorithm, where α is a tuning parameter and εopt is a measure of the error or loss function. The results suggest that there is an optimal range for α that minimizes the error, and that the optimal value may depend on the specific characteristics of the system being optimized.
</details>
|
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Chart: Loss vs. Alpha for ReLU and Tanh Activation Functions
### Overview
The image presents a line chart comparing the loss values for two activation functions, ReLU and Tanh, as a function of a parameter denoted as "α" (alpha). Each data point is accompanied by an error bar, indicating the variability or uncertainty in the loss measurement. The chart appears to be evaluating the performance of these activation functions across a range of alpha values.
### Components/Axes
* **X-axis:** Labeled "α", ranging from approximately 0 to 7.
* **Y-axis:** Represents the loss value. The scale is not explicitly labeled, but appears to range from approximately 0 to 1.
* **Data Series 1:** ReLU, represented by a solid blue line.
* **Data Series 2:** Tanh, represented by a solid red line.
* **Legend:** Located in the top-right corner, identifying the lines as ReLU (blue) and Tanh (red).
* **Data Points:** Each line is marked with data points. ReLU is marked with blue circles and crosses, while Tanh is marked with red circles and crosses.
* **Error Bars:** Vertical lines extending above and below each data point, indicating the standard deviation or confidence interval of the loss value.
### Detailed Analysis
**ReLU (Blue Line):**
The ReLU line starts at approximately 0.8 at α = 0 and rapidly decreases to around 0.2 by α = 1. The line then plateaus, exhibiting a slow, gradual decrease, leveling off to approximately 0.15-0.2 between α = 3 and α = 7. The data points show some variability, as indicated by the error bars.
* α = 0: Loss ≈ 0.8, Error ≈ 0.1
* α = 1: Loss ≈ 0.2, Error ≈ 0.05
* α = 2: Loss ≈ 0.18, Error ≈ 0.03
* α = 3: Loss ≈ 0.17, Error ≈ 0.04
* α = 4: Loss ≈ 0.16, Error ≈ 0.03
* α = 5: Loss ≈ 0.16, Error ≈ 0.02
* α = 6: Loss ≈ 0.16, Error ≈ 0.02
* α = 7: Loss ≈ 0.16, Error ≈ 0.02
**Tanh (Red Line):**
The Tanh line begins at approximately 0.6 at α = 0 and decreases more rapidly than ReLU initially, reaching a loss of around 0.1 by α = 1. It continues to decrease, but at a slower rate, eventually approaching a value of approximately 0.05-0.1 between α = 4 and α = 7. The error bars are generally smaller for Tanh than for ReLU, suggesting more consistent loss values.
* α = 0: Loss ≈ 0.6, Error ≈ 0.1
* α = 1: Loss ≈ 0.1, Error ≈ 0.03
* α = 2: Loss ≈ 0.08, Error ≈ 0.02
* α = 3: Loss ≈ 0.07, Error ≈ 0.02
* α = 4: Loss ≈ 0.06, Error ≈ 0.01
* α = 5: Loss ≈ 0.06, Error ≈ 0.01
* α = 6: Loss ≈ 0.06, Error ≈ 0.01
* α = 7: Loss ≈ 0.06, Error ≈ 0.01
### Key Observations
* Tanh consistently exhibits lower loss values than ReLU across the entire range of α.
* The rate of loss reduction is higher for both functions at lower α values (0-2).
* The error bars for ReLU are generally larger than those for Tanh, indicating greater variability in the loss measurements for ReLU.
* Both lines appear to converge as α increases, suggesting that the performance difference between ReLU and Tanh diminishes at higher α values.
### Interpretation
The chart demonstrates the impact of the α parameter on the loss function for ReLU and Tanh activation functions. The lower loss values achieved by Tanh suggest that it may be a more effective activation function for this particular task or dataset, especially at lower α values. The convergence of the lines at higher α values indicates that the choice of activation function becomes less critical as α increases. The error bars provide a measure of the robustness of each activation function, with Tanh exhibiting more consistent performance. This data could be used to optimize the choice of activation function and α parameter for a neural network model, aiming to minimize the loss and improve performance. The fact that the loss plateaus for ReLU suggests that increasing α beyond a certain point does not yield significant improvements in performance.
</details>
|
<details>
<summary>x3.png Details</summary>

### Visual Description
## Chart: Convergence Rate Comparison
### Overview
The image presents a line chart comparing the convergence rates of four different Markov Chain Monte Carlo (MCMC) methods: ADAM, informative Hamiltonian Monte Carlo (HMC), uninformative HMC, and Gaussian Adaptive Metropolis-Hastings with Riemannian Exponential (GAMP-RIE). The chart plots a metric representing convergence (likely error or some measure of distance to the true parameter values) against a parameter denoted by α (alpha). Error bars are included for each data series, indicating the uncertainty or variance in the convergence metric.
### Components/Axes
* **X-axis:** Labeled "α" (alpha). The scale ranges from approximately 0 to 7, with tick marks at integer values.
* **Y-axis:** The Y-axis is not explicitly labeled, but represents a convergence metric. The scale ranges from approximately 0 to 1.5.
* **Legend:** Located in the top-right corner of the chart. It identifies the four data series with corresponding colors and markers:
* ADAM (Black, asterisk)
* informative HMC (Dark Blue, square with error bar)
* uninformative HMC (Dark Orange, triangle with error bar)
* GAMP-RIE (Light Blue, circle with error bar)
* **Grid:** A light gray grid is present, aiding in the reading of values.
### Detailed Analysis
Here's a breakdown of each data series, including trend descriptions and approximate data points.
* **ADAM (Black Asterisk):** The line slopes sharply downward from α = 0 to approximately α = 1.5, then flattens out, approaching a value of approximately 0.15-0.2.
* α = 0: ~1.2
* α = 1: ~0.6
* α = 2: ~0.3
* α = 3: ~0.22
* α = 4: ~0.18
* α = 5: ~0.16
* α = 6: ~0.15
* α = 7: ~0.15
* **informative HMC (Dark Blue Square):** The line starts at approximately 0.8 at α = 0, decreases rapidly to around 0.2 at α = 1, and then plateaus around 0.15-0.2 for α > 2.
* α = 0: ~0.8
* α = 1: ~0.2
* α = 2: ~0.18
* α = 3: ~0.17
* α = 4: ~0.16
* α = 5: ~0.16
* α = 6: ~0.16
* α = 7: ~0.16
* **uninformative HMC (Dark Orange Triangle):** This line exhibits a slower initial decrease compared to ADAM and informative HMC. It starts at approximately 0.6 at α = 0, decreases to around 0.2 at α = 3, and then levels off around 0.1-0.15. The error bars are notably larger for this method.
* α = 0: ~0.6
* α = 1: ~0.4
* α = 2: ~0.3
* α = 3: ~0.2
* α = 4: ~0.15
* α = 5: ~0.13
* α = 6: ~0.12
* α = 7: ~0.12
* **GAMP-RIE (Light Blue Circle):** This line shows a similar trend to informative HMC, starting at approximately 0.7 at α = 0, decreasing to around 0.2 at α = 1, and then plateauing around 0.15-0.2.
* α = 0: ~0.7
* α = 1: ~0.2
* α = 2: ~0.18
* α = 3: ~0.17
* α = 4: ~0.16
* α = 5: ~0.16
* α = 6: ~0.16
* α = 7: ~0.16
### Key Observations
* ADAM exhibits the fastest initial convergence, but plateaus relatively quickly.
* informative HMC and GAMP-RIE show similar convergence behavior, reaching a stable level around α = 2.
* uninformative HMC converges more slowly and has larger error bars, indicating greater uncertainty in its convergence.
* All methods appear to converge to a similar level as α increases.
### Interpretation
The chart demonstrates the convergence properties of different MCMC algorithms as a function of the parameter α. The faster initial convergence of ADAM suggests it might be suitable for quickly exploring the parameter space. However, its plateauing behavior indicates it may not achieve the same level of accuracy as informative HMC or GAMP-RIE in the long run. The slower convergence and larger uncertainty of uninformative HMC suggest it may require more iterations to achieve reliable results. The similarity between informative HMC and GAMP-RIE suggests they offer comparable performance in this scenario. The parameter α likely controls some aspect of the MCMC algorithm's step size or adaptation, and the chart shows how the convergence rate changes as this parameter is adjusted. The error bars are crucial for understanding the reliability of each method's convergence, with smaller error bars indicating more consistent performance. The chart suggests that the choice of MCMC algorithm and the tuning of its parameters (like α) are critical for achieving efficient and accurate Bayesian inference.
</details>
|
| --- | --- | --- |
Figure 1: Theoretical prediction (solid curves) of the Bayes-optimal mean-square generalisation error for Gaussian inner weights with ReLU(x) activation (blue curves) and Tanh(2x) activation (red curves), $d=150,\gamma=0.5$ , with linear readout with Gaussian label noise of variance $\Delta=0.1$ and different $P_{v}$ laws. The dashed lines are the theoretical predictions associated with the universal solution, obtained by plugging ${\mathcal{Q}}_{W}(\mathsf{v})=0\ ∀\ \mathsf{v}$ in (6) and extremising w.r.t. $(q_{2},\hat{q}_{2})$ (the curve coincides with the optimal one before the transition $\alpha_{\rm sp}(\gamma)$ ). The numerical points are obtained with Hamiltonian Monte Carlo (HMC) with informative initialisation on the target (empty circles), uninformative, random, initialisation (empty crosses), and ADAM (thin crosses). Triangles are the error of GAMP-RIE (Maillard et al., 2024a) extended to generic activation, obtained by plugging estimator (109) in (3) in appendix. Each point has been averaged over 10 instances of the teacher and training set. Error bars are the standard deviation over instances. The generalisation error for a given training set is evaluated as $\frac{1}{2}\mathbb{E}_{{\mathbf{x}}_{\rm test}\sim\mathcal{N}(0,I_{d})}(%
\lambda_{\rm test}({\mathbf{W}})-\lambda_{\rm test}^{0})^{2}$ , using a single sample ${\mathbf{W}}$ from the posterior for HMC. For ADAM, with batch size fixed to $n/5$ and initial learning rate $0.05$ , the error corresponds to the lowest one reached during training, i.e., we use early stopping based on the minimum test loss over all gradient updates. Its generalisation error is then evaluated at this point and divided by two (for comparison with the theory). The average over ${\mathbf{x}}_{\rm test}$ is computed empirically from $10^{5}$ i.i.d. test samples. We exploit that, for typical posterior samples, the Gibbs error $\varepsilon^{\rm Gibbs}$ defined in (39) in App. C is linked to the Bayes-optimal error as $(\varepsilon^{\rm Gibbs}-\Delta)/2=\varepsilon^{\rm opt}-\Delta$ , see (40) in appendix. To use this formula, we are assuming the concentration of the Gibbs error w.r.t. the posterior distribution, in order to evaluate it from a single sample per instance. Left: Homogeneous readouts $P_{v}=\delta_{1}$ . Centre: 4-points readouts $P_{v}=\frac{1}{4}(\delta_{-3/\sqrt{5}}+\delta_{-1/\sqrt{5}}+\delta_{1/\sqrt{5}%
}+\delta_{3/\sqrt{5}})$ . Right: Gaussian readouts $P_{v}=\mathcal{N}(0,1)$ .
3 Theoretical predictions and numerical experiments
Let us compare our theoretical predictions with simulations. In Fig. 1 and 2, we report the theoretical curves from Result 2.2, focusing on the optimal mean-square generalisation error for networks with different $\sigma$ , with linear readout with Gaussian noise variance $\Delta$ . The Gibbs error divided by $2$ is used to compute the optimal error, see Remark C.2 in App. C for a justification. In what follows, the error attained by ADAM is also divided by two, only for the purpose of comparison.
Figure 1 focuses on networks with Gaussian inner weights, various readout laws, for $\sigma(x)={\rm ReLU}(x)$ and ${\rm Tanh}(2x)$ . Informative (i.e., on the teacher) and uninformative (random) initialisations are used when sampling the posterior by HMC. We also run ADAM, always selecting its best performance over all epochs, and implemented an extension of the GAMP-RIE of Maillard et al. (2024a) for generic activation (see App. H). It can be shown analytically that GAMP-RIE’s generalisation error asymptotically (in $d$ ) matches the prediction of the universal branch of our theory (i.e., associated with $\mathcal{Q}_{W}(\mathsf{v})=0\ ∀\ \mathsf{v}$ ).
For ReLU activation and homogeneous readouts (left panel), informed HMC follows the specialisation branch (the solution of the saddle point equations with $\mathcal{Q}_{W}(\mathsf{v})≠ 0$ for at least one $\mathsf{v}$ ), while with uninformative initialisation it sticks to the universal branch, thus suggesting algorithmic hardness. We shall be back to this matter in the following. We note that the error attained by ADAM (divided by 2), is close to the performance associated with the universal branch, which suggests that ADAM is an effective Gibbs estimator for this $\sigma$ . For Tanh and homogeneous readouts, both the uninformative and informative points lie on the specialisation branch, while ADAM attains an error greater than twice the posterior sample’s generalisation error.
For non-homogeneous readouts (centre and right panels) the points associated with the informative initialisation lie consistently on the specialisation branch, for both ${\rm ReLU}$ and Tanh, while the uninformatively initialised samples have a slightly worse performance for Tanh. Non-homogeneous readouts improves the ADAM performance: for Gaussian readouts and high sampling ratio its half-generalisation error is consistently below the error associated with the universal branch of the theory.
Figure 2 concerns networks with Rademacher weights and homogeneous readout. The numerical points are of two kinds: the dots, obtained from Metropolis–Hastings sampling of the weight posterior, and the circles, obtained from the GAMP-RIE (App. H). We report analogous simulations for ${\rm ReLU}$ and ${\rm ELU}$ activations in Figure 7, App. H. The remarkable agreement between theoretical curves and experimental points in both phases supports the assumptions used in Sec. 4.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Chart: Optimal Error vs. Alpha
### Overview
The image presents a chart illustrating the relationship between an optimal error value (εopt) and a parameter alpha (α). The chart uses a logarithmic y-axis and displays three data series, each representing a different sigma value (σ1, σ2, σ3). Error bars are included for each data point, indicating the uncertainty in the measurements.
### Components/Axes
* **X-axis:** Labeled "α" (alpha), ranging from approximately 0 to 4.
* **Y-axis:** Labeled "εopt" (optimal error), using a logarithmic scale ranging from approximately 0.001 to 1.2. The scale is marked with values 10^-3, 10^-2, 10^-1, 1, and 1.2.
* **Legend:** Located in the top-right corner, identifying the three data series:
* σ1 (Blue line)
* σ2 (Green line)
* σ3 (Red line with circle markers)
* **Data Series:** Three lines with associated error bars representing the optimal error for each sigma value.
* **Horizontal dashed line:** A horizontal dashed line at approximately εopt = 1.0.
### Detailed Analysis
**σ1 (Blue Line):**
The blue line representing σ1 starts at approximately εopt = 0.65 when α = 0. It decreases rapidly, approaching εopt = 0.15 at α = 2, and then levels off, reaching approximately εopt = 0.08 at α = 4. The error bars are relatively large at lower alpha values (α < 1) and decrease as alpha increases.
* α = 0, εopt ≈ 0.65, Error ≈ 0.05
* α = 0.5, εopt ≈ 0.4, Error ≈ 0.03
* α = 1, εopt ≈ 0.25, Error ≈ 0.02
* α = 2, εopt ≈ 0.15, Error ≈ 0.01
* α = 3, εopt ≈ 0.09, Error ≈ 0.005
* α = 4, εopt ≈ 0.08, Error ≈ 0.003
**σ2 (Green Line):**
The green line representing σ2 starts at approximately εopt = 1.0 when α = 0. It decreases more steeply than σ1, reaching εopt = 0.1 at α = 1.5, and then levels off, reaching approximately εopt = 0.03 at α = 4. The error bars are also large at lower alpha values and decrease with increasing alpha.
* α = 0, εopt ≈ 1.0, Error ≈ 0.1
* α = 0.5, εopt ≈ 0.6, Error ≈ 0.05
* α = 1, εopt ≈ 0.25, Error ≈ 0.03
* α = 1.5, εopt ≈ 0.1, Error ≈ 0.015
* α = 2, εopt ≈ 0.06, Error ≈ 0.008
* α = 3, εopt ≈ 0.03, Error ≈ 0.003
* α = 4, εopt ≈ 0.02, Error ≈ 0.002
**σ3 (Red Line with Circle Markers):**
The red line representing σ3 starts at approximately εopt = 1.0 when α = 0. It decreases at a rate between σ1 and σ2, reaching εopt = 0.2 at α = 1.5, and then levels off, reaching approximately εopt = 0.05 at α = 4. The error bars are similar in magnitude to σ2.
* α = 0, εopt ≈ 1.0, Error ≈ 0.1
* α = 0.5, εopt ≈ 0.75, Error ≈ 0.06
* α = 1, εopt ≈ 0.4, Error ≈ 0.04
* α = 1.5, εopt ≈ 0.2, Error ≈ 0.02
* α = 2, εopt ≈ 0.12, Error ≈ 0.01
* α = 3, εopt ≈ 0.06, Error ≈ 0.005
* α = 4, εopt ≈ 0.05, Error ≈ 0.004
### Key Observations
* All three data series show a decreasing trend of εopt as α increases.
* σ2 exhibits the steepest decline in εopt with increasing α.
* The error bars indicate greater uncertainty in the measurements at lower values of α.
* All three lines converge towards very low values of εopt as α approaches 4.
* The horizontal dashed line at εopt = 1.0 serves as a reference point, showing that all three curves start above this value and decrease below it.
### Interpretation
The chart demonstrates the relationship between an optimal error (εopt) and a parameter alpha (α) for three different sigma values (σ1, σ2, and σ3). The decreasing trend of εopt with increasing α suggests that as α increases, the optimal error decreases, indicating improved performance or accuracy. The different rates of decline for each sigma value suggest that the optimal error is sensitive to the choice of sigma. The convergence of the lines at higher α values indicates that the effect of sigma becomes less pronounced as α increases. The error bars highlight the uncertainty associated with the measurements, particularly at lower α values, suggesting that the optimal error is more sensitive to variations in the data at lower α values. The horizontal dashed line at εopt = 1.0 provides a baseline for comparison, showing that the optimal error consistently improves as α increases beyond a certain point. This data could be used to optimize a system or algorithm by selecting an appropriate value of α based on the desired level of accuracy and the chosen sigma value.
</details>
Figure 2: Theoretical prediction (solid curves) of the Bayes-optimal mean-square generalisation error for binary inner weights and polynomial activations: $\sigma_{1}={\rm He}_{2}/\sqrt{2}$ , $\sigma_{2}={\rm He}_{3}/\sqrt{6}$ , $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ , with $\gamma=0.5$ , $d=150$ , linear readout with Gaussian label noise with $\Delta=1.25$ , and homogeneous readouts ${\mathbf{v}}=\mathbf{1}$ . Dots are optimal errors computed via Gibbs errors (see Fig. 1) by running a Metropolis-Hastings MCMC initialised near the teacher. Circles are the error of GAMP-RIE (Maillard et al., 2024a) extended to generic activation, see App. H. Points are averaged over 16 data instances. Error bars for MCMC are the standard deviation over instances (omitted for GAMP-RIE, but of the same order). Dashed and dotted lines denote, respectively, the universal (i.e. the $\mathcal{Q}_{W}(\mathsf{v})=0\ ∀\ \mathsf{v}$ solution of the saddle point equations) and the specialisation branches where they are metastable (i.e., a local maximiser of the RS potential but not the global one).
Figure 3 illustrates the learning mechanism for models with Gaussian weights and non-homogeneous readouts, revealing a sequence of phase transitions as $\alpha$ increases. Top panel shows the overlap function $\mathcal{Q}_{W}(\mathsf{v})$ in the case of Gaussian readouts for four different sample rates $\alpha$ . In the bottom panel the readout assumes four different values with equal probabilities; the figure shows the evolution of the two relevant overlaps associated with the symmetric readout values $± 3/\sqrt{5}$ and $± 1/\sqrt{5}$ . As $\alpha$ increases, the student weights start aligning with the teacher weights associated with the highest readout amplitude, marking the first phase transition. As these alignments strengthen when $\alpha$ further increases, the second transition occurs when the weights corresponding to the next largest readout amplitude are learnt, and so on. In this way, continuous readouts produce an infinite sequence of learning transitions, as supported by the upper part of Figure 3.
Even when dominating the posterior measure, we observe in simulations that the specialisation solution can be algorithmically hard to reach. With a discrete distribution of readouts (such as $P_{v}=\delta_{1}$ or Rademacher), simulations for binary inner weights exhibit it only when sampling with informative initialisation (i.e., the MCMC runs to sample ${\mathbf{W}}$ are initialised in the vicinity of ${\mathbf{W}}^{0}$ ). Moreover, even in cases where algorithms (such as ADAM or HMC for Gaussian inner weights) are able to find the specialisation solution, they manage to do so only after a training time increasing exponentially with $d$ , and for relatively small values of the label noise $\Delta$ , see discussion in App. I. For the case of the continuous distribution of readouts $P_{v}={\mathcal{N}}(0,1)$ , our numerical results are inconclusive on hardness, and deserve numerical investigation at a larger scale.
The universal phase is superseded at $\alpha_{\rm sp}$ by a specialisation phase, where the student’s inner weights start aligning with the teacher ones. This transition occurs for both binary and Gaussian priors over the inner weights, and it is different in nature w.r.t. the perfect recovery threshold identified in Maillard et al. (2024a), which is the point where the student with Gaussian weights learns perfectly ${\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{0}$ (but not ${\mathbf{W}}^{0}$ ) and thus attains perfect generalisation in the case of purely quadratic activation and noiseless labels. For large $\alpha$ , the student somehow realises that the higher order terms of the activation’s Hermite decomposition are not label noise, but are informative on the decision rule. The two identified phases are akin to those recently described in Barbier et al. (2025) for matrix denoising. The model we consider is also a matrix model in ${\mathbf{W}}$ , with the amount of data scaling as the number of matrix elements. When data are scarce, the student cannot break the numerous symmetries of the problem, resulting in an “effective rotational invariance” at the source of the prior universality, with posterior samples having a vanishing overlap with ${\mathbf{W}}^{0}$ . On the other hand, when data are sufficiently abundant, $\alpha>\alpha_{\rm sp}$ , there is a “synchronisation” of the student’s samples with the teacher.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Chart: Q*Mv(v) vs. v for varying α
### Overview
The image presents a line chart illustrating the relationship between Q*Mv(v) and v for different values of α (alpha). The chart displays multiple lines, each representing a specific α value, along with shaded regions indicating the uncertainty or standard deviation around each line.
### Components/Axes
* **X-axis:** Labeled "v", ranging from approximately -2.0 to 2.0 with increments of 0.5.
* **Y-axis:** Labeled "Q*Mv(v)", ranging from approximately 0.0 to 1.0 with increments of 0.25.
* **Legend:** Located in the top-right corner, listing the following lines and their corresponding α values:
* α = 0.50 (Blue)
* α = 1.00 (Orange)
* α = 2.00 (Green)
* α = 5.00 (Red)
* **Data Series:** Four lines, each representing a different α value, with shaded areas around them. The lines are marked with 'x' symbols.
### Detailed Analysis
Let's analyze each line individually, noting the trends and approximate data points.
* **α = 0.50 (Blue):** This line exhibits a sinusoidal pattern. It starts at approximately Q*Mv(v) = 0.3 at v = -2.0, reaches a maximum of approximately Q*Mv(v) = 0.6 at v = -0.5, dips to a minimum of approximately Q*Mv(v) = 0.05 at v = 0.5, and returns to approximately Q*Mv(v) = 0.3 at v = 2.0. The shaded region around this line indicates significant variability.
* **α = 1.00 (Orange):** This line also shows a sinusoidal pattern, but with a different amplitude and phase. It starts at approximately Q*Mv(v) = 0.2 at v = -2.0, reaches a maximum of approximately Q*Mv(v) = 0.8 at v = 0.0, and returns to approximately Q*Mv(v) = 0.2 at v = 2.0. The shaded region is smaller than that of α = 0.50, suggesting less variability.
* **α = 2.00 (Green):** This line displays a more pronounced sinusoidal pattern. It starts at approximately Q*Mv(v) = 0.1 at v = -2.0, reaches a maximum of approximately Q*Mv(v) = 0.95 at v = 0.0, and returns to approximately Q*Mv(v) = 0.1 at v = 2.0. The shaded region is relatively small.
* **α = 5.00 (Red):** This line exhibits a very different behavior. It starts at approximately Q*Mv(v) = 0.95 at v = -2.0, decreases to approximately Q*Mv(v) = 0.2 at v = 0.0, and increases to approximately Q*Mv(v) = 0.95 at v = 2.0. The shaded region is moderate in size.
### Key Observations
* The lines for α = 0.50, 1.00, and 2.00 all exhibit sinusoidal behavior, with varying amplitudes and phases.
* The line for α = 5.00 shows a different trend, decreasing and then increasing, rather than oscillating.
* The variability (as indicated by the shaded regions) is highest for α = 0.50 and relatively lower for α = 1.00 and 2.00.
* All lines converge towards Q*Mv(v) = 1.0 at the extreme left and right ends of the chart.
### Interpretation
The chart demonstrates how the relationship between Q*Mv(v) and v changes as the parameter α varies. For lower values of α (0.50, 1.00, 2.00), the relationship appears to be oscillatory, suggesting a periodic behavior. As α increases to 5.00, the relationship becomes non-oscillatory, indicating a shift in the underlying dynamics. The shaded regions represent the uncertainty in the data, which is most significant for α = 0.50, suggesting that the relationship is less predictable for this value of α.
The convergence of all lines towards Q*Mv(v) = 1.0 at the extremes of the v-axis suggests a common behavior under certain conditions, regardless of the α value. The chart could be representing a model or simulation where α controls a specific parameter influencing the system's behavior. The differences in the lines suggest that the system's response is sensitive to changes in α. The data suggests that the system's behavior transitions from oscillatory to non-oscillatory as α increases.
</details>
<details>
<summary>x6.png Details</summary>

### Visual Description
## Chart: Performance Curves vs. Alpha
### Overview
The image presents a line chart illustrating the performance of three different metrics (Q*w(3/√5), Q*w(1/√5), and q*2) as a function of the parameter α (alpha). The chart displays how these metrics change as α varies from approximately 0.5 to 7. The chart includes shaded regions around each line, likely representing confidence intervals or standard deviations.
### Components/Axes
* **X-axis:** Labeled "α" (alpha), ranging from approximately 0.5 to 7. The scale is linear.
* **Y-axis:** Ranges from 0.0 to 1.0, representing the performance metric value. The scale is linear.
* **Legend:** Located in the top-right corner of the chart. It identifies the three lines:
* Blue line: Q*w(3/√5)
* Orange line: Q*w(1/√5)
* Green line: q*2
* **Shaded Regions:** Lightly colored regions surrounding each line, indicating variability or uncertainty.
### Detailed Analysis
* **Q*w(3/√5) (Blue Line):** This line starts at approximately 0.1 at α = 0.5, rapidly increases, and reaches a plateau around 0.95-1.0 at α ≈ 1.5. It remains relatively stable at this level for the rest of the α range.
* **Q*w(1/√5) (Orange Line):** This line begins at approximately 0.05 at α = 0.5. It increases more slowly than the blue line, reaching a plateau around 0.75-0.85 at α ≈ 4. It then exhibits a significant drop around α = 5, falling to approximately 0.1, before increasing again to around 0.75 at α = 7.
* **q*2 (Green Line):** This line starts at approximately 0.1 at α = 0.5. It increases steadily, but slower than the blue line, reaching a plateau around 0.85-0.95 at α ≈ 5. It remains relatively stable for the rest of the α range.
**Approximate Data Points (extracted visually):**
| α | Q*w(3/√5) | Q*w(1/√5) | q*2 |
| :---- | :-------- | :-------- | :------ |
| 0.5 | 0.1 | 0.05 | 0.1 |
| 1 | 0.7 | 0.2 | 0.4 |
| 2 | 0.95 | 0.4 | 0.65 |
| 3 | 1.0 | 0.65 | 0.75 |
| 4 | 1.0 | 0.75 | 0.85 |
| 5 | 1.0 | 0.1 | 0.9 |
| 6 | 1.0 | 0.5 | 0.9 |
| 7 | 1.0 | 0.75 | 0.9 |
### Key Observations
* The blue line (Q*w(3/√5)) consistently outperforms the other two metrics across most of the α range, reaching a stable high value quickly.
* The orange line (Q*w(1/√5)) exhibits a significant dip in performance around α = 5, followed by a recovery. This is a notable anomaly.
* The green line (q*2) shows a more gradual increase in performance, reaching a plateau later than the blue line.
* The shaded regions indicate that the performance of each metric is not constant, but varies within a certain range.
### Interpretation
This chart likely represents the performance of different algorithms or strategies as a function of a key parameter α. The rapid convergence of Q*w(3/√5) suggests it is a robust and efficient approach, quickly achieving high performance. The dip in Q*w(1/√5) around α = 5 indicates a potential instability or sensitivity to this parameter value. The chart suggests that the optimal value of α depends on the chosen metric, with Q*w(3/√5) being the most consistently high-performing option. The shaded regions highlight the inherent variability in the performance, suggesting that the results may be influenced by factors not explicitly modeled in the chart. The choice of α = 5 is a critical point, as it causes a significant performance drop for Q*w(1/√5). Further investigation into the cause of this dip would be valuable.
</details>
Figure 3: Top: Theoretical prediction (solid curves) of the overlap function $\mathcal{Q}_{W}(\mathsf{v})$ for different sampling ratios $\alpha$ for Gaussian inner weights, ReLU(x) activation, $d=150,\gamma=0.5$ , linear readout with $\Delta=0.1$ and $P_{v}=\mathcal{N}(0,1)$ . The shaded curves were obtained from HMC initialised informatively. Using a single sample ${\mathbf{W}}^{a}$ from the posterior, $\mathcal{Q}_{W}(\mathsf{v})$ has been evaluated numerically by dividing the interval $[-2,2]$ into 50 bins and by computing the value of the overlap associated with each bin. Each point has been averaged over 50 instances of the training set, and shaded regions around them correspond to one standard deviation. Bottom: Theoretical prediction (solid curves) of the overlaps as function of the sampling ratio $\alpha$ for Gaussian inner weights, Tanh(2x) activation, $d=150,\gamma=0.5$ , linear readout with $\Delta=0.1$ and $P_{v}=\frac{1}{4}(\delta_{-3/\sqrt{5}}+\delta_{-1/\sqrt{5}}+\delta_{1/\sqrt{5}%
}+\delta_{3/\sqrt{5}})$ . The shaded curves were obtained from informed HMC. Each point has been averaged over 10 instances of the training set, with one standard deviation depicted.
The phenomenology observed depends on the activation function selected. In particular, by expanding $\sigma$ in the Hermite basis we realise that the way the first three terms enter information theoretical quantities is completely described by order 0, 1 and 2 tensors later defined in (12), that give rise to combinations of the inner and readout weights. In the regime of quadratically many data, order 0 and 1 tensors are recovered exactly by the student because of the overwhelming abundance of data compared to their dimension. The challenge is thus to learn the second order tensor. On the contrary, we claim that learning any higher order tensors can only happen when the student aligns its weights with ${\mathbf{W}}^{0}$ : before this “synchronisation”, they play the role of an effective noise. This is the mechanism behind the specialisation transition. For odd activation ( ${\rm Tanh}$ in Figure 1, $\sigma_{3}$ in Figure 2), where $\mu_{2}=0$ , the aforementioned order-2 tensor does not contribute any more to the learning. Indeed, we observe numerically that the generalisation error sticks to a constant value for $\alpha<\alpha_{\rm sp}$ , whereas at the phase transition it suddenly drops. This is because the learning of the order-2 tensor is skipped entirely, and the only chance to perform better is to learn all the other higher-order tensors through specialisation.
By extrapolating universality results to generic activations, we are able to use the GAMP-RIE of Maillard et al. (2024a), publicly available at Maillard et al. (2024b), to obtain a polynomial-time predictor for test data. Its generalisation error follows our universal theoretical curve even in the $\alpha$ regime where MCMC sampling experiences a computationally hard phase with worse performance (for binary weights), and in particular after $\alpha_{\rm sp}$ (see Fig. 2, circles). Extending this algorithm, initially proposed for quadratic activation, to a generic one is possible thanks to the identification of an effective GLM onto which the learning problem can be mapped (while the mapping is exact when $\sigma(x)=x^{2}$ as exploited by Maillard et al. (2024a)), see App. H. The key observation is that our effective GLM representation holds not only from a theoretical perspective when describing the universal phase, but also algorithmically.
Finally, we emphasise that our theory is consistent with Cui et al. (2023), which considers the simpler strongly over-parametrised regime $n=\Theta(d)$ rather than the interpolation one $n=\Theta(d^{2})$ : our generalisation curves at $\alpha→ 0$ match theirs at $\alpha_{1}:=n/d→∞$ , which is when the student learns perfectly the combinations ${\mathbf{v}}^{0∈tercal}{\mathbf{W}}^{0}/\sqrt{k}$ (but nothing more).
4 Accessing the free entropy and generalisation error: replica method and spherical integration combined
The goal is to compute the asymptotic free entropy by the replica method Mezard et al. (1986), a powerful heuristic from spin glasses also used in machine learning Engel & Van den Broeck (2001), combined with the HCIZ integral. Our derivation is based on a Gaussian ansatz on the replicated post-activations of the hidden layer, which generalises Conjecture 3.1 of Cui et al. (2023), now proved in Camilli et al. (2025), where it is specialised to the case of linearly many data ( $n=\Theta(d)$ ). To obtain this generalisation, we will write the kernel arising from the covariance of the aforementioned post-activations as an infinite series of scalar order parameters derived from the expansion of the activation function in the Hermite basis, following an approach recently devised in Aguirre-López et al. (2025) in the context of the random features model (see also Hu et al. (2024) and Ghorbani et al. (2021)). Another key ingredient of our analysis will be a generalisation of an ansatz used in the replica method by Sakata & Kabashima (2013) for dictionary learning.
4.1 Replicated system and order parameters
The starting point in the replica method to tackle the data average is the replica trick:
| | $\textstyle{\lim}\,\frac{1}{n}\mathbb{E}\ln{\mathcal{Z}}(\mathcal{D})={\lim}{%
\lim\limits_{\,\,s→ 0^{+}}}\!\frac{1}{ns}\ln\mathbb{E}\mathcal{Z}^{s}=\lim%
\limits_{\,\,s→ 0^{+}}\!{\lim}\,\frac{1}{ns}\ln\mathbb{E}\mathcal{Z}^{s}$ | |
| --- | --- | --- |
assuming the limits commute. Recall ${\mathbf{W}}^{0}$ are the teacher weights. Consider first $s∈\mathbb{N}^{+}$ . Let the “replicas” of the post-activation $\{\lambda^{a}({\mathbf{W}}^{a}):=\frac{1}{\sqrt{k}}{{\mathbf{v}}^{∈tercal}}%
\sigma(\frac{1}{\sqrt{d}}{{\mathbf{W}}^{a}{\mathbf{x}}})\}_{a=0,...,s}$ . We then directly obtain
| | $\textstyle\mathbb{E}\mathcal{Z}^{s}=\mathbb{E}_{{\mathbf{v}}}∈t\prod\limits_%
{a}\limits^{0,s}dP_{W}({\mathbf{W}}^{a})\big{[}\mathbb{E}_{\mathbf{x}}∈t dy%
\prod\limits_{a}\limits^{0,s}P_{\rm out}(y\mid\lambda^{a}({\mathbf{W}}^{a}))%
\big{]}^{n}.$ | |
| --- | --- | --- |
The key is to identify the law of the replicas $\{\lambda^{a}\}_{a=0,...,s}$ , which are dependent random variables due to the common random Gaussian input ${\mathbf{x}}$ , conditionally on $({\mathbf{W}}^{a})$ . Our key hypothesis is that $\{\lambda^{a}\}$ is jointly Gaussian, an ansatz we cannot prove but that we validate a posteriori thanks to the excellent match between our theory and the empirical generalisation curves, see Sec. 2.2. Similar Gaussian assumptions have been the crux of a whole line of recent works on the analysis of neural networks, and are now known under the name of “Gaussian equivalence” (Goldt et al., 2020; Hastie et al., 2022; Mei & Montanari, 2022; Goldt et al., 2022; Hu & Lu, 2023). This can also sometimes be heuristically justified based on Breuer–Major Theorems (Nourdin et al., 2011; Pacelli et al., 2023).
Given two replica indices $a,b∈\{0,...,s\}$ we define the neuron-neuron overlap matrix $\Omega^{ab}_{ij}:={{\mathbf{W}}_{i}^{a∈tercal}{\mathbf{W}}^{b}_{j}}/d$ with $i,j∈[k]$ . Recalling the Hermite expansion of $\sigma$ , by using Mehler’s formula, see App. A, the post-activations covariance $K^{ab}:=\mathbb{E}\lambda^{a}\lambda^{b}$ reads
$$
\textstyle K^{ab} \textstyle=\sum_{\ell\geq 1}^{\infty}\frac{\mu^{2}_{\ell}}{\ell!}Q_{\ell}^{ab}%
\ \ \text{with}\ \ Q_{\ell}^{ab}:=\frac{1}{k}\sum_{i,j\leq k}v_{i}v_{j}(\Omega%
^{ab}_{ij})^{\ell}. \tag{8}
$$
This covariance ${\mathbf{K}}$ is complicated but, as we argue hereby, simplifications occur as $d→∞$ . In particular, the first two overlaps $Q_{1}^{ab},Q_{2}^{ab}$ are special. We claim that higher-order overlaps $(Q_{\ell}^{ab})_{\ell≥ 3}$ can be simplified as functions of simpler order parameters.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Overlaps vs. HMC Steps
### Overview
This image presents a line chart illustrating the relationship between "HMC steps" (Hamiltonian Monte Carlo steps) on the x-axis and "Overlaps" on the y-axis. Five different lines, labeled Q1 through Q5, represent different data series. A smaller inset chart displays a line representing "εopt" (epsilon optimal) against HMC steps. The chart appears to be tracking the convergence of some sampling process.
### Components/Axes
* **X-axis:** "HMC steps" ranging from 0 to approximately 125,000.
* **Y-axis:** "Overlaps" ranging from 0.0 to 1.0.
* **Lines:**
* Q1 (Blue)
* Q2 (Orange)
* Q3 (Green)
* Q4 (Red)
* Q5 (Purple)
* **Inset Chart:**
* **X-axis:** "HMC steps" ranging from 0 to approximately 50,000.
* **Y-axis:** Values ranging from 0.000 to 0.025.
* **Line:** εopt (Black)
* **Legend:** Located in the top-right corner, associating colors with the Q1-Q5 labels.
### Detailed Analysis
**Main Chart:**
* **Q1 (Blue):** The line starts at approximately 0.98 and remains relatively stable around 0.95-1.0 throughout the entire range of HMC steps. It shows minimal fluctuation.
* **Q2 (Orange):** The line begins at approximately 0.65 and exhibits some initial fluctuation, stabilizing around 0.62-0.68 after approximately 25,000 HMC steps.
* **Q3 (Green):** The line starts at approximately 0.1 and rapidly decreases to near 0 within the first 10,000 HMC steps, remaining close to 0 for the rest of the chart.
* **Q4 (Red):** The line begins at approximately 0.8 and rapidly decreases to approximately 0.1 within the first 10,000 HMC steps. It then plateaus around 0.05-0.1 for the remainder of the chart.
* **Q5 (Purple):** The line starts at approximately 0.02 and remains relatively stable around 0.02-0.03 throughout the entire range of HMC steps.
**Inset Chart:**
* **εopt (Black):** The line shows a rapid increase from approximately 0.005 at 0 HMC steps to a peak of approximately 0.023 at around 10,000 HMC steps. After this peak, it fluctuates around 0.018-0.022 for the rest of the chart.
### Key Observations
* Q1 exhibits high and stable overlap values, suggesting strong convergence or a consistently high acceptance rate.
* Q3 and Q5 have very low overlap values, indicating they are not contributing significantly to the sampling process after the initial steps.
* Q4 shows a rapid decrease in overlap, suggesting it initially contributes but then stabilizes at a low value.
* The εopt line in the inset chart shows an initial increase, indicating optimization of a parameter, followed by stabilization.
* The rapid initial decrease in Q3 and Q4 suggests a quick burn-in period where these parameters are adjusted.
### Interpretation
The chart likely represents the convergence behavior of a Markov Chain Monte Carlo (MCMC) sampling process, specifically using Hamiltonian Monte Carlo (HMC). The "Overlaps" metric likely represents the acceptance rate of proposed samples.
* **Q1's** consistently high overlap suggests it's a well-behaved parameter that is readily accepted during sampling.
* **Q3 and Q5's** low overlaps indicate they may be poorly scaled or have other issues preventing efficient sampling.
* **Q4's** initial high overlap followed by a rapid decrease suggests it initially contributes to the sampling but then becomes less effective.
* The **εopt** line represents the optimal step size parameter for the HMC algorithm. The initial increase and subsequent stabilization suggest the algorithm is finding a good step size and maintaining it.
The overall trend suggests that the sampling process is converging, as indicated by the stabilization of most lines. However, the low overlap values for Q3 and Q5 suggest that these parameters may require further tuning or investigation. The inset chart provides insight into the optimization of the HMC algorithm's step size parameter. The rapid initial changes in Q3 and Q4 suggest a burn-in period where the algorithm adjusts to the parameter space.
</details>
<details>
<summary>x8.png Details</summary>

### Visual Description
## Chart: Overlaps vs. HMC Steps
### Overview
The image presents a line chart illustrating the relationship between "Overlaps" (y-axis) and "HMC steps" (x-axis). Multiple lines represent different conditions or parameters, showing how overlaps change as the number of HMC steps increases. A small inset chart provides a zoomed-in view of a specific region, with a labeled horizontal bar.
### Components/Axes
* **X-axis:** "HMC steps", ranging from 0 to approximately 125,000.
* **Y-axis:** "Overlaps", ranging from 0.5 to 1.0.
* **Lines:** Five distinct lines, each with a unique color:
* Blue (dashed)
* Red (solid)
* Green (solid)
* Orange (solid)
* Purple (dashed)
* **Inset Chart:** A smaller chart within the main chart, focusing on a region between approximately 40,000 and 120,000 HMC steps. It has its own x-axis scale from 0.00 to 0.01.
* **Horizontal Line:** A dashed horizontal line at approximately 0.85 on the main chart.
* **Label:** "ε<sub>opt</sub>" with a horizontal bar underneath it in the inset chart.
### Detailed Analysis
Let's analyze each line's trend and extract approximate data points.
* **Blue Line:** Starts at approximately 0.98 and remains relatively constant throughout the entire range of HMC steps, with minor fluctuations around 0.99.
* **Red Line:** Starts at approximately 0.92 and exhibits a decreasing trend until around 25,000 HMC steps, reaching a minimum of approximately 0.88. It then fluctuates around 0.90, with some oscillations.
* **Green Line:** Starts at approximately 0.73 and decreases rapidly to around 0.65 by 25,000 HMC steps. It then stabilizes, fluctuating between approximately 0.65 and 0.72.
* **Orange Line:** Starts at approximately 0.91 and decreases slightly to around 0.89 by 25,000 HMC steps. It then remains relatively stable, fluctuating between approximately 0.89 and 0.91.
* **Purple Line:** Starts at approximately 0.62 and exhibits significant fluctuations throughout the entire range of HMC steps, oscillating between approximately 0.55 and 0.68.
**Inset Chart:** The inset chart shows a zoomed-in view of the horizontal axis. The horizontal bar labeled "ε<sub>opt</sub>" spans from approximately 50,000 to 100,000 HMC steps. The x-axis scale is from 0.00 to 0.01.
### Key Observations
* The blue line consistently shows the highest overlap values and remains stable, suggesting a robust condition.
* The purple line exhibits the most significant fluctuations, indicating a potentially unstable or sensitive condition.
* The green line shows a clear initial decrease in overlap, followed by stabilization.
* The horizontal dashed line at 0.85 appears to serve as a threshold or reference point for overlap values.
* The inset chart and the label "ε<sub>opt</sub>" suggest an optimal value or range for a parameter related to HMC steps.
### Interpretation
This chart likely represents the results of a simulation or experiment involving Hybrid Monte Carlo (HMC) steps. The "Overlaps" metric likely indicates the degree of correlation or agreement between different simulation runs or configurations.
The stability of the blue line suggests a well-converged or optimized condition. The fluctuations in the purple line could indicate sensitivity to initial conditions or parameter settings. The initial decrease in the green line might represent a burn-in period where the simulation is reaching equilibrium.
The inset chart and "ε<sub>opt</sub>" label suggest that there is an optimal range of HMC steps (between 50,000 and 100,000) where the simulation performs best, potentially achieving the highest overlap values. The horizontal dashed line at 0.85 could be a target overlap value, and the different lines show how different conditions approach or deviate from this target.
The data suggests that the choice of HMC steps and simulation parameters significantly impacts the overlap and stability of the results. Further investigation into the conditions represented by each line could reveal insights into optimizing the simulation process.
</details>
Figure 4: Hamiltonian Monte Carlo dynamics of the overlaps $Q_{\ell}=Q_{\ell}^{01}$ between student and teacher weights for $\ell∈[5]$ , with activation function ReLU(x), $d=200$ , $\gamma=0.5$ , linear readout with $\Delta=0.1$ and two choices of sample rates and readouts: $\alpha=1.0$ with $P_{v}=\delta_{1}$ (Left) and $\alpha=3.0$ with $P_{v}=\mathcal{N}(0,1)$ (Right). The teacher weights ${\mathbf{W}}^{0}$ are Gaussian. The dynamics is initialised informatively, i.e., on ${\mathbf{W}}^{0}$ . The overlap $Q_{1}$ always fluctuates around 1. Left: The overlaps $Q_{\ell}$ for $\ell≥ 3$ at equilibrium converge to 0, while $Q_{2}$ is well estimated by the theory (orange dashed line). Right: At higher sample rate $\alpha$ , also the $Q_{\ell}$ for $\ell≥ 3$ are non zero and agree with their theoretical prediction (dashed lines). Insets show the mean-square generalisation error and the theoretical prediction.
4.2 Simplifying the order parameters
In this section we show how to drastically reduce the number of order parameters to track. Assume at the moment that the readout prior $P_{v}$ has discrete support $\mathsf{V}=\{\mathsf{v}\}$ ; this can be relaxed by binning a continuous support, as mentioned in Sec. 2.2. The overlaps in (8) can be written as
$$
\textstyle Q_{\ell}^{ab}=\frac{1}{k}\sum_{\mathsf{v},\mathsf{v}^{\prime}\in%
\mathsf{V}}\mathsf{v}\,\mathsf{v}^{\prime}\sum_{\{i,j\leq k\mid v_{i}=\mathsf{%
v},v_{j}=\mathsf{v}^{\prime}\}}(\Omega_{ij}^{ab})^{\ell}. \tag{9}
$$
In the following, for $\ell≥ 3$ we discard the terms $\mathsf{v}≠\mathsf{v}^{\prime}$ in the above sum, assuming they are suppressed w.r.t. the diagonal ones. In other words, a neuron ${\mathbf{W}}^{a}_{i}$ of a student (replica) with a readout value $v_{i}=\mathsf{v}$ is assumed to possibly align only with neurons of the teacher (or, by Bayes-optimality, of another replica) with the same readout. Moreover, in the resulting sum over the neurons indices $\{i,j\mid v_{i}=v_{j}=\mathsf{v}\}$ , we assume that, for each $i$ , a single index $j=\pi_{i}$ , with $\pi$ a permutation, contributes at leading order. The model is symmetric under permutations of hidden neurons. We thus take $\pi$ to be the identity without loss of generality.
We now assume that for Hadamard powers $\ell≥ 3$ , the off-diagonal of the overlap $({\bm{\Omega}}^{ab})^{\circ\ell}$ , obtained from typical weight matrices sampled from the posterior, is sufficiently small to consider it diagonal in any quadratic form. Moreover, by exchangeability among neurons with the same readout value, we further assume that all diagonal elements $\{\Omega_{ii}^{ab}\mid i∈\mathcal{I}_{\mathsf{v}}\}$ concentrate onto the constant $\mathcal{Q}_{W}^{ab}(\mathsf{v})$ , where $\mathcal{I}_{\mathsf{v}}:=\{i≤ k\mid v_{i}=\mathsf{v}\}$ :
$$
\textstyle(\Omega_{ij}^{ab})^{\ell}=(\frac{1}{d}{\mathbf{W}}_{i}^{a\intercal}{%
\mathbf{W}}^{b}_{j})^{\ell}\approx\delta_{ij}\mathcal{Q}_{W}^{ab}(\mathsf{v})^%
{\ell} \tag{10}
$$
if $\ell≥ 3$ , $i\ \text{or}\ j∈\mathcal{I}_{\mathsf{v}}$ . Approximate equality here is up to a matrix with $o_{d}(1)$ norm. The same happens, e.g., for a standard Wishart matrix: its eigenvectors and the ones of its square Hadamard power are delocalised, while for higher Hadamard powers $\ell≥ 3$ its eigenvectors are strongly localised; this is why $Q_{2}^{ab}$ will require a separate treatment. With these simplifications we can write
$$
\textstyle Q_{\ell}^{ab}=\mathbb{E}_{v\sim P_{v}}[v^{2}\mathcal{Q}_{W}^{ab}(v)%
^{\ell}]+o_{d}(1)\ \text{for}\ \ell\geq 3. \tag{1}
$$
This is is verified numerically a posteriori as follows. Identity (11) is true (without $o_{d}(1)$ ) for the predicted theoretical values of the order parameters by construction of our theory. Fig. 3 verified the good agreement between the theoretical and experimental overlap profiles $\mathcal{Q}^{01}_{W}(\mathsf{v})$ for all $\mathsf{v}∈\mathsf{V}$ (which is statistically the same as $\smash{\mathcal{Q}^{ab}_{W}(\mathsf{v})}$ for any $a≠ b$ by the so-called Nishimori identity following from Bayes-optimality, see App. B), while Fig. 4 verifies the agreement at the level of $(Q_{\ell}^{ab})$ . Consequently, (11) is also true for the experimental overlaps.
It is convenient to define the symmetric tensors ${\mathbf{S}}_{\ell}^{a}$ with entries
$$
\textstyle S^{a}_{\ell;\alpha_{1}\ldots\alpha_{\ell}}:=\frac{1}{\sqrt{k}}\sum_%
{i\leq k}v_{i}W^{a}_{i\alpha_{1}}\cdots W^{a}_{i\alpha_{\ell}}. \tag{12}
$$
Indeed, the generic $\ell$ -th term of the series (8) can be written as the overlap $Q^{ab}_{\ell}=\langle{\mathbf{S}}^{a}_{\ell},{\mathbf{S}}^{b}_{\ell}\rangle/d^%
{\ell}$ of these tensors (where $\langle\,,\,\rangle$ is the inner product), e.g., $Q_{2}^{ab}={\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}/d^{2}$ . Given that the number of data $n=\Theta(d^{2})$ and that $({\mathbf{S}}_{1}^{a})$ are only $d$ -dimensional, they are reconstructed perfectly (the same argument was used to argue that readouts ${\mathbf{v}}$ can be quenched). We thus assume right away that at equilibrium the overlaps $Q_{1}^{ab}=1$ (or saturate to their maximum value; if tracked, the corresponding saddle point equations end up being trivial and do fix this). In other words, in the quadratic data regime, the $\mu_{1}$ contribution in the Hermite decomposition of $\sigma$ for the target is perfectly learnable, while higher order ones play a non-trivial role. In contrast, Cui et al. (2023) study the regime $n=\Theta(d)$ where $\mu_{1}$ is the only learnable term.
Then, the average replicated partition function reads $\mathbb{E}\mathcal{Z}^{s}=∈t d{\mathbf{Q}}_{2}d\bm{\mathcal{Q}}_{W}\exp(F_{S%
}\!+\!nF_{E})$ where $F_{E},F_{S}$ depend on ${\mathbf{Q}}_{2}=(Q_{2}^{ab})$ and $\bm{\mathcal{Q}}_{W}:=\{\mathcal{Q}_{W}^{ab}\mid a≤ b\}$ , where $\mathcal{Q}_{W}^{ab}:=\{\mathcal{Q}_{W}^{ab}(\mathsf{v})\mid\mathsf{v}∈%
\mathsf{V}\}$ .
The “energetic potential” is defined as
$$
\textstyle e^{nF_{E}}:=\big{(}\int dyd{\bm{\lambda}}\frac{\exp(-\frac{1}{2}{%
\bm{\lambda}}^{\intercal}{\mathbf{K}}^{-1}{\bm{\lambda}})}{((2\pi)^{s+1}\det{%
\mathbf{K}})^{1/2}}\prod_{a}^{0,s}P_{\rm out}(y\mid\lambda^{a})\big{)}^{n}. \tag{13}
$$
It takes this form due to our Gaussian assumption on the replicated post-activations and is thus easily computed, see App. D.1.
The “entropic potential” $F_{S}$ taking into account the degeneracy of the order parameters is obtained by averaging delta functions fixing their definitions w.r.t. the “microscopic degrees of freedom” $({\mathbf{W}}^{a})$ . It can be written compactly using the following conditional law over the tensors $({\mathbf{S}}_{2}^{a})$ :
$$
\textstyle P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}):=V_{W}^{kd}(\bm{%
\mathcal{Q}}_{W})^{-1}\int\prod_{a}^{0,s}dP_{W}({\mathbf{W}}^{a}) \textstyle\qquad\times\prod_{a\leq b}^{0,s}\prod_{\mathsf{v}\in\mathsf{V}}%
\prod_{i\in\mathcal{I}_{\mathsf{v}}}\delta(d\,\mathcal{Q}_{W}^{ab}(\mathsf{v})%
-{{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b}}) \textstyle\qquad\times\prod_{a}^{0,s}\delta({\mathbf{S}}^{a}_{2}-{\mathbf{W}}^%
{a\intercal}({\mathbf{v}}){\mathbf{W}}^{a}/\sqrt{k}), \tag{14}
$$
with the normalisation
| | $\textstyle V_{W}^{kd}:=∈t\prod_{a}dP_{W}({\mathbf{W}}^{a})\prod_{a≤ b,%
\mathsf{v},i∈\mathcal{I}_{\mathsf{v}}}\delta(d\,\mathcal{Q}_{W}^{ab}(\mathsf%
{v})-{{\mathbf{W}}^{a∈tercal}_{i}{\mathbf{W}}_{i}^{b}}).$ | |
| --- | --- | --- |
The entropy, which is the challenging term to compute, then reads
| | $\textstyle e^{F_{S}}:=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})∈t dP(({\mathbf{S}}_{2%
}^{a})\mid\bm{\mathcal{Q}}_{W})\prod\limits_{a≤ b}\limits^{0,s}\delta(d^{2}%
Q_{2}^{ab}-{{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}}).$ | |
| --- | --- | --- |
4.3 Tackling the entropy: measure simplification by moment matching
The delta functions above fixing $Q_{2}^{ab}$ induce quartic constraints between the weights degrees of freedom $(W_{i\alpha}^{a})$ instead of quadratic as in standard settings. A direct computation thus seems out of reach. However, we will exploit the fact that the constraints are quadratic in the matrices $({\mathbf{S}}_{2}^{a})$ . Consequently, shifting our focus towards $({\mathbf{S}}_{2}^{a})$ as the basic degrees of freedom to integrate rather than $(W_{i\alpha}^{a})$ will allow us to move forward by simplifying their measure (14). Note that while $(W_{i\alpha}^{a})$ are i.i.d. under their prior measure, $({\mathbf{S}}_{2}^{a})$ have coupled entries, even for a fixed replica index $a$ . This can be taken into account as follows.
Define $P_{S}$ as the probability density of a generalised Wishart random matrix, i.e., of $\tilde{\mathbf{W}}^{∈tercal}({\mathbf{v}})\tilde{\mathbf{W}}/\sqrt{k}$ where $\tilde{\mathbf{W}}∈\mathbb{R}^{k× d}$ is made of i.i.d. standard Gaussian entries. The simplification we consider consists in replacing (14) by the effective measure
$$
\textstyle\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}):=\frac{1}{%
\tilde{V}_{W}^{kd}}\prod\limits_{a}\limits^{0,s}P_{S}({\mathbf{S}}_{2}^{a})%
\prod\limits_{a<b}\limits^{0,s}e^{\frac{1}{2}\tau(\mathcal{Q}_{W}^{ab}){\rm Tr%
}\,{\mathbf{S}}^{a}_{2}{\mathbf{S}}^{b}_{2}} \tag{15}
$$
where $\tilde{V}_{W}^{kd}=\tilde{V}_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ is the proper normalisation constant, and
$$
\textstyle\tau(\mathcal{Q}_{W}^{ab}):=\text{mmse}_{S}^{-1}(1-\mathbb{E}_{v\sim
P%
_{v}}[v^{2}\mathcal{Q}^{ab}_{W}(v)^{2}]). \tag{16}
$$
The rationale behind this choice goes as follows. The matrices $({\mathbf{S}}_{2}^{a})$ are, under the measure (14), $(i)$ generalised Wishart matrices, constructed from $(ii)$ non-Gaussian factors $({\mathbf{W}}^{a})$ , which $(iii)$ are coupled between different replicas, thus inducing a coupling among replicas $({\mathbf{S}}^{a})$ . The proposed simplified measure captures all three aspects while remaining tractable, as we explain now. The first assumption is that in the measure (14) the details of the (centred, unit variance) prior $P_{W}$ enter only through $\bm{\mathcal{Q}}_{W}$ at leading order. Due to the conditioning, we can thus relax it to Gaussian (with the same two first moments) by universality, as is often the case in random matrix theory. $P_{W}$ will instead explicitly enter in the entropy of $\bm{\mathcal{Q}}_{W}$ related to $V_{W}^{kd}$ . Point $(ii)$ is thus taken care by the conditioning. Then, the generalised Wishart prior $P_{S}$ encodes $(i)$ and, finally, the exponential tilt in $\tilde{P}$ induces the replica couplings of point $(iii)$ . It remains to capture the dependence of measure (14) on $\bm{\mathcal{Q}}_{W}$ . This is done by realising that
| | $\textstyle∈t dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})\frac{1}{d^{2%
}}{\rm Tr}{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}=\mathbb{E}_{v\sim P_{v}}[v^%
{2}\mathcal{Q}_{W}^{ab}(v)^{2}]+\gamma\bar{v}^{2}.$ | |
| --- | --- | --- |
It is shown in App. D.2. The Lagrange multiplier $\tau(\mathcal{Q}_{W}^{ab})$ to plug in $\tilde{P}$ enforcing this moment matching condition between true and simplified measures as $s→ 0^{+}$ is (16), see App. D.3. For completeness, we provide in App. E alternatives to the simplification (15), whose analysis are left for future work.
4.4 Final steps and spherical integration
Combining all our findings, the average replicated partition function is simplified as
| | $\textstyle\mathbb{E}\mathcal{Z}^{s}=∈t d{\mathbf{Q}}_{2}d\bm{\mathcal{Q}}_{W%
}e^{nF_{E}+kd\ln V_{W}(\bm{\mathcal{Q}}_{W})-kd\ln\tilde{V}_{W}(\bm{\mathcal{Q%
}}_{W})}$ | |
| --- | --- | --- |
The equality should be interpreted as holding at leading exponential order $\exp(\Theta(n))$ , assuming the validity of our previous measure simplification. All remaining steps but the last are standard:
$(i)$ Express the delta functions fixing $\bm{\mathcal{Q}}_{W}$ and ${\mathbf{Q}}_{2}$ in exponential form using their Fourier representation; this introduces additional Fourier conjugate order parameters $\hat{\mathbf{Q}}_{2},\hat{\bm{\mathcal{Q}}}_{W}$ of same dimensions.
$(ii)$ Once this is done, the terms coupling different replicas of $({\mathbf{W}}^{a})$ or of $({\mathbf{S}}^{a})$ are all quadratic. Using the Hubbard–Stratonovich transformation (i.e., $\mathbb{E}_{{\mathbf{Z}}}\exp(\frac{d}{2}{\rm Tr}\,{\mathbf{M}}{\mathbf{Z}})=%
\exp(\frac{d}{4}{\rm Tr}\,{\mathbf{M}}^{2})$ for a $d× d$ symmetric matrix ${\mathbf{M}}$ with ${\mathbf{Z}}$ a standard GOE matrix) therefore allows us to linearise all replica-replica coupling terms, at the price of introducing new Gaussian fields interacting with all replicas.
$(iii)$ After these manipulations, we identify at leading exponential order an effective action $\mathcal{S}$ depending on the order parameters only, which allows a saddle point integration w.r.t. them as $n→∞$ :
| | $\textstyle\lim\frac{1}{ns}\ln\mathbb{E}\mathcal{Z}^{s}\!=\!\lim\frac{1}{ns}\ln%
∈t d{\mathbf{Q}}_{2}d\hat{\mathbf{Q}}_{2}d\bm{\mathcal{Q}}_{W}d\hat{\bm{%
\mathcal{Q}}}_{W}e^{n\mathcal{S}}\!=\!\frac{1}{s}{\rm extr}\,\mathcal{S}.$ | |
| --- | --- | --- |
$(iv)$ Next, the replica limit $s→ 0^{+}$ of the previously obtained expression has to be considered. To do so, we make a replica symmetric assumption, i.e., we consider that at the saddle point, all order parameters entering the action $\mathcal{S}$ , and thus $K^{ab}$ too, take a simple form of the type $R^{ab}=r\delta_{ab}+q(1-\delta_{ab})$ . Replica symmetry is rigorously known to be correct in general settings of Bayes-optimal learning and is thus justified here, see Barbier & Panchenko (2022); Barbier & Macris (2019).
$(v)$ After all these steps, the resulting expression still includes two high-dimensional integrals related to the ${\mathbf{S}}_{2}$ ’s matrices. They can be recognised as corresponding to the free entropies associated with the Bayes-optimal denoising of a generalised Wishart matrix, as described just above Result 2.1, for two different signal-to-noise ratios. The last step consists in dealing with these integrals using the HCIZ integral whose form is tractable in this case, see Maillard et al. (2022); Pourkamali et al. (2024). These free entropies yield the two last terms $\iota(\,·\,)$ in $f_{\rm RS}^{\alpha,\gamma}$ , (6).
The complete derivation is in App. D and gives Result 2.1. From the physical meaning of the order parameters, this analysis also yields the post-activations covariance ${\mathbf{K}}$ and thus Result 2.2.
As a final remark, we emphasise a key difference between our approach and earlier works on extensive-rank systems. If, instead of taking the generalised Wishart $P_{S}$ as the base measure over the matrices $({\mathbf{S}}_{2}^{a})$ in the simplified $\tilde{P}$ with moment matching, one takes a factorised Gaussian measure, thus entirely forgetting the dependencies among ${\mathbf{S}}_{2}^{a}$ entries, this mimics the Sakata-Kabashima replica method Sakata & Kabashima (2013). Our ansatz thus captures important correlations that were previously neglected in Sakata & Kabashima (2013); Krzakala et al. (2013); Kabashima et al. (2016); Barbier et al. (2025) in the context of extensive-rank matrix inference. For completeness, we show in App. E that our ansatz indeed greatly improves the prediction compared to these earlier approaches.
5 Conclusion and perspectives
We have provided an effective, quantitatively accurate, description of the optimal generalisation capability of a fully-trained two-layer neural network of extensive width with generic activation when the sample size scales with the number of parameters. This setting has resisted for a long time to mean-field approaches used, e.g., for committee machines Barkai et al. (1992); Engel et al. (1992); Schwarze & Hertz (1992; 1993); Mato & Parga (1992); Monasson & Zecchina (1995); Aubin et al. (2018); Baldassi et al. (2019).
A natural extension is to consider non Bayes-optimal models, e.g., trained through empirical risk minimisation to learn a mismatched target function. The formalism we provide here can be extended to these cases, by keeping track of additional order parameters. The extension to deeper architectures is also possible, in the vein of Cui et al. (2023); Pacelli et al. (2023) who analysed the over-parametrised proportional regime. Accounting for structured inputs is another direction: data with a covariance (Monasson, 1992; Loureiro et al., 2021a), mixture models (Del Giudice, P. et al., 1989; Loureiro et al., 2021b), hidden manifolds (Goldt et al., 2020), object manifolds and simplexes (Chung et al., 2018; Rotondo et al., 2020), etc.
Phase transitions in supervised learning are known in the statistical mechanics literature at least since Györgyi (1990), when the theory was limited to linear models. It would be interesting to connect the picture we have drawn here with Grokking, a sudden drop in generalisation error occurring during the training of neural nets close to interpolation, see Power et al. (2022); Rubin et al. (2024b).
A more systematic analysis on the computational hardness of the problem (as carried out for multi-index models in Troiani et al. (2025)) is an important step towards a full characterisation of the class of target functions that are fundamentally hard to learn.
A key novelty of our approach is to blend matrix models and spin glass techniques in a unified formalism. A limitation is then linked to the restricted class of solvable matrix models (see Kazakov (2000); Anninos & Mühlmann (2020) for a list). Indeed, as explained in App. E, possible improvements to our approach need additional finer order parameters than those appearing in Results 2.1, 2.2 (at least for inhomogeneous readouts ${\mathbf{v}}$ ). Taking them into account yields matrix models when computing their entropy which, to the best of our knowledge, are not currently solvable. We believe that obtaining asymptotically exact formulas for the log-partition function and generalisation error in the current setting and its relatives will require some major breakthrough in the field of multi-matrix models. This is an exciting direction to pursue at the crossroad of the fields of matrix models and high-dimensional inference and learning of extensive-rank matrices.
Software and data
Experiments with ADAM/HMC were performed through standard implementations in PyTorch/TensorFlow/NumPyro; the Metropolis-Hastings and GAMP-RIE routines were coded from scratch (the latter was inspired by Maillard et al. (2024a)). GitHub repository to reproduce the results: https://github.com/Minh-Toan/extensive-width-NN
Acknowledgements
J.B., F.C., M.-T.N. and M.P. were funded by the European Union (ERC, CHORAL, project number 101039794). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. M.P. thanks Vittorio Erba and Pietro Rotondo for interesting discussions and suggestions.
References
- Aguirre-López et al. (2025) Aguirre-López, F., Franz, S., and Pastore, M. Random features and polynomial rules. SciPost Phys., 18:039, 2025. 10.21468/SciPostPhys.18.1.039. URL https://scipost.org/10.21468/SciPostPhys.18.1.039.
- Aiudi et al. (2025) Aiudi, R., Pacelli, R., Baglioni, P., Vezzani, A., Burioni, R., and Rotondo, P. Local kernel renormalization as a mechanism for feature learning in overparametrized convolutional neural networks. Nature Communications, 16(1):568, Jan 2025. ISSN 2041-1723. 10.1038/s41467-024-55229-3. URL https://doi.org/10.1038/s41467-024-55229-3.
- Anninos & Mühlmann (2020) Anninos, D. and Mühlmann, B. Notes on matrix models (matrix musings). Journal of Statistical Mechanics: Theory and Experiment, 2020(8):083109, aug 2020. 10.1088/1742-5468/aba499. URL https://dx.doi.org/10.1088/1742-5468/aba499.
- Arjevani et al. (2025) Arjevani, Y., Bruna, J., Kileel, J., Polak, E., and Trager, M. Geometry and optimization of shallow polynomial networks, 2025. URL https://arxiv.org/abs/2501.06074.
- Aubin et al. (2018) Aubin, B., Maillard, A., Barbier, J., Krzakala, F., Macris, N., and Zdeborová, L. The committee machine: Computational to statistical gaps in learning a two-layers neural network. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/84f0f20482cde7e5eacaf7364a643d33-Paper.pdf.
- Baglioni et al. (2024) Baglioni, P., Pacelli, R., Aiudi, R., Di Renzo, F., Vezzani, A., Burioni, R., and Rotondo, P. Predictive power of a Bayesian effective action for fully connected one hidden layer neural networks in the proportional limit. Phys. Rev. Lett., 133:027301, Jul 2024. 10.1103/PhysRevLett.133.027301. URL https://link.aps.org/doi/10.1103/PhysRevLett.133.027301.
- Baldassi et al. (2019) Baldassi, C., Malatesta, E. M., and Zecchina, R. Properties of the geometry of solutions and capacity of multilayer neural networks with rectified linear unit activations. Phys. Rev. Lett., 123:170602, Oct 2019. 10.1103/PhysRevLett.123.170602. URL https://link.aps.org/doi/10.1103/PhysRevLett.123.170602.
- Barbier (2020) Barbier, J. Overlap matrix concentration in optimal Bayesian inference. Information and Inference: A Journal of the IMA, 10(2):597–623, 05 2020. ISSN 2049-8772. 10.1093/imaiai/iaaa008. URL https://doi.org/10.1093/imaiai/iaaa008.
- Barbier & Macris (2019) Barbier, J. and Macris, N. The adaptive interpolation method: a simple scheme to prove replica formulas in Bayesian inference. Probability Theory and Related Fields, 174(3):1133–1185, Aug 2019. ISSN 1432-2064. 10.1007/s00440-018-0879-0. URL https://doi.org/10.1007/s00440-018-0879-0.
- Barbier & Macris (2022) Barbier, J. and Macris, N. Statistical limits of dictionary learning: Random matrix theory and the spectral replica method. Phys. Rev. E, 106:024136, Aug 2022. 10.1103/PhysRevE.106.024136. URL https://link.aps.org/doi/10.1103/PhysRevE.106.024136.
- Barbier & Panchenko (2022) Barbier, J. and Panchenko, D. Strong replica symmetry in high-dimensional optimal Bayesian inference. Communications in Mathematical Physics, 393(3):1199–1239, Aug 2022. ISSN 1432-0916. 10.1007/s00220-022-04387-w. URL https://doi.org/10.1007/s00220-022-04387-w.
- Barbier et al. (2019) Barbier, J., Krzakala, F., Macris, N., Miolane, L., and Zdeborová, L. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019. 10.1073/pnas.1802705116. URL https://www.pnas.org/doi/abs/10.1073/pnas.1802705116.
- Barbier et al. (2025) Barbier, J., Camilli, F., Ko, J., and Okajima, K. Phase diagram of extensive-rank symmetric matrix denoising beyond rotational invariance. Physical Review X, 2025.
- Barkai et al. (1992) Barkai, E., Hansel, D., and Sompolinsky, H. Broken symmetries in multilayered perceptrons. Phys. Rev. A, 45:4146–4161, Mar 1992. 10.1103/PhysRevA.45.4146. URL https://link.aps.org/doi/10.1103/PhysRevA.45.4146.
- Bartlett et al. (2021) Bartlett, P. L., Montanari, A., and Rakhlin, A. Deep learning: a statistical viewpoint. Acta Numerica, 30:87–201, 2021. 10.1017/S0962492921000027. URL https://doi.org/10.1017/S0962492921000027.
- Bassetti et al. (2024) Bassetti, F., Gherardi, M., Ingrosso, A., Pastore, M., and Rotondo, P. Feature learning in finite-width Bayesian deep linear networks with multiple outputs and convolutional layers, 2024. URL https://arxiv.org/abs/2406.03260.
- Bordelon et al. (2020) Bordelon, B., Canatar, A., and Pehlevan, C. Spectrum dependent learning curves in kernel regression and wide neural networks. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 1024–1034. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/bordelon20a.html.
- Brézin et al. (2016) Brézin, E., Hikami, S., et al. Random matrix theory with an external source. Springer, 2016.
- Camilli et al. (2023) Camilli, F., Tieplova, D., and Barbier, J. Fundamental limits of overparametrized shallow neural networks for supervised learning, 2023. URL https://arxiv.org/abs/2307.05635.
- Camilli et al. (2025) Camilli, F., Tieplova, D., Bergamin, E., and Barbier, J. Information-theoretic reduction of deep neural networks to linear models in the overparametrized proportional regime. The 38th Annual Conference on Learning Theory (to appear), 2025.
- Canatar et al. (2021) Canatar, A., Bordelon, B., and Pehlevan, C. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature Communications, 12(1):2914, 05 2021. ISSN 2041-1723. 10.1038/s41467-021-23103-1. URL https://doi.org/10.1038/s41467-021-23103-1.
- Chung et al. (2018) Chung, S., Lee, D. D., and Sompolinsky, H. Classification and geometry of general perceptual manifolds. Phys. Rev. X, 8:031003, Jul 2018. 10.1103/PhysRevX.8.031003. URL https://link.aps.org/doi/10.1103/PhysRevX.8.031003.
- Cui et al. (2023) Cui, H., Krzakala, F., and Zdeborova, L. Bayes-optimal learning of deep random networks of extensive-width. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 6468–6521. PMLR, 07 2023. URL https://proceedings.mlr.press/v202/cui23b.html.
- Del Giudice, P. et al. (1989) Del Giudice, P., Franz, S., and Virasoro, M. A. Perceptron beyond the limit of capacity. J. Phys. France, 50(2):121–134, 1989. 10.1051/jphys:01989005002012100. URL https://doi.org/10.1051/jphys:01989005002012100.
- Dietrich et al. (1999) Dietrich, R., Opper, M., and Sompolinsky, H. Statistical mechanics of support vector networks. Phys. Rev. Lett., 82:2975–2978, 04 1999. 10.1103/PhysRevLett.82.2975. URL https://link.aps.org/doi/10.1103/PhysRevLett.82.2975.
- Du & Lee (2018) Du, S. and Lee, J. On the power of over-parametrization in neural networks with quadratic activation. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1329–1338. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/du18a.html.
- Engel & Van den Broeck (2001) Engel, A. and Van den Broeck, C. Statistical mechanics of learning. Cambridge University Press, 2001. ISBN 9780521773072.
- Engel et al. (1992) Engel, A., Köhler, H. M., Tschepke, F., Vollmayr, H., and Zippelius, A. Storage capacity and learning algorithms for two-layer neural networks. Phys. Rev. A, 45:7590–7609, May 1992. 10.1103/PhysRevA.45.7590. URL https://link.aps.org/doi/10.1103/PhysRevA.45.7590.
- Gamarnik et al. (2024) Gamarnik, D., Kızıldağ, E. C., and Zadik, I. Stationary points of a shallow neural network with quadratic activations and the global optimality of the gradient descent algorithm. Mathematics of Operations Research, 50(1):209–251, 2024. 10.1287/moor.2021.0082. URL https://doi.org/10.1287/moor.2021.0082.
- Gardner & Derrida (1989) Gardner, E. and Derrida, B. Three unfinished works on the optimal storage capacity of networks. Journal of Physics A: Mathematical and General, 22(12):1983, jun 1989. 10.1088/0305-4470/22/12/004. URL https://dx.doi.org/10.1088/0305-4470/22/12/004.
- Gerace et al. (2021) Gerace, F., Loureiro, B., Krzakala, F., Mézard, M., and Zdeborová, L. Generalisation error in learning with random features and the hidden manifold model. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124013, Dec 2021. ISSN 1742-5468. 10.1088/1742-5468/ac3ae6. URL http://dx.doi.org/10.1088/1742-5468/ac3ae6.
- Ghorbani et al. (2021) Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029 – 1054, 2021. 10.1214/20-AOS1990. URL https://doi.org/10.1214/20-AOS1990.
- Goldt et al. (2020) Goldt, S., Mézard, M., Krzakala, F., and Zdeborová, L. Modeling the influence of data structure on learning in neural networks: The hidden manifold model. Phys. Rev. X, 10:041044, Dec 2020. 10.1103/PhysRevX.10.041044. URL https://link.aps.org/doi/10.1103/PhysRevX.10.041044.
- Goldt et al. (2022) Goldt, S., Loureiro, B., Reeves, G., Krzakala, F., Mezard, M., and Zdeborová, L. The Gaussian equivalence of generative models for learning with shallow neural networks. In Bruna, J., Hesthaven, J., and Zdeborová, L. (eds.), Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, volume 145 of Proceedings of Machine Learning Research, pp. 426–471. PMLR, 08 2022. URL https://proceedings.mlr.press/v145/goldt22a.html.
- Guionnet & Zeitouni (2002) Guionnet, A. and Zeitouni, O. Large deviations asymptotics for spherical integrals. Journal of Functional Analysis, 188(2):461–515, 2002. ISSN 0022-1236. 10.1006/jfan.2001.3833. URL https://www.sciencedirect.com/science/article/pii/S0022123601938339.
- Guo et al. (2005) Guo, D., Shamai, S., and Verdú, S. Mutual information and minimum mean-square error in gaussian channels. IEEE Transactions on Information Theory, 51(4):1261–1282, 2005. 10.1109/TIT.2005.844072. URL https://doi.org/10.1109/TIT.2005.844072.
- Györgyi (1990) Györgyi, G. First-order transition to perfect generalization in a neural network with binary synapses. Phys. Rev. A, 41:7097–7100, Jun 1990. 10.1103/PhysRevA.41.7097. URL https://link.aps.org/doi/10.1103/PhysRevA.41.7097.
- Hanin (2023) Hanin, B. Random neural networks in the infinite width limit as Gaussian processes. The Annals of Applied Probability, 33(6A):4798 – 4819, 2023. 10.1214/23-AAP1933. URL https://doi.org/10.1214/23-AAP1933.
- Hastie et al. (2022) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 – 986, 2022. 10.1214/21-AOS2133. URL https://doi.org/10.1214/21-AOS2133.
- Hu & Lu (2023) Hu, H. and Lu, Y. M. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 69(3):1932–1964, 2023. 10.1109/TIT.2022.3217698. URL https://doi.org/10.1109/TIT.2022.3217698.
- Hu et al. (2024) Hu, H., Lu, Y. M., and Misiakiewicz, T. Asymptotics of random feature regression beyond the linear scaling regime, 2024. URL https://arxiv.org/abs/2403.08160.
- Itzykson & Zuber (1980) Itzykson, C. and Zuber, J. The planar approximation. II. Journal of Mathematical Physics, 21(3):411–421, 03 1980. ISSN 0022-2488. 10.1063/1.524438. URL https://doi.org/10.1063/1.524438.
- Kabashima et al. (2016) Kabashima, Y., Krzakala, F., Mézard, M., Sakata, A., and Zdeborová, L. Phase transitions and sample complexity in Bayes-optimal matrix factorization. IEEE Transactions on Information Theory, 62(7):4228–4265, 2016. 10.1109/TIT.2016.2556702. URL https://doi.org/10.1109/TIT.2016.2556702.
- Kazakov (2000) Kazakov, V. A. Solvable matrix models, 2000. URL https://arxiv.org/abs/hep-th/0003064.
- Kingma & Ba (2017) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980.
- Krzakala et al. (2013) Krzakala, F., Mézard, M., and Zdeborová, L. Phase diagram and approximate message passing for blind calibration and dictionary learning. In 2013 IEEE International Symposium on Information Theory, pp. 659–663, 2013. 10.1109/ISIT.2013.6620308. URL https://doi.org/10.1109/ISIT.2013.6620308.
- Lee et al. (2018) Lee, J., Sohl-dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.
- Li & Sompolinsky (2021) Li, Q. and Sompolinsky, H. Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Phys. Rev. X, 11:031059, Sep 2021. 10.1103/PhysRevX.11.031059. URL https://link.aps.org/doi/10.1103/PhysRevX.11.031059.
- Loureiro et al. (2021a) Loureiro, B., Gerbelot, C., Cui, H., Goldt, S., Krzakala, F., Mezard, M., and Zdeborová, L. Learning curves of generic features maps for realistic datasets with a teacher-student model. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 18137–18151. Curran Associates, Inc., 2021a. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/9704a4fc48ae88598dcbdcdf57f3fdef-Paper.pdf.
- Loureiro et al. (2021b) Loureiro, B., Sicuro, G., Gerbelot, C., Pacco, A., Krzakala, F., and Zdeborová, L. Learning Gaussian mixtures with generalized linear models: Precise asymptotics in high-dimensions. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 10144–10157. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/543e83748234f7cbab21aa0ade66565f-Paper.pdf.
- Maillard et al. (2022) Maillard, A., Krzakala, F., Mézard, M., and Zdeborová, L. Perturbative construction of mean-field equations in extensive-rank matrix factorization and denoising. Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083301, Aug 2022. 10.1088/1742-5468/ac7e4c. URL https://dx.doi.org/10.1088/1742-5468/ac7e4c.
- Maillard et al. (2024a) Maillard, A., Troiani, E., Martin, S., Krzakala, F., and Zdeborová, L. Bayes-optimal learning of an extensive-width neural network from quadratically many samples, 2024a. URL https://arxiv.org/abs/2408.03733.
- Maillard et al. (2024b) Maillard, A., Troiani, E., Martin, S., Krzakala, F., and Zdeborová, L. Github repository ExtensiveWidthQuadraticSamples. https://github.com/SPOC-group/ExtensiveWidthQuadraticSamples, 2024b.
- Martin et al. (2024) Martin, S., Bach, F., and Biroli, G. On the impact of overparameterization on the training of a shallow neural network in high dimensions. In Dasgupta, S., Mandt, S., and Li, Y. (eds.), Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pp. 3655–3663. PMLR, 02–04 May 2024. URL https://proceedings.mlr.press/v238/martin24a.html.
- Mato & Parga (1992) Mato, G. and Parga, N. Generalization properties of multilayered neural networks. Journal of Physics A: Mathematical and General, 25(19):5047, Oct 1992. 10.1088/0305-4470/25/19/017. URL https://dx.doi.org/10.1088/0305-4470/25/19/017.
- Matthews et al. (2018) Matthews, A. G. D. G., Hron, J., Rowland, M., Turner, R. E., and Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1-nGgWC-.
- Matytsin (1994) Matytsin, A. On the large- $N$ limit of the Itzykson-Zuber integral. Nuclear Physics B, 411(2):805–820, 1994. ISSN 0550-3213. 10.1016/0550-3213(94)90471-5. URL https://www.sciencedirect.com/science/article/pii/0550321394904715.
- Mei & Montanari (2022) Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022. 10.1002/cpa.22008. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.22008.
- Mezard et al. (1986) Mezard, M., Parisi, G., and Virasoro, M. Spin Glass Theory and Beyond. World Scientific, 1986. 10.1142/0271. URL https://www.worldscientific.com/doi/abs/10.1142/0271.
- Monasson (1992) Monasson, R. Properties of neural networks storing spatially correlated patterns. Journal of Physics A: Mathematical and General, 25(13):3701, Jul 1992. 10.1088/0305-4470/25/13/019. URL https://dx.doi.org/10.1088/0305-4470/25/13/019.
- Monasson & Zecchina (1995) Monasson, R. and Zecchina, R. Weight space structure and internal representations: A direct approach to learning and generalization in multilayer neural networks. Phys. Rev. Lett., 75:2432–2435, Sep 1995. 10.1103/PhysRevLett.75.2432. URL https://link.aps.org/doi/10.1103/PhysRevLett.75.2432.
- Naveh & Ringel (2021) Naveh, G. and Ringel, Z. A self consistent theory of Gaussian processes captures feature learning effects in finite CNNs. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 21352–21364. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/b24d21019de5e59da180f1661904f49a-Paper.pdf.
- Neal (1996) Neal, R. M. Priors for Infinite Networks, pp. 29–53. Springer New York, New York, NY, 1996. ISBN 978-1-4612-0745-0. 10.1007/978-1-4612-0745-0_2. URL https://doi.org/10.1007/978-1-4612-0745-0_2.
- Nishimori (2001) Nishimori, H. Statistical Physics of Spin Glasses and Information Processing: An Introduction. Oxford University Press, 07 2001. ISBN 9780198509417. 10.1093/acprof:oso/9780198509417.001.0001.
- Nourdin et al. (2011) Nourdin, I., Peccati, G., and Podolskij, M. Quantitative Breuer–Major theorems. Stochastic Processes and their Applications, 121(4):793–812, 2011. ISSN 0304-4149. https://doi.org/10.1016/j.spa.2010.12.006. URL https://www.sciencedirect.com/science/article/pii/S0304414910002917.
- Pacelli et al. (2023) Pacelli, R., Ariosto, S., Pastore, M., Ginelli, F., Gherardi, M., and Rotondo, P. A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit. Nature Machine Intelligence, 5(12):1497–1507, 12 2023. ISSN 2522-5839. 10.1038/s42256-023-00767-6. URL https://doi.org/10.1038/s42256-023-00767-6.
- Parker et al. (2014) Parker, J. T., Schniter, P., and Cevher, V. Bilinear generalized approximate message passing—Part I: Derivation. IEEE Transactions on Signal Processing, 62(22):5839–5853, 2014. 10.1109/TSP.2014.2357776. URL https://doi.org/10.1109/TSP.2014.2357776.
- Potters & Bouchaud (2020) Potters, M. and Bouchaud, J.-P. A first course in random matrix theory: for physicists, engineers and data scientists. Cambridge University Press, 2020.
- Pourkamali et al. (2024) Pourkamali, F., Barbier, J., and Macris, N. Matrix inference in growing rank regimes. IEEE Transactions on Information Theory, 70(11):8133–8163, 2024. 10.1109/TIT.2024.3422263. URL https://doi.org/10.1109/TIT.2024.3422263.
- Power et al. (2022) Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. URL https://arxiv.org/abs/2201.02177.
- Rotondo et al. (2020) Rotondo, P., Pastore, M., and Gherardi, M. Beyond the storage capacity: Data-driven satisfiability transition. Phys. Rev. Lett., 125:120601, Sep 2020. 10.1103/PhysRevLett.125.120601. URL https://link.aps.org/doi/10.1103/PhysRevLett.125.120601.
- Rubin et al. (2024a) Rubin, N., Ringel, Z., Seroussi, I., and Helias, M. A unified approach to feature learning in Bayesian neural networks. In High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning, 2024a. URL https://openreview.net/forum?id=ZmOSJ2MV2R.
- Rubin et al. (2024b) Rubin, N., Seroussi, I., and Ringel, Z. Grokking as a first order phase transition in two layer networks. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=3ROGsTX3IR.
- Sakata & Kabashima (2013) Sakata, A. and Kabashima, Y. Statistical mechanics of dictionary learning. Europhysics Letters, 103(2):28008, Aug 2013. 10.1209/0295-5075/103/28008. URL https://dx.doi.org/10.1209/0295-5075/103/28008.
- Sarao Mannelli et al. (2020) Sarao Mannelli, S., Vanden-Eijnden, E., and Zdeborová, L. Optimization and generalization of shallow neural networks with quadratic activation functions. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 13445–13455. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/9b8b50fb590c590ffbf1295ce92258dc-Paper.pdf.
- Schmidt (2018) Schmidt, H. C. Statistical physics of sparse and dense models in optimization and inference. PhD thesis, 2018. URL http://www.theses.fr/2018SACLS366.
- Schwarze & Hertz (1992) Schwarze, H. and Hertz, J. Generalization in a large committee machine. Europhysics Letters, 20(4):375, Oct 1992. 10.1209/0295-5075/20/4/015. URL https://dx.doi.org/10.1209/0295-5075/20/4/015.
- Schwarze & Hertz (1993) Schwarze, H. and Hertz, J. Generalization in fully connected committee machines. Europhysics Letters, 21(7):785, Mar 1993. 10.1209/0295-5075/21/7/012. URL https://dx.doi.org/10.1209/0295-5075/21/7/012.
- Semerjian (2024) Semerjian, G. Matrix denoising: Bayes-optimal estimators via low-degree polynomials. Journal of Statistical Physics, 191(10):139, Oct 2024. ISSN 1572-9613. 10.1007/s10955-024-03359-9. URL https://doi.org/10.1007/s10955-024-03359-9.
- Seroussi et al. (2023) Seroussi, I., Naveh, G., and Ringel, Z. Separation of scales and a thermodynamic description of feature learning in some CNNs. Nature Communications, 14(1):908, Feb 2023. ISSN 2041-1723. 10.1038/s41467-023-36361-y. URL https://doi.org/10.1038/s41467-023-36361-y.
- Soltanolkotabi et al. (2019) Soltanolkotabi, M., Javanmard, A., and Lee, J. D. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2019. 10.1109/TIT.2018.2854560. URL https://doi.org/10.1109/TIT.2018.2854560.
- Troiani et al. (2025) Troiani, E., Dandi, Y., Defilippis, L., Zdeborova, L., Loureiro, B., and Krzakala, F. Fundamental computational limits of weak learnability in high-dimensional multi-index models. In The 28th International Conference on Artificial Intelligence and Statistics, 2025. URL https://openreview.net/forum?id=Mwzui5H0VN.
- van Meegen & Sompolinsky (2024) van Meegen, A. and Sompolinsky, H. Coding schemes in neural networks learning classification tasks, 2024. URL https://arxiv.org/abs/2406.16689.
- Venturi et al. (2019) Venturi, L., Bandeira, A. S., and Bruna, J. Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of Machine Learning Research, 20(133):1–34, 2019. URL http://jmlr.org/papers/v20/18-674.html.
- Williams (1996) Williams, C. Computing with infinite networks. In Mozer, M., Jordan, M., and Petsche, T. (eds.), Advances in Neural Information Processing Systems, volume 9. MIT Press, 1996. URL https://proceedings.neurips.cc/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735-Paper.pdf.
- Xiao et al. (2023) Xiao, L., Hu, H., Misiakiewicz, T., Lu, Y. M., and Pennington, J. Precise learning curves and higher-order scaling limits for dot-product kernel regression. Journal of Statistical Mechanics: Theory and Experiment, 2023(11):114005, Nov 2023. 10.1088/1742-5468/ad01b7. URL https://dx.doi.org/10.1088/1742-5468/ad01b7.
- Xu et al. (2025) Xu, Y., Maillard, A., Zdeborová, L., and Krzakala, F. Fundamental limits of matrix sensing: Exact asymptotics, universality, and applications, 2025. URL https://arxiv.org/abs/2503.14121.
- Yoon & Oh (1998) Yoon, H. and Oh, J.-H. Learning of higher-order perceptrons with tunable complexities. Journal of Physics A: Mathematical and General, 31(38):7771–7784, 09 1998. 10.1088/0305-4470/31/38/012. URL https://doi.org/10.1088/0305-4470/31/38/012.
- Zdeborová & and (2016) Zdeborová, L. and and, F. K. Statistical physics of inference: thresholds and algorithms. Advances in Physics, 65(5):453–552, 2016. 10.1080/00018732.2016.1211393. URL https://doi.org/10.1080/00018732.2016.1211393.
Appendix A Hermite basis and Mehler’s formula
Recall the Hermite expansion of the activation:
$$
\sigma(x)=\sum_{\ell=0}^{\infty}\frac{\mu_{\ell}}{\ell!}{\rm He}_{\ell}(x). \tag{17}
$$
We are expressing it on the basis of the probabilist’s Hermite polynomials, generated through
$$
{\rm He}_{\ell}(z)=\frac{d^{\ell}}{{dt}^{\ell}}\exp\big{(}tz-t^{2}/2\big{)}%
\big{|}_{t=0}. \tag{18}
$$
The Hermite basis has the property of being orthogonal with respect to the standard Gaussian measure, which is the distribution of the input data:
$$
\int Dz\,{\rm He}_{k}(z){\rm He}_{\ell}(z)=\ell!\,\delta_{k\ell}, \tag{19}
$$
where $Dz:=dz\exp(-z^{2}/2)/\sqrt{2\pi}$ . By orthogonality, the coefficients of the expansions can be obtained as
$$
\mu_{\ell}=\int Dz{\rm He}_{\ell}(z)\sigma(z). \tag{20}
$$
Moreover,
$$
\mathbb{E}[\sigma(z)^{2}]=\int Dz\,\sigma(z)^{2}=\sum_{\ell=0}^{\infty}\frac{%
\mu_{\ell}^{2}}{\ell!}. \tag{21}
$$
These coefficients for some popular choices of $\sigma$ are reported in Table 1 for reference.
Table 1: First Hermite coefficients of some activation functions reported in the figues. $\theta$ is the Heaviside step function.
| ${\rm ReLU}(z)=z\theta(z)$ ${\rm ELU}(z)=z\theta(z)+(e^{z}-1)\theta(-z)$ ${\rm Tanh}(2z)$ | $1/\sqrt{2\pi}$ 0.16052 0 | $1/2$ 0.76158 0.72948 | $1/\sqrt{2\pi}$ 0.26158 0 | 0 -0.13736 -0.61398 | $-1/\sqrt{2\pi}$ -0.13736 0 | $·s$ $·s$ $·s$ | 1/2 0.64494 0.63526 |
| --- | --- | --- | --- | --- | --- | --- | --- |
The Hermite basis can be generalised to an orthogonal basis with respect to the Gaussian measure with generic variance:
$$
{\rm He}_{\ell}^{[r]}(z)=\frac{d^{\ell}}{dt^{\ell}}\exp\big{(}tz-t^{2}r/2\big{%
)}\big{|}_{t=0}, \tag{22}
$$
so that, with $D_{r}z:=dz\exp(-z^{2}/2r)/\sqrt{2\pi r}$ , we have
$$
\int D_{r}z\,{\rm He}_{k}^{[r]}(z){\rm He}_{\ell}^{[r]}(z)=\ell!\,r^{\ell}%
\delta_{k\ell}. \tag{23}
$$
From Mehler’s formula
$$
\frac{1}{2\pi\sqrt{r^{2}-q^{2}}}\exp\!\Big{[}-\frac{1}{2}(u,v)\begin{pmatrix}r%
&q\\
q&r\end{pmatrix}^{-1}\begin{pmatrix}u\\
v\end{pmatrix}\Big{]}=\frac{e^{-\frac{u^{2}}{2r}}}{\sqrt{2\pi r}}\frac{e^{-%
\frac{v^{2}}{2r}}}{\sqrt{2\pi r}}\sum_{\ell=0}^{+\infty}\frac{q^{\ell}}{\ell!r%
^{2\ell}}{\rm He}_{\ell}^{[r]}(u){\rm He}_{\ell}^{[r]}(v), \tag{24}
$$
and by orthogonality of the Hermite basis, (8) readily follows by noticing that the variables $(h_{i}^{a}=({\mathbf{W}}^{a}{\mathbf{x}})_{i}/\sqrt{d})_{i,a}$ at given $({\mathbf{W}}^{a})$ are Gaussian with covariances $\Omega^{ab}_{ij}={\mathbf{W}}_{i}^{a∈tercal}{\mathbf{W}}^{b}_{j}/d$ , so that
$$
\mathbb{E}[\sigma(h_{i}^{a})\sigma(h_{j}^{b})]=\sum_{\ell=0}^{\infty}\frac{(%
\mu_{\ell}^{[r]})^{2}}{\ell!r^{2\ell}}(\Omega_{ij}^{ab})^{\ell},\qquad\mu_{%
\ell}^{[r]}=\int D_{r}z\,{\rm He}^{[r]}_{\ell}(z)\sigma(z). \tag{25}
$$
Moreover, as $r=\Omega^{aa}_{ii}$ converges for $d$ large to the variance of the prior of ${\mathbf{W}}^{0}$ by Bayes-optimality, whenever $\Omega^{aa}_{ii}→ 1$ we can specialise this formula to the simpler case $r=1$ we reported in the main text.
Appendix B Nishimori identities
The Nishimori identities are a very general set of symmetries arising in inference in the Bayes-optimal setting as a consequence of Bayes’ rule. To introduce them, consider a test function $f$ of the teacher weights, collectively denoted by ${\bm{\theta}}^{0}$ , of $s-1$ replicas of the student’s weights $({\bm{\theta}}^{a})_{2≤ a≤ s}$ drawn conditionally i.i.d. from the posterior, and possibly also of the training set $\mathcal{D}$ : $f({\bm{\theta}}^{0},{\bm{\theta}}^{2},...,{\bm{\theta}}^{s};\mathcal{D})$ . Then
$$
\displaystyle\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle f({\bm{\theta}}%
^{0},{\bm{\theta}}^{2},\dots,{\bm{\theta}}^{s};\mathcal{D})\rangle=\mathbb{E}_%
{{\bm{\theta}}^{0},\mathcal{D}}\langle f({\bm{\theta}}^{1},{\bm{\theta}}^{2},%
\dots,{\bm{\theta}}^{s};\mathcal{D})\rangle, \tag{26}
$$
where we have replaced the teacher’s weights with another replica from the student. The proof is elementary, see e.g. Barbier et al. (2019).
The Nishimori identities have some consequences also on our replica symmetric ansatz for the free entropy. In particular, they constrain the values of the asymptotic mean of some order parameters. For instance
$$
\displaystyle m_{2}=\lim\frac{1}{d^{2}}\mathbb{E}_{\mathcal{D},{\bm{\theta}}^{%
0}}\langle{\rm Tr}[{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{0}]\rangle=\lim\frac{%
1}{d^{2}}\mathbb{E}_{\mathcal{D}}\langle{\rm Tr}[{\mathbf{S}}_{2}^{a}{\mathbf{%
S}}_{2}^{b}]\rangle=q_{2},\quad\text{for }a\neq b. \tag{27}
$$
Combined with the concentration of such order parameters, which can be proven in great generality in Bayes-optimal learning Barbier (2020); Barbier & Panchenko (2022), it fixes the values for some of them. For instance, we have that with high probability
$$
\displaystyle\frac{1}{d^{2}}{\rm Tr}[({\mathbf{S}}_{2}^{a})^{2}]\to r_{2}=\lim%
\frac{1}{d^{2}}\mathbb{E}_{\mathcal{D}}\langle{\rm Tr}[({\mathbf{S}}_{2}^{a})^%
{2}]\rangle=\lim\frac{1}{d^{2}}\mathbb{E}_{{\bm{\theta}}^{0}}{\rm Tr}[({%
\mathbf{S}}_{2}^{0})^{2}]=\rho_{2}=1+\gamma\bar{v}^{2}. \tag{28}
$$
When the values of some order parameters are determined by the Nishimori identities (and their concentration), as for those fixed to $r_{2}=\rho_{2}$ , then the respective Fourier conjugates $\hat{r}_{2},\hat{\rho}_{2}$ vanish (meaning that the desired constraints were already asymptotically enforced without the need of additional delta functions). This is because the configurations that make the order parameters take those values exponentially (in $n$ ) dominate the posterior measure, so these constraints are automatically imposed by the measure.
Appendix C Alternative representation for the optimal mean-square generalisation error
We recall that ${\bm{\theta}}^{0}=({\mathbf{v}}^{0},{\mathbf{W}}^{0})$ and similarly for ${\bm{\theta}}^{1}={\bm{\theta}},{\bm{\theta}}^{2},...$ which are replicas, i.e., conditionally i.i.d. samples from $dP({\mathbf{W}},{\mathbf{v}}\mid\mathcal{D})$ (the reasoning below applies whether ${\mathbf{v}}$ is learnable or quenched, so in general we can consider a joint posterior over both). In this section we report the details on how to obtain Result 2.2 and how to write the generalisation error defined in (3) in a form more convenient for numerical estimation.
From its definition, the Bayes-optimal mean-square generalisation error can be recast as
$$
\displaystyle\varepsilon^{\rm opt}=\mathbb{E}_{{\bm{\theta}}^{0},{\mathbf{x}}_%
{\rm test}}\mathbb{E}[y^{2}_{\rm test}\mid\lambda^{0}]-2\mathbb{E}_{{\bm{%
\theta}}^{0},\mathcal{D},{\mathbf{x}}_{\rm test}}\mathbb{E}[y_{\rm test}\mid%
\lambda^{0}]\langle\mathbb{E}[y\mid\lambda]\rangle+\mathbb{E}_{{\bm{\theta}}^{%
0},\mathcal{D},{\mathbf{x}}_{\rm test}}\langle\mathbb{E}[y\mid\lambda]\rangle^%
{2}, \tag{29}
$$
where $\mathbb{E}[y\mid\lambda]=∈t dy\,y\,P_{\rm out}(y\mid\lambda)$ , and $\lambda^{0}$ , $\lambda$ are the random variables (random due to the test input ${\mathbf{x}}_{\rm test}$ , drawn independently of the training data $\mathcal{D}$ , and their respective weights ${\bm{\theta}}^{0},{\bm{\theta}}$ )
$$
\displaystyle\lambda^{0}=\lambda({\bm{\theta}}^{0},{\mathbf{x}}_{\rm test})=%
\frac{{\mathbf{v}}^{0\intercal}}{\sqrt{k}}\sigma\Big{(}\frac{{\mathbf{W}}^{0}{%
\mathbf{x}}_{\rm test}}{\sqrt{d}}\Big{)},\qquad\lambda=\lambda^{1}=\lambda({%
\bm{\theta}},{\mathbf{x}}_{\rm test})=\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}%
}\sigma\Big{(}\frac{{\mathbf{W}}{\mathbf{x}}_{\rm test}}{\sqrt{d}}\Big{)}. \tag{30}
$$
Recall that the bracket $\langle\,·\,\rangle$ is the average w.r.t. to the posterior and acts on ${\bm{\theta}}^{1}={\bm{\theta}},{\bm{\theta}}^{2},...$ . Notice that the last term on the r.h.s. of (29) can be rewritten as
| | $\displaystyle\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D},{\mathbf{x}}_{\rm test}%
}\langle\mathbb{E}[y\mid\lambda]\rangle^{2}=\mathbb{E}_{{\bm{\theta}}^{0},%
\mathcal{D},{\mathbf{x}}_{\rm test}}\langle\mathbb{E}[y\mid\lambda^{1}]\mathbb%
{E}[y\mid\lambda^{2}]\rangle,$ | |
| --- | --- | --- |
with superscripts being replica indices, i.e., $\lambda^{a}:=\lambda({\bm{\theta}}^{a},{\mathbf{x}}_{\rm test})$ .
In order to show Result 2.2 for a generic $P_{\rm out}$ we assume the joint Gaussianity of the variables $(\lambda^{0},\lambda^{1},\lambda^{2},...)$ , with covariance given by $K^{ab}$ with $a,b∈\{0,1,2,...\}$ . Indeed, in the limit “ $\lim$ ”, our theory considers $(\lambda^{a})_{a≥ 0}$ as jointly Gaussian under the randomness of a common input, here ${\mathbf{x}}_{\rm test}$ , conditionally on the weights $({\bm{\theta}}^{a})$ . Their covariance depends on the weights $({\bm{\theta}}^{a})$ through various overlap order parameters introduced in the main. But in the large limit “ $\lim$ ” these overlaps are assumed to concentrate under the quenched posterior average $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle\,·\,\rangle$ towards non-random asymptotic values corresponding to the extremiser globally maximising the RS potential in Result 2.1, with the overlaps entering $K^{ab}$ through (42). This hypothesis is then confirmed by the excellent agreement between our theoretical predictions based on this assumption and the experimental results. This implies directly the equation for $\lim\,\varepsilon^{\mathcal{C},\mathsf{f}}$ in Result 2.2 from definition (2). For the special case of optimal mean-square generalisation error it yields
$$
\displaystyle\lim\,\varepsilon^{\rm opt}=\mathbb{E}_{\lambda^{0}}\mathbb{E}[y^%
{2}_{\rm test}\mid\lambda^{0}]-2\mathbb{E}_{\lambda^{0},\lambda^{1}}\mathbb{E}%
[y_{\rm test}\mid\lambda^{0}]\mathbb{E}[y\mid\lambda^{1}]+\mathbb{E}_{\lambda^%
{1},\lambda^{2}}\mathbb{E}[y\mid\lambda^{1}]\mathbb{E}[y\mid\lambda^{2}] \tag{31}
$$
where, in the replica symmetric ansatz,
$$
\displaystyle\mathbb{E}[(\lambda^{0})^{2}]=K^{00},\quad\mathbb{E}[\lambda^{0}%
\lambda^{1}]=\mathbb{E}[\lambda^{0}\lambda^{2}]=K^{01},\quad\mathbb{E}[\lambda%
^{1}\lambda^{2}]=K^{12},\quad\mathbb{E}[(\lambda^{1})^{2}]=\mathbb{E}[(\lambda%
^{2})^{2}]=K^{11}. \tag{32}
$$
For the dependence of the elements of ${\mathbf{K}}$ on the overlaps under this ansatz we defer the reader to (45), (46). In the Bayes-optimal setting, using the Nishimori identities (see App. B), one can show that $K^{01}=K^{12}$ and $K^{00}=K^{11}$ . Because of these identifications, we would additionally have
$$
\displaystyle\mathbb{E}_{\lambda^{0},\lambda^{1}}\mathbb{E}[y_{\rm test}\mid%
\lambda^{0}]\mathbb{E}[y\mid\lambda^{1}]=\mathbb{E}_{\lambda^{1},\lambda^{2}}%
\mathbb{E}[y\mid\lambda^{1}]\mathbb{E}[y\mid\lambda^{2}]. \tag{33}
$$
Plugging the above in (31) yields (7).
Let us now prove a formula for the optimal mean-square generalisation error written in terms of the overlaps that will be simpler to evaluate numerically, which holds for the special case of linear readout with Gaussian label noise $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ . The following derivation is exact and does not require any Gaussianity assumption on the random variables $(\lambda^{a})$ . For the linear Gaussian channel the means verify $\mathbb{E}[y\mid\lambda]=\lambda$ and $\mathbb{E}[y^{2}\mid\lambda]=\lambda^{2}+\Delta$ . Plugged in (29) this yields
$$
\displaystyle\varepsilon^{\rm opt}-\Delta=\mathbb{E}_{{\bm{\theta}}^{0},{%
\mathbf{x}}_{\rm test}}\lambda^{2}_{\rm test}-2\mathbb{E}_{{\bm{\theta}}^{0},%
\mathcal{D},{\mathbf{x}}_{\rm test}}\lambda^{0}\langle\lambda\rangle+\mathbb{E%
}_{{\bm{\theta}}^{0},\mathcal{D},{\mathbf{x}}_{\rm test}}\langle\lambda^{1}%
\lambda^{2}\rangle, \tag{34}
$$
whence we clearly see that the generalisation error depends only on the covariance of $\lambda_{\rm test}({\bm{\theta}}^{0})=\lambda^{0}({\bm{\theta}}^{0}),\lambda^{%
1}({\bm{\theta}}^{1}),\lambda^{2}({\bm{\theta}}^{2})$ under the randomness of the shared input ${\mathbf{x}}_{\rm test}$ at fixed weights, regardless of the validity of the Gaussian equivalence principle we assume in the replica computation. This covariance was already computed in (8); we recall it here for the reader’s convenience
$$
\displaystyle K({\bm{\theta}}^{a},{\bm{\theta}}^{b}):=\mathbb{E}\lambda^{a}%
\lambda^{b}=\sum_{\ell=1}^{\infty}\frac{\mu_{\ell}^{2}}{\ell!}\frac{1}{k}\sum_%
{i,j=1}^{k}v_{i}^{a}(\Omega^{ab}_{ij})^{\ell}v^{b}_{j}=\sum_{\ell=1}^{\infty}%
\frac{\mu_{\ell}^{2}}{\ell!}Q_{\ell}^{ab}, \tag{35}
$$
where $\Omega^{ab}_{ij}:={\mathbf{W}}_{i}^{a∈tercal}{\mathbf{W}}_{j}^{b}/d$ , and $Q_{\ell}^{ab}$ as introduced in (8) for $a,b=0,1,2$ . We stress that $K({\bm{\theta}}^{a},{\bm{\theta}}^{b})$ is not the limiting covariance $K^{ab}$ whose elements are in (45), (46), but rather the finite size one. $K({\bm{\theta}}^{a},{\bm{\theta}}^{b})$ provides us with an efficient way to compute the generalisation error numerically, that is through the formula
$$
\displaystyle\varepsilon^{\rm opt}-\Delta \displaystyle=\mathbb{E}_{{\bm{\theta}}^{0}}K({\bm{\theta}}^{0},{\bm{\theta}}^%
{0})-2\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle K({\bm{\theta}}^{0},{%
\bm{\theta}}^{1})\rangle+\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle K({%
\bm{\theta}}^{1},{\bm{\theta}}^{2})\rangle=\sum_{\ell=1}^{\infty}\frac{\mu_{%
\ell}^{2}}{\ell!}\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle Q_{\ell}^{0%
0}-2Q_{\ell}^{01}+Q^{12}_{\ell}\rangle. \tag{36}
$$
In the above, the posterior measure $\langle\,·\,\rangle$ is taken care of by Monte Carlo sampling (when it equilibrates). In addition, as in the main text, we assume that in the large system limit the (numerically confirmed) identity (11) holds. Putting all ingredients together we get
$$
\displaystyle\varepsilon^{\rm opt}-\Delta=\mathbb{E}_{{\bm{\theta}}^{0},%
\mathcal{D}} \displaystyle\Big{\langle}\mu_{1}^{2}(Q_{1}^{00}-2Q^{01}_{1}+Q^{12}_{1})+\frac%
{\mu_{2}^{2}}{2}(Q_{2}^{00}-2Q^{01}_{2}+Q^{12}_{2}) \displaystyle+\mathbb{E}_{v\sim P_{v}}v^{2}\big{[}g(\mathcal{Q}_{W}^{00}(v))-2%
g(\mathcal{Q}_{W}^{01}(v))+g(\mathcal{Q}_{W}^{12}(v))\big{]}\Big{\rangle}. \tag{37}
$$
In the Bayes-optimal setting one can use again the Nishimori identities that imply $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{12}_{1}\rangle=\mathbb{E}%
_{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{01}_{1}\rangle$ , and analogously $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{12}_{2}\rangle=\mathbb{E}%
_{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{01}_{2}\rangle$ and $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle g(\mathcal{Q}^{12}_{W}(v))%
\rangle=\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle g(\mathcal{Q}^{01}_{%
W}(v))\rangle$ . Inserting these identities in (37) one gets
$$
\displaystyle\varepsilon^{\rm opt}-\Delta \displaystyle=\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\Big{\langle}\mu_{1}^{%
2}(Q_{1}^{00}-Q^{01}_{1})+\frac{\mu_{2}^{2}}{2}(Q_{2}^{00}-Q^{01}_{2})+\mathbb%
{E}_{v\sim P_{v}}v^{2}\big{[}g(\mathcal{Q}_{W}^{00}(v))-g(\mathcal{Q}_{W}^{01}%
(v))\big{]}\Big{\rangle}. \tag{38}
$$
This formula makes no assumption (other than (11)), including on the law of the $\lambda$ ’s. That it depends only on their covariance is simply a consequence of the quadratic nature of the mean-square generalisation error.
**Remark C.1**
*Note that the derivation up to (36) did not assume Bayes-optimality (while (38) does). Therefore, one can consider it in cases where the true posterior average $\langle\,·\,\rangle$ is replaced by one not verifying the Nishimori identities. This is the formula we use to compute the generalisation error of Monte Carlo-based estimators in the inset of Fig. 7. This is indeed needed to compute the generalisation in the glassy regime, where MCMC cannot equilibrate.*
**Remark C.2**
*Using the Nishimory identity of App. B and again that, for the linear readout with Gaussian label noise $\mathbb{E}[y\mid\lambda]=\lambda$ and $\mathbb{E}[y^{2}\mid\lambda]=\lambda^{2}+\Delta$ , it is easy to check that the so-called Gibbs error
$$
\varepsilon^{\rm Gibbs}:=\mathbb{E}_{\bm{\theta}^{0},{\mathcal{D}},{\mathbf{x}%
}_{\rm test},y_{\rm test}}\big{\langle}(y_{\rm test}-\mathbb{E}[y\mid\lambda_{%
\rm test}({\bm{\theta}})])^{2}\big{\rangle} \tag{39}
$$
is related for this channel to the Bayes-optimal mean-square generalisation error through the identity
$$
\varepsilon^{\rm Gibbs}-\Delta=2(\varepsilon^{\rm opt}-\Delta). \tag{40}
$$
We exploited this relationship together with the concentration of the Gibbs error w.r.t. the quenched posterior measure $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle\,·\,\rangle$ when evaluating the numerical generalisation error of the Monte Carlo algorithms reported in the main text.*
Appendix D Details of the replica calculation
D.1 Energetic potential
The replicated energetic term under our Gaussian assumption on the joint law of the post-activations replicas is reported here for the reader’s convenience:
$$
F_{E}=\ln\int dy\int d{\bm{\lambda}}\frac{e^{-\frac{1}{2}{\bm{\lambda}}^{%
\intercal}{\mathbf{K}}^{-1}{\bm{\lambda}}}}{\sqrt{(2\pi)^{s+1}\det{\mathbf{K}}%
}}\prod_{a=0}^{s}P_{\rm out}(y\mid\lambda^{a}). \tag{41}
$$
After applying our ansatz (10) and using that $Q_{1}^{ab}=1$ in the quadratic-data regime, the covariance matrix ${\mathbf{K}}$ in replica space defined in (8) reads
$$
\displaystyle K^{ab} \displaystyle=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}Q^{ab}_{2}+\mathbb{E}_{v\sim P_%
{v}}v^{2}g(\mathcal{Q}_{W}^{ab}(v)), \tag{42}
$$
where the function
$$
g(x)=\sum_{\ell=3}^{\infty}\frac{\mu_{\ell}^{2}}{\ell!}x^{\ell}=\mathbb{E}_{(y%
,z)|x}[\sigma(y)\sigma(z)]-\mu_{0}^{2}-\mu_{1}^{2}x-\frac{\mu_{2}^{2}}{2}x^{2}%
,\qquad(y,z)\sim{\mathcal{N}}\left((0,0),\begin{pmatrix}1&x\\
x&1\end{pmatrix}\right). \tag{43}
$$
The energetic term $F_{E}$ is already expressed as a low-dimensional integral, but within the replica symmetric (RS) ansatz it simplifies considerably. Let us denote $\bm{\mathcal{Q}}_{W}(\mathsf{v})=(\mathcal{Q}_{W}^{ab}(\mathsf{v}))_{a,b=0}^{s}$ . The RS ansatz amounts to assume that the saddle point solutions are dominated by order parameters of the form (below $\bm{1}_{s}$ and ${\mathbb{I}}_{s}$ are the all-ones vector and identity matrix of size $s$ )
$$
\bm{\mathcal{Q}}_{W}(\mathsf{v})=\begin{pmatrix}\rho_{W}&m_{W}\bm{1}_{s}^{%
\intercal}\\
m_{W}\bm{1}_{s}&(r_{W}-\mathcal{Q}_{W}){\mathbb{I}}_{s}+\mathcal{Q}_{W}\bm{1}_%
{s}\bm{1}_{s}^{\intercal}\end{pmatrix}\iff\hat{\bm{\mathcal{Q}}}_{W}(\mathsf{v%
})=\begin{pmatrix}\hat{\rho}_{W}&-\hat{m}_{W}\bm{1}_{s}^{\intercal}\\
-\hat{m}_{W}\bm{1}_{s}&(\hat{r}_{W}+\hat{\mathcal{Q}}_{W}){\mathbb{I}}_{s}-%
\hat{\mathcal{Q}}_{W}\bm{1}_{s}\bm{1}_{s}^{\intercal}\end{pmatrix},
$$
where all the above parameter $\rho_{W}=\rho_{W}(\mathsf{v}),\hat{\rho}_{W},m_{W},...$ depend on $\mathsf{v}$ , and similarly
$$
{\mathbf{Q}}_{2}=\begin{pmatrix}\rho_{2}&m_{2}\bm{1}_{s}^{\intercal}\\
m_{2}\bm{1}_{s}&(r_{2}-q_{2}){\mathbb{I}}_{s}+q_{2}\bm{1}_{s}\bm{1}_{s}^{%
\intercal}\end{pmatrix}\iff\hat{{\mathbf{Q}}}_{2}=\begin{pmatrix}\hat{\rho}_{2%
}&-\hat{m}_{2}\bm{1}_{s}^{\intercal}\\
-\hat{m}_{2}\bm{1}_{s}&(\hat{r}_{2}+\hat{q}_{2}){\mathbb{I}}_{s}-\hat{q}_{2}%
\bm{1}_{s}\bm{1}_{s}^{\intercal}\end{pmatrix},
$$
where we reported the ansatz also for the Fourier conjugates We are going to use repeatedly the Fourier representation of the delta function, namely $\delta(x)=\frac{1}{2\pi}∈t d\hat{x}\exp(i\hat{x}x)$ . Because the integrals we will end-up with will always be at some point evaluated by saddle point, implying a deformation of the integration contour in the complex plane, tracking the imaginary unit $i$ in the delta functions will be irrelevant. Similarly, the normalization $1/2\pi$ will always contribute to sub-leading terms in the integrals at hand. Therefore, we will allow ourselves to formally write $\delta(x)=∈t d\hat{x}\exp(r\hat{x}x)$ for a convenient constant $r$ , keeping in mind these considerations (again, as we evaluate the final integrals by saddle point, the choice of $r$ ends-up being irrelevant). for future convenience, though not needed for the energetic potential. The RS ansatz, which is equivalent to an assumption of concentration of the order parameters in the high-dimensional limit, is known to be exact when analysing Bayes-optimal inference and learning, as in the present paper, see Nishimori (2001); Barbier (2020); Barbier & Panchenko (2022). Under the RS ansatz ${\mathbf{K}}$ acquires a similar form:
$$
\displaystyle{\mathbf{K}}=\begin{pmatrix}\rho_{K}&m_{K}\bm{1}_{s}^{\intercal}%
\\
m_{K}\bm{1}_{s}&(r_{K}-q_{K}){\mathbb{I}}_{s}+q_{K}\bm{1}_{s}\bm{1}_{s}^{%
\intercal}\end{pmatrix} \tag{44}
$$
with
$$
\displaystyle m_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}m_{2}+\mathbb{E}_{v\sim P%
_{v}}v^{2}g(m_{W}(v)),\quad \displaystyle q_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}q_{2}+\mathbb{E}_{v\sim P%
_{v}}v^{2}g(\mathcal{Q}_{W}(v)), \displaystyle\rho_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}\rho_{2}+\mathbb{E}_{v%
\sim P_{v}}v^{2}g(\rho_{W}(v)),\quad \displaystyle r_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}r_{2}+\mathbb{E}_{v\sim P%
_{v}}v^{2}g(r_{W}(v)). \tag{45}
$$
In the RS ansatz it is thus possible to give a convenient low-dimensional representation of the multivariate Gaussian integral of $F_{E}$ in terms of white Gaussian random variables:
$$
\displaystyle\lambda^{a}=\xi\sqrt{q_{K}}+u^{a}\sqrt{r_{K}-q_{K}}\quad\text{for%
}a=1,\dots,s,\qquad\lambda^{0}=\xi\sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{%
\rho_{K}-\frac{m_{K}^{2}}{q_{K}}} \tag{47}
$$
where $\xi,(u^{a})_{a=0}^{s}$ are i.i.d. standard Gaussian variables. Then
$$
\displaystyle F_{E}=\ln\int dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}\Big{(}y\mid%
\xi\sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{\rho_{K}-\frac{m_{K}^{2}}{q_{K}}}%
\Big{)}\prod_{a=1}^{s}\mathbb{E}_{u^{a}}P_{\rm out}(y\mid\xi\sqrt{q_{K}}+u^{a}%
\sqrt{r_{K}-q_{K}}). \tag{48}
$$
The last product over the replica index $a$ contains identical factors thanks to the RS ansatz. Therefore, by expanding in $s→ 0^{+}$ we get
$$
\displaystyle F_{E}=s\int dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}\Big{(}y\mid\xi%
\sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{\rho_{K}-\frac{m_{K}^{2}}{q_{K}}}%
\Big{)}\ln\mathbb{E}_{u}P_{\rm out}(y\mid\xi\sqrt{q_{K}}+u\sqrt{r_{K}-q_{K}})+%
O(s^{2}). \tag{49}
$$
For the linear readout with Gaussian label noise $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ the above gives
$$
\displaystyle F_{E}=-\frac{s}{2}\ln\big{[}2\pi(\Delta+r_{K}-q_{K})\big{]}-%
\frac{s}{2}\frac{\Delta+\rho_{K}-2m_{K}+q_{K}}{\Delta+r_{K}-q_{K}}+O(s^{2}). \tag{50}
$$
In the Bayes-optimal setting the Nishimori identities enforce
$$
\displaystyle r_{2}=\rho_{2}=\lim_{d\to\infty}\frac{1}{d^{2}}\mathbb{E}{\rm Tr%
}[({\mathbf{S}}_{2}^{0})^{2}]=1+\gamma\bar{v}^{2}\quad\text{and}\quad m_{2}=q_%
{2}, \displaystyle r_{W}(\mathsf{v})=\rho_{W}(\mathsf{v})=1\quad\text{and}\quad m_{%
W}(\mathsf{v})=\mathcal{Q}_{W}(\mathsf{v})\ \forall\ \mathsf{v}\in\mathsf{V}, \tag{51}
$$
which implies also that
$$
\displaystyle r_{K}=\rho_{K}=\mu_{1}^{2}+\frac{1}{2}r_{2}\mu_{2}^{2}+g(1),\,%
\quad m_{K}=q_{K}. \tag{1}
$$
Therefore the above simplifies to
$$
\displaystyle F_{E} \displaystyle=s\int dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}(y\mid\xi\sqrt{q_{K}}%
+u^{0}\sqrt{r_{K}-q_{K}})\ln\mathbb{E}_{u}P_{\rm out}(y\mid\xi\sqrt{q_{K}}+u%
\sqrt{r_{K}-q_{K}})+O(s^{2}) \displaystyle=:s\,\psi_{P_{\rm{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})+O(s^%
{2}). \tag{54}
$$
Notice that the energetic contribution to the free entropy has the same form as in the generalised linear model Barbier et al. (2019). For our running example of linear readout with Gaussian noise the function $\psi_{P_{\rm out}}$ reduces to
$$
\displaystyle\psi_{P_{\rm{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})=-\frac{1}%
{2}\ln\big{[}2\pi e(\Delta+r_{K}-q_{K})\big{]}. \tag{56}
$$
In what follows we shall restrict ourselves only to the replica symmetric ansatz, in the Bayes-optimal setting. Therefore, identifications as the ones in (51), (52) are assumed.
D.2 Second moment of $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$
For the reader’s convenience we report here the measure
$$
\displaystyle P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})^{-1}\int\prod_{a}^{0,s}dP_{W}({%
\mathbf{W}}^{a})\delta({\mathbf{S}}^{a}_{2}-{\mathbf{W}}^{a\intercal}({\mathbf%
{v}}){\mathbf{W}}^{a}/\sqrt{k})\prod_{a\leq b}^{0,s}\prod_{\mathsf{v}\in%
\mathsf{V}}\prod_{i\in\mathcal{I}_{\mathsf{v}}}\delta({d}\,\mathcal{Q}_{W}^{ab%
}(\mathsf{v})-{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b}). \tag{57}
$$
Recall $\mathsf{V}$ is the support of $P_{v}$ (assumed discrete for the moment). Recall also that we have quenched the readout weights to the ground truth. Indeed, as discussed in the main, considering them learnable or fixed to the truth does not change the leading order of the information-theoretic quantities.
In this measure, one can compute the asymptotic of its second moment
$$
\displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})\frac{1}{d%
^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b} \displaystyle=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})^{-1}\int\prod_{a}^{0,s}dP_{W}({%
\mathbf{W}}^{a})\frac{1}{kd^{2}}{\rm Tr}[{\mathbf{W}}^{a\intercal}({\mathbf{v}%
}){\mathbf{W}}^{a}{\mathbf{W}}^{b\intercal}({\mathbf{v}}){\mathbf{W}}^{b}] \displaystyle\qquad\qquad\times\prod_{a\leq b}^{0,s}\prod_{\mathsf{v}\in%
\mathsf{V}}\prod_{i\in\mathcal{I}_{v}}\delta({d}\,\mathcal{Q}_{W}^{ab}(\mathsf%
{v})-{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b}). \tag{58}
$$
The measure is coupled only through the latter $\delta$ ’s. We can decouple the measure at the cost of introducing Fourier conjugates whose values will then be fixed by a saddle point computation. The second moment computed will not affect the saddle point, hence it is sufficient to determine the value of the Fourier conjugates through the computation of $V_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ , which rewrites as
$$
\displaystyle V_{W}^{kd}(\bm{\mathcal{Q}}_{W}) \displaystyle=\int\prod_{a}^{0,s}dP_{W}({\mathbf{W}}^{a})\prod_{a\leq b}^{0,s}%
\prod_{\mathsf{v}\in\mathsf{V}}\prod_{i\in\mathcal{I}_{\mathsf{v}}}d\hat{B}^{%
ab}_{i}(\mathsf{v})\exp\big{[}-\hat{B}^{ab}_{i}(\mathsf{v})({d}\,\mathcal{Q}_{%
W}^{ab}(\mathsf{v})-{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b})\big{]} \displaystyle\approx\prod_{\mathsf{v}\in\mathsf{V}}\prod_{i\in\mathcal{I}_{%
\mathsf{v}}}\exp\Big{(}d\,{\rm extr}_{(\hat{B}^{ab}_{i}(\mathsf{v}))}\Big{[}-%
\sum_{a\leq b,0}^{s}\hat{B}^{ab}_{i}(\mathsf{v})\mathcal{Q}_{W}^{ab}(\mathsf{v%
})+\ln\int\prod_{a=0}^{s}dP_{W}(w_{a})e^{\sum_{a\leq b,0}^{s}\hat{B}_{i}^{ab}(%
\mathsf{v})w_{a}w_{b}}\Big{]}\Big{)}. \tag{59}
$$
In the last line we have used saddle point integration over $\hat{B}^{ab}_{i}(\mathsf{v})$ and the approximate equality is up to a multiplicative $\exp(o(n))$ constant. From the above, it is clear that the stationary $\hat{B}^{ab}_{i}(\mathsf{v})$ are such that
$$
\displaystyle\mathcal{Q}_{W}^{ab}(\mathsf{v})=\frac{\int\prod_{r=0}^{s}dP_{W}(%
w_{r})w_{a}w_{b}\prod_{r\leq t,0}^{s}e^{\hat{B}_{i}^{rt}(\mathsf{v})w_{r}w_{t}%
}}{\int\prod_{r=0}^{s}dP_{W}(w_{r})\prod_{r\leq t,0}^{s}e^{\hat{B}_{i}^{rt}(%
\mathsf{v})w_{r}w_{t}}}=:\langle w_{a}w_{b}\rangle_{\hat{\mathbf{B}}(\mathsf{v%
})}. \tag{60}
$$
Hence $\hat{B}_{i}^{ab}(\mathsf{v})=\hat{B}^{ab}(\mathsf{v})$ are homogeneous. Using these notations, the asymptotic trace moment of the ${\mathbf{S}}_{2}$ ’s at leading order becomes
$$
\displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle\frac{1}{d^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}%
=\frac{1}{kd^{2}}\sum_{i,l=1}^{k}\sum_{j,p=1}^{d}\langle W_{ij}^{a}v_{i}W_{ip}%
^{a}W_{lj}^{b}v_{l}W_{lp}^{b}\rangle_{\{\hat{\mathbf{B}}(\mathsf{v})\}_{%
\mathsf{v}\in\mathsf{V}}} \displaystyle=\frac{1}{k}\sum_{\mathsf{v}\in\mathsf{V}}\mathsf{v}^{2}\sum_{i%
\in\mathcal{I}_{\mathsf{v}}}\Big{\langle}\Big{(}\frac{1}{d}\sum_{j=1}^{d}W_{ij%
}^{a}W_{ij}^{b}\Big{)}^{2}\Big{\rangle}_{\hat{\mathbf{B}}(\mathsf{v})}+\frac{1%
}{k}\sum_{j=1}^{d}\Big{\langle}\sum_{i=1}^{k}\frac{v_{i}(W_{ij}^{a})^{2}}{d}%
\sum_{l\neq i,1}^{k}\frac{v_{l}(W_{lj}^{b})^{2}}{d}\Big{\rangle}_{\hat{\mathbf%
{B}}(\mathsf{v})}. \tag{61}
$$
We have used the fact that $\smash{\langle\,·\,\rangle_{\hat{\mathbf{B}}(\mathsf{v})}}$ is symmetric if the prior $P_{W}$ is, thus forcing us to match $j$ with $p$ if $i≠ l$ . Considering that by the Nishimori identities $\mathcal{Q}_{W}^{aa}(\mathsf{v})=1$ , it implies $\hat{B}^{aa}(\mathsf{v})=0$ for any $a=0,1,...,s$ and $\mathsf{v}∈\mathsf{V}$ . Furthermore, the measure $\langle\,·\,\rangle_{\hat{\mathbf{B}}(\mathsf{v})}$ is completely factorised over neuron and input indices. Hence every normalised sum can be assumed to concentrate to its expectation by the law of large numbers. Specifically, we can write that with high probability as $d,k→∞$ ,
$$
\displaystyle\frac{1}{d}\sum_{j=1}^{d}W_{ij}^{a}W_{ij}^{b}\xrightarrow{}%
\mathcal{Q}_{W}^{ab}(\mathsf{v})\ \forall\ i\in\mathcal{I}_{\mathsf{v}},\qquad%
\frac{1}{k}\sum_{\mathsf{v},\mathsf{v}^{\prime}\in\mathsf{V}}\mathsf{v}\mathsf%
{v}^{\prime}\sum_{j=1}^{d}\sum_{i\in\mathcal{I}_{\mathsf{v}}}\frac{(W_{ij}^{a}%
)^{2}}{d}\sum_{l\in\mathcal{I}_{\mathsf{v}^{\prime}},l\neq i}\frac{(W_{lj}^{b}%
)^{2}}{d}\approx\gamma\sum_{\mathsf{v},\mathsf{v}^{\prime}\in\mathsf{V}}\frac{%
|\mathcal{I}_{\mathsf{v}}||\mathcal{I}_{\mathsf{v}^{\prime}}|}{k^{2}}\mathsf{v%
}\mathsf{v}^{\prime}\to\gamma\bar{v}^{2}, \tag{62}
$$
where we used $|\mathcal{I}_{\mathsf{v}}|/k→ P_{v}(\mathsf{v})$ as $k$ diverges. Consequently, the second moment at leading order appears as claimed:
$$
\displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle\frac{1}{d^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}%
=\sum_{\mathsf{v}\in\mathsf{V}}P_{v}(\mathsf{v})\mathsf{v}^{2}\mathcal{Q}_{W}^%
{ab}(\mathsf{v})^{2}+\gamma\bar{v}^{2}=\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q%
}_{W}^{ab}(v)^{2}+\gamma\bar{v}^{2}. \tag{63}
$$
Notice that the effective law $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ in (15) is the least restrictive choice among the Wishart-type distributions with a trace moment fixed precisely to the one above. In more specific terms, it is the solution of the following maximum entropy problem:
$$
\displaystyle\inf_{P,\tau}\Big{\{}D_{\rm KL}(P\,\|\,P_{S}^{\otimes s+1})+\sum_%
{a\leq b,0}^{s}\tau^{ab}\Big{(}\mathbb{E}_{P}\frac{1}{d^{2}}{\rm Tr}\,{\mathbf%
{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}-\gamma\bar{v}^{2}-\mathbb{E}_{v\sim P_{v}}v^{%
2}\mathcal{Q}_{W}^{ab}(v)^{2}\Big{)}\Big{\}}, \tag{64}
$$
where $P_{S}$ is a generalised Wishart distribution (as defined above (15)), and $P$ is in the space of joint probability distributions over $s+1$ symmetric matrices of dimension $d× d$ . The rationale behind the choice of $P_{S}$ as a base measure is that, in absence of any other information, a statistician can always use a generalised Wishart measure for the ${\mathbf{S}}_{2}$ ’s if they assume universality in the law of the inner weights. This ansatz would yield the theory of Maillard et al. (2024a), which still describes a non-trivial performance, achieved by the adaptation of GAMP-RIE of Appendix H.
Note that if $a=b$ then, by (51), the second moment above matches precisely $r_{2}=1+\gamma\bar{v}^{2}$ . This entails directly $\tau^{aa}=0$ , as the generalised Wishart prior $P_{S}$ already imposes this constraint.
D.3 Entropic potential
We now use the results from the previous section to compute the entropic contribution $F_{S}$ to the free entropy:
$$
\displaystyle e^{F_{S}}:=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})\int dP(({\mathbf{S}}%
_{2}^{a})\mid\bm{\mathcal{Q}}_{W})\prod_{a\leq b}^{0,s}\delta(d^{2}Q_{2}^{ab}-%
{{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}}). \tag{65}
$$
The factor $V_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ was already treated in the previous section. However, here it will contribute as a tilt of the overall entropic contribution, and the Fourier conjugates $\hat{\mathcal{Q}}_{W}^{ab}(\mathsf{v})$ will appear in the final variational principle.
Let us now proceed with the relaxation of the measure $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ by replacing it with $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ given by (15):
$$
\displaystyle e^{F_{S}}=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})\int d\hat{{\mathbf{Q}%
}}_{2}\exp\Big{(}-\frac{d^{2}}{2}\sum_{a\leq b,0}^{s}\hat{Q}^{ab}_{2}Q^{ab}_{2%
}\Big{)}\frac{1}{\tilde{V}^{kd}_{W}(\bm{\mathcal{Q}}_{W})}\int\prod_{a=0}^{s}%
dP_{S}({\mathbf{S}}_{2}^{a})\exp\Big{(}\sum_{a\leq b,0}^{s}\frac{\tau_{ab}+%
\hat{Q}_{2}^{ab}}{2}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}\Big{)} \tag{66}
$$
where we have introduced another set of Fourier conjugates $\hat{\mathbf{Q}}_{2}$ for ${\mathbf{Q}}_{2}$ . As usual, the Nishimori identities impose $Q_{2}^{aa}=r_{2}=1+\gamma\bar{v}^{2}$ without the need of any Fourier conjugate. Hence, similarly to $\tau^{aa}$ , $\hat{Q}_{2}^{aa}=0$ too. Furthermore, in the hypothesis of replica symmetry, we set $\tau^{ab}=\tau$ and $\hat{Q}_{2}^{ab}=\hat{q}_{2}$ for all $0≤ a<b≤ s$ .
Then, when the number of replicas $s$ tends to $0^{+}$ , we can recognise the free entropy of a matrix denoising problem. More specifically, using the Hubbard–Stratonovich transformation (i.e., $\mathbb{E}_{{\mathbf{Z}}}\exp(\frac{d}{2}{\rm Tr}\,{\mathbf{M}}{\mathbf{Z}})=%
\exp(\frac{d}{4}{\rm Tr}\,{\mathbf{M}}^{2})$ for a $d× d$ symmetric matrix ${\mathbf{M}}$ with ${\mathbf{Z}}$ a standard GOE matrix) we get
$$
\displaystyle J_{n}(\tau,\hat{q}_{2}) \displaystyle:=\lim_{s\to 0^{+}}\frac{1}{ns}\ln\int\prod_{a=0}^{s}dP_{S}({%
\mathbf{S}}_{2}^{a})\exp\Big{(}\frac{\tau+\hat{q}_{2}}{2}\sum_{a<b,0}^{s}{\rm
Tr%
}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}\Big{)} \displaystyle=\frac{1}{n}\mathbb{E}\ln\int dP_{S}({\mathbf{S}}_{2})\exp\frac{1%
}{2}{\rm Tr}\Big{(}\sqrt{\tau+\hat{q}_{2}}{\mathbf{Y}}{\mathbf{S}}_{2}-(\tau+%
\hat{q}_{2})\frac{{\mathbf{S}}_{2}^{2}}{2}\Big{)}, \tag{67}
$$
where ${\mathbf{Y}}={\mathbf{Y}}(\tau+\hat{q}_{2})=\sqrt{\tau+\hat{q}_{2}}{\mathbf{S}%
}_{2}^{0}+{\bm{\xi}}$ with ${\bm{\xi}}/\sqrt{d}$ a standard GOE matrix, and the outer expectation is w.r.t. ${\mathbf{Y}}$ (or ${\mathbf{S}}^{0},{\bm{\xi}}$ ). Thanks to the fact that the base measure $P_{S}$ is rotationally invariant, the above can be solved exactly in the limit $n→∞,\,n/d^{2}→\alpha$ (see e.g. Pourkamali et al. (2024)):
$$
\displaystyle J(\tau,\hat{q}_{2})=\lim J_{n}(\tau,\hat{q}_{2})=\frac{1}{\alpha%
}\Big{(}\frac{(\tau+\hat{q}_{2})r_{2}}{4}-\iota(\tau+\hat{q}_{2})\Big{)},\quad%
\text{with}\quad\iota(\eta):=\frac{1}{8}+\frac{1}{2}\Sigma(\mu_{{\mathbf{Y}}(%
\eta)}). \tag{68}
$$
Here $\iota(\eta)=\lim I({\mathbf{Y}}(\eta);{\mathbf{S}}^{0}_{2})/d^{2}$ is the limiting mutual information between data ${\mathbf{Y}}(\eta)$ and signal ${\mathbf{S}}^{0}_{2}$ for the channel ${\mathbf{Y}}(\eta)=\sqrt{\eta}{\mathbf{S}}^{0}_{2}+{\bm{\xi}}$ , the measure $\mu_{{\mathbf{Y}}(\eta)}$ is the asymptotic spectral law of the rescaled observation matrix ${\mathbf{Y}}(\eta)/\sqrt{d}$ , and $\Sigma(\mu):=∈t\ln|x-y|d\mu(x)d\mu(y)$ . Using free probability, the law $\mu_{{\mathbf{Y}}(\eta)}$ can be obtained as the free convolution of a generalised Marchenko-Pastur distribution (the asymptotic spectral law of ${\mathbf{S}}^{0}_{2}$ , which is a generalised Wishart random matrix) and the semicircular distribution (the asymptotic spectral law of ${\bm{\xi}}$ ), see Potters & Bouchaud (2020). We provide the code to obtain this distribution numerically in the attached repository. The function ${\rm mmse}_{S}(\eta)$ is obtained through a derivative of $\iota$ , using the so-called I-MMSE relation Guo et al. (2005); Pourkamali et al. (2024):
$$
\displaystyle 4\frac{d}{d\eta}\iota(\eta)={\rm mmse}_{S}(\eta)=\frac{1}{\eta}%
\Big{(}1-\frac{4\pi^{2}}{3}\int\mu^{3}_{{\mathbf{Y}}(\eta)}(y)dy\Big{)}. \tag{69}
$$
The normalisation $\frac{1}{ns}\ln\tilde{V}_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ in the limit $n→∞,s→ 0^{+}$ can be simply computed as $J(\tau,0)$ .
For the other normalisation, following the same steps as in the previous section, we can simplify $V^{kd}_{W}(\bm{\mathcal{Q}}_{W})$ as follows:
$$
\displaystyle\frac{1}{ns}\ln V_{W}^{kd}(\bm{\mathcal{Q}}_{W})\approx\frac{%
\gamma}{\alpha s}\sum_{\mathsf{v}\in\mathsf{V}}\frac{1}{k}\sum_{i\in\mathcal{I%
}_{\mathsf{v}}}{\rm extr}\Big{[}-\sum_{a\leq b,0}^{s}\hat{\mathcal{Q}}^{ab}_{W%
,i}(\mathsf{v})\mathcal{Q}^{ab}_{W}(\mathsf{v})+\ln\int\prod_{a=0}^{s}dP_{W}(w%
_{a})e^{\sum_{a\leq b,0}^{s}\hat{\mathcal{Q}}^{ab}_{W,i}(\mathsf{v})w_{a}w_{b}%
}\Big{]}, \tag{70}
$$
as $n$ grows, where extremisation is w.r.t. the hatted variables only. As in the previous section, $\hat{\mathcal{Q}}^{ab}_{W,i}(\mathsf{v})$ is homogeneous over $i∈\mathcal{I}_{\mathsf{v}}$ for a given $\mathsf{v}$ . Furthermore, thanks to the Nishimori identities we have that at the saddle point $\hat{\mathcal{Q}}_{W}^{aa}(\mathsf{v})=0$ and ${\mathcal{Q}}_{W}^{aa}(\mathsf{v})=1+\gamma\bar{v}^{2}$ . This, together with standard steps and the RS ansatz, allows to write the $d→∞,s→ 0^{+}$ limit of the above as
$$
\displaystyle\lim_{s\to 0^{+}}\lim\frac{1}{ns}\ln V_{W}^{kd}(\bm{\mathcal{Q}}_%
{W})=\frac{\gamma}{\alpha}\mathbb{E}_{v\sim P_{v}}{\rm extr}\Big{[}-\frac{\hat%
{\mathcal{Q}}_{W}(v)\mathcal{Q}_{W}(v)}{2}+\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(%
v))\Big{]} \tag{71}
$$
with $\psi_{P_{W}}(\,·\,)$ as in the main. Gathering all these results yields directly
$$
\displaystyle\lim_{s\to 0^{+}}\lim\frac{F_{S}}{ns}={\rm extr}\Big{\{} \displaystyle\frac{\hat{q}_{2}(r_{2}-q_{2})}{4\alpha}-\frac{1}{\alpha}\big{[}%
\iota(\tau+\hat{q}_{2})-\iota(\tau)\big{]}+\frac{\gamma}{\alpha}\mathbb{E}_{v%
\sim P_{v}}\Big{[}\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{\hat{\mathcal{Q%
}}_{W}(v)\mathcal{Q}_{W}(v)}{2}\Big{]}\Big{\}}. \tag{72}
$$
Extremisation is w.r.t. $\hat{q}_{2},\hat{\mathcal{Q}}_{W}$ . $\tau$ has to be intended as a function of $\mathcal{Q}_{W}=\{{\mathcal{Q}}_{W}(\mathsf{v})\mid\mathsf{v}∈\mathsf{V}\}$ through the moment matching condition:
$$
\displaystyle 4\alpha\,\partial_{\tau}J(\tau,0)=r_{2}-4\iota^{\prime}(\tau)=%
\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2}+\gamma\bar{v}^{2}, \tag{73}
$$
which is the $s→ 0^{+}$ limit of the moment matching condition between $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ and $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ . Simplifying using the value of $r_{2}=1+\gamma\bar{v}^{2}$ according to the Nishimori identities, and using the I-MMSE relation between $\iota(\tau)$ and ${\rm mmse}_{S}(\tau)$ , we get
$$
\displaystyle{\rm mmse}_{S}(\tau)=1-\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{%
W}(v)^{2}\quad\iff\quad\tau={\rm mmse}_{S}^{-1}\big{(}1-\mathbb{E}_{v\sim P_{v%
}}v^{2}\mathcal{Q}_{W}(v)^{2}\big{)}. \tag{74}
$$
Since ${\rm mmse}_{S}$ is a monotonic decreasing function of its argument (and thus invertible), the above always has a solution, and it is unique for a given collection $\mathcal{Q}_{W}$ .
D.4 RS free entropy and saddle point equations
Putting the energetic and entropic contributions together we obtain the variational replica symmetric free entropy potential:
$$
\displaystyle f^{\alpha,\gamma}_{\rm RS} \displaystyle:=\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})+\frac%
{1}{4\alpha}(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2}+\frac{\gamma}{\alpha}%
\mathbb{E}_{v\sim P_{v}}\big{[}\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{1}%
{2}\mathcal{Q}_{W}(v)\hat{\mathcal{Q}}_{W}(v)\big{]} \displaystyle\qquad+\frac{1}{\alpha}\big{[}\iota(\tau(\mathcal{Q}_{W}))-\iota(%
\hat{q}_{2}+\tau(\mathcal{Q}_{W}))\big{]}, \tag{75}
$$
which is then extremised w.r.t. $\{\hat{\mathcal{Q}}_{W}(\mathsf{v}),\mathcal{Q}_{W}(\mathsf{v})\mid\mathsf{v}%
∈\mathsf{V}\},\hat{q}_{2},q_{2}$ while $\tau$ is a function of ${\mathcal{Q}}_{W}$ through the moment matching condition (74). The saddle point equations are then
$$
\left[\begin{array}[]{@{}l@{\quad}l@{}}&{\mathcal{Q}}_{W}(\mathsf{v})=\mathbb{%
E}_{w^{0},\xi}[w^{0}{\langle w\rangle}_{\hat{\mathcal{Q}}_{W}(\mathsf{v})}],\\
&\hat{\mathcal{Q}}_{W}(\mathsf{v})=\frac{1}{2\gamma}(q_{2}-\gamma\bar{v}^{2}-%
\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2})\partial_{{\mathcal{Q}}_{W%
}(\mathsf{v})}\tau(\mathcal{Q}_{W})+2\frac{\alpha}{\gamma}\partial_{{\mathcal{%
Q}}_{W}(\mathsf{v})}\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K}),%
\\
&q_{2}=r_{2}-\frac{1}{\hat{q}_{2}+\tau(\mathcal{Q}_{W})}(1-\frac{4\pi^{2}}{3}%
\int\mu^{3}_{{\mathbf{Y}}(\hat{q}_{2}+\tau(\mathcal{Q}_{W}))}(y)dy),\\
&\hat{q}_{2}=4\alpha\,\partial_{q_{2}}\psi_{P_{\text{out}}}(q_{K}(q_{2},%
\mathcal{Q}_{W});r_{K}),\end{array}\right. \tag{76}
$$
where, letting i.i.d. $w^{0},\xi\sim\mathcal{N}(0,1)$ , we define the measure
$$
\displaystyle\langle\,\cdot\,\rangle_{x}=\langle\,\cdot\,\rangle_{x}(w^{0},\xi%
):=\frac{\int dP_{W}(w)(\,\cdot\,)e^{(\sqrt{x}\xi+xw^{0})w-\frac{1}{2}xw^{2}}}%
{\int dP_{W}(w)e^{(\sqrt{x}\xi+xw^{0})w-\frac{1}{2}xw^{2}}}. \tag{77}
$$
All the above formulae are easily specialised for the linear readout with Gaussian label noise using (56). We report here the saddle point equations in this case (recalling that $g$ is defined in (43)):
$$
\left[\begin{array}[]{@{}l@{\quad}l@{}}&{\mathcal{Q}}_{W}(\mathsf{v})=\mathbb{%
E}_{w^{0},\xi}[w^{0}{\langle w\rangle}_{\hat{\mathcal{Q}}_{W}(v)}],\\
&\hat{\mathcal{Q}}_{W}(\mathsf{v})=\frac{1}{2\gamma}(q_{2}-\gamma\bar{v}^{2}-%
\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2})\partial_{{\mathcal{Q}}_{W%
}(\mathsf{v})}\tau(\mathcal{Q}_{W})+\frac{\alpha}{\gamma}\frac{\mathsf{v}^{2}%
\,g^{\prime}(\mathcal{Q}_{W}(\mathsf{v}))}{\Delta+\frac{1}{2}\mu_{2}^{2}(r_{2}%
-q_{2})+g(1)-\mathbb{E}_{v\sim P_{v}}{v}^{2}g(\mathcal{Q}_{W}(v))},\\
&q_{2}=r_{2}-\frac{1}{\hat{q}_{2}+\tau}(1-\frac{4\pi^{2}}{3}\int\mu^{3}_{{%
\mathbf{Y}}(\hat{q}_{2}+\tau(\mathcal{Q}_{W}))}(y)dy),\\
&\hat{q}_{2}=\frac{\alpha\mu_{2}^{2}}{\Delta+\frac{1}{2}\mu_{2}^{2}(r_{2}-q_{2%
})+g(1)-\mathbb{E}_{v\sim P_{v}}v^{2}g(\mathcal{Q}_{W}(v))}.\end{array}\right. \tag{1}
$$
If one assumes that the overlaps appearing in (38) are self-averaging around the values that solve the saddle point equations (and maximise the RS potential), that is $Q^{00}_{1},Q_{1}^{01}→ 1$ (as assumed in this scaling), $Q_{2}^{00}→ r_{2},Q_{2}^{01}→ q_{2}^{*}$ , and ${\mathcal{Q}}_{W}^{00}(\mathsf{v})→ 1,{\mathcal{Q}}_{W}^{01}(\mathsf{v})→{%
\mathcal{Q}}_{W}^{*}(\mathsf{v})$ , then the limiting Bayes-optimal mean-square generalisation error for the linear readout with Gaussian noise case appears as
$$
\displaystyle\varepsilon^{\rm opt}-\Delta=r_{K}-q_{K}^{*}=\frac{\mu_{2}^{2}}{2%
}(r_{2}-q_{2}^{*})+g(1)-\mathbb{E}_{v\sim P_{v}}v^{2}g(\mathcal{Q}^{*}_{W}(v)). \tag{1}
$$
This is the formula used to evaluate the theoretical Bayes-optimal mean-square generalisation error used along the paper.
D.5 Non-centred activations
Consider a non-centred activation function, i.e., $\mu_{0}≠ 0$ in (17). This reflects on the law of the post-activations, which will still be Gaussian, centred at
$$
\displaystyle\mathbb{E}_{\mathbf{x}}\lambda^{a}=\frac{\mu_{0}}{\sqrt{k}}\sum_{%
i=1}^{k}v_{i}=:\mu_{0}\Lambda, \tag{80}
$$
and with the covariance given by (8) (we are assuming $Q_{W}^{aa}=1$ ; if not, $Q_{W}^{aa}=r$ , the formula can be generalised as explained in App. A, and that the readout weights are quenched). In the above, we have introduced the new mean parameter $\Lambda$ . Notice that, if the ${\mathbf{v}}$ ’s have a $\bar{v}=O(1)$ mean, then $\Lambda$ scales as $\sqrt{k}$ due to our choice of normalisation.
One can carry out the replica computation for a fixed $\Lambda$ . This new parameter, being quenched, does not affect the entropic term. It will only appear in the energetic term as a shift to the means, yielding
$$
F_{E}=F_{E}({\mathbf{K}},\Lambda)=\ln\int dy\int d{\bm{\lambda}}\frac{e^{-%
\frac{1}{2}{\bm{\lambda}}^{\intercal}{\mathbf{K}}^{-1}{\bm{\lambda}}}}{\sqrt{(%
2\pi)^{s+1}\det{\mathbf{K}}}}\prod_{a=0}^{s}P_{\rm{out}}(y\mid\lambda^{a}+\mu_%
{0}\Lambda). \tag{81}
$$
Within the replica symmetric ansatz, the above turns into
| | $\displaystyle e^{F_{E}}=∈t dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}\Big{(}y\mid%
\mu_{0}\Lambda+\xi\sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{\rho_{K}-\frac{m_{%
K}^{2}}{q_{K}}}\Big{)}\prod_{a=1}^{s}\mathbb{E}_{u^{a}}P_{\rm out}(y\mid\mu_{0%
}\Lambda+\xi\sqrt{q_{K}}+u^{a}\sqrt{r_{K}-q_{K}}).$ | |
| --- | --- | --- |
Therefore, the simplification of the potential $F_{E}$ proceeds as in the centred activation case, yielding at leading order in the number $s$ of replicas
| | $\displaystyle\frac{F_{E}(r_{K},q_{K},\Lambda)}{s}\!=\!∈t dy\,\mathbb{E}_{\xi%
,u^{0}}P_{\rm out}\Big{(}y\mid\mu_{0}\Lambda+\xi\sqrt{q_{K}}+u^{0}\sqrt{r_{K}-%
q_{K}}\Big{)}\ln\mathbb{E}_{u}P_{\rm out}(y\mid\mu_{0}\Lambda+\xi\sqrt{q_{K}}+%
u\sqrt{r_{K}-q_{K}})+O(s)$ | |
| --- | --- | --- |
in the Bayes-optimal setting. In the case when $P_{\rm out}(y\mid\lambda)=f(y-\lambda)$ then one can verify that the contributions due to the means, containing $\mu_{0}$ , cancel each other. This is verified in our running example where $P_{\rm out}$ is the Gaussian channel:
$$
\frac{F_{E}(r_{K},q_{K},\Lambda)}{s}=-\frac{1}{2}\ln\big{[}2\pi(\Delta+r_{K}-q%
_{K})\big{]}-\frac{1}{2}-\frac{\mu_{0}^{2}}{2}\frac{(\Lambda-\Lambda)^{2}}{%
\Delta+r_{K}-q_{K}}+O(s)=-\frac{1}{2}\ln\big{[}2\pi(\Delta+r_{K}-q_{K})\big{]}%
-\frac{1}{2}+O(s). \tag{82}
$$
Appendix E Alternative simplifications of $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ through moment matching
A crucial step that allowed us to obtain a closed-form expression for the model’s free entropy is the relaxation $\tilde{P}(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (15) of the true measure $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (14) entering the replicated partition function, as explained in Sec. 4. The specific form we chose (tilted Wishart distribution with a matching second moment) has the advantage of capturing crucial features of the true measure, such as the fact that the matrices ${\mathbf{S}}^{a}_{2}$ are generalised Wishart matrices with coupled replicas, while keeping the problem solvable with techniques derived from random matrix theory of rotationally invariant ensembles. In this appendix, we report some alternative routes one can take to simplify, or potentially improve the theory.
E.1 A factorised simplified distribution
In the specialisation phase, one can assume that the only crucial feature to keep track in relaxing $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (14) is the coupling between different replicas, becoming more and more relevant as $\alpha$ increases. In this case, inspired by Sakata & Kabashima (2013); Kabashima et al. (2016), in order to relax (14) we can propose the Gaussian ansatz
$$
\displaystyle d\bar{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})=\prod%
_{a=0}^{s}d{\mathbf{S}}^{a}_{2}\prod_{\alpha=1}^{d}\delta(S^{a}_{2;\alpha%
\alpha}-\sqrt{k}\bar{v})\times\prod_{\alpha_{1}<\alpha_{2}}^{d}\frac{e^{-\frac%
{1}{2}\sum_{a,b=0}^{s}S^{a}_{2;\alpha_{1}\alpha_{2}}\bar{\tau}^{ab}(\bm{{%
\mathcal{Q}}}_{W})S^{b}_{2;\alpha_{1}\alpha_{2}}}}{\sqrt{(2\pi)^{s+1}\det(\bar%
{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})^{-1})}}, \tag{83}
$$
where $\bar{v}$ is the mean of the readout prior $P_{v}$ , and $\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W}):=(\bar{\tau}^{ab}(\bm{{\mathcal{Q}}}_{%
W}))_{a,b}$ is fixed by
$$
[\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})^{-1}]_{ab}=\mathbb{E}_{v\sim P_{v}}v^%
{2}{\mathcal{Q}}_{W}^{ab}(v)^{2}.
$$
In words, first, the diagonal elements of ${\mathbf{S}}_{2}^{a}$ are $d$ random variables whose $O(1)$ fluctuations cannot affect the free entropy in the asymptotic regime we are considering, being too few compared to $n=\Theta(d^{2})$ . Hence, we assume they concentrate to their mean. Concerning the $d(d-1)/2$ off-diagonal elements of the matrices $({\mathbf{S}}_{2}^{a})_{a}$ , they are zero-mean variables whose distribution at given $\bm{{\mathcal{Q}}}_{W}$ is assumed to be factorised over the input indices. The definition of $\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})$ ensures matching with the true second moment (63).
(83) is considerably simpler than (15): following this ansatz, the entropic contribution to the free entropy gives
$$
\displaystyle e^{\bar{F}_{S}}:=\int\prod_{a\leq b,0}^{s}d\hat{Q}_{2}^{ab}\,e^{%
kd\ln V_{W}(\bm{\mathcal{Q}}_{W})+\frac{d^{2}}{4}{\rm Tr}\hat{\mathbf{Q}}^{%
\intercal}_{2}{\mathbf{Q}}_{2}}\Big{[}\int\prod_{a=0}^{s}dS^{a}_{2}\,\frac{e^{%
-\frac{1}{2}\sum_{a,b=0}^{s}S^{a}_{2}[\bar{\tau}^{ab}(\bm{{\mathcal{Q}}}_{W})+%
\hat{Q}_{2}^{ab}]S^{b}_{2}}}{\sqrt{(2\pi)^{s+1}\det(\bar{\bm{\tau}}(\bm{{%
\mathcal{Q}}_{W}})^{-1})}}\Big{]}^{d(d-1)/2} \displaystyle\qquad\qquad\qquad\times\int\prod_{a=0}^{s}\prod_{\alpha=1}^{d}dS%
^{a}_{2;\alpha\alpha}\delta(S^{a}_{2;\alpha\alpha}-\sqrt{k}\bar{v})\,e^{-\frac%
{1}{4}\sum_{a,b=0}^{s}\hat{Q}_{2}^{ab}\sum_{\alpha=1}^{d}S_{2;\alpha\alpha}^{a%
}S_{2;\alpha\alpha}^{b}}, \tag{84}
$$
instead of (66). Integration over the diagonal elements $(S_{2;\alpha\alpha}^{a})_{\alpha}$ can be done straightforwardly, yielding
$$
\displaystyle e^{\bar{F}_{S}} \displaystyle=\int\prod_{a\leq b,0}^{s}d\hat{Q}_{2}^{ab}\,e^{kd\ln V_{W}(\bm{%
\mathcal{Q}}_{W})+\frac{d^{2}}{4}{\rm Tr}\hat{\mathbf{Q}}_{2}^{\intercal}({%
\mathbf{Q}}_{2}-\gamma\mathbf{1}\mathbf{1}^{\intercal}\bar{v}^{2})}\Big{[}\int%
\prod_{a=0}^{s}dS^{a}_{2}\,\frac{e^{-\frac{1}{2}\sum_{a,b=0}^{s}S^{a}_{2}[\bar%
{\tau}^{ab}(\bm{{\mathcal{Q}}}_{W})+\hat{Q}_{2}^{ab}]S^{b}_{2}}}{\sqrt{(2\pi)^%
{s+1}\det(\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})^{-1})}}\Big{]}^{d(d-1)/2}. \tag{85}
$$
The remaining Gaussian integral over the off-diagonal elements of ${\mathbf{S}}_{2}$ can be performed exactly, leading to
$$
\displaystyle e^{\bar{F}_{S}} \displaystyle=\int\prod_{a\leq b,0}^{s}d\hat{Q}_{2}^{ab}\,e^{kd\ln V_{W}(\bm{%
\mathcal{Q}}_{W})+\frac{d^{2}}{4}{\rm Tr}\hat{\mathbf{Q}}_{2}^{\intercal}({%
\mathbf{Q}}_{2}-\gamma\mathbf{1}\mathbf{1}^{\intercal}\bar{v}^{2})-\frac{d(d-1%
)}{4}\ln\det[{\mathbb{I}}_{s+1}+\hat{\mathbf{Q}}_{2}\bar{\bm{\tau}}(\bm{{%
\mathcal{Q}}}_{W})^{-1}]}. \tag{86}
$$
In order to proceed and perform the $s→ 0^{+}$ limit, we use the RS ansatz for the overlap matrices, combined with the Nishimori identities, as explained above. The only difference w.r.t. the approach detailed in Appendix D is the determinant in the exponent of the integrand of (86), which reads
$$
\displaystyle\ln\det[{\mathbb{I}}_{s+1}+\hat{\mathbf{Q}}_{2}\bar{\bm{\tau}}(%
\bm{{\mathcal{Q}}}_{W})^{-1}]=s\ln[1+\hat{q}_{2}(1-\mathbb{E}_{v\sim P_{v}}v^{%
2}\mathcal{Q}_{W}(v)^{2})]-s\hat{q}_{2}+O(s^{2}). \tag{87}
$$
After taking the replica and high-dimensional limits, the resulting free entropy is
$$
\displaystyle f_{\rm sp}^{\alpha,\gamma}={} \displaystyle\psi_{P_{\text{out}}}(q_{K}(q_{2},{\mathcal{Q}}_{W});r_{K})+\frac%
{(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2}}{4\alpha}+\frac{\gamma}{\alpha}\mathbb%
{E}_{v\sim P_{v}}\big{[}\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{1}{2}%
\mathcal{Q}_{W}(v)\hat{{\mathcal{Q}}}_{W}(v)\big{]} \displaystyle\qquad-\frac{1}{4\alpha}\ln\big{[}1+\hat{q}_{2}(1-\mathbb{E}_{v%
\sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2})\big{]}, \tag{88}
$$
to be extremised w.r.t. $q_{2},\hat{q}_{2},\{{\mathcal{Q}}_{W}(\mathsf{v}),\hat{{\mathcal{Q}}}_{W}(%
\mathsf{v})\}$ . The main advantage of this expression over (75) is its simplicity: the moment-matching condition fixing $\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})$ is straightforward (and has been solved explicitly in the final formula) and the result does not depend on the non-trivial (and difficult to numerically evaluate) function $\iota(\eta)$ , which is the mutual information of the associated matrix denoising problem (which has been effectively replaced by the much simpler denoising problem of independent Gaussian variables under Gaussian noise). Moreover, one can show, in the same fashion as done in Appendix G, that the generalisation error predicted from this expression has the same large- $\alpha$ behaviour than the one obtained from (75). However, not surprisingly, being derived from an ansatz ignoring the Wishart-like nature of the matrices ${\mathbf{S}}_{2}^{a}$ , this expression does not reproduce the expected behaviour of the model in the universal phase, i.e. for $\alpha<\alpha_{\rm sp}(\gamma)$ .
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Line Chart: Optimal Error vs. Alpha
### Overview
The image presents a line chart illustrating the relationship between the optimal error (ε<sup>opt</sup>) and a parameter denoted as alpha (α). Three different models or conditions – "main text", "sp", and "uni" – are compared based on their respective error curves. Error bars are included to indicate the uncertainty associated with each data point.
### Components/Axes
* **X-axis:** Labeled as "α" (alpha). The scale ranges from approximately 0 to 7, with tick marks at integer values.
* **Y-axis:** Labeled as "ε<sup>opt</sup>" (optimal error). The scale ranges from approximately 0.01 to 0.08, with tick marks at 0.02, 0.04, 0.06, and 0.08.
* **Legend:** Located in the top-right corner, identifying the three lines:
* Blue line: "main text"
* Red line: "sp"
* Green line: "uni"
* **Data Points:** Each line is composed of multiple data points, marked with circles and accompanied by vertical error bars.
### Detailed Analysis
Let's analyze each line individually, noting trends and approximate data points.
* **"main text" (Blue Line):** This line exhibits a steep downward slope initially, then gradually flattens out.
* At α ≈ 0, ε<sup>opt</sup> ≈ 0.078.
* At α ≈ 1, ε<sup>opt</sup> ≈ 0.045.
* At α ≈ 2, ε<sup>opt</sup> ≈ 0.030.
* At α ≈ 3, ε<sup>opt</sup> ≈ 0.022.
* At α ≈ 4, ε<sup>opt</sup> ≈ 0.017.
* At α ≈ 5, ε<sup>opt</sup> ≈ 0.014.
* At α ≈ 6, ε<sup>opt</sup> ≈ 0.012.
* At α ≈ 7, ε<sup>opt</sup> ≈ 0.011.
* **"sp" (Red Line):** This line also shows a decreasing trend, but it starts at a slightly lower initial value than "main text" and remains consistently above the "uni" line.
* At α ≈ 0, ε<sup>opt</sup> ≈ 0.075.
* At α ≈ 1, ε<sup>opt</sup> ≈ 0.042.
* At α ≈ 2, ε<sup>opt</sup> ≈ 0.028.
* At α ≈ 3, ε<sup>opt</sup> ≈ 0.021.
* At α ≈ 4, ε<sup>opt</sup> ≈ 0.017.
* At α ≈ 5, ε<sup>opt</sup> ≈ 0.014.
* At α ≈ 6, ε<sup>opt</sup> ≈ 0.012.
* At α ≈ 7, ε<sup>opt</sup> ≈ 0.011.
* **"uni" (Green Line):** This line demonstrates the most rapid initial decrease and reaches the lowest error values across the entire range of alpha.
* At α ≈ 0, ε<sup>opt</sup> ≈ 0.075.
* At α ≈ 1, ε<sup>opt</sup> ≈ 0.038.
* At α ≈ 2, ε<sup>opt</sup> ≈ 0.025.
* At α ≈ 3, ε<sup>opt</sup> ≈ 0.019.
* At α ≈ 4, ε<sup>opt</sup> ≈ 0.015.
* At α ≈ 5, ε<sup>opt</sup> ≈ 0.013.
* At α ≈ 6, ε<sup>opt</sup> ≈ 0.012.
* At α ≈ 7, ε<sup>opt</sup> ≈ 0.011.
The error bars are approximately consistent in length across all data points for each line, indicating a relatively constant level of uncertainty.
### Key Observations
* All three models exhibit decreasing optimal error as alpha increases.
* The "uni" model consistently achieves the lowest optimal error across all values of alpha.
* The "sp" and "main text" models have similar error curves, with "sp" generally performing slightly better.
* The rate of error reduction diminishes as alpha increases for all models, suggesting a point of diminishing returns.
### Interpretation
The chart suggests that increasing the value of alpha leads to a reduction in the optimal error for all three models ("main text", "sp", and "uni"). However, the "uni" model consistently outperforms the other two, indicating that it is the most effective approach for minimizing error in this context. The diminishing rate of error reduction at higher alpha values suggests that there is a limit to the benefits of increasing alpha beyond a certain point.
The differences between the models likely reflect variations in their underlying assumptions or methodologies. The "uni" model may be based on a more accurate or comprehensive representation of the system being modeled, leading to its superior performance. The "sp" and "main text" models may be simpler or more specialized, resulting in higher error rates.
The error bars provide a measure of the uncertainty associated with each data point. The relatively consistent length of the error bars suggests that the uncertainty is not strongly dependent on the value of alpha. This indicates that the observed trends are likely to be robust and not simply due to random fluctuations.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Function 'f' vs. Alpha 'α'
### Overview
The image presents a line chart illustrating the relationship between a function 'f' (on the y-axis) and a parameter 'α' (alpha, on the x-axis). Three distinct lines are plotted, each representing a different condition or variable: "main text", "sp", and "uni". The chart displays how the function 'f' changes as 'α' varies from approximately 0 to 6.
### Components/Axes
* **X-axis:** Labeled 'α' (alpha). Scale ranges from approximately 0 to 6, with gridlines at intervals of 2.
* **Y-axis:** Labeled 'f'. Scale ranges from approximately -0.60 to -0.35, with gridlines at intervals of 0.05.
* **Legend:** Located in the top-right corner of the chart. Contains the following entries:
* "main text" - represented by a blue line.
* "sp" - represented by a red line.
* "uni" - represented by a green line.
### Detailed Analysis
* **"main text" (Blue Line):** The line starts at approximately (-0.58, 0) and increases, approaching a plateau around (-0.38, 6). The slope is steepest at the beginning (α = 0) and gradually decreases as α increases.
* (0, -0.58)
* (2, -0.45)
* (4, -0.41)
* (6, -0.38)
* **"sp" (Red Line):** The line begins at approximately (-0.59, 0) and increases, approaching a plateau around (-0.34, 6). It has a similar trend to the "main text" line, but consistently exhibits a higher 'f' value for any given 'α'. The slope is also steepest at the beginning and decreases as α increases.
* (0, -0.59)
* (2, -0.43)
* (4, -0.38)
* (6, -0.34)
* **"uni" (Green Line):** The line starts at approximately (-0.60, 0) and increases, approaching a plateau around (-0.40, 6). It has a similar trend to the other two lines, but consistently exhibits the lowest 'f' value for any given 'α'. The slope is also steepest at the beginning and decreases as α increases.
* (0, -0.60)
* (2, -0.48)
* (4, -0.42)
* (6, -0.40)
### Key Observations
* All three lines exhibit a similar upward trend, indicating that as 'α' increases, 'f' also increases (becomes less negative).
* The "sp" line consistently has the highest 'f' value, followed by "main text", and then "uni". This suggests that the variable represented by "sp" has the greatest impact on increasing 'f'.
* The rate of increase in 'f' decreases as 'α' increases for all three lines, indicating diminishing returns.
* The lines appear to be converging as 'α' approaches 6, suggesting that the differences between the three conditions become less pronounced at higher values of 'α'.
### Interpretation
The chart likely represents the behavior of a system or model where 'α' is a control parameter and 'f' is a performance metric or output. The three lines ("main text", "sp", "uni") could represent different configurations or treatments within the system. The data suggests that increasing 'α' generally improves the performance metric 'f', but the effect is most pronounced at lower values of 'α'. The "sp" configuration consistently outperforms the others, indicating it is the most effective setting. The convergence of the lines at higher 'α' values suggests that the system may be approaching a saturation point, where further increases in 'α' yield diminishing returns. The negative values of 'f' suggest that the metric represents a cost, error, or some other undesirable quantity that is being minimized.
</details>
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Chart: Performance Comparison of Quantities
### Overview
The image presents a line chart comparing the performance of three quantities – Q<sub>w</sub><sup>\*</sup> (3/√5), Q<sub>w</sub><sup>\*</sup> (1/√5), and q<sub>2</sub> – as a function of the parameter α (alpha). The chart also includes shaded regions representing uncertainty or variance around the lines.
### Components/Axes
* **X-axis:** Labeled as "α" (alpha), ranging from approximately 0.5 to 7. The scale is linear.
* **Y-axis:** Ranges from approximately 0.0 to 1.1. The scale is linear.
* **Legend:** Located in the top-right corner of the chart. It identifies the three lines:
* Blue line: Q<sub>w</sub><sup>\*</sup> (3/√5)
* Orange line: Q<sub>w</sub><sup>\*</sup> (1/√5)
* Green line: q<sub>2</sub>
* **Shaded Regions:** Lightly colored regions surrounding each line, likely representing confidence intervals or standard deviations. The colors correspond to the respective lines (light blue, light orange, light green).
* **Data Markers:** Small 'x' markers are placed along each line, potentially representing individual data points.
### Detailed Analysis
* **Q<sub>w</sub><sup>\*</sup> (3/√5) (Blue Line):** This line starts at approximately 0.1 at α = 0.5, rapidly increases, and reaches approximately 0.85 at α = 1.5. It then plateaus, oscillating around 0.9 to 1.0 for α > 2.5. The shaded region around this line is relatively narrow, indicating lower uncertainty.
* **Q<sub>w</sub><sup>\*</sup> (1/√5) (Orange Line):** This line remains consistently low, fluctuating around 0.05 to 0.15 across the entire range of α. It shows a slight increase towards the end of the range, reaching approximately 0.2 at α = 7. The shaded region is wider than that of the blue line, suggesting higher uncertainty.
* **q<sub>2</sub> (Green Line):** This line starts at approximately 0.2 at α = 0.5 and increases rapidly, reaching approximately 0.8 at α = 1.5. It continues to increase, reaching a maximum of approximately 0.95 at α = 2.5, and then slowly decreases to approximately 0.9 at α = 7. The shaded region is moderately wide, indicating moderate uncertainty.
* **Data Points (x markers):** The data points generally follow the trend of the lines, with some scatter within the shaded regions.
### Key Observations
* Q<sub>w</sub><sup>\*</sup> (3/√5) consistently outperforms Q<sub>w</sub><sup>\*</sup> (1/√5) across all values of α.
* q<sub>2</sub> initially increases rapidly and surpasses Q<sub>w</sub><sup>\*</sup> (3/√5) before eventually leveling off and slightly decreasing.
* The uncertainty associated with Q<sub>w</sub><sup>\*</sup> (1/√5) is significantly higher than that of the other two quantities.
* All three quantities converge towards a value close to 1.0 as α increases.
### Interpretation
The chart demonstrates the performance of three different quantities as a function of the parameter α. The quantities likely represent some form of efficiency, accuracy, or convergence rate in a specific system or algorithm.
The fact that Q<sub>w</sub><sup>\*</sup> (3/√5) consistently outperforms Q<sub>w</sub><sup>\*</sup> (1/√5) suggests that the choice of the constant factor (3/√5 vs. 1/√5) significantly impacts performance. The initial rapid increase and subsequent plateau of all three lines indicate a diminishing return effect – increasing α beyond a certain point yields minimal improvement.
The convergence of all three quantities towards 1.0 as α increases suggests that the system approaches an optimal state or a limit as α becomes large. The shaded regions indicate the variability or uncertainty in the performance of each quantity, which could be due to noise, randomness, or limitations in the model. The wider shaded region for Q<sub>w</sub><sup>\*</sup> (1/√5) suggests that its performance is more sensitive to these factors.
The behavior of q<sub>2</sub>, initially exceeding Q<sub>w</sub><sup>\*</sup> (3/√5) and then declining, could indicate a trade-off between initial speed and long-term stability or accuracy.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Chart: Performance Curves
### Overview
The image presents a line chart comparing three performance curves, likely representing different strategies or algorithms, plotted against a parameter denoted as α (alpha). The chart includes shaded regions around each line, indicating a degree of uncertainty or variance.
### Components/Axes
* **X-axis:** Labeled "α", ranging from approximately 0 to 7, with gridlines at intervals of 1.
* **Y-axis:** Ranges from approximately 0 to 1.1, with gridlines at intervals of 0.2.
* **Legend:** Located in the top-right corner, listing the following curves with corresponding colors:
* `Q*w (3/√5)` - Dark Blue
* `Q*w (1/√5)` - Orange
* `q2` - Green
* **Data Series:** Three lines representing the performance of each strategy, with shaded areas indicating variance. The shaded areas are light blue, light orange, and light green, respectively.
* **Markers:** Each line is also marked with 'x' symbols at irregular intervals.
### Detailed Analysis
Let's analyze each curve individually:
1. **`Q*w (3/√5)` (Dark Blue):** This line starts at approximately 0.1 at α = 0, rapidly increases to around 0.85 at α = 1, and then plateaus, approaching a value of approximately 0.98 by α = 3. It remains relatively stable between α = 3 and α = 7, fluctuating slightly around 0.98. The shaded region indicates a variance of approximately ±0.05.
2. **`Q*w (1/√5)` (Orange):** This line begins at approximately 0 at α = 0 and exhibits a slow, steady increase throughout the entire range of α. It reaches approximately 0.1 at α = 1, 0.2 at α = 4, 0.3 at α = 6, and continues to increase, reaching approximately 0.5 at α = 7. The shaded region indicates a variance of approximately ±0.02.
3. **`q2` (Green):** This line starts at approximately 0.1 at α = 0, increases rapidly to around 0.8 at α = 1, and then plateaus, approaching a value of approximately 0.95 by α = 2. It remains relatively stable between α = 2 and α = 7, fluctuating slightly around 0.95. The shaded region indicates a variance of approximately ±0.05.
### Key Observations
* The `Q*w (3/√5)` and `q2` curves converge to similar values (around 0.95-0.98) as α increases, suggesting they achieve comparable performance for larger values of α.
* The `Q*w (1/√5)` curve consistently underperforms compared to the other two, especially for smaller values of α.
* The shaded regions around each line indicate that the performance is not deterministic and has some degree of variability.
* The `Q*w (1/√5)` curve shows a consistent upward trend, while the other two curves exhibit diminishing returns as α increases.
### Interpretation
This chart likely represents the performance of different algorithms or strategies as a function of a parameter α. The parameter α could represent a regularization strength, a learning rate, or some other control variable.
The convergence of `Q*w (3/√5)` and `q2` suggests that both strategies achieve high performance for sufficiently large values of α. The consistently lower performance of `Q*w (1/√5)` indicates that this strategy is less effective, potentially due to its sensitivity to the value of α.
The shaded regions highlight the inherent variability in the performance of each strategy. This variability could be due to randomness in the algorithm, noise in the data, or other factors.
The rapid initial increase in performance for all three curves suggests that there is a significant benefit to increasing α from a small value. However, the diminishing returns observed for `Q*w (3/√5)` and `q2` indicate that there is a point beyond which increasing α further does not significantly improve performance.
The chart provides valuable insights into the trade-offs between different strategies and the importance of tuning the parameter α to achieve optimal performance.
</details>
Figure 5: Different theoretical curves and numerical results for ReLU(x) activation, $P_{v}=\frac{1}{4}(\delta_{-3/\sqrt{5}}+\delta_{-1/\sqrt{5}}+\delta_{1/\sqrt{5}%
}+\delta_{3/\sqrt{5}})$ , $d=150$ , $\gamma=0.5$ , with linear readout with Gaussian noise of variance $\Delta=0.1$ Top left: Optimal mean-square generalisation error predicted by the theory reported in the main text (solid blue) versus the branch obtained from the simplified ansatz (83) (solid red); the green solid line shows the universal branch corresponding to $\mathcal{Q}_{W}\equiv 0$ , and empty circles are HMC results with informative initialisation. Top right: Theoretical free entropy curves (colors and linestyles as top left). Bottom: Predictions for the overlaps $\mathcal{Q}_{W}(\mathsf{v})$ and $q_{2}$ from the theory devised in the main text (left) and in Appendix E.1 (right).
To fix this issue, one can compare the predictions of the theory derived from this ansatz, with the ones obtained by plugging ${\mathcal{Q}}_{W}(\mathsf{v})=0\ ∀\ \mathsf{v}$ (denoted ${\mathcal{Q}}_{W}\equiv 0$ ) in the theory devised in the main text (6),
$$
f_{\rm uni}^{\alpha,\gamma}:=\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W}%
\equiv 0);r_{K})+\frac{1}{4\alpha}(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2}-\frac%
{1}{\alpha}\iota(\hat{q}_{2}), \tag{89}
$$
to be extremised now only w.r.t. the scalar parameters $q_{2}$ , $\hat{q}_{2}$ (one can easily verify that, for ${\mathcal{Q}}_{W}\equiv 0$ , $\tau({\mathcal{Q}}_{W})=0$ and the extremisation w.r.t. $\hat{{\mathcal{Q}}}_{W}$ in (6) gives $\hat{{\mathcal{Q}}}_{W}\equiv 0$ ). Notice that $f_{\rm uni}^{\alpha,\gamma}$ is not depending on the prior over the inner weights, which is the reason why we are calling it “universal”. For consistency, the two free entropies $f_{\rm sp}^{\alpha,\gamma}$ , $f_{\rm uni}^{\alpha,\gamma}$ should be compared through a discrete variational principle, that is the free entropy of the model is predicted to be
$$
\bar{f}^{\alpha,\gamma}_{\rm RS}:=\max\{{\rm extr}f_{\rm uni}^{\alpha,\gamma},%
{\rm extr}f_{\rm sp}^{\alpha,\gamma}\}, \tag{90}
$$
instead of the unified variational form (6). Quite generally, ${\rm extr}f_{\rm uni}^{\alpha,\gamma}>{\rm extr}f_{\rm sp}^{\alpha,\gamma}$ for low values of $\alpha$ , so that the behaviour of the model in the universal phase is correctly predicted. The curves cross at a critical value
$$
\bar{\alpha}_{\rm sp}(\gamma)=\sup\{\alpha\mid{\rm extr}f_{\rm uni}^{\alpha,%
\gamma}>{\rm extr}f_{\rm sp}^{\alpha,\gamma}\}, \tag{91}
$$
instead of the value $\alpha_{\rm sp}(\gamma)$ reported in the main. This approach has been profitably adopted in Barbier et al. (2025) in the context of matrix denoising This is also the approach we used in a earlier version of this paper (superseded by the present one), accessible on ArXiv at this link., a problem sharing some of the challenges presented in this paper. In this respect, it provides a heuristic solution that quantitatively predicts the behaviour of the model in most of its phase diagram. Moreover, for any activation $\sigma$ with a second Hermite coefficient $\mu_{2}=0$ (e.g., all odd activations) the ansatz (83) yields the same theory as the one devised in the main text, as in this case $q_{K}(q_{2},{\mathcal{Q}}_{W})$ entering the energetic part of the free entropy does not depend on $q_{2}$ , so that the extremisation selects $q_{2}=\hat{q}_{2}=0$ and the remaining parts of (88) match the ones of (6). Finally, (83) is consistent with the observation that specialisation never arises in the case of quadratic activation and Gaussian prior over the inner weights: in this case, one can check that the universal branch ${\rm extr}f_{\rm uni}^{\alpha,\gamma}$ is always higher than ${\rm extr}f_{\rm sp}^{\alpha,\gamma}$ , and thus never selected by (90). For a convincing check on the validity of this approach, and a comparison with the theory devised in the main text and numerical results, see Fig. 5, top left panel.
However, despite its merits listed above, this Appendix’s approach presents some issues, both from the theoretical and practical points of view:
1. the final free entropy of the model is obtained by comparing curves derived from completely different ansätze for the distribution $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (Gaussian with coupled replicas, leading to $f_{\rm sp}$ , vs. pure generalised Wishart with independent replicas, leading to $f_{\rm uni}$ ), rather than within a unified theory as in the main text;
1. the predicted critical value $\bar{\alpha}_{\rm sp}(\gamma)$ seems to be systematically larger than the one observed in experiments (see Fig. 5, top right panel, and compare the crossing point of the “sp” and “uni” free entropies with the actual transition where the numerical points depart from the universal branch in the top left panel);
1. predictions for the functional overlap ${\mathcal{Q}}_{W}^{*}$ from this approach are in much worse agreement with experimental data w.r.t. the ones from the theory presented in the main text (see Fig. 5, bottom panel, and compare with Fig. 3 in the main text);
1. in the cases we tested, the prediction for the generalisation error from the theory devised in the main text are in much better agreement with numerical simulations than the one from this Appendix (see Fig. 6 for a comparison).
Therefore, the more elaborate theory presented in the main is not only more meaningful from the theoretical viewpoint, but also in overall better agreement with simulations.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Chart: Optimal Error vs. Alpha
### Overview
The image presents a line chart illustrating the relationship between an alpha value (α) and an optimal error (ε<sup>opt</sup>) for three different models: "main text" (blue), "sp" (red), and "uni" (green). Error bars are included for the "main text" data series. The chart aims to compare the performance of these models as the alpha parameter varies.
### Components/Axes
* **X-axis:** Labeled "α" (alpha), ranging from approximately 0 to 7. The scale is linear with tick marks at 0, 2, 4, and 6.
* **Y-axis:** Labeled "ε<sup>opt</sup>" (optimal error), ranging from approximately 0.01 to 0.09. The scale is linear with tick marks at 0.02, 0.04, 0.06, and 0.08.
* **Legend:** Located in the top-right corner, identifying the three data series:
* "main text" - Blue line with circular data points and error bars.
* "sp" - Red line.
* "uni" - Green line.
* **Data Series:** Three lines representing the relationship between α and ε<sup>opt</sup> for each model. The "main text" series includes error bars indicating the uncertainty in the optimal error value.
### Detailed Analysis
**"main text" (Blue):**
The blue line slopes downward, indicating that as α increases, the optimal error decreases. The trend is initially steep, then gradually flattens out.
* At α ≈ 0, ε<sup>opt</sup> ≈ 0.08 ± 0.002 (estimated from error bar height).
* At α ≈ 2, ε<sup>opt</sup> ≈ 0.05 ± 0.002.
* At α ≈ 4, ε<sup>opt</sup> ≈ 0.03 ± 0.002.
* At α ≈ 6, ε<sup>opt</sup> ≈ 0.01 ± 0.002.
**"sp" (Red):**
The red line also slopes downward, but it is less steep than the blue line, especially at lower α values.
* At α ≈ 0, ε<sup>opt</sup> ≈ 0.075.
* At α ≈ 2, ε<sup>opt</sup> ≈ 0.045.
* At α ≈ 4, ε<sup>opt</sup> ≈ 0.025.
* At α ≈ 6, ε<sup>opt</sup> ≈ 0.015.
**"uni" (Green):**
The green line is relatively flat compared to the other two lines. It starts at a higher ε<sup>opt</sup> value and decreases slowly with increasing α.
* At α ≈ 0, ε<sup>opt</sup> ≈ 0.085.
* At α ≈ 2, ε<sup>opt</sup> ≈ 0.06.
* At α ≈ 4, ε<sup>opt</sup> ≈ 0.04.
* At α ≈ 6, ε<sup>opt</sup> ≈ 0.025.
### Key Observations
* The "main text" model consistently exhibits the lowest optimal error across the range of α values.
* The "uni" model has the highest optimal error, particularly at lower α values.
* The "sp" model falls between the "main text" and "uni" models in terms of optimal error.
* The error bars on the "main text" data suggest a relatively consistent uncertainty in the optimal error estimate.
### Interpretation
The chart demonstrates that the optimal error (ε<sup>opt</sup>) decreases as the alpha parameter (α) increases for all three models. However, the rate of decrease varies significantly. The "main text" model appears to be the most sensitive to changes in α, achieving the lowest optimal error values. The "uni" model is the least sensitive, exhibiting a relatively flat curve and consistently higher error values.
This suggests that the "main text" model is the most effective at minimizing error as α increases, while the "uni" model's performance is less dependent on α. The "sp" model represents an intermediate performance level. The error bars on the "main text" data indicate the reliability of the optimal error estimates, suggesting a consistent level of uncertainty.
The choice of the optimal model and α value would depend on the specific application and the desired trade-off between model complexity and error minimization. The chart provides valuable insights into the performance characteristics of each model, enabling informed decision-making.
</details>
Figure 6: Generalisation error for ReLU activation and Rademacher readout prior $P_{v}$ of the theory reported in the main text (solid blue) versus the branch obtained from the simplified ansatz (83) (solid red); the green solid line shows $\mathcal{Q}_{W}\equiv 0$ (universal branch), and empty circles are HMC results with informative initialisation.
E.2 Possible refined analyses with structured ${\mathbf{S}}_{2}$ matrices
In the main text, we kept track of the inhomogeneous profile of the readouts induced by the non-trivial distribution $P_{v}$ , which is ultimately responsible for the sequence of specialisation phase transitions occurring at increasing $\alpha$ , thanks to a functional order parameter ${\mathcal{Q}}_{W}(\mathsf{v})$ measuring how much the student’s hidden weights corresponding to all the readout elements equal to $\mathsf{v}$ have aligned with the teacher’s. However, when writing $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})$ we treated the tensor ${\mathbf{S}}_{2}^{a}$ as a whole, without considering the possibility that its “components”
$$
\displaystyle S_{2;\alpha_{1}\alpha_{2}}^{a}(\mathsf{v}):=\frac{\mathsf{v}}{%
\sqrt{|\mathcal{I}_{\mathsf{v}}|}}\sum_{i\in\mathcal{I}_{\mathsf{v}}}W^{a}_{i%
\alpha_{1}}W^{a}_{i\alpha_{2}} \tag{92}
$$
could follow different laws for different $\mathsf{v}∈\mathsf{V}$ . To do so, let us define
$$
\displaystyle Q_{2}^{ab}=\frac{1}{k}\sum_{\mathsf{v},\mathsf{v}^{\prime}}%
\mathsf{v}\,\mathsf{v}^{\prime}\sum_{i\in\mathcal{I}_{\mathsf{v}},j\in\mathcal%
{I}_{\mathsf{v^{\prime}}}}(\Omega_{ij}^{ab})^{2}=\sum_{\mathsf{v},\mathsf{v}^{%
\prime}}\frac{\sqrt{|\mathcal{I}_{\mathsf{v}}||\mathcal{I}_{\mathsf{v^{\prime}%
}}|}}{k}{\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}),\quad\text{%
where}\quad{\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}):=\frac{1}{d^%
{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}(\mathsf{v}){\mathbf{S}}_{2}^{b}(\mathsf{v}^%
{\prime})^{\intercal}. \tag{93}
$$
The generalisation of (63) then reads
$$
\displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle\frac{1}{d^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}(\mathsf{v}){\mathbf%
{S}}_{2}^{b}(\mathsf{v}^{\prime})^{\intercal}=\delta_{\mathsf{v}\mathsf{v}^{%
\prime}}\mathsf{v}^{2}\mathcal{Q}_{W}^{ab}(\mathsf{v})^{2}+\gamma\,\mathsf{v}%
\mathsf{v}^{\prime}\sqrt{P_{v}(\mathsf{v})P_{v}(\mathsf{v}^{\prime})} \tag{94}
$$
w.r.t. the true distribution $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ reported in (14). Despite the already good match of the theory in the main with the numerics, taking into account this additional level of structure thanks to a refined simplified measure could potentially lead to further improvements. The simplified measure able to match these moment-matching conditions while taking into account the Wishart form (92) of the matrices $({\mathbf{S}}_{2}^{a}(\mathsf{v}))$ is
$$
\displaystyle d\bar{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})%
\propto\prod_{\mathsf{v}\in\mathsf{V}}\prod_{a}dP_{S}^{\mathsf{v}}({\mathbf{S}%
}_{2}^{a}(\mathsf{v}))\times\prod_{\mathsf{v}\in\mathsf{V}}\prod_{a<b}e^{\frac%
{1}{2}\bar{\tau}^{ab}_{\mathsf{v}}(\bm{{\mathcal{Q}}}_{W}){\rm Tr}{\mathbf{S}}%
_{2}^{a}(\mathsf{v}){\mathbf{S}}_{2}^{b}(\mathsf{v})}, \tag{95}
$$
where $P_{S}^{\mathsf{v}}$ is the law of a random matrix $\mathsf{v}\bar{{\mathbf{W}}}\bar{{\mathbf{W}}}^{∈tercal}|\mathcal{I}_{%
\mathsf{v}}|^{-1/2}$ with $\bar{\mathbf{W}}∈\mathbb{R}^{d×|\mathcal{I}_{\mathsf{v}}|}$ having i.i.d. standard Gaussian entries. For properly chosen $(\bar{\tau}_{\mathsf{v}}^{ab})$ , (94) is verified for this simplified measure.
However, the order parameters $({\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}))$ are difficult to deal with if keeping a general form, as they not only imply coupled replicas $({\mathbf{S}}_{2}^{a}(\mathsf{v}))_{a}$ for a given $\mathsf{v}$ (a kind of coupling that is easily linearised with a single Hubbard-Stratonovich transformation, within the replica symmetric treatment justified in Bayes-optimal learning), but also a coupling for different values of the variable $\mathsf{v}$ . Linearising it would yield a more complicated matrix model than the integral reported in (D.3), because the resulting coupling field would break rotational invariance and therefore the model does not have a form which is known to be solvable, see Kazakov (2000).
A first idea to simplify $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (14) while taking into account the additional structure induced by (93), (94) and maintaining a solvable model, is to consider a generalisation of the relaxation (83). This entails dropping entirely the dependencies among matrix entries, induced by their Wishart-like form (92), for each ${\mathbf{S}}_{2}^{a}(\mathsf{v})$ . In this case, the moment constraints (94) can be exactly enforced by choosing the simplified measure
$$
\displaystyle d\bar{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})=\prod%
_{\mathsf{v}\in\mathsf{V}}\prod_{a=0}^{s}d{\mathbf{S}}^{a}_{2}(\mathsf{v})%
\prod_{\alpha=1}^{d}\delta(S^{a}_{2;\alpha\alpha}(\mathsf{v})-\mathsf{v}\sqrt{%
|\mathcal{I}_{\mathsf{v}}|})\times\prod_{\mathsf{v}\in\mathsf{V}}\prod_{\alpha%
_{1}<\alpha_{2}}^{d}\frac{e^{-\frac{1}{2}\sum_{a,b=0}^{s}S^{a}_{2;\alpha_{1}%
\alpha_{2}}(\mathsf{v})\bar{\tau}_{\mathsf{v}}^{ab}(\bm{{\mathcal{Q}}}_{W})S^{%
b}_{2;\alpha_{1}\alpha_{2}}(\mathsf{v})}}{\sqrt{(2\pi)^{s+1}\det(\bar{\bm{\tau%
}}_{\mathsf{v}}(\bm{{\mathcal{Q}}}_{W})^{-1})}}. \tag{96}
$$
The parameters $(\bar{\tau}^{ab}_{\mathsf{v}}(\bm{{\mathcal{Q}}}_{W}))$ are then properly chosen to enforce (94) for all $0≤ a≤ b≤ s$ and $\mathsf{v},\mathsf{v}^{\prime}∈\mathsf{V}$ . Using this measure, the resulting entropic term, taking into account the degeneracy of the order parameters $({\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}))$ and $({\mathcal{Q}}_{W}^{ab}(\mathsf{v}))$ , remains tractable through Gaussian integrals (the energetic term is obviously unchanged once we express $(Q_{2}^{ab})$ entering it using these new order parameters through the identity (93), and keeping in mind that nothing changes for higher order overlaps compared to the theory in the main). We leave for future work the analysis of this Gaussian relaxation and other possible simplifications of (95) leading to solvable models.
Appendix F Linking free entropy and mutual information
It is possible to relate the mutual information (MI) of the inference task to the free entropy $f_{n}=\mathbb{E}\ln\mathcal{Z}$ introduced in the main. Indeed, we can write the MI as
$$
\frac{I({\mathbf{W}}^{0};\mathcal{D})}{kd}=\frac{\mathcal{H}(\mathcal{D})}{kd}%
-\frac{\mathcal{H}(\mathcal{D}\mid{\mathbf{W}}^{0})}{kd}, \tag{97}
$$
where $\mathcal{H}(Y\mid X)$ is the conditional Shannon entropy of $Y$ given $X$ . It is straightforward to show that the free entropy is
$$
-\frac{\alpha}{\gamma}f_{n}=\frac{\mathcal{H}(\{y_{\mu}\}_{\mu\leq n}\mid\{{%
\mathbf{x}}_{\mu}\}_{\mu\leq n})}{kd}=\frac{\mathcal{H}(\mathcal{D})}{kd}-%
\frac{\mathcal{H}(\{{\mathbf{x}}_{\mu}\}_{\mu\leq n})}{kd}, \tag{98}
$$
by the chain rule for the entropy. On the other hand $\mathcal{H}(\mathcal{D}\mid{\mathbf{W}}^{0})=\mathcal{H}(\{y_{\mu}\}\mid{%
\mathbf{W}}^{0},\{{\mathbf{x}}_{\mu}\})+\mathcal{H}(\{{\mathbf{x}}_{\mu}\})$ , i.e.,
$$
\frac{\mathcal{H}(\mathcal{D}\mid{\mathbf{W}}^{0})}{kd}\approx-\frac{\alpha}{%
\gamma}\mathbb{E}_{\lambda}\int dyP_{\text{out}}(y\mid\lambda)\ln P_{\text{out%
}}(y\mid\lambda)+\frac{\mathcal{H}(\{{\mathbf{x}}_{\mu}\}_{\mu\leq n})}{kd}, \tag{99}
$$
where $\lambda\sim{\mathcal{N}}(0,r_{K})$ , with $r_{K}$ given by (53) (assuming here that $\mu_{0}=0$ , see App. D.5 if the activation $\sigma$ is non-centred), and the equality holds asymptotically in the limit $\lim$ . This allows us to express the MI as
$$
\frac{I({\mathbf{W}}^{0};\mathcal{D})}{kd}=-\frac{\alpha}{\gamma}f_{n}+\frac{%
\alpha}{\gamma}\mathbb{E}_{\lambda}\int dyP_{\text{out}}(y|\lambda)\ln P_{%
\text{out}}(y|\lambda). \tag{100}
$$
Specialising the equation to the Gaussian channel, one obtains
$$
\frac{I({\mathbf{W}}^{0};\mathcal{D})}{kd}=-\frac{\alpha}{\gamma}f_{n}-\frac{%
\alpha}{2\gamma}\ln(2\pi e\Delta). \tag{101}
$$
Note that the choice of normalising by $kd$ is not accidental. Indeed, the number of parameters is $kd+k≈ kd$ . Hence with this choice one can interpret the parameter $\alpha$ as an effective signal-to-noise ratio.
**Remark F.1**
*The arguments of Barbier et al. (2025) to show the existence of an upper bound on the mutual information per variable in the case of discrete variables and the associated inevitable breaking of prior universality beyond a certain threshold in matrix denoising apply to the present model too. It implies, as in the aforementioned paper, that the mutual information per variable cannot go beyond $\ln 2$ for Rademacher inner weights. Our theory is consistent with this fact; this is a direct consequence of the analysis in App. G (see in particular (108)) specialised to binary prior over ${\mathbf{W}}$ .*
Appendix G Large sample rate limit of $f_{\rm RS}^{\alpha,\gamma}$
In this section we show that when the prior over the weights ${\mathbf{W}}$ is discrete the MI can never exceed the entropy of the prior itself.
To do this, we first need to control the function $\rm mmse$ when its argument is large. By a saddle point argument, it is not difficult to show that the leading term for ${\rm mmse}_{S}(\tau)$ when $\tau→∞$ if of the type $C(\gamma)/\tau$ for a proper constant $C$ depending at most on the rectangularity ratio $\gamma$ .
We now notice that the equation for $\hat{\mathcal{Q}}_{W}(v)$ in (76) can be rewritten as
$$
\displaystyle\hat{\mathcal{Q}}_{W}(v)=\frac{1}{2\gamma}[{\rm mmse}_{S}(\tau)-{%
\rm mmse}_{S}(\tau+\hat{q}_{2})]\partial_{{\mathcal{Q}}_{W}(v)}\tau+2\frac{%
\alpha}{\gamma}\partial_{{\mathcal{Q}}_{W}(v)}\psi_{P_{\text{out}}}(q_{K}(q_{2%
},\mathcal{Q}_{W});r_{K}). \tag{102}
$$
For $\alpha→∞$ we make the self-consistent ansatz $\mathcal{Q}_{W}(v)=1-o_{\alpha}(1)$ . As a consequence $1/\tau$ has to vanish by the moment matching condition (74) as $o_{\alpha}(1)$ too. Using the very same equation, we are also able to evaluate $∂_{\mathcal{Q}_{W}(v)}\tau$ as follows:
$$
\displaystyle\partial_{\mathcal{Q}_{W}(v)}\tau=\frac{-2v^{2}\mathcal{Q}_{W}(v)%
}{{\rm mmse^{\prime}}(\tau)}\sim\tau^{2} \tag{103}
$$
as $\alpha→∞$ , where we have used ${\rm mmse}_{S}(\tau)\sim C(\gamma)/\tau$ to estimate the derivative. We use the same approximation for the two $\rm mmse$ ’s appearing in the fixed point equation for $\hat{\mathcal{Q}}_{W}(v)$ :
$$
\displaystyle\hat{\mathcal{Q}}_{W}(v)\sim\frac{\hat{q}_{2}}{2\gamma(\tau(\tau+%
\hat{q}_{2}))}\tau^{2}+2\frac{\alpha}{\gamma}\partial_{{\mathcal{Q}}_{W}(v)}%
\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K}). \tag{104}
$$
From the last equation in (76) we see that $\hat{q}_{2}$ cannot diverge more than $O(\alpha)$ . Thanks to the above approximation and the first equation of (76) this entails that $\mathcal{Q}_{W}(v)$ is approaching $1$ exponentially fast in $\alpha$ , which in turn implies $\tau$ is diverging exponentially in $\alpha$ . As a consequence
$$
\displaystyle\frac{\tau^{2}}{\tau(\tau+\hat{q}_{2})}\sim 1. \tag{105}
$$
Furthermore, one also has
$$
\displaystyle\frac{1}{\alpha}[\iota(\tau)-\iota(\tau+\hat{q}_{2})]=-\frac{1}{4%
\alpha}\int_{\tau}^{\tau+\hat{q}_{2}}{\rm mmse}_{S}(t)\,dt\approx-\frac{C(%
\gamma)}{4\alpha}\log(1+\frac{\hat{q}_{2}}{\tau})\xrightarrow[]{\alpha\to%
\infty}0, \tag{106}
$$
as $\frac{\hat{q}_{2}}{\tau}$ vanishes with exponential speed in $\alpha$ .
Concerning the function $\psi_{P_{W}}$ , given that it is realted to a Bayes-optimal scalar Gaussian channel, and its SNRs $\hat{\mathcal{Q}}_{W}(v)$ are all diverging one can compute the integral by saddle point, which is inevitably attained at the ground truth:
$$
\displaystyle\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v)) \displaystyle-\frac{\hat{\mathcal{Q}}_{W}(v)\mathcal{Q}_{W}(v)}{2}\approx%
\mathbb{E}_{w^{0}}\ln\int dP_{W}(w)\mathbbm{1}(w=w^{0}) \displaystyle+\mathbb{E}\Big{[}(\sqrt{\hat{\mathcal{Q}}_{W}(v)}\xi+\hat{%
\mathcal{Q}}_{W}(v)w^{0})w^{0}-\frac{\hat{\mathcal{Q}}_{W}(v)}{2}(w^{0})^{2}%
\Big{]}-\frac{\hat{\mathcal{Q}}_{W}(v)(1-o_{\alpha}(1))}{2}=-\mathcal{H}(W)+o_%
{\alpha}(1). \tag{1}
$$
Considering that $\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})\xrightarrow[]{\alpha%
→∞}\psi_{P_{\text{out}}}(r_{K};r_{K})$ , and using (100), it is then straightforward to check that our RS version of the MI saturates to the entropy of the prior $P_{W}$ when $\alpha→∞$ :
$$
\displaystyle-\frac{\alpha}{\gamma}\text{extr}f_{\rm RS}^{\alpha,\gamma}+\frac%
{\alpha}{\gamma}\mathbb{E}_{\lambda}\int dyP_{\text{out}}(y|\lambda)\ln P_{%
\text{out}}(y|\lambda)\xrightarrow[]{\alpha\to\infty}\mathcal{H}(W). \tag{108}
$$
Appendix H Extension of GAMP-RIE to arbitrary activation
Algorithm 1 GAMP-RIE for training shallow neural networks with arbitrary activation
Input: Fresh data point ${\mathbf{x}}_{\text{test}}$ with unknown associated response $y_{\text{test}}$ , dataset $\mathcal{D}=\{({\mathbf{x}}_{\mu},y_{\mu})\}_{\mu=1}^{n}$ .
Output: Estimator $\hat{y}_{\text{test}}$ of $y_{\text{test}}$ .
Estimate $y^{(0)}:=\mu_{0}{\mathbf{v}}^{∈tercal}\bm{1}/\sqrt{k}$ as
$$
\hat{y}^{(0)}=\frac{1}{n}\sum_{\mu}y_{\mu}; \tag{0}
$$
Estimate $\langle{\mathbf{W}}^{∈tercal}{\mathbf{v}}\rangle/\sqrt{k}$ using (117).
Estimate the $\mu_{1}$ term in the Hermite expansion (111) as
$$
\displaystyle\hat{y}_{\mu}^{(1)} \displaystyle=\mu_{1}\frac{\langle{\mathbf{v}}^{\intercal}{\mathbf{W}}\rangle{%
\mathbf{x}}_{\mu}}{\sqrt{kd}}; \tag{1}
$$
Compute
$$
\displaystyle\tilde{y}_{\mu} \displaystyle=\frac{y_{\mu}-\hat{y}_{\mu}^{(0)}-\hat{y}_{\mu}^{(1)}}{\mu_{2}/2%
};\qquad\tilde{\Delta}=\frac{\Delta+g(1)}{\mu_{2}^{2}/4}; \tag{0}
$$
Input $\{({\mathbf{x}}_{\mu},\tilde{y}_{\mu})\}_{\mu=1}^{n}$ and $\tilde{\Delta}$ into Algorithm 1 in Maillard et al. (2024a) to estimate $\langle{\mathbf{W}}^{∈tercal}({\mathbf{v}}){\mathbf{W}}\rangle$ ;
Output
$$
\displaystyle\hat{y}_{\text{test}}=\hat{y}^{(0)}+\mu_{1}\frac{\langle{\mathbf{%
v}}^{\intercal}{\mathbf{W}}\rangle{\mathbf{x}}_{\text{test}}}{\sqrt{kd}}+\frac%
{\mu_{2}}{2}\frac{1}{d\sqrt{k}}{\rm Tr}[({\mathbf{x}}_{\text{test}}{\mathbf{x}%
}_{\text{test}}^{\intercal}-{\mathbb{I}})\langle{\mathbf{W}}^{\intercal}({%
\mathbf{v}}){\mathbf{W}}\rangle]. \tag{0}
$$
For simplicity, let us consider $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ , which entails:
$$
\displaystyle y_{\mu}\mid({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})\overset{\rm{d}%
}{=}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}\sigma\Big{(}\frac{{\mathbf{W}}^{%
0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\sqrt{\Delta}\,z_{\mu},\quad\mu=1\dots,n, \tag{110}
$$
where $z_{\mu}$ are i.i.d. standard Gaussian random variables and $\overset{\rm d}{{}={}}$ means equality in law. Expanding $\sigma$ in the Hermite polynomial basis we have
$$
\displaystyle y_{\mu}\mid({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})\overset{\rm{d}%
}{=}\mu_{0}\frac{{\mathbf{v}}^{\intercal}\bm{1}_{k}}{\sqrt{k}}+\mu_{1}\frac{{%
\mathbf{v}}^{\intercal}{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{kd}}+\frac{%
\mu_{2}}{2}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_{2}\Big{(}\frac{{%
\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\dots+\sqrt{\Delta}z_{\mu} \tag{111}
$$
where $...$ represents the terms beyond second order. Without loss of generality, for this choice of output channel we can set $\mu_{0}=0$ as discussed in App. D.5. For low enough $\alpha$ it is reasonable to assume that higher order terms in $...$ cannot be learnt given quadratically many samples and, as a result, play the role of effective noise, which we assume independent of the first three terms. We shall see that this reasoning actually applies to the extension of the GAMP-RIE we derive, which plays the role of a “smart” spectral algorithm, regardless of the value of $\alpha$ . Therefore, these terms accumulate in an asymptotically Gaussian noise thanks to the central limit theorem (it is a projection of a centred function applied entry-wise to a vector with i.i.d. entries), with variance $g(1)$ (see (43)). We thus obtain the effective model
$$
\displaystyle y_{\mu}\mid({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})\overset{\rm{d}%
}{=}\mu_{1}\frac{{\mathbf{v}}^{\intercal}{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{%
\sqrt{kd}}+\frac{\mu_{2}}{2}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_%
{2}\Big{(}\frac{{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\sqrt{%
\Delta+g(1)}\,z_{\mu}. \tag{1}
$$
The first term in this expression can be learnt with vanishing error given quadratically many samples (Remark H.1), thus can be ignored. This further simplifies the model to
$$
\displaystyle\bar{y}_{\mu}:=y_{\mu}-\mu_{1}\frac{{\mathbf{v}}^{\intercal}{%
\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{kd}}\overset{\rm d}{{}={}}\frac{\mu_{%
2}}{2}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_{2}\Big{(}\frac{{%
\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\sqrt{\Delta+g(1)}\,z_{\mu}, \tag{1}
$$
where $\bar{y}_{\mu}$ is $y_{\mu}$ with the (asymptotically) perfectly learnt linear term removed, and the last equality in distribution is again conditional on $({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})$ . From the formula
$$
\displaystyle\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_{2}\Big{(}\frac%
{{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}={\rm Tr}\frac{{\mathbf{W%
}}^{0\intercal}({\mathbf{v}}){\mathbf{W}}^{0}}{d\sqrt{k}}{\mathbf{x}}_{\mu}{%
\mathbf{x}}_{\mu}^{\intercal}-\frac{{\mathbf{v}}^{\intercal}\bm{1}_{k}}{\sqrt{%
k}}\approx\frac{1}{\sqrt{k}d}{\rm Tr}[({\mathbf{x}}_{\mu}{\mathbf{x}}_{\mu}^{%
\intercal}-{\mathbb{I}}_{d}){\mathbf{W}}^{0\intercal}({\mathbf{v}}){\mathbf{W}%
}^{0}], \tag{114}
$$
where $≈$ is exploiting the concentration ${\rm Tr}{\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{0}/(d\sqrt{k})→%
{\mathbf{v}}^{∈tercal}\bm{1}_{k}/\sqrt{k}$ , and the Gaussian equivalence property that ${\mathbf{M}}_{\mu}:=({\mathbf{x}}_{\mu}{\mathbf{x}}_{\mu}^{∈tercal}-{\mathbb%
{I}}_{d})/\sqrt{d}$ behaves like a GOE sensing matrix, i.e., a symmetric matrix whose upper triangular part has i.i.d. entries from $\mathcal{N}(0,(1+\delta_{ij})/d)$ Maillard et al. (2024a), the model can be seen as a GLM with signal $\bar{\mathbf{S}}^{0}_{2}:={\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^%
{0}/\sqrt{kd}$ :
$$
\displaystyle y^{\rm GLM}_{\mu}=\frac{\mu_{2}}{2}{\rm Tr}[{\mathbf{M}}_{\mu}%
\bar{\mathbf{S}}^{0}_{2}]+\sqrt{\Delta+g(1)}\,z_{\mu}. \tag{1}
$$
Starting from this equation, the arguments of App. D and Maillard et al. (2024a), based on known results on the GLM Barbier et al. (2019) and matrix denoising Barbier & Macris (2022); Maillard et al. (2022); Pourkamali et al. (2024), allow us to obtain the free entropy of such matrix sensing problem. The result is consistent with the $\mathcal{Q}_{W}\equiv 0$ solution of the saddle point equations obtained from the replica method in App. D, which, as anticipated, corresponds to the case where the Hermite polynomial combinations of the signal following the second one are not learnt.
Note that, as supported by the numerics, the model actually admits specialisation when $\alpha$ is big enough, hence the above equivalence cannot hold on the whole phase diagram at the information theoretic level. In fact, if specialisation occurs one cannot consider the $...$ terms in (111) as noise uncorrelated with the first ones, as the model is aligning with the actual teacher’s weights, such that it learns all the successive terms at once.
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Chart: Test Error vs. Alpha for ReLU and ELU
### Overview
The image presents a line chart comparing the test error of two activation functions, ReLU (Rectified Linear Unit) and ELU (Exponential Linear Unit), as a function of a parameter α (alpha). The chart displays the test error on the y-axis and α on the x-axis. Error bars are included for each data point, representing the uncertainty or variance in the test error. A zoomed-in section of the chart is provided as an inset, focusing on the region where α ranges from 0 to approximately 2.
### Components/Axes
* **X-axis:** Labeled "α" (alpha), ranging from 0 to 4.
* **Y-axis:** Labeled "Test error", ranging from 0 to 0.08.
* **Legend:** Located in the top-right corner.
* "ReLU" - Represented by a solid red line.
* "ELU" - Represented by a solid blue line.
* **Data Points:** Represented by circular markers with error bars. Red circles correspond to ReLU, and blue squares correspond to ELU.
* **Inset Chart:** A zoomed-in view of the chart, focusing on the region from α = 0 to α = 2.
### Detailed Analysis
**ReLU (Red Line):**
The ReLU line starts at approximately 0.078 at α = 0, decreases rapidly to approximately 0.035 at α = 1, plateaus around 0.025 between α = 1 and α = 2, and then drops sharply to approximately 0.002 at α = 2. After α = 2, the line remains relatively constant at around 0.002 until α = 4.
* α = 0: Test error ≈ 0.078, Error bar extends to ≈ 0.082
* α = 0.5: Test error ≈ 0.052, Error bar extends to ≈ 0.056
* α = 1: Test error ≈ 0.035, Error bar extends to ≈ 0.038
* α = 1.5: Test error ≈ 0.027, Error bar extends to ≈ 0.030
* α = 2: Test error ≈ 0.002, Error bar extends to ≈ 0.004
* α = 3: Test error ≈ 0.002, Error bar extends to ≈ 0.003
* α = 4: Test error ≈ 0.002, Error bar extends to ≈ 0.003
**ELU (Blue Line):**
The ELU line starts at approximately 0.045 at α = 0, decreases steadily to approximately 0.015 at α = 1, continues to decrease to approximately 0.008 at α = 2, and then drops sharply to approximately 0.001 at α = 3 and remains constant at approximately 0.001 until α = 4.
* α = 0: Test error ≈ 0.045, Error bar extends to ≈ 0.048
* α = 0.5: Test error ≈ 0.030, Error bar extends to ≈ 0.033
* α = 1: Test error ≈ 0.015, Error bar extends to ≈ 0.017
* α = 1.5: Test error ≈ 0.011, Error bar extends to ≈ 0.013
* α = 2: Test error ≈ 0.008, Error bar extends to ≈ 0.010
* α = 3: Test error ≈ 0.001, Error bar extends to ≈ 0.002
* α = 4: Test error ≈ 0.001, Error bar extends to ≈ 0.002
**Inset Chart:**
The inset chart provides a magnified view of the initial portion of the curves, confirming the values and trends described above for α between 0 and 2.
### Key Observations
* Both ReLU and ELU exhibit decreasing test error as α increases, but the rate of decrease differs.
* ReLU shows a more dramatic drop in test error around α = 2, while ELU's decrease is more gradual.
* For α > 2, ReLU maintains a very low test error, while ELU continues to decrease slightly.
* The error bars indicate that the test error measurements have some variance, but the overall trends are clear.
### Interpretation
The chart demonstrates the impact of the α parameter on the performance of ReLU and ELU activation functions, as measured by test error. The results suggest that both activation functions benefit from increasing α, but ReLU experiences a more significant improvement in performance beyond α = 2. The sharp drop in ReLU's test error at α = 2 could indicate a critical threshold where the activation function begins to effectively mitigate certain types of errors. The inset chart highlights the initial behavior of the functions, showing that ELU initially outperforms ReLU, but ReLU catches up and surpasses ELU as α increases. The error bars suggest that the observed differences in performance are statistically significant, although further analysis would be needed to confirm this. The data suggests that for this specific task, ReLU may be a better choice than ELU when α is sufficiently large. The parameter α likely controls some aspect of the activation function's behavior, such as the slope of the linear portion or the scale of the exponential portion in the case of ELU.
</details>
Figure 7: Theoretical prediction (solid curves) of the Bayes-optimal mean-square generalisation error for binary inner weights and ReLU, eLU activations, with $\gamma=0.5$ , $d=150$ , Gaussian label noise with $\Delta=0.1$ , and fixed readouts ${\mathbf{v}}=\mathbf{1}$ . Dashed lines are obtained from the solution of the fixed point equations (76) with all $\mathcal{Q}_{W}(\mathsf{v})=0$ . Circles are the test error of GAMP-RIE (Maillard et al., 2024a) extended to generic activation. The MCMC points initialised uninformatively (inset) are obtained using (36), to account for lack of equilibration due to glassiness, which prevents using (38). Even in the possibly glassy region, the GAMP-RIE attains the universal branch performance. Data for GAMP-RIE and MCMC are averaged over 16 data instances, with error bars representing one standard deviation over instances.
We now assume that this mapping holds at the algorithmic level, namely, that we can process the data algorithmically as if they were coming from the identified GLM, and thus try to infer the signal $\bar{\mathbf{S}}_{2}^{0}={\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{%
0}/\sqrt{kd}$ and construct a predictor from it. Based on this idea, we propose Algorithm 1 that can indeed reach the performance predicted by the $\mathcal{Q}_{W}\equiv 0$ solution of our replica theory.
**Remark H.1**
*In the linear data regime, where $n/d$ converges to a fixed constant $\alpha_{1}$ , only the first term in (111) can be learnt while the rest behaves like noise. By the same argument as above, the model is equivalent to
$$
\displaystyle y_{\mu}=\mu_{1}\frac{{\mathbf{v}}^{\intercal}{\mathbf{W}}^{0}{%
\mathbf{x}}_{\mu}}{\sqrt{kd}}+\sqrt{\Delta+\nu-\mu_{0}^{2}-\mu_{1}^{2}}\,z_{%
\mu}, \tag{116}
$$
where $\nu=\mathbb{E}_{z\sim{\mathcal{N}}(0,1)}\sigma^{2}(z)$ . This is again a GLM with signal ${\mathbf{S}}_{1}^{0}={\mathbf{W}}^{0∈tercal}{\mathbf{v}}/\sqrt{k}$ and Gaussian sensing vectors ${\mathbf{x}}_{\mu}$ . Define $q_{1}$ as the limit of ${\mathbf{S}}_{1}^{a∈tercal}{\mathbf{S}}_{1}^{b}/d$ where ${\mathbf{S}}_{1}^{a},{\mathbf{S}}_{1}^{b}$ are drawn independently from the posterior. With $k→∞$ , the signal converges in law to a standard Gaussian vector. Using known results on GLMs with Gaussian signal Barbier et al. (2019), we obtain the following equations characterising $q_{1}$ :
| | $\displaystyle q_{1}$ | $\displaystyle=\frac{\hat{q}_{1}}{\hat{q}_{1}+1},\qquad\hat{q}_{1}=\frac{\alpha%
_{1}}{1+\Delta_{1}-q_{1}},\quad\text{where}\quad\Delta_{1}=\frac{\Delta+\nu-%
\mu_{0}^{2}-\mu_{1}^{2}}{\mu_{1}^{2}}.$ | |
| --- | --- | --- | --- |
In the quadratic data regime, as $\alpha_{1}=n/d$ goes to infinity, the overlap $q_{1}$ converges to $1$ and the first term in (111) is learnt with vanishing error. Moreover, since ${\mathbf{S}}_{1}^{0}$ is asymptotically Gaussian, the linear problem (116) is equivalent to denoising the Gaussian vector $({\mathbf{v}}^{∈tercal}{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}/\sqrt{kd})_{\mu=0}%
^{n}$ whose covariance is known as a function of ${\mathbf{X}}=({\mathbf{x}}_{1},...,{\mathbf{x}}_{n})∈\mathbb{R}^{d× n}$ . This leads to the following simple MMSE estimator for ${\mathbf{S}}_{1}^{0}$ :
$$
\displaystyle\langle{\mathbf{S}}_{1}^{0}\rangle=\frac{1}{\sqrt{d\Delta_{1}}}%
\left(\mathbf{I}+\frac{1}{d\Delta_{1}}{\mathbf{X}}{\mathbf{X}}^{\intercal}%
\right)^{-1}{\mathbf{X}}{\mathbf{y}} \tag{117}
$$
where ${\mathbf{y}}=(y_{1},...,y_{n})$ . Note that the derivation of this estimator does not assume the Gaussianity of ${\mathbf{x}}_{\mu}$ .*
**Remark H.2**
*The same argument can be easily generalised for general $P_{\text{out}}$ , leading to the following equivalent GLM in the universal ${\mathcal{Q}}_{W}^{*}\equiv 0$ phase of the quadratic data regime:
$$
\displaystyle y_{\mu}^{\rm GLM}\sim\tilde{P}_{\text{out}}(\cdot\mid{\rm Tr}[{%
\mathbf{M}}_{\mu}\bar{\mathbf{S}}^{0}_{2}]),\quad\text{where}\quad\tilde{P}_{%
\text{out}}(y|x):=\mathbb{E}_{z\sim\mathcal{N}(0,1)}P_{\text{out}}\Big{(}y\mid%
\frac{\mu_{2}}{2}x+z\sqrt{g(1)}\Big{)}, \tag{1}
$$
and ${\mathbf{M}}_{\mu}$ are independent GOE sensing matrices.*
**Remark H.3**
*One can show that the system of equations $({\rm S})$ in (LABEL:NSB_equations_gaussian_ch) with $\mathcal{Q}_{W}(\mathsf{v})$ all set to $0 0$ (and consequently $\tau=0$ ) can be mapped onto the fixed point of the state evolution equations (92), (94) of the GAMP-RIE in Maillard et al. (2024a) up to changes of variables. This confirms that when such a system has a unique solution, which is the case in all our tests, the GAMP-RIE asymptotically matches our universal solution. Assuming the validity of the aforementioned effective GLM, a potential improvement for discrete weights could come from a generalisation of GAMP which, in the denoising step, would correctly exploit the discrete prior over inner weights rather than using the RIE (which is prior independent). However, the results of Barbier et al. (2025) suggest that optimally denoising matrices with discrete entries is hard, and the RIE is the best efficient procedure to do so. Consequently, we tend to believe that improving GAMP-RIE in the case of discrete weights is out of reach without strong side information about the teacher, or exploiting non-polynomial-time algorithms (see Appendix I).*
Appendix I Algorithmic complexity of finding the specialisation solution
<details>
<summary>x15.png Details</summary>

### Visual Description
## Chart: Gradient Updates vs. Dimension
### Overview
The image presents a chart illustrating the relationship between the dimension of a space and the number of gradient updates required, displayed on a logarithmic scale. Three different curves are plotted, each representing a different value of epsilon (ε), a parameter likely related to the optimization process. Error bars are included for each data point, indicating the variability or uncertainty in the gradient update measurements. Linear fits are shown for each curve, with their corresponding slopes provided.
### Components/Axes
* **X-axis:** Dimension, ranging from approximately 50 to 250.
* **Y-axis:** Gradient updates (log scale), ranging from approximately 10^3 to 10^4.
* **Legend:** Located in the top-left corner, containing the following entries:
* Blue Line with Circle Markers: ε* = 0.008, Linear fit: slope=0.0146
* Green Line with Square Markers: ε* = 0.01, Linear fit: slope=0.0138
* Red Line with Triangle Markers: ε* = 0.012, Linear fit: slope=0.0136
* **Linear Fits:** Dashed lines corresponding to each data series, with the slope of the fit indicated in the legend.
* **Error Bars:** Vertical lines extending above and below each data point, representing the uncertainty in the gradient update measurements.
### Detailed Analysis
The chart displays three data series, each representing a different epsilon value. All three series show a generally upward trend, indicating that the number of gradient updates increases with increasing dimension. The error bars suggest some variability in the measurements, but the overall trend is clear.
Let's extract approximate data points, noting the logarithmic scale of the Y-axis:
**1. ε* = 0.008 (Blue Line):**
* Dimension = 50: Gradient Updates ≈ 10^3.1 (≈ 1259)
* Dimension = 75: Gradient Updates ≈ 10^3.3 (≈ 1995)
* Dimension = 100: Gradient Updates ≈ 10^3.5 (≈ 3162)
* Dimension = 125: Gradient Updates ≈ 10^3.65 (≈ 4467)
* Dimension = 150: Gradient Updates ≈ 10^3.8 (≈ 6310)
* Dimension = 175: Gradient Updates ≈ 10^3.9 (≈ 7943)
* Dimension = 200: Gradient Updates ≈ 10^4.0 (≈ 10000)
* Dimension = 225: Gradient Updates ≈ 10^4.1 (≈ 12589)
**2. ε* = 0.01 (Green Line):**
* Dimension = 50: Gradient Updates ≈ 10^3.2 (≈ 1585)
* Dimension = 75: Gradient Updates ≈ 10^3.4 (≈ 2512)
* Dimension = 100: Gradient Updates ≈ 10^3.6 (≈ 3981)
* Dimension = 125: Gradient Updates ≈ 10^3.7 (≈ 5012)
* Dimension = 150: Gradient Updates ≈ 10^3.85 (≈ 7079)
* Dimension = 175: Gradient Updates ≈ 10^4.0 (≈ 10000)
* Dimension = 200: Gradient Updates ≈ 10^4.1 (≈ 12589)
* Dimension = 225: Gradient Updates ≈ 10^4.2 (≈ 15849)
**3. ε* = 0.012 (Red Line):**
* Dimension = 50: Gradient Updates ≈ 10^3.0 (≈ 1000)
* Dimension = 75: Gradient Updates ≈ 10^3.2 (≈ 1585)
* Dimension = 100: Gradient Updates ≈ 10^3.4 (≈ 2512)
* Dimension = 125: Gradient Updates ≈ 10^3.55 (≈ 3548)
* Dimension = 150: Gradient Updates ≈ 10^3.7 (≈ 5012)
* Dimension = 175: Gradient Updates ≈ 10^3.85 (≈ 7079)
* Dimension = 200: Gradient Updates ≈ 10^4.0 (≈ 10000)
* Dimension = 225: Gradient Updates ≈ 10^4.15 (≈ 14125)
### Key Observations
* The number of gradient updates increases linearly with dimension for all three epsilon values.
* Higher epsilon values (ε* = 0.012) generally require fewer gradient updates for a given dimension compared to lower epsilon values (ε* = 0.008).
* The slopes of the linear fits are relatively similar, ranging from 0.0136 to 0.0146.
* The error bars indicate that the measurements are not perfectly precise, but the trends are still clear.
### Interpretation
The chart suggests that the number of gradient updates needed to converge during optimization scales linearly with the dimension of the problem space. This is a significant observation, as it implies that optimization becomes increasingly expensive as the dimensionality increases. The different curves for different epsilon values suggest that the choice of epsilon can influence the number of updates required. Larger epsilon values may allow for faster initial progress but could potentially lead to instability or suboptimal solutions. The relatively small differences in the slopes of the linear fits indicate that the scaling behavior is fairly consistent across the tested epsilon values. The logarithmic scale on the Y-axis emphasizes the exponential growth in gradient updates as the dimension increases, highlighting the challenges of optimizing high-dimensional spaces. This data could be used to inform the selection of appropriate optimization algorithms and parameters for different dimensionality problems. The error bars suggest that the relationship is not perfectly deterministic, and further investigation may be needed to understand the sources of variability.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Chart: Gradient Updates vs. Dimension (Log Scale)
### Overview
The image presents a chart illustrating the relationship between gradient updates (on a logarithmic scale) and dimension (also on a logarithmic scale). Three different data series are plotted, each representing a different value of ε* (epsilon star). Each data point includes error bars, and linear fits are shown for each series.
### Components/Axes
* **X-axis:** Dimension (log scale), ranging from approximately 4 x 10¹ to 2 x 10². Axis markers are at 40, 60, 100, and 200.
* **Y-axis:** Gradient updates (log scale), ranging from approximately 10³ to 10⁴. Axis markers are at 1000 and 10000.
* **Legend:** Located in the top-left corner.
* Blue line with circular markers: ε* = 0.008, Linear fit: slope = 1.4451
* Green line with triangular markers: ε* = 0.01, Linear fit: slope = 1.4692
* Red line with inverted triangular markers: ε* = 0.012, Linear fit: slope = 1.5340
* **Grid:** A grid is present on the chart to aid in reading values.
### Detailed Analysis
The chart displays three data series, each showing an upward trend. The trend is approximately linear, as indicated by the linear fit lines. Error bars are present for each data point, indicating the uncertainty in the measurements.
**Data Series 1 (ε* = 0.008 - Blue):**
* At Dimension = 40, Gradient Updates ≈ 1800 (with error bars extending roughly from 1500 to 2100).
* At Dimension = 60, Gradient Updates ≈ 2600 (with error bars extending roughly from 2200 to 3000).
* At Dimension = 100, Gradient Updates ≈ 4000 (with error bars extending roughly from 3500 to 4500).
* At Dimension = 200, Gradient Updates ≈ 7500 (with error bars extending roughly from 6500 to 8500).
**Data Series 2 (ε* = 0.01 - Green):**
* At Dimension = 40, Gradient Updates ≈ 2000 (with error bars extending roughly from 1700 to 2300).
* At Dimension = 60, Gradient Updates ≈ 2800 (with error bars extending roughly from 2400 to 3200).
* At Dimension = 100, Gradient Updates ≈ 4300 (with error bars extending roughly from 3800 to 4800).
* At Dimension = 200, Gradient Updates ≈ 8000 (with error bars extending roughly from 7000 to 9000).
**Data Series 3 (ε* = 0.012 - Red):**
* At Dimension = 40, Gradient Updates ≈ 2200 (with error bars extending roughly from 1900 to 2500).
* At Dimension = 60, Gradient Updates ≈ 3100 (with error bars extending roughly from 2700 to 3500).
* At Dimension = 100, Gradient Updates ≈ 4800 (with error bars extending roughly from 4300 to 5300).
* At Dimension = 200, Gradient Updates ≈ 9000 (with error bars extending roughly from 8000 to 10000).
### Key Observations
* All three data series exhibit a positive correlation between dimension and gradient updates.
* The slope of the linear fit increases with increasing ε*. This suggests that larger values of ε* require more gradient updates for a given dimension.
* The error bars indicate a significant degree of variability in the gradient updates, particularly at higher dimensions.
* The red line (ε* = 0.012) consistently shows the highest gradient updates for any given dimension.
### Interpretation
The chart demonstrates that the number of gradient updates required to train a model increases with the dimensionality of the input data. The rate of increase (the slope of the linear fit) is influenced by the value of ε*. A larger ε* leads to a steeper slope, indicating a greater sensitivity to dimensionality. This could be related to the difficulty of optimizing the model in higher-dimensional spaces, requiring more iterations (gradient updates) to converge. The error bars suggest that the relationship is not perfectly linear and that there is inherent noise or variability in the training process. The consistent ordering of the lines (red > green > blue) suggests that ε* is a significant factor in determining the computational cost of training. The logarithmic scales used for both axes suggest that the relationship may be exponential or power-law in nature, and the linear fits are approximations over the observed range of dimensions.
</details>
<details>
<summary>x17.png Details</summary>

### Visual Description
## Scatter Plot: Gradient Updates vs. Dimension
### Overview
This image presents a scatter plot illustrating the relationship between the dimension of a space and the number of gradient updates required for optimization. Three different experimental setups are compared, each represented by a distinct color and linear fit. Error bars are included for each data point, indicating the variance in gradient update counts. The y-axis is on a logarithmic scale.
### Components/Axes
* **X-axis:** Dimension, ranging from approximately 40 to 240. Labeled "Dimension".
* **Y-axis:** Gradient updates (log scale), ranging from approximately 10^2 to 10^4. Labeled "Gradient updates (log scale)".
* **Data Series 1:** Blue circles with error bars. Legend label: "ε* = 0.008". Linear fit: dashed blue line, slope = 0.0127.
* **Data Series 2:** Green circles with error bars. Legend label: "ε* = 0.01". Linear fit: solid green line, slope = 0.0128.
* **Data Series 3:** Red circles with error bars. Legend label: "ε* = 0.012". Linear fit: dashed-dotted red line, slope = 0.0135.
* **Legend:** Located in the top-left corner of the plot. It maps colors to the values of ε*.
### Detailed Analysis
The plot shows a clear positive correlation between dimension and gradient updates for all three experimental setups. The relationship appears approximately linear, as indicated by the fitted lines.
**Data Series 1 (ε* = 0.008 - Blue):**
* At Dimension = 50, Gradient Updates ≈ 250 (approximately 10^2.4). Error bar extends from approximately 200 to 300.
* At Dimension = 100, Gradient Updates ≈ 750 (approximately 10^2.9). Error bar extends from approximately 600 to 900.
* At Dimension = 150, Gradient Updates ≈ 1700 (approximately 10^3.2). Error bar extends from approximately 1400 to 2000.
* At Dimension = 200, Gradient Updates ≈ 3300 (approximately 10^3.5). Error bar extends from approximately 2800 to 3800.
* At Dimension = 225, Gradient Updates ≈ 4500 (approximately 10^3.65). Error bar extends from approximately 3800 to 5200.
**Data Series 2 (ε* = 0.01 - Green):**
* At Dimension = 50, Gradient Updates ≈ 300 (approximately 10^2.5). Error bar extends from approximately 250 to 350.
* At Dimension = 100, Gradient Updates ≈ 800 (approximately 10^2.9). Error bar extends from approximately 650 to 950.
* At Dimension = 150, Gradient Updates ≈ 1800 (approximately 10^3.3). Error bar extends from approximately 1500 to 2100.
* At Dimension = 200, Gradient Updates ≈ 3500 (approximately 10^3.5). Error bar extends from approximately 3000 to 4000.
* At Dimension = 225, Gradient Updates ≈ 4800 (approximately 10^3.68). Error bar extends from approximately 4000 to 5600.
**Data Series 3 (ε* = 0.012 - Red):**
* At Dimension = 50, Gradient Updates ≈ 350 (approximately 10^2.5). Error bar extends from approximately 300 to 400.
* At Dimension = 100, Gradient Updates ≈ 900 (approximately 10^2.95). Error bar extends from approximately 750 to 1050.
* At Dimension = 150, Gradient Updates ≈ 2000 (approximately 10^3.3). Error bar extends from approximately 1700 to 2300.
* At Dimension = 200, Gradient Updates ≈ 3800 (approximately 10^3.6). Error bar extends from approximately 3300 to 4300.
* At Dimension = 225, Gradient Updates ≈ 5200 (approximately 10^3.7). Error bar extends from approximately 4500 to 5900.
### Key Observations
* The number of gradient updates increases linearly with dimension for all three values of ε*.
* Higher values of ε* (0.012) consistently require more gradient updates than lower values (0.008 and 0.01).
* The error bars indicate some variability in the gradient update counts, but the overall trend remains consistent.
* The slopes of the linear fits are relatively similar, ranging from 0.0127 to 0.0135.
### Interpretation
The data suggests that the complexity of the optimization problem increases linearly with the dimensionality of the space. This is reflected in the increasing number of gradient updates required to reach a solution as the dimension grows. The parameter ε* likely represents a step size or learning rate, and larger values of ε* appear to necessitate more updates, potentially due to overshooting or instability in the optimization process. The consistent linear trend across different ε* values indicates a fundamental relationship between dimensionality and optimization effort. The error bars suggest that the exact number of updates can vary, but the underlying linear relationship holds. This could be due to the stochastic nature of gradient descent or variations in the initial conditions of the optimization. The plot provides valuable insight into the scalability of the optimization algorithm and the impact of parameter settings on its performance.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Chart: Gradient Updates vs. Dimension (Log Scale)
### Overview
The image presents a chart illustrating the relationship between gradient updates (on a logarithmic scale) and dimension (also on a logarithmic scale). Three different lines are plotted, each representing a different value of epsilon (ε*), with error bars indicating uncertainty. Linear fits are shown for each line, with their respective slopes provided.
### Components/Axes
* **X-axis:** Dimension (log scale), ranging from approximately 4 x 10¹ to 2 x 10². Axis markers are at 40, 60, 100, and 200.
* **Y-axis:** Gradient updates (log scale), ranging from approximately 10² to 10⁴. Axis markers are at 100, 1000, and 10000.
* **Legend:** Located in the top-left corner.
* Blue dashed line: Linear fit: slope = 1.2884, ε* = 0.008
* Green dashed-dotted line: Linear fit: slope = 1.3823, ε* = 0.01
* Red dashed-dotted line: Linear fit: slope = 1.5535, ε* = 0.012
* **Data Points:** Each line consists of approximately 7 data points, each with associated error bars (vertical lines).
### Detailed Analysis
Let's analyze each data series individually:
**1. Blue Line (ε* = 0.008):**
* Trend: The blue line slopes upward, indicating a positive correlation between dimension and gradient updates. The slope of the linear fit is 1.2884.
* Data Points (approximate, reading from the chart):
* Dimension = 40: Gradient Updates = 250 (± 50)
* Dimension = 60: Gradient Updates = 350 (± 60)
* Dimension = 100: Gradient Updates = 600 (± 80)
* Dimension = 200: Gradient Updates = 1200 (± 150)
**2. Green Line (ε* = 0.01):**
* Trend: The green line also slopes upward, with a steeper slope than the blue line. The slope of the linear fit is 1.3823.
* Data Points (approximate):
* Dimension = 40: Gradient Updates = 300 (± 60)
* Dimension = 60: Gradient Updates = 450 (± 70)
* Dimension = 100: Gradient Updates = 750 (± 100)
* Dimension = 200: Gradient Updates = 1500 (± 200)
**3. Red Line (ε* = 0.012):**
* Trend: The red line exhibits an upward slope, and is the steepest of the three lines. The slope of the linear fit is 1.5535.
* Data Points (approximate):
* Dimension = 40: Gradient Updates = 350 (± 70)
* Dimension = 60: Gradient Updates = 500 (± 80)
* Dimension = 100: Gradient Updates = 850 (± 120)
* Dimension = 200: Gradient Updates = 1700 (± 250)
### Key Observations
* All three lines demonstrate a positive correlation between dimension and gradient updates.
* As ε* increases, the slope of the linear fit also increases, indicating a faster rate of gradient updates with increasing dimension.
* The error bars suggest a significant degree of uncertainty in the gradient update measurements, particularly at higher dimensions.
* The lines are roughly parallel, suggesting a consistent relationship between dimension and gradient updates across different values of ε*.
### Interpretation
The chart suggests that the number of gradient updates required for optimization increases with the dimensionality of the problem. The rate of increase is influenced by the value of ε*. A larger ε* leads to a faster increase in gradient updates as the dimension grows. This could be related to the curvature of the loss landscape, where higher dimensions might require more steps to navigate effectively. The linear fits imply that the relationship between gradient updates and dimension is approximately linear within the observed range. The error bars highlight the inherent variability in the optimization process, and the need for caution when interpreting the results. The chart provides insights into the scaling behavior of gradient-based optimization algorithms in high-dimensional spaces, which is crucial for designing efficient machine learning models.
</details>
<details>
<summary>x19.png Details</summary>

### Visual Description
## Chart: Gradient Updates vs. Dimension
### Overview
The image presents a chart illustrating the relationship between the number of gradient updates (on a logarithmic scale) and the dimension of a space, for three different values of ε*. The data is presented as scatter plots with error bars, along with linear regression fits for each ε* value.
### Components/Axes
* **X-axis:** Dimension, ranging from approximately 25 to 275.
* **Y-axis:** Gradient updates (log scale), ranging from approximately 10^1 to 10^3.
* **Legend:** Located in the bottom-right corner, containing the following entries:
* ε* = 0.008 (Blue, with circular markers)
* ε* = 0.01 (Green, with square markers)
* ε* = 0.012 (Red, with triangular markers)
* Linear fit: slope=0.0090 (Blue dashed line)
* Linear fit: slope=0.0090 (Green dashed line)
* Linear fit: slope=0.0088 (Red dashed line)
* **Gridlines:** Present to aid in reading values.
### Detailed Analysis
The chart displays three data series, each representing a different ε* value. Each series consists of several data points with associated error bars. Linear regression lines are overlaid on each data series.
**Data Series 1: ε* = 0.008 (Blue)**
* The blue data points show an upward trend, indicating that as the dimension increases, the number of gradient updates also increases.
* Approximate data points (with uncertainty due to visual estimation):
* Dimension = 50, Gradient Updates ≈ 150
* Dimension = 100, Gradient Updates ≈ 250
* Dimension = 150, Gradient Updates ≈ 350
* Dimension = 200, Gradient Updates ≈ 450
* Dimension = 250, Gradient Updates ≈ 550
* The linear fit has a slope of 0.0090.
**Data Series 2: ε* = 0.01 (Green)**
* The green data points also exhibit an upward trend, similar to the blue series.
* Approximate data points:
* Dimension = 50, Gradient Updates ≈ 100
* Dimension = 100, Gradient Updates ≈ 200
* Dimension = 150, Gradient Updates ≈ 300
* Dimension = 200, Gradient Updates ≈ 400
* Dimension = 250, Gradient Updates ≈ 500
* The linear fit has a slope of 0.0090.
**Data Series 3: ε* = 0.012 (Red)**
* The red data points show an upward trend, but appear to have a slightly lower overall value compared to the blue and green series.
* Approximate data points:
* Dimension = 50, Gradient Updates ≈ 75
* Dimension = 100, Gradient Updates ≈ 150
* Dimension = 150, Gradient Updates ≈ 225
* Dimension = 200, Gradient Updates ≈ 300
* Dimension = 250, Gradient Updates ≈ 375
* The linear fit has a slope of 0.0088.
The error bars indicate the variability or uncertainty associated with each data point. The error bars are relatively consistent across the dimension range for each ε* value.
### Key Observations
* All three data series demonstrate a positive correlation between dimension and gradient updates.
* The slope of the linear fit is similar for all three ε* values (around 0.009), suggesting a consistent rate of increase in gradient updates with respect to dimension.
* Higher values of ε* (0.008 and 0.01) generally require more gradient updates than lower values (0.012) for a given dimension.
* The error bars suggest a reasonable degree of confidence in the data, with the variability being relatively small compared to the overall trend.
### Interpretation
The chart suggests that the number of gradient updates required to achieve convergence increases linearly with the dimension of the space. This is a common phenomenon in optimization problems, as higher dimensionality often leads to more complex landscapes and slower convergence rates. The parameter ε* appears to influence the number of gradient updates needed, with smaller values of ε* generally requiring more updates. This could be due to the fact that smaller ε* values lead to smaller step sizes, requiring more iterations to reach the optimal solution. The consistent slopes of the linear fits indicate that the relationship between dimension and gradient updates is relatively stable across different values of ε*. The error bars provide a measure of the uncertainty in the data, which is important for assessing the reliability of the observed trends. The logarithmic scale on the y-axis emphasizes the exponential growth in gradient updates as the dimension increases.
</details>
<details>
<summary>x20.png Details</summary>

### Visual Description
## Chart: Gradient Updates vs. Dimension (Log Scale)
### Overview
The image presents a chart illustrating the relationship between the dimension (x-axis) and the number of gradient updates (y-axis), both on a logarithmic scale. Three different data series are plotted, each representing a different value of epsilon (ε*). Error bars are included for each data point, indicating the variability or uncertainty in the measurements. Linear fits are shown for each series, with their respective slopes indicated in the legend.
### Components/Axes
* **X-axis:** Dimension (log scale), ranging from approximately 4 x 10¹ to 2 x 10². Axis markers are at 40, 60, 100, and 200.
* **Y-axis:** Gradient updates (log scale), ranging from approximately 10² to 10³. Axis markers are at 100, 1000.
* **Legend:** Located in the top-left corner. Contains the following entries:
* Blue dashed line: "Linear fit: slope=1.0114, ε* = 0.008"
* Green dashed line: "Linear fit: slope=1.0306, ε* = 0.01"
* Red dashed line: "Linear fit: slope=1.0967, ε* = 0.012"
* **Data Series:** Three lines with error bars, each corresponding to a different ε* value.
### Detailed Analysis
The chart displays three data series, each showing an increasing trend as the dimension increases. The error bars indicate the spread of data around each mean value.
**Blue Data Series (ε* = 0.008):**
The blue line slopes upward with a slope of approximately 1.0114.
* Dimension = 40: Gradient Updates ≈ 120, Error ≈ 20
* Dimension = 60: Gradient Updates ≈ 180, Error ≈ 30
* Dimension = 100: Gradient Updates ≈ 300, Error ≈ 50
* Dimension = 200: Gradient Updates ≈ 550, Error ≈ 80
**Green Data Series (ε* = 0.01):**
The green line slopes upward with a slope of approximately 1.0306.
* Dimension = 40: Gradient Updates ≈ 100, Error ≈ 30
* Dimension = 60: Gradient Updates ≈ 150, Error ≈ 40
* Dimension = 100: Gradient Updates ≈ 250, Error ≈ 60
* Dimension = 200: Gradient Updates ≈ 450, Error ≈ 100
**Red Data Series (ε* = 0.012):**
The red line slopes upward with a slope of approximately 1.0967.
* Dimension = 40: Gradient Updates ≈ 80, Error ≈ 20
* Dimension = 60: Gradient Updates ≈ 120, Error ≈ 30
* Dimension = 100: Gradient Updates ≈ 200, Error ≈ 50
* Dimension = 200: Gradient Updates ≈ 400, Error ≈ 80
### Key Observations
* All three data series exhibit a positive correlation between dimension and gradient updates.
* The red data series (ε* = 0.012) consistently shows a higher number of gradient updates for a given dimension compared to the blue (ε* = 0.008) and green (ε* = 0.01) series.
* The slopes of the linear fits are relatively similar, all around 1.0, suggesting a roughly linear relationship between the log of gradient updates and the log of dimension.
* The error bars are larger for higher dimensions, indicating greater variability in the gradient updates.
### Interpretation
The chart suggests that the number of gradient updates required to achieve convergence increases with the dimension of the problem. The different values of epsilon (ε*) represent different settings or parameters within the optimization process. The higher slope for the red series (ε* = 0.012) indicates that a larger epsilon value requires more gradient updates for the same dimension. This could be due to the larger step size leading to overshooting the optimal solution more frequently, thus requiring more iterations to converge.
The approximately linear relationship on the log-log scale implies a power-law relationship between dimension and gradient updates. The slopes of the lines represent the exponent in this power law. The fact that the slopes are close to 1 suggests that the number of gradient updates scales roughly linearly with the dimension.
The increasing error bars with increasing dimension suggest that the optimization process becomes more unstable or sensitive to initial conditions as the dimension grows. This is a common phenomenon in high-dimensional optimization problems. The chart provides valuable insights into the scaling behavior of gradient-based optimization algorithms in high-dimensional spaces and the impact of parameter settings like epsilon on convergence.
</details>
Figure 8: Semilog (Left) and log-log (Right) plots of the number of gradient updates needed to achieve a test loss below the threshold $\varepsilon^{*}<\varepsilon^{\rm uni}$ . Student network trained with ADAM with optimised batch size for each point. The dataset was generated from a teacher network with ReLU activation and parameters $\Delta=10^{-4}$ for the Gaussian noise variance of the linear readout, $\gamma=0.5$ and $\alpha=5.0$ for which $\varepsilon^{\rm opt}-\Delta=1.115× 10^{-5}$ . Points are obtained averaging over 10 teacher/data instances with error bars representing the standard deviation. Each row corresponds to a different distribution of the readouts, kept fixed during training. Top: homogeneous readouts, for which the error of the universal branch is $\varepsilon^{\rm uni}-\Delta=1.217× 10^{-2}$ . Centre: Rademacher readouts, for which $\varepsilon^{\rm uni}-\Delta=1.218× 10^{-2}$ . Bottom: Gaussian readouts, for which $\varepsilon^{\rm uni}-\Delta=1.210× 10^{-2}$ . The quality of the fits can be read from Table 2.
| Homogeneous | | $\bm{5.57}$ | $\bm{9.00}$ | $\bm{21.1}$ | $32.3$ | $26.5$ | $61.1$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Rademacher | | $\bm{4.51}$ | $\bm{6.84}$ | $\bm{12.7}$ | $12.0$ | $17.4$ | $16.0$ |
| Uniform $[-\sqrt{3},\sqrt{3}]$ | | $\bm{5.08}$ | $\bm{1.44}$ | ${4.21}$ | $8.26$ | $8.57$ | $\bm{3.82}$ |
| Gaussian | | $2.66$ | $\bm{0.76}$ | $3.02$ | $\bm{0.55}$ | $2.31$ | $\bm{1.36}$ |
Table 2: $\chi^{2}$ test for exponential and power-law fits for the time needed by ADAM to reach the thresholds $\varepsilon^{*}$ , for various priors on the readouts. Fits are displayed in Figure 8. Smaller values of $\chi^{2}$ (in bold, for given threshold and readouts) indicate a better compatibility with the hypothesis.
<details>
<summary>x21.png Details</summary>

### Visual Description
## Chart: Gradient Updates vs. Dimension
### Overview
The image presents a line chart illustrating the relationship between the dimension of a space and the number of gradient updates required during an optimization process. The chart compares empirical data with two fitted curves: an exponential fit and a power law fit. Error bars are included for the data points, indicating the variability in the gradient update counts.
### Components/Axes
* **X-axis:** Dimension, ranging from approximately 40 to 200. Axis markers are present at intervals of 20.
* **Y-axis:** Gradient updates, ranging from approximately 0 to 7000. Axis markers are present at intervals of 1000.
* **Legend:** Located in the top-left corner.
* "Data" - Represented by blue data points with error bars.
* "Exponential fit" - Represented by a red dashed line.
* "Power law fit" - Represented by a green dashed line.
* **Grid:** A light gray grid is overlaid on the chart for easier readability.
### Detailed Analysis
The chart displays three data series: the empirical data, the exponential fit, and the power law fit.
* **Data (Blue):** The data points show an increasing trend. The number of gradient updates increases as the dimension increases. Error bars are present at each data point, indicating the variance in the gradient updates.
* Dimension = 40, Gradient Updates = approximately 400, Error = approximately 50
* Dimension = 60, Gradient Updates = approximately 500, Error = approximately 50
* Dimension = 80, Gradient Updates = approximately 750, Error = approximately 75
* Dimension = 100, Gradient Updates = approximately 1000, Error = approximately 100
* Dimension = 120, Gradient Updates = approximately 1300, Error = approximately 100
* Dimension = 140, Gradient Updates = approximately 1600, Error = approximately 100
* Dimension = 160, Gradient Updates = approximately 2300, Error = approximately 200
* Dimension = 180, Gradient Updates = approximately 3500, Error = approximately 300
* Dimension = 200, Gradient Updates = approximately 6400, Error = approximately 600
* **Exponential Fit (Red):** The exponential fit line starts at approximately 400 at dimension 40 and increases at an accelerating rate, reaching approximately 5000 at dimension 200.
* **Power Law Fit (Green):** The power law fit line also shows an increasing trend, but the rate of increase is slower than the exponential fit. It starts at approximately 400 at dimension 40 and reaches approximately 3800 at dimension 200.
### Key Observations
* The number of gradient updates appears to increase non-linearly with dimension.
* The exponential fit seems to better capture the trend of the data at higher dimensions (around 160-200), while the power law fit is closer to the data at lower dimensions.
* The error bars indicate significant variability in the gradient updates, especially at higher dimensions.
* The data points show a clear upward trend, suggesting that optimization becomes more difficult as the dimension increases.
### Interpretation
The chart suggests that the computational cost of optimization (as measured by the number of gradient updates) increases with the dimensionality of the problem space. The comparison of the exponential and power law fits indicates that the relationship is likely non-polynomial. The better fit of the exponential curve at higher dimensions suggests that the computational cost may grow exponentially with dimension, which could pose a significant challenge for high-dimensional optimization problems. The error bars highlight the inherent stochasticity in the optimization process, and the variability in the number of gradient updates. This could be due to factors such as the initialization of the optimization algorithm, the choice of learning rate, or the inherent complexity of the optimization landscape. The chart provides valuable insights into the scaling behavior of optimization algorithms in high-dimensional spaces, and can inform the development of more efficient optimization techniques.
</details>
<details>
<summary>x22.png Details</summary>

### Visual Description
## Scatter Plot: Gradient Updates vs. Dimension
### Overview
The image presents a scatter plot illustrating the relationship between the dimension of a space and the number of gradient updates required. Two fitted curves, an exponential fit and a power law fit, are overlaid on the data points. Error bars are included for each data point, representing the uncertainty in the gradient update measurements.
### Components/Axes
* **X-axis:** Dimension, ranging from approximately 25 to 225. The axis is labeled "Dimension".
* **Y-axis:** Gradient updates, ranging from approximately 50 to 700. The axis is labeled "Gradient updates".
* **Data Points:** Represented by blue circles with error bars. Labeled "Data" in the legend.
* **Exponential Fit:** Represented by a red dashed line. Labeled "Exponential fit" in the legend.
* **Power Law Fit:** Represented by a green dashed line. Labeled "Power law fit" in the legend.
* **Legend:** Located in the top-left corner of the plot.
### Detailed Analysis
The data points show a general upward trend, indicating that as the dimension increases, the number of gradient updates also increases. The error bars indicate significant variability in the gradient update measurements for each dimension.
Let's analyze the data series and fitted curves:
* **Data (Blue Circles):**
* At Dimension = 50, Gradient Updates ≈ 80 ± 20.
* At Dimension = 75, Gradient Updates ≈ 140 ± 25.
* At Dimension = 100, Gradient Updates ≈ 200 ± 30.
* At Dimension = 125, Gradient Updates ≈ 260 ± 35.
* At Dimension = 150, Gradient Updates ≈ 320 ± 40.
* At Dimension = 175, Gradient Updates ≈ 380 ± 45.
* At Dimension = 200, Gradient Updates ≈ 450 ± 50.
* At Dimension = 225, Gradient Updates ≈ 550 ± 70.
* **Exponential Fit (Red Dashed Line):** This line exhibits an accelerating upward curve.
* At Dimension = 50, Gradient Updates ≈ 70.
* At Dimension = 100, Gradient Updates ≈ 180.
* At Dimension = 150, Gradient Updates ≈ 300.
* At Dimension = 200, Gradient Updates ≈ 480.
* At Dimension = 225, Gradient Updates ≈ 600.
* **Power Law Fit (Green Dashed Line):** This line exhibits a slower, more gradual upward curve compared to the exponential fit.
* At Dimension = 50, Gradient Updates ≈ 100.
* At Dimension = 100, Gradient Updates ≈ 220.
* At Dimension = 150, Gradient Updates ≈ 330.
* At Dimension = 200, Gradient Updates ≈ 440.
* At Dimension = 225, Gradient Updates ≈ 530.
### Key Observations
* The data points generally fall between the exponential and power law fits.
* The error bars are relatively large, especially at higher dimensions, indicating greater uncertainty in the gradient update measurements.
* The exponential fit appears to more closely follow the trend of the data points at higher dimensions.
* The power law fit appears to underestimate the gradient updates at higher dimensions.
### Interpretation
The plot suggests that the number of gradient updates required to optimize a model increases with the dimensionality of the input space. The fact that both an exponential and power law fit can be applied to the data indicates that the relationship is not strictly linear. The better fit of the exponential curve at higher dimensions suggests that the complexity of the optimization problem grows more rapidly with dimension than a simple power law would predict. This could be due to phenomena like the curse of dimensionality, where the volume of the search space increases exponentially with dimension, making it more difficult to find optimal solutions. The large error bars suggest that the exact relationship between dimension and gradient updates may be sensitive to the specific model architecture, dataset, and optimization algorithm used. Further investigation would be needed to determine which fit best describes the underlying process and to understand the sources of uncertainty in the gradient update measurements.
</details>
Figure 9: Same as in Fig. 8, but in linear scale for better visualisation, for homogeneous readouts (Left) and Gaussian readouts (Right), with threshold $\varepsilon^{*}=0.008$ .
<details>
<summary>x23.png Details</summary>

### Visual Description
\n
## Line Chart: Test Loss vs. Gradient Updates for Different Dimensionalities
### Overview
The image presents a line chart illustrating the relationship between test loss and gradient updates for various values of a parameter 'd' (dimensionality). Several lines, each representing a different 'd' value, show how test loss evolves during the gradient update process. Additionally, three horizontal lines represent theoretical loss values: 'ε<sup>uni</sup>', 'ε<sup>opt</sup>', and '2ε<sup>uni</sup>'.
### Components/Axes
* **X-axis:** Gradient updates, ranging from approximately 0 to 6000, labeled as "Gradient updates".
* **Y-axis:** Test loss, ranging from approximately 0 to 0.06, labeled as "Test loss".
* **Legend:** Located in the top-right corner, it identifies each line by its corresponding 'd' value:
* d = 60 (lightest red)
* d = 80 (slightly darker red)
* d = 100 (red)
* d = 120 (darker red)
* d = 140 (even darker red)
* d = 160 (darkest red)
* d = 180 (very darkest red)
* ε<sup>uni</sup> (black dashed line)
* ε<sup>opt</sup> (red dashed line)
* 2ε<sup>uni</sup> (red dotted line)
### Detailed Analysis
The chart displays multiple lines representing the test loss as a function of gradient updates for different dimensionalities (d).
* **d = 60:** The line starts at approximately 0.055 and rapidly decreases to around 0.02 within the first 1000 gradient updates. It then fluctuates between approximately 0.02 and 0.03, with some peaks reaching around 0.04, before decreasing again towards the end of the chart, settling around 0.015.
* **d = 80:** Similar to d=60, it starts at around 0.05 and decreases to approximately 0.02 within the first 1000 updates. It exhibits similar fluctuations, peaking around 0.035, and ends around 0.015.
* **d = 100:** Starts at approximately 0.048, decreases to around 0.02 within the first 1000 updates, fluctuates between 0.02 and 0.035, peaking around 0.04, and ends around 0.015.
* **d = 120:** Starts at approximately 0.045, decreases to around 0.02 within the first 1000 updates, fluctuates between 0.02 and 0.035, peaking around 0.04, and ends around 0.015.
* **d = 140:** Starts at approximately 0.04, decreases to around 0.02 within the first 1000 updates, fluctuates between 0.02 and 0.035, peaking around 0.04, and ends around 0.015.
* **d = 160:** Starts at approximately 0.038, decreases to around 0.02 within the first 1000 updates, fluctuates between 0.02 and 0.035, peaking around 0.04, and ends around 0.015.
* **d = 180:** Starts at approximately 0.035, decreases to around 0.02 within the first 1000 updates, fluctuates between 0.02 and 0.035, peaking around 0.04, and ends around 0.015.
All lines for different 'd' values exhibit a similar trend: an initial rapid decrease in test loss followed by fluctuations around a relatively stable level. As 'd' increases, the initial test loss value tends to decrease slightly.
* **ε<sup>uni</sup>:** A horizontal dashed black line at approximately 0.028.
* **ε<sup>opt</sup>:** A horizontal dashed red line at approximately 0.022.
* **2ε<sup>uni</sup>:** A horizontal dotted red line at approximately 0.056.
### Key Observations
* The test loss generally decreases with increasing gradient updates, but the rate of decrease slows down after the initial phase.
* The fluctuations in test loss suggest the presence of noise or instability during the training process.
* Higher dimensionality ('d' values) generally result in lower initial test loss values.
* The lines representing different 'd' values converge towards similar test loss levels as the number of gradient updates increases.
* The theoretical loss values (ε<sup>uni</sup>, ε<sup>opt</sup>, 2ε<sup>uni</sup>) provide benchmarks for evaluating the performance of the model. Most of the lines stay below 2ε<sup>uni</sup>.
### Interpretation
The chart demonstrates the impact of dimensionality ('d') on the test loss during gradient-based optimization. The initial decrease in test loss indicates that the model is learning and improving its performance. The subsequent fluctuations suggest that the optimization process is not perfectly smooth and may be affected by factors such as learning rate, batch size, or data noise.
The convergence of the lines for different 'd' values suggests that, given enough gradient updates, the model can achieve similar performance regardless of the dimensionality. However, higher dimensionality may lead to faster initial learning.
The theoretical loss values (ε<sup>uni</sup>, ε<sup>opt</sup>, 2ε<sup>uni</sup>) likely represent bounds or expected values for the test loss under certain assumptions. Comparing the observed test loss to these theoretical values can provide insights into the efficiency and effectiveness of the optimization process. The fact that the lines generally stay below 2ε<sup>uni</sup> suggests that the model is performing reasonably well. The lines are closer to ε<sup>opt</sup>, which suggests that the model is approaching optimal performance.
</details>
<details>
<summary>x24.png Details</summary>

### Visual Description
## Line Chart: Loss vs. Gradient Updates
### Overview
The image presents a line chart illustrating the relationship between loss (y-axis) and gradient updates (x-axis). Multiple lines represent different values of 'd', a parameter, alongside theoretical curves for uniform and optimal scenarios. The chart appears to demonstrate the convergence of loss as gradient updates increase, with varying rates depending on the value of 'd'.
### Components/Axes
* **X-axis:** "Gradient updates", ranging from 0 to 2000, with tick marks at 250, 500, 750, 1000, 1250, 1500, 1750.
* **Y-axis:** Loss, ranging from 0.00 to 0.06, with tick marks at 0.00, 0.02, 0.04, 0.06.
* **Legend:** Located in the top-right corner, containing the following labels and corresponding line styles/colors:
* d = 60 (light orange)
* d = 80 (orange)
* d = 100 (reddish-orange)
* d = 120 (red)
* d = 140 (dark red)
* d = 160 (very dark red)
* d = 180 (darkest red)
* 2 * e<sup>uni</sup> (dashed black)
* e<sup>uni</sup> (dotted black)
* e<sup>opt</sup> (solid black)
### Detailed Analysis
The chart displays several lines representing different values of 'd'. The lines generally exhibit a decreasing trend, indicating a reduction in loss as gradient updates increase.
* **d = 60 (light orange):** Starts at approximately 0.055 and decreases rapidly initially, then plateaus around 0.015-0.02.
* **d = 80 (orange):** Starts at approximately 0.05 and decreases rapidly, then plateaus around 0.015-0.02.
* **d = 100 (reddish-orange):** Starts at approximately 0.048 and decreases rapidly, then plateaus around 0.015-0.02.
* **d = 120 (red):** Starts at approximately 0.045 and decreases rapidly, then plateaus around 0.015-0.02.
* **d = 140 (dark red):** Starts at approximately 0.042 and decreases rapidly, then plateaus around 0.015-0.02.
* **d = 160 (very dark red):** Starts at approximately 0.04 and decreases rapidly, then plateaus around 0.015-0.02.
* **d = 180 (darkest red):** Starts at approximately 0.038 and decreases rapidly, then plateaus around 0.015-0.02.
* **2 * e<sup>uni</sup> (dashed black):** Starts at approximately 0.03 and decreases slowly, remaining above the other lines.
* **e<sup>uni</sup> (dotted black):** Starts at approximately 0.015 and decreases slowly, remaining above the other lines.
* **e<sup>opt</sup> (solid black):** Starts at approximately 0.008 and decreases slowly, remaining below the other lines.
All lines for different 'd' values converge towards a similar loss level around 0.015-0.02 after approximately 1000 gradient updates. The theoretical curves (dashed, dotted, and solid black) provide benchmarks for comparison.
### Key Observations
* The lines for different 'd' values initially diverge but converge as the number of gradient updates increases.
* Higher values of 'd' (160, 180) seem to exhibit a slightly faster initial decrease in loss compared to lower values (60, 80).
* The theoretical curve e<sup>opt</sup> (solid black) consistently represents the lowest loss value, indicating optimal performance.
* The theoretical curve e<sup>uni</sup> (dotted black) is consistently higher than e<sup>opt</sup>, and 2 * e<sup>uni</sup> (dashed black) is the highest of the three theoretical curves.
### Interpretation
The chart demonstrates the impact of the parameter 'd' on the convergence of a loss function during gradient updates. The convergence of the lines for different 'd' values suggests that, beyond a certain number of updates, the choice of 'd' becomes less critical. The theoretical curves provide a baseline for evaluating the performance of the algorithm. The solid black line (e<sup>opt</sup>) represents the optimal loss, while the other two theoretical lines (e<sup>uni</sup> and 2 * e<sup>uni</sup>) represent less optimal scenarios. The observed convergence towards a similar loss level for all 'd' values indicates that the algorithm is approaching a stable state, regardless of the specific value of 'd'. The initial differences in convergence rates suggest that 'd' may influence the speed of learning, but not necessarily the final outcome. The chart suggests that the algorithm is performing well, as the loss values are approaching the optimal theoretical curve.
</details>
<details>
<summary>x25.png Details</summary>

### Visual Description
\n
## Line Chart: Convergence of Loss with Gradient Updates
### Overview
The image presents a line chart illustrating the convergence of a loss function with respect to the number of gradient updates. Multiple lines represent different values of a parameter 'd', and the chart includes horizontal lines representing theoretical bounds on the loss. The chart appears to demonstrate how the loss decreases as gradient updates are applied, and how this convergence is affected by the value of 'd'.
### Components/Axes
* **X-axis:** "Gradient updates", ranging from 0 to approximately 600.
* **Y-axis:** Loss value, ranging from 0 to approximately 0.06.
* **Legend:** Located in the top-right corner, listing the following lines:
* d = 60 (lightest red)
* d = 80 (slightly darker red)
* d = 100 (medium red)
* d = 120 (darker red)
* d = 140 (even darker red)
* d = 160 (very dark red)
* d = 180 (darkest red)
* 2 * ε<sup>uni</sup> (black dashed line)
* ε<sup>uni</sup> (black dotted line)
* ε<sup>opt</sup> (red dashed-dotted line)
### Detailed Analysis
The chart displays several lines representing the loss function's value as gradient updates increase.
* **d = 60:** Starts at approximately 0.058 and rapidly decreases, leveling off around a loss value of approximately 0.008 at 600 gradient updates.
* **d = 80:** Starts at approximately 0.055 and decreases more slowly than d=60, leveling off around a loss value of approximately 0.009 at 600 gradient updates.
* **d = 100:** Starts at approximately 0.052 and decreases at a rate between d=60 and d=80, leveling off around a loss value of approximately 0.0095 at 600 gradient updates.
* **d = 120:** Starts at approximately 0.049 and decreases at a rate similar to d=100, leveling off around a loss value of approximately 0.0095 at 600 gradient updates.
* **d = 140:** Starts at approximately 0.046 and decreases at a rate similar to d=120, leveling off around a loss value of approximately 0.009 at 600 gradient updates.
* **d = 160:** Starts at approximately 0.043 and decreases at a rate similar to d=140, leveling off around a loss value of approximately 0.0085 at 600 gradient updates.
* **d = 180:** Starts at approximately 0.040 and decreases at a rate similar to d=160, leveling off around a loss value of approximately 0.008 at 600 gradient updates.
The horizontal lines represent theoretical bounds:
* **2 * ε<sup>uni</sup>:** A black dashed line at approximately 0.012.
* **ε<sup>uni</sup>:** A black dotted line at approximately 0.006.
* **ε<sup>opt</sup>:** A red dashed-dotted line at approximately 0.004.
All lines for different 'd' values converge towards a similar loss value as the number of gradient updates increases, and all lines appear to approach the ε<sup>opt</sup> bound.
### Key Observations
* The loss decreases rapidly initially for all values of 'd', then the rate of decrease slows down.
* Larger values of 'd' (e.g., 180) seem to converge slightly faster than smaller values (e.g., 60), but the difference is minimal.
* All lines converge below the 2 * ε<sup>uni</sup> bound.
* The lines approach, but do not necessarily reach, the ε<sup>opt</sup> bound within the observed range of gradient updates.
### Interpretation
The chart demonstrates the convergence behavior of a loss function during optimization using gradient updates. The parameter 'd' likely represents a dimensionality or complexity factor within the optimization problem. The convergence is shown to be relatively stable across different values of 'd', suggesting that the optimization process is not overly sensitive to this parameter within the tested range. The horizontal lines (ε<sup>uni</sup> and ε<sup>opt</sup>) represent theoretical limits on the achievable loss, potentially based on uniform or optimal convergence rates. The fact that the loss values approach these bounds indicates that the optimization process is functioning as expected. The slight differences in convergence rates for different 'd' values suggest that increasing 'd' may offer a marginal improvement in convergence speed, but the benefit is likely limited. The chart provides evidence that the optimization algorithm is effectively reducing the loss function, and that the results are consistent with theoretical expectations.
</details>
Figure 10: Trajectories of the generalisation error of neural networks trained with ADAM at fixed batch size $B=\lfloor n/4\rfloor$ , learning rate 0.05, for ReLU activation with parameters $\Delta=10^{-4}$ for the linear readout, $\gamma=0.5$ and $\alpha=5.0>\alpha_{\rm sp}$ ( $=0.22,0.12,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively). The error $\varepsilon^{\rm uni}$ is the mean-square generalisation error associated to the universal solution with overlap $\mathcal{Q}_{W}\equiv 0$ . Left: Homogeneous readouts. Centre: Rademacher readouts. Right: Gaussian readouts. Readouts are kept fixed (and equal to the teacher’s) in all cases during training. Points on the solid lines are obtained by averaging over 5 teacher/data instances, and shaded regions around them correspond to one standard deviation.
We now provide empirical evidence concerning the computational complexity to attain specialisation, namely to have one of the $\mathcal{Q}_{W}(\mathsf{v})>0$ , or equivalently to beat the “universal” performance ( $\mathcal{Q}_{W}(\mathsf{v})=0$ for all $\mathsf{v}∈\mathsf{V}$ ) in terms of generalisation error. We tested two algorithms that can find it in affordable computational time: ADAM with optimised batch size for every dimension tested (the learning rate is automatically tuned), and Hamiltonian Monte Carlo (HMC), both trying to infer a two-layer teacher network with Gaussian inner weights.
ADAM
We focus on ReLU activation, with $\gamma=0.5$ , Gaussian output channel with low label noise ( $\Delta=10^{-4}$ ) and $\alpha=5.0>\alpha_{\rm sp}$ ( $=0.22,0.12,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively, thus we are deep in the specialisation phase in all the cases we report), so that the specialisation solution exhibits a very low generalisation error. We test the learnt model at each gradient update measuring the generalisation error with a moving average of 10 steps to smoothen the curves. Let us define $\varepsilon^{\rm uni}$ as the generalisation error associated to the overlap $\mathcal{Q}_{W}\equiv 0$ , then fixing a threshold $\varepsilon^{\rm opt}<\varepsilon^{*}<\varepsilon^{\rm uni}$ , we define $t^{*}(d)$ the time (in gradient updates) needed for the algorithm to cross the threshold for the first time. We optimise over different batch sizes $B_{p}$ as follows: we define them as $B_{p}=\left\lfloor\frac{n}{2^{p}}\right\rfloor,\quad p=2,3,...,\lfloor\log_{%
2}(n)\rfloor-1$ . Then for each batch size, the student network is trained until the moving average of the test loss drops below $\varepsilon^{*}$ and thus outperforms the universal solution; we have checked that in such a scenario, the student ultimately gets close to the performance of the specialisation solution. The batch size that requires the least gradient updates is selected. We used the ADAM routine implemented in PyTorch.
We test different distributions for the readout weights (kept fixed to ${\mathbf{v}}$ during training of the inner weights). We report all the values of $t^{*}(d)$ in Fig. 8 for various dimensions $d$ at fixed $(\alpha,\gamma)$ , providing an exponential fit $t^{*}(d)=\exp(ad+b)$ (left panel) and a power-law fit $t^{*}(d)=ad^{b}$ (right panel). We report the $\chi^{2}$ test for the fits in Table 2. We observe that for homogeneous and Rademacher readouts, the exponential fit is more compatible with the experiments, while for Gaussian readouts the comparison is inconclusive.
In Fig. 10, we report the test loss of ADAM as a function of the gradient updates used for training, for various dimensions and choice of the readout distribution (as before, the readouts are not learnt but fixed to the teacher’s). Here, we fix a batch size for simplicity. For both the cases of homogeneous ( ${\mathbf{v}}=\bm{1}$ ) and Rademacher readouts (left and centre panels), the model experiences plateaux in performance increasing with the system size, in accordance with the observation of exponential complexity we reported above. The plateaux happen at values of the test loss comparable with twice the value for the Bayes error predicted by the universal branch of our theory (remember the relationship between Gibbs and Bayes errors reported in App. C). The curves are smoother for the case of Gaussian readouts.
Hamiltonian Monte Carlo
<details>
<summary>x26.png Details</summary>

### Visual Description
\n
## Line Chart: HMC Step vs. Acceptance Rate
### Overview
The image presents a line chart illustrating the acceptance rate of Hamiltonian Monte Carlo (HMC) steps as a function of the HMC step number. Multiple lines represent different values of a parameter 'd', alongside theoretical and universal acceptance rate benchmarks.
### Components/Axes
* **X-axis:** "HMC step" ranging from 0 to 4000.
* **Y-axis:** Acceptance Rate, ranging from 0.80 to 0.96.
* **Legend:** Located in the bottom-right corner, listing the following lines:
* d = 120 (Red)
* d = 140 (Light Red)
* d = 160 (Medium Red)
* d = 180 (Dark Red)
* d = 200 (Brown)
* d = 220 (Gray)
* d = 240 (Dark Gray)
* theory (Red Dashed)
* universal (Black Dashed)
### Detailed Analysis
The chart displays several lines representing the acceptance rate over HMC steps for different values of 'd'.
* **d = 120 (Red):** The line starts at approximately 0.82 at HMC step 0, rapidly increases to around 0.94 by HMC step 500, and then plateaus, fluctuating around 0.94-0.95 for the remainder of the steps.
* **d = 140 (Light Red):** Starts at approximately 0.83, increases to around 0.93 by HMC step 500, and plateaus around 0.93-0.94.
* **d = 160 (Medium Red):** Starts at approximately 0.84, increases to around 0.92 by HMC step 500, and plateaus around 0.92-0.93.
* **d = 180 (Dark Red):** Starts at approximately 0.85, increases to around 0.91 by HMC step 500, and plateaus around 0.91-0.92.
* **d = 200 (Brown):** Starts at approximately 0.86, increases to around 0.90 by HMC step 500, and plateaus around 0.90-0.91.
* **d = 220 (Gray):** Starts at approximately 0.87, increases to around 0.89 by HMC step 500, and plateaus around 0.89-0.90.
* **d = 240 (Dark Gray):** Starts at approximately 0.88, increases to around 0.88 by HMC step 500, and plateaus around 0.88-0.89.
* **theory (Red Dashed):** A horizontal line at approximately 0.95.
* **universal (Black Dashed):** A horizontal line at approximately 0.90.
All lines for different 'd' values exhibit a similar trend: a rapid increase in acceptance rate during the initial HMC steps (up to approximately 500), followed by a plateau. As 'd' increases, the plateau acceptance rate decreases.
### Key Observations
* The acceptance rates for lower values of 'd' (120-180) approach the theoretical acceptance rate of 0.95.
* As 'd' increases, the acceptance rate decreases and converges towards the universal acceptance rate of 0.90.
* The initial increase in acceptance rate suggests a burn-in period where the HMC algorithm is adapting to the parameter space.
### Interpretation
The chart demonstrates the relationship between the parameter 'd' and the acceptance rate of HMC. The acceptance rate is a crucial metric for assessing the efficiency of HMC. A higher acceptance rate generally indicates better exploration of the parameter space, but it can also be indicative of a smaller step size. The theoretical acceptance rate represents an optimal value for efficient sampling. The universal acceptance rate is a lower bound, and values below this may indicate poor mixing.
The observed trend suggests that as 'd' increases, the HMC algorithm becomes less efficient, potentially due to larger step sizes or a more complex parameter space. The convergence of the lines towards the universal acceptance rate indicates a limit to the efficiency of HMC for higher values of 'd'. This information is valuable for tuning the HMC algorithm and selecting appropriate values for 'd' to achieve a balance between exploration and efficiency.
</details>
<details>
<summary>x27.png Details</summary>

### Visual Description
\n
## Line Chart: Acceptance Rate vs. HMC Step
### Overview
The image presents a line chart illustrating the acceptance rate of Hamiltonian Monte Carlo (HMC) steps as a function of the HMC step number. Multiple lines represent different values of a parameter 'd', alongside theoretical and universal acceptance rate benchmarks. The chart appears to demonstrate the convergence of acceptance rates with increasing HMC steps for various 'd' values.
### Components/Axes
* **X-axis:** "HMC step" ranging from 0 to 2000.
* **Y-axis:** Acceptance Rate, ranging from 0.75 to 0.95.
* **Legend:** Located in the top-right corner, listing the following lines:
* d = 120 (Red)
* d = 140 (Light Red)
* d = 160 (Medium Red)
* d = 180 (Dark Red)
* d = 200 (Brown)
* d = 220 (Gray)
* d = 240 (Dark Gray)
* theory (Red dashed line)
* universal (Black dashed line)
### Detailed Analysis
The chart displays several lines representing acceptance rates for different values of 'd'.
* **d = 120 (Red):** The line starts at approximately 0.87 at HMC step 0, rapidly increases to around 0.94 by HMC step 200, and plateaus around 0.94-0.95 for the remainder of the steps.
* **d = 140 (Light Red):** Starts at approximately 0.86 at HMC step 0, increases to around 0.93 by HMC step 200, and plateaus around 0.93-0.94.
* **d = 160 (Medium Red):** Starts at approximately 0.85 at HMC step 0, increases to around 0.92 by HMC step 200, and plateaus around 0.92-0.93.
* **d = 180 (Dark Red):** Starts at approximately 0.84 at HMC step 0, increases to around 0.91 by HMC step 200, and plateaus around 0.91-0.92.
* **d = 200 (Brown):** Starts at approximately 0.83 at HMC step 0, increases to around 0.90 by HMC step 200, and plateaus around 0.90-0.91.
* **d = 220 (Gray):** Starts at approximately 0.82 at HMC step 0, increases to around 0.89 by HMC step 200, and plateaus around 0.89-0.90.
* **d = 240 (Dark Gray):** Starts at approximately 0.81 at HMC step 0, increases to around 0.88 by HMC step 200, and plateaus around 0.88-0.89.
* **theory (Red dashed line):** A horizontal line at approximately 0.95 throughout the entire range of HMC steps.
* **universal (Black dashed line):** A horizontal line at approximately 0.86 throughout the entire range of HMC steps.
All lines for different 'd' values exhibit a similar trend: a rapid increase in acceptance rate during the initial HMC steps (0-200), followed by a plateau. The lines for smaller 'd' values (120, 140) tend to approach the 'theory' line, while larger 'd' values (220, 240) remain closer to the 'universal' line.
### Key Observations
* The acceptance rates for all 'd' values converge as the number of HMC steps increases.
* The acceptance rate is inversely proportional to the value of 'd'. Higher 'd' values result in lower acceptance rates.
* The 'theory' line represents an upper bound for the acceptance rate, while the 'universal' line represents a lower bound.
* The initial acceptance rates are all below both the 'theory' and 'universal' lines, indicating that the HMC process requires a certain number of steps to reach a stable acceptance rate.
### Interpretation
This chart demonstrates the behavior of acceptance rates in HMC sampling as a function of the number of steps and a parameter 'd'. The convergence of the lines suggests that, given enough steps, the HMC algorithm will reach a stable acceptance rate regardless of the value of 'd'. The inverse relationship between 'd' and acceptance rate implies that larger values of 'd' may require more steps to achieve the same level of convergence. The 'theory' and 'universal' lines likely represent theoretical limits or benchmarks for the acceptance rate, providing a context for evaluating the performance of the HMC algorithm. The 'universal' line could represent a lower bound for acceptance rate, perhaps related to the geometry of the target distribution. The fact that the lines approach these bounds suggests the algorithm is functioning as expected. The chart is useful for tuning the HMC algorithm and understanding the trade-offs between step size ('d') and convergence speed.
</details>
<details>
<summary>x28.png Details</summary>

### Visual Description
## Line Chart: Acceptance Rate vs. HMC Step
### Overview
The image presents a line chart illustrating the acceptance rate of Hamiltonian Monte Carlo (HMC) steps as a function of the HMC step number. Multiple lines represent different values of a parameter 'd', alongside theoretical and universal acceptance rate benchmarks.
### Components/Axes
* **X-axis:** "HMC step" - ranging from 0 to 2000.
* **Y-axis:** Acceptance Rate - ranging from 0.80 to 0.96.
* **Legend:** Located in the top-right corner, listing the following lines:
* d = 120 (Red)
* d = 140 (Light Red)
* d = 160 (Medium Red)
* d = 180 (Dark Red)
* d = 200 (Very Dark Red)
* d = 220 (Black)
* d = 240 (Very Dark Black)
* theory (Red Dashed)
* universal (Black Dashed)
### Detailed Analysis
The chart displays several lines representing acceptance rates for different values of 'd'. All lines start at approximately 0.80 at HMC step 0 and increase rapidly initially.
* **d = 120 (Red):** The line starts at approximately 0.80 and rises sharply, reaching around 0.94 at HMC step 200. It then plateaus, fluctuating between 0.94 and 0.95 for the remainder of the chart.
* **d = 140 (Light Red):** Similar to d=120, it starts at 0.80, rises quickly, and reaches approximately 0.945 at HMC step 200, then plateaus around 0.95.
* **d = 160 (Medium Red):** Starts at 0.80, rises quickly, and reaches approximately 0.95 at HMC step 200, then plateaus around 0.95.
* **d = 180 (Dark Red):** Starts at 0.80, rises quickly, and reaches approximately 0.95 at HMC step 200, then plateaus around 0.95.
* **d = 200 (Very Dark Red):** Starts at 0.80, rises quickly, and reaches approximately 0.95 at HMC step 200, then plateaus around 0.95.
* **d = 220 (Black):** Starts at 0.80, rises quickly, and reaches approximately 0.95 at HMC step 200, then plateaus around 0.95.
* **d = 240 (Very Dark Black):** Starts at 0.80, rises quickly, and reaches approximately 0.95 at HMC step 200, then plateaus around 0.95.
* **theory (Red Dashed):** A horizontal line at approximately 0.955.
* **universal (Black Dashed):** A horizontal line at approximately 0.90.
All lines for different 'd' values converge to a similar plateau around 0.95 after approximately 200 HMC steps.
### Key Observations
* The acceptance rates for all 'd' values converge to a similar value as the number of HMC steps increases.
* The 'theory' line represents a higher acceptance rate than the observed values.
* The 'universal' line represents a lower acceptance rate than the observed values.
* The initial rise in acceptance rate is rapid for all 'd' values.
### Interpretation
The chart demonstrates the convergence of the acceptance rate of HMC as the number of steps increases. The convergence suggests that the algorithm is effectively exploring the parameter space. The observed acceptance rates are generally higher than the 'universal' benchmark and lower than the 'theory' benchmark. This discrepancy could indicate that the theoretical model used to derive the 'theory' line is an overestimation, or that the HMC implementation is not perfectly optimized. The convergence of the lines for different 'd' values suggests that the parameter 'd' has a diminishing effect on the acceptance rate as the number of HMC steps increases. The rapid initial increase in acceptance rate likely reflects the algorithm quickly adapting to the target distribution. The plateau indicates that the algorithm has reached a stable state where the acceptance rate is balanced between exploring new regions of the parameter space and accepting existing samples.
</details>
Figure 11: Trajectories of the overlap $q_{2}$ in HMC runs initialised uninformatively for the polynomial activation $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ with parameters $\Delta=0.1$ for the linear readout, $\gamma=0.5$ and $\alpha=1.0$ . Left: Homogeneous readouts. Centre: Rademacher readouts. Right: Gaussian readouts. Points on the solid lines are obtained by averaging over 10 teacher/data instances, and shaded regions around them correspond to one standard deviation. Notice that the $y$ -axes are limited for better visualisation. For the left and centre plot, any threshold (horizontal line in the plot) between the prediction of the $\mathcal{Q}_{W}\equiv 0$ branch of our theory (black dashed line) and its prediction for the Bayes-optimal $q_{2}$ (red dashed line) crosses the curves in points $t^{*}(d)$ more compatible with an exponential fit (see Fig. 12 and Table 3, where these fits are reported and $\chi^{2}$ -tested). For the cases of homogeneous and Rademacher readouts, the value of the overlap at which the dynamics slows down (predicted by the $\mathcal{Q}_{W}\equiv 0$ branch) is in quantitative agreement with the theoretical predictions (lower dashed line). The theory is instead off by $≈ 1\%$ for the values $q_{2}$ at which the runs ultimately converge.
The experiment is performed for the polynomial activation $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ with parameters $\Delta=0.1$ for the Gaussian noise in the linear readout, $\gamma=0.5$ and $\alpha=1.0>\alpha_{\rm sp}$ ( $=0.26,0.30,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively). Our HMC consists of $4000$ iterations for homogeneous readouts, or $2000$ iterations for Rademacher and Gaussian readouts. Each iteration is adaptive (with initial step size of $0.01$ ) and uses $10$ leapfrog steps. Instead of measuring the Gibbs error, whose relationship with $\varepsilon^{\rm opt}$ holds only at equilibrium (see the last remark in App. C), we measured the teacher-student $q_{2}$ -overlap which is meaningful at any HMC step and is informative about the learning. For a fixed threshold $q_{2}^{*}$ and dimension $d$ , we measure $t^{*}(d)$ as the number of HMC iterations needed for the $q_{2}$ -overlap between the HMC sample (obtained from uninformative initialisation) and the teacher weights ${\mathbf{W}}^{0}$ to cross the threshold. This criterion is again enough to assess that the student outperforms the universal solution.
As before, we test homogeneous, Rademacher and Gaussian readouts, getting to the same conclusions: while for homogeneous and Rademacher readouts exponential time is more compatible with the observations, the experiments remain inconclusive for Gaussian readouts (see Fig. 12). We report in Fig. 11 the values of the overlap $q_{2}$ measured along the HMC runs for different dimensions. Note that, with HMC steps, all $q_{2}$ curves saturate to a value that is off by $≈ 1\%$ w.r.t. that predicted by our theory for the selected values of $\alpha,\gamma$ and $\Delta$ . Whether this is a finite size effect, or an effect not taken into account by the current theory is an interesting question requiring further investigation, see App. E.2 for possible directions.
<details>
<summary>x29.png Details</summary>

### Visual Description
\n
## Chart: Monte Carlo Steps vs. Dimension
### Overview
The image presents a line chart illustrating the relationship between the number of Monte Carlo (MC) steps (on a logarithmic scale) and the dimension of a system. Three different lines are plotted, each representing a different value of r² (coefficient of determination), indicating the goodness of fit of a linear model to the data. Error bars are included for each data point, representing the uncertainty in the number of MC steps.
### Components/Axes
* **X-axis:** Dimension, ranging from approximately 80 to 240.
* **Y-axis:** Number of MC steps (log scale), ranging from approximately 10 to 1000. The scale is logarithmic, with major ticks at 10, 100, and 1000.
* **Legend:** Located in the top-left corner, containing the following entries:
* Blue dashed line: "Linear fit: slope=0.0167", r² = 0.903
* Green dashed line: "Linear fit: slope=0.0175", r² = 0.906
* Red dashed line: "Linear fit: slope=0.0174", r² = 0.909
* **Data Points:** Represented by points with error bars. The color of the point corresponds to the line it belongs to (blue, green, red).
### Detailed Analysis
The chart displays three lines, each representing a linear fit to data points. All three lines exhibit a positive slope, indicating that the number of MC steps increases with increasing dimension.
* **Blue Line (r² = 0.903, slope = 0.0167):**
* At Dimension = 80, Number of MC steps ≈ 20
* At Dimension = 120, Number of MC steps ≈ 40
* At Dimension = 160, Number of MC steps ≈ 60
* At Dimension = 200, Number of MC steps ≈ 80
* At Dimension = 240, Number of MC steps ≈ 100
* Error bars are relatively consistent across the range, approximately ± 5-10.
* **Green Line (r² = 0.906, slope = 0.0175):**
* At Dimension = 80, Number of MC steps ≈ 25
* At Dimension = 120, Number of MC steps ≈ 50
* At Dimension = 160, Number of MC steps ≈ 75
* At Dimension = 200, Number of MC steps ≈ 100
* At Dimension = 240, Number of MC steps ≈ 125
* Error bars are relatively consistent across the range, approximately ± 5-10.
* **Red Line (r² = 0.909, slope = 0.0174):**
* At Dimension = 80, Number of MC steps ≈ 30
* At Dimension = 120, Number of MC steps ≈ 60
* At Dimension = 160, Number of MC steps ≈ 90
* At Dimension = 200, Number of MC steps ≈ 120
* At Dimension = 240, Number of MC steps ≈ 150
* Error bars are relatively consistent across the range, approximately ± 5-10.
The slopes of the lines are very similar, with the red line having the highest slope (0.0174) and the blue line having the lowest (0.0167). The r² values are also very close, all above 0.9, indicating a strong linear relationship in each case.
### Key Observations
* The number of MC steps required increases linearly with dimension.
* The red line (highest r² value) appears to fit the data slightly better than the other two lines.
* The error bars suggest a consistent level of uncertainty in the number of MC steps across the range of dimensions.
* The lines are relatively parallel, indicating that the rate of increase in MC steps with dimension is similar for all three cases.
### Interpretation
The chart demonstrates a positive correlation between the dimension of a system and the number of Monte Carlo steps required to achieve a certain level of accuracy or convergence. The linear fits suggest that this relationship is approximately linear within the observed range of dimensions. The higher r² value for the red line indicates that the linear model provides a slightly better fit to the data for that particular case.
The fact that all three lines have high r² values and similar slopes suggests that the relationship between MC steps and dimension is relatively consistent, regardless of the specific parameter represented by the different lines (indicated by the different r² values). The error bars provide a measure of the uncertainty in the number of MC steps, which is important for assessing the reliability of the results.
The increasing trend suggests that as the dimensionality of the system increases, the computational cost of performing Monte Carlo simulations also increases. This is a common phenomenon in many scientific and engineering applications, and it highlights the importance of developing efficient algorithms and computational techniques for dealing with high-dimensional problems.
</details>
<details>
<summary>x30.png Details</summary>

### Visual Description
## Chart: Monte Carlo Steps vs. Dimension
### Overview
The image presents a chart illustrating the relationship between the number of Monte Carlo (MC) steps required and the dimension of a problem. Three different lines represent different values of q², with error bars indicating uncertainty in the number of MC steps. Both axes are on a logarithmic scale. The chart appears to demonstrate a scaling relationship between these two variables.
### Components/Axes
* **X-axis:** Dimension (log scale). Marked with values approximately 10¹, 10², and 2 x 10².
* **Y-axis:** Number of MC steps (log scale). Marked with values approximately 10¹, 10³, and 10⁴.
* **Lines:** Three lines representing different q² values.
* Blue dashed line: Linear fit, slope = 2.4082, q² = 0.903
* Green dashed-dotted line: Linear fit, slope = 2.5207, q² = 0.906
* Red dashed line: Linear fit, slope = 2.5297, q² = 0.909
* **Legend:** Located in the top-left corner, detailing the line styles, slopes, and q² values.
* **Error Bars:** Vertical lines extending above and below each data point, representing the uncertainty in the number of MC steps.
### Detailed Analysis
The chart shows three lines, each representing a linear fit to data points. All three lines exhibit a positive slope, indicating that as the dimension increases, the number of MC steps required also increases. The slopes are relatively similar, ranging from approximately 2.4082 to 2.5297.
Let's analyze the data points for each line, noting the approximate values based on the logarithmic scales:
**Blue Line (q² = 0.903, slope = 2.4082):**
* Dimension ≈ 10¹: Number of MC steps ≈ 20
* Dimension ≈ 10²: Number of MC steps ≈ 200
* Dimension ≈ 2 x 10²: Number of MC steps ≈ 600
**Green Line (q² = 0.906, slope = 2.5207):**
* Dimension ≈ 10¹: Number of MC steps ≈ 30
* Dimension ≈ 10²: Number of MC steps ≈ 300
* Dimension ≈ 2 x 10²: Number of MC steps ≈ 800
**Red Line (q² = 0.909, slope = 2.5297):**
* Dimension ≈ 10¹: Number of MC steps ≈ 40
* Dimension ≈ 10²: Number of MC steps ≈ 400
* Dimension ≈ 2 x 10²: Number of MC steps ≈ 1000
The error bars are relatively consistent across all dimensions for each line, indicating a similar level of uncertainty in the number of MC steps.
### Key Observations
* The number of MC steps scales approximately linearly with the dimension on a log-log plot, suggesting a power-law relationship.
* The slope of the lines increases slightly with increasing q² values.
* The lines are relatively close to each other, suggesting that the q² value has a relatively small effect on the scaling relationship.
* The error bars indicate that there is some uncertainty in the number of MC steps, but the overall trend is clear.
### Interpretation
The chart demonstrates that the computational cost of Monte Carlo simulations increases with the dimensionality of the problem. The number of MC steps required to achieve a certain level of accuracy grows approximately linearly with the dimension on a logarithmic scale, implying a power-law relationship. The slopes of the lines provide an estimate of the exponent in this power law.
The q² parameter likely represents a quality factor or a measure of the problem's complexity. The slight increase in slope with increasing q² suggests that more complex problems (higher q²) require more MC steps per dimension.
The consistent error bars indicate that the uncertainty in the number of MC steps is relatively constant across different dimensions. This suggests that the method used to estimate the number of MC steps is reliable.
The chart highlights the "curse of dimensionality" in Monte Carlo simulations, where the computational cost increases exponentially with the number of dimensions. This has significant implications for the application of Monte Carlo methods to high-dimensional problems. The data suggests that the scaling is closer to polynomial than exponential, but still represents a significant challenge.
</details>
<details>
<summary>x31.png Details</summary>

### Visual Description
\n
## Chart: Monte Carlo Steps vs. Dimension
### Overview
The image presents a chart illustrating the relationship between the number of Monte Carlo (MC) steps required and the dimension of a problem. The y-axis is on a logarithmic scale. Three different data series are plotted, each representing a different method or configuration, with error bars indicating uncertainty. Linear fits are shown for each series, along with their slopes and R-squared values.
### Components/Axes
* **X-axis:** Dimension, ranging from approximately 80 to 240.
* **Y-axis:** Number of MC steps (log scale), ranging from approximately 10 to 1000.
* **Data Series 1:** Blue dashed line with error bars. Labeled "Linear fit: slope=0.0136" and "r² = 0.897".
* **Data Series 2:** Green dashed line with error bars. Labeled "Linear fit: slope=0.0140" and "r² = 0.904".
* **Data Series 3:** Red dashed line with error bars. Labeled "Linear fit: slope=0.0138" and "r² = 0.911".
* **Legend:** Located in the top-right corner of the chart.
### Detailed Analysis
The chart displays three data series, each showing an increasing trend as the dimension increases. The error bars indicate the variability in the number of MC steps required for each dimension.
**Data Series 1 (Blue):**
The blue line slopes upward, indicating that the number of MC steps increases with dimension.
* At Dimension = 80, Number of MC steps ≈ 20.
* At Dimension = 100, Number of MC steps ≈ 30.
* At Dimension = 120, Number of MC steps ≈ 45.
* At Dimension = 140, Number of MC steps ≈ 60.
* At Dimension = 160, Number of MC steps ≈ 80.
* At Dimension = 180, Number of MC steps ≈ 100.
* At Dimension = 200, Number of MC steps ≈ 130.
* At Dimension = 220, Number of MC steps ≈ 170.
* At Dimension = 240, Number of MC steps ≈ 220.
**Data Series 2 (Green):**
The green line also slopes upward, but appears slightly steeper than the blue line.
* At Dimension = 80, Number of MC steps ≈ 15.
* At Dimension = 100, Number of MC steps ≈ 25.
* At Dimension = 120, Number of MC steps ≈ 35.
* At Dimension = 140, Number of MC steps ≈ 50.
* At Dimension = 160, Number of MC steps ≈ 65.
* At Dimension = 180, Number of MC steps ≈ 85.
* At Dimension = 200, Number of MC steps ≈ 110.
* At Dimension = 220, Number of MC steps ≈ 140.
* At Dimension = 240, Number of MC steps ≈ 180.
**Data Series 3 (Red):**
The red line slopes upward, and appears slightly steeper than the green line.
* At Dimension = 80, Number of MC steps ≈ 25.
* At Dimension = 100, Number of MC steps ≈ 35.
* At Dimension = 120, Number of MC steps ≈ 50.
* At Dimension = 140, Number of MC steps ≈ 70.
* At Dimension = 160, Number of MC steps ≈ 90.
* At Dimension = 180, Number of MC steps ≈ 115.
* At Dimension = 200, Number of MC steps ≈ 150.
* At Dimension = 220, Number of MC steps ≈ 190.
* At Dimension = 240, Number of MC steps ≈ 240.
### Key Observations
* All three data series exhibit a positive correlation between dimension and the number of MC steps.
* The red data series consistently requires the most MC steps for a given dimension.
* The blue data series consistently requires the fewest MC steps for a given dimension.
* The R-squared values are all relatively high (0.897, 0.904, 0.911), indicating a strong linear relationship between dimension and MC steps for each series.
* The slopes of the linear fits are similar (0.0136, 0.0140, 0.0138), suggesting that the rate of increase in MC steps with dimension is comparable across the three methods.
### Interpretation
The chart demonstrates that the computational cost of Monte Carlo simulations, as measured by the number of steps required, increases with the dimensionality of the problem. This is a well-known phenomenon in high-dimensional integration and optimization, often referred to as the "curse of dimensionality." The different data series likely represent different algorithms or parameter settings for the Monte Carlo simulation. The fact that the red series requires more steps than the blue series suggests that the red method is less efficient in higher dimensions, or that it requires more precision. The high R-squared values indicate that a linear model is a reasonable approximation of the relationship between dimension and MC steps within the observed range. The slopes provide a quantitative measure of how much the computational cost increases per unit increase in dimension. The error bars suggest that there is some variability in the number of steps required, which could be due to random fluctuations in the Monte Carlo simulation or differences in the specific problem instances being considered. The logarithmic scale on the y-axis emphasizes the exponential growth in computational cost as the dimension increases.
</details>
<details>
<summary>x32.png Details</summary>

### Visual Description
## Scatter Plot: Monte Carlo Steps vs. Dimension
### Overview
The image presents a scatter plot illustrating the relationship between the number of Monte Carlo (MC) steps (on a logarithmic scale) and the dimension (also on a logarithmic scale). Three different data series are plotted, each represented by a different color and line style, with error bars indicating uncertainty. Linear fits are shown for each series, with their respective slopes indicated.
### Components/Axes
* **X-axis:** Dimension (log scale), ranging from approximately 10^1 to 2 x 10^2.
* **Y-axis:** Number of MC steps (log scale), ranging from approximately 10^1 to 10^3.
* **Data Series 1:** Blue dashed line, labeled "Linear fit: slope=1.9791", with associated data points and error bars.
* **Data Series 2:** Green solid line, labeled "Linear fit: slope=2.0467", with associated data points and error bars.
* **Data Series 3:** Red dashed-dotted line, labeled "Linear fit: slope=2.0093", with associated data points and error bars.
* **Legend:** Located in the top-left corner, providing color-coding and R^2 values for each data series:
* Blue: q^2 = 0.897
* Green: q^2 = 0.904
* Red: q^2 = 0.911
* **Grid:** A grid is present in the background to aid in reading values.
### Detailed Analysis
Let's analyze each data series individually:
**Data Series 1 (Blue):**
The blue line shows an upward trend, indicating that as the dimension increases, the number of MC steps also increases. The line is approximately linear.
* Approximate data points (with error bars):
* Dimension ≈ 10^1: Number of MC steps ≈ 10^1.5 (error bar ≈ ± 0.3)
* Dimension ≈ 50: Number of MC steps ≈ 200 (error bar ≈ ± 40)
* Dimension ≈ 100: Number of MC steps ≈ 400 (error bar ≈ ± 50)
* Dimension ≈ 200: Number of MC steps ≈ 800 (error bar ≈ ± 100)
**Data Series 2 (Green):**
The green line also exhibits an upward trend, steeper than the blue line. It is approximately linear.
* Approximate data points (with error bars):
* Dimension ≈ 10^1: Number of MC steps ≈ 150 (error bar ≈ ± 20)
* Dimension ≈ 50: Number of MC steps ≈ 300 (error bar ≈ ± 50)
* Dimension ≈ 100: Number of MC steps ≈ 600 (error bar ≈ ± 70)
* Dimension ≈ 200: Number of MC steps ≈ 1200 (error bar ≈ ± 150)
**Data Series 3 (Red):**
The red line shows an upward trend, similar to the green line, but with larger error bars. It is approximately linear.
* Approximate data points (with error bars):
* Dimension ≈ 10^1: Number of MC steps ≈ 200 (error bar ≈ ± 50)
* Dimension ≈ 50: Number of MC steps ≈ 400 (error bar ≈ ± 100)
* Dimension ≈ 100: Number of MC steps ≈ 800 (error bar ≈ ± 150)
* Dimension ≈ 200: Number of MC steps ≈ 1600 (error bar ≈ ± 300)
### Key Observations
* All three data series demonstrate a positive correlation between dimension and the number of MC steps required.
* The slopes of the linear fits are relatively similar (around 2), suggesting a roughly linear relationship.
* The red data series has the largest error bars, indicating greater uncertainty in its measurements.
* The R^2 values are all high (around 0.9), indicating that the linear fits are reasonably good representations of the data.
* The green line consistently lies above the blue line, and the red line is generally above both.
### Interpretation
The data suggests that the computational cost of the Monte Carlo simulation (as measured by the number of steps) increases with the dimensionality of the problem. The approximately linear relationship indicates that the cost scales roughly linearly with dimension. The different data series likely represent different configurations or algorithms within the Monte Carlo simulation. The higher slopes and R^2 values for the green and red series suggest they may be more efficient or require more steps for a given dimension compared to the blue series. The error bars indicate the variability in the measurements, and the larger error bars for the red series suggest that the results for that series are less precise. The high R^2 values suggest that the linear model is a good approximation of the underlying relationship, but it's important to note that this is an approximation and may not hold true for all dimensions. The data implies that as the dimensionality of the problem increases, the computational resources required for the Monte Carlo simulation will also increase, potentially limiting the applicability of the method to high-dimensional problems.
</details>
<details>
<summary>x33.png Details</summary>

### Visual Description
\n
## Chart: Monte Carlo Steps vs. Dimension
### Overview
The image presents a chart illustrating the relationship between the number of Monte Carlo (MC) steps required and the dimension of a problem. The y-axis represents the number of MC steps on a logarithmic scale, while the x-axis represents the dimension. Three different curves are plotted, each representing a different value of R-squared (R²), with error bars indicating the uncertainty in the number of MC steps. Linear fits are shown for each curve, with their respective slopes provided.
### Components/Axes
* **X-axis:** Dimension, ranging from approximately 100 to 240.
* **Y-axis:** Number of MC steps (log scale), ranging from approximately 10^1 to 10^3.
* **Curves:** Three curves, each representing a different R² value:
* Blue dashed line: R² = 0.940, slope = 0.0048
* Green solid line: R² = 0.945, slope = 0.0058
* Red dashed-dotted line: R² = 0.950, slope = 0.0065
* **Error Bars:** Vertical error bars are present for each data point, indicating the uncertainty in the number of MC steps.
* **Legend:** Located in the top-left corner, detailing the R² values, slopes, and line styles.
### Detailed Analysis
The chart shows a clear positive correlation between dimension and the number of MC steps required. As the dimension increases, the number of MC steps needed also increases for all three R² values.
**Blue Curve (R² = 0.940):**
The blue dashed line shows a relatively slow increase in the number of MC steps with increasing dimension.
* At Dimension = 100, Number of MC steps ≈ 150
* At Dimension = 120, Number of MC steps ≈ 180
* At Dimension = 140, Number of MC steps ≈ 210
* At Dimension = 160, Number of MC steps ≈ 240
* At Dimension = 180, Number of MC steps ≈ 270
* At Dimension = 200, Number of MC steps ≈ 300
* At Dimension = 220, Number of MC steps ≈ 330
* At Dimension = 240, Number of MC steps ≈ 360
**Green Curve (R² = 0.945):**
The green solid line shows a steeper increase in the number of MC steps compared to the blue curve.
* At Dimension = 100, Number of MC steps ≈ 220
* At Dimension = 120, Number of MC steps ≈ 260
* At Dimension = 140, Number of MC steps ≈ 300
* At Dimension = 160, Number of MC steps ≈ 340
* At Dimension = 180, Number of MC steps ≈ 380
* At Dimension = 200, Number of MC steps ≈ 420
* At Dimension = 220, Number of MC steps ≈ 460
* At Dimension = 240, Number of MC steps ≈ 500
**Red Curve (R² = 0.950):**
The red dashed-dotted line exhibits the steepest increase in the number of MC steps.
* At Dimension = 100, Number of MC steps ≈ 300
* At Dimension = 120, Number of MC steps ≈ 360
* At Dimension = 140, Number of MC steps ≈ 420
* At Dimension = 160, Number of MC steps ≈ 480
* At Dimension = 180, Number of MC steps ≈ 540
* At Dimension = 200, Number of MC steps ≈ 600
* At Dimension = 220, Number of MC steps ≈ 660
* At Dimension = 240, Number of MC steps ≈ 720
The error bars are relatively consistent across all dimensions for each curve, indicating a similar level of uncertainty in the number of MC steps.
### Key Observations
* The number of MC steps required increases linearly with dimension for all three R² values.
* Higher R² values correspond to steeper slopes, indicating that a greater number of MC steps are needed for a given increase in dimension.
* The slopes of the linear fits are relatively small (0.0048, 0.0058, 0.0065), suggesting that the increase in MC steps with dimension is gradual.
* The R² values are all very high (0.940, 0.945, 0.950), indicating a strong linear relationship between dimension and the number of MC steps.
### Interpretation
The data suggests that the computational cost of Monte Carlo simulations increases with the dimensionality of the problem. The higher the R² value, the more sensitive the simulation is to the dimension. This is likely due to the "curse of dimensionality," where the volume of the search space grows exponentially with dimension, requiring more samples (MC steps) to achieve the same level of accuracy. The linear fits provide a simple model for estimating the number of MC steps needed for a given dimension, but it's important to note that this model may not hold true for very high dimensions or complex problems. The consistent error bars suggest that the uncertainty in the number of MC steps is relatively constant across different dimensions, which could be due to the inherent randomness of the Monte Carlo method. The differences in slopes between the curves with different R² values suggest that the relationship between dimension and MC steps may be influenced by other factors, such as the specific algorithm used or the properties of the problem being solved.
</details>
<details>
<summary>x34.png Details</summary>

### Visual Description
## Chart: Monte Carlo Steps vs. Dimension
### Overview
The image presents a scatter plot with error bars illustrating the relationship between the number of Monte Carlo (MC) steps required and the dimension of the problem. Three different curves are plotted, each representing a different value of *q*<sup>2</sup>, with linear fits overlaid. Both axes are on a logarithmic scale.
### Components/Axes
* **X-axis:** Dimension (log scale). Markers are at approximately 10<sup>2</sup>, 1.2 x 10<sup>2</sup>, 1.4 x 10<sup>2</sup>, 1.6 x 10<sup>2</sup>, 1.8 x 10<sup>2</sup>, 2 x 10<sup>2</sup>, 2.2 x 10<sup>2</sup>, and 2.4 x 10<sup>2</sup>.
* **Y-axis:** Number of MC steps (log scale). Markers are at approximately 10<sup>2</sup>, 10<sup>3</sup>.
* **Legend (top-right):**
* Blue dotted line: "Linear fit: slope=0.7867"
* Green dashed line: "Linear fit: slope=0.9348"
* Red dashed-dotted line: "Linear fit: slope=1.0252"
* Blue circles: q<sup>2</sup> = 0.940
* Green squares: q<sup>2</sup> = 0.945
* Red triangles: q<sup>2</sup> = 0.950
* **Error Bars:** Vertical error bars are present for each data point, indicating the uncertainty in the number of MC steps.
### Detailed Analysis
* **Blue Data Series (q<sup>2</sup> = 0.940):** The blue data points, represented by circles, show an upward trend. The line slopes upward with increasing dimension.
* (10<sup>2</sup>, ~90): Error bar extends from ~70 to ~110.
* (1.2 x 10<sup>2</sup>, ~120): Error bar extends from ~90 to ~150.
* (1.4 x 10<sup>2</sup>, ~150): Error bar extends from ~120 to ~180.
* (1.6 x 10<sup>2</sup>, ~180): Error bar extends from ~150 to ~210.
* (1.8 x 10<sup>2</sup>, ~210): Error bar extends from ~180 to ~240.
* (2 x 10<sup>2</sup>, ~240): Error bar extends from ~210 to ~270.
* (2.2 x 10<sup>2</sup>, ~270): Error bar extends from ~240 to ~300.
* (2.4 x 10<sup>2</sup>, ~300): Error bar extends from ~270 to ~330.
* **Green Data Series (q<sup>2</sup> = 0.945):** The green data points, represented by squares, also show an upward trend, steeper than the blue series.
* (10<sup>2</sup>, ~140): Error bar extends from ~110 to ~170.
* (1.2 x 10<sup>2</sup>, ~200): Error bar extends from ~160 to ~240.
* (1.4 x 10<sup>2</sup>, ~260): Error bar extends from ~220 to ~300.
* (1.6 x 10<sup>2</sup>, ~320): Error bar extends from ~280 to ~360.
* (1.8 x 10<sup>2</sup>, ~380): Error bar extends from ~340 to ~420.
* (2 x 10<sup>2</sup>, ~440): Error bar extends from ~400 to ~480.
* (2.2 x 10<sup>2</sup>, ~500): Error bar extends from ~460 to ~540.
* (2.4 x 10<sup>2</sup>, ~560): Error bar extends from ~520 to ~600.
* **Red Data Series (q<sup>2</sup> = 0.950):** The red data points, represented by triangles, exhibit the steepest upward trend.
* (10<sup>2</sup>, ~200): Error bar extends from ~150 to ~250.
* (1.2 x 10<sup>2</sup>, ~280): Error bar extends from ~220 to ~340.
* (1.4 x 10<sup>2</sup>, ~360): Error bar extends from ~300 to ~420.
* (1.6 x 10<sup>2</sup>, ~440): Error bar extends from ~380 to ~500.
* (1.8 x 10<sup>2</sup>, ~520): Error bar extends from ~460 to ~580.
* (2 x 10<sup>2</sup>, ~600): Error bar extends from ~540 to ~660.
* (2.2 x 10<sup>2</sup>, ~680): Error bar extends from ~620 to ~740.
* (2.4 x 10<sup>2</sup>, ~760): Error bar extends from ~700 to ~820.
### Key Observations
* The number of MC steps required increases with dimension for all values of *q*<sup>2</sup>.
* Higher values of *q*<sup>2</sup> require significantly more MC steps for a given dimension.
* The linear fits confirm the approximately linear relationship between the log of the number of MC steps and the log of the dimension.
* The error bars indicate substantial uncertainty in the number of MC steps, particularly at lower dimensions.
### Interpretation
The chart demonstrates the "curse of dimensionality" in the context of Monte Carlo simulations. As the dimensionality of the problem increases, the number of MC steps needed to achieve a certain level of accuracy grows rapidly. The parameter *q*<sup>2</sup> likely represents a quality factor or a measure of the problem's difficulty. Higher *q*<sup>2</sup> values indicate a more challenging problem, requiring more computational effort (MC steps) to explore the solution space effectively. The linear fits suggest that the scaling of MC steps with dimension is approximately exponential (since the slope of a log-log plot corresponds to the exponent). The error bars highlight the inherent stochasticity of Monte Carlo methods and the need for careful statistical analysis to ensure reliable results. The data suggests that for a fixed level of accuracy, the computational cost of Monte Carlo simulations increases dramatically with increasing dimensionality, especially for problems with higher *q*<sup>2</sup> values.
</details>
Figure 12: Semilog (Left) and log-log (Right) plots of the number of Hamiltonian Monte Carlo steps needed to achieve an overlap $q_{2}^{*}>q_{2}^{\rm uni}$ , that certifies the universal solution is outperformed. The dataset was generated from a teacher with polynomial activation $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ and parameters $\Delta=0.1$ for the linear readout, $\gamma=0.5$ and $\alpha=1.0>\alpha_{\rm sp}$ ( $=0.26,0.30,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively). Student weights are sampled using HMC (initialised uninformatively) with $4000$ iterations for homogeneous readouts (Top row, for which $q_{2}^{\rm uni}=0.883$ ), or $2000$ iterations for Rademacher (Centre row, with $q_{2}^{\rm uni}=0.868$ ) and Gaussian readouts (Bottom row, for which $q_{2}^{\rm uni}=0.903$ ). Each iteration is adaptative (with initial step size of $0.01$ ) and uses $10$ leapfrog steps. $q_{2}^{\rm sp}=0.941,0.948,0.963$ in the three cases. The readouts are kept fixed during training. Points are obtained averaging over 10 teacher/data instances with error bars representing the standard deviation.
| Homogeneous Rademacher Gaussian | ( $q_{2}^{*}∈\{0.903,0.906,0.909\}$ ) ( $q_{2}^{*}∈\{0.897,0.904,0.911\}$ ) ( $q_{2}^{*}∈\{0.940,0.945,0.950\}$ ) | $\bm{2.22}$ $\bm{1.88}$ $0.66$ | $\bm{1.47}$ $\bm{2.12}$ $\bm{0.44}$ | $\bm{1.14}$ $\bm{1.70}$ $\bm{0.26}$ | $8.01$ $8.10$ $\bm{0.62}$ | $7.25$ $7.70$ $0.53$ | $6.35$ $8.57$ $0.39$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
Table 3: $\chi^{2}$ test for exponential and power-law fits for the time needed by Hamiltonian Monte Carlo to reach the thresholds $q_{2}^{*}$ , for various priors on the readouts. For a given row, we report three values of the $\chi^{2}$ test per hypothesis, corresponding with the thresholds $q_{2}^{*}$ on the left, in the order given. Fits are displayed in Figure 12. Smaller values of $\chi^{2}$ (in bold, for given threshold and readouts) indicate a better compatibility with the hypothesis.