2505.24849

Model: nemotron-free

# 1 Introduction Statistical mechanics of extensive-width Bayesian neural networks near interpolation Jean Barbier * 1 Francesco Camilli * 1 Minh-Toan Nguyen * 1 Mauro Pastore * 1 Rudy Skerk * 2 footnotetext: * Equal contribution 1 The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34151 Trieste, Italy 2 International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy. Abstract For three decades statistical mechanics has been providing a framework to analyse neural networks. However, the theoretically tractable models, e.g., perceptrons, random features models and kernel machines, or multi-index models and committee machines with few neurons, remained simple compared to those used in applications. In this paper we help reducing the gap between practical networks and their theoretical understanding through a statistical physics analysis of the supervised learning of a two-layer fully connected network with generic weight distribution and activation function, whose hidden layer is large but remains proportional to the inputs dimension. This makes it more realistic than infinitely wide networks where no feature learning occurs, but also more expressive than narrow ones or with fixed inner weights. We focus on the Bayes-optimal learning in the teacher-student scenario, i.e., with a dataset generated by another network with the same architecture. We operate around interpolation, where the number of trainable parameters and of data are comparable and feature learning emerges. Our analysis uncovers a rich phenomenology with various learning transitions as the number of data increases. In particular, the more strongly the features (i.e., hidden neurons of the target) contribute to the observed responses, the less data is needed to learn them. Moreover, when the data is scarce, the model only learns non-linear combinations of the teacher weights, rather than “specialising” by aligning its weights with the teacher’s. Specialisation occurs only when enough data becomes available, but it can be hard to find for practical training algorithms, possibly due to statistical-to-computational gaps. Understanding the expressive power and generalisation capabilities of neural networks is not only a stimulating intellectual activity, producing surprising results that seem to defy established common sense in statistics and optimisation (Bartlett et al., 2021), but has important practical implications in cost-benefit planning whenever a model is deployed. E.g., from a fruitful research line that spanned three decades, we now know that deep fully connected Bayesian neural networks with $O(1)$ readout weights and $L_{2}$ regularisation behave as kernel machines (the so-called Neural Network Gaussian processes, NNGPs) in the heavily overparametrised, infinite-width regime (Neal, 1996; Williams, 1996; Lee et al., 2018; Matthews et al., 2018; Hanin, 2023), and so suffer from these models’ limitations. Indeed, kernel machines infer the decision rule by first embedding the data in a fixed a priori feature space, the renowned kernel trick, then operating linear regression/classification over the features. In this respect, they do not learn features (in the sense of statistics relevant for the decision rule) from the data, so they need larger and larger feature spaces and training sets to fit their higher order statistics (Yoon & Oh, 1998; Dietrich et al., 1999; Gerace et al., 2021; Bordelon et al., 2020; Canatar et al., 2021; Xiao et al., 2023). Many efforts have been devoted to studying Bayesian neural networks beyond this regime. In the so-called proportional regime, when the width is large and proportional to the training set size, recent studies showed how a limited amount of feature learning makes the network equivalent to optimally regularised kernels (Li & Sompolinsky, 2021; Pacelli et al., 2023; Camilli et al., 2023; Cui et al., 2023; Baglioni et al., 2024; Camilli et al., 2025). This could be a consequence of the fully connected architecture, as, e.g., convolutional neural networks learn more informative features (Naveh & Ringel, 2021; Seroussi et al., 2023; Aiudi et al., 2025; Bassetti et al., 2024). Another scenario is the mean-field scaling, i.e., when the readout weights are small: in this case too a Bayesian network can learn features in the proportional regime (Rubin et al., 2024a; van Meegen & Sompolinsky, 2024). Here instead we analyse a fully connected two-layer Bayesian network trained end-to-end near the interpolation threshold, when the sample size $n$ is scaling like the number of trainable parameters: for input dimension $d$ and width $k$ , both large and proportional, $n=\Theta(d^{2})=\Theta(kd)$ , a regime where non-trivial feature learning can happen. We consider i.i.d. Gaussian input vectors with labels generated by a teacher network with matching architecture, in order to study the Bayes-optimal learning of this neural network target function. Our results thus provide a benchmark for the performance of any model trained on the same dataset. 2 Setting and main results 2.1 Teacher-student setting We consider supervised learning with a shallow neural network in the classical teacher-student setup (Gardner & Derrida, 1989). The data-generating model, i.e., the teacher (or target function), is thus a two-layer neural network itself, with readout weights ${\mathbf{v}}^{0}∈\mathbb{R}^{k}$ and internal weights ${\mathbf{W}}^{0}∈\mathbb{R}^{k× d}$ , drawn entrywise i.i.d. from $P_{v}^{0}$ and $P^{0}_{W}$ , respectively; we assume $P^{0}_{W}$ to be centred while $P^{0}_{v}$ has mean $\bar{v}$ , and both priors have unit second moment. We denote the whole set of parameters of the target as ${\bm{\theta}}^{0}=({\mathbf{v}}^{0},{\mathbf{W}}^{0})$ . The inputs are i.i.d. standard Gaussian vectors ${\mathbf{x}}_{\mu}∈\mathbb{R}^{d}$ for $\mu≤ n$ . The responses/labels $y_{\mu}$ are drawn from a kernel $P^{0}_{\rm out}$ : $$ \textstyle y_{\mu}\sim P^{0}_{\rm out}(\,\cdot\mid\lambda^{0}_{\mu}),\quad% \lambda^{0}_{\mu}:=\frac{1}{\sqrt{k}}{{\mathbf{v}}^{0\intercal}}\sigma(\frac{1% }{\sqrt{d}}{{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}). \tag{1} $$ The kernel can be stochastic or model a deterministic rule if $P^{0}_{\rm out}(y\mid\lambda)=\delta(y-\mathsf{f}^{0}(\lambda))$ for some outer non-linearity $\mathsf{f}^{0}$ . The activation function $\sigma$ is applied entrywise to vectors and is required to admit an expansion in Hermite polynomials with Hermite coefficients $(\mu_{\ell})_{\ell≥ 0}$ , see App. A: $\sigma(x)=\sum_{\ell≥ 0}\frac{\mu_{\ell}}{\ell!}{\rm He}_{\ell}(x)$ . We assume it has vanishing 0th Hermite coefficient, i.e., that it is centred $\mathbb{E}_{z\sim\mathcal{N}(0,1)}\sigma(z)=0$ ; in App. D.5 we relax this assumption. The input/output pairs $\mathcal{D}=\{({\mathbf{x}}_{\mu},y_{\mu})\}_{\mu≤ n}$ form the training set for a student network with matching architecture. Notice that the readouts ${\mathbf{v}}^{0}$ are only $k$ unknowns in the target compared to the $kd=\Theta(k^{2})$ inner weights ${\mathbf{W}}^{0}$ . Therefore, they can be equivalently considered quenched, i.e., either given and thus fixed in the student network defined below, or unknown and thus learnable, without changing the leading order of the information-theoretic quantities we aim for. E.g., in terms of mutual information per parameter $\frac{1}{kd+k}I(({\mathbf{W}}^{0},{\mathbf{v}}^{0});\mathcal{D})=\frac{1}{kd}I% ({\mathbf{W}}^{0};\mathcal{D}\mid{\mathbf{v}}^{0})+o_{d}(1)$ . Without loss of generality, we thus consider ${\mathbf{v}}^{0}$ quenched and denote it ${\mathbf{v}}$ from now on. This equivalence holds at leading order and at equilibrium only, but not at the dynamical level, the study of which is left for future work. The Bayesian student learns via the posterior distribution of the weights ${\mathbf{W}}$ given the training data (and ${\mathbf{v}}$ ), defined by | | $\textstyle dP({\mathbf{W}}\mid\mathcal{D}):=\mathcal{Z}(\mathcal{D})^{-1}dP_{W% }({\mathbf{W}})\prod_{\mu≤ n}P_{\rm out}\big{(}y_{\mu}\mid\lambda_{\mu}({% \mathbf{W}})\big{)}$ | | | --- | --- | --- | with post-activation $\lambda_{\mu}({\mathbf{W}}):=\frac{1}{\sqrt{k}}{\mathbf{v}}^{∈tercal}\sigma(% \frac{1}{\sqrt{d}}{{\mathbf{W}}{\mathbf{x}}_{\mu}})$ , the posterior normalisation constant $\mathcal{Z}(\mathcal{D})$ called the partition function, and $P_{W}$ is the prior assumed by the student. From now on, we focus on the Bayes-optimal case $P_{W}=P_{W}^{0}$ and $P_{\rm out}=P_{\rm out}^{0}$ , but the approach can be extended to account for a mismatch. We aim at evaluating the expected generalisation error of the student. Let $({\mathbf{x}}_{\rm test},y_{\rm test}\sim P_{\rm out}(\,·\mid\lambda^{0}_{% \rm test}))$ be a fresh sample (not present in $\mathcal{D}$ ) drawn using the teacher, where $\lambda_{\rm test}^{0}$ is defined as in (1) with ${\mathbf{x}}_{\mu}$ replaced by ${\mathbf{x}}_{\rm test}$ (and similarly for $\lambda_{\rm test}({\mathbf{W}})$ ). Given any prediction function $\mathsf{f}$ , the Bayes estimator for the test response reads $\hat{y}^{\mathsf{f}}({\mathbf{x}}_{\rm test},{\mathcal{D}}):=\langle\mathsf{f}% (\lambda_{\rm test}({\mathbf{W}}))\rangle$ , where the expectation $\langle\,·\,\rangle:=\mathbb{E}[\,·\mid\mathcal{D}]$ is w.r.t. the posterior $dP({\mathbf{W}}\mid\mathcal{D})$ . Then, for a performance measure $\mathcal{C}:\mathbb{R}×\mathbb{R}\mapsto\mathbb{R}_{≥ 0}$ the Bayes generalisation error is $$ \displaystyle\varepsilon^{\mathcal{C},\mathsf{f}}:=\mathbb{E}_{{\bm{\theta}}^{% 0},{\mathcal{D}},{\mathbf{x}}_{\rm test},y_{\rm test}}\mathcal{C}\big{(}y_{\rm test% },\big{\langle}\mathsf{f}(\lambda_{\rm test}({\mathbf{W}}))\big{\rangle}\big{)}. \tag{2} $$ An important case is the square loss $\mathcal{C}(y,\hat{y})=(y-\hat{y})^{2}$ with the choice $\mathsf{f}(\lambda)=∈t dy\,y\,P_{\rm out}(y\mid\lambda)=:\mathbb{E}[y\mid\lambda]$ . The Bayes-optimal mean-square generalisation error follows: $$ \displaystyle\varepsilon^{\rm opt} \displaystyle:=\mathbb{E}_{{\bm{\theta}}^{0},{\mathcal{D}},{\mathbf{x}}_{\rm test% },y_{\rm test}}\big{(}y_{\rm test}-\big{\langle}\mathbb{E}[y\mid\lambda_{\rm test% }({\mathbf{W}})]\big{\rangle}\big{)}^{2}. \tag{3} $$ Our main example will be the case of linear readout with Gaussian label noise: $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ . In this case, the generalisation error $\varepsilon^{\rm opt}$ takes a simpler form for numerical evaluation than (3), thanks to the concentration of “overlaps” entering it, see App. C. We study the challenging extensive-width regime with quadratically many samples, i.e., a large size limit $$ \displaystyle d,k,n\to+\infty\quad\text{with}\quad k/d\to\gamma,\quad n/d^{2}% \to\alpha. \tag{4} $$ We denote this joint $d,k,n$ limit with these rates by “ ${\lim}$ ”. In order to access $\varepsilon^{\mathcal{C},\mathsf{f}},\varepsilon^{\rm opt}$ and other relevant quantities, one can tackle the computation of the average log-partition function, or free entropy in statistical physics language: $$ \textstyle f_{n}:=\frac{1}{n}\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\ln% \mathcal{Z}(\mathcal{D}). \tag{5} $$ The mutual information between teacher weights and the data is related to the free entropy $f_{n}$ , see App. F. E.g., in the case of linear readout with Gaussian label noise we have $\lim\frac{1}{kd}I({\mathbf{W}}^{0};\mathcal{D}\mid{\mathbf{v}})=-\frac{\alpha}% {\gamma}\lim f_{n}-\frac{\alpha}{2\gamma}\ln(2\pi e\Delta)$ . Considering the mutual information per parameter allows us to interpret $\alpha$ as a sort of signal-to-noise ratio, so that the mutual information defined in this way increases with it. Notations: Bold is for vectors and matrices; $d$ is the input dimension, $k$ the width of the hidden layer, $n$ the size of the training set $\mathcal{D}$ , with asymptotic ratios given by (4); ${\mathbf{A}}^{\circ\ell}$ is the Hadamard power of a matrix; for a vector ${\mathbf{v}}$ , $({\mathbf{v}})$ is the diagonal matrix ${\rm diag}({\mathbf{v}})$ ; $(\mu_{\ell})$ are the Hermite coefficients of the activation function $\sigma(x)=\sum_{\ell≥ 0}\frac{\mu_{\ell}}{\ell!}{\rm He}_{\ell}(x)$ ; the norm $\|\,·\,\|$ for vectors and matrices is the Frobenius norm. 2.2 Main results The aforementioned setting is related to the recent paper Maillard et al. (2024a), with two major differences: said work considers Gaussian distributed weights and quadratic activation. These hypotheses allow numerous simplifications in the analysis, exploited in a series of works Du & Lee (2018); Soltanolkotabi et al. (2019); Venturi et al. (2019); Sarao Mannelli et al. (2020); Gamarnik et al. (2024); Martin et al. (2024); Arjevani et al. (2025). Thanks to this, Maillard et al. (2024a) maps the learning task onto a generalised linear model (GLM) where the goal is to infer a Wishart matrix from linear observations, which is analysable using known results on the GLM Barbier et al. (2019) and matrix denoising Barbier & Macris (2022); Maillard et al. (2022); Pourkamali et al. (2024); Semerjian (2024). Our main contribution is a statistical mechanics framework for characterising the prediction performance of shallow Bayesian neural networks, able to handle arbitrary activation functions and different distributions of i.i.d. weights, both ingredients playing an important role for the phenomenology. The theory we derive draws a rich picture with various learning transitions when tuning the sample rate $\alpha≈ n/d^{2}$ . For low $\alpha$ , feature learning occurs because the student tunes its weights to match non-linear combinations of the teacher’s, rather than aligning to those weights themselves. This phase is universal in the (centred, with unit variance) law of the i.i.d. teacher inner weights: our numerics obtained both with binary and Gaussian inner weights match well the theory, which does not depend on this prior here. When increasing $\alpha$ , strong feature learning emerges through specialisation phase transitions, where the student aligns some of its weights with the actual teacher’s ones. In particular, when the readouts ${\mathbf{v}}$ in the target function have a non-trivial distribution, a whole sequence of specialisation transitions occurs as $\alpha$ grows, for the following intuitive reason. Different features in the data are related to the weights of the teacher neurons, $({\mathbf{W}}^{0}_{j}∈\mathbb{R}^{d})_{j≤ k}$ . The strength with which the responses $(y_{\mu})$ depend on the feature ${\mathbf{W}}_{j}^{0}$ is tuned by the corresponding readout through $|v_{j}|$ , which plays the role of a feature-dependent “signal-to-noise ratio”. Therefore, features/hidden neurons $j∈[k]$ corresponding to the largest readout amplitude $\max\{|v_{j}|\}$ are learnt first by the student when increasing $\alpha$ (in the sense that the teacher-student overlap ${\mathbf{W}}^{∈tercal}_{j}{\mathbf{W}}^{0}_{j}/d>o_{d}(1)$ ), then features with the second largest amplitude are, and so on. If the readouts are continuous, an infinite sequence of specialisation transitions emerges in the limit (4). On the contrary, if the readouts are homogeneous (i.e. take a unique value), then a single transition occurs where almost all neurons of the student specialise jointly (possibly up to a vanishing fraction). We predict specialisation transitions to occur for binary inner weights and generic activation, or for Gaussian ones and more-than-quadratic activation. We provide a theoretical description of these learning transitions and identify the order parameters (sufficient statistics) needed to deduce the generalisation error through scalar equations. The picture that emerges is connected to recent findings in the context of extensive-rank matrix denoising Barbier et al. (2025). In that model, a recovery transition was also identified, separating a universal phase (i.e., independent of the signal prior), from a factorisation phase akin to specialisation in the present context. We believe that this picture and the one found in the present paper are not just similar, but a manifestation of the same fundamental mechanism inherent to the extensive-rank of the matrices involved. Indeed, matrix denoising and neural networks share features with both matrix models Kazakov (2000); Brézin et al. (2016); Anninos & Mühlmann (2020) and planted mean-field spin glasses Nishimori (2001); Zdeborová & and (2016). This mixed nature requires blending techniques from both fields to tackle them. Consequently, the approach developed in Sec. 4 based on the replica method Mezard et al. (1986) is non-standard, as it crucially relies on the Harish Chandra–Itzykson–Zuber (HCIZ), or “spherical”, integral used in matrix models Itzykson & Zuber (1980); Matytsin (1994); Guionnet & Zeitouni (2002). Mixing spherical integration and the replica method has been previously attempted in Schmidt (2018); Barbier & Macris (2022) for matrix denoising, both papers yielding promising but quantitatively inaccurate or non-computable results. Another attempt to exploit a mean-field technique for matrix denoising (in that case a high-temperature expansion) is Maillard et al. (2022), which suffers from similar limitations. The more quantitative answer from Barbier et al. (2025) was made possible precisely thanks to the understanding that the problem behaves more as a matrix model or as a planted mean-field spin glass depending on the phase in which it lives. The two phases could then be treated separately and then joined using an appropriate criterion to locate the transition. It would be desirable to derive a unified theory able to describe the whole phase diagram based on a single formalism. This is what the present paper provides through a principled combination of spherical integration and the replica method, yielding predictive formulas that are easy to evaluate. It is important to notice that the presence of the HCIZ integral, which is a high-dimensional matrix integral, in the replica formula presented in Result 2.1 suggests that effective one-body problems are not enough to capture alone the physics of the problem, as it is usually the case in standard mean-field inference and spin glass models. Indeed, the appearance of effective one-body problems to describe complex statistical models is usually related to the asymptotic decoupling of the finite marginals of the variables in the problem at hand in terms of products of the single-variable marginals. Therefore, we do not expect a standard cavity (or leave-one-out) approach based on single-variable extraction to be exact, while it is usually showed that the replica and cavity approaches are equivalent in mean-field models Mezard et al. (1986). This may explain why the approximate message-passing algorithms proposed in Parker et al. (2014); Krzakala et al. (2013); Kabashima et al. (2016) are, as stated by the authors, not properly converging nor able to match their corresponding theoretical predictions based on the cavity method. Algorithms for extensive-rank systems should therefore combine ingredients from matrix denoising and standard message-passing, reflecting their hybrid mean-field/matrix model nature. In order to face this, we adapt the GAMP-RIE (generalised approximate message-passing with rotational invariant estimator) introduced in Maillard et al. (2024a) for the special case of quadratic activation, to accommodate a generic activation function $\sigma$ . By construction, the resulting algorithm described in App. H cannot find the specialisation solution, i.e., a solution where at least $\Theta(k)$ neurons align with the teacher’s. Nevertheless, it matches the performance associated with the so-called universal solution/branch of our theory for all $\alpha$ , which describes a solution with overlap ${\mathbf{W}}^{∈tercal}_{j}{\mathbf{W}}^{0}_{j}/d>o_{d}(1)$ for at most $o(k)$ neurons. As a side investigation, we show empirically that the specialisation solution is potentially hard to reach with popular algorithms for some target functions: the algorithms we tested either fail to find it and instead get stuck in a sub-optimal glassy phase (Metropolis-Hastings sampling for the case of binary inner weights), or may find it but in a training time increasing exponentially with $d$ (ADAM Kingma & Ba (2017) and Hamiltonian Monte Carlo (HMC) for the case of Gaussian weights). It would thus be interesting to settle whether GAMP-RIE has the best prediction performance achievable by a polynomial-time learner when $n=\Theta(d^{2})$ for such targets. For specific choices of the distribution of the readout weights, the evidence of hardness is not conclusive and requires further investigation. Replica free entropy Our first result is a tractable approximation for the free entropy. To state it, let us introduce two functions $\mathcal{Q}_{W}(\mathsf{v}),\hat{\mathcal{Q}}_{W}(\mathsf{v})∈[0,1]$ for $\mathsf{v}∈{\rm Supp}(P_{v})$ , which are non-decreasing in $|\mathsf{v}|$ . Let (see (43) in appendix for a more explicit expression of $g$ ) $$ \textstyle g(x):=\sum_{\ell\geq 3}x^{\ell}{\mu_{\ell}^{2}}/{\ell!}, \textstyle q_{K}(x,\mathcal{Q}_{W}):=\mu_{1}^{2}+{\mu_{2}^{2}}\,x/2+\mathbb{E}% _{v\sim P_{v}}[v^{2}g(\mathcal{Q}_{W}(v))], \textstyle r_{K}:=\mu_{1}^{2}+{\mu_{2}^{2}}(1+\gamma\bar{v}^{2})/2+g(1), \tag{1} $$ and the auxiliary potentials | | $\textstyle\psi_{P_{W}}(x):=\mathbb{E}_{w^{0},\xi}\ln\mathbb{E}_{w}\exp(-\frac{% 1}{2}xw^{2}+xw^{0}w+\sqrt{x}\xi w),$ | | | --- | --- | --- | where $w^{0},w\sim P_{W}$ and $\xi,u_{0},u\sim{\mathcal{N}}(0,1)$ all independent. Moreover, $\mu_{{\mathbf{Y}}(x)}$ is the limiting (in $d→∞$ ) spectral density of data ${\mathbf{Y}}(x)=\sqrt{x/(kd)}\,{\mathbf{S}}^{0}+{\mathbf{Z}}$ in the denoising problem of the matrix ${\mathbf{S}}^{0}:={\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{0}∈% \mathbb{R}^{d× d}$ , with ${\mathbf{Z}}$ a standard GOE matrix (a symmetric matrix whose upper triangular part has i.i.d. entries from $\mathcal{N}(0,(1+\delta_{ij})/d)$ ). Denote the minimum mean-square error associated with this denoising problem as ${\rm mmse}_{S}(x)=\lim_{d→∞}d^{-2}\mathbb{E}\|{\mathbf{S}}^{0}-\mathbb{% E}[{\mathbf{S}}^{0}\mid{\mathbf{Y}}(x)]\|^{2}$ (whose explicit definition is given in App. D.3) and its functional inverse by ${\rm mmse}_{S}^{-1}$ (which exists by monotonicity). **Result 2.1 (Replica symmetric free entropy)** *Let the functional $\tau(\mathcal{Q}_{W}):={\rm mmse}_{S}^{-1}(1-\mathbb{E}_{v\sim P_{v}}[v^{2}% \mathcal{Q}_{W}(v)^{2}])$ . Given $(\alpha,\gamma)$ , the replica symmetric (RS) free entropy approximating ${\lim}\,f_{n}$ in the scaling limit (4) is ${\rm extr}\,f_{\rm RS}^{\alpha,\gamma}$ with RS potential $f^{\alpha,\gamma}_{\rm RS}=f^{\alpha,\gamma}_{\rm RS}(q_{2},\hat{q}_{2},% \mathcal{Q}_{W},\hat{\mathcal{Q}}_{W})$ given by $$ \textstyle f^{\alpha,\gamma}_{\rm RS} \textstyle:=\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})+\frac{1}% {4\alpha}(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2} \textstyle\qquad+\frac{\gamma}{\alpha}\mathbb{E}_{v\sim P_{v}}\big{[}\psi_{P_{% W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{1}{2}\mathcal{Q}_{W}(v)\hat{\mathcal{Q}}_{% W}(v)\big{]} \textstyle\qquad+\frac{1}{\alpha}\big{[}\iota(\tau(\mathcal{Q}_{W}))-\iota(% \hat{q}_{2}+\tau(\mathcal{Q}_{W}))\big{]}. \tag{6} $$ The extremisation operation in ${\rm extr}\,f^{\alpha,\gamma}_{\rm RS}$ selects a solution $(q_{2}^{*},\hat{q}_{2}^{*},\mathcal{Q}_{W}^{*},\hat{\mathcal{Q}}_{W}^{*})$ of the saddle point equations, obtained from $∇ f^{\alpha,\gamma}_{\rm RS}=\mathbf{0}$ , which maximises the RS potential.* The extremisation of $f_{\rm RS}^{\alpha,\gamma}$ yields the system (76) in the appendix, solved numerically in a standard way (see provided code). The order parameters $q_{2}^{*}$ and $\mathcal{Q}_{W}^{*}$ have a precise physical meaning that will be clear from the discussion in Sec. 4. In particular, $q_{2}^{*}$ is measuring the alignment of the student’s combination of weights ${\mathbf{W}}^{∈tercal}({\mathbf{v}}){\mathbf{W}}/\sqrt{k}$ with the corresponding teacher’s ${\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{0}/\sqrt{k}$ , which is non trivial with $n=\Theta(d^{2})$ data even when the student is not able to reconstruct ${\mathbf{W}}^{0}$ itself (i.e., to specialise). On the other hand, $\mathcal{Q}_{W}^{*}(\mathsf{v})$ measures the overlap between weights $\{{\mathbf{W}}_{i}^{0/·}\mid v_{i}=\mathsf{v}\}$ (a different treatment for weights connected to different $\mathsf{v}$ ’s is needed because, as discussed earlier, the student will learn first –with less data– weights connected to larger readouts). A non-trivial $\mathcal{Q}_{W}^{*}(\mathsf{v})≠ 0$ signals that the student learns something about ${\mathbf{W}}^{0}$ . Thus, the specialisation transitions are naturally defined, based on the extremiser of $f_{\rm RS}^{\alpha,\gamma}$ in the result above, as $\alpha_{\rm sp,\mathsf{v}}(\gamma):=\sup\,\{\alpha\mid\mathcal{Q}^{*}_{W}(% \mathsf{v})=0\}$ . For non-homogeneous readouts, we call the specialisation transition $\alpha_{\rm sp}(\gamma):=\min_{\mathsf{v}}\alpha_{\rm sp,\mathsf{v}}(\gamma)$ . In this article, we report cases where the inner weights are discrete or Gaussian distributed. For activations different than a pure quadratic, $\sigma(x)≠ x^{2}$ , we predict the transition to occur in both cases (see Fig. 1 and 2). Then, $\alpha<\alpha_{\rm sp}$ corresponds to the universal phase, where the free entropy is independent of the choice of the prior over the inner weights. Instead, $\alpha>\alpha_{\rm sp}$ is the specialisation phase where the prior $P_{W}$ matters, and the student aligns a finite fraction of its weights $({\mathbf{W}}_{j})_{j≤ k}$ with those of the teacher, which lowers the generalisation error. Let us comment on why the special case $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ could be treated exactly with known techniques (spherical integration) in Maillard et al. (2024a); Xu et al. (2025). With $\sigma(x)=x^{2}$ the responses $(y_{\mu})$ depend on ${\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{0}$ only. If ${\mathbf{v}}$ has finite fractions of equal entries, a large invariance group prevents learning ${\mathbf{W}}^{0}$ and thus specialisation. Take as example ${\mathbf{v}}=(1,...,1,-1,...,-1)$ with the first half filled with ones. Then, the responses are indistinguishable from those obtained using a modified matrix ${\mathbf{W}}^{0∈tercal}{\mathbf{U}}^{∈tercal}({\mathbf{v}}){\mathbf{U}}{% \mathbf{W}}^{0}$ where ${\mathbf{U}}=(({\mathbf{U}}_{1},\mathbf{0}_{d/2})^{∈tercal},(\mathbf{0}_{d/2% },{\mathbf{U}}_{2})^{∈tercal})$ is block diagonal with $d/2× d/2$ orthogonal ${\mathbf{U}}_{1},{\mathbf{U}}_{2}$ and zeros on off-diagonal blocks. The Gaussian prior $P_{W}$ is rotationally invariant and, thus, does not break any invariance, so ${\mathbf{U}}_{1},{\mathbf{U}}_{2}$ are arbitrary. The resulting invariance group has an $\Theta(d^{2})$ entropy (the logarithm of its volume), which is comparable to the leading order of the free entropy. Therefore, it cannot be broken using infinitesimal perturbations (or “side information”) and, consequently, prevents specialisation. This reasoning can be extended to $P_{v}$ with a continuous support, as long as we can discretise it with a finite (possibly large) number of bins, take the limit (4) first, and then take the continuum limit of the binning afterwards. However, the picture changes if the prior breaks rotational invariance; e.g., with Rademacher $P_{W}$ , only signed permutation invariances survive, a symmetry with negligible entropy $o(d^{2})$ which, consequently, does not change the limiting thermodynamic (information-theoretic) quantities. The large rotational invariance group is the reason why $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ can be treated using the HCIZ integral alone. Even when $P_{W}=\mathcal{N}(0,1)$ , the presence of any other term in the series expansion of $\sigma$ breaks invariances with large entropy: specialisation can then occur, thus requiring our theory. We mention that our theory seems inexact When solving the extremisation of (6) for $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ , we noticed that the difference between the RS free entropy of the correct universal solution, $\mathcal{Q}_{W}(\mathsf{v})=0$ , and the maximiser, predicting $\mathcal{Q}_{W}(\mathsf{v})>0$ , does not exceed $≈ 1\%$ : the RS potential is very flat as a function of $\mathcal{Q}_{W}$ . We thus cannot discard that the true maximiser of the potential is at $\mathcal{Q}_{W}(\mathsf{v})=0$ , and that we observe otherwise due to numerical errors. Indeed, evaluating the spherical integrals $\iota(\,·\,)$ in $f^{\alpha,\gamma}_{\rm RS}$ is challenging, in particular when $\gamma$ is small. Actually, for $\gamma\gtrsim 1$ we do get that $\mathcal{Q}_{W}(\mathsf{v})=0$ is always the maximiser for $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ . for $\sigma(x)=x^{2}$ with $P_{W}=\mathcal{N}(0,1)$ if applied naively, as it predicts ${\mathcal{Q}}_{W}(\mathsf{v})>0$ and therefore does not recover the rigorous result of Xu et al. (2025) (yet, it predicts a free entropy less than $1\%$ away from the truth). Nevertheless, the solution of Maillard et al. (2024a); Xu et al. (2025) is recovered from our equations by enforcing a vanishing overlap $\mathcal{Q}_{W}(\mathsf{v})=0$ , i.e., via its universal branch. Bayes generalisation error Another main result is an approximate formula for the generalisation error. Let $({\mathbf{W}}^{a})_{a≥ 1}$ be i.i.d. samples from the posterior $dP(\,·\mid\mathcal{D})$ and ${\mathbf{W}}^{0}$ the teacher’s weights. Assuming that the joint law of $(\lambda_{\rm test}({\mathbf{W}}^{a},{\mathbf{x}}_{\rm test}))_{a≥ 0}=:(% \lambda^{a})_{a≥ 0}$ for a common test input ${\mathbf{x}}_{\rm test}∉\mathcal{D}$ is a centred Gaussian, our framework predicts its covariance. Our approximation for the Bayes error follows. **Result 2.2 (Bayes generalisation error)** *Let $q_{K}^{*}=q_{K}(q_{2}^{*},\mathcal{Q}_{W}^{*})$ where $(q_{2}^{*},\hat{q}_{2}^{*},\mathcal{Q}_{W}^{*},\hat{\mathcal{Q}}_{W}^{*})$ is an extremiser of $f_{\rm RS}^{\alpha,\gamma}$ as in Result 2.1. Assuming joint Gaussianity of the post-activations $(\lambda^{a})_{a≥ 0}$ , in the scaling limit (4) their mean is zero and their covariance is approximated by $\mathbb{E}\lambda^{a}\lambda^{b}=q_{K}^{*}+(r_{K}-q_{K}^{*})\delta_{ab}=:(% \mathbf{\Gamma})_{ab}$ , see App. C. Assume $\mathcal{C}$ has the series expansion $\mathcal{C}(y,\hat{y})=\sum_{i≥ 0}c_{i}(y)\hat{y}^{i}$ . The Bayes error $\smash{\lim\,\varepsilon^{\mathcal{C},\mathsf{f}}}$ is approximated by | | $\textstyle\mathbb{E}_{(\lambda^{a})\sim\mathcal{N}(\mathbf{0},\mathbf{\Gamma})% }\mathbb{E}_{y_{\rm test}\sim P_{\rm out}(\,·\mid\lambda^{0})}\sum_{i≥ 0% }c_{i}(y_{\rm test}(\lambda^{0}))\prod_{a=1}^{i}\mathsf{f}(\lambda^{a}).$ | | | --- | --- | --- | Letting $\mathbb{E}[\,·\mid\lambda]=∈t dy\,(\,·\,)\,P_{\rm out}(y\mid\lambda)$ , the Bayes-optimal mean-square generalisation error $\smash{\lim\,\varepsilon^{\rm opt}}$ is approximated by $$ \textstyle\mathbb{E}_{\lambda^{0},\lambda^{1}}\big{(}\mathbb{E}[y^{2}\mid% \lambda^{0}]-\mathbb{E}[y\mid\lambda^{0}]\mathbb{E}[y\mid\lambda^{1}]\big{)}. \tag{7} $$* This result assumed that $\mu_{0}=0$ ; see App. D.5 if this is not the case. Results 2.1 and 2.2 provide an effective theory for the generalisation capabilities of Bayesian shallow networks with generic activation. We call these “results” because, despite their excellent match with numerics, we do not expect these formulas to be exact: their derivation is based on an unconventional mix of spin glass techniques and spherical integrals, and require approximations in order to deal with the fact that the degrees of freedom to integrate are large matrices of extensive rank. This is in contrast with simpler (vector) models (perceptrons, multi-index models, etc) where replica formulas are routinely proved correct, see e.g. Barbier & Macris (2019); Barbier et al. (2019); Aubin et al. (2018). | <details> <summary>x1.png Details</summary> ![cbe7eb0c](/v1/image/cbe7eb0c7b96e143deb5f5ee91b27fa6c05605b179af20c0ea23b962b317cea2) ### Visual Description ## Line Graph: ε_opt vs α for Two Models ### Overview The image displays a line graph comparing the optimal error (ε_opt) of two models (Model A and Model B) as a function of a parameter α. The graph includes two data series with error bars, a horizontal dashed reference line, and a legend. The x-axis (α) ranges from 0 to 7, while the y-axis (ε_opt) spans 0.02 to 0.10. --- ### Components/Axes - **X-axis (α)**: Labeled "α", with integer ticks from 0 to 7. - **Y-axis (ε_opt)**: Labeled "ε_opt", with ticks at 0.02, 0.04, 0.06, 0.08, and 0.10. - **Legend**: Located in the top-right corner, with: - **Blue line**: "Model A" (circles with error bars). - **Red line**: "Model B" (open circles with error bars). - **Dashed Reference Line**: Horizontal red dashed line at ε_opt = 0.100. --- ### Detailed Analysis #### Model A (Blue Line) - **Trend**: Gradual, monotonic decrease from α = 0 to α = 7. - **Data Points**: - α = 0: ε_opt ≈ 0.080 (±0.005) - α = 1: ε_opt ≈ 0.060 (±0.003) - α = 2: ε_opt ≈ 0.045 (±0.002) - α = 3: ε_opt ≈ 0.035 (±0.002) - α = 4: ε_opt ≈ 0.030 (±0.001) - α = 5: ε_opt ≈ 0.025 (±0.001) - α = 6: ε_opt ≈ 0.020 (±0.001) - α = 7: ε_opt ≈ 0.015 (±0.001) #### Model B (Red Line) - **Trend**: Sharp initial drop from α = 0 to α = 2, followed by a plateau. - **Data Points**: - α = 0: ε_opt ≈ 0.100 (±0.002) - α = 1: ε_opt ≈ 0.100 (±0.002) - α = 2: ε_opt ≈ 0.075 (±0.003) - α = 3: ε_opt ≈ 0.050 (±0.002) - α = 4: ε_opt ≈ 0.040 (±0.001) - α = 5: ε_opt ≈ 0.035 (±0.001) - α = 6: ε_opt ≈ 0.030 (±0.001) - α = 7: ε_opt ≈ 0.025 (±0.001) #### Dashed Reference Line - Horizontal red dashed line at ε_opt = 0.100, intersecting Model B at α = 0 and α = 1. --- ### Key Observations 1. **Model A** exhibits a consistent, gradual decline in ε_opt as α increases, suggesting a linear or near-linear relationship. 2. **Model B** shows a sharp reduction in ε_opt between α = 0 and α = 2, followed by stabilization. The initial drop exceeds the dashed reference line (0.100), indicating a significant parameter-dependent improvement. 3. **Error Bars**: Both models have smaller uncertainties at higher α values, implying improved measurement precision or model stability at larger α. 4. **Dashed Line Significance**: The red dashed line at ε_opt = 0.100 may represent a threshold or baseline value for comparison. --- ### Interpretation - **Model A** likely represents a system where ε_opt decreases predictably with α, possibly due to a direct dependency (e.g., regularization strength in machine learning). - **Model B**'s sharp initial drop suggests a threshold effect or phase transition at low α values, after which further increases in α yield diminishing returns. The stabilization at ε_opt ≈ 0.025–0.030 implies an optimal α range (α ≥ 4) for minimizing error. - The dashed line at ε_opt = 0.100 highlights that Model B initially operates above this threshold but rapidly converges below it. This could indicate a critical α value (α ≈ 2) where Model B becomes significantly more effective than Model A. - **Outliers/Anomalies**: No clear outliers; all data points align with their respective trends. The error bars are consistent with measurement noise rather than experimental anomalies. --- ### Conclusion The graph demonstrates that Model B outperforms Model A at low α values but converges to similar performance at higher α. The dashed reference line provides context for evaluating the significance of Model B's initial improvement. These trends suggest that α tuning is critical for optimizing ε_opt, with Model B offering superior performance in specific regimes. </details> | <details> <summary>x2.png Details</summary> ![9eae1a56](/v1/image/9eae1a5676fa3f61e80d298b007b43b3e2c9e959efdcdcc660c9ca7bf27bab34) ### Visual Description ## Line Graph: Comparison of ReLU and Tanh Activation Functions ### Overview The graph compares two activation functions, ReLU (blue dashed line) and Tanh (red solid line), across a parameter α ranging from 0 to 7. Both lines show decreasing trends, with ReLU maintaining higher values than Tanh across the α range. Error bars are present for data points, indicating measurement uncertainty. ### Components/Axes - **X-axis (α)**: Labeled "α", ranging from 0 to 7 in integer increments. - **Y-axis**: Unlabeled, with values approximately between 0 and 1.2. - **Legend**: Located in the top-right corner, with: - **Blue dashed line**: ReLU - **Red solid line**: Tanh - **Data Points**: - ReLU: Open circles with error bars (blue). - Tanh: Crosses with error bars (red). ### Detailed Analysis 1. **ReLU (Blue Dashed Line)**: - Starts at ~1.2 at α=0. - Gradually decreases to ~0.8 by α=7. - Error bars are small and consistent (~±0.05–0.1) across all α values. - Trend: Smooth, gradual decline with minimal fluctuation. 2. **Tanh (Red Solid Line)**: - Starts at ~0.8 at α=0. - Drops sharply to ~0.2 by α=2, then plateaus. - Error bars increase in size as α increases (e.g., ~±0.1 at α=0 to ±0.3 at α=7). - Trend: Steeper initial decline, followed by stabilization. ### Key Observations - ReLU consistently outperforms Tanh in magnitude across all α values. - Tanh exhibits a more pronounced sensitivity to α, with a sharp drop in the first two units of α. - Error bars for Tanh grow larger at higher α values, suggesting increased variability or measurement uncertainty in this region. - ReLU’s error bars remain relatively stable, indicating more reliable measurements. ### Interpretation The graph demonstrates that ReLU activation functions retain higher output values than Tanh for the tested α range. The steeper decline in Tanh suggests it may be less robust or more sensitive to changes in α, particularly in the early range (α=0–2). The increasing error bars for Tanh at higher α values could imply that its performance becomes less predictable or that measurements in this region are noisier. This might influence the choice of activation function depending on the application’s requirements for stability and sensitivity to α. The exact meaning of α (e.g., learning rate, scaling factor) would clarify the practical implications, but the data strongly favors ReLU in scenarios prioritizing higher and more consistent outputs. </details> | <details> <summary>x3.png Details</summary> ![573bcd7b](/v1/image/573bcd7b8295d23bd9fc6645e38fa094642f05ed69033cdc85e3119d792c1e6e) ### Visual Description ## Line Chart: Performance Metrics Across α Values ### Overview The image is a line chart comparing multiple performance metrics (ADAM, informative HMC, uninformative HMC, GAMP-RIE) against a reference threshold (red dashed line) across a range of α values (0 to 7). The y-axis represents a normalized metric (likely error or efficiency), decreasing from ~1.2 to ~0.2. The blue line represents the primary data series, while the red dashed line serves as a reference. ### Components/Axes - **X-axis (α)**: Labeled "α", ranging from 0 to 7 in integer increments. - **Y-axis**: Unlabeled, but values decrease from ~1.2 (top) to ~0.2 (bottom). - **Legend**: Located in the top-right corner, with four entries: - **ADAM** (blue crosses, `×`) - **informative HMC** (blue circles, `○`) - **uninformative HMC** (blue stars, `★`) - **GAMP-RIE** (blue triangles, `△`) - **Red dashed line**: A horizontal reference line at ~0.8, spanning the entire x-axis. ### Detailed Analysis 1. **Blue Line (Primary Data Series)**: - **Trend**: Starts at ~1.2 (α=0) and decreases gradually to ~0.8 (α=7). - **Data Points**: - **ADAM** (`×`): Peaks at ~1.2 (α=0), drops to ~0.9 (α=3), then stabilizes near ~0.8 (α=7). - **informative HMC** (`○`): Starts at ~1.1 (α=0), decreases to ~0.85 (α=3), and stabilizes near ~0.8 (α=7). - **uninformative HMC** (`★`): Begins at ~1.0 (α=0), drops to ~0.8 (α=3), and remains near ~0.8 (α=7). - **GAMP-RIE** (`△`): Starts at ~1.05 (α=0), decreases to ~0.82 (α=3), and stabilizes near ~0.8 (α=7). - **Uncertainty**: Error bars (vertical lines) indicate variability, with ADAM showing the largest spread (~0.1–0.2). 2. **Red Dashed Line (Reference)**: - **Trend**: Horizontal line at ~0.8, constant across all α values. - **Position**: All blue data points (except ADAM at α=0) lie above this line. ### Key Observations - **ADAM** (`×`) exhibits the highest initial value (~1.2) and the slowest decline, suggesting it may be less sensitive to α. - **informative HMC** (`○`) and **GAMP-RIE** (`△`) show similar trends, with GAMP-RIE slightly outperforming informative HMC at higher α. - **uninformative HMC** (`★`) starts lower than the other blue series but converges with them by α=3. - The red dashed line (~0.8) acts as a threshold; most data points (except ADAM at α=0) fall above it, indicating a performance benchmark. ### Interpretation The chart suggests that the metrics for ADAM, informative HMC, uninformative HMC, and GAMP-RIE degrade with increasing α, but ADAM maintains the highest performance. The red dashed line likely represents a target or baseline, with most data points exceeding it. The convergence of uninformative HMC and other series at higher α implies that α may have diminishing returns for certain metrics. The variability in ADAM’s error bars highlights potential instability in its performance under varying α. **Note**: The y-axis label is missing, so the exact metric (e.g., error, efficiency) remains unspecified. The red dashed line’s purpose (e.g., target, theoretical limit) is inferred from its position relative to the data. </details> | | --- | --- | --- | Figure 1: Theoretical prediction (solid curves) of the Bayes-optimal mean-square generalisation error for Gaussian inner weights with ReLU(x) activation (blue curves) and Tanh(2x) activation (red curves), $d=150,\gamma=0.5$ , with linear readout with Gaussian label noise of variance $\Delta=0.1$ and different $P_{v}$ laws. The dashed lines are the theoretical predictions associated with the universal solution, obtained by plugging ${\mathcal{Q}}_{W}(\mathsf{v})=0\ ∀\ \mathsf{v}$ in (6) and extremising w.r.t. $(q_{2},\hat{q}_{2})$ (the curve coincides with the optimal one before the transition $\alpha_{\rm sp}(\gamma)$ ). The numerical points are obtained with Hamiltonian Monte Carlo (HMC) with informative initialisation on the target (empty circles), uninformative, random, initialisation (empty crosses), and ADAM (thin crosses). Triangles are the error of GAMP-RIE (Maillard et al., 2024a) extended to generic activation, obtained by plugging estimator (109) in (3) in appendix. Each point has been averaged over 10 instances of the teacher and training set. Error bars are the standard deviation over instances. The generalisation error for a given training set is evaluated as $\frac{1}{2}\mathbb{E}_{{\mathbf{x}}_{\rm test}\sim\mathcal{N}(0,I_{d})}(% \lambda_{\rm test}({\mathbf{W}})-\lambda_{\rm test}^{0})^{2}$ , using a single sample ${\mathbf{W}}$ from the posterior for HMC. For ADAM, with batch size fixed to $n/5$ and initial learning rate $0.05$ , the error corresponds to the lowest one reached during training, i.e., we use early stopping based on the minimum test loss over all gradient updates. Its generalisation error is then evaluated at this point and divided by two (for comparison with the theory). The average over ${\mathbf{x}}_{\rm test}$ is computed empirically from $10^{5}$ i.i.d. test samples. We exploit that, for typical posterior samples, the Gibbs error $\varepsilon^{\rm Gibbs}$ defined in (39) in App. C is linked to the Bayes-optimal error as $(\varepsilon^{\rm Gibbs}-\Delta)/2=\varepsilon^{\rm opt}-\Delta$ , see (40) in appendix. To use this formula, we are assuming the concentration of the Gibbs error w.r.t. the posterior distribution, in order to evaluate it from a single sample per instance. Left: Homogeneous readouts $P_{v}=\delta_{1}$ . Centre: 4-points readouts $P_{v}=\frac{1}{4}(\delta_{-3/\sqrt{5}}+\delta_{-1/\sqrt{5}}+\delta_{1/\sqrt{5}% }+\delta_{3/\sqrt{5}})$ . Right: Gaussian readouts $P_{v}=\mathcal{N}(0,1)$ . 3 Theoretical predictions and numerical experiments Let us compare our theoretical predictions with simulations. In Fig. 1 and 2, we report the theoretical curves from Result 2.2, focusing on the optimal mean-square generalisation error for networks with different $\sigma$ , with linear readout with Gaussian noise variance $\Delta$ . The Gibbs error divided by $2$ is used to compute the optimal error, see Remark C.2 in App. C for a justification. In what follows, the error attained by ADAM is also divided by two, only for the purpose of comparison. Figure 1 focuses on networks with Gaussian inner weights, various readout laws, for $\sigma(x)={\rm ReLU}(x)$ and ${\rm Tanh}(2x)$ . Informative (i.e., on the teacher) and uninformative (random) initialisations are used when sampling the posterior by HMC. We also run ADAM, always selecting its best performance over all epochs, and implemented an extension of the GAMP-RIE of Maillard et al. (2024a) for generic activation (see App. H). It can be shown analytically that GAMP-RIE’s generalisation error asymptotically (in $d$ ) matches the prediction of the universal branch of our theory (i.e., associated with $\mathcal{Q}_{W}(\mathsf{v})=0\ ∀\ \mathsf{v}$ ). For ReLU activation and homogeneous readouts (left panel), informed HMC follows the specialisation branch (the solution of the saddle point equations with $\mathcal{Q}_{W}(\mathsf{v})≠ 0$ for at least one $\mathsf{v}$ ), while with uninformative initialisation it sticks to the universal branch, thus suggesting algorithmic hardness. We shall be back to this matter in the following. We note that the error attained by ADAM (divided by 2), is close to the performance associated with the universal branch, which suggests that ADAM is an effective Gibbs estimator for this $\sigma$ . For Tanh and homogeneous readouts, both the uninformative and informative points lie on the specialisation branch, while ADAM attains an error greater than twice the posterior sample’s generalisation error. For non-homogeneous readouts (centre and right panels) the points associated with the informative initialisation lie consistently on the specialisation branch, for both ${\rm ReLU}$ and Tanh, while the uninformatively initialised samples have a slightly worse performance for Tanh. Non-homogeneous readouts improves the ADAM performance: for Gaussian readouts and high sampling ratio its half-generalisation error is consistently below the error associated with the universal branch of the theory. Figure 2 concerns networks with Rademacher weights and homogeneous readout. The numerical points are of two kinds: the dots, obtained from Metropolis–Hastings sampling of the weight posterior, and the circles, obtained from the GAMP-RIE (App. H). We report analogous simulations for ${\rm ReLU}$ and ${\rm ELU}$ activations in Figure 7, App. H. The remarkable agreement between theoretical curves and experimental points in both phases supports the assumptions used in Sec. 4. <details> <summary>x4.png Details</summary> ![05b2ddd6](/v1/image/05b2ddd60852f978de4ec4720f136c230262fc4c0ea20305350d3a269d5dfb39) ### Visual Description ## Line Graph: ε_opt vs α with σ₁, σ₂, σ₃ ### Overview The graph depicts the relationship between ε_opt (y-axis) and α (x-axis) for three distinct σ values (σ₁, σ₂, σ₃). A secondary inset graph in the top-right corner provides a logarithmic-scale view of the same data. The primary graph shows exponential decay trends for all σ values, with σ₁ declining most sharply and σ₃ least. ### Components/Axes - **Primary Graph**: - **X-axis (α)**: Linear scale from 0 to 4, labeled "α". - **Y-axis (ε_opt)**: Linear scale from 0.0 to 1.2, labeled "ε_opt". - **Legend**: Located in the bottom-right corner, mapping: - Blue line: σ₁ - Green line: σ₂ - Red line: σ₃ - **Inset Graph**: - **X-axis (α)**: Same linear scale (0–4). - **Y-axis**: Logarithmic scale from 10⁻³ to 10⁻¹, labeled "10⁻¹, 10⁻², 10⁻³". ### Detailed Analysis 1. **σ₁ (Blue Line)**: - Starts at ε_opt = 1.0 when α = 0. - Drops sharply to ~0.2 at α = 1, then plateaus. - Data points (blue circles with error bars) align closely with the line, showing minimal deviation. 2. **σ₂ (Green Line)**: - Starts at ε_opt = 1.0 when α = 0. - Declines to ~0.4 at α = 1, then follows a gradual downward trend. - Data points (green squares) match the line’s trajectory. 3. **σ₃ (Red Line)**: - Starts at ε_opt = 1.0 when α = 0. - Drops to ~0.6 at α = 1, then declines slowly. - Data points (red circles) follow the line with slight scatter. 4. **Inset Graph**: - All three lines (blue, green, red) are plotted on a logarithmic y-axis. - Confirms exponential decay behavior, with σ₁ showing the steepest slope. ### Key Observations - **Threshold at α = 1**: All σ values exhibit a sharp drop in ε_opt at α = 1, suggesting a critical transition point. - **σ₁ Dominance**: σ₁’s ε_opt declines most rapidly, indicating higher sensitivity to α. - **σ₃ Resilience**: σ₃ maintains the highest ε_opt across α > 1, suggesting stability. - **Error Bars**: Small and consistent across all data points, implying precise measurements. ### Interpretation The graph demonstrates that ε_opt decreases exponentially with increasing α for all σ values, but the rate of decay varies significantly. σ₁’s steep decline suggests it is highly sensitive to α, while σ₃’s gradual drop implies robustness. The inset’s logarithmic scale highlights the exponential nature of the decay, which may be critical for modeling systems where small α changes have large impacts. The threshold at α = 1 could represent a phase transition or operational limit in the studied system. The consistent error bars across all σ values indicate reliable experimental or simulated data. </details> Figure 2: Theoretical prediction (solid curves) of the Bayes-optimal mean-square generalisation error for binary inner weights and polynomial activations: $\sigma_{1}={\rm He}_{2}/\sqrt{2}$ , $\sigma_{2}={\rm He}_{3}/\sqrt{6}$ , $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ , with $\gamma=0.5$ , $d=150$ , linear readout with Gaussian label noise with $\Delta=1.25$ , and homogeneous readouts ${\mathbf{v}}=\mathbf{1}$ . Dots are optimal errors computed via Gibbs errors (see Fig. 1) by running a Metropolis-Hastings MCMC initialised near the teacher. Circles are the error of GAMP-RIE (Maillard et al., 2024a) extended to generic activation, see App. H. Points are averaged over 16 data instances. Error bars for MCMC are the standard deviation over instances (omitted for GAMP-RIE, but of the same order). Dashed and dotted lines denote, respectively, the universal (i.e. the $\mathcal{Q}_{W}(\mathsf{v})=0\ ∀\ \mathsf{v}$ solution of the saddle point equations) and the specialisation branches where they are metastable (i.e., a local maximiser of the RS potential but not the global one). Figure 3 illustrates the learning mechanism for models with Gaussian weights and non-homogeneous readouts, revealing a sequence of phase transitions as $\alpha$ increases. Top panel shows the overlap function $\mathcal{Q}_{W}(\mathsf{v})$ in the case of Gaussian readouts for four different sample rates $\alpha$ . In the bottom panel the readout assumes four different values with equal probabilities; the figure shows the evolution of the two relevant overlaps associated with the symmetric readout values $± 3/\sqrt{5}$ and $± 1/\sqrt{5}$ . As $\alpha$ increases, the student weights start aligning with the teacher weights associated with the highest readout amplitude, marking the first phase transition. As these alignments strengthen when $\alpha$ further increases, the second transition occurs when the weights corresponding to the next largest readout amplitude are learnt, and so on. In this way, continuous readouts produce an infinite sequence of learning transitions, as supported by the upper part of Figure 3. Even when dominating the posterior measure, we observe in simulations that the specialisation solution can be algorithmically hard to reach. With a discrete distribution of readouts (such as $P_{v}=\delta_{1}$ or Rademacher), simulations for binary inner weights exhibit it only when sampling with informative initialisation (i.e., the MCMC runs to sample ${\mathbf{W}}$ are initialised in the vicinity of ${\mathbf{W}}^{0}$ ). Moreover, even in cases where algorithms (such as ADAM or HMC for Gaussian inner weights) are able to find the specialisation solution, they manage to do so only after a training time increasing exponentially with $d$ , and for relatively small values of the label noise $\Delta$ , see discussion in App. I. For the case of the continuous distribution of readouts $P_{v}={\mathcal{N}}(0,1)$ , our numerical results are inconclusive on hardness, and deserve numerical investigation at a larger scale. The universal phase is superseded at $\alpha_{\rm sp}$ by a specialisation phase, where the student’s inner weights start aligning with the teacher ones. This transition occurs for both binary and Gaussian priors over the inner weights, and it is different in nature w.r.t. the perfect recovery threshold identified in Maillard et al. (2024a), which is the point where the student with Gaussian weights learns perfectly ${\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{0}$ (but not ${\mathbf{W}}^{0}$ ) and thus attains perfect generalisation in the case of purely quadratic activation and noiseless labels. For large $\alpha$ , the student somehow realises that the higher order terms of the activation’s Hermite decomposition are not label noise, but are informative on the decision rule. The two identified phases are akin to those recently described in Barbier et al. (2025) for matrix denoising. The model we consider is also a matrix model in ${\mathbf{W}}$ , with the amount of data scaling as the number of matrix elements. When data are scarce, the student cannot break the numerous symmetries of the problem, resulting in an “effective rotational invariance” at the source of the prior universality, with posterior samples having a vanishing overlap with ${\mathbf{W}}^{0}$ . On the other hand, when data are sufficiently abundant, $\alpha>\alpha_{\rm sp}$ , there is a “synchronisation” of the student’s samples with the teacher. <details> <summary>x5.png Details</summary> ![46737d9f](/v1/image/46737d9f504262571581012e2c4afb418e96b4d967bfd3240c47f1e9c0f27fbe) ### Visual Description ## Line Chart: Relationship Between v and Q*_W(v) for Different α Values ### Overview The chart illustrates the relationship between the variable **v** (horizontal axis) and the optimal value **Q*_W(v)** (vertical axis) across four distinct α (alpha) values: 0.50, 1.00, 2.00, and 5.00. Each α value is represented by a colored line with a shaded confidence interval. The graph exhibits symmetry around **v = 0**, with all lines forming U-shaped curves. --- ### Components/Axes - **X-axis (v)**: Ranges from **-2.0** to **2.0** in increments of **0.5**. - **Y-axis (Q*_W(v))**: Ranges from **0.00** to **1.00** in increments of **0.25**. - **Legend**: Positioned on the **right side**, with colors mapped to α values: - **Blue**: α = 0.50 - **Orange**: α = 1.00 - **Green**: α = 2.00 - **Red**: α = 5.00 - **Shaded Regions**: Surround each line, likely representing confidence intervals or uncertainty bounds. --- ### Detailed Analysis 1. **α = 0.50 (Blue Line)**: - Starts at **Q*_W(v) ≈ 0.5** when **v = -2.0**. - Decreases to **Q*_W(v) ≈ 0.0** at **v = 0**. - Increases back to **Q*_W(v) ≈ 0.5** at **v = 2.0**. - Shaded region is narrowest, indicating low uncertainty. 2. **α = 1.00 (Orange Line)**: - Starts at **Q*_W(v) ≈ 0.25** when **v = -2.0**. - Peaks at **Q*_W(v) ≈ 0.75** near **v = 0**. - Drops back to **Q*_W(v) ≈ 0.25** at **v = 2.0**. - Shaded region is wider than α = 0.50. 3. **α = 2.00 (Green Line)**: - Starts at **Q*_W(v) ≈ 0.75** when **v = -2.0**. - Peaks at **Q*_W(v) ≈ 0.95** near **v = 0**. - Decreases to **Q*_W(v) ≈ 0.75** at **v = 2.0**. - Shaded region is wider than α = 1.00. 4. **α = 5.00 (Red Line)**: - Starts at **Q*_W(v) ≈ 0.9** when **v = -2.0**. - Peaks at **Q*_W(v) ≈ 1.0** near **v = 0**. - Drops back to **Q*_W(v) ≈ 0.9** at **v = 2.0**. - Shaded region is widest, indicating high uncertainty. --- ### Key Observations - **Symmetry**: All lines are symmetric around **v = 0**, suggesting a mirrored relationship between **v** and **Q*_W(v)**. - **Peak at v = 0**: Higher α values correlate with higher peaks at **v = 0** (e.g., α = 5.00 reaches **Q*_W(v) = 1.0**). - **Uncertainty**: Shaded regions widen as α increases, implying greater variability or confidence intervals for larger α. - **Crossing Lines**: Lines for α = 1.00 and α = 2.00 intersect near **v = 0**, indicating overlapping optimal values. --- ### Interpretation The chart demonstrates that **Q*_W(v)** is highly sensitive to **v** and **α**: - **α** acts as a scaling factor: Larger α values amplify the peak at **v = 0** but reduce the curve’s sensitivity to deviations from **v = 0**. - The **U-shaped curves** suggest a trade-off between **v** and **Q*_W(v)**, where extreme values of **v** diminish the optimal outcome. - The widening shaded regions for higher α values may reflect increased uncertainty in estimating **Q*_W(v)** under extreme α conditions. This relationship could model phenomena such as risk-return trade-offs, optimization under uncertainty, or parameter sensitivity in dynamic systems. The symmetry implies that the system’s behavior is invariant to the direction of **v** (e.g., positive/negative deviations from zero). </details> <details> <summary>x6.png Details</summary> ![b4a5fe93](/v1/image/b4a5fe9328f2a470997b76a8a5785ad54b5e05eb09e126155516c85c60cdb9db) ### Visual Description ## Line Graph: Response Functions vs. Parameter α ### Overview The graph depicts three response functions (Q*_W(3/√5), Q*_W(1/√5), and q*_2) plotted against a parameter α (0–7). All functions exhibit sigmoidal-like behavior with varying thresholds and saturation points. The blue line (Q*_W(3/√5)) saturates fastest, while the orange line (Q*_W(1/√5)) shows the slowest response. ### Components/Axes - **X-axis (α)**: Horizontal axis labeled α, ranging from 0 to 7 in integer increments. - **Y-axis**: Vertical axis labeled with values from 0.0 to 1.0 in 0.1 increments. - **Legend**: Positioned in the bottom-right corner, with three entries: - **Blue solid line**: Q*_W(3/√5) - **Orange dashed line**: Q*_W(1/√5) - **Green dotted line**: q*_2 ### Detailed Analysis 1. **Q*_W(3/√5) (Blue Solid Line)**: - Rises sharply from 0 to 1 between α=0 and α=1. - Maintains a value of 1.0 for α ≥ 1. - No visible uncertainty (solid line with no shaded region). 2. **Q*_W(1/√5) (Orange Dashed Line)**: - Gradual ascent from 0 to ~0.8 between α=0 and α=7. - Reaches ~0.8 at α=7, with a steep slope between α=5 and α=7. - Shaded orange region indicates uncertainty, peaking at α=5–6. 3. **q*_2 (Green Dotted Line)**: - Rapid rise to 1.0 between α=0 and α=1. - Dips slightly below 1.0 (~0.95) between α=4 and α=6. - Shaded green region shows uncertainty, peaking at α=3–5. ### Key Observations - **Threshold Behavior**: Q*_W(3/√5) achieves maximum response (1.0) at the lowest α value (~1), while Q*_W(1/√5) requires α ≥ 5 to approach saturation. - **Uncertainty Patterns**: - Q*_W(1/√5) shows the largest uncertainty (shaded region width) between α=5–7. - q*_2 exhibits uncertainty primarily between α=2–5. - **Phase Transitions**: - All functions transition from 0 to 1 between α=0–2. - q*_2 shows a secondary dip between α=4–6, suggesting a stabilization phase. ### Interpretation The graph illustrates parameter-dependent response thresholds for three functions. The blue line (Q*_W(3/√5)) represents the most sensitive system, achieving full response at α=1. The orange line (Q*_W(1/√5)) demonstrates a delayed response, requiring higher α values to approach saturation, with increasing uncertainty at higher α. The green line (q*_2) exhibits bistable behavior, peaking early but showing reduced confidence (dip and uncertainty) at intermediate α values. This could indicate competing mechanisms or hysteresis effects in the system being modeled. The shaded regions suggest measurement noise or model uncertainty, with Q*_W(1/√5) being the least predictable at higher α values. </details> Figure 3: Top: Theoretical prediction (solid curves) of the overlap function $\mathcal{Q}_{W}(\mathsf{v})$ for different sampling ratios $\alpha$ for Gaussian inner weights, ReLU(x) activation, $d=150,\gamma=0.5$ , linear readout with $\Delta=0.1$ and $P_{v}=\mathcal{N}(0,1)$ . The shaded curves were obtained from HMC initialised informatively. Using a single sample ${\mathbf{W}}^{a}$ from the posterior, $\mathcal{Q}_{W}(\mathsf{v})$ has been evaluated numerically by dividing the interval $[-2,2]$ into 50 bins and by computing the value of the overlap associated with each bin. Each point has been averaged over 50 instances of the training set, and shaded regions around them correspond to one standard deviation. Bottom: Theoretical prediction (solid curves) of the overlaps as function of the sampling ratio $\alpha$ for Gaussian inner weights, Tanh(2x) activation, $d=150,\gamma=0.5$ , linear readout with $\Delta=0.1$ and $P_{v}=\frac{1}{4}(\delta_{-3/\sqrt{5}}+\delta_{-1/\sqrt{5}}+\delta_{1/\sqrt{5}% }+\delta_{3/\sqrt{5}})$ . The shaded curves were obtained from informed HMC. Each point has been averaged over 10 instances of the training set, with one standard deviation depicted. The phenomenology observed depends on the activation function selected. In particular, by expanding $\sigma$ in the Hermite basis we realise that the way the first three terms enter information theoretical quantities is completely described by order 0, 1 and 2 tensors later defined in (12), that give rise to combinations of the inner and readout weights. In the regime of quadratically many data, order 0 and 1 tensors are recovered exactly by the student because of the overwhelming abundance of data compared to their dimension. The challenge is thus to learn the second order tensor. On the contrary, we claim that learning any higher order tensors can only happen when the student aligns its weights with ${\mathbf{W}}^{0}$ : before this “synchronisation”, they play the role of an effective noise. This is the mechanism behind the specialisation transition. For odd activation ( ${\rm Tanh}$ in Figure 1, $\sigma_{3}$ in Figure 2), where $\mu_{2}=0$ , the aforementioned order-2 tensor does not contribute any more to the learning. Indeed, we observe numerically that the generalisation error sticks to a constant value for $\alpha<\alpha_{\rm sp}$ , whereas at the phase transition it suddenly drops. This is because the learning of the order-2 tensor is skipped entirely, and the only chance to perform better is to learn all the other higher-order tensors through specialisation. By extrapolating universality results to generic activations, we are able to use the GAMP-RIE of Maillard et al. (2024a), publicly available at Maillard et al. (2024b), to obtain a polynomial-time predictor for test data. Its generalisation error follows our universal theoretical curve even in the $\alpha$ regime where MCMC sampling experiences a computationally hard phase with worse performance (for binary weights), and in particular after $\alpha_{\rm sp}$ (see Fig. 2, circles). Extending this algorithm, initially proposed for quadratic activation, to a generic one is possible thanks to the identification of an effective GLM onto which the learning problem can be mapped (while the mapping is exact when $\sigma(x)=x^{2}$ as exploited by Maillard et al. (2024a)), see App. H. The key observation is that our effective GLM representation holds not only from a theoretical perspective when describing the universal phase, but also algorithmically. Finally, we emphasise that our theory is consistent with Cui et al. (2023), which considers the simpler strongly over-parametrised regime $n=\Theta(d)$ rather than the interpolation one $n=\Theta(d^{2})$ : our generalisation curves at $\alpha→ 0$ match theirs at $\alpha_{1}:=n/d→∞$ , which is when the student learns perfectly the combinations ${\mathbf{v}}^{0∈tercal}{\mathbf{W}}^{0}/\sqrt{k}$ (but nothing more). 4 Accessing the free entropy and generalisation error: replica method and spherical integration combined The goal is to compute the asymptotic free entropy by the replica method Mezard et al. (1986), a powerful heuristic from spin glasses also used in machine learning Engel & Van den Broeck (2001), combined with the HCIZ integral. Our derivation is based on a Gaussian ansatz on the replicated post-activations of the hidden layer, which generalises Conjecture 3.1 of Cui et al. (2023), now proved in Camilli et al. (2025), where it is specialised to the case of linearly many data ( $n=\Theta(d)$ ). To obtain this generalisation, we will write the kernel arising from the covariance of the aforementioned post-activations as an infinite series of scalar order parameters derived from the expansion of the activation function in the Hermite basis, following an approach recently devised in Aguirre-López et al. (2025) in the context of the random features model (see also Hu et al. (2024) and Ghorbani et al. (2021)). Another key ingredient of our analysis will be a generalisation of an ansatz used in the replica method by Sakata & Kabashima (2013) for dictionary learning. 4.1 Replicated system and order parameters The starting point in the replica method to tackle the data average is the replica trick: | | $\textstyle{\lim}\,\frac{1}{n}\mathbb{E}\ln{\mathcal{Z}}(\mathcal{D})={\lim}{% \lim\limits_{\,\,s→ 0^{+}}}\!\frac{1}{ns}\ln\mathbb{E}\mathcal{Z}^{s}=\lim% \limits_{\,\,s→ 0^{+}}\!{\lim}\,\frac{1}{ns}\ln\mathbb{E}\mathcal{Z}^{s}$ | | | --- | --- | --- | assuming the limits commute. Recall ${\mathbf{W}}^{0}$ are the teacher weights. Consider first $s∈\mathbb{N}^{+}$ . Let the “replicas” of the post-activation $\{\lambda^{a}({\mathbf{W}}^{a}):=\frac{1}{\sqrt{k}}{{\mathbf{v}}^{∈tercal}}% \sigma(\frac{1}{\sqrt{d}}{{\mathbf{W}}^{a}{\mathbf{x}}})\}_{a=0,...,s}$ . We then directly obtain | | $\textstyle\mathbb{E}\mathcal{Z}^{s}=\mathbb{E}_{{\mathbf{v}}}∈t\prod\limits_% {a}\limits^{0,s}dP_{W}({\mathbf{W}}^{a})\big{[}\mathbb{E}_{\mathbf{x}}∈t dy% \prod\limits_{a}\limits^{0,s}P_{\rm out}(y\mid\lambda^{a}({\mathbf{W}}^{a}))% \big{]}^{n}.$ | | | --- | --- | --- | The key is to identify the law of the replicas $\{\lambda^{a}\}_{a=0,...,s}$ , which are dependent random variables due to the common random Gaussian input ${\mathbf{x}}$ , conditionally on $({\mathbf{W}}^{a})$ . Our key hypothesis is that $\{\lambda^{a}\}$ is jointly Gaussian, an ansatz we cannot prove but that we validate a posteriori thanks to the excellent match between our theory and the empirical generalisation curves, see Sec. 2.2. Similar Gaussian assumptions have been the crux of a whole line of recent works on the analysis of neural networks, and are now known under the name of “Gaussian equivalence” (Goldt et al., 2020; Hastie et al., 2022; Mei & Montanari, 2022; Goldt et al., 2022; Hu & Lu, 2023). This can also sometimes be heuristically justified based on Breuer–Major Theorems (Nourdin et al., 2011; Pacelli et al., 2023). Given two replica indices $a,b∈\{0,...,s\}$ we define the neuron-neuron overlap matrix $\Omega^{ab}_{ij}:={{\mathbf{W}}_{i}^{a∈tercal}{\mathbf{W}}^{b}_{j}}/d$ with $i,j∈[k]$ . Recalling the Hermite expansion of $\sigma$ , by using Mehler’s formula, see App. A, the post-activations covariance $K^{ab}:=\mathbb{E}\lambda^{a}\lambda^{b}$ reads $$ \textstyle K^{ab} \textstyle=\sum_{\ell\geq 1}^{\infty}\frac{\mu^{2}_{\ell}}{\ell!}Q_{\ell}^{ab}% \ \ \text{with}\ \ Q_{\ell}^{ab}:=\frac{1}{k}\sum_{i,j\leq k}v_{i}v_{j}(\Omega% ^{ab}_{ij})^{\ell}. \tag{8} $$ This covariance ${\mathbf{K}}$ is complicated but, as we argue hereby, simplifications occur as $d→∞$ . In particular, the first two overlaps $Q_{1}^{ab},Q_{2}^{ab}$ are special. We claim that higher-order overlaps $(Q_{\ell}^{ab})_{\ell≥ 3}$ can be simplified as functions of simpler order parameters. <details> <summary>x7.png Details</summary> ![3cd9fb55](/v1/image/3cd9fb55bb63d97f5addcece441d3d9fa4a80885e88031d24bd992bbabee758e) ### Visual Description ## Line Graph: Overlap Reduction Across Q Values Over HMC Steps ### Overview The image is a line graph depicting the reduction of "Overlaps" across five distinct data series (Q1–Q5) as a function of "HMC steps" (Hamiltonian Monte Carlo steps). A secondary inset graph shows the behavior of a parameter labeled "ε_opt" over the same HMC steps. The graph includes a legend, axis labels, and numerical markers. --- ### Components/Axes - **X-axis (HMC steps)**: Ranges from 0 to 125,000 in increments of 25,000. Labeled "HMC steps." - **Y-axis (Overlaps)**: Ranges from 0.0 to 1.0 in increments of 0.2. Labeled "Overlaps." - **Legend**: Located in the top-right corner, mapping colors to Q values: - **Blue**: Q1 - **Orange**: Q2 - **Green**: Q3 - **Red**: Q4 - **Purple**: Q5 - **Inset Graph**: A smaller graph in the bottom-left corner, labeled "ε_opt" (black line), with x-axis "HMC steps" (0–100,000) and y-axis "ε_opt" (0.0–0.025). --- ### Detailed Analysis 1. **Q1 (Blue Line)**: - Starts at **1.0** at HMC step 0. - Drops sharply to **0.6** by ~25,000 steps. - Remains flat at **0.6** for the remainder of the graph. 2. **Q2 (Orange Line)**: - Starts at **0.6** at HMC step 0. - Drops sharply to **0.0** by ~25,000 steps. - Remains flat at **0.0** for the remainder of the graph. 3. **Q3 (Green Line)**: - Starts at **0.0** at HMC step 0. - Remains flat at **0.0** for the entire graph. 4. **Q4 (Red Line)**: - Starts at **0.0** at HMC step 0. - Remains flat at **0.0** for the entire graph. 5. **Q5 (Purple Line)**: - Starts at **0.0** at HMC step 0. - Remains flat at **0.0** for the entire graph. 6. **ε_opt (Inset Graph)**: - Starts at **0.025** at HMC step 0. - Drops sharply to **0.0** by ~25,000 steps. - Remains flat at **0.0** for the remainder of the inset graph. --- ### Key Observations - **Q1 and Q2** exhibit significant overlap reduction, with Q1 stabilizing at 0.6 and Q2 dropping to 0.0. - **Q3, Q4, and Q5** show no overlap reduction, remaining at 0.0 throughout. - The **ε_opt** parameter in the inset graph drops to 0.0 early and remains there, suggesting a threshold or stabilization point in the optimization process. --- ### Interpretation The graph demonstrates that **Q1 and Q2** are the only data series where overlap reduction occurs, with Q1 retaining a moderate overlap (0.6) and Q2 achieving complete overlap elimination (0.0). The other Q values (Q3–Q5) show no overlap reduction, indicating they may be less effective or irrelevant in this context. The **ε_opt** parameter in the inset graph suggests a critical threshold (0.025) that is reached early in the HMC steps, after which the system stabilizes. This could imply that the optimization process for ε_opt is highly efficient, with minimal further changes after the initial drop. The stark contrast between Q1/Q2 and Q3–Q5 highlights potential differences in the underlying mechanisms or parameters governing their behavior. </details> <details> <summary>x8.png Details</summary> ![7e36443f](/v1/image/7e36443fdd08aa62b24388f09c47b8eb1a4ed01f887a61c42365a017dee9fd0d) ### Visual Description ## Line Graph: Overlap Performance Across HMC Steps ### Overview The image depicts a multi-line graph tracking the performance of five distinct data series (colored lines) over Hamiltonian Monte Carlo (HMC) steps. A secondary inset graph highlights the behavior of a specific metric, ε_opt, over a subset of the HMC steps. The primary graph emphasizes variability and convergence patterns, while the inset focuses on a critical threshold. --- ### Components/Axes - **X-axis (HMC steps)**: Ranges from 0 to 125,000 in increments of 25,000. - **Y-axis (Overlaps)**: Scaled from 0.5 to 1.0 in increments of 0.1. - **Legend**: Positioned on the right, mapping colors to data series: - Blue: "Series 1" (topmost line) - Orange: "Series 2" - Green: "Series 3" - Red: "Series 4" - Purple: "Series 5" (bottommost line) - **Inset Graph**: - X-axis: 0 to 100,000 HMC steps. - Y-axis: ε_opt values from 0.00 to 0.01. - Title: "ε_opt" (black line). --- ### Detailed Analysis 1. **Blue Line (Series 1)**: - Remains nearly flat at **~1.0** throughout all HMC steps. - Minor fluctuations (±0.005) observed but no significant deviation. 2. **Orange Line (Series 2)**: - Starts at **~0.95**, drops sharply to **~0.9** by 25,000 steps. - Fluctuates between **0.88–0.92** thereafter. 3. **Green Line (Series 3)**: - Begins at **~0.7**, dips to **~0.65** by 25,000 steps. - Stabilizes around **0.65–0.68** with minor oscillations. 4. **Red Line (Series 4)**: - Starts at **~0.6**, fluctuates between **0.55–0.62** for the first 50,000 steps. - Settles near **0.58–0.61** afterward. 5. **Purple Line (Series 5)**: - Begins at **~0.5**, dips to **~0.45** by 25,000 steps. - Stabilizes around **0.48–0.52** with persistent noise. 6. **ε_opt (Inset)**: - Sharp rise from **0.00 to 0.01** between 0 and 50,000 steps. - Plateaus at **~0.01** for the remaining steps. --- ### Key Observations - **Convergence Patterns**: - Series 1 (blue) shows no degradation, suggesting optimal or stable performance. - Series 2 (orange) and Series 3 (green) exhibit early convergence but with higher variability. - Series 4 (red) and Series 5 (purple) demonstrate slower stabilization, with Series 5 being the most unstable. - **ε_opt Behavior**: - The sharp increase in ε_opt by 50,000 steps implies a critical threshold or phase transition in the system being modeled. - **Noise and Variability**: - All lines exhibit noise, but Series 5 (purple) has the highest relative variability (±0.02). --- ### Interpretation The graph likely represents the performance of different algorithms, configurations, or parameter sets in an HMC-based optimization or sampling task. - **Series 1 (blue)** may represent a baseline or ideal scenario, maintaining maximum overlap (1.0) consistently. - The **ε_opt** metric’s rapid rise suggests a critical point where the system stabilizes or transitions to a new regime, possibly indicating convergence or a phase shift in the underlying model. - The variability in lower-performing series (e.g., purple) could reflect sensitivity to initialization conditions or parameter choices. The inset emphasizes the importance of ε_opt as a diagnostic tool, highlighting its role in identifying convergence behavior early in the HMC process. </details> Figure 4: Hamiltonian Monte Carlo dynamics of the overlaps $Q_{\ell}=Q_{\ell}^{01}$ between student and teacher weights for $\ell∈[5]$ , with activation function ReLU(x), $d=200$ , $\gamma=0.5$ , linear readout with $\Delta=0.1$ and two choices of sample rates and readouts: $\alpha=1.0$ with $P_{v}=\delta_{1}$ (Left) and $\alpha=3.0$ with $P_{v}=\mathcal{N}(0,1)$ (Right). The teacher weights ${\mathbf{W}}^{0}$ are Gaussian. The dynamics is initialised informatively, i.e., on ${\mathbf{W}}^{0}$ . The overlap $Q_{1}$ always fluctuates around 1. Left: The overlaps $Q_{\ell}$ for $\ell≥ 3$ at equilibrium converge to 0, while $Q_{2}$ is well estimated by the theory (orange dashed line). Right: At higher sample rate $\alpha$ , also the $Q_{\ell}$ for $\ell≥ 3$ are non zero and agree with their theoretical prediction (dashed lines). Insets show the mean-square generalisation error and the theoretical prediction. 4.2 Simplifying the order parameters In this section we show how to drastically reduce the number of order parameters to track. Assume at the moment that the readout prior $P_{v}$ has discrete support $\mathsf{V}=\{\mathsf{v}\}$ ; this can be relaxed by binning a continuous support, as mentioned in Sec. 2.2. The overlaps in (8) can be written as $$ \textstyle Q_{\ell}^{ab}=\frac{1}{k}\sum_{\mathsf{v},\mathsf{v}^{\prime}\in% \mathsf{V}}\mathsf{v}\,\mathsf{v}^{\prime}\sum_{\{i,j\leq k\mid v_{i}=\mathsf{% v},v_{j}=\mathsf{v}^{\prime}\}}(\Omega_{ij}^{ab})^{\ell}. \tag{9} $$ In the following, for $\ell≥ 3$ we discard the terms $\mathsf{v}≠\mathsf{v}^{\prime}$ in the above sum, assuming they are suppressed w.r.t. the diagonal ones. In other words, a neuron ${\mathbf{W}}^{a}_{i}$ of a student (replica) with a readout value $v_{i}=\mathsf{v}$ is assumed to possibly align only with neurons of the teacher (or, by Bayes-optimality, of another replica) with the same readout. Moreover, in the resulting sum over the neurons indices $\{i,j\mid v_{i}=v_{j}=\mathsf{v}\}$ , we assume that, for each $i$ , a single index $j=\pi_{i}$ , with $\pi$ a permutation, contributes at leading order. The model is symmetric under permutations of hidden neurons. We thus take $\pi$ to be the identity without loss of generality. We now assume that for Hadamard powers $\ell≥ 3$ , the off-diagonal of the overlap $({\bm{\Omega}}^{ab})^{\circ\ell}$ , obtained from typical weight matrices sampled from the posterior, is sufficiently small to consider it diagonal in any quadratic form. Moreover, by exchangeability among neurons with the same readout value, we further assume that all diagonal elements $\{\Omega_{ii}^{ab}\mid i∈\mathcal{I}_{\mathsf{v}}\}$ concentrate onto the constant $\mathcal{Q}_{W}^{ab}(\mathsf{v})$ , where $\mathcal{I}_{\mathsf{v}}:=\{i≤ k\mid v_{i}=\mathsf{v}\}$ : $$ \textstyle(\Omega_{ij}^{ab})^{\ell}=(\frac{1}{d}{\mathbf{W}}_{i}^{a\intercal}{% \mathbf{W}}^{b}_{j})^{\ell}\approx\delta_{ij}\mathcal{Q}_{W}^{ab}(\mathsf{v})^% {\ell} \tag{10} $$ if $\ell≥ 3$ , $i\ \text{or}\ j∈\mathcal{I}_{\mathsf{v}}$ . Approximate equality here is up to a matrix with $o_{d}(1)$ norm. The same happens, e.g., for a standard Wishart matrix: its eigenvectors and the ones of its square Hadamard power are delocalised, while for higher Hadamard powers $\ell≥ 3$ its eigenvectors are strongly localised; this is why $Q_{2}^{ab}$ will require a separate treatment. With these simplifications we can write $$ \textstyle Q_{\ell}^{ab}=\mathbb{E}_{v\sim P_{v}}[v^{2}\mathcal{Q}_{W}^{ab}(v)% ^{\ell}]+o_{d}(1)\ \text{for}\ \ell\geq 3. \tag{1} $$ This is is verified numerically a posteriori as follows. Identity (11) is true (without $o_{d}(1)$ ) for the predicted theoretical values of the order parameters by construction of our theory. Fig. 3 verified the good agreement between the theoretical and experimental overlap profiles $\mathcal{Q}^{01}_{W}(\mathsf{v})$ for all $\mathsf{v}∈\mathsf{V}$ (which is statistically the same as $\smash{\mathcal{Q}^{ab}_{W}(\mathsf{v})}$ for any $a≠ b$ by the so-called Nishimori identity following from Bayes-optimality, see App. B), while Fig. 4 verifies the agreement at the level of $(Q_{\ell}^{ab})$ . Consequently, (11) is also true for the experimental overlaps. It is convenient to define the symmetric tensors ${\mathbf{S}}_{\ell}^{a}$ with entries $$ \textstyle S^{a}_{\ell;\alpha_{1}\ldots\alpha_{\ell}}:=\frac{1}{\sqrt{k}}\sum_% {i\leq k}v_{i}W^{a}_{i\alpha_{1}}\cdots W^{a}_{i\alpha_{\ell}}. \tag{12} $$ Indeed, the generic $\ell$ -th term of the series (8) can be written as the overlap $Q^{ab}_{\ell}=\langle{\mathbf{S}}^{a}_{\ell},{\mathbf{S}}^{b}_{\ell}\rangle/d^% {\ell}$ of these tensors (where $\langle\,,\,\rangle$ is the inner product), e.g., $Q_{2}^{ab}={\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}/d^{2}$ . Given that the number of data $n=\Theta(d^{2})$ and that $({\mathbf{S}}_{1}^{a})$ are only $d$ -dimensional, they are reconstructed perfectly (the same argument was used to argue that readouts ${\mathbf{v}}$ can be quenched). We thus assume right away that at equilibrium the overlaps $Q_{1}^{ab}=1$ (or saturate to their maximum value; if tracked, the corresponding saddle point equations end up being trivial and do fix this). In other words, in the quadratic data regime, the $\mu_{1}$ contribution in the Hermite decomposition of $\sigma$ for the target is perfectly learnable, while higher order ones play a non-trivial role. In contrast, Cui et al. (2023) study the regime $n=\Theta(d)$ where $\mu_{1}$ is the only learnable term. Then, the average replicated partition function reads $\mathbb{E}\mathcal{Z}^{s}=∈t d{\mathbf{Q}}_{2}d\bm{\mathcal{Q}}_{W}\exp(F_{S% }\!+\!nF_{E})$ where $F_{E},F_{S}$ depend on ${\mathbf{Q}}_{2}=(Q_{2}^{ab})$ and $\bm{\mathcal{Q}}_{W}:=\{\mathcal{Q}_{W}^{ab}\mid a≤ b\}$ , where $\mathcal{Q}_{W}^{ab}:=\{\mathcal{Q}_{W}^{ab}(\mathsf{v})\mid\mathsf{v}∈% \mathsf{V}\}$ . The “energetic potential” is defined as $$ \textstyle e^{nF_{E}}:=\big{(}\int dyd{\bm{\lambda}}\frac{\exp(-\frac{1}{2}{% \bm{\lambda}}^{\intercal}{\mathbf{K}}^{-1}{\bm{\lambda}})}{((2\pi)^{s+1}\det{% \mathbf{K}})^{1/2}}\prod_{a}^{0,s}P_{\rm out}(y\mid\lambda^{a})\big{)}^{n}. \tag{13} $$ It takes this form due to our Gaussian assumption on the replicated post-activations and is thus easily computed, see App. D.1. The “entropic potential” $F_{S}$ taking into account the degeneracy of the order parameters is obtained by averaging delta functions fixing their definitions w.r.t. the “microscopic degrees of freedom” $({\mathbf{W}}^{a})$ . It can be written compactly using the following conditional law over the tensors $({\mathbf{S}}_{2}^{a})$ : $$ \textstyle P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}):=V_{W}^{kd}(\bm{% \mathcal{Q}}_{W})^{-1}\int\prod_{a}^{0,s}dP_{W}({\mathbf{W}}^{a}) \textstyle\qquad\times\prod_{a\leq b}^{0,s}\prod_{\mathsf{v}\in\mathsf{V}}% \prod_{i\in\mathcal{I}_{\mathsf{v}}}\delta(d\,\mathcal{Q}_{W}^{ab}(\mathsf{v})% -{{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b}}) \textstyle\qquad\times\prod_{a}^{0,s}\delta({\mathbf{S}}^{a}_{2}-{\mathbf{W}}^% {a\intercal}({\mathbf{v}}){\mathbf{W}}^{a}/\sqrt{k}), \tag{14} $$ with the normalisation | | $\textstyle V_{W}^{kd}:=∈t\prod_{a}dP_{W}({\mathbf{W}}^{a})\prod_{a≤ b,% \mathsf{v},i∈\mathcal{I}_{\mathsf{v}}}\delta(d\,\mathcal{Q}_{W}^{ab}(\mathsf% {v})-{{\mathbf{W}}^{a∈tercal}_{i}{\mathbf{W}}_{i}^{b}}).$ | | | --- | --- | --- | The entropy, which is the challenging term to compute, then reads | | $\textstyle e^{F_{S}}:=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})∈t dP(({\mathbf{S}}_{2% }^{a})\mid\bm{\mathcal{Q}}_{W})\prod\limits_{a≤ b}\limits^{0,s}\delta(d^{2}% Q_{2}^{ab}-{{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}}).$ | | | --- | --- | --- | 4.3 Tackling the entropy: measure simplification by moment matching The delta functions above fixing $Q_{2}^{ab}$ induce quartic constraints between the weights degrees of freedom $(W_{i\alpha}^{a})$ instead of quadratic as in standard settings. A direct computation thus seems out of reach. However, we will exploit the fact that the constraints are quadratic in the matrices $({\mathbf{S}}_{2}^{a})$ . Consequently, shifting our focus towards $({\mathbf{S}}_{2}^{a})$ as the basic degrees of freedom to integrate rather than $(W_{i\alpha}^{a})$ will allow us to move forward by simplifying their measure (14). Note that while $(W_{i\alpha}^{a})$ are i.i.d. under their prior measure, $({\mathbf{S}}_{2}^{a})$ have coupled entries, even for a fixed replica index $a$ . This can be taken into account as follows. Define $P_{S}$ as the probability density of a generalised Wishart random matrix, i.e., of $\tilde{\mathbf{W}}^{∈tercal}({\mathbf{v}})\tilde{\mathbf{W}}/\sqrt{k}$ where $\tilde{\mathbf{W}}∈\mathbb{R}^{k× d}$ is made of i.i.d. standard Gaussian entries. The simplification we consider consists in replacing (14) by the effective measure $$ \textstyle\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}):=\frac{1}{% \tilde{V}_{W}^{kd}}\prod\limits_{a}\limits^{0,s}P_{S}({\mathbf{S}}_{2}^{a})% \prod\limits_{a<b}\limits^{0,s}e^{\frac{1}{2}\tau(\mathcal{Q}_{W}^{ab}){\rm Tr% }\,{\mathbf{S}}^{a}_{2}{\mathbf{S}}^{b}_{2}} \tag{15} $$ where $\tilde{V}_{W}^{kd}=\tilde{V}_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ is the proper normalisation constant, and $$ \textstyle\tau(\mathcal{Q}_{W}^{ab}):=\text{mmse}_{S}^{-1}(1-\mathbb{E}_{v\sim P% _{v}}[v^{2}\mathcal{Q}^{ab}_{W}(v)^{2}]). \tag{16} $$ The rationale behind this choice goes as follows. The matrices $({\mathbf{S}}_{2}^{a})$ are, under the measure (14), $(i)$ generalised Wishart matrices, constructed from $(ii)$ non-Gaussian factors $({\mathbf{W}}^{a})$ , which $(iii)$ are coupled between different replicas, thus inducing a coupling among replicas $({\mathbf{S}}^{a})$ . The proposed simplified measure captures all three aspects while remaining tractable, as we explain now. The first assumption is that in the measure (14) the details of the (centred, unit variance) prior $P_{W}$ enter only through $\bm{\mathcal{Q}}_{W}$ at leading order. Due to the conditioning, we can thus relax it to Gaussian (with the same two first moments) by universality, as is often the case in random matrix theory. $P_{W}$ will instead explicitly enter in the entropy of $\bm{\mathcal{Q}}_{W}$ related to $V_{W}^{kd}$ . Point $(ii)$ is thus taken care by the conditioning. Then, the generalised Wishart prior $P_{S}$ encodes $(i)$ and, finally, the exponential tilt in $\tilde{P}$ induces the replica couplings of point $(iii)$ . It remains to capture the dependence of measure (14) on $\bm{\mathcal{Q}}_{W}$ . This is done by realising that | | $\textstyle∈t dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})\frac{1}{d^{2% }}{\rm Tr}{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}=\mathbb{E}_{v\sim P_{v}}[v^% {2}\mathcal{Q}_{W}^{ab}(v)^{2}]+\gamma\bar{v}^{2}.$ | | | --- | --- | --- | It is shown in App. D.2. The Lagrange multiplier $\tau(\mathcal{Q}_{W}^{ab})$ to plug in $\tilde{P}$ enforcing this moment matching condition between true and simplified measures as $s→ 0^{+}$ is (16), see App. D.3. For completeness, we provide in App. E alternatives to the simplification (15), whose analysis are left for future work. 4.4 Final steps and spherical integration Combining all our findings, the average replicated partition function is simplified as | | $\textstyle\mathbb{E}\mathcal{Z}^{s}=∈t d{\mathbf{Q}}_{2}d\bm{\mathcal{Q}}_{W% }e^{nF_{E}+kd\ln V_{W}(\bm{\mathcal{Q}}_{W})-kd\ln\tilde{V}_{W}(\bm{\mathcal{Q% }}_{W})}$ | | | --- | --- | --- | The equality should be interpreted as holding at leading exponential order $\exp(\Theta(n))$ , assuming the validity of our previous measure simplification. All remaining steps but the last are standard: $(i)$ Express the delta functions fixing $\bm{\mathcal{Q}}_{W}$ and ${\mathbf{Q}}_{2}$ in exponential form using their Fourier representation; this introduces additional Fourier conjugate order parameters $\hat{\mathbf{Q}}_{2},\hat{\bm{\mathcal{Q}}}_{W}$ of same dimensions. $(ii)$ Once this is done, the terms coupling different replicas of $({\mathbf{W}}^{a})$ or of $({\mathbf{S}}^{a})$ are all quadratic. Using the Hubbard–Stratonovich transformation (i.e., $\mathbb{E}_{{\mathbf{Z}}}\exp(\frac{d}{2}{\rm Tr}\,{\mathbf{M}}{\mathbf{Z}})=% \exp(\frac{d}{4}{\rm Tr}\,{\mathbf{M}}^{2})$ for a $d× d$ symmetric matrix ${\mathbf{M}}$ with ${\mathbf{Z}}$ a standard GOE matrix) therefore allows us to linearise all replica-replica coupling terms, at the price of introducing new Gaussian fields interacting with all replicas. $(iii)$ After these manipulations, we identify at leading exponential order an effective action $\mathcal{S}$ depending on the order parameters only, which allows a saddle point integration w.r.t. them as $n→∞$ : | | $\textstyle\lim\frac{1}{ns}\ln\mathbb{E}\mathcal{Z}^{s}\!=\!\lim\frac{1}{ns}\ln% ∈t d{\mathbf{Q}}_{2}d\hat{\mathbf{Q}}_{2}d\bm{\mathcal{Q}}_{W}d\hat{\bm{% \mathcal{Q}}}_{W}e^{n\mathcal{S}}\!=\!\frac{1}{s}{\rm extr}\,\mathcal{S}.$ | | | --- | --- | --- | $(iv)$ Next, the replica limit $s→ 0^{+}$ of the previously obtained expression has to be considered. To do so, we make a replica symmetric assumption, i.e., we consider that at the saddle point, all order parameters entering the action $\mathcal{S}$ , and thus $K^{ab}$ too, take a simple form of the type $R^{ab}=r\delta_{ab}+q(1-\delta_{ab})$ . Replica symmetry is rigorously known to be correct in general settings of Bayes-optimal learning and is thus justified here, see Barbier & Panchenko (2022); Barbier & Macris (2019). $(v)$ After all these steps, the resulting expression still includes two high-dimensional integrals related to the ${\mathbf{S}}_{2}$ ’s matrices. They can be recognised as corresponding to the free entropies associated with the Bayes-optimal denoising of a generalised Wishart matrix, as described just above Result 2.1, for two different signal-to-noise ratios. The last step consists in dealing with these integrals using the HCIZ integral whose form is tractable in this case, see Maillard et al. (2022); Pourkamali et al. (2024). These free entropies yield the two last terms $\iota(\,·\,)$ in $f_{\rm RS}^{\alpha,\gamma}$ , (6). The complete derivation is in App. D and gives Result 2.1. From the physical meaning of the order parameters, this analysis also yields the post-activations covariance ${\mathbf{K}}$ and thus Result 2.2. As a final remark, we emphasise a key difference between our approach and earlier works on extensive-rank systems. If, instead of taking the generalised Wishart $P_{S}$ as the base measure over the matrices $({\mathbf{S}}_{2}^{a})$ in the simplified $\tilde{P}$ with moment matching, one takes a factorised Gaussian measure, thus entirely forgetting the dependencies among ${\mathbf{S}}_{2}^{a}$ entries, this mimics the Sakata-Kabashima replica method Sakata & Kabashima (2013). Our ansatz thus captures important correlations that were previously neglected in Sakata & Kabashima (2013); Krzakala et al. (2013); Kabashima et al. (2016); Barbier et al. (2025) in the context of extensive-rank matrix inference. For completeness, we show in App. E that our ansatz indeed greatly improves the prediction compared to these earlier approaches. 5 Conclusion and perspectives We have provided an effective, quantitatively accurate, description of the optimal generalisation capability of a fully-trained two-layer neural network of extensive width with generic activation when the sample size scales with the number of parameters. This setting has resisted for a long time to mean-field approaches used, e.g., for committee machines Barkai et al. (1992); Engel et al. (1992); Schwarze & Hertz (1992; 1993); Mato & Parga (1992); Monasson & Zecchina (1995); Aubin et al. (2018); Baldassi et al. (2019). A natural extension is to consider non Bayes-optimal models, e.g., trained through empirical risk minimisation to learn a mismatched target function. The formalism we provide here can be extended to these cases, by keeping track of additional order parameters. The extension to deeper architectures is also possible, in the vein of Cui et al. (2023); Pacelli et al. (2023) who analysed the over-parametrised proportional regime. Accounting for structured inputs is another direction: data with a covariance (Monasson, 1992; Loureiro et al., 2021a), mixture models (Del Giudice, P. et al., 1989; Loureiro et al., 2021b), hidden manifolds (Goldt et al., 2020), object manifolds and simplexes (Chung et al., 2018; Rotondo et al., 2020), etc. Phase transitions in supervised learning are known in the statistical mechanics literature at least since Györgyi (1990), when the theory was limited to linear models. It would be interesting to connect the picture we have drawn here with Grokking, a sudden drop in generalisation error occurring during the training of neural nets close to interpolation, see Power et al. (2022); Rubin et al. (2024b). A more systematic analysis on the computational hardness of the problem (as carried out for multi-index models in Troiani et al. (2025)) is an important step towards a full characterisation of the class of target functions that are fundamentally hard to learn. A key novelty of our approach is to blend matrix models and spin glass techniques in a unified formalism. A limitation is then linked to the restricted class of solvable matrix models (see Kazakov (2000); Anninos & Mühlmann (2020) for a list). Indeed, as explained in App. E, possible improvements to our approach need additional finer order parameters than those appearing in Results 2.1, 2.2 (at least for inhomogeneous readouts ${\mathbf{v}}$ ). Taking them into account yields matrix models when computing their entropy which, to the best of our knowledge, are not currently solvable. We believe that obtaining asymptotically exact formulas for the log-partition function and generalisation error in the current setting and its relatives will require some major breakthrough in the field of multi-matrix models. This is an exciting direction to pursue at the crossroad of the fields of matrix models and high-dimensional inference and learning of extensive-rank matrices. Software and data Experiments with ADAM/HMC were performed through standard implementations in PyTorch/TensorFlow/NumPyro; the Metropolis-Hastings and GAMP-RIE routines were coded from scratch (the latter was inspired by Maillard et al. (2024a)). GitHub repository to reproduce the results: https://github.com/Minh-Toan/extensive-width-NN Acknowledgements J.B., F.C., M.-T.N. and M.P. were funded by the European Union (ERC, CHORAL, project number 101039794). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. M.P. thanks Vittorio Erba and Pietro Rotondo for interesting discussions and suggestions. References - Aguirre-López et al. (2025) Aguirre-López, F., Franz, S., and Pastore, M. Random features and polynomial rules. SciPost Phys., 18:039, 2025. 10.21468/SciPostPhys.18.1.039. URL https://scipost.org/10.21468/SciPostPhys.18.1.039. - Aiudi et al. (2025) Aiudi, R., Pacelli, R., Baglioni, P., Vezzani, A., Burioni, R., and Rotondo, P. Local kernel renormalization as a mechanism for feature learning in overparametrized convolutional neural networks. Nature Communications, 16(1):568, Jan 2025. ISSN 2041-1723. 10.1038/s41467-024-55229-3. URL https://doi.org/10.1038/s41467-024-55229-3. - Anninos & Mühlmann (2020) Anninos, D. and Mühlmann, B. Notes on matrix models (matrix musings). Journal of Statistical Mechanics: Theory and Experiment, 2020(8):083109, aug 2020. 10.1088/1742-5468/aba499. URL https://dx.doi.org/10.1088/1742-5468/aba499. - Arjevani et al. (2025) Arjevani, Y., Bruna, J., Kileel, J., Polak, E., and Trager, M. Geometry and optimization of shallow polynomial networks, 2025. URL https://arxiv.org/abs/2501.06074. - Aubin et al. (2018) Aubin, B., Maillard, A., Barbier, J., Krzakala, F., Macris, N., and Zdeborová, L. The committee machine: Computational to statistical gaps in learning a two-layers neural network. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/84f0f20482cde7e5eacaf7364a643d33-Paper.pdf. - Baglioni et al. (2024) Baglioni, P., Pacelli, R., Aiudi, R., Di Renzo, F., Vezzani, A., Burioni, R., and Rotondo, P. Predictive power of a Bayesian effective action for fully connected one hidden layer neural networks in the proportional limit. Phys. Rev. Lett., 133:027301, Jul 2024. 10.1103/PhysRevLett.133.027301. URL https://link.aps.org/doi/10.1103/PhysRevLett.133.027301. - Baldassi et al. (2019) Baldassi, C., Malatesta, E. M., and Zecchina, R. Properties of the geometry of solutions and capacity of multilayer neural networks with rectified linear unit activations. Phys. Rev. Lett., 123:170602, Oct 2019. 10.1103/PhysRevLett.123.170602. URL https://link.aps.org/doi/10.1103/PhysRevLett.123.170602. - Barbier (2020) Barbier, J. Overlap matrix concentration in optimal Bayesian inference. Information and Inference: A Journal of the IMA, 10(2):597–623, 05 2020. ISSN 2049-8772. 10.1093/imaiai/iaaa008. URL https://doi.org/10.1093/imaiai/iaaa008. - Barbier & Macris (2019) Barbier, J. and Macris, N. The adaptive interpolation method: a simple scheme to prove replica formulas in Bayesian inference. Probability Theory and Related Fields, 174(3):1133–1185, Aug 2019. ISSN 1432-2064. 10.1007/s00440-018-0879-0. URL https://doi.org/10.1007/s00440-018-0879-0. - Barbier & Macris (2022) Barbier, J. and Macris, N. Statistical limits of dictionary learning: Random matrix theory and the spectral replica method. Phys. Rev. E, 106:024136, Aug 2022. 10.1103/PhysRevE.106.024136. URL https://link.aps.org/doi/10.1103/PhysRevE.106.024136. - Barbier & Panchenko (2022) Barbier, J. and Panchenko, D. Strong replica symmetry in high-dimensional optimal Bayesian inference. Communications in Mathematical Physics, 393(3):1199–1239, Aug 2022. ISSN 1432-0916. 10.1007/s00220-022-04387-w. URL https://doi.org/10.1007/s00220-022-04387-w. - Barbier et al. (2019) Barbier, J., Krzakala, F., Macris, N., Miolane, L., and Zdeborová, L. Optimal errors and phase transitions in high-dimensional generalized linear models. Proceedings of the National Academy of Sciences, 116(12):5451–5460, 2019. 10.1073/pnas.1802705116. URL https://www.pnas.org/doi/abs/10.1073/pnas.1802705116. - Barbier et al. (2025) Barbier, J., Camilli, F., Ko, J., and Okajima, K. Phase diagram of extensive-rank symmetric matrix denoising beyond rotational invariance. Physical Review X, 2025. - Barkai et al. (1992) Barkai, E., Hansel, D., and Sompolinsky, H. Broken symmetries in multilayered perceptrons. Phys. Rev. A, 45:4146–4161, Mar 1992. 10.1103/PhysRevA.45.4146. URL https://link.aps.org/doi/10.1103/PhysRevA.45.4146. - Bartlett et al. (2021) Bartlett, P. L., Montanari, A., and Rakhlin, A. Deep learning: a statistical viewpoint. Acta Numerica, 30:87–201, 2021. 10.1017/S0962492921000027. URL https://doi.org/10.1017/S0962492921000027. - Bassetti et al. (2024) Bassetti, F., Gherardi, M., Ingrosso, A., Pastore, M., and Rotondo, P. Feature learning in finite-width Bayesian deep linear networks with multiple outputs and convolutional layers, 2024. URL https://arxiv.org/abs/2406.03260. - Bordelon et al. (2020) Bordelon, B., Canatar, A., and Pehlevan, C. Spectrum dependent learning curves in kernel regression and wide neural networks. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 1024–1034. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/bordelon20a.html. - Brézin et al. (2016) Brézin, E., Hikami, S., et al. Random matrix theory with an external source. Springer, 2016. - Camilli et al. (2023) Camilli, F., Tieplova, D., and Barbier, J. Fundamental limits of overparametrized shallow neural networks for supervised learning, 2023. URL https://arxiv.org/abs/2307.05635. - Camilli et al. (2025) Camilli, F., Tieplova, D., Bergamin, E., and Barbier, J. Information-theoretic reduction of deep neural networks to linear models in the overparametrized proportional regime. The 38th Annual Conference on Learning Theory (to appear), 2025. - Canatar et al. (2021) Canatar, A., Bordelon, B., and Pehlevan, C. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature Communications, 12(1):2914, 05 2021. ISSN 2041-1723. 10.1038/s41467-021-23103-1. URL https://doi.org/10.1038/s41467-021-23103-1. - Chung et al. (2018) Chung, S., Lee, D. D., and Sompolinsky, H. Classification and geometry of general perceptual manifolds. Phys. Rev. X, 8:031003, Jul 2018. 10.1103/PhysRevX.8.031003. URL https://link.aps.org/doi/10.1103/PhysRevX.8.031003. - Cui et al. (2023) Cui, H., Krzakala, F., and Zdeborova, L. Bayes-optimal learning of deep random networks of extensive-width. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 6468–6521. PMLR, 07 2023. URL https://proceedings.mlr.press/v202/cui23b.html. - Del Giudice, P. et al. (1989) Del Giudice, P., Franz, S., and Virasoro, M. A. Perceptron beyond the limit of capacity. J. Phys. France, 50(2):121–134, 1989. 10.1051/jphys:01989005002012100. URL https://doi.org/10.1051/jphys:01989005002012100. - Dietrich et al. (1999) Dietrich, R., Opper, M., and Sompolinsky, H. Statistical mechanics of support vector networks. Phys. Rev. Lett., 82:2975–2978, 04 1999. 10.1103/PhysRevLett.82.2975. URL https://link.aps.org/doi/10.1103/PhysRevLett.82.2975. - Du & Lee (2018) Du, S. and Lee, J. On the power of over-parametrization in neural networks with quadratic activation. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1329–1338. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/du18a.html. - Engel & Van den Broeck (2001) Engel, A. and Van den Broeck, C. Statistical mechanics of learning. Cambridge University Press, 2001. ISBN 9780521773072. - Engel et al. (1992) Engel, A., Köhler, H. M., Tschepke, F., Vollmayr, H., and Zippelius, A. Storage capacity and learning algorithms for two-layer neural networks. Phys. Rev. A, 45:7590–7609, May 1992. 10.1103/PhysRevA.45.7590. URL https://link.aps.org/doi/10.1103/PhysRevA.45.7590. - Gamarnik et al. (2024) Gamarnik, D., Kızıldağ, E. C., and Zadik, I. Stationary points of a shallow neural network with quadratic activations and the global optimality of the gradient descent algorithm. Mathematics of Operations Research, 50(1):209–251, 2024. 10.1287/moor.2021.0082. URL https://doi.org/10.1287/moor.2021.0082. - Gardner & Derrida (1989) Gardner, E. and Derrida, B. Three unfinished works on the optimal storage capacity of networks. Journal of Physics A: Mathematical and General, 22(12):1983, jun 1989. 10.1088/0305-4470/22/12/004. URL https://dx.doi.org/10.1088/0305-4470/22/12/004. - Gerace et al. (2021) Gerace, F., Loureiro, B., Krzakala, F., Mézard, M., and Zdeborová, L. Generalisation error in learning with random features and the hidden manifold model. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124013, Dec 2021. ISSN 1742-5468. 10.1088/1742-5468/ac3ae6. URL http://dx.doi.org/10.1088/1742-5468/ac3ae6. - Ghorbani et al. (2021) Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029 – 1054, 2021. 10.1214/20-AOS1990. URL https://doi.org/10.1214/20-AOS1990. - Goldt et al. (2020) Goldt, S., Mézard, M., Krzakala, F., and Zdeborová, L. Modeling the influence of data structure on learning in neural networks: The hidden manifold model. Phys. Rev. X, 10:041044, Dec 2020. 10.1103/PhysRevX.10.041044. URL https://link.aps.org/doi/10.1103/PhysRevX.10.041044. - Goldt et al. (2022) Goldt, S., Loureiro, B., Reeves, G., Krzakala, F., Mezard, M., and Zdeborová, L. The Gaussian equivalence of generative models for learning with shallow neural networks. In Bruna, J., Hesthaven, J., and Zdeborová, L. (eds.), Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, volume 145 of Proceedings of Machine Learning Research, pp. 426–471. PMLR, 08 2022. URL https://proceedings.mlr.press/v145/goldt22a.html. - Guionnet & Zeitouni (2002) Guionnet, A. and Zeitouni, O. Large deviations asymptotics for spherical integrals. Journal of Functional Analysis, 188(2):461–515, 2002. ISSN 0022-1236. 10.1006/jfan.2001.3833. URL https://www.sciencedirect.com/science/article/pii/S0022123601938339. - Guo et al. (2005) Guo, D., Shamai, S., and Verdú, S. Mutual information and minimum mean-square error in gaussian channels. IEEE Transactions on Information Theory, 51(4):1261–1282, 2005. 10.1109/TIT.2005.844072. URL https://doi.org/10.1109/TIT.2005.844072. - Györgyi (1990) Györgyi, G. First-order transition to perfect generalization in a neural network with binary synapses. Phys. Rev. A, 41:7097–7100, Jun 1990. 10.1103/PhysRevA.41.7097. URL https://link.aps.org/doi/10.1103/PhysRevA.41.7097. - Hanin (2023) Hanin, B. Random neural networks in the infinite width limit as Gaussian processes. The Annals of Applied Probability, 33(6A):4798 – 4819, 2023. 10.1214/23-AAP1933. URL https://doi.org/10.1214/23-AAP1933. - Hastie et al. (2022) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 – 986, 2022. 10.1214/21-AOS2133. URL https://doi.org/10.1214/21-AOS2133. - Hu & Lu (2023) Hu, H. and Lu, Y. M. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 69(3):1932–1964, 2023. 10.1109/TIT.2022.3217698. URL https://doi.org/10.1109/TIT.2022.3217698. - Hu et al. (2024) Hu, H., Lu, Y. M., and Misiakiewicz, T. Asymptotics of random feature regression beyond the linear scaling regime, 2024. URL https://arxiv.org/abs/2403.08160. - Itzykson & Zuber (1980) Itzykson, C. and Zuber, J. The planar approximation. II. Journal of Mathematical Physics, 21(3):411–421, 03 1980. ISSN 0022-2488. 10.1063/1.524438. URL https://doi.org/10.1063/1.524438. - Kabashima et al. (2016) Kabashima, Y., Krzakala, F., Mézard, M., Sakata, A., and Zdeborová, L. Phase transitions and sample complexity in Bayes-optimal matrix factorization. IEEE Transactions on Information Theory, 62(7):4228–4265, 2016. 10.1109/TIT.2016.2556702. URL https://doi.org/10.1109/TIT.2016.2556702. - Kazakov (2000) Kazakov, V. A. Solvable matrix models, 2000. URL https://arxiv.org/abs/hep-th/0003064. - Kingma & Ba (2017) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980. - Krzakala et al. (2013) Krzakala, F., Mézard, M., and Zdeborová, L. Phase diagram and approximate message passing for blind calibration and dictionary learning. In 2013 IEEE International Symposium on Information Theory, pp. 659–663, 2013. 10.1109/ISIT.2013.6620308. URL https://doi.org/10.1109/ISIT.2013.6620308. - Lee et al. (2018) Lee, J., Sohl-dickstein, J., Pennington, J., Novak, R., Schoenholz, S., and Bahri, Y. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z. - Li & Sompolinsky (2021) Li, Q. and Sompolinsky, H. Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Phys. Rev. X, 11:031059, Sep 2021. 10.1103/PhysRevX.11.031059. URL https://link.aps.org/doi/10.1103/PhysRevX.11.031059. - Loureiro et al. (2021a) Loureiro, B., Gerbelot, C., Cui, H., Goldt, S., Krzakala, F., Mezard, M., and Zdeborová, L. Learning curves of generic features maps for realistic datasets with a teacher-student model. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 18137–18151. Curran Associates, Inc., 2021a. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/9704a4fc48ae88598dcbdcdf57f3fdef-Paper.pdf. - Loureiro et al. (2021b) Loureiro, B., Sicuro, G., Gerbelot, C., Pacco, A., Krzakala, F., and Zdeborová, L. Learning Gaussian mixtures with generalized linear models: Precise asymptotics in high-dimensions. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 10144–10157. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/543e83748234f7cbab21aa0ade66565f-Paper.pdf. - Maillard et al. (2022) Maillard, A., Krzakala, F., Mézard, M., and Zdeborová, L. Perturbative construction of mean-field equations in extensive-rank matrix factorization and denoising. Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083301, Aug 2022. 10.1088/1742-5468/ac7e4c. URL https://dx.doi.org/10.1088/1742-5468/ac7e4c. - Maillard et al. (2024a) Maillard, A., Troiani, E., Martin, S., Krzakala, F., and Zdeborová, L. Bayes-optimal learning of an extensive-width neural network from quadratically many samples, 2024a. URL https://arxiv.org/abs/2408.03733. - Maillard et al. (2024b) Maillard, A., Troiani, E., Martin, S., Krzakala, F., and Zdeborová, L. Github repository ExtensiveWidthQuadraticSamples. https://github.com/SPOC-group/ExtensiveWidthQuadraticSamples, 2024b. - Martin et al. (2024) Martin, S., Bach, F., and Biroli, G. On the impact of overparameterization on the training of a shallow neural network in high dimensions. In Dasgupta, S., Mandt, S., and Li, Y. (eds.), Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pp. 3655–3663. PMLR, 02–04 May 2024. URL https://proceedings.mlr.press/v238/martin24a.html. - Mato & Parga (1992) Mato, G. and Parga, N. Generalization properties of multilayered neural networks. Journal of Physics A: Mathematical and General, 25(19):5047, Oct 1992. 10.1088/0305-4470/25/19/017. URL https://dx.doi.org/10.1088/0305-4470/25/19/017. - Matthews et al. (2018) Matthews, A. G. D. G., Hron, J., Rowland, M., Turner, R. E., and Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1-nGgWC-. - Matytsin (1994) Matytsin, A. On the large- $N$ limit of the Itzykson-Zuber integral. Nuclear Physics B, 411(2):805–820, 1994. ISSN 0550-3213. 10.1016/0550-3213(94)90471-5. URL https://www.sciencedirect.com/science/article/pii/0550321394904715. - Mei & Montanari (2022) Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022. 10.1002/cpa.22008. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.22008. - Mezard et al. (1986) Mezard, M., Parisi, G., and Virasoro, M. Spin Glass Theory and Beyond. World Scientific, 1986. 10.1142/0271. URL https://www.worldscientific.com/doi/abs/10.1142/0271. - Monasson (1992) Monasson, R. Properties of neural networks storing spatially correlated patterns. Journal of Physics A: Mathematical and General, 25(13):3701, Jul 1992. 10.1088/0305-4470/25/13/019. URL https://dx.doi.org/10.1088/0305-4470/25/13/019. - Monasson & Zecchina (1995) Monasson, R. and Zecchina, R. Weight space structure and internal representations: A direct approach to learning and generalization in multilayer neural networks. Phys. Rev. Lett., 75:2432–2435, Sep 1995. 10.1103/PhysRevLett.75.2432. URL https://link.aps.org/doi/10.1103/PhysRevLett.75.2432. - Naveh & Ringel (2021) Naveh, G. and Ringel, Z. A self consistent theory of Gaussian processes captures feature learning effects in finite CNNs. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 21352–21364. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/b24d21019de5e59da180f1661904f49a-Paper.pdf. - Neal (1996) Neal, R. M. Priors for Infinite Networks, pp. 29–53. Springer New York, New York, NY, 1996. ISBN 978-1-4612-0745-0. 10.1007/978-1-4612-0745-0_2. URL https://doi.org/10.1007/978-1-4612-0745-0_2. - Nishimori (2001) Nishimori, H. Statistical Physics of Spin Glasses and Information Processing: An Introduction. Oxford University Press, 07 2001. ISBN 9780198509417. 10.1093/acprof:oso/9780198509417.001.0001. - Nourdin et al. (2011) Nourdin, I., Peccati, G., and Podolskij, M. Quantitative Breuer–Major theorems. Stochastic Processes and their Applications, 121(4):793–812, 2011. ISSN 0304-4149. https://doi.org/10.1016/j.spa.2010.12.006. URL https://www.sciencedirect.com/science/article/pii/S0304414910002917. - Pacelli et al. (2023) Pacelli, R., Ariosto, S., Pastore, M., Ginelli, F., Gherardi, M., and Rotondo, P. A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit. Nature Machine Intelligence, 5(12):1497–1507, 12 2023. ISSN 2522-5839. 10.1038/s42256-023-00767-6. URL https://doi.org/10.1038/s42256-023-00767-6. - Parker et al. (2014) Parker, J. T., Schniter, P., and Cevher, V. Bilinear generalized approximate message passing—Part I: Derivation. IEEE Transactions on Signal Processing, 62(22):5839–5853, 2014. 10.1109/TSP.2014.2357776. URL https://doi.org/10.1109/TSP.2014.2357776. - Potters & Bouchaud (2020) Potters, M. and Bouchaud, J.-P. A first course in random matrix theory: for physicists, engineers and data scientists. Cambridge University Press, 2020. - Pourkamali et al. (2024) Pourkamali, F., Barbier, J., and Macris, N. Matrix inference in growing rank regimes. IEEE Transactions on Information Theory, 70(11):8133–8163, 2024. 10.1109/TIT.2024.3422263. URL https://doi.org/10.1109/TIT.2024.3422263. - Power et al. (2022) Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. URL https://arxiv.org/abs/2201.02177. - Rotondo et al. (2020) Rotondo, P., Pastore, M., and Gherardi, M. Beyond the storage capacity: Data-driven satisfiability transition. Phys. Rev. Lett., 125:120601, Sep 2020. 10.1103/PhysRevLett.125.120601. URL https://link.aps.org/doi/10.1103/PhysRevLett.125.120601. - Rubin et al. (2024a) Rubin, N., Ringel, Z., Seroussi, I., and Helias, M. A unified approach to feature learning in Bayesian neural networks. In High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning, 2024a. URL https://openreview.net/forum?id=ZmOSJ2MV2R. - Rubin et al. (2024b) Rubin, N., Seroussi, I., and Ringel, Z. Grokking as a first order phase transition in two layer networks. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=3ROGsTX3IR. - Sakata & Kabashima (2013) Sakata, A. and Kabashima, Y. Statistical mechanics of dictionary learning. Europhysics Letters, 103(2):28008, Aug 2013. 10.1209/0295-5075/103/28008. URL https://dx.doi.org/10.1209/0295-5075/103/28008. - Sarao Mannelli et al. (2020) Sarao Mannelli, S., Vanden-Eijnden, E., and Zdeborová, L. Optimization and generalization of shallow neural networks with quadratic activation functions. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 13445–13455. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/9b8b50fb590c590ffbf1295ce92258dc-Paper.pdf. - Schmidt (2018) Schmidt, H. C. Statistical physics of sparse and dense models in optimization and inference. PhD thesis, 2018. URL http://www.theses.fr/2018SACLS366. - Schwarze & Hertz (1992) Schwarze, H. and Hertz, J. Generalization in a large committee machine. Europhysics Letters, 20(4):375, Oct 1992. 10.1209/0295-5075/20/4/015. URL https://dx.doi.org/10.1209/0295-5075/20/4/015. - Schwarze & Hertz (1993) Schwarze, H. and Hertz, J. Generalization in fully connected committee machines. Europhysics Letters, 21(7):785, Mar 1993. 10.1209/0295-5075/21/7/012. URL https://dx.doi.org/10.1209/0295-5075/21/7/012. - Semerjian (2024) Semerjian, G. Matrix denoising: Bayes-optimal estimators via low-degree polynomials. Journal of Statistical Physics, 191(10):139, Oct 2024. ISSN 1572-9613. 10.1007/s10955-024-03359-9. URL https://doi.org/10.1007/s10955-024-03359-9. - Seroussi et al. (2023) Seroussi, I., Naveh, G., and Ringel, Z. Separation of scales and a thermodynamic description of feature learning in some CNNs. Nature Communications, 14(1):908, Feb 2023. ISSN 2041-1723. 10.1038/s41467-023-36361-y. URL https://doi.org/10.1038/s41467-023-36361-y. - Soltanolkotabi et al. (2019) Soltanolkotabi, M., Javanmard, A., and Lee, J. D. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2019. 10.1109/TIT.2018.2854560. URL https://doi.org/10.1109/TIT.2018.2854560. - Troiani et al. (2025) Troiani, E., Dandi, Y., Defilippis, L., Zdeborova, L., Loureiro, B., and Krzakala, F. Fundamental computational limits of weak learnability in high-dimensional multi-index models. In The 28th International Conference on Artificial Intelligence and Statistics, 2025. URL https://openreview.net/forum?id=Mwzui5H0VN. - van Meegen & Sompolinsky (2024) van Meegen, A. and Sompolinsky, H. Coding schemes in neural networks learning classification tasks, 2024. URL https://arxiv.org/abs/2406.16689. - Venturi et al. (2019) Venturi, L., Bandeira, A. S., and Bruna, J. Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of Machine Learning Research, 20(133):1–34, 2019. URL http://jmlr.org/papers/v20/18-674.html. - Williams (1996) Williams, C. Computing with infinite networks. In Mozer, M., Jordan, M., and Petsche, T. (eds.), Advances in Neural Information Processing Systems, volume 9. MIT Press, 1996. URL https://proceedings.neurips.cc/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735-Paper.pdf. - Xiao et al. (2023) Xiao, L., Hu, H., Misiakiewicz, T., Lu, Y. M., and Pennington, J. Precise learning curves and higher-order scaling limits for dot-product kernel regression. Journal of Statistical Mechanics: Theory and Experiment, 2023(11):114005, Nov 2023. 10.1088/1742-5468/ad01b7. URL https://dx.doi.org/10.1088/1742-5468/ad01b7. - Xu et al. (2025) Xu, Y., Maillard, A., Zdeborová, L., and Krzakala, F. Fundamental limits of matrix sensing: Exact asymptotics, universality, and applications, 2025. URL https://arxiv.org/abs/2503.14121. - Yoon & Oh (1998) Yoon, H. and Oh, J.-H. Learning of higher-order perceptrons with tunable complexities. Journal of Physics A: Mathematical and General, 31(38):7771–7784, 09 1998. 10.1088/0305-4470/31/38/012. URL https://doi.org/10.1088/0305-4470/31/38/012. - Zdeborová & and (2016) Zdeborová, L. and and, F. K. Statistical physics of inference: thresholds and algorithms. Advances in Physics, 65(5):453–552, 2016. 10.1080/00018732.2016.1211393. URL https://doi.org/10.1080/00018732.2016.1211393. Appendix A Hermite basis and Mehler’s formula Recall the Hermite expansion of the activation: $$ \sigma(x)=\sum_{\ell=0}^{\infty}\frac{\mu_{\ell}}{\ell!}{\rm He}_{\ell}(x). \tag{17} $$ We are expressing it on the basis of the probabilist’s Hermite polynomials, generated through $$ {\rm He}_{\ell}(z)=\frac{d^{\ell}}{{dt}^{\ell}}\exp\big{(}tz-t^{2}/2\big{)}% \big{|}_{t=0}. \tag{18} $$ The Hermite basis has the property of being orthogonal with respect to the standard Gaussian measure, which is the distribution of the input data: $$ \int Dz\,{\rm He}_{k}(z){\rm He}_{\ell}(z)=\ell!\,\delta_{k\ell}, \tag{19} $$ where $Dz:=dz\exp(-z^{2}/2)/\sqrt{2\pi}$ . By orthogonality, the coefficients of the expansions can be obtained as $$ \mu_{\ell}=\int Dz{\rm He}_{\ell}(z)\sigma(z). \tag{20} $$ Moreover, $$ \mathbb{E}[\sigma(z)^{2}]=\int Dz\,\sigma(z)^{2}=\sum_{\ell=0}^{\infty}\frac{% \mu_{\ell}^{2}}{\ell!}. \tag{21} $$ These coefficients for some popular choices of $\sigma$ are reported in Table 1 for reference. Table 1: First Hermite coefficients of some activation functions reported in the figues. $\theta$ is the Heaviside step function. | ${\rm ReLU}(z)=z\theta(z)$ ${\rm ELU}(z)=z\theta(z)+(e^{z}-1)\theta(-z)$ ${\rm Tanh}(2z)$ | $1/\sqrt{2\pi}$ 0.16052 0 | $1/2$ 0.76158 0.72948 | $1/\sqrt{2\pi}$ 0.26158 0 | 0 -0.13736 -0.61398 | $-1/\sqrt{2\pi}$ -0.13736 0 | $·s$ $·s$ $·s$ | 1/2 0.64494 0.63526 | | --- | --- | --- | --- | --- | --- | --- | --- | The Hermite basis can be generalised to an orthogonal basis with respect to the Gaussian measure with generic variance: $$ {\rm He}_{\ell}^{[r]}(z)=\frac{d^{\ell}}{dt^{\ell}}\exp\big{(}tz-t^{2}r/2\big{% )}\big{|}_{t=0}, \tag{22} $$ so that, with $D_{r}z:=dz\exp(-z^{2}/2r)/\sqrt{2\pi r}$ , we have $$ \int D_{r}z\,{\rm He}_{k}^{[r]}(z){\rm He}_{\ell}^{[r]}(z)=\ell!\,r^{\ell}% \delta_{k\ell}. \tag{23} $$ From Mehler’s formula $$ \frac{1}{2\pi\sqrt{r^{2}-q^{2}}}\exp\!\Big{[}-\frac{1}{2}(u,v)\begin{pmatrix}r% &q\\ q&r\end{pmatrix}^{-1}\begin{pmatrix}u\\ v\end{pmatrix}\Big{]}=\frac{e^{-\frac{u^{2}}{2r}}}{\sqrt{2\pi r}}\frac{e^{-% \frac{v^{2}}{2r}}}{\sqrt{2\pi r}}\sum_{\ell=0}^{+\infty}\frac{q^{\ell}}{\ell!r% ^{2\ell}}{\rm He}_{\ell}^{[r]}(u){\rm He}_{\ell}^{[r]}(v), \tag{24} $$ and by orthogonality of the Hermite basis, (8) readily follows by noticing that the variables $(h_{i}^{a}=({\mathbf{W}}^{a}{\mathbf{x}})_{i}/\sqrt{d})_{i,a}$ at given $({\mathbf{W}}^{a})$ are Gaussian with covariances $\Omega^{ab}_{ij}={\mathbf{W}}_{i}^{a∈tercal}{\mathbf{W}}^{b}_{j}/d$ , so that $$ \mathbb{E}[\sigma(h_{i}^{a})\sigma(h_{j}^{b})]=\sum_{\ell=0}^{\infty}\frac{(% \mu_{\ell}^{[r]})^{2}}{\ell!r^{2\ell}}(\Omega_{ij}^{ab})^{\ell},\qquad\mu_{% \ell}^{[r]}=\int D_{r}z\,{\rm He}^{[r]}_{\ell}(z)\sigma(z). \tag{25} $$ Moreover, as $r=\Omega^{aa}_{ii}$ converges for $d$ large to the variance of the prior of ${\mathbf{W}}^{0}$ by Bayes-optimality, whenever $\Omega^{aa}_{ii}→ 1$ we can specialise this formula to the simpler case $r=1$ we reported in the main text. Appendix B Nishimori identities The Nishimori identities are a very general set of symmetries arising in inference in the Bayes-optimal setting as a consequence of Bayes’ rule. To introduce them, consider a test function $f$ of the teacher weights, collectively denoted by ${\bm{\theta}}^{0}$ , of $s-1$ replicas of the student’s weights $({\bm{\theta}}^{a})_{2≤ a≤ s}$ drawn conditionally i.i.d. from the posterior, and possibly also of the training set $\mathcal{D}$ : $f({\bm{\theta}}^{0},{\bm{\theta}}^{2},...,{\bm{\theta}}^{s};\mathcal{D})$ . Then $$ \displaystyle\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle f({\bm{\theta}}% ^{0},{\bm{\theta}}^{2},\dots,{\bm{\theta}}^{s};\mathcal{D})\rangle=\mathbb{E}_% {{\bm{\theta}}^{0},\mathcal{D}}\langle f({\bm{\theta}}^{1},{\bm{\theta}}^{2},% \dots,{\bm{\theta}}^{s};\mathcal{D})\rangle, \tag{26} $$ where we have replaced the teacher’s weights with another replica from the student. The proof is elementary, see e.g. Barbier et al. (2019). The Nishimori identities have some consequences also on our replica symmetric ansatz for the free entropy. In particular, they constrain the values of the asymptotic mean of some order parameters. For instance $$ \displaystyle m_{2}=\lim\frac{1}{d^{2}}\mathbb{E}_{\mathcal{D},{\bm{\theta}}^{% 0}}\langle{\rm Tr}[{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{0}]\rangle=\lim\frac{% 1}{d^{2}}\mathbb{E}_{\mathcal{D}}\langle{\rm Tr}[{\mathbf{S}}_{2}^{a}{\mathbf{% S}}_{2}^{b}]\rangle=q_{2},\quad\text{for }a\neq b. \tag{27} $$ Combined with the concentration of such order parameters, which can be proven in great generality in Bayes-optimal learning Barbier (2020); Barbier & Panchenko (2022), it fixes the values for some of them. For instance, we have that with high probability $$ \displaystyle\frac{1}{d^{2}}{\rm Tr}[({\mathbf{S}}_{2}^{a})^{2}]\to r_{2}=\lim% \frac{1}{d^{2}}\mathbb{E}_{\mathcal{D}}\langle{\rm Tr}[({\mathbf{S}}_{2}^{a})^% {2}]\rangle=\lim\frac{1}{d^{2}}\mathbb{E}_{{\bm{\theta}}^{0}}{\rm Tr}[({% \mathbf{S}}_{2}^{0})^{2}]=\rho_{2}=1+\gamma\bar{v}^{2}. \tag{28} $$ When the values of some order parameters are determined by the Nishimori identities (and their concentration), as for those fixed to $r_{2}=\rho_{2}$ , then the respective Fourier conjugates $\hat{r}_{2},\hat{\rho}_{2}$ vanish (meaning that the desired constraints were already asymptotically enforced without the need of additional delta functions). This is because the configurations that make the order parameters take those values exponentially (in $n$ ) dominate the posterior measure, so these constraints are automatically imposed by the measure. Appendix C Alternative representation for the optimal mean-square generalisation error We recall that ${\bm{\theta}}^{0}=({\mathbf{v}}^{0},{\mathbf{W}}^{0})$ and similarly for ${\bm{\theta}}^{1}={\bm{\theta}},{\bm{\theta}}^{2},...$ which are replicas, i.e., conditionally i.i.d. samples from $dP({\mathbf{W}},{\mathbf{v}}\mid\mathcal{D})$ (the reasoning below applies whether ${\mathbf{v}}$ is learnable or quenched, so in general we can consider a joint posterior over both). In this section we report the details on how to obtain Result 2.2 and how to write the generalisation error defined in (3) in a form more convenient for numerical estimation. From its definition, the Bayes-optimal mean-square generalisation error can be recast as $$ \displaystyle\varepsilon^{\rm opt}=\mathbb{E}_{{\bm{\theta}}^{0},{\mathbf{x}}_% {\rm test}}\mathbb{E}[y^{2}_{\rm test}\mid\lambda^{0}]-2\mathbb{E}_{{\bm{% \theta}}^{0},\mathcal{D},{\mathbf{x}}_{\rm test}}\mathbb{E}[y_{\rm test}\mid% \lambda^{0}]\langle\mathbb{E}[y\mid\lambda]\rangle+\mathbb{E}_{{\bm{\theta}}^{% 0},\mathcal{D},{\mathbf{x}}_{\rm test}}\langle\mathbb{E}[y\mid\lambda]\rangle^% {2}, \tag{29} $$ where $\mathbb{E}[y\mid\lambda]=∈t dy\,y\,P_{\rm out}(y\mid\lambda)$ , and $\lambda^{0}$ , $\lambda$ are the random variables (random due to the test input ${\mathbf{x}}_{\rm test}$ , drawn independently of the training data $\mathcal{D}$ , and their respective weights ${\bm{\theta}}^{0},{\bm{\theta}}$ ) $$ \displaystyle\lambda^{0}=\lambda({\bm{\theta}}^{0},{\mathbf{x}}_{\rm test})=% \frac{{\mathbf{v}}^{0\intercal}}{\sqrt{k}}\sigma\Big{(}\frac{{\mathbf{W}}^{0}{% \mathbf{x}}_{\rm test}}{\sqrt{d}}\Big{)},\qquad\lambda=\lambda^{1}=\lambda({% \bm{\theta}},{\mathbf{x}}_{\rm test})=\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}% }\sigma\Big{(}\frac{{\mathbf{W}}{\mathbf{x}}_{\rm test}}{\sqrt{d}}\Big{)}. \tag{30} $$ Recall that the bracket $\langle\,·\,\rangle$ is the average w.r.t. to the posterior and acts on ${\bm{\theta}}^{1}={\bm{\theta}},{\bm{\theta}}^{2},...$ . Notice that the last term on the r.h.s. of (29) can be rewritten as | | $\displaystyle\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D},{\mathbf{x}}_{\rm test}% }\langle\mathbb{E}[y\mid\lambda]\rangle^{2}=\mathbb{E}_{{\bm{\theta}}^{0},% \mathcal{D},{\mathbf{x}}_{\rm test}}\langle\mathbb{E}[y\mid\lambda^{1}]\mathbb% {E}[y\mid\lambda^{2}]\rangle,$ | | | --- | --- | --- | with superscripts being replica indices, i.e., $\lambda^{a}:=\lambda({\bm{\theta}}^{a},{\mathbf{x}}_{\rm test})$ . In order to show Result 2.2 for a generic $P_{\rm out}$ we assume the joint Gaussianity of the variables $(\lambda^{0},\lambda^{1},\lambda^{2},...)$ , with covariance given by $K^{ab}$ with $a,b∈\{0,1,2,...\}$ . Indeed, in the limit “ $\lim$ ”, our theory considers $(\lambda^{a})_{a≥ 0}$ as jointly Gaussian under the randomness of a common input, here ${\mathbf{x}}_{\rm test}$ , conditionally on the weights $({\bm{\theta}}^{a})$ . Their covariance depends on the weights $({\bm{\theta}}^{a})$ through various overlap order parameters introduced in the main. But in the large limit “ $\lim$ ” these overlaps are assumed to concentrate under the quenched posterior average $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle\,·\,\rangle$ towards non-random asymptotic values corresponding to the extremiser globally maximising the RS potential in Result 2.1, with the overlaps entering $K^{ab}$ through (42). This hypothesis is then confirmed by the excellent agreement between our theoretical predictions based on this assumption and the experimental results. This implies directly the equation for $\lim\,\varepsilon^{\mathcal{C},\mathsf{f}}$ in Result 2.2 from definition (2). For the special case of optimal mean-square generalisation error it yields $$ \displaystyle\lim\,\varepsilon^{\rm opt}=\mathbb{E}_{\lambda^{0}}\mathbb{E}[y^% {2}_{\rm test}\mid\lambda^{0}]-2\mathbb{E}_{\lambda^{0},\lambda^{1}}\mathbb{E}% [y_{\rm test}\mid\lambda^{0}]\mathbb{E}[y\mid\lambda^{1}]+\mathbb{E}_{\lambda^% {1},\lambda^{2}}\mathbb{E}[y\mid\lambda^{1}]\mathbb{E}[y\mid\lambda^{2}] \tag{31} $$ where, in the replica symmetric ansatz, $$ \displaystyle\mathbb{E}[(\lambda^{0})^{2}]=K^{00},\quad\mathbb{E}[\lambda^{0}% \lambda^{1}]=\mathbb{E}[\lambda^{0}\lambda^{2}]=K^{01},\quad\mathbb{E}[\lambda% ^{1}\lambda^{2}]=K^{12},\quad\mathbb{E}[(\lambda^{1})^{2}]=\mathbb{E}[(\lambda% ^{2})^{2}]=K^{11}. \tag{32} $$ For the dependence of the elements of ${\mathbf{K}}$ on the overlaps under this ansatz we defer the reader to (45), (46). In the Bayes-optimal setting, using the Nishimori identities (see App. B), one can show that $K^{01}=K^{12}$ and $K^{00}=K^{11}$ . Because of these identifications, we would additionally have $$ \displaystyle\mathbb{E}_{\lambda^{0},\lambda^{1}}\mathbb{E}[y_{\rm test}\mid% \lambda^{0}]\mathbb{E}[y\mid\lambda^{1}]=\mathbb{E}_{\lambda^{1},\lambda^{2}}% \mathbb{E}[y\mid\lambda^{1}]\mathbb{E}[y\mid\lambda^{2}]. \tag{33} $$ Plugging the above in (31) yields (7). Let us now prove a formula for the optimal mean-square generalisation error written in terms of the overlaps that will be simpler to evaluate numerically, which holds for the special case of linear readout with Gaussian label noise $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ . The following derivation is exact and does not require any Gaussianity assumption on the random variables $(\lambda^{a})$ . For the linear Gaussian channel the means verify $\mathbb{E}[y\mid\lambda]=\lambda$ and $\mathbb{E}[y^{2}\mid\lambda]=\lambda^{2}+\Delta$ . Plugged in (29) this yields $$ \displaystyle\varepsilon^{\rm opt}-\Delta=\mathbb{E}_{{\bm{\theta}}^{0},{% \mathbf{x}}_{\rm test}}\lambda^{2}_{\rm test}-2\mathbb{E}_{{\bm{\theta}}^{0},% \mathcal{D},{\mathbf{x}}_{\rm test}}\lambda^{0}\langle\lambda\rangle+\mathbb{E% }_{{\bm{\theta}}^{0},\mathcal{D},{\mathbf{x}}_{\rm test}}\langle\lambda^{1}% \lambda^{2}\rangle, \tag{34} $$ whence we clearly see that the generalisation error depends only on the covariance of $\lambda_{\rm test}({\bm{\theta}}^{0})=\lambda^{0}({\bm{\theta}}^{0}),\lambda^{% 1}({\bm{\theta}}^{1}),\lambda^{2}({\bm{\theta}}^{2})$ under the randomness of the shared input ${\mathbf{x}}_{\rm test}$ at fixed weights, regardless of the validity of the Gaussian equivalence principle we assume in the replica computation. This covariance was already computed in (8); we recall it here for the reader’s convenience $$ \displaystyle K({\bm{\theta}}^{a},{\bm{\theta}}^{b}):=\mathbb{E}\lambda^{a}% \lambda^{b}=\sum_{\ell=1}^{\infty}\frac{\mu_{\ell}^{2}}{\ell!}\frac{1}{k}\sum_% {i,j=1}^{k}v_{i}^{a}(\Omega^{ab}_{ij})^{\ell}v^{b}_{j}=\sum_{\ell=1}^{\infty}% \frac{\mu_{\ell}^{2}}{\ell!}Q_{\ell}^{ab}, \tag{35} $$ where $\Omega^{ab}_{ij}:={\mathbf{W}}_{i}^{a∈tercal}{\mathbf{W}}_{j}^{b}/d$ , and $Q_{\ell}^{ab}$ as introduced in (8) for $a,b=0,1,2$ . We stress that $K({\bm{\theta}}^{a},{\bm{\theta}}^{b})$ is not the limiting covariance $K^{ab}$ whose elements are in (45), (46), but rather the finite size one. $K({\bm{\theta}}^{a},{\bm{\theta}}^{b})$ provides us with an efficient way to compute the generalisation error numerically, that is through the formula $$ \displaystyle\varepsilon^{\rm opt}-\Delta \displaystyle=\mathbb{E}_{{\bm{\theta}}^{0}}K({\bm{\theta}}^{0},{\bm{\theta}}^% {0})-2\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle K({\bm{\theta}}^{0},{% \bm{\theta}}^{1})\rangle+\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle K({% \bm{\theta}}^{1},{\bm{\theta}}^{2})\rangle=\sum_{\ell=1}^{\infty}\frac{\mu_{% \ell}^{2}}{\ell!}\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle Q_{\ell}^{0% 0}-2Q_{\ell}^{01}+Q^{12}_{\ell}\rangle. \tag{36} $$ In the above, the posterior measure $\langle\,·\,\rangle$ is taken care of by Monte Carlo sampling (when it equilibrates). In addition, as in the main text, we assume that in the large system limit the (numerically confirmed) identity (11) holds. Putting all ingredients together we get $$ \displaystyle\varepsilon^{\rm opt}-\Delta=\mathbb{E}_{{\bm{\theta}}^{0},% \mathcal{D}} \displaystyle\Big{\langle}\mu_{1}^{2}(Q_{1}^{00}-2Q^{01}_{1}+Q^{12}_{1})+\frac% {\mu_{2}^{2}}{2}(Q_{2}^{00}-2Q^{01}_{2}+Q^{12}_{2}) \displaystyle+\mathbb{E}_{v\sim P_{v}}v^{2}\big{[}g(\mathcal{Q}_{W}^{00}(v))-2% g(\mathcal{Q}_{W}^{01}(v))+g(\mathcal{Q}_{W}^{12}(v))\big{]}\Big{\rangle}. \tag{37} $$ In the Bayes-optimal setting one can use again the Nishimori identities that imply $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{12}_{1}\rangle=\mathbb{E}% _{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{01}_{1}\rangle$ , and analogously $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{12}_{2}\rangle=\mathbb{E}% _{{\bm{\theta}}^{0},\mathcal{D}}\langle Q^{01}_{2}\rangle$ and $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle g(\mathcal{Q}^{12}_{W}(v))% \rangle=\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle g(\mathcal{Q}^{01}_{% W}(v))\rangle$ . Inserting these identities in (37) one gets $$ \displaystyle\varepsilon^{\rm opt}-\Delta \displaystyle=\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\Big{\langle}\mu_{1}^{% 2}(Q_{1}^{00}-Q^{01}_{1})+\frac{\mu_{2}^{2}}{2}(Q_{2}^{00}-Q^{01}_{2})+\mathbb% {E}_{v\sim P_{v}}v^{2}\big{[}g(\mathcal{Q}_{W}^{00}(v))-g(\mathcal{Q}_{W}^{01}% (v))\big{]}\Big{\rangle}. \tag{38} $$ This formula makes no assumption (other than (11)), including on the law of the $\lambda$ ’s. That it depends only on their covariance is simply a consequence of the quadratic nature of the mean-square generalisation error. **Remark C.1** *Note that the derivation up to (36) did not assume Bayes-optimality (while (38) does). Therefore, one can consider it in cases where the true posterior average $\langle\,·\,\rangle$ is replaced by one not verifying the Nishimori identities. This is the formula we use to compute the generalisation error of Monte Carlo-based estimators in the inset of Fig. 7. This is indeed needed to compute the generalisation in the glassy regime, where MCMC cannot equilibrate.* **Remark C.2** *Using the Nishimory identity of App. B and again that, for the linear readout with Gaussian label noise $\mathbb{E}[y\mid\lambda]=\lambda$ and $\mathbb{E}[y^{2}\mid\lambda]=\lambda^{2}+\Delta$ , it is easy to check that the so-called Gibbs error $$ \varepsilon^{\rm Gibbs}:=\mathbb{E}_{\bm{\theta}^{0},{\mathcal{D}},{\mathbf{x}% }_{\rm test},y_{\rm test}}\big{\langle}(y_{\rm test}-\mathbb{E}[y\mid\lambda_{% \rm test}({\bm{\theta}})])^{2}\big{\rangle} \tag{39} $$ is related for this channel to the Bayes-optimal mean-square generalisation error through the identity $$ \varepsilon^{\rm Gibbs}-\Delta=2(\varepsilon^{\rm opt}-\Delta). \tag{40} $$ We exploited this relationship together with the concentration of the Gibbs error w.r.t. the quenched posterior measure $\mathbb{E}_{{\bm{\theta}}^{0},\mathcal{D}}\langle\,·\,\rangle$ when evaluating the numerical generalisation error of the Monte Carlo algorithms reported in the main text.* Appendix D Details of the replica calculation D.1 Energetic potential The replicated energetic term under our Gaussian assumption on the joint law of the post-activations replicas is reported here for the reader’s convenience: $$ F_{E}=\ln\int dy\int d{\bm{\lambda}}\frac{e^{-\frac{1}{2}{\bm{\lambda}}^{% \intercal}{\mathbf{K}}^{-1}{\bm{\lambda}}}}{\sqrt{(2\pi)^{s+1}\det{\mathbf{K}}% }}\prod_{a=0}^{s}P_{\rm out}(y\mid\lambda^{a}). \tag{41} $$ After applying our ansatz (10) and using that $Q_{1}^{ab}=1$ in the quadratic-data regime, the covariance matrix ${\mathbf{K}}$ in replica space defined in (8) reads $$ \displaystyle K^{ab} \displaystyle=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}Q^{ab}_{2}+\mathbb{E}_{v\sim P_% {v}}v^{2}g(\mathcal{Q}_{W}^{ab}(v)), \tag{42} $$ where the function $$ g(x)=\sum_{\ell=3}^{\infty}\frac{\mu_{\ell}^{2}}{\ell!}x^{\ell}=\mathbb{E}_{(y% ,z)|x}[\sigma(y)\sigma(z)]-\mu_{0}^{2}-\mu_{1}^{2}x-\frac{\mu_{2}^{2}}{2}x^{2}% ,\qquad(y,z)\sim{\mathcal{N}}\left((0,0),\begin{pmatrix}1&x\\ x&1\end{pmatrix}\right). \tag{43} $$ The energetic term $F_{E}$ is already expressed as a low-dimensional integral, but within the replica symmetric (RS) ansatz it simplifies considerably. Let us denote $\bm{\mathcal{Q}}_{W}(\mathsf{v})=(\mathcal{Q}_{W}^{ab}(\mathsf{v}))_{a,b=0}^{s}$ . The RS ansatz amounts to assume that the saddle point solutions are dominated by order parameters of the form (below $\bm{1}_{s}$ and ${\mathbb{I}}_{s}$ are the all-ones vector and identity matrix of size $s$ ) $$ \bm{\mathcal{Q}}_{W}(\mathsf{v})=\begin{pmatrix}\rho_{W}&m_{W}\bm{1}_{s}^{% \intercal}\\ m_{W}\bm{1}_{s}&(r_{W}-\mathcal{Q}_{W}){\mathbb{I}}_{s}+\mathcal{Q}_{W}\bm{1}_% {s}\bm{1}_{s}^{\intercal}\end{pmatrix}\iff\hat{\bm{\mathcal{Q}}}_{W}(\mathsf{v% })=\begin{pmatrix}\hat{\rho}_{W}&-\hat{m}_{W}\bm{1}_{s}^{\intercal}\\ -\hat{m}_{W}\bm{1}_{s}&(\hat{r}_{W}+\hat{\mathcal{Q}}_{W}){\mathbb{I}}_{s}-% \hat{\mathcal{Q}}_{W}\bm{1}_{s}\bm{1}_{s}^{\intercal}\end{pmatrix}, $$ where all the above parameter $\rho_{W}=\rho_{W}(\mathsf{v}),\hat{\rho}_{W},m_{W},...$ depend on $\mathsf{v}$ , and similarly $$ {\mathbf{Q}}_{2}=\begin{pmatrix}\rho_{2}&m_{2}\bm{1}_{s}^{\intercal}\\ m_{2}\bm{1}_{s}&(r_{2}-q_{2}){\mathbb{I}}_{s}+q_{2}\bm{1}_{s}\bm{1}_{s}^{% \intercal}\end{pmatrix}\iff\hat{{\mathbf{Q}}}_{2}=\begin{pmatrix}\hat{\rho}_{2% }&-\hat{m}_{2}\bm{1}_{s}^{\intercal}\\ -\hat{m}_{2}\bm{1}_{s}&(\hat{r}_{2}+\hat{q}_{2}){\mathbb{I}}_{s}-\hat{q}_{2}% \bm{1}_{s}\bm{1}_{s}^{\intercal}\end{pmatrix}, $$ where we reported the ansatz also for the Fourier conjugates We are going to use repeatedly the Fourier representation of the delta function, namely $\delta(x)=\frac{1}{2\pi}∈t d\hat{x}\exp(i\hat{x}x)$ . Because the integrals we will end-up with will always be at some point evaluated by saddle point, implying a deformation of the integration contour in the complex plane, tracking the imaginary unit $i$ in the delta functions will be irrelevant. Similarly, the normalization $1/2\pi$ will always contribute to sub-leading terms in the integrals at hand. Therefore, we will allow ourselves to formally write $\delta(x)=∈t d\hat{x}\exp(r\hat{x}x)$ for a convenient constant $r$ , keeping in mind these considerations (again, as we evaluate the final integrals by saddle point, the choice of $r$ ends-up being irrelevant). for future convenience, though not needed for the energetic potential. The RS ansatz, which is equivalent to an assumption of concentration of the order parameters in the high-dimensional limit, is known to be exact when analysing Bayes-optimal inference and learning, as in the present paper, see Nishimori (2001); Barbier (2020); Barbier & Panchenko (2022). Under the RS ansatz ${\mathbf{K}}$ acquires a similar form: $$ \displaystyle{\mathbf{K}}=\begin{pmatrix}\rho_{K}&m_{K}\bm{1}_{s}^{\intercal}% \\ m_{K}\bm{1}_{s}&(r_{K}-q_{K}){\mathbb{I}}_{s}+q_{K}\bm{1}_{s}\bm{1}_{s}^{% \intercal}\end{pmatrix} \tag{44} $$ with $$ \displaystyle m_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}m_{2}+\mathbb{E}_{v\sim P% _{v}}v^{2}g(m_{W}(v)),\quad \displaystyle q_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}q_{2}+\mathbb{E}_{v\sim P% _{v}}v^{2}g(\mathcal{Q}_{W}(v)), \displaystyle\rho_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}\rho_{2}+\mathbb{E}_{v% \sim P_{v}}v^{2}g(\rho_{W}(v)),\quad \displaystyle r_{K}=\mu_{1}^{2}+\frac{\mu_{2}^{2}}{2}r_{2}+\mathbb{E}_{v\sim P% _{v}}v^{2}g(r_{W}(v)). \tag{45} $$ In the RS ansatz it is thus possible to give a convenient low-dimensional representation of the multivariate Gaussian integral of $F_{E}$ in terms of white Gaussian random variables: $$ \displaystyle\lambda^{a}=\xi\sqrt{q_{K}}+u^{a}\sqrt{r_{K}-q_{K}}\quad\text{for% }a=1,\dots,s,\qquad\lambda^{0}=\xi\sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{% \rho_{K}-\frac{m_{K}^{2}}{q_{K}}} \tag{47} $$ where $\xi,(u^{a})_{a=0}^{s}$ are i.i.d. standard Gaussian variables. Then $$ \displaystyle F_{E}=\ln\int dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}\Big{(}y\mid% \xi\sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{\rho_{K}-\frac{m_{K}^{2}}{q_{K}}}% \Big{)}\prod_{a=1}^{s}\mathbb{E}_{u^{a}}P_{\rm out}(y\mid\xi\sqrt{q_{K}}+u^{a}% \sqrt{r_{K}-q_{K}}). \tag{48} $$ The last product over the replica index $a$ contains identical factors thanks to the RS ansatz. Therefore, by expanding in $s→ 0^{+}$ we get $$ \displaystyle F_{E}=s\int dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}\Big{(}y\mid\xi% \sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{\rho_{K}-\frac{m_{K}^{2}}{q_{K}}}% \Big{)}\ln\mathbb{E}_{u}P_{\rm out}(y\mid\xi\sqrt{q_{K}}+u\sqrt{r_{K}-q_{K}})+% O(s^{2}). \tag{49} $$ For the linear readout with Gaussian label noise $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ the above gives $$ \displaystyle F_{E}=-\frac{s}{2}\ln\big{[}2\pi(\Delta+r_{K}-q_{K})\big{]}-% \frac{s}{2}\frac{\Delta+\rho_{K}-2m_{K}+q_{K}}{\Delta+r_{K}-q_{K}}+O(s^{2}). \tag{50} $$ In the Bayes-optimal setting the Nishimori identities enforce $$ \displaystyle r_{2}=\rho_{2}=\lim_{d\to\infty}\frac{1}{d^{2}}\mathbb{E}{\rm Tr% }[({\mathbf{S}}_{2}^{0})^{2}]=1+\gamma\bar{v}^{2}\quad\text{and}\quad m_{2}=q_% {2}, \displaystyle r_{W}(\mathsf{v})=\rho_{W}(\mathsf{v})=1\quad\text{and}\quad m_{% W}(\mathsf{v})=\mathcal{Q}_{W}(\mathsf{v})\ \forall\ \mathsf{v}\in\mathsf{V}, \tag{51} $$ which implies also that $$ \displaystyle r_{K}=\rho_{K}=\mu_{1}^{2}+\frac{1}{2}r_{2}\mu_{2}^{2}+g(1),\,% \quad m_{K}=q_{K}. \tag{1} $$ Therefore the above simplifies to $$ \displaystyle F_{E} \displaystyle=s\int dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}(y\mid\xi\sqrt{q_{K}}% +u^{0}\sqrt{r_{K}-q_{K}})\ln\mathbb{E}_{u}P_{\rm out}(y\mid\xi\sqrt{q_{K}}+u% \sqrt{r_{K}-q_{K}})+O(s^{2}) \displaystyle=:s\,\psi_{P_{\rm{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})+O(s^% {2}). \tag{54} $$ Notice that the energetic contribution to the free entropy has the same form as in the generalised linear model Barbier et al. (2019). For our running example of linear readout with Gaussian noise the function $\psi_{P_{\rm out}}$ reduces to $$ \displaystyle\psi_{P_{\rm{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})=-\frac{1}% {2}\ln\big{[}2\pi e(\Delta+r_{K}-q_{K})\big{]}. \tag{56} $$ In what follows we shall restrict ourselves only to the replica symmetric ansatz, in the Bayes-optimal setting. Therefore, identifications as the ones in (51), (52) are assumed. D.2 Second moment of $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ For the reader’s convenience we report here the measure $$ \displaystyle P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})^{-1}\int\prod_{a}^{0,s}dP_{W}({% \mathbf{W}}^{a})\delta({\mathbf{S}}^{a}_{2}-{\mathbf{W}}^{a\intercal}({\mathbf% {v}}){\mathbf{W}}^{a}/\sqrt{k})\prod_{a\leq b}^{0,s}\prod_{\mathsf{v}\in% \mathsf{V}}\prod_{i\in\mathcal{I}_{\mathsf{v}}}\delta({d}\,\mathcal{Q}_{W}^{ab% }(\mathsf{v})-{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b}). \tag{57} $$ Recall $\mathsf{V}$ is the support of $P_{v}$ (assumed discrete for the moment). Recall also that we have quenched the readout weights to the ground truth. Indeed, as discussed in the main, considering them learnable or fixed to the truth does not change the leading order of the information-theoretic quantities. In this measure, one can compute the asymptotic of its second moment $$ \displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})\frac{1}{d% ^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b} \displaystyle=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})^{-1}\int\prod_{a}^{0,s}dP_{W}({% \mathbf{W}}^{a})\frac{1}{kd^{2}}{\rm Tr}[{\mathbf{W}}^{a\intercal}({\mathbf{v}% }){\mathbf{W}}^{a}{\mathbf{W}}^{b\intercal}({\mathbf{v}}){\mathbf{W}}^{b}] \displaystyle\qquad\qquad\times\prod_{a\leq b}^{0,s}\prod_{\mathsf{v}\in% \mathsf{V}}\prod_{i\in\mathcal{I}_{v}}\delta({d}\,\mathcal{Q}_{W}^{ab}(\mathsf% {v})-{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b}). \tag{58} $$ The measure is coupled only through the latter $\delta$ ’s. We can decouple the measure at the cost of introducing Fourier conjugates whose values will then be fixed by a saddle point computation. The second moment computed will not affect the saddle point, hence it is sufficient to determine the value of the Fourier conjugates through the computation of $V_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ , which rewrites as $$ \displaystyle V_{W}^{kd}(\bm{\mathcal{Q}}_{W}) \displaystyle=\int\prod_{a}^{0,s}dP_{W}({\mathbf{W}}^{a})\prod_{a\leq b}^{0,s}% \prod_{\mathsf{v}\in\mathsf{V}}\prod_{i\in\mathcal{I}_{\mathsf{v}}}d\hat{B}^{% ab}_{i}(\mathsf{v})\exp\big{[}-\hat{B}^{ab}_{i}(\mathsf{v})({d}\,\mathcal{Q}_{% W}^{ab}(\mathsf{v})-{\mathbf{W}}^{a\intercal}_{i}{\mathbf{W}}_{i}^{b})\big{]} \displaystyle\approx\prod_{\mathsf{v}\in\mathsf{V}}\prod_{i\in\mathcal{I}_{% \mathsf{v}}}\exp\Big{(}d\,{\rm extr}_{(\hat{B}^{ab}_{i}(\mathsf{v}))}\Big{[}-% \sum_{a\leq b,0}^{s}\hat{B}^{ab}_{i}(\mathsf{v})\mathcal{Q}_{W}^{ab}(\mathsf{v% })+\ln\int\prod_{a=0}^{s}dP_{W}(w_{a})e^{\sum_{a\leq b,0}^{s}\hat{B}_{i}^{ab}(% \mathsf{v})w_{a}w_{b}}\Big{]}\Big{)}. \tag{59} $$ In the last line we have used saddle point integration over $\hat{B}^{ab}_{i}(\mathsf{v})$ and the approximate equality is up to a multiplicative $\exp(o(n))$ constant. From the above, it is clear that the stationary $\hat{B}^{ab}_{i}(\mathsf{v})$ are such that $$ \displaystyle\mathcal{Q}_{W}^{ab}(\mathsf{v})=\frac{\int\prod_{r=0}^{s}dP_{W}(% w_{r})w_{a}w_{b}\prod_{r\leq t,0}^{s}e^{\hat{B}_{i}^{rt}(\mathsf{v})w_{r}w_{t}% }}{\int\prod_{r=0}^{s}dP_{W}(w_{r})\prod_{r\leq t,0}^{s}e^{\hat{B}_{i}^{rt}(% \mathsf{v})w_{r}w_{t}}}=:\langle w_{a}w_{b}\rangle_{\hat{\mathbf{B}}(\mathsf{v% })}. \tag{60} $$ Hence $\hat{B}_{i}^{ab}(\mathsf{v})=\hat{B}^{ab}(\mathsf{v})$ are homogeneous. Using these notations, the asymptotic trace moment of the ${\mathbf{S}}_{2}$ ’s at leading order becomes $$ \displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle\frac{1}{d^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}% =\frac{1}{kd^{2}}\sum_{i,l=1}^{k}\sum_{j,p=1}^{d}\langle W_{ij}^{a}v_{i}W_{ip}% ^{a}W_{lj}^{b}v_{l}W_{lp}^{b}\rangle_{\{\hat{\mathbf{B}}(\mathsf{v})\}_{% \mathsf{v}\in\mathsf{V}}} \displaystyle=\frac{1}{k}\sum_{\mathsf{v}\in\mathsf{V}}\mathsf{v}^{2}\sum_{i% \in\mathcal{I}_{\mathsf{v}}}\Big{\langle}\Big{(}\frac{1}{d}\sum_{j=1}^{d}W_{ij% }^{a}W_{ij}^{b}\Big{)}^{2}\Big{\rangle}_{\hat{\mathbf{B}}(\mathsf{v})}+\frac{1% }{k}\sum_{j=1}^{d}\Big{\langle}\sum_{i=1}^{k}\frac{v_{i}(W_{ij}^{a})^{2}}{d}% \sum_{l\neq i,1}^{k}\frac{v_{l}(W_{lj}^{b})^{2}}{d}\Big{\rangle}_{\hat{\mathbf% {B}}(\mathsf{v})}. \tag{61} $$ We have used the fact that $\smash{\langle\,·\,\rangle_{\hat{\mathbf{B}}(\mathsf{v})}}$ is symmetric if the prior $P_{W}$ is, thus forcing us to match $j$ with $p$ if $i≠ l$ . Considering that by the Nishimori identities $\mathcal{Q}_{W}^{aa}(\mathsf{v})=1$ , it implies $\hat{B}^{aa}(\mathsf{v})=0$ for any $a=0,1,...,s$ and $\mathsf{v}∈\mathsf{V}$ . Furthermore, the measure $\langle\,·\,\rangle_{\hat{\mathbf{B}}(\mathsf{v})}$ is completely factorised over neuron and input indices. Hence every normalised sum can be assumed to concentrate to its expectation by the law of large numbers. Specifically, we can write that with high probability as $d,k→∞$ , $$ \displaystyle\frac{1}{d}\sum_{j=1}^{d}W_{ij}^{a}W_{ij}^{b}\xrightarrow{}% \mathcal{Q}_{W}^{ab}(\mathsf{v})\ \forall\ i\in\mathcal{I}_{\mathsf{v}},\qquad% \frac{1}{k}\sum_{\mathsf{v},\mathsf{v}^{\prime}\in\mathsf{V}}\mathsf{v}\mathsf% {v}^{\prime}\sum_{j=1}^{d}\sum_{i\in\mathcal{I}_{\mathsf{v}}}\frac{(W_{ij}^{a}% )^{2}}{d}\sum_{l\in\mathcal{I}_{\mathsf{v}^{\prime}},l\neq i}\frac{(W_{lj}^{b}% )^{2}}{d}\approx\gamma\sum_{\mathsf{v},\mathsf{v}^{\prime}\in\mathsf{V}}\frac{% |\mathcal{I}_{\mathsf{v}}||\mathcal{I}_{\mathsf{v}^{\prime}}|}{k^{2}}\mathsf{v% }\mathsf{v}^{\prime}\to\gamma\bar{v}^{2}, \tag{62} $$ where we used $|\mathcal{I}_{\mathsf{v}}|/k→ P_{v}(\mathsf{v})$ as $k$ diverges. Consequently, the second moment at leading order appears as claimed: $$ \displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle\frac{1}{d^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}% =\sum_{\mathsf{v}\in\mathsf{V}}P_{v}(\mathsf{v})\mathsf{v}^{2}\mathcal{Q}_{W}^% {ab}(\mathsf{v})^{2}+\gamma\bar{v}^{2}=\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q% }_{W}^{ab}(v)^{2}+\gamma\bar{v}^{2}. \tag{63} $$ Notice that the effective law $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ in (15) is the least restrictive choice among the Wishart-type distributions with a trace moment fixed precisely to the one above. In more specific terms, it is the solution of the following maximum entropy problem: $$ \displaystyle\inf_{P,\tau}\Big{\{}D_{\rm KL}(P\,\|\,P_{S}^{\otimes s+1})+\sum_% {a\leq b,0}^{s}\tau^{ab}\Big{(}\mathbb{E}_{P}\frac{1}{d^{2}}{\rm Tr}\,{\mathbf% {S}}_{2}^{a}{\mathbf{S}}_{2}^{b}-\gamma\bar{v}^{2}-\mathbb{E}_{v\sim P_{v}}v^{% 2}\mathcal{Q}_{W}^{ab}(v)^{2}\Big{)}\Big{\}}, \tag{64} $$ where $P_{S}$ is a generalised Wishart distribution (as defined above (15)), and $P$ is in the space of joint probability distributions over $s+1$ symmetric matrices of dimension $d× d$ . The rationale behind the choice of $P_{S}$ as a base measure is that, in absence of any other information, a statistician can always use a generalised Wishart measure for the ${\mathbf{S}}_{2}$ ’s if they assume universality in the law of the inner weights. This ansatz would yield the theory of Maillard et al. (2024a), which still describes a non-trivial performance, achieved by the adaptation of GAMP-RIE of Appendix H. Note that if $a=b$ then, by (51), the second moment above matches precisely $r_{2}=1+\gamma\bar{v}^{2}$ . This entails directly $\tau^{aa}=0$ , as the generalised Wishart prior $P_{S}$ already imposes this constraint. D.3 Entropic potential We now use the results from the previous section to compute the entropic contribution $F_{S}$ to the free entropy: $$ \displaystyle e^{F_{S}}:=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})\int dP(({\mathbf{S}}% _{2}^{a})\mid\bm{\mathcal{Q}}_{W})\prod_{a\leq b}^{0,s}\delta(d^{2}Q_{2}^{ab}-% {{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}}). \tag{65} $$ The factor $V_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ was already treated in the previous section. However, here it will contribute as a tilt of the overall entropic contribution, and the Fourier conjugates $\hat{\mathcal{Q}}_{W}^{ab}(\mathsf{v})$ will appear in the final variational principle. Let us now proceed with the relaxation of the measure $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ by replacing it with $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ given by (15): $$ \displaystyle e^{F_{S}}=V_{W}^{kd}(\bm{\mathcal{Q}}_{W})\int d\hat{{\mathbf{Q}% }}_{2}\exp\Big{(}-\frac{d^{2}}{2}\sum_{a\leq b,0}^{s}\hat{Q}^{ab}_{2}Q^{ab}_{2% }\Big{)}\frac{1}{\tilde{V}^{kd}_{W}(\bm{\mathcal{Q}}_{W})}\int\prod_{a=0}^{s}% dP_{S}({\mathbf{S}}_{2}^{a})\exp\Big{(}\sum_{a\leq b,0}^{s}\frac{\tau_{ab}+% \hat{Q}_{2}^{ab}}{2}{\rm Tr}\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}\Big{)} \tag{66} $$ where we have introduced another set of Fourier conjugates $\hat{\mathbf{Q}}_{2}$ for ${\mathbf{Q}}_{2}$ . As usual, the Nishimori identities impose $Q_{2}^{aa}=r_{2}=1+\gamma\bar{v}^{2}$ without the need of any Fourier conjugate. Hence, similarly to $\tau^{aa}$ , $\hat{Q}_{2}^{aa}=0$ too. Furthermore, in the hypothesis of replica symmetry, we set $\tau^{ab}=\tau$ and $\hat{Q}_{2}^{ab}=\hat{q}_{2}$ for all $0≤ a<b≤ s$ . Then, when the number of replicas $s$ tends to $0^{+}$ , we can recognise the free entropy of a matrix denoising problem. More specifically, using the Hubbard–Stratonovich transformation (i.e., $\mathbb{E}_{{\mathbf{Z}}}\exp(\frac{d}{2}{\rm Tr}\,{\mathbf{M}}{\mathbf{Z}})=% \exp(\frac{d}{4}{\rm Tr}\,{\mathbf{M}}^{2})$ for a $d× d$ symmetric matrix ${\mathbf{M}}$ with ${\mathbf{Z}}$ a standard GOE matrix) we get $$ \displaystyle J_{n}(\tau,\hat{q}_{2}) \displaystyle:=\lim_{s\to 0^{+}}\frac{1}{ns}\ln\int\prod_{a=0}^{s}dP_{S}({% \mathbf{S}}_{2}^{a})\exp\Big{(}\frac{\tau+\hat{q}_{2}}{2}\sum_{a<b,0}^{s}{\rm Tr% }\,{\mathbf{S}}_{2}^{a}{\mathbf{S}}_{2}^{b}\Big{)} \displaystyle=\frac{1}{n}\mathbb{E}\ln\int dP_{S}({\mathbf{S}}_{2})\exp\frac{1% }{2}{\rm Tr}\Big{(}\sqrt{\tau+\hat{q}_{2}}{\mathbf{Y}}{\mathbf{S}}_{2}-(\tau+% \hat{q}_{2})\frac{{\mathbf{S}}_{2}^{2}}{2}\Big{)}, \tag{67} $$ where ${\mathbf{Y}}={\mathbf{Y}}(\tau+\hat{q}_{2})=\sqrt{\tau+\hat{q}_{2}}{\mathbf{S}% }_{2}^{0}+{\bm{\xi}}$ with ${\bm{\xi}}/\sqrt{d}$ a standard GOE matrix, and the outer expectation is w.r.t. ${\mathbf{Y}}$ (or ${\mathbf{S}}^{0},{\bm{\xi}}$ ). Thanks to the fact that the base measure $P_{S}$ is rotationally invariant, the above can be solved exactly in the limit $n→∞,\,n/d^{2}→\alpha$ (see e.g. Pourkamali et al. (2024)): $$ \displaystyle J(\tau,\hat{q}_{2})=\lim J_{n}(\tau,\hat{q}_{2})=\frac{1}{\alpha% }\Big{(}\frac{(\tau+\hat{q}_{2})r_{2}}{4}-\iota(\tau+\hat{q}_{2})\Big{)},\quad% \text{with}\quad\iota(\eta):=\frac{1}{8}+\frac{1}{2}\Sigma(\mu_{{\mathbf{Y}}(% \eta)}). \tag{68} $$ Here $\iota(\eta)=\lim I({\mathbf{Y}}(\eta);{\mathbf{S}}^{0}_{2})/d^{2}$ is the limiting mutual information between data ${\mathbf{Y}}(\eta)$ and signal ${\mathbf{S}}^{0}_{2}$ for the channel ${\mathbf{Y}}(\eta)=\sqrt{\eta}{\mathbf{S}}^{0}_{2}+{\bm{\xi}}$ , the measure $\mu_{{\mathbf{Y}}(\eta)}$ is the asymptotic spectral law of the rescaled observation matrix ${\mathbf{Y}}(\eta)/\sqrt{d}$ , and $\Sigma(\mu):=∈t\ln|x-y|d\mu(x)d\mu(y)$ . Using free probability, the law $\mu_{{\mathbf{Y}}(\eta)}$ can be obtained as the free convolution of a generalised Marchenko-Pastur distribution (the asymptotic spectral law of ${\mathbf{S}}^{0}_{2}$ , which is a generalised Wishart random matrix) and the semicircular distribution (the asymptotic spectral law of ${\bm{\xi}}$ ), see Potters & Bouchaud (2020). We provide the code to obtain this distribution numerically in the attached repository. The function ${\rm mmse}_{S}(\eta)$ is obtained through a derivative of $\iota$ , using the so-called I-MMSE relation Guo et al. (2005); Pourkamali et al. (2024): $$ \displaystyle 4\frac{d}{d\eta}\iota(\eta)={\rm mmse}_{S}(\eta)=\frac{1}{\eta}% \Big{(}1-\frac{4\pi^{2}}{3}\int\mu^{3}_{{\mathbf{Y}}(\eta)}(y)dy\Big{)}. \tag{69} $$ The normalisation $\frac{1}{ns}\ln\tilde{V}_{W}^{kd}(\bm{\mathcal{Q}}_{W})$ in the limit $n→∞,s→ 0^{+}$ can be simply computed as $J(\tau,0)$ . For the other normalisation, following the same steps as in the previous section, we can simplify $V^{kd}_{W}(\bm{\mathcal{Q}}_{W})$ as follows: $$ \displaystyle\frac{1}{ns}\ln V_{W}^{kd}(\bm{\mathcal{Q}}_{W})\approx\frac{% \gamma}{\alpha s}\sum_{\mathsf{v}\in\mathsf{V}}\frac{1}{k}\sum_{i\in\mathcal{I% }_{\mathsf{v}}}{\rm extr}\Big{[}-\sum_{a\leq b,0}^{s}\hat{\mathcal{Q}}^{ab}_{W% ,i}(\mathsf{v})\mathcal{Q}^{ab}_{W}(\mathsf{v})+\ln\int\prod_{a=0}^{s}dP_{W}(w% _{a})e^{\sum_{a\leq b,0}^{s}\hat{\mathcal{Q}}^{ab}_{W,i}(\mathsf{v})w_{a}w_{b}% }\Big{]}, \tag{70} $$ as $n$ grows, where extremisation is w.r.t. the hatted variables only. As in the previous section, $\hat{\mathcal{Q}}^{ab}_{W,i}(\mathsf{v})$ is homogeneous over $i∈\mathcal{I}_{\mathsf{v}}$ for a given $\mathsf{v}$ . Furthermore, thanks to the Nishimori identities we have that at the saddle point $\hat{\mathcal{Q}}_{W}^{aa}(\mathsf{v})=0$ and ${\mathcal{Q}}_{W}^{aa}(\mathsf{v})=1+\gamma\bar{v}^{2}$ . This, together with standard steps and the RS ansatz, allows to write the $d→∞,s→ 0^{+}$ limit of the above as $$ \displaystyle\lim_{s\to 0^{+}}\lim\frac{1}{ns}\ln V_{W}^{kd}(\bm{\mathcal{Q}}_% {W})=\frac{\gamma}{\alpha}\mathbb{E}_{v\sim P_{v}}{\rm extr}\Big{[}-\frac{\hat% {\mathcal{Q}}_{W}(v)\mathcal{Q}_{W}(v)}{2}+\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(% v))\Big{]} \tag{71} $$ with $\psi_{P_{W}}(\,·\,)$ as in the main. Gathering all these results yields directly $$ \displaystyle\lim_{s\to 0^{+}}\lim\frac{F_{S}}{ns}={\rm extr}\Big{\{} \displaystyle\frac{\hat{q}_{2}(r_{2}-q_{2})}{4\alpha}-\frac{1}{\alpha}\big{[}% \iota(\tau+\hat{q}_{2})-\iota(\tau)\big{]}+\frac{\gamma}{\alpha}\mathbb{E}_{v% \sim P_{v}}\Big{[}\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{\hat{\mathcal{Q% }}_{W}(v)\mathcal{Q}_{W}(v)}{2}\Big{]}\Big{\}}. \tag{72} $$ Extremisation is w.r.t. $\hat{q}_{2},\hat{\mathcal{Q}}_{W}$ . $\tau$ has to be intended as a function of $\mathcal{Q}_{W}=\{{\mathcal{Q}}_{W}(\mathsf{v})\mid\mathsf{v}∈\mathsf{V}\}$ through the moment matching condition: $$ \displaystyle 4\alpha\,\partial_{\tau}J(\tau,0)=r_{2}-4\iota^{\prime}(\tau)=% \mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2}+\gamma\bar{v}^{2}, \tag{73} $$ which is the $s→ 0^{+}$ limit of the moment matching condition between $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ and $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ . Simplifying using the value of $r_{2}=1+\gamma\bar{v}^{2}$ according to the Nishimori identities, and using the I-MMSE relation between $\iota(\tau)$ and ${\rm mmse}_{S}(\tau)$ , we get $$ \displaystyle{\rm mmse}_{S}(\tau)=1-\mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{% W}(v)^{2}\quad\iff\quad\tau={\rm mmse}_{S}^{-1}\big{(}1-\mathbb{E}_{v\sim P_{v% }}v^{2}\mathcal{Q}_{W}(v)^{2}\big{)}. \tag{74} $$ Since ${\rm mmse}_{S}$ is a monotonic decreasing function of its argument (and thus invertible), the above always has a solution, and it is unique for a given collection $\mathcal{Q}_{W}$ . D.4 RS free entropy and saddle point equations Putting the energetic and entropic contributions together we obtain the variational replica symmetric free entropy potential: $$ \displaystyle f^{\alpha,\gamma}_{\rm RS} \displaystyle:=\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})+\frac% {1}{4\alpha}(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2}+\frac{\gamma}{\alpha}% \mathbb{E}_{v\sim P_{v}}\big{[}\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{1}% {2}\mathcal{Q}_{W}(v)\hat{\mathcal{Q}}_{W}(v)\big{]} \displaystyle\qquad+\frac{1}{\alpha}\big{[}\iota(\tau(\mathcal{Q}_{W}))-\iota(% \hat{q}_{2}+\tau(\mathcal{Q}_{W}))\big{]}, \tag{75} $$ which is then extremised w.r.t. $\{\hat{\mathcal{Q}}_{W}(\mathsf{v}),\mathcal{Q}_{W}(\mathsf{v})\mid\mathsf{v}% ∈\mathsf{V}\},\hat{q}_{2},q_{2}$ while $\tau$ is a function of ${\mathcal{Q}}_{W}$ through the moment matching condition (74). The saddle point equations are then $$ \left[\begin{array}[]{@{}l@{\quad}l@{}}&{\mathcal{Q}}_{W}(\mathsf{v})=\mathbb{% E}_{w^{0},\xi}[w^{0}{\langle w\rangle}_{\hat{\mathcal{Q}}_{W}(\mathsf{v})}],\\ &\hat{\mathcal{Q}}_{W}(\mathsf{v})=\frac{1}{2\gamma}(q_{2}-\gamma\bar{v}^{2}-% \mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2})\partial_{{\mathcal{Q}}_{W% }(\mathsf{v})}\tau(\mathcal{Q}_{W})+2\frac{\alpha}{\gamma}\partial_{{\mathcal{% Q}}_{W}(\mathsf{v})}\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K}),% \\ &q_{2}=r_{2}-\frac{1}{\hat{q}_{2}+\tau(\mathcal{Q}_{W})}(1-\frac{4\pi^{2}}{3}% \int\mu^{3}_{{\mathbf{Y}}(\hat{q}_{2}+\tau(\mathcal{Q}_{W}))}(y)dy),\\ &\hat{q}_{2}=4\alpha\,\partial_{q_{2}}\psi_{P_{\text{out}}}(q_{K}(q_{2},% \mathcal{Q}_{W});r_{K}),\end{array}\right. \tag{76} $$ where, letting i.i.d. $w^{0},\xi\sim\mathcal{N}(0,1)$ , we define the measure $$ \displaystyle\langle\,\cdot\,\rangle_{x}=\langle\,\cdot\,\rangle_{x}(w^{0},\xi% ):=\frac{\int dP_{W}(w)(\,\cdot\,)e^{(\sqrt{x}\xi+xw^{0})w-\frac{1}{2}xw^{2}}}% {\int dP_{W}(w)e^{(\sqrt{x}\xi+xw^{0})w-\frac{1}{2}xw^{2}}}. \tag{77} $$ All the above formulae are easily specialised for the linear readout with Gaussian label noise using (56). We report here the saddle point equations in this case (recalling that $g$ is defined in (43)): $$ \left[\begin{array}[]{@{}l@{\quad}l@{}}&{\mathcal{Q}}_{W}(\mathsf{v})=\mathbb{% E}_{w^{0},\xi}[w^{0}{\langle w\rangle}_{\hat{\mathcal{Q}}_{W}(v)}],\\ &\hat{\mathcal{Q}}_{W}(\mathsf{v})=\frac{1}{2\gamma}(q_{2}-\gamma\bar{v}^{2}-% \mathbb{E}_{v\sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2})\partial_{{\mathcal{Q}}_{W% }(\mathsf{v})}\tau(\mathcal{Q}_{W})+\frac{\alpha}{\gamma}\frac{\mathsf{v}^{2}% \,g^{\prime}(\mathcal{Q}_{W}(\mathsf{v}))}{\Delta+\frac{1}{2}\mu_{2}^{2}(r_{2}% -q_{2})+g(1)-\mathbb{E}_{v\sim P_{v}}{v}^{2}g(\mathcal{Q}_{W}(v))},\\ &q_{2}=r_{2}-\frac{1}{\hat{q}_{2}+\tau}(1-\frac{4\pi^{2}}{3}\int\mu^{3}_{{% \mathbf{Y}}(\hat{q}_{2}+\tau(\mathcal{Q}_{W}))}(y)dy),\\ &\hat{q}_{2}=\frac{\alpha\mu_{2}^{2}}{\Delta+\frac{1}{2}\mu_{2}^{2}(r_{2}-q_{2% })+g(1)-\mathbb{E}_{v\sim P_{v}}v^{2}g(\mathcal{Q}_{W}(v))}.\end{array}\right. \tag{1} $$ If one assumes that the overlaps appearing in (38) are self-averaging around the values that solve the saddle point equations (and maximise the RS potential), that is $Q^{00}_{1},Q_{1}^{01}→ 1$ (as assumed in this scaling), $Q_{2}^{00}→ r_{2},Q_{2}^{01}→ q_{2}^{*}$ , and ${\mathcal{Q}}_{W}^{00}(\mathsf{v})→ 1,{\mathcal{Q}}_{W}^{01}(\mathsf{v})→{% \mathcal{Q}}_{W}^{*}(\mathsf{v})$ , then the limiting Bayes-optimal mean-square generalisation error for the linear readout with Gaussian noise case appears as $$ \displaystyle\varepsilon^{\rm opt}-\Delta=r_{K}-q_{K}^{*}=\frac{\mu_{2}^{2}}{2% }(r_{2}-q_{2}^{*})+g(1)-\mathbb{E}_{v\sim P_{v}}v^{2}g(\mathcal{Q}^{*}_{W}(v)). \tag{1} $$ This is the formula used to evaluate the theoretical Bayes-optimal mean-square generalisation error used along the paper. D.5 Non-centred activations Consider a non-centred activation function, i.e., $\mu_{0}≠ 0$ in (17). This reflects on the law of the post-activations, which will still be Gaussian, centred at $$ \displaystyle\mathbb{E}_{\mathbf{x}}\lambda^{a}=\frac{\mu_{0}}{\sqrt{k}}\sum_{% i=1}^{k}v_{i}=:\mu_{0}\Lambda, \tag{80} $$ and with the covariance given by (8) (we are assuming $Q_{W}^{aa}=1$ ; if not, $Q_{W}^{aa}=r$ , the formula can be generalised as explained in App. A, and that the readout weights are quenched). In the above, we have introduced the new mean parameter $\Lambda$ . Notice that, if the ${\mathbf{v}}$ ’s have a $\bar{v}=O(1)$ mean, then $\Lambda$ scales as $\sqrt{k}$ due to our choice of normalisation. One can carry out the replica computation for a fixed $\Lambda$ . This new parameter, being quenched, does not affect the entropic term. It will only appear in the energetic term as a shift to the means, yielding $$ F_{E}=F_{E}({\mathbf{K}},\Lambda)=\ln\int dy\int d{\bm{\lambda}}\frac{e^{-% \frac{1}{2}{\bm{\lambda}}^{\intercal}{\mathbf{K}}^{-1}{\bm{\lambda}}}}{\sqrt{(% 2\pi)^{s+1}\det{\mathbf{K}}}}\prod_{a=0}^{s}P_{\rm{out}}(y\mid\lambda^{a}+\mu_% {0}\Lambda). \tag{81} $$ Within the replica symmetric ansatz, the above turns into | | $\displaystyle e^{F_{E}}=∈t dy\,\mathbb{E}_{\xi,u^{0}}P_{\rm out}\Big{(}y\mid% \mu_{0}\Lambda+\xi\sqrt{\frac{m_{K}^{2}}{q_{K}}}+u^{0}\sqrt{\rho_{K}-\frac{m_{% K}^{2}}{q_{K}}}\Big{)}\prod_{a=1}^{s}\mathbb{E}_{u^{a}}P_{\rm out}(y\mid\mu_{0% }\Lambda+\xi\sqrt{q_{K}}+u^{a}\sqrt{r_{K}-q_{K}}).$ | | | --- | --- | --- | Therefore, the simplification of the potential $F_{E}$ proceeds as in the centred activation case, yielding at leading order in the number $s$ of replicas | | $\displaystyle\frac{F_{E}(r_{K},q_{K},\Lambda)}{s}\!=\!∈t dy\,\mathbb{E}_{\xi% ,u^{0}}P_{\rm out}\Big{(}y\mid\mu_{0}\Lambda+\xi\sqrt{q_{K}}+u^{0}\sqrt{r_{K}-% q_{K}}\Big{)}\ln\mathbb{E}_{u}P_{\rm out}(y\mid\mu_{0}\Lambda+\xi\sqrt{q_{K}}+% u\sqrt{r_{K}-q_{K}})+O(s)$ | | | --- | --- | --- | in the Bayes-optimal setting. In the case when $P_{\rm out}(y\mid\lambda)=f(y-\lambda)$ then one can verify that the contributions due to the means, containing $\mu_{0}$ , cancel each other. This is verified in our running example where $P_{\rm out}$ is the Gaussian channel: $$ \frac{F_{E}(r_{K},q_{K},\Lambda)}{s}=-\frac{1}{2}\ln\big{[}2\pi(\Delta+r_{K}-q% _{K})\big{]}-\frac{1}{2}-\frac{\mu_{0}^{2}}{2}\frac{(\Lambda-\Lambda)^{2}}{% \Delta+r_{K}-q_{K}}+O(s)=-\frac{1}{2}\ln\big{[}2\pi(\Delta+r_{K}-q_{K})\big{]}% -\frac{1}{2}+O(s). \tag{82} $$ Appendix E Alternative simplifications of $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ through moment matching A crucial step that allowed us to obtain a closed-form expression for the model’s free entropy is the relaxation $\tilde{P}(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (15) of the true measure $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (14) entering the replicated partition function, as explained in Sec. 4. The specific form we chose (tilted Wishart distribution with a matching second moment) has the advantage of capturing crucial features of the true measure, such as the fact that the matrices ${\mathbf{S}}^{a}_{2}$ are generalised Wishart matrices with coupled replicas, while keeping the problem solvable with techniques derived from random matrix theory of rotationally invariant ensembles. In this appendix, we report some alternative routes one can take to simplify, or potentially improve the theory. E.1 A factorised simplified distribution In the specialisation phase, one can assume that the only crucial feature to keep track in relaxing $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (14) is the coupling between different replicas, becoming more and more relevant as $\alpha$ increases. In this case, inspired by Sakata & Kabashima (2013); Kabashima et al. (2016), in order to relax (14) we can propose the Gaussian ansatz $$ \displaystyle d\bar{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})=\prod% _{a=0}^{s}d{\mathbf{S}}^{a}_{2}\prod_{\alpha=1}^{d}\delta(S^{a}_{2;\alpha% \alpha}-\sqrt{k}\bar{v})\times\prod_{\alpha_{1}<\alpha_{2}}^{d}\frac{e^{-\frac% {1}{2}\sum_{a,b=0}^{s}S^{a}_{2;\alpha_{1}\alpha_{2}}\bar{\tau}^{ab}(\bm{{% \mathcal{Q}}}_{W})S^{b}_{2;\alpha_{1}\alpha_{2}}}}{\sqrt{(2\pi)^{s+1}\det(\bar% {\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})^{-1})}}, \tag{83} $$ where $\bar{v}$ is the mean of the readout prior $P_{v}$ , and $\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W}):=(\bar{\tau}^{ab}(\bm{{\mathcal{Q}}}_{% W}))_{a,b}$ is fixed by $$ [\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})^{-1}]_{ab}=\mathbb{E}_{v\sim P_{v}}v^% {2}{\mathcal{Q}}_{W}^{ab}(v)^{2}. $$ In words, first, the diagonal elements of ${\mathbf{S}}_{2}^{a}$ are $d$ random variables whose $O(1)$ fluctuations cannot affect the free entropy in the asymptotic regime we are considering, being too few compared to $n=\Theta(d^{2})$ . Hence, we assume they concentrate to their mean. Concerning the $d(d-1)/2$ off-diagonal elements of the matrices $({\mathbf{S}}_{2}^{a})_{a}$ , they are zero-mean variables whose distribution at given $\bm{{\mathcal{Q}}}_{W}$ is assumed to be factorised over the input indices. The definition of $\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})$ ensures matching with the true second moment (63). (83) is considerably simpler than (15): following this ansatz, the entropic contribution to the free entropy gives $$ \displaystyle e^{\bar{F}_{S}}:=\int\prod_{a\leq b,0}^{s}d\hat{Q}_{2}^{ab}\,e^{% kd\ln V_{W}(\bm{\mathcal{Q}}_{W})+\frac{d^{2}}{4}{\rm Tr}\hat{\mathbf{Q}}^{% \intercal}_{2}{\mathbf{Q}}_{2}}\Big{[}\int\prod_{a=0}^{s}dS^{a}_{2}\,\frac{e^{% -\frac{1}{2}\sum_{a,b=0}^{s}S^{a}_{2}[\bar{\tau}^{ab}(\bm{{\mathcal{Q}}}_{W})+% \hat{Q}_{2}^{ab}]S^{b}_{2}}}{\sqrt{(2\pi)^{s+1}\det(\bar{\bm{\tau}}(\bm{{% \mathcal{Q}}_{W}})^{-1})}}\Big{]}^{d(d-1)/2} \displaystyle\qquad\qquad\qquad\times\int\prod_{a=0}^{s}\prod_{\alpha=1}^{d}dS% ^{a}_{2;\alpha\alpha}\delta(S^{a}_{2;\alpha\alpha}-\sqrt{k}\bar{v})\,e^{-\frac% {1}{4}\sum_{a,b=0}^{s}\hat{Q}_{2}^{ab}\sum_{\alpha=1}^{d}S_{2;\alpha\alpha}^{a% }S_{2;\alpha\alpha}^{b}}, \tag{84} $$ instead of (66). Integration over the diagonal elements $(S_{2;\alpha\alpha}^{a})_{\alpha}$ can be done straightforwardly, yielding $$ \displaystyle e^{\bar{F}_{S}} \displaystyle=\int\prod_{a\leq b,0}^{s}d\hat{Q}_{2}^{ab}\,e^{kd\ln V_{W}(\bm{% \mathcal{Q}}_{W})+\frac{d^{2}}{4}{\rm Tr}\hat{\mathbf{Q}}_{2}^{\intercal}({% \mathbf{Q}}_{2}-\gamma\mathbf{1}\mathbf{1}^{\intercal}\bar{v}^{2})}\Big{[}\int% \prod_{a=0}^{s}dS^{a}_{2}\,\frac{e^{-\frac{1}{2}\sum_{a,b=0}^{s}S^{a}_{2}[\bar% {\tau}^{ab}(\bm{{\mathcal{Q}}}_{W})+\hat{Q}_{2}^{ab}]S^{b}_{2}}}{\sqrt{(2\pi)^% {s+1}\det(\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})^{-1})}}\Big{]}^{d(d-1)/2}. \tag{85} $$ The remaining Gaussian integral over the off-diagonal elements of ${\mathbf{S}}_{2}$ can be performed exactly, leading to $$ \displaystyle e^{\bar{F}_{S}} \displaystyle=\int\prod_{a\leq b,0}^{s}d\hat{Q}_{2}^{ab}\,e^{kd\ln V_{W}(\bm{% \mathcal{Q}}_{W})+\frac{d^{2}}{4}{\rm Tr}\hat{\mathbf{Q}}_{2}^{\intercal}({% \mathbf{Q}}_{2}-\gamma\mathbf{1}\mathbf{1}^{\intercal}\bar{v}^{2})-\frac{d(d-1% )}{4}\ln\det[{\mathbb{I}}_{s+1}+\hat{\mathbf{Q}}_{2}\bar{\bm{\tau}}(\bm{{% \mathcal{Q}}}_{W})^{-1}]}. \tag{86} $$ In order to proceed and perform the $s→ 0^{+}$ limit, we use the RS ansatz for the overlap matrices, combined with the Nishimori identities, as explained above. The only difference w.r.t. the approach detailed in Appendix D is the determinant in the exponent of the integrand of (86), which reads $$ \displaystyle\ln\det[{\mathbb{I}}_{s+1}+\hat{\mathbf{Q}}_{2}\bar{\bm{\tau}}(% \bm{{\mathcal{Q}}}_{W})^{-1}]=s\ln[1+\hat{q}_{2}(1-\mathbb{E}_{v\sim P_{v}}v^{% 2}\mathcal{Q}_{W}(v)^{2})]-s\hat{q}_{2}+O(s^{2}). \tag{87} $$ After taking the replica and high-dimensional limits, the resulting free entropy is $$ \displaystyle f_{\rm sp}^{\alpha,\gamma}={} \displaystyle\psi_{P_{\text{out}}}(q_{K}(q_{2},{\mathcal{Q}}_{W});r_{K})+\frac% {(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2}}{4\alpha}+\frac{\gamma}{\alpha}\mathbb% {E}_{v\sim P_{v}}\big{[}\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v))-\frac{1}{2}% \mathcal{Q}_{W}(v)\hat{{\mathcal{Q}}}_{W}(v)\big{]} \displaystyle\qquad-\frac{1}{4\alpha}\ln\big{[}1+\hat{q}_{2}(1-\mathbb{E}_{v% \sim P_{v}}v^{2}\mathcal{Q}_{W}(v)^{2})\big{]}, \tag{88} $$ to be extremised w.r.t. $q_{2},\hat{q}_{2},\{{\mathcal{Q}}_{W}(\mathsf{v}),\hat{{\mathcal{Q}}}_{W}(% \mathsf{v})\}$ . The main advantage of this expression over (75) is its simplicity: the moment-matching condition fixing $\bar{\bm{\tau}}(\bm{{\mathcal{Q}}}_{W})$ is straightforward (and has been solved explicitly in the final formula) and the result does not depend on the non-trivial (and difficult to numerically evaluate) function $\iota(\eta)$ , which is the mutual information of the associated matrix denoising problem (which has been effectively replaced by the much simpler denoising problem of independent Gaussian variables under Gaussian noise). Moreover, one can show, in the same fashion as done in Appendix G, that the generalisation error predicted from this expression has the same large- $\alpha$ behaviour than the one obtained from (75). However, not surprisingly, being derived from an ansatz ignoring the Wishart-like nature of the matrices ${\mathbf{S}}_{2}^{a}$ , this expression does not reproduce the expected behaviour of the model in the universal phase, i.e. for $\alpha<\alpha_{\rm sp}(\gamma)$ . <details> <summary>x9.png Details</summary> ![3873928c](/v1/image/3873928c35050202f140d52ef7789892131ae026775efe60dd279918eb2f43ea) ### Visual Description ## Line Graph: ε_opt vs α for Different Models ### Overview The graph depicts the relationship between the parameter α (x-axis) and the optimal error rate ε_opt (y-axis) for three distinct models: "main text" (blue), "sp" (red), and "uni" (green). All three models show a decreasing trend in ε_opt as α increases, with the "uni" model consistently exhibiting the highest ε_opt values. ### Components/Axes - **X-axis (α)**: Ranges from 0 to 6 in increments of 1. No explicit units provided. - **Y-axis (ε_opt)**: Ranges from 0 to 0.08 in increments of 0.02. Represents optimal error rate. - **Legend**: Located in the top-right corner, associating: - Blue line: "main text" - Red line: "sp" - Green line: "uni" - **Data Points**: Blue line includes error bars (vertical black lines with caps) at specific α values. ### Detailed Analysis 1. **Initial Values (α = 0)**: - All three models start at ε_opt ≈ 0.08. - Blue ("main text") and red ("sp") lines overlap exactly at this point. - Green ("uni") line begins slightly higher (~0.082). 2. **Trend Behavior**: - **Blue ("main text")**: - Data points plotted at α = 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6. - Error bars decrease in magnitude as α increases (e.g., ±0.005 at α=0 vs. ±0.001 at α=6). - Final value at α=6: ε_opt ≈ 0.008. - **Red ("sp")**: - Smooth curve closely follows blue line but remains ~0.002 higher throughout. - Final value at α=6: ε_opt ≈ 0.010. - **Green ("uni")**: - Smooth curve maintains the highest ε_opt across all α values. - Final value at α=6: ε_opt ≈ 0.018. 3. **Convergence**: - Blue and red lines converge near α=4, with ε_opt differences <0.003. - Green line remains ~0.008 higher than blue/red at α=6. ### Key Observations - All models show diminishing returns in ε_opt reduction as α increases. - "uni" model consistently underperforms (higher ε_opt) compared to "main text" and "sp". - Error bars on "main text" suggest decreasing measurement uncertainty with higher α. - Red ("sp") and blue ("main text") lines exhibit near-identical trends after α=2. ### Interpretation The graph demonstrates that increasing α improves ε_opt for all models, with "main text" and "sp" achieving similar performance at higher α values. The "uni" model's persistent ε_opt advantage suggests inherent limitations or different optimization constraints. The convergence of "main text" and "sp" lines implies potential equivalence in their underlying mechanisms at larger α scales. Error bar reduction in "main text" data points indicates improved reliability of measurements as α increases, possibly reflecting stabilized system behavior. </details> <details> <summary>x10.png Details</summary> ![b0db99da](/v1/image/b0db99da91019a75ac87da9f6fe97e2ae5888a35273bed098da224edda91a80b) ### Visual Description ## Line Graph: Function f vs. Parameter α ### Overview The image depicts a line graph comparing three data series labeled "main text," "sp," and "uni" across a parameter α (x-axis) and a function value f (y-axis). All lines originate at (0, -0.60) and exhibit upward curvature, diverging as α increases. The graph uses a Cartesian coordinate system with gridlines for reference. ### Components/Axes - **X-axis (α)**: Labeled "α," ranging from 0 to 6 in increments of 2. - **Y-axis (f)**: Labeled "f," ranging from -0.60 to -0.35 in increments of 0.05. - **Legend**: Positioned in the bottom-right corner, associating: - **Blue line**: "main text" - **Red line**: "sp" - **Green line**: "uni" ### Detailed Analysis 1. **Line Trends**: - **Blue ("main text")**: Steepest slope, consistently above other lines. At α=2, f ≈ -0.45; α=4, f ≈ -0.40; α=6, f ≈ -0.35. - **Red ("sp")**: Intermediate slope, slightly below blue. At α=2, f ≈ -0.47; α=4, f ≈ -0.42; α=6, f ≈ -0.37. - **Green ("uni")**: Shallowest slope, lowest values. At α=2, f ≈ -0.49; α=4, f ≈ -0.44; α=6, f ≈ -0.40. 2. **Divergence**: - At α=0, all lines converge at (0, -0.60). - By α=2, separation begins: blue > red > green. - By α=6, the gap widens: blue (~-0.35) vs. green (~-0.40), a difference of ~0.05. ### Key Observations - All lines exhibit concave-up curvature, suggesting accelerating improvement in f with increasing α. - "main text" outperforms "sp" and "uni" across all α values. - The divergence between lines grows nonlinearly, indicating accelerating differences in performance. ### Interpretation The graph likely represents a comparison of three models, algorithms, or parameters (α) influencing a metric f (e.g., efficiency, error rate, or cost). The "main text" series demonstrates superior performance, while "uni" lags behind. The divergence suggests that α has a compounding effect on the relative effectiveness of these models. The y-axis values (negative) imply f could represent a loss function, error metric, or inverted efficiency score. The lack of error bars or confidence intervals limits certainty in exact values, but the trends are visually robust. </details> <details> <summary>x11.png Details</summary> ![34c67bb7](/v1/image/34c67bb7a197fd7accad64a2b8f411005cc7648a347d2b50f97e0e9c4f3ce451) ### Visual Description ## Line Graph: Performance Metrics Across Parameter α ### Overview The graph depicts three performance metrics (Q_W*(3/√5), Q_W*(1/√5), q₂*) plotted against a parameter α (0–7). All metrics exhibit distinct growth patterns, with q₂* achieving the highest values and Q_W*(1/√5) remaining minimal. ### Components/Axes - **X-axis (α)**: Horizontal axis labeled "α" with integer ticks from 0 to 7. - **Y-axis**: Unitless scale from 0.0 to 1.0 in increments of 0.2. - **Legend**: Positioned in the upper-right quadrant, with: - **Blue solid line**: Q_W*(3/√5) - **Orange dashed line**: Q_W*(1/√5) - **Green dotted line**: q₂* ### Detailed Analysis 1. **Q_W*(3/√5) (Blue)**: - Starts at 0.0 when α=0. - Rises sharply to ~0.8 at α=1. - Plateaus near 0.95–1.0 for α≥2. - Uncertainty: ±0.05 shaded region around the line. 2. **Q_W*(1/√5) (Orange)**: - Remains near 0.0 for α=0–5. - Gradually increases to ~0.05 at α=7. - Uncertainty: ±0.02 shaded region. 3. **q₂* (Green)**: - Begins at 0.0 when α=0. - Reaches 1.0 by α=1. - Fluctuates between 0.95–1.0 for α≥2. - Uncertainty: ±0.03 shaded region. ### Key Observations - **Rapid Saturation**: q₂* achieves maximum capacity (1.0) by α=1, while Q_W*(3/√5) approaches saturation by α=2. - **Divergent Growth**: Q_W*(3/√5) outperforms Q_W*(1/√5) by ~100x at α=1. - **Stability**: q₂* shows minimal fluctuation post-α=1, suggesting robustness. ### Interpretation The graph demonstrates parameter α's impact on three metrics: - **q₂*** likely represents an optimal or baseline condition, achieving full capacity rapidly. - **Q_W*(3/√5)** balances growth speed and stability, reaching near-maximum efficiency by α=2. - **Q_W*(1/√5)** exhibits negligible performance, possibly indicating a suboptimal configuration or secondary effect. The stark contrast between Q_W*(3/√5) and Q_W*(1/√5) suggests the √5 scaling factor critically influences performance. The plateauing trends imply diminishing returns beyond α=2 for Q_W*(3/√5) and α=1 for q₂*, highlighting potential optimization thresholds. </details> <details> <summary>x12.png Details</summary> ![1b167fee](/v1/image/1b167fee0b771d60aff7dcc909c2272a50bb477268991bc5b74f9a28ec60e25d) ### Visual Description ## Line Graph: Behavior of Q_W* and q₂* Functions vs. α ### Overview The graph illustrates the relationship between three mathematical functions—Q_W*(3/√5), Q_W*(1/√5), and q₂*—and the variable α (ranging from 1 to 7). The y-axis represents a normalized value between 0 and 1. Key trends include rapid convergence of Q_W*(3/√5) and q₂* at intermediate α values, while Q_W*(1/√5) exhibits a delayed increase. Uncertainty bands (shaded regions) accompany Q_W*(3/√5) and q₂*, suggesting variability in their measurements or models. ### Components/Axes - **X-axis (α)**: Labeled as α, spanning 1 to 7 in integer increments. - **Y-axis**: Unitless value from 0 to 1, with increments of 0.2. - **Legend**: Located in the top-right corner, associating: - **Blue solid line**: Q_W*(3/√5) - **Orange dashed line**: Q_W*(1/√5) - **Green dash-dot line**: q₂* - **Shaded Regions**: - Light blue around Q_W*(3/√5) - Light orange around Q_W*(1/√5) - Light green around q₂* ### Detailed Analysis 1. **Q_W*(3/√5) (Blue Solid Line)**: - Starts at ~0.0 when α=1. - Rises sharply to ~0.9 by α=2. - Plateaus near 0.95 for α ≥ 3. - Uncertainty band widest at α=1–2, narrowing afterward. 2. **Q_W*(1/√5) (Orange Dashed Line)**: - Remains near 0 for α ≤ 4. - Gradually increases to ~0.5 by α=7. - Uncertainty band remains narrow throughout. 3. **q₂* (Green Dash-Dot Line)**: - Begins at ~0.2 when α=1. - Peaks at ~0.9 by α=2. - Declines slightly to ~0.85 by α=7. - Uncertainty band widest at α=1–2, stabilizing afterward. ### Key Observations - **Convergence**: Q_W*(3/√5) and q₂* overlap closely for α ≥ 3, suggesting similar asymptotic behavior. - **Delayed Response**: Q_W*(1/√5) lags significantly, remaining near 0 until α=4. - **Uncertainty**: Higher variability in Q_W*(3/√5) and q₂* at low α values, diminishing as α increases. - **Data Markers**: Crosses (×) denote data points for Q_W*(1/√5), while circles (○) mark Q_W*(3/√5) and q₂*. ### Interpretation The graph likely compares the performance of two weighted functions (Q_W* with different scaling factors) and a secondary metric (q₂*). The rapid rise of Q_W*(3/√5) and q₂* implies they stabilize quickly with increasing α, while Q_W*(1/√5) requires larger α to show meaningful growth. The uncertainty bands highlight confidence intervals, with Q_W*(3/√5) and q₂* showing greater variability at low α. The convergence of Q_W*(3/√5) and q₂* at higher α suggests they may represent complementary or equivalent mechanisms under certain conditions. The delayed response of Q_W*(1/√5) could indicate a threshold effect or dependency on additional parameters not captured in this model. </details> Figure 5: Different theoretical curves and numerical results for ReLU(x) activation, $P_{v}=\frac{1}{4}(\delta_{-3/\sqrt{5}}+\delta_{-1/\sqrt{5}}+\delta_{1/\sqrt{5}% }+\delta_{3/\sqrt{5}})$ , $d=150$ , $\gamma=0.5$ , with linear readout with Gaussian noise of variance $\Delta=0.1$ Top left: Optimal mean-square generalisation error predicted by the theory reported in the main text (solid blue) versus the branch obtained from the simplified ansatz (83) (solid red); the green solid line shows the universal branch corresponding to $\mathcal{Q}_{W}\equiv 0$ , and empty circles are HMC results with informative initialisation. Top right: Theoretical free entropy curves (colors and linestyles as top left). Bottom: Predictions for the overlaps $\mathcal{Q}_{W}(\mathsf{v})$ and $q_{2}$ from the theory devised in the main text (left) and in Appendix E.1 (right). To fix this issue, one can compare the predictions of the theory derived from this ansatz, with the ones obtained by plugging ${\mathcal{Q}}_{W}(\mathsf{v})=0\ ∀\ \mathsf{v}$ (denoted ${\mathcal{Q}}_{W}\equiv 0$ ) in the theory devised in the main text (6), $$ f_{\rm uni}^{\alpha,\gamma}:=\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W}% \equiv 0);r_{K})+\frac{1}{4\alpha}(1+\gamma\bar{v}^{2}-q_{2})\hat{q}_{2}-\frac% {1}{\alpha}\iota(\hat{q}_{2}), \tag{89} $$ to be extremised now only w.r.t. the scalar parameters $q_{2}$ , $\hat{q}_{2}$ (one can easily verify that, for ${\mathcal{Q}}_{W}\equiv 0$ , $\tau({\mathcal{Q}}_{W})=0$ and the extremisation w.r.t. $\hat{{\mathcal{Q}}}_{W}$ in (6) gives $\hat{{\mathcal{Q}}}_{W}\equiv 0$ ). Notice that $f_{\rm uni}^{\alpha,\gamma}$ is not depending on the prior over the inner weights, which is the reason why we are calling it “universal”. For consistency, the two free entropies $f_{\rm sp}^{\alpha,\gamma}$ , $f_{\rm uni}^{\alpha,\gamma}$ should be compared through a discrete variational principle, that is the free entropy of the model is predicted to be $$ \bar{f}^{\alpha,\gamma}_{\rm RS}:=\max\{{\rm extr}f_{\rm uni}^{\alpha,\gamma},% {\rm extr}f_{\rm sp}^{\alpha,\gamma}\}, \tag{90} $$ instead of the unified variational form (6). Quite generally, ${\rm extr}f_{\rm uni}^{\alpha,\gamma}>{\rm extr}f_{\rm sp}^{\alpha,\gamma}$ for low values of $\alpha$ , so that the behaviour of the model in the universal phase is correctly predicted. The curves cross at a critical value $$ \bar{\alpha}_{\rm sp}(\gamma)=\sup\{\alpha\mid{\rm extr}f_{\rm uni}^{\alpha,% \gamma}>{\rm extr}f_{\rm sp}^{\alpha,\gamma}\}, \tag{91} $$ instead of the value $\alpha_{\rm sp}(\gamma)$ reported in the main. This approach has been profitably adopted in Barbier et al. (2025) in the context of matrix denoising This is also the approach we used in a earlier version of this paper (superseded by the present one), accessible on ArXiv at this link., a problem sharing some of the challenges presented in this paper. In this respect, it provides a heuristic solution that quantitatively predicts the behaviour of the model in most of its phase diagram. Moreover, for any activation $\sigma$ with a second Hermite coefficient $\mu_{2}=0$ (e.g., all odd activations) the ansatz (83) yields the same theory as the one devised in the main text, as in this case $q_{K}(q_{2},{\mathcal{Q}}_{W})$ entering the energetic part of the free entropy does not depend on $q_{2}$ , so that the extremisation selects $q_{2}=\hat{q}_{2}=0$ and the remaining parts of (88) match the ones of (6). Finally, (83) is consistent with the observation that specialisation never arises in the case of quadratic activation and Gaussian prior over the inner weights: in this case, one can check that the universal branch ${\rm extr}f_{\rm uni}^{\alpha,\gamma}$ is always higher than ${\rm extr}f_{\rm sp}^{\alpha,\gamma}$ , and thus never selected by (90). For a convincing check on the validity of this approach, and a comparison with the theory devised in the main text and numerical results, see Fig. 5, top left panel. However, despite its merits listed above, this Appendix’s approach presents some issues, both from the theoretical and practical points of view: 1. the final free entropy of the model is obtained by comparing curves derived from completely different ansätze for the distribution $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (Gaussian with coupled replicas, leading to $f_{\rm sp}$ , vs. pure generalised Wishart with independent replicas, leading to $f_{\rm uni}$ ), rather than within a unified theory as in the main text; 1. the predicted critical value $\bar{\alpha}_{\rm sp}(\gamma)$ seems to be systematically larger than the one observed in experiments (see Fig. 5, top right panel, and compare the crossing point of the “sp” and “uni” free entropies with the actual transition where the numerical points depart from the universal branch in the top left panel); 1. predictions for the functional overlap ${\mathcal{Q}}_{W}^{*}$ from this approach are in much worse agreement with experimental data w.r.t. the ones from the theory presented in the main text (see Fig. 5, bottom panel, and compare with Fig. 3 in the main text); 1. in the cases we tested, the prediction for the generalisation error from the theory devised in the main text are in much better agreement with numerical simulations than the one from this Appendix (see Fig. 6 for a comparison). Therefore, the more elaborate theory presented in the main is not only more meaningful from the theoretical viewpoint, but also in overall better agreement with simulations. <details> <summary>x13.png Details</summary> ![d40fbe29](/v1/image/d40fbe296f31b44173aea919dc1d49daa418d1e0b211b5c728b594fe76879c2d) ### Visual Description ## Line Graph: ε_opt vs α ### Overview The image is a line graph depicting the relationship between the variable **α** (x-axis) and **ε_opt** (y-axis). Three data series are plotted: "main text" (blue), "sp" (red), and "uni" (green). The graph shows a general downward trend for all series as α increases, with convergence toward lower ε_opt values at higher α. ### Components/Axes - **X-axis (α)**: Labeled as α, ranging from 0 to 6 in increments of 1. - **Y-axis (ε_opt)**: Labeled as ε_opt, ranging from 0 to 0.08 in increments of 0.02. - **Legend**: Located in the top-right corner, with three entries: - Blue: "main text" - Red: "sp" - Green: "uni" - **Data Points**: Marked with circular symbols (open circles for "main text" and "sp", filled circles for "uni") and error bars (small vertical lines) for uncertainty. ### Detailed Analysis 1. **Main Text (Blue Line)**: - Starts at ε_opt ≈ 0.078 at α = 0. - Declines steeply to ε_opt ≈ 0.012 at α = 6. - Error bars are consistently small (~±0.002), indicating precise measurements. 2. **SP (Red Line)**: - Begins at ε_opt ≈ 0.075 at α = 0. - Decreases to ε_opt ≈ 0.012 at α = 6. - Slightly less steep decline than "main text," with similar error bar magnitude. 3. **Uni (Green Line)**: - Starts highest at ε_opt ≈ 0.085 at α = 0. - Declines gradually to ε_opt ≈ 0.015 at α = 6. - Error bars are minimal (~±0.001), suggesting high data reliability. ### Key Observations - All three series exhibit a **monotonic decrease** in ε_opt as α increases. - "Uni" consistently maintains the highest ε_opt values across all α, followed by "main text" and "sp." - Convergence of "main text" and "sp" toward ε_opt ≈ 0.012 at α = 6 suggests similar behavior at higher α values. - Error bars are smallest for "uni," indicating greater measurement precision for this series. ### Interpretation The graph demonstrates that ε_opt is inversely related to α for all three categories. The "uni" series exhibits the highest baseline ε_opt and slowest decline, potentially indicating a distinct mechanism or property compared to "main text" and "sp." The convergence of "main text" and "sp" at higher α values suggests shared dependencies on α in this regime. The minimal error bars across all series imply robust experimental or computational validation of the trends. This could reflect optimization processes where increasing α reduces ε_opt, with "uni" representing a more stable or resilient system. </details> Figure 6: Generalisation error for ReLU activation and Rademacher readout prior $P_{v}$ of the theory reported in the main text (solid blue) versus the branch obtained from the simplified ansatz (83) (solid red); the green solid line shows $\mathcal{Q}_{W}\equiv 0$ (universal branch), and empty circles are HMC results with informative initialisation. E.2 Possible refined analyses with structured ${\mathbf{S}}_{2}$ matrices In the main text, we kept track of the inhomogeneous profile of the readouts induced by the non-trivial distribution $P_{v}$ , which is ultimately responsible for the sequence of specialisation phase transitions occurring at increasing $\alpha$ , thanks to a functional order parameter ${\mathcal{Q}}_{W}(\mathsf{v})$ measuring how much the student’s hidden weights corresponding to all the readout elements equal to $\mathsf{v}$ have aligned with the teacher’s. However, when writing $\tilde{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})$ we treated the tensor ${\mathbf{S}}_{2}^{a}$ as a whole, without considering the possibility that its “components” $$ \displaystyle S_{2;\alpha_{1}\alpha_{2}}^{a}(\mathsf{v}):=\frac{\mathsf{v}}{% \sqrt{|\mathcal{I}_{\mathsf{v}}|}}\sum_{i\in\mathcal{I}_{\mathsf{v}}}W^{a}_{i% \alpha_{1}}W^{a}_{i\alpha_{2}} \tag{92} $$ could follow different laws for different $\mathsf{v}∈\mathsf{V}$ . To do so, let us define $$ \displaystyle Q_{2}^{ab}=\frac{1}{k}\sum_{\mathsf{v},\mathsf{v}^{\prime}}% \mathsf{v}\,\mathsf{v}^{\prime}\sum_{i\in\mathcal{I}_{\mathsf{v}},j\in\mathcal% {I}_{\mathsf{v^{\prime}}}}(\Omega_{ij}^{ab})^{2}=\sum_{\mathsf{v},\mathsf{v}^{% \prime}}\frac{\sqrt{|\mathcal{I}_{\mathsf{v}}||\mathcal{I}_{\mathsf{v^{\prime}% }}|}}{k}{\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}),\quad\text{% where}\quad{\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}):=\frac{1}{d^% {2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}(\mathsf{v}){\mathbf{S}}_{2}^{b}(\mathsf{v}^% {\prime})^{\intercal}. \tag{93} $$ The generalisation of (63) then reads $$ \displaystyle\int dP(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W}) \displaystyle\frac{1}{d^{2}}{\rm Tr}\,{\mathbf{S}}_{2}^{a}(\mathsf{v}){\mathbf% {S}}_{2}^{b}(\mathsf{v}^{\prime})^{\intercal}=\delta_{\mathsf{v}\mathsf{v}^{% \prime}}\mathsf{v}^{2}\mathcal{Q}_{W}^{ab}(\mathsf{v})^{2}+\gamma\,\mathsf{v}% \mathsf{v}^{\prime}\sqrt{P_{v}(\mathsf{v})P_{v}(\mathsf{v}^{\prime})} \tag{94} $$ w.r.t. the true distribution $P(({\mathbf{S}}_{2}^{a})\mid\bm{\mathcal{Q}}_{W})$ reported in (14). Despite the already good match of the theory in the main with the numerics, taking into account this additional level of structure thanks to a refined simplified measure could potentially lead to further improvements. The simplified measure able to match these moment-matching conditions while taking into account the Wishart form (92) of the matrices $({\mathbf{S}}_{2}^{a}(\mathsf{v}))$ is $$ \displaystyle d\bar{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})% \propto\prod_{\mathsf{v}\in\mathsf{V}}\prod_{a}dP_{S}^{\mathsf{v}}({\mathbf{S}% }_{2}^{a}(\mathsf{v}))\times\prod_{\mathsf{v}\in\mathsf{V}}\prod_{a<b}e^{\frac% {1}{2}\bar{\tau}^{ab}_{\mathsf{v}}(\bm{{\mathcal{Q}}}_{W}){\rm Tr}{\mathbf{S}}% _{2}^{a}(\mathsf{v}){\mathbf{S}}_{2}^{b}(\mathsf{v})}, \tag{95} $$ where $P_{S}^{\mathsf{v}}$ is the law of a random matrix $\mathsf{v}\bar{{\mathbf{W}}}\bar{{\mathbf{W}}}^{∈tercal}|\mathcal{I}_{% \mathsf{v}}|^{-1/2}$ with $\bar{\mathbf{W}}∈\mathbb{R}^{d×|\mathcal{I}_{\mathsf{v}}|}$ having i.i.d. standard Gaussian entries. For properly chosen $(\bar{\tau}_{\mathsf{v}}^{ab})$ , (94) is verified for this simplified measure. However, the order parameters $({\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}))$ are difficult to deal with if keeping a general form, as they not only imply coupled replicas $({\mathbf{S}}_{2}^{a}(\mathsf{v}))_{a}$ for a given $\mathsf{v}$ (a kind of coupling that is easily linearised with a single Hubbard-Stratonovich transformation, within the replica symmetric treatment justified in Bayes-optimal learning), but also a coupling for different values of the variable $\mathsf{v}$ . Linearising it would yield a more complicated matrix model than the integral reported in (D.3), because the resulting coupling field would break rotational invariance and therefore the model does not have a form which is known to be solvable, see Kazakov (2000). A first idea to simplify $P(({\mathbf{S}}^{a}_{2})\mid\bm{{\mathcal{Q}}}_{W})$ (14) while taking into account the additional structure induced by (93), (94) and maintaining a solvable model, is to consider a generalisation of the relaxation (83). This entails dropping entirely the dependencies among matrix entries, induced by their Wishart-like form (92), for each ${\mathbf{S}}_{2}^{a}(\mathsf{v})$ . In this case, the moment constraints (94) can be exactly enforced by choosing the simplified measure $$ \displaystyle d\bar{P}(({\mathbf{S}}_{2}^{a})\mid\bm{{\mathcal{Q}}}_{W})=\prod% _{\mathsf{v}\in\mathsf{V}}\prod_{a=0}^{s}d{\mathbf{S}}^{a}_{2}(\mathsf{v})% \prod_{\alpha=1}^{d}\delta(S^{a}_{2;\alpha\alpha}(\mathsf{v})-\mathsf{v}\sqrt{% |\mathcal{I}_{\mathsf{v}}|})\times\prod_{\mathsf{v}\in\mathsf{V}}\prod_{\alpha% _{1}<\alpha_{2}}^{d}\frac{e^{-\frac{1}{2}\sum_{a,b=0}^{s}S^{a}_{2;\alpha_{1}% \alpha_{2}}(\mathsf{v})\bar{\tau}_{\mathsf{v}}^{ab}(\bm{{\mathcal{Q}}}_{W})S^{% b}_{2;\alpha_{1}\alpha_{2}}(\mathsf{v})}}{\sqrt{(2\pi)^{s+1}\det(\bar{\bm{\tau% }}_{\mathsf{v}}(\bm{{\mathcal{Q}}}_{W})^{-1})}}. \tag{96} $$ The parameters $(\bar{\tau}^{ab}_{\mathsf{v}}(\bm{{\mathcal{Q}}}_{W}))$ are then properly chosen to enforce (94) for all $0≤ a≤ b≤ s$ and $\mathsf{v},\mathsf{v}^{\prime}∈\mathsf{V}$ . Using this measure, the resulting entropic term, taking into account the degeneracy of the order parameters $({\mathcal{Q}}_{2}^{ab}(\mathsf{v},\mathsf{v}^{\prime}))$ and $({\mathcal{Q}}_{W}^{ab}(\mathsf{v}))$ , remains tractable through Gaussian integrals (the energetic term is obviously unchanged once we express $(Q_{2}^{ab})$ entering it using these new order parameters through the identity (93), and keeping in mind that nothing changes for higher order overlaps compared to the theory in the main). We leave for future work the analysis of this Gaussian relaxation and other possible simplifications of (95) leading to solvable models. Appendix F Linking free entropy and mutual information It is possible to relate the mutual information (MI) of the inference task to the free entropy $f_{n}=\mathbb{E}\ln\mathcal{Z}$ introduced in the main. Indeed, we can write the MI as $$ \frac{I({\mathbf{W}}^{0};\mathcal{D})}{kd}=\frac{\mathcal{H}(\mathcal{D})}{kd}% -\frac{\mathcal{H}(\mathcal{D}\mid{\mathbf{W}}^{0})}{kd}, \tag{97} $$ where $\mathcal{H}(Y\mid X)$ is the conditional Shannon entropy of $Y$ given $X$ . It is straightforward to show that the free entropy is $$ -\frac{\alpha}{\gamma}f_{n}=\frac{\mathcal{H}(\{y_{\mu}\}_{\mu\leq n}\mid\{{% \mathbf{x}}_{\mu}\}_{\mu\leq n})}{kd}=\frac{\mathcal{H}(\mathcal{D})}{kd}-% \frac{\mathcal{H}(\{{\mathbf{x}}_{\mu}\}_{\mu\leq n})}{kd}, \tag{98} $$ by the chain rule for the entropy. On the other hand $\mathcal{H}(\mathcal{D}\mid{\mathbf{W}}^{0})=\mathcal{H}(\{y_{\mu}\}\mid{% \mathbf{W}}^{0},\{{\mathbf{x}}_{\mu}\})+\mathcal{H}(\{{\mathbf{x}}_{\mu}\})$ , i.e., $$ \frac{\mathcal{H}(\mathcal{D}\mid{\mathbf{W}}^{0})}{kd}\approx-\frac{\alpha}{% \gamma}\mathbb{E}_{\lambda}\int dyP_{\text{out}}(y\mid\lambda)\ln P_{\text{out% }}(y\mid\lambda)+\frac{\mathcal{H}(\{{\mathbf{x}}_{\mu}\}_{\mu\leq n})}{kd}, \tag{99} $$ where $\lambda\sim{\mathcal{N}}(0,r_{K})$ , with $r_{K}$ given by (53) (assuming here that $\mu_{0}=0$ , see App. D.5 if the activation $\sigma$ is non-centred), and the equality holds asymptotically in the limit $\lim$ . This allows us to express the MI as $$ \frac{I({\mathbf{W}}^{0};\mathcal{D})}{kd}=-\frac{\alpha}{\gamma}f_{n}+\frac{% \alpha}{\gamma}\mathbb{E}_{\lambda}\int dyP_{\text{out}}(y|\lambda)\ln P_{% \text{out}}(y|\lambda). \tag{100} $$ Specialising the equation to the Gaussian channel, one obtains $$ \frac{I({\mathbf{W}}^{0};\mathcal{D})}{kd}=-\frac{\alpha}{\gamma}f_{n}-\frac{% \alpha}{2\gamma}\ln(2\pi e\Delta). \tag{101} $$ Note that the choice of normalising by $kd$ is not accidental. Indeed, the number of parameters is $kd+k≈ kd$ . Hence with this choice one can interpret the parameter $\alpha$ as an effective signal-to-noise ratio. **Remark F.1** *The arguments of Barbier et al. (2025) to show the existence of an upper bound on the mutual information per variable in the case of discrete variables and the associated inevitable breaking of prior universality beyond a certain threshold in matrix denoising apply to the present model too. It implies, as in the aforementioned paper, that the mutual information per variable cannot go beyond $\ln 2$ for Rademacher inner weights. Our theory is consistent with this fact; this is a direct consequence of the analysis in App. G (see in particular (108)) specialised to binary prior over ${\mathbf{W}}$ .* Appendix G Large sample rate limit of $f_{\rm RS}^{\alpha,\gamma}$ In this section we show that when the prior over the weights ${\mathbf{W}}$ is discrete the MI can never exceed the entropy of the prior itself. To do this, we first need to control the function $\rm mmse$ when its argument is large. By a saddle point argument, it is not difficult to show that the leading term for ${\rm mmse}_{S}(\tau)$ when $\tau→∞$ if of the type $C(\gamma)/\tau$ for a proper constant $C$ depending at most on the rectangularity ratio $\gamma$ . We now notice that the equation for $\hat{\mathcal{Q}}_{W}(v)$ in (76) can be rewritten as $$ \displaystyle\hat{\mathcal{Q}}_{W}(v)=\frac{1}{2\gamma}[{\rm mmse}_{S}(\tau)-{% \rm mmse}_{S}(\tau+\hat{q}_{2})]\partial_{{\mathcal{Q}}_{W}(v)}\tau+2\frac{% \alpha}{\gamma}\partial_{{\mathcal{Q}}_{W}(v)}\psi_{P_{\text{out}}}(q_{K}(q_{2% },\mathcal{Q}_{W});r_{K}). \tag{102} $$ For $\alpha→∞$ we make the self-consistent ansatz $\mathcal{Q}_{W}(v)=1-o_{\alpha}(1)$ . As a consequence $1/\tau$ has to vanish by the moment matching condition (74) as $o_{\alpha}(1)$ too. Using the very same equation, we are also able to evaluate $∂_{\mathcal{Q}_{W}(v)}\tau$ as follows: $$ \displaystyle\partial_{\mathcal{Q}_{W}(v)}\tau=\frac{-2v^{2}\mathcal{Q}_{W}(v)% }{{\rm mmse^{\prime}}(\tau)}\sim\tau^{2} \tag{103} $$ as $\alpha→∞$ , where we have used ${\rm mmse}_{S}(\tau)\sim C(\gamma)/\tau$ to estimate the derivative. We use the same approximation for the two $\rm mmse$ ’s appearing in the fixed point equation for $\hat{\mathcal{Q}}_{W}(v)$ : $$ \displaystyle\hat{\mathcal{Q}}_{W}(v)\sim\frac{\hat{q}_{2}}{2\gamma(\tau(\tau+% \hat{q}_{2}))}\tau^{2}+2\frac{\alpha}{\gamma}\partial_{{\mathcal{Q}}_{W}(v)}% \psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K}). \tag{104} $$ From the last equation in (76) we see that $\hat{q}_{2}$ cannot diverge more than $O(\alpha)$ . Thanks to the above approximation and the first equation of (76) this entails that $\mathcal{Q}_{W}(v)$ is approaching $1$ exponentially fast in $\alpha$ , which in turn implies $\tau$ is diverging exponentially in $\alpha$ . As a consequence $$ \displaystyle\frac{\tau^{2}}{\tau(\tau+\hat{q}_{2})}\sim 1. \tag{105} $$ Furthermore, one also has $$ \displaystyle\frac{1}{\alpha}[\iota(\tau)-\iota(\tau+\hat{q}_{2})]=-\frac{1}{4% \alpha}\int_{\tau}^{\tau+\hat{q}_{2}}{\rm mmse}_{S}(t)\,dt\approx-\frac{C(% \gamma)}{4\alpha}\log(1+\frac{\hat{q}_{2}}{\tau})\xrightarrow[]{\alpha\to% \infty}0, \tag{106} $$ as $\frac{\hat{q}_{2}}{\tau}$ vanishes with exponential speed in $\alpha$ . Concerning the function $\psi_{P_{W}}$ , given that it is realted to a Bayes-optimal scalar Gaussian channel, and its SNRs $\hat{\mathcal{Q}}_{W}(v)$ are all diverging one can compute the integral by saddle point, which is inevitably attained at the ground truth: $$ \displaystyle\psi_{P_{W}}(\hat{\mathcal{Q}}_{W}(v)) \displaystyle-\frac{\hat{\mathcal{Q}}_{W}(v)\mathcal{Q}_{W}(v)}{2}\approx% \mathbb{E}_{w^{0}}\ln\int dP_{W}(w)\mathbbm{1}(w=w^{0}) \displaystyle+\mathbb{E}\Big{[}(\sqrt{\hat{\mathcal{Q}}_{W}(v)}\xi+\hat{% \mathcal{Q}}_{W}(v)w^{0})w^{0}-\frac{\hat{\mathcal{Q}}_{W}(v)}{2}(w^{0})^{2}% \Big{]}-\frac{\hat{\mathcal{Q}}_{W}(v)(1-o_{\alpha}(1))}{2}=-\mathcal{H}(W)+o_% {\alpha}(1). \tag{1} $$ Considering that $\psi_{P_{\text{out}}}(q_{K}(q_{2},\mathcal{Q}_{W});r_{K})\xrightarrow[]{\alpha% →∞}\psi_{P_{\text{out}}}(r_{K};r_{K})$ , and using (100), it is then straightforward to check that our RS version of the MI saturates to the entropy of the prior $P_{W}$ when $\alpha→∞$ : $$ \displaystyle-\frac{\alpha}{\gamma}\text{extr}f_{\rm RS}^{\alpha,\gamma}+\frac% {\alpha}{\gamma}\mathbb{E}_{\lambda}\int dyP_{\text{out}}(y|\lambda)\ln P_{% \text{out}}(y|\lambda)\xrightarrow[]{\alpha\to\infty}\mathcal{H}(W). \tag{108} $$ Appendix H Extension of GAMP-RIE to arbitrary activation Algorithm 1 GAMP-RIE for training shallow neural networks with arbitrary activation Input: Fresh data point ${\mathbf{x}}_{\text{test}}$ with unknown associated response $y_{\text{test}}$ , dataset $\mathcal{D}=\{({\mathbf{x}}_{\mu},y_{\mu})\}_{\mu=1}^{n}$ . Output: Estimator $\hat{y}_{\text{test}}$ of $y_{\text{test}}$ . Estimate $y^{(0)}:=\mu_{0}{\mathbf{v}}^{∈tercal}\bm{1}/\sqrt{k}$ as $$ \hat{y}^{(0)}=\frac{1}{n}\sum_{\mu}y_{\mu}; \tag{0} $$ Estimate $\langle{\mathbf{W}}^{∈tercal}{\mathbf{v}}\rangle/\sqrt{k}$ using (117). Estimate the $\mu_{1}$ term in the Hermite expansion (111) as $$ \displaystyle\hat{y}_{\mu}^{(1)} \displaystyle=\mu_{1}\frac{\langle{\mathbf{v}}^{\intercal}{\mathbf{W}}\rangle{% \mathbf{x}}_{\mu}}{\sqrt{kd}}; \tag{1} $$ Compute $$ \displaystyle\tilde{y}_{\mu} \displaystyle=\frac{y_{\mu}-\hat{y}_{\mu}^{(0)}-\hat{y}_{\mu}^{(1)}}{\mu_{2}/2% };\qquad\tilde{\Delta}=\frac{\Delta+g(1)}{\mu_{2}^{2}/4}; \tag{0} $$ Input $\{({\mathbf{x}}_{\mu},\tilde{y}_{\mu})\}_{\mu=1}^{n}$ and $\tilde{\Delta}$ into Algorithm 1 in Maillard et al. (2024a) to estimate $\langle{\mathbf{W}}^{∈tercal}({\mathbf{v}}){\mathbf{W}}\rangle$ ; Output $$ \displaystyle\hat{y}_{\text{test}}=\hat{y}^{(0)}+\mu_{1}\frac{\langle{\mathbf{% v}}^{\intercal}{\mathbf{W}}\rangle{\mathbf{x}}_{\text{test}}}{\sqrt{kd}}+\frac% {\mu_{2}}{2}\frac{1}{d\sqrt{k}}{\rm Tr}[({\mathbf{x}}_{\text{test}}{\mathbf{x}% }_{\text{test}}^{\intercal}-{\mathbb{I}})\langle{\mathbf{W}}^{\intercal}({% \mathbf{v}}){\mathbf{W}}\rangle]. \tag{0} $$ For simplicity, let us consider $P_{\rm out}(y\mid\lambda)=\exp(-\frac{1}{2\Delta}(y-\lambda)^{2})/\sqrt{2\pi\Delta}$ , which entails: $$ \displaystyle y_{\mu}\mid({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})\overset{\rm{d}% }{=}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}\sigma\Big{(}\frac{{\mathbf{W}}^{% 0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\sqrt{\Delta}\,z_{\mu},\quad\mu=1\dots,n, \tag{110} $$ where $z_{\mu}$ are i.i.d. standard Gaussian random variables and $\overset{\rm d}{{}={}}$ means equality in law. Expanding $\sigma$ in the Hermite polynomial basis we have $$ \displaystyle y_{\mu}\mid({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})\overset{\rm{d}% }{=}\mu_{0}\frac{{\mathbf{v}}^{\intercal}\bm{1}_{k}}{\sqrt{k}}+\mu_{1}\frac{{% \mathbf{v}}^{\intercal}{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{kd}}+\frac{% \mu_{2}}{2}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_{2}\Big{(}\frac{{% \mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\dots+\sqrt{\Delta}z_{\mu} \tag{111} $$ where $...$ represents the terms beyond second order. Without loss of generality, for this choice of output channel we can set $\mu_{0}=0$ as discussed in App. D.5. For low enough $\alpha$ it is reasonable to assume that higher order terms in $...$ cannot be learnt given quadratically many samples and, as a result, play the role of effective noise, which we assume independent of the first three terms. We shall see that this reasoning actually applies to the extension of the GAMP-RIE we derive, which plays the role of a “smart” spectral algorithm, regardless of the value of $\alpha$ . Therefore, these terms accumulate in an asymptotically Gaussian noise thanks to the central limit theorem (it is a projection of a centred function applied entry-wise to a vector with i.i.d. entries), with variance $g(1)$ (see (43)). We thus obtain the effective model $$ \displaystyle y_{\mu}\mid({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})\overset{\rm{d}% }{=}\mu_{1}\frac{{\mathbf{v}}^{\intercal}{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{% \sqrt{kd}}+\frac{\mu_{2}}{2}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_% {2}\Big{(}\frac{{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\sqrt{% \Delta+g(1)}\,z_{\mu}. \tag{1} $$ The first term in this expression can be learnt with vanishing error given quadratically many samples (Remark H.1), thus can be ignored. This further simplifies the model to $$ \displaystyle\bar{y}_{\mu}:=y_{\mu}-\mu_{1}\frac{{\mathbf{v}}^{\intercal}{% \mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{kd}}\overset{\rm d}{{}={}}\frac{\mu_{% 2}}{2}\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_{2}\Big{(}\frac{{% \mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}+\sqrt{\Delta+g(1)}\,z_{\mu}, \tag{1} $$ where $\bar{y}_{\mu}$ is $y_{\mu}$ with the (asymptotically) perfectly learnt linear term removed, and the last equality in distribution is again conditional on $({\bm{\theta}}^{0},{\mathbf{x}}_{\mu})$ . From the formula $$ \displaystyle\frac{{\mathbf{v}}^{\intercal}}{\sqrt{k}}{\rm He}_{2}\Big{(}\frac% {{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}}{\sqrt{d}}\Big{)}={\rm Tr}\frac{{\mathbf{W% }}^{0\intercal}({\mathbf{v}}){\mathbf{W}}^{0}}{d\sqrt{k}}{\mathbf{x}}_{\mu}{% \mathbf{x}}_{\mu}^{\intercal}-\frac{{\mathbf{v}}^{\intercal}\bm{1}_{k}}{\sqrt{% k}}\approx\frac{1}{\sqrt{k}d}{\rm Tr}[({\mathbf{x}}_{\mu}{\mathbf{x}}_{\mu}^{% \intercal}-{\mathbb{I}}_{d}){\mathbf{W}}^{0\intercal}({\mathbf{v}}){\mathbf{W}% }^{0}], \tag{114} $$ where $≈$ is exploiting the concentration ${\rm Tr}{\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{0}/(d\sqrt{k})→% {\mathbf{v}}^{∈tercal}\bm{1}_{k}/\sqrt{k}$ , and the Gaussian equivalence property that ${\mathbf{M}}_{\mu}:=({\mathbf{x}}_{\mu}{\mathbf{x}}_{\mu}^{∈tercal}-{\mathbb% {I}}_{d})/\sqrt{d}$ behaves like a GOE sensing matrix, i.e., a symmetric matrix whose upper triangular part has i.i.d. entries from $\mathcal{N}(0,(1+\delta_{ij})/d)$ Maillard et al. (2024a), the model can be seen as a GLM with signal $\bar{\mathbf{S}}^{0}_{2}:={\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^% {0}/\sqrt{kd}$ : $$ \displaystyle y^{\rm GLM}_{\mu}=\frac{\mu_{2}}{2}{\rm Tr}[{\mathbf{M}}_{\mu}% \bar{\mathbf{S}}^{0}_{2}]+\sqrt{\Delta+g(1)}\,z_{\mu}. \tag{1} $$ Starting from this equation, the arguments of App. D and Maillard et al. (2024a), based on known results on the GLM Barbier et al. (2019) and matrix denoising Barbier & Macris (2022); Maillard et al. (2022); Pourkamali et al. (2024), allow us to obtain the free entropy of such matrix sensing problem. The result is consistent with the $\mathcal{Q}_{W}\equiv 0$ solution of the saddle point equations obtained from the replica method in App. D, which, as anticipated, corresponds to the case where the Hermite polynomial combinations of the signal following the second one are not learnt. Note that, as supported by the numerics, the model actually admits specialisation when $\alpha$ is big enough, hence the above equivalence cannot hold on the whole phase diagram at the information theoretic level. In fact, if specialisation occurs one cannot consider the $...$ terms in (111) as noise uncorrelated with the first ones, as the model is aligning with the actual teacher’s weights, such that it learns all the successive terms at once. <details> <summary>x14.png Details</summary> ![58ccf98c](/v1/image/58ccf98cd70c48cc45231ea0ea6ee6d2966a9340efb6afae30d137f50e64c3c9) ### Visual Description ## Line Graph: Test Error vs. α for ReLU and ELU Activation Functions ### Overview The graph compares the test error performance of two activation functions (ReLU and ELU) across different values of the hyperparameter α. Two lines are plotted: a red line for ReLU and a blue line for ELU. An inset graph provides a zoomed-in view of the region between α=1 and α=2. ### Components/Axes - **X-axis (α)**: Ranges from 0 to 4, labeled "α". - **Y-axis (Test error)**: Ranges from 0.00 to 0.08, labeled "Test error". - **Legend**: Located in the top-right corner of the main graph, with: - Red line: ReLU - Blue line: ELU - **Inset graph**: Positioned in the top-right corner of the main graph, focusing on α=1 to 2 with a reduced y-axis range (0.00 to 0.08). ### Detailed Analysis 1. **ReLU (Red Line)**: - Starts at ~0.08 test error at α=0. - Decreases sharply until α≈1.5, reaching ~0.02. - Plateaus at α≥2, maintaining ~0.02 test error. - Data points: Square markers with error bars (e.g., α=0: 0.08±0.005, α=2: 0.02±0.002). 2. **ELU (Blue Line)**: - Starts at ~0.04 test error at α=0. - Decreases gradually until α≈2, reaching ~0.015. - Plateaus at α≥2, maintaining ~0.015 test error. - Data points: Circular markers with error bars (e.g., α=0: 0.04±0.003, α=2: 0.015±0.001). 3. **Inset Graph**: - Focuses on α=1 to 2. - ReLU (red) and ELU (blue) converge near α=2, with overlapping error bars. - Both lines show reduced variability in this range. ### Key Observations - ReLU exhibits a steeper initial decline in test error compared to ELU. - ELU demonstrates more gradual improvement but achieves lower test error at higher α values. - Both activation functions plateau at α≥2, suggesting diminishing returns beyond this point. - The inset confirms convergence of ReLU and ELU performance near α=2. ### Interpretation The data suggests that ELU activation functions may offer better generalization (lower test error) than ReLU for α≥2, though both exhibit saturation effects. The sharp decline in ReLU’s performance at lower α values indicates sensitivity to hyperparameter tuning, while ELU’s gradual improvement implies greater stability. The convergence at α=2 implies that beyond this threshold, the choice between ReLU and ELU may have minimal impact on test error, though ELU’s lower baseline error could be advantageous in practice. </details> Figure 7: Theoretical prediction (solid curves) of the Bayes-optimal mean-square generalisation error for binary inner weights and ReLU, eLU activations, with $\gamma=0.5$ , $d=150$ , Gaussian label noise with $\Delta=0.1$ , and fixed readouts ${\mathbf{v}}=\mathbf{1}$ . Dashed lines are obtained from the solution of the fixed point equations (76) with all $\mathcal{Q}_{W}(\mathsf{v})=0$ . Circles are the test error of GAMP-RIE (Maillard et al., 2024a) extended to generic activation. The MCMC points initialised uninformatively (inset) are obtained using (36), to account for lack of equilibration due to glassiness, which prevents using (38). Even in the possibly glassy region, the GAMP-RIE attains the universal branch performance. Data for GAMP-RIE and MCMC are averaged over 16 data instances, with error bars representing one standard deviation over instances. We now assume that this mapping holds at the algorithmic level, namely, that we can process the data algorithmically as if they were coming from the identified GLM, and thus try to infer the signal $\bar{\mathbf{S}}_{2}^{0}={\mathbf{W}}^{0∈tercal}({\mathbf{v}}){\mathbf{W}}^{% 0}/\sqrt{kd}$ and construct a predictor from it. Based on this idea, we propose Algorithm 1 that can indeed reach the performance predicted by the $\mathcal{Q}_{W}\equiv 0$ solution of our replica theory. **Remark H.1** *In the linear data regime, where $n/d$ converges to a fixed constant $\alpha_{1}$ , only the first term in (111) can be learnt while the rest behaves like noise. By the same argument as above, the model is equivalent to $$ \displaystyle y_{\mu}=\mu_{1}\frac{{\mathbf{v}}^{\intercal}{\mathbf{W}}^{0}{% \mathbf{x}}_{\mu}}{\sqrt{kd}}+\sqrt{\Delta+\nu-\mu_{0}^{2}-\mu_{1}^{2}}\,z_{% \mu}, \tag{116} $$ where $\nu=\mathbb{E}_{z\sim{\mathcal{N}}(0,1)}\sigma^{2}(z)$ . This is again a GLM with signal ${\mathbf{S}}_{1}^{0}={\mathbf{W}}^{0∈tercal}{\mathbf{v}}/\sqrt{k}$ and Gaussian sensing vectors ${\mathbf{x}}_{\mu}$ . Define $q_{1}$ as the limit of ${\mathbf{S}}_{1}^{a∈tercal}{\mathbf{S}}_{1}^{b}/d$ where ${\mathbf{S}}_{1}^{a},{\mathbf{S}}_{1}^{b}$ are drawn independently from the posterior. With $k→∞$ , the signal converges in law to a standard Gaussian vector. Using known results on GLMs with Gaussian signal Barbier et al. (2019), we obtain the following equations characterising $q_{1}$ : | | $\displaystyle q_{1}$ | $\displaystyle=\frac{\hat{q}_{1}}{\hat{q}_{1}+1},\qquad\hat{q}_{1}=\frac{\alpha% _{1}}{1+\Delta_{1}-q_{1}},\quad\text{where}\quad\Delta_{1}=\frac{\Delta+\nu-% \mu_{0}^{2}-\mu_{1}^{2}}{\mu_{1}^{2}}.$ | | | --- | --- | --- | --- | In the quadratic data regime, as $\alpha_{1}=n/d$ goes to infinity, the overlap $q_{1}$ converges to $1$ and the first term in (111) is learnt with vanishing error. Moreover, since ${\mathbf{S}}_{1}^{0}$ is asymptotically Gaussian, the linear problem (116) is equivalent to denoising the Gaussian vector $({\mathbf{v}}^{∈tercal}{\mathbf{W}}^{0}{\mathbf{x}}_{\mu}/\sqrt{kd})_{\mu=0}% ^{n}$ whose covariance is known as a function of ${\mathbf{X}}=({\mathbf{x}}_{1},...,{\mathbf{x}}_{n})∈\mathbb{R}^{d× n}$ . This leads to the following simple MMSE estimator for ${\mathbf{S}}_{1}^{0}$ : $$ \displaystyle\langle{\mathbf{S}}_{1}^{0}\rangle=\frac{1}{\sqrt{d\Delta_{1}}}% \left(\mathbf{I}+\frac{1}{d\Delta_{1}}{\mathbf{X}}{\mathbf{X}}^{\intercal}% \right)^{-1}{\mathbf{X}}{\mathbf{y}} \tag{117} $$ where ${\mathbf{y}}=(y_{1},...,y_{n})$ . Note that the derivation of this estimator does not assume the Gaussianity of ${\mathbf{x}}_{\mu}$ .* **Remark H.2** *The same argument can be easily generalised for general $P_{\text{out}}$ , leading to the following equivalent GLM in the universal ${\mathcal{Q}}_{W}^{*}\equiv 0$ phase of the quadratic data regime: $$ \displaystyle y_{\mu}^{\rm GLM}\sim\tilde{P}_{\text{out}}(\cdot\mid{\rm Tr}[{% \mathbf{M}}_{\mu}\bar{\mathbf{S}}^{0}_{2}]),\quad\text{where}\quad\tilde{P}_{% \text{out}}(y|x):=\mathbb{E}_{z\sim\mathcal{N}(0,1)}P_{\text{out}}\Big{(}y\mid% \frac{\mu_{2}}{2}x+z\sqrt{g(1)}\Big{)}, \tag{1} $$ and ${\mathbf{M}}_{\mu}$ are independent GOE sensing matrices.* **Remark H.3** *One can show that the system of equations $({\rm S})$ in (LABEL:NSB_equations_gaussian_ch) with $\mathcal{Q}_{W}(\mathsf{v})$ all set to $0 0$ (and consequently $\tau=0$ ) can be mapped onto the fixed point of the state evolution equations (92), (94) of the GAMP-RIE in Maillard et al. (2024a) up to changes of variables. This confirms that when such a system has a unique solution, which is the case in all our tests, the GAMP-RIE asymptotically matches our universal solution. Assuming the validity of the aforementioned effective GLM, a potential improvement for discrete weights could come from a generalisation of GAMP which, in the denoising step, would correctly exploit the discrete prior over inner weights rather than using the RIE (which is prior independent). However, the results of Barbier et al. (2025) suggest that optimally denoising matrices with discrete entries is hard, and the RIE is the best efficient procedure to do so. Consequently, we tend to believe that improving GAMP-RIE in the case of discrete weights is out of reach without strong side information about the teacher, or exploiting non-polynomial-time algorithms (see Appendix I).* Appendix I Algorithmic complexity of finding the specialisation solution <details> <summary>x15.png Details</summary> ![28810114](/v1/image/288101149c0d4fed87b9acc6edcdddd1e13d3389763a7782fa7ff60b3eecf386) ### Visual Description ## Line Graph: Gradient Updates vs. Dimension ### Overview The image is a line graph depicting the relationship between "Gradient updates (log scale)" and "Dimension" for three distinct linear fits and corresponding data series. The y-axis uses a logarithmic scale, and the x-axis ranges from 50 to 250. Three linear fits (blue, green, red dashed lines) and three data series (blue circles, green squares, red triangles) are plotted, each associated with specific slope values and ε* parameters. --- ### Components/Axes - **X-axis (Dimension)**: Labeled "Dimension," with values ranging from 50 to 250 in increments of 25. - **Y-axis (Gradient updates)**: Labeled "Gradient updates (log scale)," with values spanning $10^3$ to $10^4$. - **Legend**: Located in the top-left corner, containing: - **Blue dashed line**: "Linear fit: slope=0.0146" - **Green dashed line**: "Linear fit: slope=0.0138" - **Red dashed line**: "Linear fit: slope=0.0136" - **Data Series**: - **Blue circles**: ε* = 0.008 - **Green squares**: ε* = 0.01 - **Red triangles**: ε* = 0.012 - **Error Bars**: Vertical lines attached to data points, indicating uncertainty. --- ### Detailed Analysis 1. **Linear Fits**: - **Blue line (slope=0.0146)**: Steepest slope, indicating the highest rate of increase in gradient updates with dimension. - **Green line (slope=0.0138)**: Intermediate slope. - **Red line (slope=0.0136)**: Shallowest slope, suggesting the slowest growth rate. 2. **Data Series**: - **ε* = 0.008 (blue circles)**: Data points align closely with the blue dashed line, with error bars increasing slightly as dimension grows. - **ε* = 0.01 (green squares)**: Data points follow the green dashed line, with error bars showing moderate variability. - **ε* = 0.012 (red triangles)**: Data points track the red dashed line, with larger error bars at higher dimensions. 3. **Trends**: - All three lines exhibit a positive, linear relationship between dimension and gradient updates on a logarithmic scale. - The blue line (highest slope) grows fastest, while the red line (lowest slope) grows slowest. - Error bars increase in size for all data series as dimension increases, suggesting greater uncertainty at higher dimensions. --- ### Key Observations - **Slope Consistency**: The linear fits (dashed lines) closely match their respective data series (solid markers), confirming the validity of the linear approximations. - **Error Bar Patterns**: Larger error bars at higher dimensions (e.g., 200–250) may indicate measurement limitations or non-linear behavior at extreme values. - **ε* Correlation**: Higher ε* values (0.012) correspond to slower growth rates (red line), while lower ε* (0.008) align with faster growth (blue line). --- ### Interpretation The graph demonstrates that gradient updates scale linearly with dimension for three distinct ε* values, with the rate of scaling governed by the slope of the linear fit. The blue line (ε* = 0.008) exhibits the steepest growth, suggesting that smaller ε* values amplify the impact of dimension on gradient updates. Conversely, larger ε* values (e.g., 0.012) result in slower scaling, as seen in the red line. The increasing error bars at higher dimensions imply potential challenges in maintaining precision or stability in high-dimensional systems. This trend could reflect computational constraints or theoretical limits in gradient-based optimization methods. </details> <details> <summary>x16.png Details</summary> ![24475357](/v1/image/2447535768fc315c8844b1e23b266ace466dade83c5d6962e747ca3af2efb788) ### Visual Description ## Line Chart: Gradient Updates vs. Dimension (Log-Log Scale) ### Overview The chart illustrates the relationship between **dimension** (log scale) and **gradient updates** (log scale) across three distinct linear fits, each associated with a specific slope and ε* value. Three data series are plotted, with markers and error bars indicating variability in gradient updates for different ε* values. --- ### Components/Axes - **X-axis**: "Dimension (log scale)" ranging from $4 \times 10^1$ to $2 \times 10^2$. - **Y-axis**: "Gradient updates (log scale)" ranging from $10^3$ to $10^4$. - **Legend**: Located in the **top-left corner**, with three entries: - **Blue dashed line**: Slope = 1.4451, ε* = 0.008. - **Green dash-dot line**: Slope = 1.4692, ε* = 0.01. - **Red dotted line**: Slope = 1.5340, ε* = 0.012. - **Data Points**: - Blue circles (ε* = 0.008). - Green squares (ε* = 0.01). - Red triangles (ε* = 0.012). - **Error Bars**: Vertical error bars on data points indicate uncertainty in gradient updates. --- ### Detailed Analysis 1. **Blue Line (Slope = 1.4451, ε* = 0.008)**: - Data points (blue circles) align closely with the dashed line. - At $4 \times 10^1$ dimension, gradient updates ≈ $10^3$. - At $2 \times 10^2$ dimension, gradient updates ≈ $10^4$. - Error bars suggest moderate variability (e.g., ±10% at $10^4$ updates). 2. **Green Line (Slope = 1.4692, ε* = 0.01)**: - Data points (green squares) follow the dash-dot line. - At $4 \times 10^1$ dimension, gradient updates ≈ $10^3$. - At $2 \times 10^2$ dimension, gradient updates ≈ $10^4$. - Error bars are slightly larger than the blue line, indicating higher uncertainty. 3. **Red Line (Slope = 1.5340, ε* = 0.012)**: - Data points (red triangles) align with the dotted line. - At $4 \times 10^1$ dimension, gradient updates ≈ $10^3$. - At $2 \times 10^2$ dimension, gradient updates ≈ $10^4$. - Error bars are the largest, suggesting greater variability at higher ε*. --- ### Key Observations - **Trend**: All three lines exhibit a **positive power-law relationship** between dimension and gradient updates, with steeper slopes for higher ε* values. - **Consistency**: Data points for each ε* value closely follow their respective linear fits, confirming the power-law scaling. - **Outliers**: No significant outliers; all data points fall within error margins. - **Slope Correlation**: Higher ε* values (0.012) correspond to steeper slopes (1.5340), suggesting a direct relationship between learning rate and update scaling. --- ### Interpretation The chart demonstrates that **gradient updates scale polynomially with dimension**, with the exponent (slope) increasing as ε* increases. This implies: - **Higher learning rates (ε*)** require more gradient updates to achieve convergence in higher-dimensional spaces. - The **power-law relationship** highlights the computational cost of training in high-dimensional settings, where updates grow faster than linearly with dimension. - The error bars indicate that variability in gradient updates increases with ε*, possibly due to instability in optimization at larger learning rates. This analysis underscores the trade-off between learning rate and computational efficiency in high-dimensional optimization problems. </details> <details> <summary>x17.png Details</summary> ![43ce3fc7](/v1/image/43ce3fc77015b8b0eeff7c400909372bc07579bf7f3040a1ae7b334c8d5bf046) ### Visual Description ## Line Graph: Gradient Updates vs. Dimension ### Overview The image is a line graph depicting the relationship between "Dimension" (x-axis) and "Gradient updates (log scale)" (y-axis). Three linear fits with distinct slopes are plotted, alongside data points with error bars corresponding to different ε* values. The graph uses a logarithmic scale for gradient updates, emphasizing exponential growth trends. ### Components/Axes - **X-axis (Dimension)**: Ranges from 50 to 250 in increments of 25. Labeled "Dimension." - **Y-axis (Gradient updates)**: Logarithmic scale from 10² to 10³. Labeled "Gradient updates (log scale)." - **Legend**: Positioned in the top-left corner. Contains: - Blue dashed line: "Linear fit: slope=0.0127" - Green dashed line: "Linear fit: slope=0.0128" - Red dashed line: "Linear fit: slope=0.0135" - Blue circles: ε* = 0.008 - Green squares: ε* = 0.01 - Red triangles: ε* = 0.012 ### Detailed Analysis 1. **Linear Fits**: - All three lines (blue, green, red) exhibit nearly identical slopes (0.0127–0.0135), indicating a consistent linear relationship between dimension and gradient updates. - Lines are tightly clustered, with minimal divergence across the dimension range. 2. **Data Points**: - **ε* = 0.008 (blue circles)**: Follow the blue dashed line closely, with error bars decreasing slightly as dimension increases. - **ε* = 0.01 (green squares)**: Align with the green dashed line, showing similar error patterns. - **ε* = 0.012 (red triangles)**: Match the red dashed line, with error bars slightly larger at higher dimensions. 3. **Trends**: - Gradient updates increase linearly with dimension for all ε* values. - The logarithmic y-axis reveals exponential growth in absolute gradient updates (e.g., 10² to 10³ represents a 10x increase). ### Key Observations - **Consistency Across ε***: All ε* values follow nearly identical linear trends, suggesting ε* has minimal impact on the slope of gradient updates. - **Error Bars**: Variability in gradient updates increases slightly at higher dimensions, particularly for ε* = 0.012 (red triangles). - **Slope Similarity**: The near-identical slopes (0.0127–0.0135) imply that the relationship between dimension and gradient updates is robust across different ε* conditions. ### Interpretation The graph demonstrates that gradient updates scale linearly with dimension, regardless of ε* values. The logarithmic y-axis highlights the exponential resource requirements for higher-dimensional models. The tight alignment between data points and linear fits suggests a strong theoretical or empirical basis for the observed trend. The slight increase in error bars at higher dimensions may indicate practical limitations (e.g., computational noise) in extreme-scale scenarios. This trend is critical for optimizing training efficiency in high-dimensional machine learning systems. </details> <details> <summary>x18.png Details</summary> ![c52c3018](/v1/image/c52c3018300300014d984666cccd191cbe942d7300ca9a7d8657b2a945f9d2f2) ### Visual Description ## Line Graph: Gradient Updates vs. Dimension ### Overview The image is a log-log line graph comparing gradient updates (y-axis) across different dimensions (x-axis). Three linear fits with distinct slopes are plotted alongside three data series labeled by ε* values (0.008, 0.01, 0.012). Each data series includes error bars, and the lines are color-coded to match their corresponding ε* values. --- ### Components/Axes - **X-axis (Dimension)**: Logarithmic scale ranging from $4 \times 10^1$ to $2 \times 10^2$. Tick labels: $4 \times 10^1$, $6 \times 10^1$, $10^2$, $2 \times 10^2$. - **Y-axis (Gradient Updates)**: Logarithmic scale from $10^2$ to $10^3$. Tick labels: $10^2$, $10^3$. - **Legend**: Located in the top-left corner. Contains: - Blue dashed line: Slope = 1.2884 (ε* = 0.008). - Green dash-dot line: Slope = 1.3823 (ε* = 0.01). - Red dashed line: Slope = 1.5535 (ε* = 0.012). - **Data Points**: Three series with error bars, color-coded to match the legend. Positions: - ε* = 0.008 (blue): Points at $4 \times 10^1$, $6 \times 10^1$, $10^2$, $2 \times 10^2$ dimensions. - ε* = 0.01 (green): Points at $6 \times 10^1$, $10^2$, $2 \times 10^2$. - ε* = 0.012 (red): Points at $10^2$, $2 \times 10^2$. --- ### Detailed Analysis 1. **Linear Fits**: - Blue line (slope = 1.2884): Starts near $10^2$ gradient updates at $4 \times 10^1$ dimension, ending near $10^3$ at $2 \times 10^2$. - Green line (slope = 1.3823): Begins slightly above the blue line at $6 \times 10^1$, ending above it at $2 \times 10^2$. - Red line (slope = 1.5535): Steepest slope, starting near $10^2$ at $10^2$ dimension and reaching ~$10^3$ at $2 \times 10^2$. 2. **Data Points**: - All series show increasing gradient updates with dimension. - Error bars vary: Largest for ε* = 0.012 (red) at $2 \times 10^2$, smallest for ε* = 0.008 (blue) at $4 \times 10^1$. 3. **Legend Consistency**: - Colors and markers match legend entries exactly. For example, red dashed line corresponds to ε* = 0.012 data points. --- ### Key Observations - **Trend Verification**: - All lines slope upward, confirming a positive correlation between dimension and gradient updates. - Red line (highest slope) increases fastest, followed by green, then blue. - **Outliers/Anomalies**: - No clear outliers; data points align closely with their respective lines. - Error bars suggest higher uncertainty for larger dimensions and larger ε* values. --- ### Interpretation The graph demonstrates a power-law relationship between dimension and gradient updates, with the exponent (slope) increasing as ε* rises. This implies: - **Higher ε* values** (e.g., 0.012) result in steeper growth in gradient updates as dimension increases. - The logarithmic scales emphasize multiplicative trends, suggesting exponential scaling in linear space. - Error bars indicate that measurements for larger dimensions and ε* values are less precise, possibly due to computational or experimental constraints. The data supports the hypothesis that gradient updates scale polynomially with dimension, modulated by the ε* parameter. This could reflect trade-offs in optimization algorithms or hyperparameter sensitivity in machine learning models. </details> <details> <summary>x19.png Details</summary> ![6578d0b3](/v1/image/6578d0b3fabbd66bb3aa8cce17d8b133772a55cfcd63dbe60d182e22653006a7) ### Visual Description ## Line Chart: Gradient Updates vs. Dimension ### Overview The chart illustrates the relationship between gradient updates (on a logarithmic scale) and dimension size. Three data series are plotted, each with distinct markers and linear fit lines. Error bars indicate variability in the data points. ### Components/Axes - **X-axis (Dimension)**: Ranges from 50 to 250 in increments of 50. - **Y-axis (Gradient Updates)**: Logarithmic scale from 10² to 10³. - **Legend**: Located in the bottom-right corner, associating: - **Blue dashed line**: Linear fit (slope = 0.0090), ε* = 0.008. - **Green solid line**: Linear fit (slope = 0.0090), ε* = 0.01. - **Red dashed line**: Linear fit (slope = 0.0088), ε* = 0.012. - **Markers**: - Blue circles (ε* = 0.008). - Green squares (ε* = 0.01). - Red triangles (ε* = 0.012). ### Detailed Analysis 1. **Blue Series (ε* = 0.008)**: - Data points: Blue circles with vertical error bars. - Linear fit: Slope = 0.0090 (dashed line). - Trend: Consistent upward trajectory with moderate error margins. 2. **Green Series (ε* = 0.01)**: - Data points: Green squares with vertical error bars. - Linear fit: Slope = 0.0090 (solid line). - Trend: Parallel to the blue series but slightly higher gradient updates at larger dimensions. 3. **Red Series (ε* = 0.012)**: - Data points: Red triangles with vertical error bars. - Linear fit: Slope = 0.0088 (dashed line). - Trend: Slightly flatter than the blue/green series, with larger error margins at higher dimensions. ### Key Observations - All series exhibit a positive linear relationship between dimension and gradient updates. - The blue and green series share identical slopes (0.0090), suggesting similar scaling behavior despite different ε* values. - The red series has a marginally lower slope (0.0088) and higher ε* (0.012), correlating with increased variability in gradient updates. - Error bars grow larger for all series as dimension increases, particularly noticeable in the red series. ### Interpretation The data suggests that gradient updates scale linearly with dimension, but the rate of scaling (slope) is influenced by the parameter ε*. Higher ε* values (e.g., 0.012) are associated with reduced slope efficiency and greater variability in updates. The near-identical slopes for ε* = 0.008 and 0.01 imply that small changes in ε* may not significantly alter scaling behavior, while larger ε* values (0.012) introduce notable deviations. The error bars highlight increasing uncertainty in gradient updates at higher dimensions, potentially indicating computational or theoretical limits in the model's stability. </details> <details> <summary>x20.png Details</summary> ![5965d8e7](/v1/image/5965d8e7ee75689e07c5553c81372afc339f30ff128a70636a1d44b567eb91f9) ### Visual Description ## Line Graph: Gradient Updates vs. Dimension (Log-Log Scale) ### Overview The image is a log-log line graph comparing gradient updates (y-axis) across different dimensions (x-axis) for three distinct parameter settings (ε*). Three linear fits with slopes >1 are plotted, each corresponding to a unique ε* value. The graph includes error bars for data points and legend labels with slope values and ε* parameters. ### Components/Axes - **X-axis**: "Dimension (log scale)" - Range: 4×10¹ to 2×10² - Tick marks at: 4×10¹, 6×10¹, 10², 2×10² - **Y-axis**: "Gradient updates (log scale)" - Range: 10² to 10³ - **Legend**: Top-left corner - Three entries: 1. Blue dashed line: "Linear fit: slope=1.0114" (ε* = 0.008) 2. Green dash-dot line: "Linear fit: slope=1.0306" (ε* = 0.01) 3. Red dotted line: "Linear fit: slope=1.0967" (ε* = 0.012) ### Detailed Analysis 1. **Blue Line (ε* = 0.008)** - Slope: 1.0114 (shallowest) - Data points: Start at ~10² gradient updates at 4×10¹ dimension, rising to ~500 at 2×10². - Error bars: Moderate spread (~10–20% of point values). 2. **Green Line (ε* = 0.01)** - Slope: 1.0306 (intermediate) - Data points: Start at ~10² at 4×10¹, reaching ~600 at 2×10². - Error bars: Slightly larger than blue, especially at higher dimensions. 3. **Red Line (ε* = 0.012)** - Slope: 1.0967 (steepest) - Data points: Start at ~10² at 4×10¹, rising to ~800 at 2×10². - Error bars: Largest spread (~20–30% of point values). ### Key Observations - All three lines exhibit **upward trends**, confirming a positive correlation between dimension and gradient updates. - The **red line (ε* = 0.012)** has the steepest slope, indicating the strongest scaling with dimension. - Error bars increase with dimension for all lines, suggesting greater variability at higher dimensions. - The linear fits on a log-log scale imply a **power-law relationship** (y ∝ x^slope) between dimension and gradient updates. ### Interpretation The graph demonstrates that gradient updates scale superlinearly with dimension, with the scaling rate increasing as ε* rises. This suggests that higher ε* values amplify the impact of dimensionality on gradient updates, potentially reflecting trade-offs in optimization efficiency or stability. The error bars highlight experimental uncertainty, which grows with dimension, possibly due to computational noise or model complexity. The slopes (all >1) confirm that gradient updates grow faster than linearly with dimension, a critical insight for resource allocation in high-dimensional systems. </details> Figure 8: Semilog (Left) and log-log (Right) plots of the number of gradient updates needed to achieve a test loss below the threshold $\varepsilon^{*}<\varepsilon^{\rm uni}$ . Student network trained with ADAM with optimised batch size for each point. The dataset was generated from a teacher network with ReLU activation and parameters $\Delta=10^{-4}$ for the Gaussian noise variance of the linear readout, $\gamma=0.5$ and $\alpha=5.0$ for which $\varepsilon^{\rm opt}-\Delta=1.115× 10^{-5}$ . Points are obtained averaging over 10 teacher/data instances with error bars representing the standard deviation. Each row corresponds to a different distribution of the readouts, kept fixed during training. Top: homogeneous readouts, for which the error of the universal branch is $\varepsilon^{\rm uni}-\Delta=1.217× 10^{-2}$ . Centre: Rademacher readouts, for which $\varepsilon^{\rm uni}-\Delta=1.218× 10^{-2}$ . Bottom: Gaussian readouts, for which $\varepsilon^{\rm uni}-\Delta=1.210× 10^{-2}$ . The quality of the fits can be read from Table 2. | Homogeneous | | $\bm{5.57}$ | $\bm{9.00}$ | $\bm{21.1}$ | $32.3$ | $26.5$ | $61.1$ | | --- | --- | --- | --- | --- | --- | --- | --- | | Rademacher | | $\bm{4.51}$ | $\bm{6.84}$ | $\bm{12.7}$ | $12.0$ | $17.4$ | $16.0$ | | Uniform $[-\sqrt{3},\sqrt{3}]$ | | $\bm{5.08}$ | $\bm{1.44}$ | ${4.21}$ | $8.26$ | $8.57$ | $\bm{3.82}$ | | Gaussian | | $2.66$ | $\bm{0.76}$ | $3.02$ | $\bm{0.55}$ | $2.31$ | $\bm{1.36}$ | Table 2: $\chi^{2}$ test for exponential and power-law fits for the time needed by ADAM to reach the thresholds $\varepsilon^{*}$ , for various priors on the readouts. Fits are displayed in Figure 8. Smaller values of $\chi^{2}$ (in bold, for given threshold and readouts) indicate a better compatibility with the hypothesis. <details> <summary>x21.png Details</summary> ![c07ed0b5](/v1/image/c07ed0b596bd17e9deb9461d150b696b7d45472b62f376f8608fadf133ce05ce) ### Visual Description ## Line Graph: Gradient Updates vs. Dimension ### Overview The graph compares two mathematical models ("Exponential fit" and "Power law fit") against empirical data points ("Data") to illustrate how gradient updates scale with increasing dimension. The x-axis represents dimension (40–200), and the y-axis represents gradient updates (0–7000). The legend is positioned in the top-left corner, with red dashed lines for the exponential model, green dashed lines for the power law model, and blue markers with error bars for observed data. --- ### Components/Axes - **X-axis (Dimension)**: Labeled "Dimension," with ticks at 40, 60, 80, 100, 120, 140, 160, 180, and 200. - **Y-axis (Gradient updates)**: Labeled "Gradient updates," with increments of 1000 (0, 1000, 2000, ..., 7000). - **Legend**: Top-left corner, with: - Red dashed line: "Exponential fit" - Green dashed line: "Power law fit" - Blue markers: "Data" (with error bars) --- ### Detailed Analysis 1. **Exponential Fit (Red Dashed Line)**: - Starts slightly above the power law fit at dimension 40 (~500 gradient updates). - Slope increases steadily, surpassing the power law fit by dimension 100. - At dimension 200, reaches ~5000 gradient updates. - Error bars are absent for this model. 2. **Power Law Fit (Green Dashed Line)**: - Begins slightly below the exponential fit at dimension 40 (~400 gradient updates). - Slope increases more gradually than the exponential fit. - At dimension 200, reaches ~4000 gradient updates. - Error bars are absent for this model. 3. **Data Points (Blue Markers)**: - Plotted at dimensions 40, 60, 80, 100, 120, 140, 160, 180, and 200. - Values align closely with the exponential fit, especially at higher dimensions. - Error bars increase in size at higher dimensions: - Dimension 40: ±50 - Dimension 60: ±70 - Dimension 80: ±100 - Dimension 100: ±150 - Dimension 120: ±200 - Dimension 140: ±250 - Dimension 160: ±300 - Dimension 180: ±400 - Dimension 200: ±600 --- ### Key Observations - The exponential fit consistently predicts higher gradient updates than the power law fit across all dimensions. - Data points align more closely with the exponential fit, particularly at dimensions ≥100. - Error bars grow significantly at higher dimensions, suggesting reduced measurement precision or increased variability in gradient updates as dimension increases. --- ### Interpretation The data strongly supports the exponential model as a better predictor of gradient updates in high-dimensional spaces. The power law fit underestimates updates, especially beyond dimension 100. The widening error bars at higher dimensions imply that gradient updates become harder to measure or predict accurately as systems scale. This could reflect challenges in optimization algorithms (e.g., training neural networks) where computational complexity grows exponentially with dimension, or it may indicate noise in data collection at scale. The divergence between models highlights the importance of selecting appropriate scaling laws for high-dimensional problems. </details> <details> <summary>x22.png Details</summary> ![d5744aae](/v1/image/d5744aae3cc4e4c46b00d4aa1c526e07557fa47b063a1e4d518f0d821089f271) ### Visual Description ## Line Graph: Gradient Updates vs. Dimension ### Overview The image is a line graph comparing two mathematical models (exponential and power law fits) against empirical data points. The x-axis represents "Dimension" (ranging from 50 to 225), and the y-axis represents "Gradient updates" (ranging from 0 to 700). The graph includes a legend in the top-left corner and error bars for the data points. ### Components/Axes - **X-axis (Dimension)**: Labeled "Dimension," with values from 50 to 225 in increments of 25. - **Y-axis (Gradient updates)**: Labeled "Gradient updates," with values from 0 to 700 in increments of 100. - **Legend**: Located in the top-left corner, with three entries: - Red dashed line: Exponential fit - Green dashed line: Power law fit - Blue points with error bars: Data ### Detailed Analysis 1. **Exponential Fit (Red Dashed Line)**: - Starts at ~100 gradient updates at dimension 50. - Increases steadily, reaching ~550 at dimension 225. - Slope becomes steeper as dimension increases. 2. **Power Law Fit (Green Dashed Line)**: - Begins slightly below the exponential fit at dimension 50 (~90 gradient updates). - Grows more gradually, reaching ~480 at dimension 225. - Maintains a consistent upward trend but with a less pronounced slope. 3. **Data Points (Blue with Error Bars)**: - Plotted at each dimension mark (50, 75, 100, ..., 225). - Values align closely with the power law fit but show variability: - At dimension 50: ~90 (±10). - At dimension 225: ~490 (±110). - Error bars increase in size as dimension increases, indicating higher uncertainty at larger dimensions. ### Key Observations - Both fits exhibit upward trends, but the exponential fit grows faster than the power law fit. - Data points cluster near the power law fit but occasionally exceed it, particularly at higher dimensions. - Error bars for data points suggest increasing variability with dimension, especially at 225 (error range: 380–600). ### Interpretation The graph demonstrates a comparison of two scaling models against empirical data. The power law fit aligns more closely with the observed data, suggesting it better captures the underlying trend. However, the data's increasing uncertainty at higher dimensions (e.g., ±110 at 225) implies potential limitations in model accuracy or external factors affecting gradient updates. The exponential fit, while steeper, may overestimate growth, highlighting the importance of model selection in predictive scenarios. The divergence between fits at higher dimensions underscores the need for further validation in extreme cases. </details> Figure 9: Same as in Fig. 8, but in linear scale for better visualisation, for homogeneous readouts (Left) and Gaussian readouts (Right), with threshold $\varepsilon^{*}=0.008$ . <details> <summary>x23.png Details</summary> ![fee1e06a](/v1/image/fee1e06a4846984a7c3b4f54a718a72b170f0eee77141f7de33021897339305b) ### Visual Description ## Line Chart: Test Loss vs. Gradient Updates ### Overview The chart illustrates the convergence behavior of test loss across multiple gradient update iterations (0–6000) for different hyperparameter values (`d`). It includes two reference lines (`ε_uni` and `ε_opt`) and shaded regions representing variability. Higher `d` values (140–180) show faster convergence and lower final loss compared to lower `d` values (60–120). ### Components/Axes - **X-axis**: Gradient updates (0–6000, increments of 1000). - **Y-axis**: Test loss (0.00–0.06, increments of 0.01). - **Legend**: - Solid lines: `d = 60` (light red) to `d = 180` (dark red). - Dashed lines: `ε_uni` (horizontal, ~0.01) and `ε_opt` (horizontal, ~0.009). - **Shaded regions**: Semi-transparent bands around each line, likely representing confidence intervals or variability. ### Detailed Analysis 1. **Initial Drop**: All lines start near 0.06 at 0 updates, dropping sharply within the first 1000 updates. - Example: `d = 60` drops to ~0.02 by 1000 updates; `d = 180` drops to ~0.015. 2. **Fluctuations**: Post-1000 updates, lines exhibit noisy oscillations, with amplitude decreasing over time. 3. **Convergence**: Higher `d` values (e.g., `d = 180`) stabilize near ~0.01 by 6000 updates, while lower `d` values (e.g., `d = 60`) hover around ~0.02. 4. **Reference Lines**: - `ε_uni` (dashed gray): Horizontal at ~0.01. - `ε_opt` (dotted gray): Horizontal at ~0.009. 5. **Shaded Regions**: - Narrower for higher `d` (e.g., `d = 180` has minimal shading by 6000 updates). - Wider for lower `d` (e.g., `d = 60` retains significant shading). ### Key Observations - **Inverse Relationship**: Test loss decreases as `d` increases, with `d = 180` achieving the lowest loss (~0.01). - **Stability**: Higher `d` values exhibit tighter confidence intervals (narrower shaded regions). - **Thresholds**: `ε_uni` and `ε_opt` act as benchmarks, with most lines converging below `ε_uni` after 4000 updates. ### Interpretation The chart demonstrates that increasing the hyperparameter `d` improves model performance (lower test loss) and stability (reduced variability). The shaded regions suggest that higher `d` values yield more reliable loss estimates, likely due to better generalization or regularization. The `ε_uni` and `ε_opt` lines may represent theoretical or empirical thresholds for acceptable loss, with practical performance approaching `ε_opt` for optimal `d` values. The noise in early updates highlights the importance of sufficient gradient steps for convergence. </details> <details> <summary>x24.png Details</summary> ![9125928c](/v1/image/9125928c78835fa9b2daba3c716061b0be8fe1780b3ef6c10ea2460ab3f43033) ### Visual Description ## Line Chart: ε vs. Gradient Updates ### Overview The chart displays the convergence behavior of a parameter ε across multiple gradient update iterations (0–2000) for different dimensionality values (d = 60, 80, 100, 120, 140, 160, 180). It includes theoretical bounds (2ε^uni, ε^uni, ε^opt) and confidence intervals for each data series. ### Components/Axes - **X-axis**: Gradient updates (0–2000, linear scale) - **Y-axis**: ε values (0.00–0.06, logarithmic scale) - **Legend**: - Right-aligned, with color-coded lines for: - d = 60 (light orange) - d = 80 (orange) - d = 100 (dark orange) - d = 120 (red) - d = 140 (dark red) - d = 160 (maroon) - d = 180 (dark maroon) - 2ε^uni (dotted gray) - ε^uni (dashed gray) - ε^opt (dash-dot gray) - **Shading**: Confidence intervals (light gray bands around each line) ### Detailed Analysis 1. **Initial Drop**: All d-series lines start at ε ≈ 0.06 and drop sharply within the first 250 updates. 2. **Convergence Patterns**: - Higher d-values (160, 180) achieve lower ε faster, reaching ~0.015 by 1000 updates. - Lower d-values (60, 80) plateau at ~0.02–0.03 after 1000 updates. 3. **Theoretical Bounds**: - ε^opt (0.01) is the lowest horizontal line, serving as the target. - ε^uni (0.02) and 2ε^uni (0.04) represent upper bounds. 4. **Confidence Intervals**: Shaded regions narrow as updates increase, indicating reduced variance in later iterations. ### Key Observations - **Performance Scaling**: ε decreases monotonically with increasing d, with d=180 achieving the lowest ε (~0.012) by 2000 updates. - **Theoretical Alignment**: All d-series approach ε^opt asymptotically but remain above it throughout the observed range. - **Anomaly**: d=60 shows the widest confidence interval (up to ±0.005), suggesting higher instability. ### Interpretation The chart demonstrates that increasing dimensionality (d) improves convergence speed and final ε performance, with d=180 outperforming lower dimensions by ~40% in final ε. The theoretical bounds provide context: ε^opt represents the ideal limit, while ε^uni and 2ε^uni quantify acceptable performance thresholds. The narrowing confidence intervals suggest that longer training stabilizes the model, though all d-values remain suboptimal relative to ε^opt. This implies a trade-off between computational cost (higher d) and performance gains, with diminishing returns observed after d=140. </details> <details> <summary>x25.png Details</summary> ![f9fcdade](/v1/image/f9fcdade8dfc825b116abc1c3701b8374e8c6533a6cc237a1c52385dc07f8173) ### Visual Description ## Line Chart: ε vs Gradient Updates ### Overview The chart displays the convergence behavior of a parameter ε (epsilon) across multiple gradient update steps (0–600). Seven distinct lines represent different values of a parameter "d" (60, 80, 100, 120, 140, 160, 180), with shaded confidence intervals. Two horizontal reference lines (2ε_uni and ε_opt) are included for comparison. ### Components/Axes - **X-axis**: Gradient updates (0–600, linear scale) - **Y-axis**: ε values (0.00–0.06, logarithmic-like decay) - **Legend**: - Solid lines: d = 60 (light orange), 80 (orange), 100 (light red), 120 (red), 140 (dark red), 160 (maroon), 180 (dark maroon) - Dashed lines: 2ε_uni (gray), ε_opt (red dashed) - **Shading**: Confidence intervals (light gray bands) around each line ### Detailed Analysis 1. **d = 60–180 Lines**: - All lines start near ε ≈ 0.06 at 0 gradient updates. - Lines exhibit exponential decay, with higher "d" values showing slower convergence. - Example: d=180 (dark maroon) retains ε ≈ 0.03 at 600 updates, while d=60 (light orange) drops below 0.01. - Confidence intervals widen initially (first 100 updates) and narrow as updates increase. 2. **Reference Lines**: - 2ε_uni (gray dashed): Horizontal line at ε ≈ 0.01. - ε_opt (red dashed): Horizontal line at ε ≈ 0.02. 3. **Convergence Patterns**: - Lines for d=60–100 cross below ε_opt (~0.02) by ~200 updates. - d=120–180 lines remain above ε_opt throughout, approaching 2ε_uni asymptotically. ### Key Observations - **Inverse Relationship**: Higher "d" values correlate with slower ε decay (e.g., d=180 vs. d=60). - **Threshold Behavior**: All lines eventually approach 2ε_uni, suggesting a universal lower bound. - **Optimal ε Gap**: ε_opt (0.02) lies between 2ε_uni (0.01) and the final ε values of d=120–180. ### Interpretation The data demonstrates that increasing "d" (possibly model complexity or dataset size) delays ε convergence, requiring more gradient updates to achieve lower error rates. The ε_opt line suggests a practical target for balancing performance and computational cost, while 2ε_uni represents a theoretical minimum. The widening confidence intervals at early updates indicate higher variability in initial training phases, which stabilizes as training progresses. This pattern implies that larger "d" systems may require specialized optimization strategies to efficiently reach ε_opt. </details> Figure 10: Trajectories of the generalisation error of neural networks trained with ADAM at fixed batch size $B=\lfloor n/4\rfloor$ , learning rate 0.05, for ReLU activation with parameters $\Delta=10^{-4}$ for the linear readout, $\gamma=0.5$ and $\alpha=5.0>\alpha_{\rm sp}$ ( $=0.22,0.12,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively). The error $\varepsilon^{\rm uni}$ is the mean-square generalisation error associated to the universal solution with overlap $\mathcal{Q}_{W}\equiv 0$ . Left: Homogeneous readouts. Centre: Rademacher readouts. Right: Gaussian readouts. Readouts are kept fixed (and equal to the teacher’s) in all cases during training. Points on the solid lines are obtained by averaging over 5 teacher/data instances, and shaded regions around them correspond to one standard deviation. We now provide empirical evidence concerning the computational complexity to attain specialisation, namely to have one of the $\mathcal{Q}_{W}(\mathsf{v})>0$ , or equivalently to beat the “universal” performance ( $\mathcal{Q}_{W}(\mathsf{v})=0$ for all $\mathsf{v}∈\mathsf{V}$ ) in terms of generalisation error. We tested two algorithms that can find it in affordable computational time: ADAM with optimised batch size for every dimension tested (the learning rate is automatically tuned), and Hamiltonian Monte Carlo (HMC), both trying to infer a two-layer teacher network with Gaussian inner weights. ADAM We focus on ReLU activation, with $\gamma=0.5$ , Gaussian output channel with low label noise ( $\Delta=10^{-4}$ ) and $\alpha=5.0>\alpha_{\rm sp}$ ( $=0.22,0.12,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively, thus we are deep in the specialisation phase in all the cases we report), so that the specialisation solution exhibits a very low generalisation error. We test the learnt model at each gradient update measuring the generalisation error with a moving average of 10 steps to smoothen the curves. Let us define $\varepsilon^{\rm uni}$ as the generalisation error associated to the overlap $\mathcal{Q}_{W}\equiv 0$ , then fixing a threshold $\varepsilon^{\rm opt}<\varepsilon^{*}<\varepsilon^{\rm uni}$ , we define $t^{*}(d)$ the time (in gradient updates) needed for the algorithm to cross the threshold for the first time. We optimise over different batch sizes $B_{p}$ as follows: we define them as $B_{p}=\left\lfloor\frac{n}{2^{p}}\right\rfloor,\quad p=2,3,...,\lfloor\log_{% 2}(n)\rfloor-1$ . Then for each batch size, the student network is trained until the moving average of the test loss drops below $\varepsilon^{*}$ and thus outperforms the universal solution; we have checked that in such a scenario, the student ultimately gets close to the performance of the specialisation solution. The batch size that requires the least gradient updates is selected. We used the ADAM routine implemented in PyTorch. We test different distributions for the readout weights (kept fixed to ${\mathbf{v}}$ during training of the inner weights). We report all the values of $t^{*}(d)$ in Fig. 8 for various dimensions $d$ at fixed $(\alpha,\gamma)$ , providing an exponential fit $t^{*}(d)=\exp(ad+b)$ (left panel) and a power-law fit $t^{*}(d)=ad^{b}$ (right panel). We report the $\chi^{2}$ test for the fits in Table 2. We observe that for homogeneous and Rademacher readouts, the exponential fit is more compatible with the experiments, while for Gaussian readouts the comparison is inconclusive. In Fig. 10, we report the test loss of ADAM as a function of the gradient updates used for training, for various dimensions and choice of the readout distribution (as before, the readouts are not learnt but fixed to the teacher’s). Here, we fix a batch size for simplicity. For both the cases of homogeneous ( ${\mathbf{v}}=\bm{1}$ ) and Rademacher readouts (left and centre panels), the model experiences plateaux in performance increasing with the system size, in accordance with the observation of exponential complexity we reported above. The plateaux happen at values of the test loss comparable with twice the value for the Bayes error predicted by the universal branch of our theory (remember the relationship between Gibbs and Bayes errors reported in App. C). The curves are smoother for the case of Gaussian readouts. Hamiltonian Monte Carlo <details> <summary>x26.png Details</summary> ![823e7905](/v1/image/823e79052e9bedd4b5484d38fdc7146d029f6007fc9b8ebfc190ed75ea2369e1) ### Visual Description ## Line Chart: Convergence of HMC Steps Across Dimensions ### Overview The chart illustrates the convergence behavior of Hamiltonian Monte Carlo (HMC) steps across varying dimensions (`d`). Multiple data series represent different dimension values, showing how a metric (likely acceptance rate or convergence quality) evolves over HMC steps. Two theoretical benchmarks ("theory" and "universal") are plotted as dashed lines for comparison. ### Components/Axes - **X-axis**: "HMC step" (0 to 4000), representing the number of HMC iterations. - **Y-axis**: A metric (unitless, 0.80–0.95), likely acceptance rate or convergence quality. - **Legend**: - Solid lines: Dimension values (`d=120` to `d=240` in increments of 20). - Dashed lines: - Red: "theory" (upper bound at ~0.95). - Black: "universal" (lower bound at ~0.90). - **Line Colors**: - Red shades for lower `d` values (e.g., `d=120` is bright red). - Darker shades (black) for higher `d` values (e.g., `d=240` is nearly black). ### Detailed Analysis - **Initial Behavior**: All lines start near 0.80 at HMC step 0, indicating a uniform initial state. - **Rise Phase**: Lines sharply increase between 0 and ~1000 steps, with higher `d` values (e.g., `d=240`) rising faster. - **Plateau Phase**: After ~1000 steps, lines flatten, approaching asymptotic values: - `d=120`: ~0.85 (below "universal" line). - `d=240`: ~0.94 (near "theory" line). - **Asymptotic Trends**: - Higher `d` values (e.g., `d=200`, `d=240`) approach the "theory" line (~0.95). - Lower `d` values (e.g., `d=120`, `d=140`) plateau near or below the "universal" line (~0.90). ### Key Observations 1. **Dimensionality Impact**: Higher `d` values achieve better convergence, with `d=240` nearly reaching the theoretical maximum. 2. **Theoretical vs. Universal**: - "Theory" (~0.95) acts as an upper performance ceiling. - "Universal" (~0.90) represents a lower benchmark, exceeded only by higher `d` values. 3. **Convergence Speed**: Lines for `d ≥ 200` converge faster and to higher values than lower `d` values. ### Interpretation The chart demonstrates that increasing the dimension `d` improves HMC performance, as measured by the metric. Higher dimensions allow the algorithm to approach the theoretical optimum more closely, while lower dimensions remain constrained by a universal performance floor. The "universal" line may represent a baseline achievable across all dimensions, whereas the "theory" line reflects an idealized limit. This suggests that HMC efficiency is sensitive to dimensionality, with practical applications benefiting from higher `d` values when computational resources permit. </details> <details> <summary>x27.png Details</summary> ![1461d7cf](/v1/image/1461d7cf48df99dfc99f73c2042ba4847828b4de6e6ec092e08d7525fb94792b) ### Visual Description ## Line Chart: Performance of Models with Varying Dimensions (d) Across HMC Steps ### Overview The chart illustrates the convergence behavior of multiple models with different dimensionalities (d) as they perform Hamiltonian Monte Carlo (HMC) steps. The y-axis represents a metric (likely probability or performance score) that approaches 1 as HMC steps increase. The chart includes reference lines for "theory" (red dashed) and "universal" (black dashed) benchmarks. ### Components/Axes - **X-axis**: "HMC step" (0 to 2000), representing the number of Hamiltonian Monte Carlo steps. - **Y-axis**: A normalized metric (0.75 to 0.95), likely a probability or performance score. - **Legend**: - Solid lines represent models with dimensions (d) = 120, 140, 160, 180, 200, 220, 240. - Red dashed line: "theory" (benchmark at ~0.95). - Black dashed line: "universal" (benchmark at ~0.85). ### Detailed Analysis - **Data Series**: - **d=120 (red)**: Starts at ~0.75, rises sharply to ~0.85 by ~500 steps, then plateaus near 0.85. - **d=140 (orange)**: Similar to d=120 but slightly higher, reaching ~0.88 by ~1000 steps. - **d=160 (dark red)**: Reaches ~0.90 by ~1000 steps, plateauing near 0.90. - **d=180 (brown)**: Approaches ~0.92 by ~1500 steps. - **d=200 (dark brown)**: Reaches ~0.93 by ~1500 steps. - **d=220 (black)**: Approaches ~0.94 by ~1500 steps. - **d=240 (dark gray)**: Closest to the "theory" line, reaching ~0.945 by ~2000 steps. - **Reference Lines**: - **Theory (red dashed)**: Horizontal line at ~0.95, representing the theoretical maximum. - **Universal (black dashed)**: Horizontal line at ~0.85, possibly a baseline or alternative benchmark. ### Key Observations 1. **Convergence Trends**: All models show rapid initial improvement, with performance plateauing as HMC steps increase. Higher-dimensional models (larger d) achieve higher plateaus closer to the "theory" line. 2. **Dimensionality Impact**: Larger d values (e.g., 240) outperform smaller d values (e.g., 120) by ~0.1 in the final plateau. 3. **Benchmark Alignment**: The "theory" line is the highest, suggesting it represents an idealized or optimal performance. The "universal" line is lower, possibly indicating a conservative or alternative theoretical limit. ### Interpretation The chart demonstrates that increasing the dimensionality (d) of the model improves its convergence toward the theoretical maximum ("theory" line). This suggests that higher-dimensional models may better capture complex patterns in the data, aligning with expectations in Bayesian inference where higher dimensions can reduce bias. The "universal" line may represent a lower bound or a different theoretical framework, highlighting the trade-off between model complexity and practical performance. The data implies that optimizing d is critical for balancing computational efficiency and accuracy in HMC-based methods. </details> <details> <summary>x28.png Details</summary> ![cf5505f6](/v1/image/cf5505f6f0b7af48939752afc4068ba1a07929c40fafce9f8b6c2622d1ad89eb) ### Visual Description ## Line Graph: Convergence of HMC Steps Across Different 'd' Values ### Overview The image depicts a line graph illustrating the convergence behavior of Hamiltonian Monte Carlo (HMC) steps across varying 'd' values (120–240). The graph includes two reference lines labeled "theory" (dashed red) and "universal" (dashed black), with all data lines converging toward a plateau near the "universal" threshold. ### Components/Axes - **X-axis**: "HMC step" (horizontal), scaled from 0 to 2000 in increments of 500. - **Y-axis**: Unlabeled numerical scale from 0.80 to 0.95, with markers at 0.80, 0.85, 0.90, and 0.95. - **Legend**: Positioned on the right, mapping colors to: - Solid lines: `d=120` (light red), `d=140` (medium red), `d=160` (dark red), `d=180` (brown), `d=200` (dark brown), `d=220` (black), `d=240` (black). - Dashed lines: "theory" (red), "universal" (black). ### Detailed Analysis - **Data Series Trends**: - All lines start at the same y-value (~0.80) and rise sharply before plateauing. - Higher 'd' values (e.g., `d=240`) reach the plateau faster than lower 'd' values (e.g., `d=120`). - The plateau value for all lines is approximately **0.94–0.95**, slightly below the "theory" line (0.95) and above the "universal" line (0.90). - Lines for `d=220`, `d=240` overlap almost perfectly, while lower 'd' values show distinct separation. - **Reference Lines**: - "theory" (dashed red): Horizontal line at y=0.95, acting as an upper bound. - "universal" (dashed black): Horizontal line at y=0.90, acting as a lower bound. ### Key Observations 1. **Convergence Speed**: Higher 'd' values achieve the plateau more rapidly, suggesting faster convergence in HMC steps. 2. **Plateau Consistency**: All lines converge to the same plateau value (~0.94–0.95), indicating a saturation effect. 3. **Boundaries**: The plateau lies between the "theory" and "universal" thresholds, implying practical limits to performance gains from increasing 'd'. ### Interpretation The graph demonstrates that increasing 'd' improves the speed of convergence in HMC steps but does not surpass the "theory" threshold. The plateau near the "universal" line suggests a fundamental limit to performance, where further increases in 'd' yield diminishing returns. This aligns with theoretical expectations of HMC behavior, where higher-dimensional systems may stabilize at a universal efficiency threshold. The consistency across all 'd' values reinforces the robustness of this convergence pattern. </details> Figure 11: Trajectories of the overlap $q_{2}$ in HMC runs initialised uninformatively for the polynomial activation $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ with parameters $\Delta=0.1$ for the linear readout, $\gamma=0.5$ and $\alpha=1.0$ . Left: Homogeneous readouts. Centre: Rademacher readouts. Right: Gaussian readouts. Points on the solid lines are obtained by averaging over 10 teacher/data instances, and shaded regions around them correspond to one standard deviation. Notice that the $y$ -axes are limited for better visualisation. For the left and centre plot, any threshold (horizontal line in the plot) between the prediction of the $\mathcal{Q}_{W}\equiv 0$ branch of our theory (black dashed line) and its prediction for the Bayes-optimal $q_{2}$ (red dashed line) crosses the curves in points $t^{*}(d)$ more compatible with an exponential fit (see Fig. 12 and Table 3, where these fits are reported and $\chi^{2}$ -tested). For the cases of homogeneous and Rademacher readouts, the value of the overlap at which the dynamics slows down (predicted by the $\mathcal{Q}_{W}\equiv 0$ branch) is in quantitative agreement with the theoretical predictions (lower dashed line). The theory is instead off by $≈ 1\%$ for the values $q_{2}$ at which the runs ultimately converge. The experiment is performed for the polynomial activation $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ with parameters $\Delta=0.1$ for the Gaussian noise in the linear readout, $\gamma=0.5$ and $\alpha=1.0>\alpha_{\rm sp}$ ( $=0.26,0.30,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively). Our HMC consists of $4000$ iterations for homogeneous readouts, or $2000$ iterations for Rademacher and Gaussian readouts. Each iteration is adaptive (with initial step size of $0.01$ ) and uses $10$ leapfrog steps. Instead of measuring the Gibbs error, whose relationship with $\varepsilon^{\rm opt}$ holds only at equilibrium (see the last remark in App. C), we measured the teacher-student $q_{2}$ -overlap which is meaningful at any HMC step and is informative about the learning. For a fixed threshold $q_{2}^{*}$ and dimension $d$ , we measure $t^{*}(d)$ as the number of HMC iterations needed for the $q_{2}$ -overlap between the HMC sample (obtained from uninformative initialisation) and the teacher weights ${\mathbf{W}}^{0}$ to cross the threshold. This criterion is again enough to assess that the student outperforms the universal solution. As before, we test homogeneous, Rademacher and Gaussian readouts, getting to the same conclusions: while for homogeneous and Rademacher readouts exponential time is more compatible with the observations, the experiments remain inconclusive for Gaussian readouts (see Fig. 12). We report in Fig. 11 the values of the overlap $q_{2}$ measured along the HMC runs for different dimensions. Note that, with HMC steps, all $q_{2}$ curves saturate to a value that is off by $≈ 1\%$ w.r.t. that predicted by our theory for the selected values of $\alpha,\gamma$ and $\Delta$ . Whether this is a finite size effect, or an effect not taken into account by the current theory is an interesting question requiring further investigation, see App. E.2 for possible directions. <details> <summary>x29.png Details</summary> ![17efae48](/v1/image/17efae48fa19c2942149657beaa61f3c9abd1b61e40ea6f7ec900594d86eea04) ### Visual Description ## Line Chart: Relationship Between Dimension and Number of MC Steps ### Overview The image is a line chart depicting the relationship between "Dimension" (x-axis) and "Number of MC steps (log scale)" (y-axis). Three data series are plotted, each with distinct slopes and associated *q*2 values. The chart uses a logarithmic scale for the y-axis, and error bars are present for all data points. --- ### Components/Axes - **X-axis (Dimension)**: Labeled "Dimension," with ticks at intervals of 20 (80, 100, 120, ..., 240). - **Y-axis (Number of MC steps)**: Labeled "Number of MC steps (log scale)," with values ranging from 10² to 10³. - **Legend**: Positioned in the top-left corner, containing three entries: - **Blue line**: Slope = 0.0167, *q*2 = 0.903 (marked with circles). - **Green line**: Slope = 0.0175, *q*2 = 0.906 (marked with squares). - **Red line**: Slope = 0.0174, *q*2 = 0.909 (marked with triangles). --- ### Detailed Analysis 1. **Data Series Trends**: - All three lines exhibit an **upward trend**, consistent with the logarithmic scale. - **Blue line** (slope = 0.0167): Lowest slope, corresponding to *q*2 = 0.903. - **Red line** (slope = 0.0174): Intermediate slope, *q*2 = 0.909. - **Green line** (slope = 0.0175): Steepest slope, *q*2 = 0.906. - The slopes are very close, suggesting a weak sensitivity to *q*2 in this range. 2. **Error Bars**: - All data points include error bars, indicating variability in the measured values. - The error bars are relatively small compared to the data points, suggesting moderate precision. 3. **Logarithmic Scale**: - The y-axis uses a log scale, which linearizes the exponential growth of MC steps with dimension. - The straight-line fits confirm a power-law relationship between dimension and MC steps. --- ### Key Observations - **Slope vs. *q*2**: - Higher *q*2 values (e.g., 0.909) correspond to steeper slopes (0.0174–0.0175), while lower *q*2 (0.903) has the shallowest slope (0.0167). - The relationship between *q*2 and slope is non-linear, as the increase in *q*2 does not proportionally increase the slope. - **Data Point Placement**: - All lines pass through the origin (0, 10²) implicitly, as the log scale starts at 10². - At dimension = 240, the green line (highest slope) reaches ~10³ MC steps, while the blue line reaches ~500 MC steps. - **Error Bar Consistency**: - Error bars are visually similar across all lines, suggesting comparable uncertainty in measurements. --- ### Interpretation The chart demonstrates that the number of Monte Carlo (MC) steps required scales linearly with dimension on a logarithmic scale. The slope of this relationship is weakly correlated with the *q*2 parameter, which quantifies some property of the system (e.g., convergence rate or efficiency). - **Trend Implications**: - As dimension increases, the computational cost (MC steps) grows exponentially, as indicated by the log scale. - The slight differences in slopes suggest that *q*2 may influence the efficiency of the MC method, but the effect is minimal within the tested range (0.903–0.909). - **Practical Significance**: - For high-dimensional problems, the exponential growth of MC steps highlights the need for optimization strategies (e.g., adaptive sampling or parallelization). - The weak dependence on *q*2 implies that small changes in this parameter may not significantly alter computational requirements. - **Anomalies**: - No outliers are observed; all data points align closely with their respective linear fits. - The error bars do not indicate systematic deviations, reinforcing the reliability of the trends. --- ### Conclusion This chart underscores the computational challenges of high-dimensional MC simulations, where the number of steps grows exponentially with dimension. While *q*2 influences the slope of this growth, its impact is limited in the tested range. The results emphasize the importance of optimizing *q*2 and other parameters to mitigate the curse of dimensionality in MC methods. </details> <details> <summary>x30.png Details</summary> ![e4ccae75](/v1/image/e4ccae754dcbe43a12e6ae6a22e0de346537ae7aa3f3f40c2c42e017c9ee2b1c) ### Visual Description ## Line Graph: Scaling of Monte Carlo Steps with Dimension ### Overview The image is a line graph depicting the relationship between computational dimension (on a logarithmic scale) and the number of Monte Carlo (MC) steps required (also on a logarithmic scale). Three distinct linear fits are plotted, each corresponding to different parameter values of *q₂* (0.903, 0.906, 0.909). The graph uses dashed lines for the linear fits and markers with error bars for data points. --- ### Components/Axes - **X-axis**: "Dimension (log scale)" ranging from 10² to 2×10². - **Y-axis**: "Number of MC steps (log scale)" ranging from 10² to 10³. - **Legend**: Located in the top-left corner, with three entries: - **Blue dashed line**: Slope = 2.4082, *q₂* = 0.903. - **Green dashed line**: Slope = 2.5207, *q₂* = 0.906. - **Red dashed line**: Slope = 2.5297, *q₂* = 0.909. - **Data Points**: Markers (blue, green, red) with error bars aligned with their respective linear fits. --- ### Detailed Analysis 1. **Blue Line (q₂ = 0.903)**: - Slope = 2.4082. - Data points (blue circles) align closely with the dashed line. - Error bars are small, indicating low variability. 2. **Green Line (q₂ = 0.906)**: - Slope = 2.5207. - Data points (green squares) follow the dashed line tightly. - Error bars slightly larger than blue but consistent. 3. **Red Line (q₂ = 0.909)**: - Slope = 2.5297. - Data points (red triangles) show the steepest trend. - Error bars increase slightly at higher dimensions but remain proportional. --- ### Key Observations - **Trend Verification**: - All lines slope upward, confirming that the number of MC steps increases with dimension. - Red line (highest slope) grows fastest, followed by green, then blue. - **Spatial Grounding**: - Legend is top-left; colors match data points and lines exactly. - Error bars are centered on markers, with vertical orientation. - **Data Consistency**: - All slopes are distinct but close (2.4082–2.5297), suggesting minor parameter-driven differences. - *q₂* values increase incrementally (0.903 → 0.909), correlating with steeper slopes. --- ### Interpretation The graph demonstrates that the number of MC steps scales polynomially with dimension, with the rate of growth dependent on the parameter *q₂*. Higher *q₂* values (closer to 1) correspond to steeper slopes, implying that the computational cost rises more rapidly as dimension increases. The logarithmic scales emphasize exponential growth trends, highlighting the sensitivity of MC methods to dimensionality. The error bars suggest experimental or stochastic variability, but the linear fits indicate a strong underlying trend. This could inform optimization strategies for high-dimensional simulations by identifying parameter regimes that minimize computational overhead. </details> <details> <summary>x31.png Details</summary> ![8916ae9b](/v1/image/8916ae9bb6ffadfbfb52cbfb0d23eb7876da415e6f50bf559065dbff59f93ec1) ### Visual Description ## Line Chart: Relationship Between Dimension and Number of MC Steps ### Overview The image is a line chart depicting the relationship between "Dimension" (x-axis) and "Number of MC steps (log scale)" (y-axis). Three distinct data series are plotted, each represented by a unique line style (dashed, dash-dot, solid) and marker type (circles, squares, triangles). The y-axis uses a logarithmic scale from 10² to 10³, while the x-axis ranges from 80 to 240. A legend in the bottom-right corner identifies the three series with their slopes and q² values. ### Components/Axes - **X-axis (Dimension)**: Labeled "Dimension," with values ranging from 80 to 240 in increments of 20. - **Y-axis (Number of MC steps)**: Labeled "Number of MC steps (log scale)," with values from 10² to 10³ in logarithmic increments. - **Legend**: Located in the bottom-right corner, containing three entries: 1. **Blue dashed line**: Slope = 0.0136, q² = 0.897 (circles). 2. **Green dash-dot line**: Slope = 0.0140, q² = 0.904 (squares). 3. **Red solid line**: Slope = 0.0138, q² = 0.911 (triangles). ### Detailed Analysis - **Data Series Trends**: - All three lines exhibit a **positive linear trend**, increasing with dimension. - The **red solid line** (slope = 0.0138) has the steepest ascent, followed by the **green dash-dot line** (slope = 0.0140) and the **blue dashed line** (slope = 0.0136). - Each line is accompanied by **error bars** (vertical lines with caps), indicating measurement uncertainty. The error bars grow larger at higher dimensions, suggesting increased variability in MC steps as dimension increases. - **Logarithmic Scale**: The y-axis compresses the range of values, emphasizing proportional growth rather than absolute differences. This highlights exponential-like scaling behavior. - **q² Values**: All q² values (0.897–0.911) are close to 1, indicating strong linear fits for the data series. ### Key Observations 1. **Consistent Growth**: All three series show a roughly linear increase in MC steps with dimension, but with slight differences in slope. 2. **Slope Variability**: The slopes (0.0136–0.0140) are very close, but the red line (0.0138) and green line (0.0140) have marginally steeper trends than the blue line (0.0136). 3. **Error Bars**: Larger error bars at higher dimensions suggest greater uncertainty in MC step measurements as systems become more complex. 4. **q² Proximity to 1**: High q² values confirm that the linear models fit the data well, with minimal deviation. ### Interpretation The chart demonstrates that the number of Markov Chain Monte Carlo (MC) steps required scales linearly with system dimension, albeit with minor variations between the three data series. The logarithmic y-axis underscores the exponential nature of computational complexity in higher-dimensional systems. The nearly identical slopes (0.0136–0.0140) suggest a universal scaling law, while the slight differences may reflect methodological variations or system-specific factors. The high q² values (all >0.9) validate the linear models, though the error bars indicate that experimental or computational noise increases with dimension. This trend is critical for understanding computational resource requirements in high-dimensional optimization or sampling problems. </details> <details> <summary>x32.png Details</summary> ![f5cbe7fa](/v1/image/f5cbe7fa1bebc0c94ec64b353a73b8fa4e97229c8df470d731c774d14ef978c7) ### Visual Description ## Line Graph: Relationship Between Dimension and Number of MC Steps ### Overview The image is a line graph depicting the relationship between "Dimension" (log scale) and "Number of MC steps" (log scale). Three distinct linear fits are plotted, each with a unique slope and associated statistical value (q₂). Data points with error bars are overlaid on the lines, and a legend explains the color-coded trends. ### Components/Axes - **X-axis**: "Dimension (log scale)" ranging from 10² to 2×10². - **Y-axis**: "Number of MC steps (log scale)" ranging from 10² to 10³. - **Legend**: Located in the top-left corner, with three entries: - **Blue dashed line**: Slope = 1.9791, q₂ = 0.897. - **Green dashed line**: Slope = 2.0467, q₂ = 0.904. - **Red dashed line**: Slope = 2.0093, q₂ = 0.911. - **Data Points**: Blue circles, green squares, and red triangles correspond to the three lines, with error bars indicating variability. ### Detailed Analysis 1. **Blue Line (Slope = 1.9791)**: - Data points (blue circles) align closely with the dashed line. - q₂ = 0.897 suggests moderate statistical confidence. - Error bars are smallest at lower dimensions, increasing slightly at higher dimensions. 2. **Green Line (Slope = 2.0467)**: - Data points (green squares) follow the dashed line with minor deviations. - q₂ = 0.904 indicates slightly higher confidence than the blue line. - Error bars are consistent across dimensions. 3. **Red Line (Slope = 2.0093)**: - Data points (red triangles) closely match the dashed line. - q₂ = 0.911 is the highest, suggesting the most reliable trend. - Error bars are moderate, with slight increases at higher dimensions. ### Key Observations - All three lines exhibit **positive linear trends** on log-log scales, indicating exponential growth of MC steps with dimension. - The **red line** (slope = 2.0093) has the steepest slope and highest q₂, suggesting the strongest correlation between dimension and MC steps. - The **blue line** (slope = 1.9791) has the shallowest slope and lowest q₂, indicating weaker correlation. - Error bars are smallest for the blue line at lower dimensions but grow larger for all lines as dimension increases. ### Interpretation The graph demonstrates that the number of Monte Carlo (MC) steps required scales exponentially with dimension, with distinct trends for different parameters (represented by slopes and q₂ values). The red line’s higher slope and q₂ suggest it may correspond to a more computationally intensive or sensitive model. The log-log scale emphasizes the exponential relationship, while error bars highlight variability in measurements. The differences in q₂ values imply varying levels of statistical confidence across the models, with the red line being the most robust. This could reflect differences in algorithmic efficiency, system complexity, or parameter tuning in the underlying simulations. </details> <details> <summary>x33.png Details</summary> ![4b7716f0](/v1/image/4b7716f0b26ea9db580481768ad5fb5c8f1e8582f632aca106c2d47e0096d7e7) ### Visual Description ## Line Chart: Number of MC Steps vs. Dimension ### Overview The chart displays three linear fits (blue, green, red) representing the relationship between dimension (x-axis) and the number of Markov Chain (MC) steps (y-axis, log scale). Data points with error bars are plotted for three q* values (0.940, 0.945, 0.950), each corresponding to a distinct color. The legend is positioned in the top-left corner. ### Components/Axes - **X-axis (Dimension)**: Linear scale from 100 to 240, incrementing by 20. - **Y-axis (Number of MC steps)**: Logarithmic scale from 10² to 10³. - **Legend**: - Blue: Linear fit slope = 0.0048 (q* = 0.940) - Green: Linear fit slope = 0.0058 (q* = 0.945) - Red: Linear fit slope = 0.0065 (q* = 0.950) - **Data Points**: Vertical error bars indicate uncertainty in MC steps for each q* value. ### Detailed Analysis 1. **Blue Line (q* = 0.940, slope = 0.0048)**: - Data points cluster near the lower bound of the y-axis (e.g., ~100 at dimension 100, ~200 at dimension 240). - Error bars are smallest (~±20 at dimension 100, ~±50 at dimension 240). - Trend: Gradual increase with minimal variability. 2. **Green Line (q* = 0.945, slope = 0.0058)**: - Data points are mid-range (e.g., ~120 at dimension 100, ~300 at dimension 240). - Error bars are moderate (~±30 at dimension 100, ~±70 at dimension 240). - Trend: Steeper than blue, with consistent upward trajectory. 3. **Red Line (q* = 0.950, slope = 0.0065)**: - Data points are highest (e.g., ~140 at dimension 100, ~400 at dimension 240). - Error bars are largest (~±40 at dimension 100, ~±100 at dimension 240). - Trend: Sharpest increase, with growing uncertainty at higher dimensions. ### Key Observations - **Slope Correlation**: Higher q* values correspond to steeper slopes (0.0048 → 0.0065), indicating a direct relationship between q* and MC step growth rate. - **Error Bar Trends**: Uncertainty increases with dimension for all q* values, but most pronounced for q* = 0.950. - **Data Point Alignment**: Points for each q* align closely with their respective linear fits, confirming the linear relationship. ### Interpretation The chart demonstrates that the number of MC steps required scales linearly with dimension, with the growth rate accelerating as q* increases. The logarithmic y-axis emphasizes exponential growth trends. The error bars suggest that higher q* values (and thus steeper slopes) introduce greater variability in MC step requirements, particularly at larger dimensions. This could imply that systems with higher q* values (e.g., more complex models or parameters) demand significantly more computational resources as dimensionality increases. The consistency of data points with their linear fits validates the linear approximation, though real-world applications might require accounting for non-linear deviations at extreme dimensions. </details> <details> <summary>x34.png Details</summary> ![ab60a28c](/v1/image/ab60a28c0cc6933d059057550d49843d54589bbf8256c6d0ad83b23db958d06a) ### Visual Description ## Line Graph: Scaling of Monte Carlo Steps with Dimension ### Overview The image is a log-log line graph comparing the number of Monte Carlo (MC) steps required across different dimensions (x-axis) for three distinct parameter values of *q₂* (0.940, 0.945, 0.950). Three linear fits are overlaid, each with a unique slope, and data points with error bars are plotted for each *q₂* value. ### Components/Axes - **X-axis**: "Dimension (log scale)" ranging from 10² to 2.4×10² in logarithmic increments (10², 1.2×10², 1.4×10², ..., 2.4×10²). - **Y-axis**: "Number of MC steps (log scale)" ranging from 10² to 10³ in logarithmic increments. - **Legend**: Located in the top-left corner, with three entries: - Blue dashed line: Slope = 0.7867 (q₂ = 0.940). - Green dash-dot line: Slope = 0.9348 (q₂ = 0.945). - Red dashed line: Slope = 1.0252 (q₂ = 0.950). - **Data Points**: Vertical error bars are plotted for each *q₂* value at every dimension. ### Detailed Analysis 1. **Blue Line (q₂ = 0.940)**: - Slope: 0.7867 (shallowest). - Data points: Start at ~10² MC steps for 10² dimension, increasing to ~2×10² MC steps for 2.4×10² dimension. - Error bars: Smallest uncertainty (~±10–20% of the data point value). 2. **Green Line (q₂ = 0.945)**: - Slope: 0.9348 (moderate). - Data points: Start at ~10² MC steps for 10² dimension, increasing to ~3×10² MC steps for 2.4×10² dimension. - Error bars: Moderate uncertainty (~±20–30%). 3. **Red Line (q₂ = 0.950)**: - Slope: 1.0252 (steepest). - Data points: Start at ~10² MC steps for 10² dimension, increasing to ~5×10² MC steps for 2.4×10² dimension. - Error bars: Largest uncertainty (~±30–50%). ### Key Observations - All three lines exhibit upward trends, indicating that the number of MC steps increases with dimension. - The red line (q₂ = 0.950) has the steepest slope, suggesting the highest computational cost at larger dimensions. - Error bars grow larger for higher *q₂* values, reflecting increased variability in MC steps at higher dimensions. - The slopes confirm a power-law relationship: MC steps ∝ dimension^slope. ### Interpretation The data demonstrates that higher *q₂* values correlate with steeper scaling of MC steps with dimension. For example: - At *q₂* = 0.940, the system scales sublinearly (slope < 1). - At *q₂* = 0.950, the system scales superlinearly (slope > 1), implying exponential growth in computational effort as dimension increases. - The error bars suggest that uncertainty in MC steps grows with both *q₂* and dimension, highlighting the need for robust error handling in high-dimensional simulations. This trend underscores the trade-off between parameter precision (*q₂*) and computational feasibility in Monte Carlo methods. Systems with higher *q₂* may require specialized optimization strategies to manage scaling costs. </details> Figure 12: Semilog (Left) and log-log (Right) plots of the number of Hamiltonian Monte Carlo steps needed to achieve an overlap $q_{2}^{*}>q_{2}^{\rm uni}$ , that certifies the universal solution is outperformed. The dataset was generated from a teacher with polynomial activation $\sigma_{3}={\rm He}_{2}/\sqrt{2}+{\rm He}_{3}/6$ and parameters $\Delta=0.1$ for the linear readout, $\gamma=0.5$ and $\alpha=1.0>\alpha_{\rm sp}$ ( $=0.26,0.30,0.02$ for homogeneous, Rademacher and Gaussian readouts respectively). Student weights are sampled using HMC (initialised uninformatively) with $4000$ iterations for homogeneous readouts (Top row, for which $q_{2}^{\rm uni}=0.883$ ), or $2000$ iterations for Rademacher (Centre row, with $q_{2}^{\rm uni}=0.868$ ) and Gaussian readouts (Bottom row, for which $q_{2}^{\rm uni}=0.903$ ). Each iteration is adaptative (with initial step size of $0.01$ ) and uses $10$ leapfrog steps. $q_{2}^{\rm sp}=0.941,0.948,0.963$ in the three cases. The readouts are kept fixed during training. Points are obtained averaging over 10 teacher/data instances with error bars representing the standard deviation. | Homogeneous Rademacher Gaussian | ( $q_{2}^{*}∈\{0.903,0.906,0.909\}$ ) ( $q_{2}^{*}∈\{0.897,0.904,0.911\}$ ) ( $q_{2}^{*}∈\{0.940,0.945,0.950\}$ ) | $\bm{2.22}$ $\bm{1.88}$ $0.66$ | $\bm{1.47}$ $\bm{2.12}$ $\bm{0.44}$ | $\bm{1.14}$ $\bm{1.70}$ $\bm{0.26}$ | $8.01$ $8.10$ $\bm{0.62}$ | $7.25$ $7.70$ $0.53$ | $6.35$ $8.57$ $0.39$ | | --- | --- | --- | --- | --- | --- | --- | --- | Table 3: $\chi^{2}$ test for exponential and power-law fits for the time needed by Hamiltonian Monte Carlo to reach the thresholds $q_{2}^{*}$ , for various priors on the readouts. For a given row, we report three values of the $\chi^{2}$ test per hypothesis, corresponding with the thresholds $q_{2}^{*}$ on the left, in the order given. Fits are displayed in Figure 12. Smaller values of $\chi^{2}$ (in bold, for given threshold and readouts) indicate a better compatibility with the hypothesis.

Rendering Paper...