# Statistical physics analysis of graph neural networks: Approaching optimality in the contextual stochastic block model
**Authors**: Duranthon, Lenka Zdeborová
> Statistical physics of computation laboratory, École polytechnique fédérale de Lausanne, Switzerland
(November 21, 2025)
## Abstract
Graph neural networks (GNNs) are designed to process data associated with graphs. They are finding an increasing range of applications; however, as with other modern machine learning techniques, their theoretical understanding is limited. GNNs can encounter difficulties in gathering information from nodes that are far apart by iterated aggregation steps. This situation is partly caused by so-called oversmoothing; and overcoming it is one of the practically motivated challenges. We consider the situation where information is aggregated by multiple steps of convolution, leading to graph convolutional networks (GCNs). We analyze the generalization performance of a basic GCN, trained for node classification on data generated by the contextual stochastic block model. We predict its asymptotic performance by deriving the free energy of the problem, using the replica method, in the high-dimensional limit. Calling depth the number of convolutional steps, we show the importance of going to large depth to approach the Bayes-optimality. We detail how the architecture of the GCN has to scale with the depth to avoid oversmoothing. The resulting large depth limit can be close to the Bayes-optimality and leads to a continuous GCN. Technically, we tackle this continuous limit via an approach that resembles dynamical mean-field theory (DMFT) with constraints at the initial and final times. An expansion around large regularization allows us to solve the corresponding equations for the performance of the deep GCN. This promising tool may contribute to the analysis of further deep neural networks.
## I Introduction
### I.1 Summary of the narrative
Graph neural networks (GNNs) emerged as the leading paradigm when learning from data that are associated with a graph or a network. Given the ubiquity of such data in sciences and technology, GNNs are gaining importance in their range of applications, including chemistry [1], biomedicine [2], neuroscience [3], simulating physical systems [4], particle physics [5] and solving combinatorial problems [6, 7]. As common in modern machine learning, the theoretical understanding of learning with GNNs is lagging behind their empirical success. In the context of GNNs, one pressing question concerns their ability to aggregate information from far away parts of the graph: the performance of GNNs often deteriorates as depth increases [8]. This issue is often attributed to oversmoothing [9, 10], a situation where a multi-layer GNN averages out the relevant information. Consequently, mostly relatively shallow GNNs are used in practice or other strategies are designed to avoid oversmoothing [11, 12].
Understanding the generalization properties of GNNs on unseen examples is a path towards yet more powerful models. Existing theoretical works addressed the generalization ability of GNNs mainly by deriving generalization bounds, with a minimal set of assumptions on the architecture and on the data, relying on VC dimension, Rademacher complexity or a PAC-Bayesian analysis; see for instance [13] and the references therein. Works along these lines that considered settings related to one of this work include [14], [15] or [16]. However, they only derive loose bounds for the test performance of the GNN and they do not provide insights on the effect of the structure of data. [14] provides sharper bounds; yet they do not take into account the data structure and depend on continuity constants that cannot be determined a priori. In order to provide more actionable outcomes, the interplay between the architecture of the GNN, the training algorithm and the data needs to be understood better, ideally including constant factors characterizing their dependencies on the variety of parameters.
Statistical physics traditionally plays a key role in understanding the behaviour of complex dynamical systems in the presence of disorder. In the context of neural networks, the dynamics refers to the training, and the disorder refers to the data used for learning. In the case of GNNs, the data is related to a graph. The statistical physics research strategy defines models that are simplified and allow analytical treatment. One models both the data generative process, and the learning procedure. A key ingredient is a properly defined thermodynamic limit in which quantities of interest self-average. One then aims to derive a closed set of equations for the quantities of interest, akin to obtaining exact expressions for free energies from which physical quantities can be derived. While numerous other research strategies are followed in other theoretical works on GNNs, see above, the statistical physics strategy is the main one accounting for constant factors in the generalization performance and as such provides invaluable insight about the properties of the studied systems. This line of research has been very fruitful in the context of fully connected feed-forward neural networks, see e.g. [17, 18, 19]. It is reasonable to expect that also in the context of GNNs this strategy will provide new actionable insights.
The analysis of generalization of GNNs in the framework of the statistical physics strategy was initiated recently in [20] where the authors studied the performance of a single-layer graph convolutional neural network (GCN) applied to data coming from the so-called contextual stochastic block model (CSBM). The CSBM, introduced in [21, 22], is particularly suited as a prototypical generative model for graph-structured data where each node belongs to one of several groups and is associated with a vector of attributes. The task is then the classification of the nodes into groups. Such data are used by practitioners as a benchmark for performance of GNNs [15, 23, 24, 25]. On the theoretical side, the follow-up work [26] generalized the analysis of [20] to a broader class of loss functions but also alerted to the relatively large gap between the performance of a single-layer GCN and the Bayes-optimal performance.
In this paper, we show that the close-formed analysis of training a GCN on data coming from the CSBM can be extended to networks performing multiple layers of convolutions. With a properly tuned regularization and strength of the residual connection this allows us to approach the Bayes-optimal performance very closely. Our analysis sheds light on the interplay between the different parameters –mainly the depth, the strength of the residual connection and the regularization– and on how to select the values of the parameters to mitigate oversmoothing. On a technical level the analysis relies on the replica method, with the limit of large depth leading to a continuous formulation similar to neural ordinary differential equations [27] that can be treated analytically via an approach that resembles dynamical mean-field theory with the position in the network playing the role of time. We anticipate that this type of infinite depth analysis can be generalized to studies of other deep networks with residual connections such a residual networks or multi-layer attention networks.
### I.2 Further motivations and related work
#### I.2.1 Graph neural networks:
In this work we focus on graph neural networks (GNNs). GNNs are neural networks designed to work on data that can be represented as graphs, such as molecules, knowledge graphs extracted from encyclopedias, interactions among proteins or social networks. GNNs can predict properties at the level of nodes, edges or the whole graph. Given a graph $G$ over $N$ nodes, its adjacency matrix $A∈ℝ^N× N$ and initial features $h_i^(0)∈ℝ^M$ on each node $i$ , a GNN can be expressed as the mapping
$$
\displaystyle h_i^(k+1)=f_θ^(k)≤ft(h_i^(k),aggreg(\{h_j^(k),j∼ i\})\right) \tag{1}
$$
for $k=0,\dots,K$ with $K$ being the depth of the network. where $f_θ^(k)$ is a learnable function of parameters $θ^(k)$ and $aggreg()$ is a function that aggregates the features of the neighboring nodes in a permutation-invariant way. A common choice is the sum function, akin to a convolution on the graph
$$
\displaystyleaggreg(\{h_j,j∼ i\})=∑_j∼ ih_j=(Ah)_i . \tag{2}
$$
Given this choice of aggregation the GNN is called graph convolutional network (GCN) [28]. For a GNN of depth $K$ the transformed features $h^(K)∈ℝ^M^{\prime}$ can be used to predict the properties of the nodes, the edges or the graph by a learnt projection.
In this work we will consider a GCN with the following architecture, that we will define more precisely in the detailed setting part II. We consider one trainable layer $w∈ℝ^M$ , since dealing with multiple layers of learnt weights is still a major issue [29], and since we want to focus on modeling the impact of numerous convolution steps on the generalization ability of the GCN.
$$
\displaystyle h^(k+1) \displaystyle=≤ft(\frac{1}{√{N}}\tilde{A}+c_kI_N\right)h^(k) \displaystyle\hat{y} \displaystyle=≤ft(\frac{1}{√{N}}w^Th^(K)\right) \tag{3}
$$
where $\tilde{A}$ is a rescaling of the adjacency matrix, $I_N$ is the identity, $c_k∈ℝ$ for all $k$ are the residual connection strengths and $\hat{y}∈ℝ^N$ are the predicted labels of each node. We will call the number of layers $K$ the depth, but we reiterate that only the layer $w$ is learned.
#### I.2.2 Analyzable model of synthetic data:
Modeling the training data is a starting point to derive sharp predictions. A popular model of attributed graph, that we will consider in the present work and define in detail in sec. II.1, is the contextual stochastic block model (CSBM), introduced in [21, 22]. It consists in $N$ nodes with labels $y∈\{-1,+1\}^N$ , in a binary stochastic block model (SBM) to model the adjacency matrix $A∈ℝ^N× N$ and in features (or attributes) $X∈ℝ^N× M$ defined on the nodes and drawn according to a Gaussian mixture. $y$ has to be recovered given $A$ and $X$ . The inference is done in a semi-supervised way, in the sense that one also has access to a train subset of $y$ .
A key aspect in statistical physics is the thermodynamic limit, how should $N$ and $M$ scale together. In statistical physics we always aim at a scaling in which quantities of interest concentrate around deterministic values, and the performance of the system ranges between as bad as random guessing to as good as perfect learning. As we will see, these two requirements are satisfied in the high-dimensional limit $N→∞$ and $M→∞$ with $α=N/M$ of order one. This scaling limit also aligns well with the common graph datasets that are of interest in practice, for instance Cora [30] ( $N=3.10^3$ and $M=3.10^3$ ), Coauthor CS [31] ( $N=2.10^4$ and $M=7.10^3$ ), CiteSeer [32] ( $N=4.10^3$ and $M=3.10^3$ ) and PubMed [33] ( $N=2.10^4$ and $M=5.10^2$ ).
A series of works that builds on the CSBM with lower dimensionality of features that is $M=o(N)$ exists. Authors of [34] consider a one-layer GNN trained on the CSBM by logistic regression and derive bounds for the test loss; however, they analyze its generalization ability on new graphs that are independent of the train graph and do not give exact predictions. In [35] they propose an architecture of GNN that is optimal on the CSBM with low-dimensional features, among classifiers that process local tree-like neighborhoods, and they derive its generalization error. In [36] the authors analyze the structure and the separability of the convolved data $\tilde{A}^KX$ , for different rescalings $\tilde{A}$ of the adjacency matrix, and provide a bound on the classification error. Compared to our work these articles consider a low-dimensional setting ([35]) where the dimension of the features $M$ is constant, or a setting where $M$ is negligible compared to $N$ ([34] and [36]).
#### I.2.3 Tight prediction on GNNs in the high-dimensional limit:
Little has been done as to tightly predicting the performance of GNNs in the high-dimensional limit where both the size of the graph and the dimensionality of the features diverge proportionally. The only pioneering references in this direction we are aware of are [20] and [26], where the authors consider a simple single-layer GCN that performs only one step of convolution, $K=1$ , trained on the CSBM in a semi-supervised setting. In these works the authors express the performance of the trained network as a function of a finite set of order parameters following a system of self-consistent equations.
There are two important motivations to extend these works and to consider GCNs with a higher depth $K$ . First, the GNNs that are used in practice almost always perform several steps of aggregation, and a more realistic model should take this in account. Second, [26] shows that the GCN it considers is far from the Bayes-optimal (BO) performance and the Bayes-optimal rate for all common losses. The BO performance is the best that any algorithm can achieve knowing the distribution of the data, and the BO rate is the rate of convergence toward perfect inference when the signal strength of the graph grows to infinity. Such a gap is intriguing in the sense that previous works [37, 38] show that a simple one-layer fully-connected neural network can reach or be very close to the Bayes-optimality on simple synthetic datasets, including Gaussian mixtures. A plausible explanation is that on the CSBM considering only one step of aggregation $K=1$ is not enough to retrieve all information, and one has to aggregate information from further nodes. Consequently, even on this simple dataset, introducing depth and considering a GCN with several convolution layers, $K>1$ , is crucial.
In the present work we study the effect of the depth $K$ of the convolution for the generalization ability of a simple GCN. A first part of our contribution consists in deriving the exact performance of a GCN performing several steps of convolution, trained on the CSBM, in the high-dimensional limit. We show that $K=2$ is the minimal number of steps to reach the BO learning rate. As to the performance at moderate signal strength, it appears that, if the architecture is well tuned, going to larger and larger $K$ increases the performance until it reaches a limit. This limit, if the adjacency matrix is symmetrized, can be close to the Bayes optimality. This is illustrated on fig. 1, which highlights the importance of numerous convolution layers.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Line Graph: Test Accuracy vs. Graph Signal for Different K Values
### Overview
The image is a line graph plotting "test accuracy" against "graph signal." It compares the performance of five different models or conditions, represented by distinct line styles. The graph demonstrates how accuracy improves with increasing graph signal strength, with different methods converging toward perfect accuracy (1.0) at different rates.
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** `graph signal`
* **Scale:** Linear, ranging from 0.0 to 2.5.
* **Major Tick Marks:** 0.0, 0.5, 1.0, 1.5, 2.0, 2.5.
* **Y-Axis (Vertical):**
* **Label:** `test accuracy`
* **Scale:** Linear, ranging from 0.6 to 1.0.
* **Major Tick Marks:** 0.6, 0.7, 0.8, 0.9, 1.0.
* **Legend:**
* **Position:** Bottom-right corner of the plot area.
* **Entries (from top to bottom as listed):**
1. `Bayes – optimal` (represented by a dotted line `...`)
2. `K = ∞, symmetrized graph` (represented by a solid line with circular markers `—•—`)
3. `K = ∞` (represented by a solid line `—`)
4. `K = 2` (represented by a dash-dot line `-.-`)
5. `K = 1` (represented by a dashed line `--`)
### Detailed Analysis
The graph shows five sigmoidal (S-shaped) curves, all starting at approximately the same low accuracy for a graph signal of 0.0 and increasing toward an accuracy of 1.0 as the signal increases. The curves are ordered by performance.
1. **Bayes – optimal (Dotted Line):**
* **Trend:** This is the highest-performing curve, representing the theoretical upper bound. It rises the most steeply.
* **Approximate Data Points:**
* At signal 0.0: Accuracy ≈ 0.56
* At signal 0.5: Accuracy ≈ 0.65
* At signal 1.0: Accuracy ≈ 0.92
* At signal 1.5: Accuracy ≈ 0.99
* Reaches near-perfect accuracy (≈1.0) by signal ≈ 2.0.
2. **K = ∞, symmetrized graph (Solid Line with Markers):**
* **Trend:** This curve closely follows the Bayes-optimal line but is consistently slightly below it. It is the best-performing practical method shown.
* **Approximate Data Points (Markers):**
* At signal 0.0: Accuracy ≈ 0.56
* At signal 0.5: Accuracy ≈ 0.61
* At signal 1.0: Accuracy ≈ 0.86
* At signal 1.5: Accuracy ≈ 0.98
* At signal 2.0: Accuracy ≈ 1.0
3. **K = ∞ (Solid Line):**
* **Trend:** This curve is below the symmetrized version. It shows a clear performance gap compared to the symmetrized graph, especially in the mid-range of graph signal (0.5 to 1.5).
* **Approximate Data Points:**
* At signal 0.0: Accuracy ≈ 0.56
* At signal 0.5: Accuracy ≈ 0.59
* At signal 1.0: Accuracy ≈ 0.75
* At signal 1.5: Accuracy ≈ 0.92
* At signal 2.0: Accuracy ≈ 0.98
4. **K = 2 (Dash-Dot Line):**
* **Trend:** This curve shows significantly slower improvement than the K=∞ variants. It requires a much stronger graph signal to achieve high accuracy.
* **Approximate Data Points:**
* At signal 0.0: Accuracy ≈ 0.56
* At signal 0.5: Accuracy ≈ 0.58
* At signal 1.0: Accuracy ≈ 0.65
* At signal 1.5: Accuracy ≈ 0.80
* At signal 2.0: Accuracy ≈ 0.92
* At signal 2.5: Accuracy ≈ 0.98
5. **K = 1 (Dashed Line):**
* **Trend:** This is the lowest-performing curve. Its ascent is the most gradual, indicating the weakest relationship between graph signal and test accuracy for this condition.
* **Approximate Data Points:**
* At signal 0.0: Accuracy ≈ 0.56
* At signal 0.5: Accuracy ≈ 0.57
* At signal 1.0: Accuracy ≈ 0.60
* At signal 1.5: Accuracy ≈ 0.68
* At signal 2.0: Accuracy ≈ 0.78
* At signal 2.5: Accuracy ≈ 0.89
### Key Observations
* **Performance Hierarchy:** There is a clear and consistent ordering of performance: Bayes-optimal > K=∞, symmetrized > K=∞ > K=2 > K=1. This hierarchy holds across the entire range of graph signal values > 0.
* **Convergence:** All methods start at a baseline accuracy of approximately 0.56 (likely random chance for a binary classification task) when the graph signal is zero. All methods appear to converge toward an accuracy of 1.0, but at vastly different rates.
* **Impact of Symmetrization:** For the K=∞ condition, symmetrizing the graph provides a substantial and consistent boost in accuracy, particularly in the critical transition region (signal between 0.5 and 1.5).
* **Impact of K Value:** Lower values of K (1 and 2) result in significantly worse performance, requiring a much stronger signal to achieve the same accuracy as the K=∞ models. The gap between K=2 and K=1 is also notable.
### Interpretation
This graph likely comes from a study on graph-based semi-supervised learning or signal processing on graphs. The "graph signal" probably represents the strength or quality of the underlying data structure or label information.
* **What the data suggests:** The results demonstrate that more complex models (higher K, likely representing more neighbors or a larger receptive field) and specific graph preprocessing (symmetrization) lead to more efficient learning. They achieve higher accuracy with a weaker signal.
* **Relationship between elements:** The "Bayes-optimal" line serves as a benchmark, showing the best possible performance given the data distribution. The proximity of the "K=∞, symmetrized" curve to this benchmark suggests it is a highly effective method that nearly achieves theoretical optimality. The poor performance of K=1 indicates that using only immediate neighbors (or a very local view) is insufficient for this task.
* **Notable Anomaly/Trend:** The most striking trend is the dramatic difference in the *slope* of the curves. The steep slope of the top two curves indicates a phase transition: once a critical signal strength is reached (around 0.5-1.0), accuracy improves very rapidly. The shallower slopes for K=2 and K=1 suggest a more gradual, less efficient learning process. This implies that capturing broader graph structure (via high K or symmetrization) is crucial for leveraging the signal effectively.
</details>
Figure 1: Test accuracy of the graph neural network on data generated by the contextual stochastic block model vs the signal strength. We define the model and the network in section II. The test accuracy is maximized over all the hyperparameters of the network. The Bayes-optimal performance is from [39]. The line $K=1$ has been studied by [20, 26]; we improve it to $K>1$ , $K=∞$ and symmetrized graphs. All the curves are theoretical predictions we derive in this work.
#### I.2.4 Oversmoothing and residual connections:
Going to larger depth $K$ is essential to obtain better performance. Yet, GNNs used in practice can be quite shallow, because of the several difficulties encountered at increasing depth, such that vanishing gradient, which is not specific to graph neural networks, or oversmoothing [9, 10]. Oversmoothing refers to the fact that the GNN tends to act like a low-pass filter on the graph and to smooth the features $h_i$ , which after too many steps may converge to the same vector for every node. A few steps of aggregation are beneficial but too many degrade the performance, as [40] shows for a simple GNN, close to the one we study, on a particular model. In the present work we show that the model we consider can suffer from oversmoothing at increasing $K$ if its architecture is not well-tuned and we precisely quantify it.
A way to mitigate vanishing gradient and oversmoothing is to allow the nodes to remember their initial features $h_i^(0)$ . This is done by adding residual (or skip) connections to the neural network, so the update function becomes
$$
\displaystyle h_i^(k+1)=c_kh_i^(k)+f_θ^(k)≤ft(h_i^(k),aggreg(\{h_j^(k),j∼ i\})\right) \tag{5}
$$
where the $c_k$ modulate the strength of the residual connections. The resulting architecture is known as residual network or resnet [41] in the context of fully-connected and convolutional neural networks. As to GNNs, architectures with residual connections have been introduced in [42] and used in [11, 12] to reach large numbers of layers with competitive accuracy. [43] additionally shows that residual connections help gradient descent. In the setting we consider we prove that residual connections are necessary to circumvent oversmoothing, to go to larger $K$ and to improve the performance.
#### I.2.5 Continuous neural networks:
Continuous neural networks can be seen as the natural limit of residual networks, when the depth $K$ and the residual connection strengths $c_k$ go to infinity proportionally, if $f_θ^(k)$ is smooth enough with respect to $k$ . In this limit, rescaling $h^(k+1)$ with $c_k$ and setting $x=k/K$ and $c_k=K/t$ , the rescaled $h$ satisfies the differential equation
$$
\displaystyle\frac{dh_i}{dx}(x)=tf_θ(x)≤ft(h_i(x),aggreg(\{h_j(x),j∼ i\})\right) . \tag{6}
$$
This equation is called a neural ordinary differential equation [27]. The convergence of a residual network to a continuous limit has been studied for instance in [44]. Continuous neural networks are commonly used to model and learn the dynamics of time-evolving systems, by usually taking the update function $f_θ$ independent of the time $t$ . For an example [45] uses a continuous fully-connected neural network to model turbulence in a fluid. As such, they are a building block of scientific machine learning; see for instance [46] for several applications. As to the generalization ability of continuous neural networks, the only theoretical work we are aware of is [47], that derives loose bounds based on continuity arguments.
Continuous neural networks have been extended to continuous GNNs in [48, 49]. For the GCN that we consider the residual connections are implemented by adding self-loops $c_kI_N$ to the graph. The continuous dynamic of $h$ is then
$$
\displaystyle\frac{dh}{dx}(x)=t\tilde{A}h(x) , \tag{7}
$$
with $t∈ℝ$ ; which is a diffusion on the graph. Other types of dynamics have been considered, such as anisotropic diffusion, where the diffusion factors are learnt, or oscillatory dynamics, that should avoid oversmoothing too; see for instance the review [50] for more details. No prior works predict their generalization ability. In this work we fill this gap by deriving the performance of the continuous limit of the simple GCN we consider.
### I.3 Summary of the main results:
We first generalize the work of [20, 26] to predict the performance of a simple GCN with arbitrary number $K$ of convolution steps. The network is trained in a semi-supervised way on data generated by the CSBM for node classification. In the high-dimensional limit and in the limit of dense graphs, the main properties of the trained network concentrate onto deterministic values, that do not depend on the particular realization of the data. The network is described by a few order parameters (or summary statistics), that satisfy a set of self-consistent equations, that we solve analytically or numerically. We thus have access to the expected train and test errors and accuracies of the trained network.
From these predictions we draw several consequences. Our main guiding line is to search for the architecture and the hyperparameters of the GCN that maximize its performance, and check whether the optimal GCN can reach the Bayes-optimal performance on the CSBM. The main parameters we considers are the depth $K$ , the residual connection strengths $c_k$ , the regularization $r$ and the loss function.
We consider the convergence rates towards perfect inference at large graph signal. We show that $K=2$ is the minimal depth to reach the Bayes-optimal rate, after which increasing $K$ or fine-tuning $c_k$ only leads to sub-leading improvements. In case of asymmetric graphs the GCN is not able to deal the asymmetry, for all $K$ or $c_k$ , and one has to pre-process the graph by symmetrizing it.
At finite graph signal the behaviour of the GCN is more complex. We find that large regularization $r$ maximizes the test accuracy in the case we consider, while the loss has little effect. The residual connection strengths $c_k$ have to be tuned to a same optimal value $c$ that depends on the properties of the graph.
An important point is that going to larger $K$ seems to improve the test accuracy. Yet the residual connection $c$ has to vary accordingly. If $c$ stays constant with respect to $K$ then the GCN will perform PCA on the graph $A$ , oversmooth and discard the information from the features $X$ . Instead, if $c$ grows with $K$ , the residual connections alleviate oversmoothing and the performance of the GCN keeps increasing with $K$ , if the diffusion time $t=K/c$ is well tuned.
The limit $K→∞,c∝ K$ is thus of particular interest. It corresponds to a continuous GCN performing diffusion on the graph. Our analysis can be extended to this case by directly taking the limit in the self-consistent equations. One has to further jointly expand them around $r→+∞$ and we keep the first order. At the end we predict the performance of the continuous GCN in an explicit and closed form. To our knowledge this is the first tight prediction of the generalization ability of a continuous neural network, and in particular of a continuous graph neural network. The large regularization limit $r→+∞$ is important: on one hand it appears to lead to the optimal performance of the neural network; on another hand, it is instrumental to analyze the continuous limit $K→∞$ and it allows to analytically solve the self-consistent equations describing the neural network.
We show that the continuous GCN at optimal time $t$ performs better than any finite- $K$ GCN. The optimal $t$ depends on the properties of the graph, and can be negative for heterophilic graphs. This result is a step toward solving one of the major challenges identified by [8]; that is, creating benchmarks where depth is necessary and building efficient deep networks.
The continuous GCN as large $r$ is optimal. Moreover, if run on the symmetrized graph, it approaches the Bayes-optimality on a broad range of configurations of the CSBM, as exemplified on fig. 1. We identify when the GCN fails to approach the Bayes-optimality: this happens when most of the information is contained in the features and not in the graph, and has to be processed in an unsupervised manner.
We provide the code that allows to evaluate our predictions in the supplementary material.
## II Detailed setting
### II.1 Contextual Stochastic Block Model for attributed graphs
We consider the problem of semi-supervised node classification on an attributed graph, where the nodes have labels and carry additional attributes, or features, and where the structure of the graph correlates with the labels. We consider a graph $G$ made of $N$ nodes; each node $i$ has a binary label $y_i=± 1$ that is a Rademacher random variable.
The structure of the graph should be correlated with $y$ . We model the graph with a binary stochastic block model (SBM): the adjacency matrix $A∈ℝ^N× N$ is drawn according to
$$
A_ij∼B≤ft(\frac{d}{N}+\frac{λ}{√{N}}√{\frac{d}{N}≤ft(1-\frac{d}{N}\right)}y_iy_j\right) \tag{8}
$$
where $λ$ is the signal-to-noise ratio (snr) of the graph, $d$ is the average degree of the graph, $B$ is a Bernoulli law and the elements $A_ij$ are independent for all $i$ and $j$ . It can be interpreted in the following manner: an edge between $i$ and $j$ appears with a higher probability if $λ y_iy_j>1$ i.e. for $λ>0$ if the two nodes are in the same group. The scaling with $d$ and $N$ is chosen so that this model does not have a trivial limit at $N→∞$ both for $d=Θ(1)$ and $d=Θ(N)$ . Notice that we take $A$ asymmetric.
Additionally to the graph, each node $i$ carries attributes $X_i∈ℝ^M$ , that we collect in the matrix $X∈ℝ^N× M$ . We set $α=N/M$ the aspect ratio between the number of nodes and the dimension of the features. We model them by a Gaussian mixture: we draw $M$ hidden Gaussian variables $u_ν∼N(0,1)$ , the centroid $u∈ℝ^M$ , and we set
$$
X=√{\frac{μ}{N}}yu^T+W \tag{9}
$$
where $μ$ is the snr of the features and $W$ is noise whose components $W_iν$ are independent standard Gaussians. We use the notation $N(m,V)$ for a Gaussian distribution or density of mean $m$ and variance $V$ . The whole model for $(y,A,X)$ is called the contextual stochastic block model (CSBM) and was introduced in [21, 22].
We consider the task of inferring the labels $y$ given a subset of them. We define the training set $R$ as the set of nodes whose labels are revealed; $ρ=|R|/N$ is the training ratio. The test set $R^\prime$ is selected from the complement of $R$ ; we define the testing ratio $ρ^\prime=|R^\prime|/N$ . We assume that $R$ and $R^\prime$ are independent from the other quantities. The inference problem is to find back $y$ and $u$ given $A$ , $X$ , $R$ and the parameters of the model.
[22, 51] prove that the effective snr of the CSBM is
$$
snr_CSBM=λ^2+μ^2/α , \tag{10}
$$
in the sense that in the unsupervised regime $ρ=0$ for $snr_CSBM<1$ no information on the labels can be recovered while for $snr_CSBM>1$ partial information can be recovered. The information given by the graph is $λ^2$ while the information given by the features is $μ^2/α$ . As soon as a finite fraction of nodes $ρ>0$ is revealed the phase transition between no recovery and weak recovery disappears.
We work in the high-dimensional limit $N→∞$ and $M→∞$ while the aspect ratio $α=N/M$ is of order one. The average degree $d$ should be of order $N$ , but taking $d$ growing with $N$ should be sufficient for our results to hold, as shown by our experiments. The other parameters $λ$ , $μ$ , $ρ$ and $ρ^\prime$ are of order one.
### II.2 Analyzed architecture
In this work, we focus on the role of applying several data aggregation steps. With the current theoretical tools, the tight analysis of the generic GNN described in eq. (1) is not possible: dealing with multiple layers of learnt weights is hard; and even for a fully-connected two-layer perceptron this is a current and major topic [29]. Instead, we consider a one-layer GNN with a learnt projection $w$ . We focus on graph convolutional networks (GCNs) [28], where the aggregation is a convolution done by applying powers of a rescaling $\tilde{A}$ of the adjacency matrix. Last we remove the non-linearities. As we will see, the fact that the GCN is linear does not prevent it to approach the optimality in some regimes. The resulting GCN is referred to as simple graph convolutional network; it has been shown to have good performance while being much easier to train [52, 53]. The network we consider transforms the graph and the features in the following manner:
$$
h(w)=∏_k=1^K≤ft(\frac{1}{√{N}}\tilde{A}+c_kI_N\right)\frac{1}{√{N}}Xw \tag{11}
$$
where $w∈ℝ^M$ is the layer of trainable weights, $I_N$ is the identity, $c_k∈ℝ$ is the strength of the residual connections and $\tilde{A}∈ℝ^N× N$ is a rescaling of the adjacency matrix defined by
$$
\tilde{A}_ij=≤ft(\frac{d}{N}≤ft(1-\frac{d}{N}\right)\right)^-1/2≤ft(A_ij-\frac{d}{N}\right), for all i, j. \tag{12}
$$
The prediction $\hat{y}_i$ of the label of $i$ by the GNN is then $\hat{y}_i=h(w)_i$ .
$\tilde{A}$ is a rescaling of $A$ that is centered and normalized. In the limit of dense graphs, where $d$ is large, this will allow us to rely on a Gaussian equivalence property to analyze this GCN. The equivalence [54, 22, 20] states that in the high-dimensional limit, for $d$ growing with $N$ , $\tilde{A}$ can be approximated by the following spiked matrix $A^g$ without changing the macroscopic properties of the GCN:
$$
A^g=\frac{λ}{√{N}}yy^T+Ξ , \tag{13}
$$
where the components of the $N× N$ matrix $Ξ$ are independent standard Gaussian random variables. The main reason for considering dense graphs instead of sparse graphs $d=Θ(1)$ is to ease the theoretical analysis. The dense model can be described by a few order parameters; while a sparse SBM would be harder to analyze because many quantities, such as the degrees of the nodes, do not self-average, and one would need to take in account all the nodes, by predicting the performance on one realization of the graph or by running population dynamics. We believe that it would lead to qualitatively similar results, as for instance [55] shows, for a related model.
The above architecture corresponds to applying $K$ times a graph convolution on the projected features $Xw$ . At each convolution step $k$ a node $i$ updates its features by summing those of its neighbors and adding $c_k$ times its own features. In [20, 26] the same architecture was considered for $K=1$ ; we generalize these works by deriving the performance of the GCN for arbitrary numbers $K$ of convolution steps. As we will show this is crucial to approach the Bayes-optimal performance.
Compared to [20, 26], another important improvement towards the Bayes-optimality is obtained by symmetrizing the graph, and we will also study the performance of the GCN when it acts by applying the symmetrized rescaled adjacency matrix $\tilde{A}^s$ defined by:
$$
\tilde{A}^s=\frac{1}{√{2}}(\tilde{A}+\tilde{A}^T) , A^g,s=\frac{λ^s}{√{N}}yy^T+Ξ^s . \tag{14}
$$
$A^g,s$ is its Gaussian equivalent, with $λ^s=√{2}λ$ , $Ξ^s$ is symmetric and $Ξ^s_i≤ j$ are independent standard Gaussian random variables. In this article we derive and show the performance of the GNN both acting with $\tilde{A}$ and $\tilde{A}^s$ but in a first part we will mainly consider and state the expressions for $\tilde{A}$ because they are simpler. We will consider $\tilde{A}^s$ in a second part while taking the continuous limit. To deal with both cases, asymmetric or symmetrized, we define $\tilde{A}^e∈\{\tilde{A},\tilde{A}^s\}$ and $λ^e∈\{λ,λ^s\}$ .
The continuous limit of the above network (11) is defined by
$$
h(w)=e^\frac{t{√{N}}\tilde{A}^e}\frac{1}{√{N}}Xw \tag{15}
$$
where $t$ is the diffusion time. It is obtained at large $K$ when the update between two convolutions becomes small, as follows:
$$
≤ft(\frac{t}{K√{N}}\tilde{A}^e+I_N\right)^K\underset{K→∞}{\longrightarrow}e^\frac{t{√{N}}\tilde{A}^e} . \tag{16}
$$
$h$ is the solution at time $t$ of the time-continuous diffusion of the features on the graph $G$ with Laplacian $\tilde{A}^e$ , defined by $∂_xX(x)=\frac{1}{√{N}}\tilde{A}^eX(x)$ and $X(0)=X$ . The discrete GCN can be seen as the discretization of the differential equation in the forward Euler scheme. The mapping with eq. (11) is done by taking $c_k=K/t$ for all $k$ and by rescaling the features of the discrete GCN $h(w)$ as $h(w)∏_kc_k^-1$ so they remain of order one when $K$ is large. For the discrete GCN we do not directly consider the update $h_k+1=(I_N+c_k^-1\tilde{A}/√{N})h_k$ because we want to study the effect of having no residual connections, i.e. $c_k=0$ . The case where the diffusion coefficient depends on the position in the network is equivalent to a constant diffusion coefficient. Indeed, because of commutativity, the solution at time $t$ of $∂_xX(x)=\frac{1}{√{N}}a(x)\tilde{A}^eX(x)$ for $a:ℝ→ℝ$ is $\exp≤ft(∫_0^tdx a(x)\frac{1}{√{N}}\tilde{A}^e\right)X(0)$ .
The discrete and the continuous GCNs are trained by empirical risk minimization. We define the regularized loss
$$
L_A,X(w)=\frac{1}{ρ N}∑_i∈ R\ell(y_ih_i(w))+\frac{r}{ρ N}∑_νγ(w_ν) \tag{17}
$$
where $γ$ is a strictly convex regularization function, $r$ is the regularization strength and $\ell$ is a convex loss function. The regularization ensures that the GCN does not overfit the train data and has good generalization properties on the test set. We will focus on $l_2$ -regularization $γ(x)=x^2/2$ and on the square loss $\ell(x)=(1-x)^2/2$ (ridge regression) or the logistic loss $\ell(x)=\log(1+e^-x)$ (logistic regression). Since $L$ is strictly convex it admits a unique minimizer $w^*$ . The key quantities we want to estimate are the average train and test errors and accuracies of this model, which are
$$
\displaystyle E_train/test=E \frac{1}{|\hat{R}|}∑_i∈\hat{R}\ell(y_ih(w^*)_i) \displaystyleAcc_train/test=E \frac{1}{|\hat{R}|}∑_i∈\hat{R ,}δ_y_{i={h(w^*)_i}} \tag{18}
$$
where $\hat{R}$ stands either for the train set $R$ or the test set $R^\prime$ and the expectation is taken over $y$ , $u$ , $A$ , $X$ , $R$ and $R^\prime$ . $Acc_train/test$ is the proportion of train/test nodes that are correctly classified. A main part of the present work is dedicated to the derivation of exact expressions for the errors and the accuracies. We will then search for the architecture of the GCN that maximizes the test accuracy $Acc_test$ .
Notice that one could treat the residual connection strengths $c_k$ as supplementary parameters, jointly trained with $w$ to minimize the train loss. Our analysis can straightforwardly be extended to deal with this case. Yet, as we will show, to take trainable $c_k$ degrades the test performances; and it is better to consider them as hyperparameters optimizing the test accuracy.
Table 1: Summary of the parameters of the model.
| $N$ $M$ $α=N/M$ | number of nodes dimension of the attributes aspect ratio |
| --- | --- |
| $d$ | average degree of the graph |
| $λ$ | signal strength of the graph |
| $μ$ | signal strength of the features |
| $ρ=|R|/N$ | fraction of training nodes |
| $\ell$ , $γ$ | loss and regularization functions |
| $r$ | regularization strength |
| $K$ | number of aggregation steps |
| $c_k$ , $c$ , $t$ | residual connection strengths, diffusion time |
### II.3 Bayes-optimal performance:
An interesting consequence of modeling the data as we propose is that one has access to the Bayes-optimal (BO) performance on this task. The BO performance is defined as the upper-bound on the test accuracy that any algorithm can reach on this problem, knowing the model and its parameters $α,λ,μ$ and $ρ$ . It is of particular interest since it will allow us to check how far the GCNs are from the optimality and how much improvement can one hope for.
The BO performance on this problem has been derived in [22] and [39]. It is expressed as a function of the fixed-point of an algorithm based on approximate message-passing (AMP). In the limit of large degrees $d=Θ(N)$ this algorithm can be tracked by a few scalar state-evolution (SE) equations that we reproduce in appendix C.
## III Asymptotic characterization of the GCN
In this section we provide an asymptotic characterization of the performance of the GCNs previously defined. It relies on a finite set of order parameters that satisfy a system of self-consistent, or fixed-point, equations, that we obtain thanks to the replica method in the high-dimensional limit at finite $K$ . In a second part, for the continuous GCN, we show how to take the limit $K→∞$ for the order parameters and for their self-consistent equations. The continuous GCN is still described by a finite set of order parameters, but these are now continuous functions and the self-consistent equations are integral equations.
Notice that for a quadratic loss function $\ell$ there is an analytical expression for the minimizer $w^*$ of the regularized loss $L_A,X$ eq. (17), given by the regularized least-squares formula. Based on that, a computation of the performance of the GCN with random matrix theory (RMT) is possible. It would not be straightforward in the sense that the convolved features, the weights $w^*$ and the labels $y$ are correlated, and such a computation would have to take in account these correlations. Instead, we prefer to use the replica method, which has already been successfully applied to analyze several architectures of one (learnable) layer neural networks in articles such that [17, 38]. Compared to RMT, the replica method allows us to seamlessly handle the regularized pseudo-inverse of the least-squares and to deal with logistic regression, where no explicit expression for $w^*$ exists.
We compute the average train and test errors and accuracies eqs. (18) and (19) in the high-dimensional limit $N$ and $M$ large. We define the Hamiltonian
$$
H(w)=s∑_i∈ R\ell(y_ih(w)_i)+r∑_νγ(w_ν)+s^\prime∑_i∈ R^\prime\ell(y_ih(w)_i) \tag{20}
$$
where $s$ and $s^\prime$ are external fields to probe the observables. The loss of the test samples is in $H$ for the purpose of the analysis; we will take $s^\prime=0$ later and the GCN is minimizing the training loss (17). The free energy $f$ is defined as
$$
Z=∫dw e^-β H(w) , f=-\frac{1}{β N}E\log Z . \tag{21}
$$
$β$ is an inverse temperature; we consider the limit $β→∞$ where the partition function $Z$ concentrates over $w^*$ at $s=1$ and $s^\prime=0$ . The train and test errors are then obtained according to
$$
E_train=\frac{1}{ρ}\frac{∂ f}{∂ s} , E_test=\frac{1}{ρ^\prime}\frac{∂ f}{∂ s^\prime} \tag{22}
$$
both evaluated at $(s,s^\prime)=(1,0)$ . One can, in the same manner, compute the average accuracies by introducing the observables $∑_i∈\hat{R}δ_y_{i={h(w)_i}}$ in $H$ . To compute $f$ we introduce $n$ replica:
$$
E\log Z=E\frac{∂ Z^n}{∂ n}(n=0)=≤ft(\frac{∂}{∂ n}EZ^n\right)(n=0) . \tag{23}
$$
To pursue the computation we need to precise the architecture of the GCN.
### III.1 Discrete GCN
#### III.1.1 Asymptotic characterization
In this section, we work at finite $K$ . We consider only the asymmetric graph. We define the state of the GCN after the $k^th$ convolution step as
$$
h_k=≤ft(\frac{1}{√{N}}\tilde{A}+c_kI_N\right)h_k-1 , h_0=\frac{1}{√{N}}Xw . \tag{24}
$$
$h_K=h(w)∈ℝ^N$ is the output of the full GCN. We introduce $h_k$ in the replicated partition function $Z^n$ and we integrate over the fluctuations of $A$ and $X$ . This couples the variables across the different layers $k=0… K$ and one has to take in account the correlations between the different $h_k$ , which will result into order parameters of dimension $K$ . One has to keep separate the indices $i∈ R$ and $i∉ R$ , whether the loss $\ell$ is active or not; consequently the free entropy of the problem will be a linear combination of $ρ$ times a potential with $\ell$ and $(1-ρ)$ times without $\ell$ . The limit $N→∞$ is taken thanks to Laplace’s method. The extremization is done in the space of the replica-symmetric ansatz, which is justified by the convexity of $H$ . The detailed computation is given in appendix A.
The outcome of the computation is that this problem is described by a set of twelve order parameters (or summary statistics). They are $Θ=\{m_w∈ℝ,Q_w∈ℝ,V_w∈ℝ,m∈ℝ^K,Q∈ℝ^K× K,V∈ℝ^K× K\}$ and their conjugates $\hat{Θ}=\{\hat{m}_w∈ℝ,\hat{Q}_w∈ℝ,\hat{V}_w∈ℝ,\hat{m}∈ℝ,\hat{Q}∈ℝ^K× K,\hat{V}∈ℝ^K× K\}$ , where
$$
\displaystyle m_w=\frac{1}{N}u^Tw , \displaystyle m_k=\frac{1}{N}y^Th_k , \displaystyle Q_w=\frac{1}{N}w^Tw , \displaystyle Q_k,l=\frac{1}{N}h^T_kh_l , \displaystyle V_w=\frac{β}{N}(_β(w,w)) , \displaystyle V_k,l=\frac{β}{N}(_β(h_k,h_l)) . \tag{25}
$$
$m_w$ and $m_k$ are the magnetizations (or overlaps) between the weights and the hidden variables and between the $k^th$ layer and the labels; the $Q$ s are the self-overlaps (or scalar products) between the different layers; and, writing $_β$ for the covariance under the density $e^-β H$ , the $V$ s are the covariances between different trainings on the same data, after rescaling by $β$ .
The order parameters $Θ$ and $\hat{Θ}$ satisfy the property that they extremize the following free entropy $φ$ :
$$
\displaystyleφ \displaystyle=\frac{1}{2}≤ft(\hat{V}_wV_w+\hat{V}_wQ_w-V_w\hat{Q}_w\right)-\hat{m}_wm_w+\frac{1}{2}tr≤ft(\hat{V}V+\hat{V}Q-V\hat{Q}\right)-\hat{m}^Tm \displaystyle {}+\frac{1}{α}E_u,ς≤ft(\log∫dw e^ψ_w(w)\right)+ρE_y,ξ,ζ,χ≤ft(\log∫∏_k=0^Kdh_ke^ψ_h(h;s)\right)+(1-ρ)E_y,ξ,ζ,χ≤ft(\log∫∏_k=0^Kdh_ke^ψ_h(h;s^{\prime)}\right) , \tag{28}
$$
the potentials being
$$
\displaystyleψ_w(w) \displaystyle=-rγ(w)-\frac{1}{2}\hat{V}_ww^2+≤ft(√{\hat{Q}_w}ς+u\hat{m}_w\right)w \displaystyleψ_h(h;\bar{s}) \displaystyle=-\bar{s}\ell(yh_K)-\frac{1}{2}h_<K^T\hat{V}h_<K+≤ft(ξ^T\hat{Q}^1/2+y\hat{m}^T\right)h_<K \displaystyle {}+\logN≤ft(h_0≤ft|√{μ}ym_w+√{Q_w}ζ;V_w\right.\right)+\logN≤ft(h_>0≤ft|c\odot h_<K+λ ym+Q^1/2χ;V\right.\right) , \tag{29}
$$
for $w∈ℝ$ and $h∈ℝ^K+1$ , where we introduced the Gaussian random variables $ς∼N(0,1)$ , $ξ∼N(0,I_K)$ , $ζ∼N(0,1)$ and $χ∼N(0,I_K)$ , take $y$ Rademacher and $u∼N(0,1)$ , where we set $h_>0=(h_1,…,h_K)^T$ , $h_<K=(h_0,…,h_K-1)^T$ and $c\odot h_<K=(c_1h_0,…,c_Kh_K-1)^T$ and where $\bar{s}∈\{0,1\}$ controls whether the loss $\ell$ is active or not. We use the notation $N(·|m;V)$ for a Gaussian density of mean $m$ and variance $V$ . We emphasize that $ψ_w$ and $ψ_h$ are effective potentials taking in account the randomness of the model and that are defined over a finite number of variables, contrary to the initial loss function $H$ .
The extremality condition $∇_Θ,\hat{Θ} φ=0$ can be stated in terms of a system of self-consistent equations that we give here. In the limit $β→∞$ one has to consider the extremizers of $ψ_w$ and $ψ_h$ defined as
$$
\displaystyle w^* \displaystyle=\operatorname*{argmax}_wψ_w(w)∈ℝ \displaystyle h^* \displaystyle=\operatorname*{argmax}_hψ_h(h;\bar{s}=1)∈ℝ^K+1 \displaystyle h^{^\prime*} \displaystyle=\operatorname*{argmax}_hψ_h(h;\bar{s}=0)∈ℝ^K+1 . \tag{31}
$$
We also need to introduce $_ψ_{h}(h)$ and $_ψ_{h}(h^\prime)$ the covariances of $h$ under the densities $e^ψ_h(h,\bar{s=1)}$ and $e^ψ_h(h,\bar{s=0)}$ . In the limit $β→∞$ they read
$$
\displaystyle_ψ_{h}(h) \displaystyle=∇∇ψ_h(h^*;\bar{s}=1) \displaystyle_ψ_{h}(h^\prime) \displaystyle=∇∇ψ_h(h^{^\prime*};\bar{s}=0) , \tag{34}
$$
$∇∇$ being the Hessian with respect to $h$ . Last, for compactness we introduce the operator $P$ that, for a function $g$ in $h$ , acts according to
$$
P(g(h))=ρ g(h^*)+(1-ρ)g(h^{^\prime*}) . \tag{36}
$$
For instance $P(hh^T)=ρ h^*(h^*)^T+(1-ρ)h^{^\prime*}(h^{^\prime*})^T$ and $P(_ψ_{h}(h))=ρ_ψ_{h}(h)+(1-ρ)_ψ_{h}(h^\prime)$ . Then the extremality condition gives the following self-consistent, or fixed-point, equations on the order parameters:
$$
\displaystyle m_w=\frac{1}{α}E_u,ς uw^* \displaystyle Q_w=\frac{1}{α}E_u,ς(w^*)^2 \displaystyle V_w=\frac{1}{α}\frac{1}{√{\hat{Q}_w}}E_u,ς ς w^* \displaystyle m=E_y,ξ,ζ,χ yP(h_<K) \displaystyle Q=E_y,ξ,ζ,χP(h_<Kh_<K^T) \displaystyle V=E_y,ξ,ζ,χP(_ψ_{h}(h_<K)) \displaystyle\hat{m}_w=\frac{√{μ}}{V_w}E_y,ξ,ζ,χ yP(h_0-√{μ}ym_w) \displaystyle\hat{Q}_w=\frac{1}{V_w^2}E_y,ξ,ζ,χP≤ft((h_0-√{μ}ym_w-√{Q}_wζ)^2\right) \displaystyle\hat{V}_w=\frac{1}{V_w}-\frac{1}{V_w^2}E_y,ξ,ζ,χP(_ψ_{h}(h_0)) \displaystyle\hat{m}=λ V^-1E_y,ξ,ζ,χ yP(h_>0-c\odot h_<K-λ ym) \displaystyle\hat{Q}=V^-1E_y,ξ,ζ,χP≤ft((h_>0-c\odot h_<K-λ ym-Q^1/2χ)^⊗ 2\right)V^-1 \displaystyle\hat{V}=V^-1-V^-1E_y,ξ,ζ,χP(_ψ_{h}(h_>0-c\odot h_<K))V^-1 \tag{37}
$$
Once this system of equations is solved, the expected errors and accuracies can be expressed as
$$
\displaystyle E_train=E_y,ξ,ζ,χ\ell(yh_K^*) , \displaystyleAcc_train=E_y,ξ,ζ,χδ_y= \displaystyle E_test=E_y,ξ,ζ,χ\ell(yh_K^{^\prime*}) , \displaystyleAcc_test=E_y,ξ,ζ,χδ_y=)} . \tag{49}
$$
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Plots with Error Bars: Test Accuracy vs. Parameter `c` for Different `r` and `K` Values
### Overview
The image displays a 2x3 grid of line plots showing the relationship between a parameter `c` (x-axis) and test accuracy (`Acc_test`, y-axis). Each column corresponds to a different value of `K` (1, 2, 3). Each plot contains multiple data series representing different values of a parameter `r`. A dashed horizontal line indicates the "Bayes-optimal" accuracy. An inset contour plot is embedded within the middle-top plot (`K=2`).
### Components/Axes
* **Grid Structure:** Two rows, three columns.
* **Column Titles (Top of each column):** `K = 1`, `K = 2`, `K = 3`.
* **Y-axis Label (Left side of both rows):** `Acc_test` (Test Accuracy).
* **X-axis Label (Bottom of both rows):** `c`.
* **Legend (Located in the top-right plot, `K=3`):**
* `--- Bayes-optimal` (Black dashed line)
* `r = 10^4` (Cyan line with '+' markers)
* `r = 10^2` (Teal line with '+' markers)
* `r = 10^0` (Dark green line with '+' markers)
* `r = 10^{-2}` (Brown line with '+' markers)
* **Inset Plot (Within the `K=2`, top row plot):**
* **X-axis Label:** `c1`
* **Y-axis Label:** `c2`
* **Color Bar (Right side of inset):** Values ranging from approximately 0.6 to 0.9, indicating a third dimension (likely accuracy).
* **Content:** Contour lines showing regions of constant value in the (`c1`, `c2`) parameter space.
### Detailed Analysis
**General Trend Across All Plots:**
For all series, `Acc_test` initially increases as `c` increases from 0, reaches a peak, and then gradually decreases or plateaus as `c` continues to increase. The Bayes-optimal line is a constant horizontal benchmark.
**Top Row (First set of experiments):**
* **`K=1` (Top-Left):**
* **Bayes-optimal:** ~0.985.
* **Trend:** All curves peak around `c ≈ 0.8-1.0`.
* **Data Points (Approximate Peak Accuracies):**
* `r=10^4` (Cyan): Peaks at ~0.87.
* `r=10^2` (Teal): Peaks at ~0.86.
* `r=10^0` (Green): Peaks at ~0.85.
* `r=10^{-2}` (Brown): Peaks at ~0.84.
* **Order:** Higher `r` yields higher accuracy. The gap between `r=10^4` and `r=10^2` is small.
* **`K=2` (Top-Middle):**
* **Bayes-optimal:** ~0.985.
* **Trend:** Curves rise sharply and plateau. Peak is broader, around `c ≈ 0.8-1.5`.
* **Data Points (Approximate Plateau Accuracies):**
* `r=10^4` (Cyan): Plateaus at ~0.91.
* `r=10^2` (Teal): Plateaus at ~0.90.
* `r=10^0` (Green): Plateaus at ~0.89.
* `r=10^{-2}` (Brown): Plateaus at ~0.88.
* **Inset Plot:** Shows a contour map in (`c1`, `c2`) space. The contours form a diagonal, elongated valley/ridge structure, suggesting a correlation or trade-off between `c1` and `c2` for achieving a given performance level. The color bar indicates performance ranges from ~0.6 to >0.9.
* **`K=3` (Top-Right):**
* **Bayes-optimal:** ~0.985.
* **Trend:** Similar to `K=2`, sharp rise to a plateau.
* **Data Points (Approximate Plateau Accuracies):**
* `r=10^4` (Cyan): Plateaus at ~0.92.
* `r=10^2` (Teal): Plateaus at ~0.91.
* `r=10^0` (Green): Plateaus at ~0.90.
* `r=10^{-2}` (Brown): Plateaus at ~0.89.
* **Note:** The x-axis extends to `c=3`, showing the plateau continues.
**Bottom Row (Second set of experiments):**
* **`K=1` (Bottom-Left):**
* **Bayes-optimal:** ~0.91.
* **Trend:** Peaks around `c ≈ 0.8-1.0`.
* **Data Points (Approximate Peak Accuracies):**
* `r=10^4` (Cyan): Peaks at ~0.74.
* `r=10^2` (Teal): Peaks at ~0.73.
* `r=10^0` (Green): Peaks at ~0.69.
* `r=10^{-2}` (Brown): Peaks at ~0.68.
* **Note:** Overall accuracy is lower than the top row. The gap between `r=10^2` and `r=10^0` is more pronounced.
* **`K=2` (Bottom-Middle):**
* **Bayes-optimal:** ~0.91.
* **Trend:** Rises to a plateau around `c ≈ 1.0-1.5`.
* **Data Points (Approximate Plateau Accuracies):**
* `r=10^4` (Cyan): Plateaus at ~0.77.
* `r=10^2` (Teal): Plateaus at ~0.76.
* `r=10^0` (Green): Plateaus at ~0.71.
* `r=10^{-2}` (Brown): Plateaus at ~0.70.
* **`K=3` (Bottom-Right):**
* **Bayes-optimal:** ~0.91.
* **Trend:** Rises to a plateau around `c ≈ 1.5-2.0`.
* **Data Points (Approximate Plateau Accuracies):**
* `r=10^4` (Cyan): Plateaus at ~0.78.
* `r=10^2` (Teal): Plateaus at ~0.77.
* `r=10^0` (Green): Plateaus at ~0.72.
* `r=10^{-2}` (Brown): Plateaus at ~0.71.
### Key Observations
1. **Effect of `r`:** In every subplot, higher values of `r` (10^4, cyan) consistently yield higher test accuracy than lower values (10^{-2}, brown). The performance gap between `r=10^4` and `r=10^2` is generally smaller than the gap between `r=10^0` and `r=10^{-2}`.
2. **Effect of `K`:** Moving from left to right (`K=1` to `K=3`), the peak/plateau accuracy increases within each row. The shape of the curve also changes from a distinct peak (`K=1`) to a broader plateau (`K=2,3`).
3. **Effect of Row (Experimental Condition):** The top row achieves significantly higher absolute accuracy (peaks ~0.84-0.92) compared to the bottom row (peaks ~0.68-0.78). The Bayes-optimal benchmark is also higher in the top row (~0.985 vs ~0.91).
4. **Parameter `c`:** There is an optimal range for `c` (typically 0.5 to 2.0) that maximizes accuracy. Setting `c` too low or too high degrades performance.
5. **Inset Plot:** The contour plot for `K=2` suggests the existence of a two-dimensional parameter space (`c1`, `c2`) where performance is optimized along a specific manifold or valley.
### Interpretation
This figure likely comes from a machine learning or statistical modeling paper investigating the performance of an algorithm under different hyperparameter settings (`K`, `r`, `c`). The two rows probably represent two different datasets or problem difficulties (the top row being "easier," given the higher Bayes-optimal and achieved accuracies).
* **`K`** likely represents model complexity or capacity (e.g., number of components, layers, or clusters). Increasing `K` improves performance, but with diminishing returns, and changes the sensitivity to parameter `c`.
* **`r`** appears to be a regularization or noise-related parameter. Higher `r` (less regularization or noise) leads to better fitting and higher accuracy, but the benefit saturates (little difference between `r=10^2` and `r=10^4`).
* **`c`** is a critical tuning parameter. Its optimal value depends on `K` and the dataset. The existence of a peak indicates a bias-variance trade-off; `c` controls this balance.
* The **Bayes-optimal** line represents the theoretical maximum accuracy achievable for the given problem. The gap between the algorithm's performance and this line shows the room for improvement. The algorithm gets closer to optimal as `K` and `r` increase.
* The **inset contour plot** provides a deeper look into the optimization landscape for the `K=2` case, showing that performance depends on a combination of two underlying parameters (`c1`, `c2`), and optimal solutions lie along a specific curve in that space.
**In summary, the data demonstrates that the algorithm's test performance is a complex function of its hyperparameters. Optimal performance requires choosing a sufficiently complex model (`K`), appropriate regularization (`r`), and carefully tuning the control parameter `c`. The consistent trends across different conditions (rows) suggest these relationships are robust.**
</details>
Figure 2: Predicted test accuracy $Acc_test$ for different values of $K$ . Top: for $λ=1.5$ , $μ=3$ and logistic loss; bottom: for $λ=1$ , $μ=2$ and quadratic loss; $α=4$ and $ρ=0.1$ . We take $c_k=c$ for all $k$ . Inset: $Acc_test$ vs $c_1$ and $c_2$ at $K=2$ and at large $r$ . Dots: numerical simulation of the GCN for $N=10^4$ and $d=30$ , averaged over ten experiments.
#### III.1.2 Analytical solution
In general the system of self-consistent equations (37 - 48) has to be solved numerically. The equations are applied iteratively, starting from arbitrary $Θ$ and $\hat{Θ}$ , until convergence.
An analytical solution can be computed in some special cases. We consider ridge regression (i.e. quadratic $\ell$ ) and take $c=0$ no residual connections. Then $_ψ_{h}(h)$ , $_ψ_{h^\prime}(h)$ , $V$ and $\hat{V}$ are diagonal. We obtain that
$$
Acc_test=\frac{1}{2}≤ft(1+erf≤ft(\frac{λ q_y,K-1}{√{2}}\right)\right) , q_y,k=\frac{m_k}{√{Q_k,k}} . \tag{51}
$$
The test accuracy only depends on the angle (or overlap) $q_y,k$ between the labels $y$ and the last hidden state $h_K-1$ of the GCN. $q_y,k$ can easily be computed in the limit $r→∞$ . In appendix A.3 we explicit the equations (37 - 50) and give their solution in that limit. In particular we obtain for any $k$
$$
\displaystyle m_k \displaystyle=\frac{ρ}{α r}≤ft(μλ^K+k+∑_l=0^kλ^K-k+2l\right) \displaystyle Q_k,k \displaystyle=\frac{ρ}{α^2r^2}≤ft(α≤ft(1+ρμλ^2K+ρ∑_l=1^Kλ^2l\right)\right. \displaystyle\hskip 18.49988pt≤ft.{}+∑_l=0^k≤ft(1+ρ∑_l^\prime=1^K-1-lλ^2l^{\prime}+\frac{α^2r^2}{ρ}m_l^2\right)\right) . \tag{52}
$$
#### III.1.3 Consequences: going to large $K$ is necessary
We derive consequences from the previous theoretical predictions. We numerically solve eqs. (37 - 48) for some plausible values of the parameters of the data model. We keep balanced the signals from the graph, $λ^2$ , and from the features, $μ^2/α$ ; we take $ρ=0.1$ to stick to the common case where few train nodes are available. We focus on searching the architecture that maximizes the test accuracy by varying the loss $\ell$ , the regularization $r$ , the residual connections $c_k$ and $K$ . For simplicity we will mostly consider the case where $c_k=c$ for all $k$ and for a given $c$ . We compare our theoretical predictions to simulations of the GCN for $N=10^4$ in fig. 2; as expected, the predictions are within the statistical errors. Details on the numerics are provided in appendix D. We provide the code to run our predictions in the supplementary material.
[26] already studies in detail the effect of $\ell$ , $r$ and $c$ at $K=1$ . It reaches the conclusion that the optimal regularization is $r→∞$ , that the choice of the loss $\ell$ has little effect and that there is an optimal $c=c^*$ of order one. According to fig. 2, it seems that these results can be extrapolated to $K>1$ . We indeed observe that, for both the quadratic and the logistic loss, at $K∈\{1,2,3\}$ , $r→∞$ seems optimal. Then the choice of the loss has little effect, because at large $r$ the output $h(w)$ of the network is small and only the behaviour of $\ell$ around 0 matters. Notice that, though $h(w)$ is small and the error $E_train/test$ is trivially equal to $\ell(0)$ , the sign of $h(w)$ is mostly correct and the accuracy $Acc_train/test$ is not trivial. Last, according to the inset of fig. 2 for $K=2$ , to take $c_1=c_2$ is optimal and our assumption $c_k=c$ for all $k$ is justified.
To take trainable $c_k$ would degrade the test performances. We show this in fig. 15 in appendix E, where optimizing the train error $E_train$ over $c_k=c$ trivially leads to $c=+∞$ . Indeed, in this case the graph is discarded and the convolved features are proportional to the features $X$ ; which, if $αρ$ is small enough, are separable, and lead to a null train error. Consequently $c_k$ should be treated as a hyperparameter, tuned to maximize $Acc_test$ , as we do in the rest of the article.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Scatter Plot: Test Error vs. λ² for Different Model Configurations
### Overview
This image is a scatter plot on a logarithmic scale comparing the test error (1 - Acc_test) against the parameter λ² for various model configurations. The plot includes two theoretical benchmark lines and data points for two primary model families (α=4, μ=1 and α=2, μ=2), each further subdivided by a parameter K and whether the graph is symmetrized. The overall trend shows test error decreasing as λ² increases.
### Components/Axes
* **X-axis:** Labeled "λ²". It is a linear scale with major tick marks at 10, 12, 14, 16, 18, 20, 22, 24, and 26.
* **Y-axis:** Labeled "1 - Acc_test". It is a logarithmic scale (base 10) with major tick marks at 10⁻⁶, 10⁻⁵, 10⁻⁴, 10⁻³, and 10⁻².
* **Primary Legend (Top-Right):**
* Filled Circle (●): Corresponds to model parameters α=4, μ=1.
* Plus Sign (+): Corresponds to model parameters α=2, μ=2.
* Dotted Line (····): Labeled "½ τ_BO". This is a theoretical benchmark line.
* Dashed Line (----): Labeled "Bayes – optimal". This is a theoretical lower bound line.
* **Secondary Legend (Bottom-Left):** This legend maps colors to the parameter K and graph type. The colors apply to both the circle and plus sign data points.
* Dark Brown: K=1
* Medium Brown: K=2
* Light Brown: K=3
* Dark Yellow: K=2, symmetrized graph
* Bright Yellow: K=3, symmetrized graph
### Detailed Analysis
**Trend Verification:**
* **Bayes-optimal Line (Dashed):** Slopes downward steeply from left to right. It represents the best possible theoretical performance.
* **½ τ_BO Line (Dotted):** Slopes downward, but less steeply than the Bayes-optimal line. It represents another theoretical benchmark.
* **Data Points (All Series):** All data series (circles and plus signs, across all colors) show a general downward trend as λ² increases, indicating that test error decreases with larger λ².
**Data Point Extraction (Approximate Values):**
* **Bayes-optimal Line:** At λ²=10, y ≈ 10⁻⁵. At λ²=12, y ≈ 10⁻⁶.
* **½ τ_BO Line:** At λ²=10, y ≈ 1.5x10⁻³. At λ²=20, y ≈ 10⁻⁵.
* **α=4, μ=1 (Circles):**
* K=1 (Dark Brown): Starts near y=4x10⁻² at λ²=10, decreases to ~10⁻⁴ at λ²=20.
* K=2 (Medium Brown): Starts near y=2x10⁻³ at λ²=10, decreases to ~10⁻⁵ at λ²=22.
* K=3 (Light Brown): Points are generally lower than K=2 for the same λ².
* K=2, symmetrized (Dark Yellow): Points are consistently lower than non-symmetrized K=2 circles. At λ²=10, y ≈ 5x10⁻⁵.
* K=3, symmetrized (Bright Yellow): The lowest circle points. At λ²=10, y ≈ 2x10⁻⁵. At λ²=12, y ≈ 10⁻⁶.
* **α=2, μ=2 (Plus Signs):**
* K=1 (Dark Brown): Starts near y=7x10⁻³ at λ²=10, decreases to ~3x10⁻⁵ at λ²=26.
* K=2 (Medium Brown): Points are generally lower than K=1 plus signs.
* K=3 (Light Brown): Points are generally lower than K=2 plus signs.
* K=2, symmetrized (Dark Yellow): Points are lower than non-symmetrized K=2 plus signs.
* K=3, symmetrized (Bright Yellow): The lowest plus sign points. At λ²=10, y ≈ 10⁻⁵.
### Key Observations
1. **Hierarchy of Performance:** For both model families (α=4,μ=1 and α=2,μ=2), performance improves (error decreases) as K increases from 1 to 3.
2. **Symmetrization Benefit:** For a fixed K (2 or 3), the "symmetrized graph" variants (yellow points) consistently achieve lower test error than their non-symmetrized counterparts (brown points) at similar λ² values.
3. **Model Family Comparison:** At similar λ² and K values, the α=4, μ=1 models (circles) generally achieve lower error than the α=2, μ=2 models (plus signs). For example, at λ²=16, the K=3 symmetrized circle is near 10⁻⁵, while the K=3 symmetrized plus sign is near 5x10⁻⁵.
4. **Proximity to Bounds:** The best-performing models (K=3, symmetrized) approach the ½ τ_BO line and, at higher λ², get closer to the Bayes-optimal bound. The K=1 models are far above both theoretical lines.
### Interpretation
This plot investigates how test error scales with the parameter λ² for different graph neural network or kernel machine configurations, likely in a theoretical or controlled experimental setting. The parameters α, μ, and K probably control aspects of the model's architecture or the data's structure (e.g., graph degree, feature dimension, or number of layers/hops).
The key findings are:
1. **λ² is a key driver of performance:** Increasing λ² consistently reduces test error across all tested configurations, suggesting it is a beneficial regularization parameter or a measure of signal-to-noise ratio.
2. **Model complexity helps:** Increasing K (which could represent the number of propagation steps or model depth) improves performance, indicating that capturing more complex, multi-hop relationships is beneficial.
3. **Symmetry is powerful:** Enforcing symmetry in the graph representation ("symmetrized graph") provides a significant and consistent performance boost. This suggests that the underlying problem or data has an inherent symmetry that, when leveraged by the model, leads to better generalization.
4. **Theoretical guides are informative:** The data points follow the slope of the theoretical ½ τ_BO line, and the best models trend toward the Bayes-optimal limit. This validates the theoretical framework and shows that practical models can approach fundamental limits with the right inductive biases (like symmetry and sufficient depth).
**Notable Anomaly:** The data points for K=1 (dark brown) are clustered in the upper-left region, showing high error and weak scaling with λ². This indicates that a model with K=1 is fundamentally limited and cannot take advantage of increasing λ² to reduce error effectively, unlike deeper (K=2,3) models.
</details>
Figure 3: Predicted misclassification error $1-Acc_test$ at large $λ$ for two strengths of the feature signal. $r=∞$ , $c=c^*$ is optimized by grid search and $ρ=0.1$ . The dots are theoretical predictions given by numerically solving the self-consistent equations (37 - 48) simplified in the limit $r→∞$ . For the symmetrized graph the self-consistent equations are eqs. (87 - 94) in the next part.
Finite $K$ :
We focus on the effect of varying the number $K$ of aggregation steps. [26] shows that at $K=1$ there is a large gap between the Bayes-optimal test accuracy and the best test accuracy of the GCN. We find that, according to fig. 2, for $K∈\{1,2,3\}$ , to increase $K$ reduces more and more the gap. Thus going to higher depth allows to approach the Bayes-optimality.
This also stands as to the learning rate when the signal $λ$ of the graph increases. At $λ→∞$ the GCN is consistent and correctly predicts the labels of all the test nodes, that is $Acc_test\underset{λ→∞}{\longrightarrow}1$ . The learning rate $τ>0$ of the GCN is defined as
$$
\log(1-Acc_test)\underset{λ→∞}{∼}-τλ^2 . \tag{54}
$$
As shown in [39], the rate $τ_BO$ of the Bayes-optimal test accuracy is
$$
τ_BO=1 . \tag{55}
$$
For $K=1$ [26] proves that $τ≤τ_BO/2$ and that $τ→τ_BO/2$ when the signal from the features $μ^2/α$ diverges. We obtain that if $K>1$ then $τ=τ_BO/2$ for any signal from the features. This is shown on fig. 3, where for $K=1$ the slope of the residual error varies with $μ$ and $α$ but does not reach half of the Bayes-optimal slope; while for $K>1$ it does, and the features only contribute with a sub-leading order.
Analytically, taking the limit in eqs. (52) and (53), at $c=0$ and $r→∞$ we have that
$$
\underset{λ→∞}{lim}q_y,K-1≤ft\{\begin{array}[]{rr}=1&if K>1\\
<1&if K=1\end{array}\right. \tag{56}
$$
Since $\log(1-(λ q_y,K-1/√{2}))\underset{λ→∞}{∼}-λ^2q_y,K-1^2/2$ we recover the leading behaviour depicted on fig. 3. $c$ has little effect on the rate $τ$ ; it only seems to vary the test accuracy by a sub-leading term.
Symmetrization:
We found that in order to reach the Bayes-optimal rate one has to further symmetrize the graph, according to eq. (14), and to perform the convolution steps by applying $\tilde{A}^s$ instead of $\tilde{A}$ . Then, as shown on fig. 3, the GCN reaches the BO rate for any $K>1$ , at any signal from the features.
The reason of this improvement is the following. The GCN we consider is not able to deal with the asymmetry of the graph and the supplementary information it gives. This is shown on fig. 14 in appendix E for different values of $K$ at finite signal, in agreement with [20]. There is little difference in the performance of the simple GCN whether the graph is symmetric or not with same $λ$ . As to the rates, as shown by the computation in appendix C, a symmetric graph with signal $λ$ would lead to a BO rate $τ_BO^s=1/2$ , which is the rate the GCN achieves on the asymmetric graph. It is thus better to let the GCN process the symmetrized the graph, which has a higher signal $λ^s=√{2}λ$ , and which leads to $τ=1=τ_BO$ .
Symmetrization is an important step toward the optimality and we will detail the analysis of the GCN on the symmetrized graph in part III.2.
Large $K$ and scaling of $c$ :
Going to larger $K$ is beneficial and allows the network to approach the Bayes optimality. Yet $K=3$ is not enough to reach it at finite $λ$ , and one can ask what happens at larger $K$ . An important point is that $c$ has to be well tuned. On fig. 2 we observe that $c^*$ , the optimal $c$ , is increasing with $K$ . To make this point more precise, on fig. 4 we show the predicted test accuracy for larger $K$ for different scalings of $c$ . We take $r=∞$ since it appears to be the optimal regularization. We consider no residual connections, $c=0$ ; constant residual connections, $c=1$ ; or growing residual connections, $c∝ K$ .
<details>
<summary>x4.png Details</summary>

### Visual Description
## Scatter Plot with Reference Lines: Test Accuracy vs. K for Different c Values
### Overview
The image displays two vertically stacked scatter plots sharing a common logarithmic x-axis labeled "K". Both plots have a y-axis labeled "Acc_test" (Test Accuracy), but with different scales. The plots compare the performance of three different conditions (c=K, c=1, c=0) against two reference baselines ("continuous limit" and "PCA on the graph") as the parameter K increases.
### Components/Axes
* **X-Axis (Shared):** Labeled "K". It uses a logarithmic scale with major tick marks at `10^0` (1) and `10^1` (10). Data points are plotted at approximate K values of 1, 2, 4, 8, and 16.
* **Y-Axis (Top Plot):** Labeled "Acc_test". Linear scale ranging from approximately 0.80 to 0.93.
* **Y-Axis (Bottom Plot):** Labeled "Acc_test". Linear scale ranging from approximately 0.50 to 0.66.
* **Legend:** Positioned in the center, between the two subplots.
* **Data Series:**
* Red Circle (●): `c = K`
* Blue Star (★): `c = 1`
* Black Plus (+): `c = 0`
* **Reference Lines:**
* Red Dotted Line (····): `continuous limit`
* Black Dotted Line (····): `PCA on the graph`
### Detailed Analysis
**Top Subplot (Acc_test range ~0.80-0.93):**
* **c=K (Red Circles):** Shows a clear upward trend. Values are approximately: (K=1, Acc≈0.86), (K=2, Acc≈0.90), (K=4, Acc≈0.92), (K=8, Acc≈0.93), (K=16, Acc≈0.93). It approaches and nearly meets the "continuous limit" line.
* **c=1 (Blue Stars):** Shows an initial increase followed by a plateau/slight decline. Values are approximately: (K=1, Acc≈0.86), (K=2, Acc≈0.91), (K=4, Acc≈0.90), (K=8, Acc≈0.88), (K=16, Acc≈0.87). It peaks near K=2.
* **c=0 (Black Plus Signs):** Shows a relatively flat, slightly increasing trend. Values are approximately: (K=1, Acc≈0.79), (K=2, Acc≈0.85), (K=4, Acc≈0.87), (K=8, Acc≈0.87), (K=16, Acc≈0.87). It converges to the "PCA on the graph" line.
* **Reference Lines:**
* "continuous limit" (Red Dotted): Horizontal line at Acc_test ≈ 0.93.
* "PCA on the graph" (Black Dotted): Horizontal line at Acc_test ≈ 0.87.
**Bottom Subplot (Acc_test range ~0.50-0.66):**
* **c=K (Red Circles):** Shows a slight upward trend. Values are approximately: (K=1, Acc≈0.62), (K=2, Acc≈0.64), (K=4, Acc≈0.65), (K=8, Acc≈0.655), (K=16, Acc≈0.66). It approaches the "continuous limit" line.
* **c=1 (Blue Stars):** Shows a distinct downward trend after an initial point. Values are approximately: (K=1, Acc≈0.62), (K=2, Acc≈0.64), (K=4, Acc≈0.64), (K=8, Acc≈0.62), (K=16, Acc≈0.59).
* **c=0 (Black Plus Signs):** Shows a clear downward trend. Values are approximately: (K=1, Acc≈0.55), (K=2, Acc≈0.545), (K=4, Acc≈0.54), (K=8, Acc≈0.53), (K=16, Acc≈0.525). It trends away from the "PCA on the graph" line.
* **Reference Lines:**
* "continuous limit" (Red Dotted): Horizontal line at Acc_test ≈ 0.66.
* "PCA on the graph" (Black Dotted): Horizontal line at Acc_test ≈ 0.50.
### Key Observations
1. **Performance Hierarchy:** In both plots, the `c=K` condition consistently achieves the highest test accuracy, followed by `c=1`, with `c=0` performing the worst.
2. **Trend Divergence with K:** The behavior of `c=1` and `c=0` diverges significantly between the two plots. In the top plot, they stabilize or slightly increase with K. In the bottom plot, both show a marked decrease in accuracy as K increases.
3. **Convergence to Limits:** The `c=K` series converges toward the "continuous limit" baseline in both scenarios. The `c=0` series converges to the "PCA on the graph" baseline in the top plot but diverges from it in the bottom plot.
4. **Scale Sensitivity:** The absolute accuracy values are much higher in the top plot (~0.8-0.93) compared to the bottom plot (~0.5-0.66), suggesting the two plots represent different datasets, tasks, or model configurations.
### Interpretation
This chart likely evaluates the performance of a graph-based or kernel-based machine learning model where `K` is a key hyperparameter (e.g., number of neighbors, kernel width) and `c` controls a specific model component or regularization.
* **`c=K` is the Optimal Strategy:** Setting the parameter `c` to scale with `K` yields the best and most robust performance, approaching a theoretical "continuous limit." This suggests an adaptive strategy is superior.
* **Fixed `c` Values are Suboptimal and Brittle:** Using fixed values for `c` (`c=1` or `c=0`) leads to lower performance. More critically, their effectiveness is highly sensitive to the problem context (as seen by the differing trends between the top and bottom plots). The declining accuracy for `c=1` and `c=0` with increasing K in the bottom plot indicates over-smoothing, over-regularization, or a mismatch between the fixed parameter and the growing model complexity/scope represented by K.
* **Baselines as Performance Ceilings/Floors:** The "continuous limit" acts as an upper-bound benchmark that the adaptive `c=K` method can reach. The "PCA on the graph" serves as a lower-bound or alternative baseline; the fact that `c=0` matches it in one scenario but not the other provides insight into the conditions under which that simple baseline is competitive.
* **Underlying Message:** The data argues for using an adaptive parameter (`c=K`) that scales with the model's operational scale (`K`) to achieve optimal and stable generalization performance, rather than relying on fixed heuristic values. The two subplots demonstrate that this conclusion holds across different problem settings (high-accuracy vs. lower-accuracy regimes).
</details>
Figure 4: Predicted test accuracy $Acc_test$ vs $K$ for different scalings of $c$ , at $r=∞$ . Top: for $λ=1.5$ , $μ=3$ ; bottom: for $λ=0.7$ , $μ=1$ ; $α=4$ , $ρ=0.1$ . The predictions are given either by the explicit expression eqs. (51 - 53) for $c=0$ , either by solving the self-consistent equations (37 - 48) simplified in the limit $r→∞$ . The performance for the continuous limit are derived and given in the next section III.2, while the performance of PCA on the graph are given by eqs. (59 - 60).
A main observation is that, on fig. 4 for $K→∞$ , $c=0$ or $c=1$ converge to the same limit while $c∝ K$ converge to a different limit, that has higher accuracy.
In the case where $c=0$ or $c=1$ the GCN oversmooths at large $K$ . The limit it converges to corresponds to the accuracy of principal component analysis (PCA) on the sole graph; that is, it corresponds to the accuracy of the estimator $\hat{y}_PCA=≤ft(Re(y_1)\right)$ where $y_1$ is the leading eigenvector of $\tilde{A}$ . The overlap $q_PCA$ between $y$ and $\hat{y}_PCA$ and the accuracy are
$$
\displaystyle q_PCA=≤ft\{\begin{array}[]{cr}√{1-λ^-2}&if λ≥ 1\\
0&if λ≤ 1\end{array}\right. , \displaystyleAcc_test,PCA=\frac{1}{2}≤ft(1+erf≤ft(\frac{λ q_PCA}{√{2}}\right)\right) . \tag{59}
$$
Consequently, if $c$ does not grow, the GCN will oversmooth at large $K$ , in the sense that all the information from the features $X$ vanishes. Only the information from the graph remains, that can still be informative if $λ>1$ . The formula (59 – 60) is obtained by taking the limit $K→∞$ in eqs. (51 – 53), for $c=0$ . For any constant $c$ it can also be recovered by considering the leading eigenvector $y_1$ of $\tilde{A}$ . At large $K$ , $(\tilde{A}/√{N}+cI)^K$ is dominated by $y_1$ and the output of the GCN is $h(w)∝ y_1$ for any $w$ . Consequently the GCN exactly acts like thresholded PCA on $\tilde{A}$ . The sharp transition at $λ=1$ corresponds to the BBP phase transition in the spectrum of $A^g$ and $\tilde{A}$ [56]. According to eqs. (51 – 53) the convergence of $q_y,K-1$ toward $q_PCA$ is exponentially fast in $K$ if $λ>1$ ; it is like $1/√{K}$ , much slower, if $λ<1$ .
The fact that the oversmoothed features can be informative differs from several previous works where they are fully non-informative, such as [9, 10, 40]. This is mainly due to the normalization $\tilde{A}$ of $A$ we use and that these works do not use. It allows to remove the uniform eigenvector $(1,…,1)^T$ , that otherwise dominates $A$ and leads to non-informative features. [36] emphasizes on this point and compares different ways of normalizing and correcting $A$ . This work concludes, as we do, that for a correct rescaling $\tilde{A}$ of $A$ , similar to ours, going to higher $K$ is always beneficial if $λ$ is high enough, and that the convergence to the limit is exponentially fast. Yet, at large $K$ it obtains bounds on the test accuracy that do not depend on the features: the network they consider still oversmooths in the precise sense we defined. This can be expected since it does not have residual connections, i.e. $c=0$ , that appear to be decisive.
In the case where $c∝ K$ the GCN does not oversmooth and it converges to a continuous limit, obtained as $(cI+\tilde{A}/√{N})^K∝(I+t\tilde{A}/K√{N})^K→ e^\frac{t\tilde{A}{√{N}}}$ . We study this limit in detail in the next part where we predict the resulting accuracy for all constant ratios $t=c/K$ . In general the continuous limit has a better performance than the limit at constant $c$ that relies only on the graph, performing PCA, because it can take in account the features, which bring additional information.
Fig. 4 suggests that $Acc_test$ is monotonically increasing with $K$ if $c∝ K$ and that the continuous limit is the upper-bound on the performance at any $K$ . We will make this point more precise in the next part. Yet we can already see that, for this to be true, one has to correctly tune the ratio $c/K$ : for instance if $λ$ is small $\tilde{A}$ mostly contains noise and applying it to $X$ will mostly lower the accuracy. Shortly, if $c/K$ is optimized then $K→∞$ is better than any fixed $K$ . Consequently the continuous limit is the correct limit to maximize the test accuracy and it is of particular relevance.
### III.2 Continuous GCN
In this section we present the asymptotic characterization of the continuous GCN, both for the asymmetric graph and for its symmetrization. The continuous GCN is the limit of the discrete GCN when the number of convolution steps $K$ diverges while the residual connections $c$ become large. The order parameters that describe it, as well as the self-consistent equations they follow, can be obtained as the limit of those of the discrete GCN. We give a detailed derivation of how the limit is taken, since it is of independent interest.
The outcome is that the state $h$ of the GCN across the convolutions is described by a set of equations resembling the dynamical mean-field theory. The order parameters of the problem are continuous functions and the self-consistent equations can be expressed by expansion around large regularization $r→∞$ as integral equations, that specialize to differential equations in the asymmetric case. The resulting equations can be solved analytically; for asymmetric graphs, the covariance and its conjugate are propagators (or resolvant) of the two-dimensional Klein-Gordon equation. We show numerically that our approach is justified and agrees with simulations. Last we show that going to the continuous limit while symmetrizing the graph corresponds to the optimum of the architecture and allows to approach the Bayes-optimality.
#### III.2.1 Asymptotic characterization
To deal with both cases, asymmetric or symmetrized, we define $(δ_e,\tilde{A}^e,λ^e)∈\{(0,\tilde{A},λ),(1,\tilde{A}^s,λ^s)\}$ , where we remind that $\tilde{A}^s$ is the symmetrized $\tilde{A}$ with effective signal $λ^s=√{2}λ$ . In particular $δ_e=0$ for the asymmetric and $δ_e=1$ for the symmetrized.
The continuous GCN is defined by the output function
$$
h(w)=e^\frac{t{√{N}}\tilde{A}^e}\frac{1}{√{N}}Xw . \tag{61}
$$
We first derive the free entropy of the discretization of the GCN and then take the continuous limit. The discretization at finite $K$ is
$$
\displaystyle h(w)=h_K , \displaystyle h_k+1=≤ft(I_N+\frac{t}{K√{N}}\tilde{A}^e\right)h_k , \displaystyle h_0=\frac{1}{√{N}}Xw . \tag{62}
$$
In the case of the asymmetric graph this discretization can be mapped to the discrete GCN of the previous section A as detailed in eq. (16) and the following paragraph; the free entropy and the order parameters of the two models are the same, up to a rescaling by $c$ .
The order parameters of the discretization of the GCN are $m_w∈ℝ,Q_w∈ℝ,V_w∈ℝ,m∈ℝ^K,Q_h∈ℝ^K× K,V_h∈ℝ^K× K$ , their conjugates $\hat{m}_w∈ℝ,\hat{Q}_w∈ℝ,\hat{V}_w∈ℝ,\hat{m}∈ℝ,\hat{Q}_h∈ℝ^K× K,\hat{V}_h∈ℝ^K× K$ and the two additional order parameters $Q_qh∈ℝ^K× K$ and $V_qh∈ℝ^K× K$ that account for the supplementary correlations the symmetry of the graph induces; $Q_qh=V_qh=0$ for the asymmetric case.
The free entropy and its derivation are given in appendix B. The outcome is that $h$ is described by the effective low-dimensional potential $ψ_h$ over $ℝ^K+1$ that is
$$
\displaystyleψ_h(h;\bar{s}) \displaystyle=-\frac{1}{2}h^TGh+h^T≤ft(B_h+D_qh^TG_0^-1B\right) ; \tag{65}
$$
where
$$
\displaystyle G \displaystyle=G_h+D_qh^TG_0^-1D_qh , \displaystyle G_h \displaystyle=≤ft(\begin{smallmatrix}\hat{V}_h&0\\
0&\bar{s}\end{smallmatrix}\right) , \displaystyle G_0 \displaystyle=≤ft(\begin{smallmatrix}K^2V_w&0\\
0&t^2V_h\end{smallmatrix}\right) , \displaystyle D_qh \displaystyle=D-t≤ft(\begin{smallmatrix}0&0\\
-iδ_eV_qh^T&0\end{smallmatrix}\right) \tag{66}
$$
are $(K+1)×(K+1)$ block matrices;
$$
\displaystyle D \displaystyle=K≤ft(\begin{smallmatrix}1&&&0\\
-1&⋱&&\\
&⋱&⋱&\\
0&&-1&1\end{smallmatrix}\right) \tag{70}
$$
is the $(K+1)×(K+1)$ discrete derivative;
$$
\displaystyle B \displaystyle=≤ft(\begin{smallmatrix}K√{Q_w}χ\\
it≤ft(\hat{Q}^1/2ζ\right)_q\end{smallmatrix}\right)+y≤ft(\begin{smallmatrix}K√{μ}m_w\\
λ^etm\end{smallmatrix}\right) , \displaystyle B_h \displaystyle=≤ft(\begin{smallmatrix}≤ft(\hat{Q}^1/2ζ\right)_h\\
0\end{smallmatrix}\right)+y≤ft(\begin{smallmatrix}\hat{m}\\
\bar{s}\end{smallmatrix}\right) , \displaystyle≤ft(\begin{smallmatrix}(\hat{Q}^1/2ζ)_q\\
(\hat{Q}^1/2ζ)_h\end{smallmatrix}\right) \displaystyle=≤ft(\begin{smallmatrix}-Q_h&-δ_eQ_qh^T\\
-δ_eQ_qh&\hat{Q}_h\end{smallmatrix}\right)^1/2≤ft(\begin{smallmatrix}ζ_q\\
ζ_h\end{smallmatrix}\right) \tag{71}
$$
are vectors of size $K+1$ , where $y=± 1$ is Rademacher and $ζ_q∼N(0,I_K+1)$ , $ζ_h∼N(0,I_K+1)$ and $χ∼N(0,1)$ are standard Gaussians. $\bar{s}$ determines whether the loss is active $\bar{s}=1$ or not $\bar{s}=0$ . We assumed that $\ell$ is quadratic. Later we will take the limit $r→∞$ where $h$ is small and where $\ell$ can effectively be expanded around $0$ as a quadratic potential. Notice that in the case $δ_e=0$ we recover the potential $ψ_h$ eq. (30) of the previous part.
This potential eq. (65) corresponds to a one dimensional interacting chain, involving the positions $h$ and their effective derivative $D_qhh$ , and with constraints at the two ends, for the loss on $h_K$ and the regularized weights on $h_0$ . Its extremizer $h^*$ is
$$
h^*=G^-1≤ft(B_h+D_qh^TG_0^-1B\right) . \tag{74}
$$
The order parameters are determined by the following fixed-point equations, obtained by extremizing the free entropy. As before $P$ acts by linearly combining quantities evaluated at $h^*$ , taken with $\bar{s}=1$ and $\bar{s}=0$ with weights $ρ$ and $1-ρ$ .
$$
\displaystyle m_w=\frac{1}{α}\frac{\hat{m}_w}{r+\hat{V}_w} \displaystyle Q_w=\frac{1}{α}\frac{\hat{Q}_w+\hat{m}_w^2}{(r+\hat{V}_w)^2} \displaystyle V_w=\frac{1}{α}\frac{1}{r+\hat{V}_w} \displaystyle≤ft(\begin{smallmatrix}\hat{m}_w\\
\hat{m}\\
m\\
·\end{smallmatrix}\right)=≤ft(\begin{smallmatrix}K√{μ}&&0\\
&λ^etI_K&\\
0&&I_K+1\end{smallmatrix}\right)E_y,ξ,ζ yP≤ft(\begin{smallmatrix}G_0^-1(D_qhh-B)\\
h\end{smallmatrix}\right) \displaystyle\begin{multlined}≤ft(\begin{smallmatrix}\hat{Q}_w&&&·\\
&\hat{Q}_h&Q_qh&\\
&Q_qh^T&Q_h&\\
·&&&·\end{smallmatrix}\right)=≤ft(\begin{smallmatrix}K&&0\\
&tI_K&\\
0&&I_K+1\end{smallmatrix}\right)\\
\hskip 18.49988ptE_y,ξ,ζP≤ft(≤ft(\begin{smallmatrix}G_0^-1(D_qhh-B)\\
h\end{smallmatrix}\right)^⊗ 2\right)≤ft(\begin{smallmatrix}K&&0\\
&tI_K&\\
0&&I_K+1\end{smallmatrix}\right)\end{multlined}≤ft(\begin{smallmatrix}\hat{Q}_w&&&·\\
&\hat{Q}_h&Q_qh&\\
&Q_qh^T&Q_h&\\
·&&&·\end{smallmatrix}\right)=≤ft(\begin{smallmatrix}K&&0\\
&tI_K&\\
0&&I_K+1\end{smallmatrix}\right)\\
\hskip 18.49988ptE_y,ξ,ζP≤ft(≤ft(\begin{smallmatrix}G_0^-1(D_qhh-B)\\
h\end{smallmatrix}\right)^⊗ 2\right)≤ft(\begin{smallmatrix}K&&0\\
&tI_K&\\
0&&I_K+1\end{smallmatrix}\right) \displaystyle≤ft(\begin{smallmatrix}·&·\\
-iV_qh&·\end{smallmatrix}\right)=tP≤ft(G_0^-1D_qhG^-1\right) \displaystyle≤ft(\begin{smallmatrix}V_h&·\\
·&·\end{smallmatrix}\right)=P≤ft(G^-1\right) \displaystyle≤ft(\begin{smallmatrix}\hat{V}_w&·\\
·&\hat{V}_h\end{smallmatrix}\right)=≤ft(\begin{smallmatrix}K^2&0\\
0&t^2I_K\end{smallmatrix}\right)P≤ft(G_0^-1-G_0^-1D_qhG^-1D_qh^TG_0^-1\right) \tag{75}
$$
where $·$ are unspecified elements that pad the vector to the size $2(K+1)$ and the matrices to the size $2(K+1)× 2(K+1)$ and $(K+1)×(K+1)$ . On $w$ we assumed $l_2$ regularization and obtained the same equations as in part III.1.
Once a solution to this system is found the train and test accuracies are expressed as
$$
\displaystyleAcc_train/test=E_y,ζ,χδ_y= , \tag{85}
$$
taking $\bar{s}=1$ or $\bar{s}=0$ .
#### III.2.2 Expansion around large regularization $r$ and continuous limit
Solving the above self-consistent equations (75 - 84) is difficult as such. One can solve them numerically by repeated updates; but this does not allow to go to large $K$ because of numerical instability. One has to invert $G$ eq. (66) and to make sense of the continuous limit of matrix inverts. This is an issue in the sense that, for a generic $K× K$ matrix $(M)_ij$ whose elements vary smoothly with $i$ and $j$ in the limit of large $K$ , the elements of its inverse $M^-1$ are not necessarly continuous with respect to their indices and can vary with a large magnitude.
Our analysis from the previous part III.1 gives an insight on how to achieve this. It appears that the limit of large regularization $r→∞$ is of particular relevance. In this limit the above system can be solved analytically thanks to an expansion around large $r$ . This expansion is natural in the sense that it leads to several simplifications and corresponds to expanding the matrix inverts in Neumann series. Keeping the first terms of the expansion the limit $K→∞$ is then well defined. In this section we detail this expansion; we take the continuous limit and, keeping the first constant order, we solve (75 - 84).
In the limit of large regularization $h$ and $w$ are of order $1/r$ ; the parameters $m_w$ , $m$ , $V_w$ and $V$ are of order $1/r$ and $Q_w$ and $Q$ are of order $1/r^2$ , while all their conjugates, $Q_qh$ and $V_qh$ are of order one. Consequently we have $G_0^-1∼ r\gg G_h∼ 1$ and we expand $G^-1$ around $G_0$ :
$$
\displaystyle G^-1 \displaystyle=D_qh^-1G_0D_qh^-1,T∑_a≥ 0≤ft(-G_hD_qh^-1G_0D_qh^-1,T\right)^a . \tag{86}
$$
Constant order:
We detail how to solve the self-consistent equations (75 - 84) taking the continuous limit $K→∞$ at the constant order in $1/r$ . As we will show later, truncating $G^-1$ to the constant order gives predictions that are close to the simulations at finite $r$ , even for $r≈ 1$ if $t$ is not too large. Considering higher orders is feasible but more challenging and we will only provide insights on how to pursue the computation.
The truncated expansion gives, starting from the variances:
$$
\displaystyle≤ft(\begin{smallmatrix}·&·\\
-iV_qh&·\end{smallmatrix}\right) \displaystyle=tD_qh^-1,T , \displaystyle≤ft(\begin{smallmatrix}V_h&·\\
·&·\end{smallmatrix}\right) \displaystyle=D_qh^-1≤ft(\begin{smallmatrix}K^2V_w&0\\
0&t^2V_h\end{smallmatrix}\right)D_qh^-1,T, \displaystyle≤ft(\begin{smallmatrix}\hat{V}_w&·\\
·&\hat{V}_h\end{smallmatrix}\right) \displaystyle=≤ft(\begin{smallmatrix}K^2&0\\
0&t^2I_K\end{smallmatrix}\right)D_qh^-1,T≤ft(\begin{smallmatrix}\hat{V}_h&0\\
0&ρ\end{smallmatrix}\right)D_qh^-1 . \tag{87}
$$
We kept the order $a=0$ for $V_qh$ and $V_h$ , and the orders $a≤ 1$ for $\hat{V}_w$ and $\hat{V}_h$ . We expand $h^*≈ D_qh^-1G_0D_qh^-1,TB_h+D_qh^-1B$ keeping the order $a=0$ and obtain the remaining self-consistent equations
$$
\displaystyle≤ft(\begin{smallmatrix}\hat{m}_w\\
\hat{m}\end{smallmatrix}\right) \displaystyle=≤ft(\begin{smallmatrix}K√{μ}&0\\
0&λ^etI_K·\end{smallmatrix}\right)D_qh^-1,T≤ft(\begin{smallmatrix}\hat{m}\\
ρ\end{smallmatrix}\right) \displaystyle≤ft(\begin{smallmatrix}m\\
·\end{smallmatrix}\right) \displaystyle=D_qh^-1G_0D_qh^-1,T≤ft(\begin{smallmatrix}\hat{m}\\
ρ\end{smallmatrix}\right)+D_qh^-1≤ft(\begin{smallmatrix}K√{μ}m_w\\
λ^etm\end{smallmatrix}\right) \tag{90}
$$
$$
\displaystyle≤ft(\begin{smallmatrix}\hat{Q}_w&·\\
·&\hat{Q}_h\end{smallmatrix}\right) \displaystyle=≤ft(\begin{smallmatrix}K&0\\
0&tI_K\end{smallmatrix}\right)D_qh^-1,T≤ft(≤ft(\begin{smallmatrix}\hat{Q}_h&0\\
0&0\end{smallmatrix}\right)+ρ≤ft(\begin{smallmatrix}\hat{m}\\
1\end{smallmatrix}\right)^⊗ 2+(1-ρ)≤ft(\begin{smallmatrix}\hat{m}\\
0\end{smallmatrix}\right)^⊗ 2\right)D_qh^-1≤ft(\begin{smallmatrix}K&0\\
0&tI_K\end{smallmatrix}\right) \displaystyle≤ft(\begin{smallmatrix}·&·\\
-iQ_qh&·\end{smallmatrix}\right) \displaystyle=tD_qh^-1,T≤ft[≤ft(tδ_e≤ft(\begin{smallmatrix}0&-iQ_qh\\
0&0\end{smallmatrix}\right)+≤ft(\begin{smallmatrix}\hat{m}\\
ρ\end{smallmatrix}\right)≤ft(\begin{smallmatrix}K√{μ}m_w\\
λ^etm\end{smallmatrix}\right)^T\right)D_qh^-1,T+≤ft(≤ft(\begin{smallmatrix}\hat{Q}_h&0\\
0&0\end{smallmatrix}\right)+ρ≤ft(\begin{smallmatrix}\hat{m}\\
1\end{smallmatrix}\right)^⊗ 2+(1-ρ)≤ft(\begin{smallmatrix}\hat{m}\\
0\end{smallmatrix}\right)^⊗ 2\right)D_qh^-1G_0D_qh^-1,T\right] \displaystyle≤ft(\begin{smallmatrix}Q_h&·\\
·&·\end{smallmatrix}\right) \displaystyle=D_qh^-1G_0D_qh^-1,T≤ft(≤ft(\begin{smallmatrix}\hat{Q}_h&0\\
0&0\end{smallmatrix}\right)+ρ≤ft(\begin{smallmatrix}\hat{m}\\
1\end{smallmatrix}\right)^⊗ 2+(1-ρ)≤ft(\begin{smallmatrix}\hat{m}\\
0\end{smallmatrix}\right)^⊗ 2\right)D_qh^-1G_0D_qh^-1,T+D_qh^-1≤ft(≤ft(\begin{smallmatrix}K^2Q_w&0\\
0&t^2Q_h\end{smallmatrix}\right)+≤ft(\begin{smallmatrix}K√{μ}m_w\\
λ^etm\end{smallmatrix}\right)^⊗ 2\right)D_qh^-1,T \displaystyle{}+D_qh^-1G_0D_qh^-1,T≤ft(tδ_e≤ft(\begin{smallmatrix}0&-iQ_qh\\
0&0\end{smallmatrix}\right)+≤ft(\begin{smallmatrix}\hat{m}\\
ρ\end{smallmatrix}\right)≤ft(\begin{smallmatrix}K√{μ}m_w\\
λ^etm\end{smallmatrix}\right)^T\right)D_qh^-1,T+D_qh^-1≤ft(tδ_e≤ft(\begin{smallmatrix}0&0\\
-iQ_qh^T&0\end{smallmatrix}\right)+≤ft(\begin{smallmatrix}K√{μ}m_w\\
λ^etm\end{smallmatrix}\right)≤ft(\begin{smallmatrix}\hat{m}\\
ρ\end{smallmatrix}\right)^T\right)D_qh^-1G_0D_qh^-1,T \tag{92}
$$
We see that all these self-consistent equations (88 - 94) are vectorial or matricial equations of the form $x=λ^etD_qh^-1x$ or $X=t^2D_qh^-1XD_qh^-1,T$ , over $x$ or $X$ , plus inhomogenuous terms and boundary conditions at 0 or $(0,0)$ . The equations are recursive in the sense that each equation only depends on the previous ones and they can be solved one by one. It is thus enough to compute the resolvants of these two equations. Last eq. (87) shows how to invert $D_qh$ and express $D_qh^-1$ . These different properties make the system of self-consistent equations easily solvable, provided one can compute $D_qh$ and the two resolvants. This furthermore highlights the relevance of the $r→∞$ limit.
We take the continuous limit $K→∞$ . We translate the above self-consistent equations into functional equations. Thanks to the expansion around large $r$ we have a well defined limit, that does not involve any matrix inverse. We set $x=k/K$ and $z=l/K$ continuous indices ranging from 0 to 1. We extend the vectors and the matrices by continuity to match the correct dimensions. We apply the following rescaling to obtain quantities that are independent of $K$ in that limit:
$$
\displaystyle\hat{m}→ K\hat{m} , \displaystyle\hat{Q}_h→ K^2\hat{Q}_h , \displaystyle\hat{V}_h→ K^2\hat{V}_h , \displaystyle Q_qh→ KQ_qh , \displaystyle V_qh→ KV_qh . \tag{95}
$$
We first compute the effective derivative $D_qh=D-t≤ft(\begin{smallmatrix}0&0\ -iδ_eV_qh^T&0\end{smallmatrix}\right)$ and its inverse. In the asymmetric case we have $D_qh=D$ , the usual derivative. In the symmetric case we have $D_qh=D-tV_qh^T$ where $V_qh$ satisfies eq. (87) which reads
$$
\displaystyle∂_zV_qh(x,z)+δ(z)V_qh(x,z)= \displaystyle tδ(z-x)+t∫_0^1dx^\prime V_qh(x,x^\prime)V_qh(x^\prime,z) , \tag{97}
$$
where we multiplied both sides by $D_qh^T$ and took $V_qh(x,z)$ for $-iV_qh$ . The solution to this integro-differential equation is
$$
\displaystyle V_qh(x,z) \displaystyle=θ(z-x)\frac{I_1(2t(z-x))}{z-x} \tag{98}
$$
with $θ$ the step function and $I_ν$ the modified Bessel function of the second kind of order $ν$ . Consequently we obtain the effective inverse derivative
$$
\displaystyle D_qh^-1(x,z) \displaystyle=D_qh^-1,T(z,x)=≤ft\{\begin{array}[]{cc}θ(x-z)&if δ_e=0\\
\frac{1}{t}V_qh(z,x)&if δ_e=1\end{array}\right. . \tag{101}
$$
We then define the resolvants (or propagators) $φ$ and $Φ$ of the integral equations as
$$
\displaystyle D_qhφ(x) \displaystyle=λ^etφ(x)+δ(x) , \displaystyle D_qhΦ(x,z)D_qh^T \displaystyle=t^2Φ(x,z)+δ(x,z) . \tag{102}
$$
Notice that in the asymmetric case, $D_qh=∂_x$ , $D_qh^T=∂_z$ and $Φ$ is the propagator of the two-dimensional Klein-Gordon equation up to a change of variables. The resolvants can be expressed as
$$
\displaystyleφ(x) \displaystyle=≤ft\{\begin{array}[]{cc}e^λ^{etx}&if δ_e=0\\
∑_ν>0^∞ν(λ^e)^ν-1\frac{I_ν(2tx)}{tx}&if δ_e=1\end{array}\right. , \displaystyleΦ(x,z) \displaystyle=≤ft\{\begin{array}[]{cc}I_0(2t√{xz})&if δ_e=0\\
\frac{I_1(2t(x+z))}{t(x+z)}&if δ_e=1\end{array}\right. . \tag{106}
$$
We obtain the solution of the self-consistent equations by convolving $φ$ or $Φ$ with the non-homogenuous terms. We flip $\hat{m}$ along it axis to match the vectorial equation with boundary condition at $x=0$ ; we do the same for $\hat{V}_h$ and $\hat{Q}_h$ along there two axes, and for $Q_qh$ along its first axis. This gives the following expressions for the order parameters:
$$
\displaystyle V_w=\frac{1}{rα} \displaystyle V_h(x,z)=V_wΦ(x,z) \displaystyle\hat{V}_h(1-x,1-z)=t^2ρΦ(x,z) \displaystyle\hat{V}_w=t^-2\hat{V}_h(0,0) \displaystyle\hat{m}(1-x)=ρλ^etφ(x) \displaystyle\hat{m}_w=√{μ}\frac{1}{λ^et}\hat{m}(0) \displaystyle m_w=\frac{\hat{m}_w}{rα} \displaystyle m(x)=(1+μ)\frac{m_w}{√{μ}}φ(x) \displaystyle {}+\frac{t}{λ^e}∫_0^xdx^\prime∫_0^1dx^\prime\prime φ(x-x^\prime)V_h(x^\prime,x^\prime\prime)\hat{m}(x^\prime\prime) \displaystyle\hat{Q}_w=t^-2\hat{Q}_h(0,0) \displaystyle Q_w=\frac{\hat{Q}_w+\hat{m}_w^2}{r^2α} \tag{110}
$$
$$
\displaystyle\hat{Q}_h(1-x,1-z)=t^2∫_0^-,0^{-}^x,zdx^\primedz^\prime Φ(x-x^\prime,z-z^\prime)≤ft[P(\hat{m}^⊗ 2)(1-x^\prime,1-z^\prime)\right] \displaystyle Q_qh(1-x,z)=t∫_0^-,0^{-}^x,zdx^\primedz^\prime Φ(x-x^\prime,z-z^\prime)\Bigg[P(\hat{m})(1-x^\prime)(λ^etm(z^\prime)+√{μ}m_wδ(z^\prime)) \displaystyle\hskip 18.49988pt≤ft.{}+∫_0,0^-^1^{+,1}dx^\prime\primedz^\prime\prime ≤ft(\hat{Q}_h(1-x^\prime,x^\prime\prime)+P(\hat{m}^⊗ 2)(1-x^\prime,x^\prime\prime)\right)D_qh^-1(x^\prime\prime,z^\prime\prime)G_0(z^\prime\prime,z^\prime)\right] \tag{120}
$$
$$
\displaystyle Q_h(x,z)=∫_0^-,0^{-}^x,zdx^\primedz^\prime Φ(x-x^\prime,z-z^\prime)\Bigg[\hat{Q}_wδ(x^\prime,z^\prime)+(λ^etm(x^\prime)+√{μ}m_wδ(x^\prime))(λ^etm(z^\prime)+√{μ}m_wδ(z^\prime)) \displaystyle\hskip 18.49988pt{}+∫_0^-,0^1,1^{+}dx^\prime\primedx^\prime\prime\prime G_0(x^\prime,x^\prime\prime)D_qh^-1,T(x^\prime\prime,x^\prime\prime\prime)≤ft(tδ_eQ_qh(x^\prime\prime\prime,z^\prime)+P(\hat{m})(x^\prime\prime\prime)(λ^etm(z^\prime)+√{μ}m_wδ(z^\prime))\right) \displaystyle\hskip 18.49988pt{}+∫_0,0^-^1^{+,1}dz^\prime\prime\primedz^\prime\prime ≤ft(tδ_eQ_qh(z^\prime\prime\prime,x^\prime)+(λ^etm(x^\prime)+√{μ}m_wδ(x^\prime))P(\hat{m})(z^\prime\prime\prime)\right)D_qh^-1(z^\prime\prime\prime,z^\prime\prime)G_0(z^\prime\prime,z^\prime) \displaystyle ≤ft.{}+∫_0^-,0,0,0^{-}^1,1^{+,1^+,1}dx^\prime\primedx^\prime\prime\primedz^\prime\prime\primedz^\prime\prime G_0(x^\prime,x^\prime\prime)D_qh^-1,T(x^\prime\prime,x^\prime\prime\prime)≤ft(\hat{Q}_h(x^\prime\prime\prime,z^\prime\prime\prime)+P(\hat{m}^⊗ 2)(x^\prime\prime\prime,z^\prime\prime\prime)\right)D_qh^-1(z^\prime\prime\prime,z^\prime\prime)G_0(z^\prime\prime,z^\prime)\right] ; \tag{122}
$$
where we set
$$
\displaystyleP(\hat{m})(x) \displaystyle=\hat{m}(x)+ρδ(1-x) , \displaystyleP(\hat{m}^⊗ 2)(x,z) \displaystyle=ρ≤ft(\hat{m}(x)+δ(1-x)\right)≤ft(\hat{m}(z)+δ(1-z)\right) \displaystyle {}+(1-ρ)\hat{m}(x)\hat{m}(z) , \displaystyle G_0(x,z) \displaystyle=t^2V_h(x,z)+V_wδ(x,z) \tag{123}
$$
and take $Q_qh(x,z)$ for $-iQ_qh$ . The accuracies are, with $\bar{s}=1$ for train and $\bar{s}=0$ for test:
$$
\displaystyleAcc_train/test= \displaystyle \frac{1}{2}≤ft(1+erf≤ft(\frac{m(1)+(\bar{s}-ρ)V_h(1,1)}{√{2}√{Q_h(1,1)-m(1)^2-ρ(1-ρ)V_h(1,1)^2}}\right)\right) . \tag{126}
$$
Notice that we fully solved the model, in a certain limit, by giving an explicit expression of the performance of the GCN. This is an uncommon result in the sense that, in several works analyzing the performance of neural networks in a high-dimensional limit, the performance are only expressed as the function of the self-consistent of a system of equations similar to ours (75 - 84). These systems have to be solved numerically, which may be unsatisfactory for the understanding of the studied models.
So far, we dealt with infinite regularization $r$ keeping only the first constant order. The predicted accuracy (126) does not depend on $r$ . We briefly show how to pursue the computation at any order in appendix B.4, by a perturbative approach with expansion in powers of $1/r$ .
Interpretation in terms of dynamical mean-field theory:
The order parameters $V_h$ , $V_qh$ and $\hat{V}_h$ come from the replica computation and were introduced as the covariances between $h$ and its conjugate $q$ . Their values are determined by extremizing the free entropy of the problem. In the above lines we derived that $V_h(x,z)∝Φ(x,z)$ is the forward propagator, from the weights to the loss, while $\hat{V}_h(x,z)∝Φ(1-x,1-z)$ is the backward propagator, from the loss to the weights.
In this section we state an equivalence between these order parameters and the correlation and response functions of the dynamical process followed by $h$ .
We introduce the tilting field $η(x)∈ℝ^N$ and the tilted Hamiltonian as
$$
\displaystyle\frac{dh}{dx}(x)=\frac{t}{√{N}}\tilde{A}^eh(x)+η(x)+δ(x)\frac{1}{√{N}}Xw , \displaystyle h(x)=∫_0^-^xdx^\primee^(x-x^{\prime)\frac{t}{√{N}}\tilde{A}^e}≤ft(η(x^\prime)+δ(x^\prime)\frac{1}{√{N}}Xw\right) , \displaystyle H(η)=\frac{1}{2}(y-h(1))^TR(y-h(1))+\frac{r}{2}w^Tw , \tag{127}
$$
where $R∈ℝ^N× N$ diagonal accounts for the train and test nodes. We write $⟨·⟩_β$ the expectation under the density $e^-β H(η)/Z$ (normalized only at $η=0$ ).
Then we have
$$
\displaystyle V_h(x,z) \displaystyle=\frac{β}{N}≤ft[⟨ h(x)h(z)^T⟩_β-⟨ h(x)⟩_β⟨ h(z)^T⟩_β\right]|_η=0 , \displaystyle V_qh(x,z) \displaystyle=\frac{t}{N}\frac{∂}{∂η(z)}⟨ h(x)⟩_β|_η=0 , \displaystyle\hat{V}_h(x,z) \displaystyle=\frac{t^2}{β^2N}\frac{∂^2}{∂η(x)∂η(z)}⟨ 1⟩_β|_η=0 ; \tag{130}
$$
that is to say $V_h$ is the correlation function, $V_qh≈ tD_qh^-1,T$ is the response function and $\hat{V}_h$ is the correlation function of the responses of $h$ . We prove these equalities at the constant order in $r$ using random matrix theory in the appendix B.5.
#### III.2.3 Consequences
Convergences:
We compare our predictions to numerical simulations of the continuous GCN for $N=10^4$ and $N=7× 10^3$ in fig. 5 and figs. 8, 10 and 11 in appendix E. The predicted test accuracies are well within the statistical errors. On these figures we can observe the convergence of $Acc_test$ with respect to $r$ . The interversion of the two limits $r→∞$ and $K→∞$ we did to obtain (126) seems valid. Indeed on the figures we simulate the continuous GCN with $e^\frac{t\tilde{A}{√{N}}}$ or $e^\frac{t\tilde{A^s}{√{N}}}$ and take $r→∞$ after the continuous limit $K→∞$ ; and we observe that the simulated accuracies converge well toward the predicted ones. To keep only the constant order in $1/r$ gives a good approximation of the continuous GCN. Indeed, the convergence with respect to $1/r$ can be fast: at $t≤ssapprox 1$ not too large, $r\gtrapprox 1$ is enough to reach the continuous limit.
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Line Chart with Scatter Points: Test Accuracy vs. Parameter `t`
### Overview
The image is a scientific line chart overlaid with scatter points, plotting test accuracy (`Acc_test`) against a parameter `t`. It compares three distinct model configurations (defined by parameters `λ` and `μ`) and shows the performance distribution across four different noise or regularization levels (parameter `r`). The chart demonstrates how accuracy evolves with `t`, peaking for all series before declining, and highlights the impact of both model configuration and the `r` parameter on performance and variance.
### Components/Axes
* **Y-Axis:** Labeled `Acc_test`. The scale is linear, ranging from approximately 0.45 to 0.95, with major tick marks at 0.5, 0.6, 0.7, 0.8, and 0.9.
* **X-Axis:** Labeled `t`. The scale is linear, ranging from -1 to 4, with major tick marks at -1, 0, 1, 2, 3, and 4.
* **Legend (Top-Left Corner):**
* **Lines (Model Configurations):**
* Cyan line: `λ = 1.5, μ = 3`
* Medium blue line: `λ = 1, μ = 2`
* Dark blue line: `λ = 0.7, μ = 1`
* **Scatter Points (Noise/Regularization Level `r`):**
* Yellow circle: `r = 10²`
* Light brown circle: `r = 10¹`
* Brown circle: `r = 10⁰`
* Purple circle: `r = 10⁻¹`
* **Data Series:** Three solid lines represent the mean or expected trend for each model configuration. Overlaid on these lines are clusters of scatter points (with vertical error bars) at discrete `t` values, showing the distribution of results for different `r` values.
### Detailed Analysis
**Trend Verification per Line:**
1. **Cyan Line (`λ=1.5, μ=3`):** Starts at ~0.46 at `t=-1`. Slopes upward steeply, crossing 0.8 around `t=0.2`. Reaches a peak of ~0.93 at `t=1`. After the peak, it slopes gently downward to ~0.88 at `t=4`.
2. **Medium Blue Line (`λ=1, μ=2`):** Starts at ~0.47 at `t=-1`. Slopes upward, crossing 0.7 around `t=0.2`. Reaches a peak of ~0.80 at `t=1`. After the peak, slopes downward to ~0.74 at `t=4`.
3. **Dark Blue Line (`λ=0.7, μ=1`):** Starts at ~0.48 at `t=-1`. Slopes upward more gradually, crossing 0.6 around `t=0.2`. Reaches a peak of ~0.66 at `t=1`. After the peak, slopes downward to ~0.62 at `t=4`.
**Scatter Point Distribution (Cross-referenced with Legend):**
At each sampled `t` value (e.g., t = -0.8, -0.5, -0.2, 0, 0.2, 0.5, 0.8, 1.0, 1.2, etc.), four colored points are plotted, corresponding to the four `r` values.
* **General Pattern:** For a given `t` and model line, the yellow points (`r=10²`) are consistently the highest, followed by light brown (`r=10¹`), then brown (`r=10⁰`), with purple (`r=10⁻¹`) being the lowest. This indicates that higher `r` values correlate with higher test accuracy.
* **Variance:** The vertical spread (error bars) between the purple and yellow points at any given `t` is significant, often spanning 0.1 to 0.15 in accuracy. This spread appears largest around the peak (`t=1`) and narrows slightly at the extremes of `t`.
* **Alignment with Lines:** The solid lines pass through the central tendency of the scatter point clusters. The cyan line aligns with the uppermost clusters, the medium blue with the middle clusters, and the dark blue with the lowest clusters.
### Key Observations
1. **Universal Peak at `t=1`:** All three model configurations achieve their maximum test accuracy at approximately `t=1`.
2. **Performance Hierarchy:** The model with `λ=1.5, μ=3` (cyan) consistently outperforms the others across the entire `t` range, followed by `λ=1, μ=2` (medium blue), and then `λ=0.7, μ=1` (dark blue).
3. **Impact of `r`:** There is a strong, monotonic relationship between the parameter `r` and accuracy. Higher `r` (e.g., 10²) yields higher accuracy and appears to have slightly lower variance (smaller error bars) compared to lower `r` (e.g., 10⁻¹).
4. **Symmetrical Decline:** The decline in accuracy after `t=1` is more gradual than the initial ascent for all series.
5. **Initial Convergence:** At very low `t` values (`t=-1`), the performance difference between the three model configurations is minimal (all ~0.46-0.48), but they diverge significantly as `t` increases.
### Interpretation
This chart likely illustrates the results of a machine learning experiment studying the dynamics of model training or generalization. The parameter `t` could represent a training step, a temperature, or an interpolation coefficient between two model states. The parameters `λ` and `μ` define different model architectures or hyperparameter settings, with higher values leading to better peak performance.
The parameter `r` is critical. Its inverse relationship with accuracy (lower `r` = worse performance) suggests it might represent a **noise level** (where lower `r` means more noise) or the **inverse of a regularization strength** (where lower `r` means weaker regularization). The large spread in scatter points for low `r` indicates that high noise/weak regularization leads to unstable and poor generalization.
The consistent peak at `t=1` for all configurations is a key finding. It suggests an **optimal operating point** in the `t`-parameter space that is robust to changes in model architecture (`λ, μ`) and noise level (`r`). The fact that all models degrade after this point could indicate overfitting, over-smoothing, or a transition into a less optimal regime as `t` increases further. The chart effectively communicates that while model choice (`λ, μ`) sets the performance ceiling, the `r` parameter heavily influences where within that ceiling a specific run will land, and `t` must be carefully tuned to hit the universal peak.
</details>
Figure 5: Predicted test accuracy $Acc_test$ of the continuous GCN on the asymmetric graph, at $r=∞$ . $α=4$ and $ρ=0.1$ . The performance of the continuous GCN are given by eq. (126). Dots: numerical simulation of the continuous GCN for $N=10^4$ and $d=30$ , trained with quadratic loss, averaged over ten experiments.
The convergence with respect to $K→∞$ , taken after $r→∞$ , is depicted in fig. 6 and fig. 9 in appendix E. Again the continuous limit enjoyes good convergence properties since $K\gtrapprox 16$ is enough if $t$ is not too large.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Chart: Test Accuracy vs. Time for Different Parameter Configurations
### Overview
The image is a line chart plotting test accuracy (`Acc_test`) on the vertical axis against a variable `t` (likely time or a training step) on the horizontal axis. It displays six distinct data series, each representing a different model configuration defined by parameters `λ` (lambda), `μ` (mu), and `K`. The chart demonstrates how test accuracy evolves over `t` for these configurations.
### Components/Axes
* **Y-Axis (Vertical):**
* **Label:** `Acc_test`
* **Scale:** Linear, ranging from 0.60 to approximately 0.93.
* **Major Tick Marks:** 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90.
* **X-Axis (Horizontal):**
* **Label:** `t`
* **Scale:** Linear, ranging from approximately 0.2 to 2.8.
* **Major Tick Marks:** 0.5, 1.0, 1.5, 2.0, 2.5.
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains six entries, each with a specific color and line style:
1. **Cyan, Solid Line:** `λ = 1.5, μ = 3`
2. **Blue, Solid Line:** `λ = 1, μ = 2`
3. **Dark Blue, Solid Line:** `λ = 0.7, μ = 1`
4. **Magenta, Dash-Dot Line:** `K = 16`
5. **Purple, Dash-Dot Line:** `K = 4`
6. **Dark Green, Dash-Dot Line:** `K = 2`
### Detailed Analysis
The six lines form three distinct performance clusters. Within each cluster, the lines follow similar trajectories but reach different peak accuracies.
**High-Performance Cluster (Peak Acc_test ~0.90 - 0.93):**
* **Cyan Solid Line (`λ = 1.5, μ = 3`):** Starts at ~0.74 at t=0.2. Rises steeply, peaks at ~0.93 around t=1.2, then gradually declines to ~0.89 by t=2.8. This is the highest-performing configuration overall.
* **Magenta Dash-Dot Line (`K = 16`):** Follows a very similar path to the cyan line, starting slightly lower (~0.73), peaking slightly lower (~0.925) at a similar t, and ending slightly higher (~0.905).
* **Purple Dash-Dot Line (`K = 4`):** Starts lower (~0.72), rises to a lower peak (~0.915) around t=1.4, and ends at ~0.91. Its ascent is slightly less steep than the top two.
* **Dark Green Dash-Dot Line (`K = 2`):** The lowest in this cluster. Starts at ~0.71, peaks at ~0.905 around t=1.6, and ends at ~0.90. It has the most gradual ascent and the latest peak.
**Mid-Performance Cluster (Peak Acc_test ~0.77 - 0.80):**
* **Blue Solid Line (`λ = 1, μ = 2`):** Starts at ~0.66. Rises to a peak of ~0.80 around t=1.2, then declines to ~0.77 by t=2.8.
* **Dark Blue Solid Line (`λ = 0.7, μ = 1`):** Starts at ~0.65. Peaks at ~0.79 around t=1.4, and ends at ~0.78. It is consistently below the blue line.
**Low-Performance Cluster (Peak Acc_test ~0.64 - 0.66):**
* **Dark Blue Solid Line (`λ = 0.7, μ = 1`):** (Note: This is the same series as in the mid-cluster, but its lower segment is analyzed here for completeness). Its lower bound starts at ~0.59, peaks at ~0.66 around t=1.4, and ends at ~0.64.
* **Purple Dash-Dot Line (`K = 4`):** (Note: This is the same series as in the high-cluster, but its lower segment is analyzed here). Its lower bound starts at ~0.59, peaks at ~0.65 around t=1.5, and ends at ~0.64.
* **Dark Green Dash-Dot Line (`K = 2`):** (Note: This is the same series as in the high-cluster, but its lower segment is analyzed here). Its lower bound starts at ~0.59, peaks at ~0.64 around t=1.6, and ends at ~0.63.
**Trend Verification:** All six lines exhibit the same fundamental trend: a rapid initial increase in accuracy, followed by a peak, and then a very gradual decline or plateau as `t` increases further. The steepness of the initial rise and the height of the peak are determined by the configuration parameters.
### Key Observations
1. **Parameter Hierarchy:** Configurations with higher `λ` and `μ` values (e.g., 1.5, 3) achieve significantly higher peak accuracy than those with lower values (e.g., 0.7, 1).
2. **Effect of K:** For the `K` parameter series (dash-dot lines), higher `K` values (16) lead to higher peak accuracy and a slightly earlier peak compared to lower `K` values (2).
3. **Performance Clustering:** The data series naturally group into three distinct performance tiers, suggesting a strong, non-linear relationship between the parameters and the model's maximum achievable accuracy.
4. **Convergence and Decline:** All models show signs of performance saturation and slight degradation after reaching their peak, which could indicate overfitting or a changing optimization landscape as `t` increases.
5. **Line Style Correlation:** Solid lines are used for the `λ, μ` parameter sets, while dash-dot lines are used for the `K` parameter sets, providing a clear visual distinction between the two types of configurations.
### Interpretation
This chart likely illustrates the performance of a machine learning model (possibly a neural network or an ensemble method) under different regularization or architectural settings over the course of training or across a scaling parameter `t`.
* **What the data suggests:** The parameters `λ` and `μ` appear to be primary drivers of model capacity or regularization strength. Higher values correlate with a higher performance ceiling. The parameter `K` (which could represent ensemble size, number of clusters, or a similar discrete hyperparameter) also positively correlates with performance, but its effect is secondary to the `λ, μ` combination within the ranges tested.
* **How elements relate:** The three performance clusters indicate that the parameter space has distinct "regimes." Moving from the low to mid to high cluster requires a significant shift in parameter values, not just incremental changes. The consistent trend shape across all series suggests the underlying learning dynamics are similar, but the parameters control the efficiency and ultimate limit of that learning.
* **Notable anomalies/insights:** The most striking insight is the clear separation between the high-performance group (all above 0.90 accuracy) and the others. This suggests that finding the right combination of `λ` and `μ` (or a high `K`) is critical for optimal results. The slight decline after the peak is also important, implying that early stopping based on a validation set could be beneficial to capture the model at its peak performance before potential overfitting occurs. The chart provides a visual guide for hyperparameter tuning, showing which combinations are promising and which are not.
</details>
Figure 6: Predicted test accuracy $Acc_test$ of the continuous GCN and of its discrete counterpart with depth $K$ on the asymmetric graph, at $r=∞$ . $α=1$ and $ρ=0.1$ . The performance of the continuous GCN are given by eq. (126) while for the discrete GCN they are given by numerically solving the fixed-point equations (88 - 94).
To summarize, figs. 5, 6 and appendix E validate our method that consists in deriving the self-consistent equations at finite $K$ with replica, expanding them with respect to $1/r$ , taking the continuous limit $K→∞$ and then solving the integral equations.
Optimal diffusion time $t^*$ :
We observe on the previous figures that there is an optimal diffusion time $t^*$ that maximizes $Acc_test$ . Though we are able to solve the self-consistent equations and to obtain an explicit and analytical expression (126), it is hard to analyze it in order to evaluate $t^*$ . We have to consider further limiting cases or to compute $t^*$ numerically. The derivation of the following equations is detailed in appendix B.6.
We first consider the case $t→ 0$ . Expanding (126) to the first order in $t$ we obtain
$$
Acc_test\underset{t→ 0}{=}\frac{1}{2}≤ft(1+erf≤ft(\frac{1}{√{2}}√{\frac{ρ}{α}}\frac{μ+λ^et(2+μ)}{√{1+ρμ}}\right)\right)+o(t) . \tag{133}
$$
This expression shows in particular that $t^*>0$ , i.e. some diffusion on the graph is always beneficial compared to no diffusion, as long as $λ t>0$ i.e. the diffusion is done forward if the graph is homophilic $λ>0$ and backward if it is heterophilic $λ<0$ . We recover the result of [40] for the discrete case in a slightly different setting. This holds even if the features of the graph are not informative $μ=0$ . Notice the explicit invariance by the change $(λ,t)→(-λ,-t)$ in the potential (65) and in (133), which allows us to focus on $λ≥ 0$ . The case $t=0$ no diffusion corresponds to performing ridge regression on the Gaussian mixture $X$ alone. Such a model has been studied in [37]; we checked we obtain the same expression as theirs at large regularization.
We now consider the case $t→+∞$ and $λ≥ 0$ . Taking the limit in (126) we obtain
$$
Acc_test\underset{t→∞}{\longrightarrow}\frac{1}{2}≤ft(1+erf≤ft(\frac{λ^eq_PCA}{√{2}}\right)\right) , \tag{134}
$$
where $q_PCA$ is the same as for the discrete GCN, defined in eq. (59). This shows that the continuous GCN will oversmooth at large diffusion times. Thus, if the features are informative, if $μ^2/α>0$ , the optimal diffusion time should be finite, $t^*<+∞$ . The continuous GCN does exactly as does the discrete GCN at $K→∞$ if $c$ is fixed. This is not surprising because of the mapping $c=K/t$ : taking $t$ large is equivalent to take $c$ small with respect to $K$ . $e^\frac{t{√{N}}\tilde{A}}$ is dominated by the same leading eigenvector $y_1$ .
These two limits show that at finite time $t$ the GCN avoids oversmoothing and interpolates between an estimator that is only function of the features at $t=0$ and an estimator only function of the graph at $t=∞$ . $t$ has to be fine-tuned to reach the best trade-off $t^*$ and the optimal performance.
In the insets of fig. 7 and fig. 12 in appendix E we show how $t^*$ depends on $λ$ . In particular, $t^*$ is finite for any $λ$ : some diffusion is always beneficial but too much diffusion leads to oversmoothing. We have $t^*\underset{λ→ 0}{\longrightarrow}0$ . This is expected since if $λ=0$ then $A$ is not informative and any diffusion $t>0$ would degrade the performance. The non-monotonicity of $t^*$ with respect to $λ$ is less expected and we do not have a clear interpretation for it. Last $t^*$ decreases when the feature signal $μ^2/α$ increases: the more informative $X$ the less needed diffusion is.
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Line Chart with Inset: Test Accuracy vs. Lambda for Different K Values
### Overview
The image is a technical line chart illustrating the relationship between a parameter `λ` (lambda) and test accuracy (`Acc_test`) for various model configurations, characterized by a parameter `K`. The chart includes a primary plot and a smaller inset plot. The primary plot shows multiple curves converging towards high accuracy as `λ` increases. The inset plot shows the behavior of a variable `t*` against `λ`.
### Components/Axes
**Main Chart:**
* **Y-axis:** Label is `Acc_test`. Scale ranges from approximately 0.55 to 1.0, with major ticks at 0.6, 0.7, 0.8, 0.9, and 1.0.
* **X-axis:** Label is `λ`. Scale ranges from 0.0 to 2.5, with major ticks at 0.0, 0.5, 1.0, 1.5, 2.0, and 2.5.
* **Legend:** Located in the bottom-right quadrant of the main chart area. It contains seven entries, each associating a line style/color with a model configuration.
1. `Bayes – optimal`: Black dotted line.
2. `K = ∞, symmetrized graph`: Solid red line with circular markers.
3. `K = ∞`: Solid red line (no markers).
4. `K = 16`: Dashed red line.
5. `K = 4`: Dash-dot dark red/brown line.
6. `K = 2`: Dashed dark red/brown line.
7. `K = 1`: Dashed dark gray/black line.
**Inset Chart (Top-Left Corner):**
* **Y-axis:** Label is `t*`. Scale shows ticks at 0, 1, and 2.
* **X-axis:** Label is `λ`. Scale shows ticks at 0, 1, and 2.
* **Content:** Contains two solid red lines (one slightly thicker than the other) plotting `t*` against `λ`.
### Detailed Analysis
**Main Chart Trends & Approximate Data Points:**
All curves show a sigmoidal (S-shaped) increase in test accuracy as `λ` increases from 0.0 to approximately 2.0, after which they plateau.
1. **Bayes – optimal (Black Dotted):** This is the upper bound. It starts at ~0.56 accuracy at λ=0.0, rises steeply, and approaches 1.0 accuracy by λ≈1.5.
2. **K = ∞, symmetrized graph (Solid Red with Markers):** This is the best-performing practical model. It closely follows the Bayes-optimal curve but is slightly below it. It starts at ~0.555 at λ=0.0, crosses 0.9 accuracy around λ≈1.0, and converges to near 1.0 by λ≈2.0.
3. **K = ∞ (Solid Red):** Performs slightly worse than the symmetrized version. It starts at a similar point (~0.555) but remains below the symmetrized curve across the entire range.
4. **K = 16 (Dashed Red):** Starts at ~0.555. Its rise is less steep than the K=∞ curves. It reaches 0.9 accuracy around λ≈1.3 and approaches but does not fully reach 1.0 by λ=2.5.
5. **K = 4 (Dash-Dot Dark Red):** Starts at ~0.555. Its slope is shallower. It reaches 0.8 accuracy around λ≈1.2 and is at ~0.97 by λ=2.5.
6. **K = 2 (Dashed Dark Red):** Starts at ~0.555. Its rise is even more gradual. It reaches 0.8 accuracy around λ≈1.6 and is at ~0.95 by λ=2.5.
7. **K = 1 (Dashed Dark Gray):** This is the lowest-performing curve. It starts at ~0.555 and increases almost linearly with a shallow slope, reaching only ~0.90 accuracy by λ=2.5.
**Inset Chart Trends:**
The two red lines in the inset show that `t*` initially increases with `λ`, peaks at a value slightly above 2 when `λ` is around 1.0, and then gradually decreases as `λ` increases further towards 2.0.
### Key Observations
1. **Performance Hierarchy:** There is a clear and consistent hierarchy: Bayes-optimal > K=∞, symmetrized > K=∞ > K=16 > K=4 > K=2 > K=1. This demonstrates that increasing the parameter `K` leads to higher test accuracy for a given `λ`.
2. **Convergence:** All models except K=1 show strong convergence towards the Bayes-optimal performance as `λ` increases, though the rate of convergence slows dramatically for lower `K` values.
3. **Symmetrization Benefit:** The "symmetrized graph" variant of K=∞ provides a small but consistent accuracy improvement over the standard K=∞ model across the entire `λ` range.
4. **Inset Peak:** The variable `t*` exhibits a non-monotonic relationship with `λ`, suggesting an optimal `λ` value (around 1.0) for maximizing `t*`.
### Interpretation
This chart likely comes from a study on graph-based semi-supervised learning or a similar field where `K` represents the number of neighbors or a graph connectivity parameter, and `λ` controls the influence of the graph structure or a regularization term.
* **What the data suggests:** The results demonstrate that richer graph structures (higher `K`) enable models to more effectively leverage the parameter `λ` to approach theoretical optimal (Bayesian) performance. The symmetrization of the graph provides an additional, modest boost, likely by enforcing a more robust similarity measure.
* **Relationship between elements:** The main chart and inset are linked by the common x-axis variable `λ`. The peak in `t*` at `λ≈1.0` in the inset may correspond to the region in the main chart where the accuracy curves begin their steepest ascent, suggesting `t*` could be a measure of training dynamics, confidence, or another intermediate variable that is maximized at an intermediate regularization strength.
* **Notable anomaly/trend:** The K=1 curve is a significant outlier in its poor performance and near-linear response. This implies that with minimal graph information (only self-connection or a trivial graph), the model's ability to improve with `λ` is severely hampered, and it cannot approach optimal performance within the plotted range. The chart argues strongly for using sufficiently large `K` in such models.
</details>
Figure 7: Predicted test accuracy $Acc_test$ of the continuous GCN and of its discrete counterpart with depth $K$ , at optimal times $t^*$ and $r=∞$ . $α=4$ , $μ=1$ and $ρ=0.1$ . The performance of the continuous GCN $K=∞$ are given by eq. (126) while for its discretization at finite $K$ they are given by numerically solving eqs. (87 - 94). Inset: $t^*$ the maximizer at $K=∞$ .
Optimality of the continuous limit:
A major result is that, at $t=t^*$ , the continuous GCN is better than any fixed- $K$ GCN. Taking the continuous limit of the simple GCN is the way to reach its optimal performance. This was suggested by fig. 4 in the previous part; we show this more precisely in fig. 7 and fig. 12 in appendix E. We compare the continuous GCN to its discretization at different depths $K$ for several configurations $α,λ,μ$ and $ρ$ of the data model. The result is that at $t^*$ the test accuracy appears to be always an increasing function of $K$ , and that its value at $K→∞$ and $t^*$ is a upper-bound for all $K$ and $t$ .
Additionally, if the GCN is run on the symmetrized graph it can approach the Bayes-optimality and almost close the gap that [26] describes, as shown by figs. 7, 12 and 13 right. For all the considered $λ$ and $μ$ the GCN is less than a few percents of accuracy far from the optimality.
However we shall precise this statement: the GCN approaches the Bayes-optimality only for a certain range of the parameters of the CSBM, as exemplified by figs. 12 and 13 left. In these figures, the GCN is far from the Bayes-optimality when $λ$ is small but $μ$ is large. In this regime we have $snr_CSBM>1$ ; even at $ρ=0$ information can be retrieved on the labels and the problem is closer to an unsupervised classification of the sole features $X$ . On $X$ the GCN acts as a supervised classifier, and as long as $ρ≠ 1$ it cannot catch all information. As previously highlighted by [39] the comparison with the Bayes-optimality is more relevant at $snr_CSBM<1$ where supervision is necessary. Then, as shown by figs. 7, 12 and 13 the symmetrized continuous GCN is close to the Bayes-optimality. The GCN is also able to close the gap in the region where $λ$ is large because, as we saw, it can perform unsupervised PCA on $A$ .
## IV Conclusion
In this article we derived the performance of a simple GCN trained for node classification in a semi-supervised way on data generated by the CSBM in the high-dimensional limit. We first studied a discrete network with a finite number $K$ of convolution steps. We showed the importance of going to large $K$ to approach the Bayes-optimality, while scaling accordingly the residual connections $c$ of the network to avoid oversmoothing. The resulting limit is a continuous GCN.
In a second part we were able to explicitly derive the performance of the continuous GCN. We highlighted the importance of the double limit $r,K→∞$ , which allows to reach the optimal architecture and which can be analyzed thanks to an expansion in powers of $1/r$ . In is an interesting question for future work whether this approach could allow the study of fully-connected large-depth neural networks.
Though the continuous GCN can be close to the Bayes-optimality, it has to better handle the features, especially when they are the main source of information.
## Acknowledgments
We acknowledge useful discussions with J. Zavatone-Veth, F. Zamponi and V. Erba. This work is supported by the Swiss National Science Foundation under grant SNSF SMArtNet (grant number 212049).
## Appendix A Asymptotic characterisation of the discrete GCN
In this part we compute the free energy of the discrete finite- $K$ GCN using replica. We derive the fixed-point equations for the order parameters of the problem and the asymptotic characterization of the errors and accuracies in function of the order parameters. We consider only the asymmetric graph $\tilde{A}$ ; the symmetrized case $\tilde{A}^s$ is analyzed in the following section B together with the continuous GCN.
The free energy of the problem is $-β Nf=∂_nE_u,Ξ,W,yZ^n(n=0)$ where the partition function is
$$
\displaystyle Z \displaystyle=∫∏_ν^Mdw_νe^-β rγ(w_ν)e^-β s∑_i∈ R\ell(y_ih(w)_i)-β s^{\prime∑_i∈ R^\prime\ell(y_ih(w)_i)} . \tag{135}
$$
To lighten the notations we take $ρ^\prime=1-ρ$ i.e. the test set is the whole complementary of the train set. This does not change the result since the performances do not depend on the size of the test set.
We recall that $\tilde{A}$ admits the following Gaussian equivalent:
$$
\tilde{A}≈ A^g=\frac{λ}{√{N}}yy^T+Ξ , Ξ_ij∼N(0,1) . \tag{136}
$$
$\tilde{A}$ can be approximated by $A^g$ with a vanishing change in the free energy $f$ .
### A.1 Derivation of the free energy
We define the intermediate states of the GCN as
$$
h_k=≤ft(\frac{1}{√{N}}\tilde{A}+c_kI_N\right)h_k-1 , h_0=\frac{1}{√{N}}Xw . \tag{137}
$$
We introduce them in $Z$ thanks to Dirac deltas. The expectation of the replicated partition function is
$$
\displaystyleEZ^n∝ \displaystyle E_u,Ξ,W,y∫∏_a^n∏_ν^Mdw_ν^ae^-β rγ(w_ν^{a)}∏_a^n∏_i^N∏_k=0^Kdh_i,k^adq_i,k^ae^-β s∑_a,i∈ R\ell(y_ih_i,K^{a)-β s^\prime∑_a,i∈ R^\prime\ell(y_ih_i,K^a)} \displaystyle e^∑_a,i∑_k=1^{Kiq_i,k^a≤ft(h_i,k^a-\frac{1}{√{N}}∑_j(\frac{λ}{√{N}}y_iy_j+Ξ_ij)h_j,k-1^a-c_kh_i,k-1^a\right)+∑_a,iiq_i,0^a≤ft(h_i,0^a-\frac{1}{√{N}}∑_ν≤ft(√{\frac{μ}{N}}y_iu_ν+W_iν\right)w_ν^a\right)} \displaystyle= \displaystyle E_u,y∫∏_a,νdw_ν^ae^-β rγ(w_ν^{a)}∏_a,i,kdh_i,k^ae^-β s∑_a,i∈ R\ell(y_ih_i,K^{a)-β s^\prime∑_a,i∈ R^\prime\ell(y_ih_i,K^a)} \displaystyle ∏_iN≤ft(h_i,>0≤ft|c\odot h_i,<K+y_i\frac{λ}{N}∑_jy_jh_j,<K;\tilde{Q}\right.\right)∏_iN≤ft(h_i,0≤ft|y_i\frac{√{μ}}{N}∑_νu_νw_ν;\frac{1}{N}∑_νw_νw_ν^T\right.\right) . \tag{138}
$$
$N(·|m;V)$ is the Gaussian density of mean $m$ and covariance $V$ . We integrated over the random fluctuations $Ξ$ and $W$ and then over the $q$ s. We collected the replica in vectors of size $n$ and assembled them as
$$
\displaystyle h_i,>0=≤ft(\begin{smallmatrix}h_i,1\\
⋮\\
h_i,K\end{smallmatrix}\right)∈ℝ^nK, h_i,<K=≤ft(\begin{smallmatrix}h_i,0\\
⋮\\
h_i,K-1\end{smallmatrix}\right)∈ℝ^nK, c\odot h_i,<K=≤ft(\begin{smallmatrix}c_1h_i,0\\
⋮\\
c_Kh_i,K-1\end{smallmatrix}\right), \displaystyle\tilde{Q}_k,l=\frac{1}{N}∑_jh_j,kh_j,l^T , \tilde{Q}=≤ft(\begin{smallmatrix}\tilde{Q}_0,0&…&\tilde{Q}_0,K-1\\
⋮&&⋮\\
\tilde{Q}_K-1,0&…&\tilde{Q}_K-1,K-1\end{smallmatrix}\right)∈ℝ^nK× nK . \tag{139}
$$
We introduce the order parameters
$$
\displaystyle m_w^a=\frac{1}{N}∑_νu_νw_ν^a , Q_w^ab=\frac{1}{N}∑_νw_ν^aw_ν^b , \displaystyle m_k^a=\frac{1}{N}∑_jy_jk_j,k^a , Q_k^ab=(\tilde{Q}_k,k)_a,b=\frac{1}{N}∑_jh_j,k^ah_j,k^b , Q_k,l^ab=(\tilde{Q}_k,l)_a,b=\frac{1}{N}∑_jh_j,k^ah_j,l^b . \tag{141}
$$
$m_k$ is the magnetization (or overlap) between the $k^th$ layer and the labels; $m_w$ is the magnetization between the weights and the hidden variables and the $Q$ s are the self-overlaps across the different layers. In the following we write $\tilde{Q}$ for the matrix with elements $(\tilde{Q})_ak,bl=Q_k,l^ab$ . We introduce these quantities thanks to new Dirac deltas. This allows us to factorize the spacial $i$ and $ν$ indices.
$$
\displaystyleEZ^n∝ \displaystyle ∫∏_a∏_k=0^K-1d\hat{m}_k^adm_k^ae^N\hat{m_k^am_k^a}∏_ad\hat{m}_w^adm_w^ae^N\hat{m_w^am_w^a}∏_a≤ b∏_k=0^K-1d\hat{Q}_k^abdQ_k^abe^N\hat{Q_k^abQ_k^ab}∏_a,b∏_k<l^K-1d\hat{Q}_k,l^abdQ_k,l^abe^N\hat{Q_k,l^abQ_k,l^ab} \displaystyle∏_a≤ bd\hat{Q}_w^abdQ_w^abe^N\hat{Q_w^abQ_w^ab}≤ft[E_u∫∏_adw^ae^ψ_w^{(n)(w)}\right]^\frac{N{α}}≤ft[E_y∫∏_a,kdh_k^ae^ψ_h^{(n)(h;s)}\right]^ρ N≤ft[E_y∫∏_a,kdh_k^ae^ψ_h^{(n)(h;s^\prime)}\right]^(1-ρ)N \tag{143}
$$
where we defined the two potentials
$$
\displaystyleψ_w^(n)(w)=-β r∑_aγ(w^a)-∑_a≤ b\hat{Q}_w^abw^aw^b-∑_a\hat{m}_w^auw^a \displaystyleψ_h^(n)(h;\bar{s})=-β\bar{s}∑_a\ell(yh_K^a)-∑_a≤ b∑_k=0^K-1\hat{Q}_k^abh_k^ah_k^b-∑_a,b∑_k<l^K-1\hat{Q}_k,l^abh_k^ah_l^b-∑_a∑_k=0^K-1\hat{m}_k^ayh_k^a \displaystyle {}+\logN≤ft(h_>0≤ft|c\odot h_<K+λ ym_<K;\tilde{Q}\right.\right)+\logN≤ft(h_0≤ft|√{μ}ym_w;Q_w\right.\right) . \tag{144}
$$
We leverage the replica-symmetric ansatz. It is justified by the convexity of the Hamiltonian $H$ . We assume that for all $a$ and $b$
$$
\displaystyle m_k^a=m_k , \displaystyle\hat{m}_k^a=-\hat{m}_k , \displaystyle m_w^a=m_w , \displaystyle\hat{m}_w^a=-\hat{m}_w , \displaystyle Q_k^ab=Q_kJ+V_kI , \displaystyle\hat{Q}_k^ab=-\hat{Q}_kJ+\frac{1}{2}(\hat{V}_k+\hat{Q}_k)I , \displaystyle Q_w^ab=Q_wJ+V_wI , \displaystyle\hat{Q}_w^ab=-\hat{Q}_wJ+\frac{1}{2}(\hat{V}_w+\hat{Q}_w)I , \displaystyle Q_k,l^ab=Q_k,lJ+V_k,lI , \displaystyle\hat{Q}_k,l^ab=-\hat{Q}_k,lJ+\hat{V}_k,lI . \tag{146}
$$
$I$ is the $n× n$ identity and $J$ is the $n× n$ matrix filled with ones. We introduce the $K× K$ symmetric matrices $Q$ and $V$ , filled with $(Q_k)_0≤ k≤ K-1$ and $(V_k)_0≤ k≤ K-1$ on the diagonal, and $(Q_k,l)_0≤ k<l≤ K-1$ and $(V_k,l)_0≤ k<l≤ K-1$ off the diagonal, such that $\tilde{Q}$ can be written in terms of Kronecker products as
$$
\displaystyle\tilde{Q}=Q⊗ J+V⊗ I . \tag{149}
$$
The entropic terms of $ψ_w^(n)$ and $ψ_h^(n)$ can be computed. Since we will take $n=0$ we discard subleading terms in $n$ . We obtain
$$
\displaystyle∑_a\hat{m}_w^am_w^a=n\hat{m}_wm_w , ∑_a≤ b\hat{Q}_w^abQ_w^ab=\frac{n}{2}(\hat{V}_wV_w+\hat{V}_wQ_w-V_w\hat{Q}_w) , \displaystyle∑_a\hat{m}_k^am_k^a=n\hat{m}_km_k , ∑_a≤ b\hat{Q}_k^abQ_k^ab=\frac{n}{2}(\hat{V}_kV_k+\hat{V}_kQ_k-V_k\hat{Q}_k) , ∑_a,b\hat{Q}_k,l^abQ_k,l^ab=n(\hat{V}_k,lV_k,l+\hat{V}_k,lQ_k,l-V_k,l\hat{Q}_k,l) . \tag{150}
$$
The Gaussian densities can be explicited, keeping again the main order in $n$ and using the formula for a rank-1 update to a matrix (Sherman-Morrison formula):
$$
\displaystyle Q_w^-1=\frac{1}{V_w}I-\frac{Q_w}{V_w^2}J , \log\det Q_w=n\frac{Q_w}{V_w}+n\log V_w , \displaystyle\tilde{Q}^-1=V^-1⊗ I-(V^-1QV^-1)⊗ J , \log\det\tilde{Q}=n(V^-1Q)+n\log V . \tag{152}
$$
Then we can factorize the replica by introducing random Gaussian variables:
$$
\displaystyle∫∏_adw^ae^ψ_w^{(n)(w)} \displaystyle=∫∏_adw^ae^∑_a\log P_W(w^{a)+\frac{1}{2}\hat{Q}_ww^TJw-\frac{1}{2}\hat{V}_ww^Tw+u\hat{m}_w^Tw}=E_ς≤ft(∫dwe^ψ_w(w)\right)^n \tag{154}
$$
where $ς∼N(0,1)$ and the potential is
$$
\displaystyleψ_w(w)=\log P_W(w)-\frac{1}{2}\hat{V}_ww^2+≤ft(√{\hat{Q}_w}ς+u\hat{m}_w\right)w ; \tag{155}
$$
and samely
$$
\displaystyle∫∏_a,kdh_k^ae^ψ_h^{(n)(h;\bar{s})} \displaystyle=∫∏_a,kdh_k^ae^-β\bar{s∑_a\ell(yk_K^a)+∑_k=0^K-1≤ft(\frac{1}{2}\hat{Q}_kh_k^TJh_k-\frac{1}{2}\hat{V}_kh_k^Th_k+y\hat{m}_k^Th_k\right)+∑_k<l^K-1(\hat{Q}_k,lh_k^TJh_l-\hat{V}_k,lh_k^Th_l)} \displaystyle e^-\frac{1{2}(h_0-√{μ}ym_w)^T≤ft(\frac{1}{V_w}I-\frac{Q_w}{V_w^2}J\right)(h_0-√{μ}ym_w)-\frac{1}{2}n\frac{Q_w}{V_w}-\frac{1}{2}n\log V_w} \displaystyle e^-\frac{1{2}∑_k,l^K(h_k-c_kh_k-1-λ ym_k-1)^T≤ft({(V^-1)_k-1,l-1}I-(V^-1QV^-1)_k-1,l-1J\right)(h_l-c_lh_l-1-λ ym_l-1)-\frac{n}{2}(V^-1Q)-\frac{n}{2}\log\det V} \displaystyle=E_ξ,χ,ζ≤ft(∫∏_k=0^Kdh_ke^ψ_h(h;\bar{s)}\right)^n \tag{156}
$$
where $ξ∼N(0,I_K)$ , $χ∼N(0,I_K)$ , $ζ∼N(0,1)$ and the potential is
$$
\displaystyleψ_h(h;\bar{s}) \displaystyle=-β\bar{s}\ell(yh_K)-\frac{1}{2}h_<K^T\hat{V}h_<K+≤ft(ξ^T\hat{Q}^1/2+y\hat{m}^T\right)h_<K \displaystyle {}+\logN≤ft(h_0≤ft|√{μ}ym_w+√{Q_w}ζ;V_w\right.\right)+\logN≤ft(h_>0≤ft|c\odot h_<K+λ ym+Q^1/2χ;V\right.\right) ; \tag{158}
$$
where $h_>0=(h_1,…,h_K)∈ℝ^K$ , $h_<K=(h_0,…,h_K-1)∈ℝ^K$ , $c\odot h_<K=(c_1h_0,…,c_Kh_K-1)$ , $\hat{m}=(\hat{m}_0,…,\hat{m}_K-1)∈ℝ^K$ , $m=(m_0,…,m_K-1)∈ℝ^K$ , $\hat{Q}$ and $\hat{V}$ are the $K× K$ symmetric matrix filled with $(\hat{Q}_k)_0≤ k≤ K-1$ and $(\hat{V}_k)_0≤ k≤ K-1$ on the diagonal, and $(\hat{Q}_k,l)_0≤ k<l≤ K-1$ and $(\hat{V}_k,l)_0≤ k<l≤ K-1$ off the diagonal. We used that $E_ζe^-\frac{n{2}\frac{Q_w}{V_w}ζ^2}=e^-\frac{n{2}\frac{Q_w}{V_w}}$ in the limit $n→ 0$ to factorize $√{Q_w}ζ$ and the same for $Q^1/2χ$ .
We pursue the computation:
$$
\displaystyleEZ^n∝ \displaystyle ∫d\hat{m}_wdm_we^Nn\hat{m_wm_w}d\hat{Q}_wdQ_wd\hat{V}_wdV_we^N\frac{n{2}(\hat{V}_wV_w+\hat{V}_wQ_w-V_w\hat{Q}_w)}∏_k=0^K-1d\hat{m}_kdm_ke^Nn\hat{m^Tm} \displaystyle ∏_k=0^K-1d\hat{Q}_kdQ_kd\hat{V}_kdV_k∏_k<l^K-1d\hat{Q}_k,ldQ_k,ld\hat{V}_k,ldV_k,le^N\frac{n{2}tr≤ft(\hat{V}V+\hat{V}Q-V\hat{Q}\right)} \displaystyle ≤ft[E_u,ς≤ft(∫dwe^ψ_w(w)\right)^n\right]^N/α≤ft[E_y,ξ,χ,ζ≤ft(∫∏_k=0^Kdh_ke^ψ_h(h;s)\right)^n\right]^ρ N≤ft[E_y,ξ,χ,ζ≤ft(∫∏_k=0^Kdh_ke^ψ_h(h;s^{\prime)}\right)^n\right]^(1-ρ)N \displaystyle:= \displaystyle ∫dΘd\hat{Θ}e^Nφ^{(n)(Θ,\hat{Θ})} . \tag{159}
$$
where $Θ=\{m_w,Q_w,V_w,m,Q,V\}$ and $\hat{Θ}=\{\hat{m}_w,\hat{Q}_w,\hat{V}_w,\hat{m},\hat{Q},\hat{V}\}$ are the sets of the order parameters. We can now take the limit $N→∞$ thanks to Laplace’s method.
$$
\displaystyle-β f∝ \displaystyle\frac{1}{N}\frac{∂}{∂ n}(n=0)∫dΘd\hat{Θ} e^Nφ^{(n)(Θ,\hat{Θ})} \displaystyle= \displaystyle\operatorname*{extr}_Θ,\hat{Θ}\frac{∂}{∂ n}(n=0)φ^(n)(Θ,\hat{Θ}) \displaystyle:= \displaystyle\operatorname*{extr}_Θ,\hat{Θ}φ(Θ,\hat{Θ}) , \tag{161}
$$
where we extremize the following free entropy $φ$ :
$$
\displaystyleφ \displaystyle=\frac{1}{2}≤ft(\hat{V}_wV_w+\hat{V}_wQ_w-V_w\hat{Q}_w\right)-\hat{m}_wm_w+\frac{1}{2}tr≤ft(\hat{V}V+\hat{V}Q-V\hat{Q}\right)-\hat{m}^Tm \displaystyle {}+\frac{1}{α}E_u,ξ≤ft(\log∫dw e^ψ_w(w)\right)+ρE_y,ξ,ζ,χ≤ft(\log∫∏_k=0^Kdh_ke^ψ_h(h;s)\right)+(1-ρ)E_y,ξ,ζ,χ≤ft(\log∫∏_k=0^Kdh_ke^ψ_h(h;s^{\prime)}\right) . \tag{164}
$$
We take the limit $β→∞$ . Later we will differentiate $φ$ with respect to the order parameters or to $\bar{s}$ and these derivatives will simplify in that limit. We introduce the measures
$$
\displaystyledP_w=\frac{dw e^ψ_w(w)}{∫dw e^ψ_w(w)} , dP_h=\frac{∏_k=0^Kdh_k e^ψ_h(h;\bar{s=1)}}{∫∏_k=0^Kdh_k e^ψ_h(h;\bar{s=1)}} , dP_h^\prime=\frac{∏_k=0^Kdh_k e^ψ_h(h;\bar{s=0)}}{∫∏_k=0^Kdh_k e^ψ_h(h;\bar{s=0)}} . \tag{165}
$$
We have to rescale the order parameters not to obtain a degenerated solution when $β→∞$ (we recall that, in $ψ_w$ , $\log P_W(w)∝β$ ). We take
$$
\displaystyle\hat{m}_w→β\hat{m}_w , \displaystyle\hat{Q}_w→β^2\hat{Q}_w , \displaystyle\hat{V}_w→β\hat{V}_w , \displaystyle V_w→β^-1V_w \displaystyle\hat{m}→β\hat{m} , \displaystyle\hat{Q}→β^2\hat{Q} , \displaystyle\hat{V}→β\hat{V} , \displaystyle V→β^-1V \tag{166}
$$
So we obtain that $f=-φ$ . Then $dP_w$ , $dP_h$ and $dP_h^\prime$ are picked around their maximum and can be approximated by Gaussian measures. We define
$$
\displaystyle w^*=\operatorname*{argmax}_wψ_w(w) , h^*=\operatorname*{argmax}_hψ_h(h;\bar{s}=1) , h^{^\prime*}=\operatorname*{argmax}_hψ_h(h;\bar{s}=0) . \tag{168}
$$
Then we have the expected value of a function $g$ in $h$ $E_P_{h}g(h)=g(h^*)$ and the covariance $_P_{h}(h)=-\frac{1}{2}(∇∇ψ_h(h^*))^-1$ with $∇∇$ the Hessian; and similarly for $dP_w$ and $dP_h^\prime$ .
Last we compute the expected errors and accuracies. We differentiate the free energy $f$ with respect to $s$ and $s^\prime$ to obtain that
$$
\displaystyle E_train=E_y,ξ,ζ,χ\ell(yh_K^*) , E_test=E_y,ξ,ζ,χ\ell(yh_K^{^\prime*}) . \tag{169}
$$
Augmenting $H$ with the observable $\frac{1}{|\hat{R}|}∑_i∈\hat{R}δ_y_{i=h(w)_i}$ and following the same steps gives the expected accuracies
$$
\displaystyleAcc_train=E_y,ξ,ζ,χδ_y= , Acc_test=E_y,ξ,ζ,χδ_y=)} . \tag{170}
$$
### A.2 Self-consistent equations
The two above formula (169) and (170) are valid only at the values of the order parameters that extremize the free entropy. We seek the extremizer of $φ$ . The extremality condition $∇_Θ,\hat{Θ}φ=0$ gives the following self-consistent equations:
$$
\displaystyle m_w=\frac{1}{α}E_u,ς uw^* m=E_y,ξ,ζ,χ y≤ft(ρ h_<K^*+(1-ρ)h_<K^{^\prime*}\right) \displaystyle Q_w=\frac{1}{α}E_u,ς(w^*)^2 Q=E_y,ξ,ζ,χ≤ft(ρ(h_<K^*)^⊗ 2+(1-ρ)(h_<K^{^\prime*})^⊗ 2\right) \displaystyle V_w=\frac{1}{α}\frac{1}{√{\hat{Q}_w}}E_u,ς ς w^* V=E_y,ξ,ζ,χ≤ft(ρ_P_{h}(h_<K)+(1-ρ)_P_{h^\prime}(h_<K)\right) \displaystyle\hat{m}_w=\frac{√{μ}}{V_w}E_y,ξ,ζ,χ y≤ft(ρ(h^*_0-√{μ}ym_w)+(1-ρ)(h_0^{^\prime*}-√{μ}ym_w)\right) \displaystyle\hat{Q}_w=\frac{1}{V_w^2}E_y,ξ,ζ,χ≤ft(ρ(h_0^*-√{μ}ym_w-√{Q}_wζ)^2+(1-ρ)(h_0^{^\prime*}-√{μ}ym_w-√{Q}_wζ)^2\right) \displaystyle\hat{V}_w=\frac{1}{V_w}-\frac{1}{V_w^2}E_y,ξ,ζ,χ≤ft(ρ_P_{h}(h_0)+(1-ρ)_P_{h}(h_0)\right) \displaystyle\hat{m}=λ V^-1E_y,ξ,ζ,χ y≤ft(ρ(h_>0^*-c\odot h_<K^*-λ ym)+(1-ρ)(h_>0^{^\prime*}-c\odot h_<K^{^\prime*}-λ ym)\right) \displaystyle\hat{Q}=V^-1E_y,ξ,ζ,χ≤ft(ρ(h_>0^*-c\odot h_<K^*-λ ym-Q^1/2χ)^⊗ 2+(1-ρ)(h_>0^{^\prime*}-c\odot h_<K^{^\prime*}-λ ym-Q^1/2χ)^⊗ 2\right)V^-1 \displaystyle\hat{V}=V^-1-V^-1E_y,ξ,ζ,χ≤ft(ρ_P_{h}(h_>0-c\odot h_<K)+(1-ρ)_P_{h^\prime}(h_>0-c\odot h_<K)\right)V^-1 \tag{171}
$$
We introduced the covariance $_P(x)=E_P(xx^T)-E_P(x)E_P(x^T)$ and the tensorial product $x^⊗ 2=xx^T$ . We used Stein’s lemma to simplify the differentials of $Q^1/2$ and $\hat{Q}^1/2$ and to transform the expression of $\hat{V}_w$ into a more accurate expression for numerical computation in terms of covariance. We used the identities $2x^TQ^1/2\frac{∂ Q^1/2}{∂ E_k,l}x=x^TE_k,lx$ and $-x^TV\frac{∂ V^-1}{∂ E_k,l}Vx=x^TE_k,lx$ for any element matrix $E_k,l$ and for any vector $x$ . We have also that $∇_V\log\det V=V^-1$ , considering its comatrix. Last we kept the first order in $β$ with the approximations $Q+V≈ Q$ and $\hat{Q}-\hat{V}≈\hat{Q}$ .
These self-consistent equations are reproduced in the main part III.1.1.
### A.3 Solution for ridge regression
We take quadratic $\ell$ and $γ$ . Moreover we assume there is no residual connections $c=0$ ; this simplifies largely the analysis in the sense that the covariances of $h$ under $P_h$ or $P_h^\prime$ become diagonal. We have
$$
\displaystyle_P_{h}(h)=diag≤ft(\frac{V_w}{1+V_w\hat{V}_0},\frac{V_0}{1+V_0\hat{V}_1},…,\frac{V_K-2}{1+V_K-2\hat{V}_K-1},\frac{V_K-1}{1+V_K-1}\right) \displaystyle_P_{h^\prime}(h)=diag≤ft(\frac{V_w}{1+V_w\hat{V}_0},\frac{V_0}{1+V_0\hat{V}_1},…,\frac{V_K-2}{1+V_K-2\hat{V}_K-1},V_K-1\right) \displaystyle h^*=_P_{h}(h)≤ft(\begin{pmatrix}\hat{Q}^1/2ξ+y\hat{m}\\
y\end{pmatrix}+\begin{pmatrix}\frac{1}{V_w}(√{μ}ym_w+√{Q_w}ζ)\\
V^-1≤ft(λ ym+Q^1/2χ\right)\end{pmatrix}\right) \displaystyle h^{^\prime*}=_P_{h^\prime}(h)≤ft(\begin{pmatrix}\hat{Q}^1/2ξ+y\hat{m}\\
0\end{pmatrix}+\begin{pmatrix}\frac{1}{V_w}(√{μ}ym_w+√{Q_w}ζ)\\
V^-1≤ft(λ ym+Q^1/2χ\right)\end{pmatrix}\right) \tag{180}
$$
where diag means the diagonal matrix with the given diagonal. We packed elements into block vectors of size $K+1$ . The self-consistent equations can be explicited:
$$
\displaystyle m_w=\frac{1}{α}\frac{\hat{m}_w}{r+\hat{V}_w} \displaystyle V_w=\frac{1}{α}\frac{1}{r+\hat{V}_w} \displaystyle Q_w=\frac{1}{α}\frac{\hat{Q}_w+\hat{m}_w^2}{(r+\hat{V}_w)^2} \displaystyle m=V≤ft(\hat{m}+≤ft(\begin{smallmatrix}√{μ}\frac{m_w}{V_w}\\
λ V_<K-1^-1m_<K-1\end{smallmatrix}\right)\right) \displaystyle V=diag≤ft(\begin{smallmatrix}\frac{V_w}{1+V_w\hat{V}_0},\frac{V_0}{1+V_0\hat{V}_1},…,\frac{V_K-2}{1+V_K-2\hat{V}_K-1}\end{smallmatrix}\right) \displaystyle\hat{m}_w=\frac{√{μ}}{V_w}(m_0-√{μ}m_w) \displaystyle \hat{V}_w=\frac{\hat{V}_0}{1+V_w\hat{V}_0} \displaystyle\hat{m}=λ\hat{V}≤ft(≤ft(\begin{smallmatrix}\hat{V}_>0^-1\hat{m}_>0\\
1\end{smallmatrix}\right)-λ m\right) \displaystyle \hat{V}=diag≤ft(\begin{smallmatrix}\frac{\hat{V}_1}{1+V_0\hat{V}_1},…,\frac{\hat{V}_K-1}{1+V_K-2\hat{V}_K-1},\frac{ρ}{1+V_K-1}\end{smallmatrix}\right) \tag{184}
$$
and
$$
\displaystyle\hat{Q}_w=\frac{V_0^2}{V_w^2}\hat{Q}_0,0+≤ft(\frac{V_0}{V_w}-1\right)^2\frac{Q_w}{V_w^2}+\frac{\hat{m}_w^2}{μ} \displaystyle Q=V≤ft(\hat{Q}+≤ft(\begin{smallmatrix}\frac{Q_w}{V_w^2}&0\\
0&V_<K-1^-1Q_<K-1V_<K-1^-1\end{smallmatrix}\right)\right)V+m^⊗ 2 \displaystyle\hat{Q}=\hat{V}≤ft(≤ft(\begin{smallmatrix}\hat{V}_>0^-1\hat{Q}_>0\hat{V}_>0^-1&0\\
0&0\end{smallmatrix}\right)+ρ≤ft(\begin{smallmatrix}I_K-1&0\\
0&ρ^-1\end{smallmatrix}\right)Q≤ft(\begin{smallmatrix}I_K-1&0\\
0&ρ^-1\end{smallmatrix}\right)+(1-ρ)≤ft(\begin{smallmatrix}I_K-1&0\\
0&0\end{smallmatrix}\right)Q≤ft(\begin{smallmatrix}I_K-1&0\\
0&0\end{smallmatrix}\right)\right)\hat{V} \displaystyle {}+ρ≤ft(\begin{smallmatrix}\frac{1}{λ}\hat{m}_<K-1\\
\frac{1}{λρ}\hat{m}_K-1\end{smallmatrix}\right)^⊗ 2+(1-ρ)≤ft(\begin{smallmatrix}\frac{1}{λ}\hat{m}_<K-1\\
0\end{smallmatrix}\right)^⊗ 2 \tag{188}
$$
We used the notations $m_<K-1=(m_k)_0≤ k<K-1$ , $\hat{m}_<K-1=(\hat{m}_k)_0≤ k<K-1$ , $\hat{m}_>0=(\hat{m}_k)_0<k≤ K-1$ , $Q_<K-1=(Q_k,l)_0≤ k,l<K-1$ , $Q_>0=(Q_k,l)_0<k,l≤ K-1$ , $V_<K-1=(V_k,l)_0≤ k,l<K-1$ and $V_>0=(V_k,l)_0<k,l≤ K-1$ . We simplified the equations by combining the expressions of $V$ , $\hat{V}$ , $m$ and $\hat{m}$ : the above system of equations is equivalent to the generic equations only at the fixed-point. The expected losses and accuracies are
$$
\displaystyle E_train=\frac{1}{2ρ}\hat{Q}_K-1,K-1 \displaystyle E_test=\frac{1}{2ρ}(1+V_K-1,K-1)^2\hat{Q}_K-1,K-1 \displaystyleAcc_train=\frac{1}{2}≤ft(1+erf≤ft(\frac{V_K-1,K-1+λ m_K-1}{√{2Q_K-1,K-1}}\right)\right) \displaystyleAcc_test=\frac{1}{2}≤ft(1+erf≤ft(\frac{λ m_K-1}{√{2Q_K-1,K-1}}\right)\right) . \tag{191}
$$
To obtain a simple solution we take the limit $r→∞$ . The solution of this system is then
$$
\displaystyle m_w=\frac{ρ√{μ}}{α r}λ^K \displaystyle V_w=\frac{1}{α r} \displaystyle Q_w=\frac{1}{α r^2}≤ft(ρ+ρ^2μλ^2K+ρ^2∑_l=1^Kλ^2l\right) \displaystyle m_k=\frac{ρ}{α r}≤ft(μλ^K+k+∑_l=0^kλ^K-k+2l\right) \displaystyle V_k,k=\frac{1}{α r} \displaystyle\hat{m}_w=ρ√{μ}λ^K \displaystyle \hat{V}_w=ρ \displaystyle \hat{Q}_w=ρ+ρ^2∑_l=1^Kλ^2l \displaystyle\hat{m}_k=ρλ^K-k \displaystyle \hat{V}_k,k=ρ \displaystyle \hat{Q}_k,k=ρ+ρ^2∑_l=1^K-1-kλ^2l \tag{193}
$$
and
$$
\displaystyle Q_k,k=\frac{ρ}{α^2r^2}≤ft(α≤ft(1+ρμλ^2K+ρ∑_l=1^Kλ^2l\right)+∑_m=0^k≤ft(1+ρ∑_l=1^K-1-mλ^2l+ρ≤ft(μλ^K+m+∑_l=0^mλ^K-m+2l\right)^2\right)\right) \tag{197}
$$
We did not precise the off-diagonal parts of $Q$ and $\hat{Q}$ since they do not enter in the computation of the losses and accuracies. The expressions for $m$ and $Q$ are reproduced in the main part III.1.2.
## Appendix B Asymptotic characterization of the continuous GCN, for asymmetric and symmetrized graphs
In this part we derive the asymptotic characterization of the continuous GCN for both the asymmetric and symmetrized graphs $\tilde{A}$ and $\tilde{A}^s$ . As shown in the main section III.2 this architecture is particularly relevant since it can be close to the Bayes-optimality.
We start by discretizing the GCN and deriving its free energy and the self-consistent equations on its order parameters. Then we take the continuous limit $K→∞$ , jointly with an expansion around large regularization $r$ . The derivation of the free energy and of the self-consistent equations follows the same steps as in the previous section A; in particular for the asymmetric case the expressions are identical up to the point where the continuous limit is taken.
To deal with both cases, asymmetric or symmetrized, we define $(δ_e,\tilde{A}^e,A^g,e,λ^e,Ξ^e)∈\{(0,\tilde{A},A^g,λ,Ξ),(1,\tilde{A}^s,A^g,s,λ^s,Ξ^s)\}$ . In particular $δ_e=0$ for the asymmetric and $δ_e=1$ for the symmetrized. We remind that $\tilde{A}^s$ is the symmetrized $\tilde{A}$ with effective signal $λ^s=√{2}λ$ . $\tilde{A}^e$ admits the following Gaussian equivalent [54, 22, 20]:
$$
\tilde{A}^e≈ A^g,e=\frac{λ^e}{√{N}}yy^T+Ξ^e , \tag{198}
$$
with $(Ξ)_ij$ i.i.d. for all $i$ and $j$ while $Ξ^s$ is taken from the Gaussian orthogonal ensemble.
### B.1 Derivation of the free energy
The continuous GCN is defined by the output function
$$
h(w)=e^\frac{t{√{N}}\tilde{A}^e}\frac{1}{√{N}}Xw . \tag{199}
$$
Its discretization at finite $K$ is
$$
h(w)=h_K , h_k=≤ft(I_N+\frac{t}{√{N}}\tilde{A}^e\right)h_k-1 , h_0=\frac{1}{√{N}}Xw . \tag{200}
$$
It can be mapped to the discrete GCN of the previous section A by taking $c=t/K$ .
The free energy is $-β Nf=∂_nE_u,Ξ^e,W,yZ^n(n=0)$ where the partition function is
$$
\displaystyle Z \displaystyle=∫∏_ν^Mdw_νe^-β rγ(w_ν)e^-β s∑_i∈ R\ell(y_ih(w)_i)-β s^{\prime∑_i∈ R^\prime\ell(y_ih(w)_i)} . \tag{201}
$$
The expectation of the replicated partition function is
$$
\displaystyleEZ^n∝ E_u,Ξ^e,W,y∫∏_a^n∏_ν^Mdw_ν^ae^-β rγ(w_ν^{a)}∏_a^n∏_i^N∏_k=0^Kdh_i,k^adq_i,k^ae^-β s∑_a,i∈ R\ell(y_ih_i,K^{a)-β s^\prime∑_a,i∈ R^\prime\ell(y_ih_i,K^a)} \displaystyle e^∑_a,i∑_k=1^{Kiq_i,k^a≤ft(\frac{K}{t}h_i,k^a-\frac{1}{√{N}}∑_j≤ft(√{N}\frac{K}{t}δ_i,j+\frac{λ^e}{√{N}}y_iy_j+Ξ^e_ij\right)h_j,k-1^a\right)+∑_a,iiq_i,0^a≤ft(h_i,0^a-\frac{1}{√{N}}∑_ν≤ft(√{\frac{μ}{N}}y_ju_ν+W_jν\right)w_ν^a\right)} \displaystyle=E_u,y∫∏_a,νdw_ν^ae^-β rγ(w_ν^{a)}∏_a,i,kdh_i,k^adq_i,k^ae^-β s∑_a,i∈ R\ell(y_ih_i,K^{a)-β s^\prime∑_a,i∈ R^\prime\ell(y_ih_i,K^a)+i∑_a,i,k>0q_i,k^a≤ft(\frac{K}{t}(h_i,k^a-h_i,k-1^a)-\frac{λ^e}{√{N}}y_i∑_jy_jh_j,k-1^a\right)} \displaystyle e^-\frac{1{2N}∑_i,j∑_a,b∑_k>0,l>0(q_i,k^ah_j,k-1^aq_i,l^bh_j,l-1^b+δ_eq_i,k^ah_j,k-1^aq_j,l^bh_i,l-1^b)-i∑_a,i\frac{√{μ}}{N}y_iq_i,0^a∑_νu_νw_ν^a-\frac{1}{2N}∑_i,ν,a,bq_i,0^aq_i,0^bw_ν^aw_ν^b} . \tag{202}
$$
Compared to part A, because of the symmetry the expectation over $Ξ^s$ gives an additional cross-term. We symmetrized $∑_i<j$ by neglecting the diagonal terms. We introduce new order parameters between $h$ and its conjugate $q$ . We set for all $a$ and $b$ and for $0<k≤ K$ and $0<l≤ K$
$$
\displaystyle m_w^a=\frac{1}{N}∑_νu_νw_ν^a , Q_w^ab=\frac{1}{N}∑_νw_ν^aw_ν^b , \displaystyle m_k^a=\frac{1}{N}∑_jy_jh_j,k-1^a , Q_h,kl^ab=\frac{1}{N}∑_jh_j,k-1^ah_j,l-1^b , \displaystyle Q_q,kl^ab=\frac{1}{N}∑_jq_j,k^aq_j,l^b , Q_qh,kl^ab=\frac{1}{N}∑_jq_j,k^ah_j,l-1^b . \tag{204}
$$
We introduce these quantities via $δ$ -Dirac functions. Their conjugates are $\hat{m}_w^a$ , $\hat{Q}_w^ab$ , $\hat{V}_w^ab$ , $\hat{m}^a$ , $\hat{Q}^ab$ and $\hat{V}^ab$ . We factorize the $ν$ and $i$ indices. We leverage the replica-symmetric ansatz. We assume that for all $a$ and $b$
$$
m_w^a=m_w , \hat{m}_w^a=-\hat{m}_w , m_k^a=m_k , \hat{m}_k^a=-\hat{m}_k \tag{207}
$$
and
$$
\displaystyle Q_w^ab=Q_w+V_wδ_a,b , \displaystyle\hat{Q}_w^ab=-\hat{Q}_w+\frac{1}{2}(\hat{V}_w+\hat{Q}_w)δ_a,b , \displaystyle Q_h,kl^ab=Q_h,kl+V_h,klδ_a,b , \displaystyle\hat{Q}_h,kk^ab=-\hat{Q}_h,kk+\frac{1}{2}(\hat{V}_h,kk+\hat{Q}_h,kk)δ_a,b , \displaystyle\hat{Q}_h,kl^ab=-\hat{Q}_h,kl+\hat{V}_h,klδ_a,b , \displaystyle Q_q,kl^ab=Q_q,kl+V_q,klδ_a,b , \displaystyle\hat{Q}_q,kk^ab=-\hat{Q}_q,kk+\frac{1}{2}(\hat{V}_q,kk+\hat{Q}_q,kk)δ_a,b , \displaystyle\hat{Q}_q,kl^ab=-\hat{Q}_q,kl+\hat{V}_q,klδ_a,b , \displaystyle Q_qh,kl^ab=Q_qh,kl+V_qh,klδ_a,b , \displaystyle\hat{Q}_qh,kk^ab=-\hat{Q}_qh,kk+\hat{V}_qh,kkδ_a,b , \displaystyle\hat{Q}_qh,kl^ab=-\hat{Q}_qh,kl+\hat{V}_qh,klδ_a,b . \tag{208}
$$
$δ_a,b$ is a Kronecker delta between $a$ and $b$ . $Q_h$ , $Q_q$ , $Q_qh$ , $V_h$ , $V_q$ , $V_qh$ , and their conjugates, written with a hat, are $K× K$ matrices that we pack into the following $2K× 2K$ symmetric block matrices:
$$
\displaystyle Q=≤ft(\begin{smallmatrix}Q_q&Q_qh\\
Q_qh^T&Q_h\end{smallmatrix}\right) , \displaystyle V=≤ft(\begin{smallmatrix}V_q&V_qh\\
V_qh^T&V_h\end{smallmatrix}\right) , \displaystyle\hat{Q}=≤ft(\begin{smallmatrix}\hat{Q}_q&\hat{Q}_qh\\
\hat{Q}_qh^T&\hat{Q}_h\end{smallmatrix}\right) , \displaystyle\hat{V}=≤ft(\begin{smallmatrix}\hat{V}_q&\hat{V}_qh\\
\hat{V}_qh^T&\hat{V}_h\end{smallmatrix}\right) . \tag{212}
$$
We obtain that
$$
\displaystyleEZ^n∝ \displaystyle ∫d\hat{Q}_wd\hat{V}_wdQ_wdV_wd\hat{Q}d\hat{V}dQdVe^\frac{nN{2}(\hat{V}_wV_w+\hat{V}_wQ_w-V_w\hat{Q}_w+tr(\hat{V}V+\hat{V}Q-V\hat{Q})-tr(V_qV_h+V_qQ_h+V_hQ_q+δ_eV_qh^2+2δ_eV_qhQ_qh))} \displaystyle d\hat{m}_wdm_wd\hat{m}_σdm_σe^-nN(\hat{m_wm_w+\hat{m}_σm_σ)}≤ft[E_u∫∏_adw^a e^ψ_w^{(n)(w)}\right]^N/α \displaystyle ≤ft[E_y∫∏_a,kdh_k^adq_k^ae^ψ_h^{(n)(h,q;s)}\right]^ρ N≤ft[E_y∫∏_a,kdh_k^adq_k^ae^ψ_h^{(n)(h,q;s^\prime)}\right]^(1-ρ)N \displaystyle:= \displaystyle ∫dΘd\hat{Θ}e^Nφ^{(n)(Θ,\hat{Θ})} , \tag{214}
$$
with $Θ=\{m_w,Q_w,V_w,m,Q,V\}$ and $\hat{Θ}=\{\hat{m}_w,\hat{Q}_w,\hat{V}_w,\hat{m},\hat{Q},\hat{V}\}$ the sets of order parameters and
$$
\displaystyleψ_w^(n)(w) \displaystyle=-β r∑_aγ(w^a)-\frac{1}{2}\hat{V}_w∑_a(w^a)^2+\hat{Q}_w∑_a,bw^aw^b+u\hat{m}_w∑_aw^a \displaystyleψ_h^(n)(h,q;\bar{s}) \displaystyle=-β\bar{s}∑_a\ell(yh_K^a)-\frac{1}{2}V_w∑_a(q_0^a)^2+Q_w∑_a,bq_0^aq_0^b-\frac{1}{2}∑_a≤ft(\begin{smallmatrix}q_>0^a\\
h_<K^a\end{smallmatrix}\right)^T\hat{V}≤ft(\begin{smallmatrix}q_>0^a\\
h_<K^a\end{smallmatrix}\right)+∑_a,b≤ft(\begin{smallmatrix}q_>0^a\\
h_<K^a\end{smallmatrix}\right)^T\hat{Q}≤ft(\begin{smallmatrix}q_>0^b\\
h_<K^b\end{smallmatrix}\right) \displaystyle {}+y\hat{m}^T∑_ah_<K^a+i∑_a(q_>0^a)^T≤ft(\frac{K}{t}(h_>0^a-h_<K^a)-λ^eym^a\right)-i√{μ}ym_w∑_aq_0^a \tag{216}
$$
$u$ is a scalar standard Gaussian and $y$ is a scalar Rademacher variable. We use the notation $q_>0^a∈ℝ^K$ for $(q_k^a)_k>0$ and similarly as to $h_>0^a$ and $h_<K^a=(h_k^a)_k<K$ . We packed them into vectors of size $2K$ .
We take the limit $N→∞$ thanks to Laplace’s method.
$$
\displaystyle-β f∝ \displaystyle\frac{1}{N}\frac{∂}{∂ n}(n=0)∫dΘd\hat{Θ} e^Nφ^{(n)(Θ,\hat{Θ})} \displaystyle= \displaystyle\operatorname*{extr}_Θ,\hat{Θ}\frac{∂}{∂ n}(n=0)φ^(n)(Θ,\hat{Θ}) \displaystyle:= \displaystyle\operatorname*{extr}_Θ,\hat{Θ}φ(Θ,\hat{Θ}) , \tag{218}
$$
where we extremize the following free entropy $φ$ :
$$
\displaystyleφ= \displaystyle\frac{1}{2}(V_w\hat{V}_w+\hat{V}_wQ_w-V_w\hat{Q}_w)+\frac{1}{2}(V_q\hat{V}_q+\hat{V}_qQ_q-V_q\hat{Q}_q)+\frac{1}{2}(V_h\hat{V}_h+\hat{V}_hQ_h-V_h\hat{Q}_h) \displaystyle{}+(V_qh\hat{V}_qh^T+\hat{V}_qhQ_qh^T-V_qh\hat{Q}_qh^T)-\frac{1}{2}(V_qV_h+V_qQ_h+Q_qV_h+δ_eV_qh^2+2δ_eV_qhQ_qh)-m_w\hat{m}_w-m^T\hat{m} \displaystyle{}+E_u,ς∫dw e^ψ_w(w)+ρE_y,ζ,χ∫dqdh e^ψ_qh(q,h;s)+(1-ρ)E_y,ζ,χ∫dqdh e^ψ_qh(q,h;s^{\prime)} . \tag{221}
$$
We factorized the replica and took the derivative with respect to $n$ by introducing independent standard Gaussian random variables $ς∈ℝ$ , $ζ=≤ft(\begin{smallmatrix}ζ_q\ ζ_h\end{smallmatrix}\right)∈ℝ^2K$ and $χ∈ℝ$ . The potentials are
$$
\displaystyleψ_w(w)= \displaystyle-β rγ(w)-\frac{1}{2}\hat{V}_ww^2+≤ft(√{\hat{Q}_w}ς+u\hat{m}_w\right)w \displaystyleψ_qh(q,h;\bar{s})= \displaystyle-β\bar{s}\ell(yh_K)-\frac{1}{2}V_wq_0^2-\frac{1}{2}≤ft(\begin{smallmatrix}q_>0\\
h_<K\end{smallmatrix}\right)^T\hat{V}≤ft(\begin{smallmatrix}q_>0\\
h_<K\end{smallmatrix}\right)+≤ft(\begin{smallmatrix}q_>0\\
h_<K\end{smallmatrix}\right)^T\hat{Q}^1/2≤ft(\begin{smallmatrix}ζ_q\\
ζ_h\end{smallmatrix}\right) \displaystyle{}+yh_<K^T\hat{m}+iq^T≤ft(≤ft(\begin{smallmatrix}1/K&\\
&I/t\end{smallmatrix}\right)Dh-≤ft(\begin{smallmatrix}y√{μ}m_w+√{Q_w}χ\\
yλ^em\end{smallmatrix}\right)\right) \tag{222}
$$
We already extremize $φ$ with respect to $Q$ and $V$ to obtain the following equalities:
$$
\displaystyle\hat{V}_q=V_h , \displaystyle V_q=\hat{V}_h , \displaystyle\hat{V}_qh=δ_eV_qh^T , \displaystyle\hat{Q}_q=-Q_h , \displaystyle Q_q=-\hat{Q}_h , \displaystyle\hat{Q}_qh=-δ_eQ_qh^T . \tag{224}
$$
In particular this shows that in the asymmetric case where $δ_e=0$ one has $\hat{V}_qh=\hat{Q}_qh=0$ and as a consequence $V_qh=Q_qh=0$ ; and we recover the potential $ψ_h$ previously derived in part A.
We assume that $\ell$ is quadratic so $ψ_qh$ can be written as the following quadratic potential. Later we will take the limit $r→∞$ where $h$ is small and where $\ell$ can effectively be expanded around $0$ as a quadratic potential.
$$
\displaystyleψ_qh(q,h;\bar{s}) \displaystyle=-\frac{1}{2}≤ft(\begin{smallmatrix}q\\
h\end{smallmatrix}\right)^T≤ft(\begin{smallmatrix}G_q&-iG_qh\\
-iG_qh^T&G_h\end{smallmatrix}\right)≤ft(\begin{smallmatrix}q\\
h\end{smallmatrix}\right)+≤ft(\begin{smallmatrix}q\\
h\end{smallmatrix}\right)^T≤ft(\begin{smallmatrix}-iB_q\\
B_h\end{smallmatrix}\right) \tag{226}
$$
with
$$
\displaystyle G_q=≤ft(\begin{smallmatrix}V_w&0\\
0&\hat{V}_q\end{smallmatrix}\right) , G_h=≤ft(\begin{smallmatrix}\hat{V}_h&0\\
0&β\bar{s}\end{smallmatrix}\right) , G_qh=≤ft(\begin{smallmatrix}1/K&0\\
0&I_K/t\end{smallmatrix}\right)D+≤ft(\begin{smallmatrix}0&0\\
i\hat{V}_qh&0\end{smallmatrix}\right) , D=K≤ft(\begin{smallmatrix}1&&&0\\
-1&⋱&&\\
&⋱&⋱&\\
0&&-1&1\end{smallmatrix}\right) , \displaystyle B_q=≤ft(\begin{smallmatrix}√{Q_w}χ\\
i≤ft(\hat{Q}^1/2ζ\right)_q\end{smallmatrix}\right)+y≤ft(\begin{smallmatrix}√{μ}m_w\\
λ^em\end{smallmatrix}\right) , B_h=≤ft(\begin{smallmatrix}≤ft(\hat{Q}^1/2ζ\right)_h\\
0\end{smallmatrix}\right)+y≤ft(\begin{smallmatrix}\hat{m}\\
β\bar{s}\end{smallmatrix}\right) , ≤ft(\begin{smallmatrix}(\hat{Q}^1/2ζ)_q\\
(\hat{Q}^1/2ζ)_h\end{smallmatrix}\right)=≤ft(\begin{smallmatrix}\hat{Q}_q&\hat{Q}_qh\\
\hat{Q}_qh^T&\hat{Q}_h\end{smallmatrix}\right)^1/2≤ft(\begin{smallmatrix}ζ_q\\
ζ_h\end{smallmatrix}\right) . \tag{227}
$$
$G_q$ , $G_h$ , $G_qh$ and $D$ are in $ℝ^(K+1)×(K+1)$ . $D$ is the discrete derivative. $B_q$ , $B_h$ and $≤ft(\begin{smallmatrix}q\ h\end{smallmatrix}\right)$ are in $ℝ^2(K+1)$ . We can marginalize $e^ψ_qh$ over $q$ :
$$
\displaystyle∫dqdh e^ψ_qh(q,h;\bar{s)}=∫dh e^ψ_h(h;\bar{s)} \displaystyleψ_h(h;\bar{s})=-\frac{1}{2}h^TG_hh+h^TB_h-\frac{1}{2}\log\det G_q-\frac{1}{2}(G_qhh-B_q)^TG_q^-1(G_qhh-B_q) \displaystyle =-\frac{1}{2}h^TGh+h^T≤ft(B_h+D_qh^TG_0^-1B\right)-\frac{1}{2}\log\det G_q , \tag{229}
$$
where we set
$$
\displaystyle G \displaystyle=G_h+D_qh^TG_0^-1D_qh , \displaystyle G_0 \displaystyle=≤ft(\begin{smallmatrix}K^2V_w&0\\
0&t^2V_h\end{smallmatrix}\right) , \displaystyle D_qh \displaystyle=D-t≤ft(\begin{smallmatrix}0&0\\
-iδ_eV_qh^T&0\end{smallmatrix}\right) , \displaystyle B \displaystyle=≤ft(\begin{smallmatrix}K√{Q_w}χ\\
it≤ft(\hat{Q}^1/2ζ\right)_q\end{smallmatrix}\right)+y≤ft(\begin{smallmatrix}K√{μ}m_w\\
λ^etm\end{smallmatrix}\right) . \tag{232}
$$
Eq. (231) is the potential eq. (65) given in the main part, up to a term independent of $h$ .
We take the limit $β→∞$ . As before we introduce the measures $dP_w$ , $dP_qh$ and $dP_qh^\prime$ , $dP_h$ and $dP_h^\prime$ whose unnormalized densities are $e^ψ_w(w)$ , $e^ψ_qh(h,q;s)$ , $e^ψ_qh(h,q;s^{\prime)}$ , $e^ψ_h(h;s)$ and $e^ψ_h(h;s^{\prime)}$ . We use Laplace’s method to evaluate them. We have to rescale the order parameters not to obtain a degenerated solution. We take
$$
\displaystyle m_w→ m_w , \displaystyle Q_w→ Q_w , \displaystyle V_w→ V_w/β , \displaystyle\hat{m}_w→β\hat{m}_w , \displaystyle\hat{Q}_w→β^2\hat{Q}_w , \displaystyle\hat{V}_w→β\hat{V}_w , \displaystyle m→ m , \displaystyle Q_h→ Q_h , \displaystyle V_h→ V_h/β , \displaystyle\hat{m}→β\hat{m} , \displaystyle\hat{Q}_h→β^2\hat{Q}_h , \displaystyle\hat{V}_h→β\hat{V}_h , \displaystyle Q_qh→β Q_qh , \displaystyle V_qh→ V_qh . \tag{236}
$$
We take this scaling for $Q_qh$ and $V_qh$ because we want $D_qh$ and $B$ to be of order one while $G$ , $B_h$ and $G_0^-1$ to be of order $β$ . Taking the matrix square root we obtain the block-wise scaling
$$
\hat{Q}^1/2→≤ft(\begin{smallmatrix}1&1\\
1&β\end{smallmatrix}\right)\odot\hat{Q}^1/2 , \tag{241}
$$
which does give $(\hat{Q}^1/2ζ)_q$ of order one and $(\hat{Q}^1/2ζ)_h$ of order $β$ . As a consequence we obtain that $f=-φ$ and that $P_w$ , $P_h$ and $P_h^\prime$ are peaked around their respective maximum $w^*$ , $h^*$ and $h^{^\prime*}$ , and that they can be approximated by Gaussian measures. Notice that $P_qh$ is not peaked as to its $q$ variable, which has to be integrated over all its range, which leads to the marginale $P_h$ and the potential $ψ_h$ eq. (231).
Last, differentiating the free energy $f$ with respect to $s$ and $s^\prime$ we obtain the expected errors and accuracies:
$$
\displaystyle E_train=E_y,ζ,ξ\ell(yh_K^*) , \displaystyleAcc_train=E_y,ζ,ξδ_y= , \displaystyle E_test=E_y,ζ,ξ\ell(yh_K^{^\prime*}) , \displaystyleAcc_test=E_y,ζ,ξδ_y=)} . \tag{242}
$$
### B.2 Self-consistent equations
The extremality condition $∇_Θ,\hat{Θ}φ$ gives the following self-consistent equations on the order parameters. $P$ is the operator that acts by linearly combining quantities evaluated at $h^*$ , taken with $\bar{s}=1$ and $\bar{s}=0$ with weights $ρ$ and $1-ρ$ , according to $P(g(h))=ρ g(h^*)+(1-ρ)g(h^{^\prime*})$ . We assume $l_2$ regularization, i.e. $γ(w)=w^2/2$ .
$$
\displaystyle m_w=\frac{1}{α}\frac{\hat{m}_w}{r+\hat{V}_w} \displaystyle Q_w=\frac{1}{α}\frac{\hat{Q}_w+\hat{m}_w^2}{(r+\hat{V}_w)^2} \displaystyle V_w=\frac{1}{α}\frac{1}{r+\hat{V}_w} \displaystyle≤ft(\begin{smallmatrix}\hat{m}_w\\
\hat{m}\\
m\\
·\end{smallmatrix}\right)=≤ft(\begin{smallmatrix}K√{μ}&&0\\
&λ^etI_K&\\
0&&I_K+1\end{smallmatrix}\right)E_y,ξ,ζ yP≤ft(\begin{smallmatrix}G_0^-1(D_qhh-B)\\
h\end{smallmatrix}\right) \displaystyle≤ft(\begin{smallmatrix}\hat{Q}_w&&&·\\
&\hat{Q}_h&Q_qh&\\
&Q_qh^T&Q_h&\\
·&&&·\end{smallmatrix}\right)=≤ft(\begin{smallmatrix}K&&0\\
&tI_K&\\
0&&I_K+1\end{smallmatrix}\right)E_y,ξ,ζP≤ft(≤ft(\begin{smallmatrix}G_0^-1(D_qhh-B)\\
h\end{smallmatrix}\right)^⊗ 2\right)≤ft(\begin{smallmatrix}K&&0\\
&tI_K&\\
0&&I_K+1\end{smallmatrix}\right) \displaystyle≤ft(\begin{smallmatrix}\hat{V}_w&&&·\\
&\hat{V}_h&V_qh&\\
&V_qh^T&V_h&\\
·&&&·\end{smallmatrix}\right)=P≤ft(_ψ_{qh}≤ft(\begin{smallmatrix}q\\
h\end{smallmatrix}\right)\right) \tag{244}
$$
We use the notation $·$ for unspecified padding to reach vectors of size $2(K+1)$ and matrices of size $2(K+1)× 2(K+1)$ .
The extremizer $h^*$ of $ψ_h$ is
$$
\displaystyle h^*=G^-1≤ft(B_h+D_qh^TG_0^-1B\right) . \tag{250}
$$
It has to be plugged in to the fixed-point equations (247 - 248) and the expectation over the disorder has to be taken.
As to the variances eq. (249), we have $_ψ_{qh}≤ft(\begin{smallmatrix}q\ h\end{smallmatrix}\right)=≤ft(\begin{smallmatrix}G_q&-iG_qh\ -iG_qh^T&G_h\end{smallmatrix}\right)^-1$ and using Schur’s complement on $G_q$ invertible, one obtains
$$
\displaystyle≤ft(\begin{smallmatrix}·&·\\
-iV_qh&·\end{smallmatrix}\right)=tP≤ft(G_0^-1D_qhG^-1\right) \displaystyle≤ft(\begin{smallmatrix}V_h&·\\
·&·\end{smallmatrix}\right)=P≤ft(G^-1\right) \displaystyle≤ft(\begin{smallmatrix}\hat{V}_w&·\\
·&\hat{V}_h\end{smallmatrix}\right)=≤ft(\begin{smallmatrix}K^2&0\\
0&t^2I_K\end{smallmatrix}\right)P≤ft(G_0^-1-G_0^-1D_qhG^-1D_qh^TG_0^-1\right) \tag{251}
$$
The continuation of the computation and how to solve these equations is detailed in the main part III.2.2.
### B.3 Solution in the continuous limit at large $r$
We report the final values of the order parameters, given in the main part III.2.1. We set $x=k/K$ and $z=l/K$ continuous indices ranging from 0 to 1. We define the resolvants
$$
\displaystyleφ(x) \displaystyle=≤ft\{\begin{array}[]{cc}e^λ^{etx}&if δ_e=0\\
∑_ν>0^∞ν(λ^e)^ν-1\frac{I_ν(2tx)}{tx}&if δ_e=1\end{array}\right. , \displaystyleΦ(x,z) \displaystyle=≤ft\{\begin{array}[]{cc}I_0(2t√{xz})&if δ_e=0\\
\frac{I_1(2t(x+z))}{t(x+z)}&if δ_e=1\end{array}\right. , \tag{256}
$$
with $I_ν$ the modified Bessel function of the second kind of order $ν$ . The effective inverse derivative is
$$
\displaystyle V_qh(x,z) \displaystyle=θ(z-x)(z-x)^-1I_1(2t(z-x)) , \displaystyle D_qh^-1(x,z) \displaystyle=D_qh^-1,T(z,x)=≤ft\{\begin{array}[]{cc}θ(x-z)&if δ_e=0\\
\frac{1}{t}V_qh(z,x)&if δ_e=1\end{array}\right. , \tag{260}
$$
with $θ$ the step function.
The solution to the fixed-point equations, in the continuous limit $K→∞$ , at first constant order in $1/r$ , is
$$
\displaystyle V_w=\frac{1}{rα} \displaystyle V_h(x,z)=V_wΦ(x,z) \displaystyle\hat{V}_h(1-x,1-z)=t^2ρΦ(x,z) \displaystyle\hat{V}_w=t^-2\hat{V}_h(0,0) \displaystyle\hat{m}(1-x)=ρλ^etφ(x) \displaystyle\hat{m}_w=√{μ}\frac{1}{λ^et}\hat{m}(0) \displaystyle m_w=\frac{\hat{m}_w}{rα} \displaystyle m(x)=(1+μ)\frac{m_w}{√{μ}}φ(x)+\frac{t}{λ^e}∫_0^xdx^\prime∫_0^1dx^\prime\prime φ(x-x^\prime)V_h(x^\prime,x^\prime\prime)\hat{m}(x^\prime\prime) \displaystyle\hat{Q}_w=t^-2\hat{Q}_h(0,0) \displaystyle Q_w=\frac{\hat{Q}_w+\hat{m}_w^2}{r^2α} \displaystyle\hat{Q}_h(1-x,1-z)=t^2∫_0^-,0^{-}^x,zdx^\primedz^\prime Φ(x-x^\prime,z-z^\prime)≤ft[P(\hat{m}^⊗ 2)(1-x^\prime,1-z^\prime)\right] \displaystyle Q_qh(1-x,z)=t∫_0^-,0^{-}^x,zdx^\primedz^\prime Φ(x-x^\prime,z-z^\prime)\Bigg[P(\hat{m})(1-x^\prime)(λ^etm(z^\prime)+√{μ}m_wδ(z^\prime)) \displaystyle\hskip 18.49988pt≤ft.{}+∫_0,0^-^1^{+,1}dx^\prime\primedz^\prime\prime ≤ft(\hat{Q}_h(1-x^\prime,x^\prime\prime)+P(\hat{m}^⊗ 2)(1-x^\prime,x^\prime\prime)\right)D_qh^-1(x^\prime\prime,z^\prime\prime)G_0(z^\prime\prime,z^\prime)\right] \displaystyle Q_h(x,z)=∫_0^-,0^{-}^x,zdx^\primedz^\prime Φ(x-x^\prime,z-z^\prime)\Bigg[\hat{Q}_wδ(x^\prime,z^\prime)+(λ^etm(x^\prime)+√{μ}m_wδ(x^\prime))(λ^etm(z^\prime)+√{μ}m_wδ(z^\prime)) \displaystyle\hskip 18.49988pt{}+∫_0^-,0^1,1^{+}dx^\prime\primedx^\prime\prime\prime G_0(x^\prime,x^\prime\prime)D_qh^-1,T(x^\prime\prime,x^\prime\prime\prime)≤ft(tδ_eQ_qh(x^\prime\prime\prime,z^\prime)+P(\hat{m})(x^\prime\prime\prime)(λ^etm(z^\prime)+√{μ}m_wδ(z^\prime))\right) \displaystyle\hskip 18.49988pt{}+∫_0,0^-^1^{+,1}dz^\prime\prime\primedz^\prime\prime ≤ft(tδ_eQ_qh(z^\prime\prime\prime,x^\prime)+(λ^etm(x^\prime)+√{μ}m_wδ(x^\prime))P(\hat{m})(z^\prime\prime\prime)\right)D_qh^-1(z^\prime\prime\prime,z^\prime\prime)G_0(z^\prime\prime,z^\prime) \displaystyle ≤ft.{}+∫_0^-,0,0,0^{-}^1,1^{+,1^+,1}dx^\prime\primedx^\prime\prime\primedz^\prime\prime\primedz^\prime\prime G_0(x^\prime,x^\prime\prime)D_qh^-1,T(x^\prime\prime,x^\prime\prime\prime)≤ft(\hat{Q}_h(x^\prime\prime\prime,z^\prime\prime\prime)+P(\hat{m}^⊗ 2)(x^\prime\prime\prime,z^\prime\prime\prime)\right)D_qh^-1(z^\prime\prime\prime,z^\prime\prime)G_0(z^\prime\prime,z^\prime)\right] ; \tag{264}
$$
where we set
$$
\displaystyleP(\hat{m})(x) \displaystyle=\hat{m}(x)+ρδ(1-x) , \displaystyleP(\hat{m}^⊗ 2)(x,z) \displaystyle=ρ≤ft(\hat{m}(x)+δ(1-x)\right)≤ft(\hat{m}(z)+δ(1-z)\right)+(1-ρ)\hat{m}(x)\hat{m}(z) , \displaystyle G_0(x,z) \displaystyle=t^2V_h(x,z)+V_wδ(x,z) . \tag{277}
$$
The test and train accuracies are
$$
\displaystyleAcc_test \displaystyle=E_y,ξ,ζ,χδ_y=(1))} \displaystyle=E_ξ,ζ,χδ_0<√{μm_w+K∫_0^1dx V(1,x)\hat{m}(x)+λ t∫_0^1dx m(x)+√{Q_w}ζ+K∫_0^1dxdz V(1,x)\hat{Q}^1/2(x,z)ξ(z)+t∫_0^1dxdz Q^1/2(x,z)χ(z)} \displaystyle=\frac{1}{2}≤ft(1+erf≤ft(\frac{√{μ}m_w+K∫_0^1dx V(1,x)\hat{m}(x)+λ t∫_0^1dx m(x)}{√{2}√{Q_w+K^2∫_0^1dxdz V(1,x)\hat{Q}(x,z)V(z,1)+t^2∫_0^1dxdz Q(x,z)}}\right)\right) \displaystyle=\frac{1}{2}≤ft(1+erf≤ft(\frac{m(1)-ρ V(1,1)}{√{2}√{Q(1,1)-m(1)^2-ρ(1-ρ)V(1,1)^2}}\right)\right) \tag{1}
$$
and
$$
\displaystyleAcc_train \displaystyle=E_y,ξ,ζ,χδ_y= \displaystyle=E_y,ξ,ζ,χδ_y=(1)+V(1,1)y)} \displaystyle=\frac{1}{2}≤ft(1+erf≤ft(\frac{m(1)+(1-ρ)V(1,1)}{√{2}√{Q(1,1)-m(1)^2-ρ(1-ρ)V(1,1)^2}}\right)\right) \tag{1}
$$
To obtain the last expressions we integrated $m$ and $Q$ by parts thanks to the self-consistent conditions they satisfy.
### B.4 Higher orders in $1/r$ : how to pursue the computation
The solution given in the main part III.2.1 and reproduced above are for infinite regularization $r$ , keeping only the first constant order. We briefly show how to pursue the computation at any order.
The self-consistent equations for $V_qh$ , $V_h$ and $\hat{V}_h$ at any order can be phrased as, rewritting eqs. (251 - 253) and extending the matrices by continuity:
$$
\displaystyle\frac{1}{t}V_qh=P≤ft(D_qh^-1,T∑_a≥ 0≤ft(-G_hD_qh^-1G_0D_qh^-1,T\right)^a\right) , \displaystyle V_h=D_qh^-1P≤ft(G_0∑_a≥ 0≤ft(-D_qh^-1,TG_hD_qh^-1G_0\right)^a\right)D_qh^-1,T , \displaystyle\hat{V}_h=t^2D_qh^-1,TP≤ft(G_h∑_a≥ 0≤ft(-D_qh^-1G_0D_qh^-1,TG_h\right)^a\right)D_qh^-1 \tag{287}
$$
where we remind that $G_0=t^2V_h+V_wδ(x,z)=O(1/r)$ , $G_h=\hat{V}_h+\bar{s}δ(1-x,1-z)$ and $D_qh=D-tδ_eV_qh^T$ . These equations form a system of non-linear integral equations. A perturbative approach with expansion in powers of $1/r$ should allow to solve it. At each order one has to solve linear integral equations whose resolvant is $Φ$ for $V_h$ and $\hat{V}_h$ , the previously determined resolvant to the constant order. The perturbations have to summed and the resulting $V_qh$ , $V_h$ and $\hat{V}_h$ can be used to express $h^*$ , $h^{^\prime*}$ and the other order parameters.
### B.5 Interpretation of terms of DMFT: computation
We prove the relations given in the main part III.2.2, that state an equivalence between the order parameters $V_h$ , $V_qh$ and $\hat{V}_h$ stemming from the replica computation and the correlation and response functions of the dynamical process that $h$ follows. We assume that the regularization $r$ is large and we derive the equalities to the constant order.
We introduce the tilting field $η(x)∈ℝ^N$ and the tilted Hamiltonian as
$$
\displaystyle\frac{dh}{dx}(x)=\frac{t}{√{N}}\tilde{A}^eh(x)+η(x) , \displaystyle h(x)=∫_0^xdx^\primee^(x-x^{\prime)\frac{t}{√{N}}\tilde{A}^e}≤ft(η(x^\prime)+δ(x^\prime)\frac{1}{√{N}}Xw\right) , \displaystyle H(η)=\frac{1}{2}(y-h(1))^TR(y-h(1))+\frac{r}{2}w^Tw , \tag{290}
$$
where $R∈ℝ^N× N$ diagonal accounts for the train and test nodes. We write $⟨·⟩_β$ the expectation under the density $e^-β H(η)/Z$ (normalized only at $η=0$ , $Z$ is not a function of $η$ ).
For $V_h$ we have:
$$
\displaystyle\frac{β}{N}≤ft[⟨ h(x)h(z)^T⟩_β-⟨ h(x)⟩_β⟨ h(z)^T⟩_β\right]|_η=0 \displaystyle =\frac{1}{N}≤ft(e^\frac{tx{√{N}}\tilde{A}^e}\frac{1}{N}X(⟨ ww^T⟩_β-⟨ w⟩_β⟨ w^T⟩_β)X^Te^\frac{tz{√{N}}\tilde{A}^e}\right) \displaystyle =\frac{V_w}{N}≤ft\{\begin{array}[]{lc}≤ft(e^\frac{tx{√{N}}\tilde{A}}e^\frac{tz{√{N}}\tilde{A}^T}\right)&if δ_e=0\\
≤ft(e^\frac{tx+tz{√{N}}\tilde{A}^s}\right)&if δ_e=1\end{array}\right. . \tag{293}
$$
We used that in the large regularization limit the covariance of $w$ is $I_M/r$ and $V_w=rα$ . We distinguish the two cases symmetrized or not. For the symmetrized case we have
$$
\displaystyle\frac{V_w}{N}≤ft(e^\frac{tx+tz{√{N}}\tilde{A}^s}\right) \displaystyle=∫_-2^+2\frac{d\hat{λ}}{2π}√{4-\hat{λ}^2}e^\hat{λt(x+z)} \displaystyle=V_w\frac{I_1(2t(x+z))}{t(x+z)} , \tag{298}
$$
where we used that the spectrum of $\tilde{A}^s/√{N}$ follows the semi-circle law up to negligible corrections. For the asymmetric case we expand the two exponentials. $\tilde{A}≈Ξ$ has independent Gaussian entries.
$$
\displaystyle\frac{V_w}{N}≤ft(e^\frac{tx{√{N}}\tilde{A}}e^\frac{tz{√{N}}\tilde{A}^T}\right) \displaystyle =∑_n,m≥ 0\frac{V_w}{N^1+\frac{n+m{2}}}\frac{(tx)^n(tz)^m}{n!m!} \displaystyle ∑_i_{1,…,i_n}∑_j_{1,…,j_m}Ξ_i_{1i_2}…Ξ_i_{n-1i_n}Ξ_i_{nj_1}Ξ_j_{2j_1}Ξ_j_{3j_2}…Ξ_i_{1j_m} \displaystyle =V_w∑_n\frac{(t^2xz)^n}{(n!)^2}=V_wI_0(2t√{xz}) . \tag{300}
$$
In the sum only contribute the terms where $j_2=i_n,…,j_m=i_2$ for $m=n$ . Consequently in both cases we obtain that
$$
\displaystyle V_h(x,z)=\frac{β}{N}≤ft[⟨ h(x)h(z)^T⟩_β-⟨ h(x)⟩_β⟨ h(z)^T⟩_β\right]|_η=0 \tag{303}
$$
$V_h$ is the correlation function between the states $h(x)∈ℝ^N$ of the network, under the dynamic defined by the Hamiltonian (20).
This derivation can be used to compute the resolvant $Φ=V_h/V_w$ in the symmetrized case, instead of solving the integral equation that defines it eq. (103), that is $Φ(x,z)=D_qh^-1(t^2Φ(x,z)+δ(x,z))D_qh^-1,T$ . As a consequence of the two equivalent definitions we obtain the following mathematical identity, for all $x$ and $z$ :
$$
\displaystyle∫_0,0^x,zdx^\primedz^\prime \frac{I_1(2(x-x^\prime))}{x-x^\prime}\frac{I_1(2(x^\prime+z^\prime))}{x^\prime+z^\prime}\frac{I_1(2(z-z^\prime))}{z-z^\prime} \displaystyle =\frac{I_1(2(x+z))}{x+z}-\frac{I_1(2x)I_1(2z)}{xz} . \tag{304}
$$
For $V_qh$ we have:
$$
\displaystyle\frac{t}{N}\frac{∂}{∂η(z)}⟨ h(x)⟩_β|_η=0 \displaystyle =\frac{t}{N}e^(x-z)\frac{t{√{N}}\tilde{A}^e}θ(x-z) \displaystyle =≤ft\{\begin{array}[]{lc}θ(x-z)&if δ_e=0\\
θ(x-z)(x-z)^-1I_1(2t(x-z))&if δ_e=1\end{array}\right. \displaystyle =V_qh(x,z) . \tag{305}
$$
We neglected the terms of order $1/r$ stemming from $w$ . We integrated over the spectrum of $\tilde{A}^e$ , which follows the semi-circle law (symmetric case) or the circular law (asymmetric) up to negligeable corrections. We obtain that $V_qh$ is the response function oh $h$ .
Last for $\hat{V}_h$ we have:
$$
\displaystyle\frac{t^2}{β^2N}\frac{∂^2}{∂η(x)∂η(z)}⟨ 1⟩_β|_η=0 \displaystyle =\frac{t^2}{N}≤ft[R⟨(y-h(1))^⊗ 2⟩_β|_η=0\right. \displaystyle ≤ft.Re^(1-z)\frac{t{√{N}}\tilde{A}^e}e^(1-x)\frac{t{√{N}}(\tilde{A}^e)^T}\right] \displaystyle =\frac{ρ t^2}{N}e^(1-z)\frac{t{√{N}}\tilde{A}^e}e^(1-x)\frac{t{√{N}}(\tilde{A}^e)^T} \displaystyle =\hat{V}_h(x,z) . \tag{311}
$$
We neglected the terms of order $1/β$ obtained by differenciating only once $e^-β H$ and these of order $1/r$ , i.e. $y-h(1)≈ y$ . We obtain that $\hat{V}_h$ is the correlation function between the responses.
### B.6 Limiting cases
To obtain insights on the behaviour of the test accuracy and to make connections with already studied models we expand (283) around the limiting cases $t→ 0$ and $t→∞$ .
At $t→ 0$ we use that $φ(x)=1+λ^etx+O(t^2)$ and $Φ(x,z)=1+O(t^2)$ ; this simplifies several terms. We obtain the following expansions at the first order in $t$ :
$$
\displaystyle V_w=\frac{1}{rα} , V(x,z)=\frac{1}{rα} , \displaystyle\hat{m}_w=ρ√{μ} , \hat{m}(x)=ρλ^et , \displaystyle m_w=\frac{ρ}{rα}√{μ} , m(x)=\frac{ρ}{rα}(1+μ)(1+λ^et(x+1)) , \displaystyle\hat{Q}_w=ρ , \hat{Q}_h(x,z)=0 , \displaystyle Q_w=\frac{ρ+ρ^2μ}{rα} , Q_qh=O(t) , \displaystyle Q_h(1,1)=Q_w+m(0)^2+ρ(1-ρ)V_w^2+2\frac{ρ^2}{r^2α^2}(1+μ)^2λ^et . \tag{315}
$$
Pluging them in eq. (283) we obtain the expression given in the main part III.2.3:
$$
Acc_test=\frac{1}{2}≤ft(1+erf≤ft(\frac{1}{√{2}}√{\frac{ρ}{α}}\frac{μ+λ^et(2+μ)}{√{1+ρμ}}\right)\right) . \tag{321}
$$
At $t→∞$ we assume that $λ^e>1$ . We distinguish the two cases asymmetric or symmetrized. For asymmetric we have $φ(x)=\exp(λ^etx)$ and $\logΦ(x,z)=Θ(2t√{xz})$ . For the symmetrized we have
$$
\displaystyleφ(x) \displaystyle=\frac{1}{tx}\frac{∂}{∂λ^e}∑_ν≥ 0^∞(λ^e)^νI_ν(2tx) \displaystyle≈\frac{1}{tx}\frac{∂}{∂λ^e}∑_ν=-∞^+∞(λ^e)^νI_ν(2tx) \displaystyle=\frac{1}{tx}\frac{∂}{∂λ^e}e^tx(λ^{e+1/λ^e)} \displaystyle=(1-(λ^e)^-2)e^tx(λ^{e+1/λ^e)} \tag{322}
$$
and $\logΦ(x,z)=Θ(2t(x+z))$ . In the two cases, only the few dominant terms scaling like $e^2λ^{et}$ or $e^2(λ^{e+1/λ^e)t}$ dominate in (283). We obtain
$$
\displaystyleAcc_test≈\frac{1}{2}≤ft(1+erf≤ft(\frac{m(1)}{√{2}√{Q(1,1)-m(1)^2}}\right)\right) \displaystyle m(x)=\frac{ρ}{rα}φ(1)φ(x)(1+μ+C(λ^e)) \displaystyle C(λ^e)\mkern-5.0mu=\mkern-5.0mu∫_0^∞\mkern-9.0mudx^\primedz^\prime≤ft\{\begin{array}[]{lc}I_0(2√{x^\primez^\prime})e^-(x^{\prime+z^\prime)λ^e}&if δ_e=0\\
\mkern-5.0mu\frac{I_1(2(x^\prime+z^\prime))}{x^\prime+z^\prime}e^-(x^{\prime+z^\prime)(λ^e+1/λ^e)}&if δ_e=1\end{array}\right. \displaystyle Q(1,1)≈∫_0^1dx^\primedz^\primeΦ(1-x^\prime,1-z^\prime)(λ^e)^2t^2m(x^\prime)m(z^\prime) \tag{1}
$$
where in $m$ we performed the changes of variables $x^\prime→ x^\prime/t$ and $z^\prime→ z^\prime/t$ and took the limit $t→∞$ in the integration bounds to remove the dependency in $t$ and $x$ . Performing a change of variables $1-x^\prime→ x^\prime/t$ and $1-z^\prime→ z^\prime/t$ in $Q(1,1)$ we can express $Acc_test$ solely in terms of $C(λ^e)$ . Last we use the identity
$$
C(λ^e)=\frac{1}{(λ^e)^2-1} , \tag{332}
$$
valid in the two cases asymmetric or not, to obtain the expression given in the main part III.2.3:
$$
\displaystyleAcc_test \displaystyle\underset{t→∞}{\longrightarrow}\frac{1}{2}≤ft(1+erf≤ft(\frac{λ^eq_PCA}{√{2}}\right)\right) , \displaystyle q_PCA \displaystyle=√{1-(λ^e)^-2} \tag{333}
$$
## Appendix C State-evolution equations for the Bayes-optimal performance
The Bayes-optimal (BO) performance for semi-supervised classification on the binary CSBM can be computed thanks to the following iterative state-evolution equations, that have been derived in [22, 39].
The equations have been derived for a symmetric graph. We map the asymmetric $\tilde{A}$ to a symmetric matrix by the symmetrization $(\tilde{A}+\tilde{A}^T)/√{2}$ . Thus the BO performance on $A$ asymmetric are the BO performance on $A$ symmetrized and effective signal $λ^s=√{2}λ$ .
Let $m_y^0$ and $m_u^0$ be the initial condition. The state-evolution equations are
$$
\displaystyle m_u^t+1=\frac{μ m_y^t}{1+μ m_y^t} \displaystyle m^t=\frac{μ}{α}m_u^t+(λ^s)^2m_y^t-1 \displaystyle m_y^t=ρ+(1-ρ)E_W≤ft[\tanh≤ft(m^t+√{m^t}W\right)\right] \tag{335}
$$
where $W$ is a standard scalar Gaussian. These equations are iterated until convergence to a fixed-point $(m,m_y,m_u)$ . Then the BO test accuracy is
$$
Acc_test=\frac{1}{2}(1+erf√{m/2}) . \tag{338}
$$
In the large $λ$ limit we have $m_y→ 1$ and
$$
\log(1-Acc_test)\underset{λ→∞}{∼}-λ^2 . \tag{339}
$$
## Appendix D Details on the numerics
For the discrete GCN, the system of fixed-point equations (37 – 48) is solved by iterating it until convergence. The iterations are stable up to $K≈ 4$ and no damping is necessary. The integration over $(ξ,ζ,χ)$ is done by Hermite quadrature (quadratic loss) or Monte-Carlo sampling (logistic loss) over about $10^6$ samples. For the quadratic loss $h^*$ has to be computed by Newton’s method. Then the whole computation takes around one minute on a single CPU.
For the continuous GCN the equation (126) is evaluated by a trapezoidal integration scheme with a hundred of discretization points. In the nested integrals of $Q(1,1)$ , $\hat{Q}$ can be evaluated only once at each discretization point. The whole computation takes a few seconds.
We provide the code to evaluate our predictions in the supplementary material.
## Appendix E Supplementary figures
In this section we provide the supplementary figures of part III.2.3. They show the convergence to the continuous limit with respect to $K$ and $r$ , and that the continuous limit can be close to the optimality. We also provide the supplementary figures of part III.1.3, that compare the GCN on symmetric and asymmetric graphs, and that show the train error versus the residual connection strength.
### Asymmetric graph
The following figures support the discussion of part III.2.3 for the asymmetric graph. They compare the theoretical predictions for the continuous GCN to numerical simulations of the trained network. They show the convergence towards the limit $r→∞$ and the optimality of the continuous GCN over its discretization at finite $K$ .
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Line Chart with Scatter Points: Accuracy Over Time for Different Parameters
### Overview
The image is a technical line chart with overlaid scatter points, plotting a metric labeled "Acc_int" (likely "Internal Accuracy") against a variable "t" (likely time or a time-like parameter). The chart compares the performance of three distinct model configurations (represented by lines) across four different resolution or scale parameters (represented by scatter points). The overall trend shows an initial rapid increase in accuracy, a peak, followed by a gradual decline.
### Components/Axes
* **X-Axis:** Labeled **"t"**. The scale is linear, ranging from **-1 to 4**, with major tick marks at integer intervals (-1, 0, 1, 2, 3, 4).
* **Y-Axis:** Labeled **"Acc_int"**. The scale is linear, ranging from **0.5 to approximately 0.95**, with major tick marks at 0.1 intervals (0.5, 0.6, 0.7, 0.8, 0.9).
* **Legend (Bottom-Left Quadrant):** Contains two sections.
* **Line Series (Solid Lines):**
* Cyan Line: **λ = 1.5, μ = 2**
* Medium Blue Line: **λ = 1, μ = 3**
* Dark Blue Line: **λ = 0.7, μ = 3**
* **Scatter Point Series (Markers):**
* Yellow Circle: **r = 10²**
* Gold Circle: **r = 10¹**
* Brown Circle: **r = 10⁰**
* Purple Circle: **r = 10⁻¹**
### Detailed Analysis
**1. Line Series Trends (Model Configurations):**
* **Cyan Line (λ=1.5, μ=2):** Starts lowest at t=-1 (~0.45), rises most steeply, peaks highest (~0.94 at t≈0.7), and decays the slowest, remaining the highest line for t > 1.
* **Medium Blue Line (λ=1, μ=3):** Starts at ~0.48 at t=-1, peaks at ~0.91 at t≈0.5, and decays at a moderate rate.
* **Dark Blue Line (λ=0.7, μ=3):** Starts highest at t=-1 (~0.50), peaks at ~0.86 at t≈0.4 (the earliest peak), and decays the fastest, becoming the lowest line for t > 1.5.
**2. Scatter Point Series Trends (Resolution/Scale 'r'):**
* **Yellow (r=10²):** Points generally lie closest to the cyan line, especially after the peak. They show the highest accuracy values among scatter points for t > 0.5.
* **Gold (r=10¹):** Points cluster near the medium blue line. Their values are consistently below the yellow points but above the brown/purple points for t > 0.
* **Brown (r=10⁰) & Purple (r=10⁻¹):** These points are tightly clustered together, generally lying below the gold points and following the trend of the dark blue line most closely. They exhibit the lowest accuracy values, particularly for t > 1. The distinction between brown and purple points is minimal.
**3. Data Point Approximations (Selected):**
* At **t = -1**: Acc_int ranges from ~0.45 (cyan line) to ~0.50 (dark blue line). Scatter points are between ~0.47 and ~0.51.
* At **Peak (t ≈ 0.5-0.7)**: Maximum Acc_int is ~0.94 (cyan line). Scatter points peak between ~0.85 (purple/brown) and ~0.93 (yellow).
* At **t = 4**: Acc_int ranges from ~0.64 (dark blue line) to ~0.88 (cyan line). Scatter points range from ~0.58 (purple) to ~0.87 (yellow).
### Key Observations
1. **Parameter Sensitivity:** The model configuration (λ, μ) has a dominant effect on the overall accuracy trajectory (peak height and decay rate). Higher λ with lower μ (cyan line) yields higher sustained accuracy.
2. **Resolution Impact:** The scale parameter 'r' creates a clear stratification in performance. Higher 'r' (10², yellow) consistently yields higher accuracy than lower 'r' (10⁰, 10⁻¹), especially in the post-peak regime (t > 1).
3. **Convergence of Low 'r':** The performance difference between r=10⁰ (brown) and r=10⁻¹ (purple) is negligible, suggesting a diminishing return or a performance floor as 'r' decreases below 1.
4. **Temporal Dynamics:** All configurations show a similar qualitative pattern: a rapid learning/improvement phase (t < 0.5), an optimal point, and a subsequent degradation phase. The rate of degradation is controlled by the model parameters.
### Interpretation
This chart likely illustrates the **stability-performance trade-off** in a dynamic system, such as a neural network training process, a simulation, or a control system. The variable 't' could represent training time, simulation steps, or a perturbation magnitude.
* **What the data suggests:** The system's internal accuracy ("Acc_int") is not static. It improves to an optimal point and then deteriorates, possibly due to overfitting, instability, or accumulating error. The parameters λ and μ control this dynamic: a higher λ (perhaps a learning rate or gain) combined with a lower μ (perhaps a damping or regularization factor) leads to a higher but potentially less stable peak (cyan line). The 'r' parameter likely represents model resolution, data granularity, or resource allocation. Higher 'r' provides better accuracy, acting as a buffer against the degradation seen at lower 'r'.
* **Relationship between elements:** The lines represent the *theoretical* or *expected* behavior for given (λ, μ) pairs. The scatter points represent *empirical* results at different operational scales 'r'. The close alignment of yellow points (high 'r') with the cyan line (high-performance parameters) suggests that sufficient resources (high 'r') are needed to realize the potential of aggressive parameter settings. Conversely, low 'r' forces all configurations toward a lower, similar performance baseline.
* **Notable Anomaly:** The dark blue line (λ=0.7, μ=3) starts with the highest accuracy at t=-1 but ends with the lowest. This indicates a configuration that is initially robust but lacks long-term stability or adaptability, making it unsuitable for processes extending beyond t≈1.5. The chart provides a visual guide for selecting parameters (λ, μ) and resources ('r') based on the desired operational timeframe 't'.
</details>
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Line Plot: Test Accuracy vs. Parameter `t` for Varying `λ` and `r`
### Overview
The image is a scientific line plot displaying the relationship between a test accuracy metric (`Acc_test`) and a parameter `t`. The plot compares multiple data series defined by two parameters: `λ` (lambda), which determines the line color and style, and `r`, which determines the marker color and shape. The data suggests an investigation into how model or system performance evolves with `t` under different hyperparameter configurations.
### Components/Axes
* **X-Axis:** Labeled `t`. The scale is linear, ranging from -1 to 4, with major tick marks at intervals of 1 (-1, 0, 1, 2, 3, 4).
* **Y-Axis:** Labeled `Acc_test`. The scale is linear, ranging from 0.4 to 1.0, with major tick marks at intervals of 0.1 (0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0).
* **Legend:** Located in the top-left corner of the plot area. It contains two distinct sections:
1. **Line Legend (λ):** Defines three solid lines by color.
* Cyan line: `λ = 2`
* Light blue line: `λ = 1`
* Dark blue line: `λ = 0.5`
2. **Marker Legend (r):** Defines four marker types by color and shape.
* Yellow circle: `r = 10²`
* Gold circle: `r = 10¹`
* Brown circle: `r = 10⁰`
* Purple diamond: `r = 10⁻¹`
### Detailed Analysis
The plot shows three primary curves (solid lines) corresponding to the three `λ` values. Each curve is overlaid with scatter points (markers) corresponding to the four `r` values. The markers for a given `r` appear to follow the general trend of one of the `λ` lines, but with systematic offsets.
**Trend Verification & Data Points (Approximate):**
1. **Series: λ = 2 (Cyan Line)**
* **Trend:** Starts lowest at `t=-1`, rises sharply to a peak near `t=0.5`, then gradually declines.
* **Approximate Points:** (-1, 0.38), (0, 0.65), (0.5, 0.98), (1, 0.97), (2, 0.96), (3, 0.95), (4, 0.94).
2. **Series: λ = 1 (Light Blue Line)**
* **Trend:** Starts in the middle at `t=-1`, rises to a peak near `t=0.8`, then declines more steeply than the λ=2 line.
* **Approximate Points:** (-1, 0.42), (0, 0.65), (0.8, 0.83), (1, 0.82), (2, 0.79), (3, 0.77), (4, 0.75).
3. **Series: λ = 0.5 (Dark Blue Line)**
* **Trend:** Starts highest at `t=-1`, rises to a peak near `t=0.3`, then declines steadily.
* **Approximate Points:** (-1, 0.46), (0, 0.65), (0.3, 0.70), (1, 0.68), (2, 0.64), (3, 0.62), (4, 0.60).
**Marker Series Analysis (Cross-referenced with Legend):**
* **r = 10² (Yellow Circles):** These points consistently lie slightly above the `λ=2` (cyan) line across the entire range of `t`.
* **r = 10¹ (Gold Circles):** These points closely follow the `λ=1` (light blue) line, sitting just above it for most `t` values.
* **r = 10⁰ (Brown Circles):** These points are positioned between the `λ=1` and `λ=0.5` lines. They follow a trend similar to the `λ=0.5` line but are offset upwards.
* **r = 10⁻¹ (Purple Diamonds):** These points are the lowest for any given `t` after `t=0`. They follow a trend similar to the `λ=0.5` line but are offset downwards, showing the lowest peak accuracy and the most pronounced decline as `t` increases.
### Key Observations
1. **Convergence at t=0:** All three lines and all four marker series converge at approximately `Acc_test = 0.65` when `t = 0`. This is a critical point in the parameter space.
2. **Impact of λ:** Higher `λ` values (e.g., 2) lead to a higher peak accuracy and a slower rate of decay as `t` increases beyond the peak. Lower `λ` values (e.g., 0.5) result in a lower peak and a faster decay.
3. **Impact of r:** For a fixed `t > 0`, higher `r` values (e.g., 10²) are associated with higher `Acc_test`. The ordering of performance by `r` is consistent: `r=10²` > `r=10¹` > `r=10⁰` > `r=10⁻¹`.
4. **Peak Shift:** The `t` value at which peak accuracy occurs shifts to the right (higher `t`) as `λ` increases. The peak for `λ=0.5` is near `t=0.3`, for `λ=1` near `t=0.8`, and for `λ=2` near `t=0.5` (though its plateau is broader).
### Interpretation
This plot likely visualizes the performance of a machine learning model or a dynamical system where `t` is a control parameter (e.g., time, temperature, regularization strength). `Acc_test` is the test set accuracy.
* **Parameter Sensitivity:** The system's performance is highly sensitive to both `λ` and `r`. `λ` appears to control the **capacity or stability** of the solution (higher `λ` yields more robust performance against increasing `t`), while `r` might control **initial conditions, noise level, or resource allocation** (higher `r` yields better performance across the board).
* **The t=0 Singularity:** The convergence at `t=0` suggests this is a baseline or neutral configuration where the influence of `λ` and `r` is minimized or balanced. It could represent an unregularized state or a point of symmetry in the parameter space.
* **Trade-off Identification:** The data demonstrates a clear trade-off. Configurations with high `λ` and high `r` (e.g., cyan line with yellow markers) achieve the highest peak accuracy and maintain it well. However, if `t` must be small (e.g., `t < 0.2`), a lower `λ` (like 0.5) might be preferable as it starts at a higher accuracy. The plot serves as a guide for selecting optimal (`λ`, `r`) pairs based on the operational range of `t`.
* **Underlying Phenomenon:** The rise-and-fall pattern of `Acc_test` with `t` is characteristic of phenomena like **overfitting** (where `t` could be model complexity) or **phase transitions** in physical systems. The parameter `λ` modulates the system's resilience to the effect represented by `t`.
</details>
Figure 8: Predicted test accuracy $Acc_test$ of the continuous GCN, at $r=∞$ . Left: for $α=1$ and $ρ=0.1$ ; right: for $α=2$ , $μ=1$ and $ρ=0.3$ . The performance of the continuous GCN are given by eq. (126). Dots: numerical simulation of the continuous GCN for $N=7× 10^3$ and $d=30$ , trained with quadratic loss, averaged over ten experiments.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Test Accuracy vs. Parameter `t` for Various Model Configurations
### Overview
The image is a line chart plotting test accuracy (`Acc_test`) against a parameter `t`. It displays six distinct curves, each representing a different model configuration defined by parameters `λ`, `μ`, or `K`. The chart illustrates how the test accuracy evolves as `t` increases from 0.0 to approximately 2.3 for each configuration.
### Components/Axes
* **Chart Type:** Multi-line graph.
* **X-Axis:**
* **Label:** `t`
* **Scale:** Linear, ranging from 0.0 to 2.0 with major tick marks at 0.0, 0.5, 1.0, 1.5, and 2.0. The axis extends slightly beyond 2.0.
* **Y-Axis:**
* **Label:** `Acc_test`
* **Scale:** Linear, ranging from 0.80 to 0.96 with major tick marks at intervals of 0.02 (0.80, 0.82, 0.84, 0.86, 0.88, 0.90, 0.92, 0.94, 0.96).
* **Legend:** Positioned in the top-right quadrant of the chart area. It contains six entries, each with a line sample and a label.
1. **Cyan Solid Line:** `λ=1.5, μ=2`
2. **Blue Solid Line:** `λ=1, μ=3`
3. **Dark Blue Solid Line:** `λ=0.7, μ=3`
4. **Magenta Dashed Line:** `K=16`
5. **Purple Dash-Dot Line:** `K=4`
6. **Black Dash-Dot-Dot Line:** `K=2`
### Detailed Analysis
The chart shows six curves, all following a similar general pattern: a rapid initial increase in accuracy, reaching a peak, followed by a decline as `t` increases further. The peak accuracy and the rate of decline vary significantly between configurations.
**Trend Verification & Data Points (Approximate):**
1. **Cyan Solid Line (`λ=1.5, μ=2`):**
* **Trend:** Steepest initial ascent, reaches the highest peak, then declines steadily.
* **Key Points:** Starts near (0.0, 0.80). Peaks at approximately (0.5, 0.955). At t=1.0, Acc ≈ 0.945. At t=2.0, Acc ≈ 0.915.
2. **Magenta Dashed Line (`K=16`):**
* **Trend:** Similar steep ascent to the cyan line, peaks slightly lower and later, then declines more gradually.
* **Key Points:** Starts near (0.0, 0.80). Peaks at approximately (0.6, 0.95). At t=1.0, Acc ≈ 0.948. At t=2.0, Acc ≈ 0.935.
3. **Purple Dash-Dot Line (`K=4`):**
* **Trend:** Rises quickly, peaks lower than the top two, then declines at a moderate rate.
* **Key Points:** Starts near (0.0, 0.80). Peaks at approximately (0.5, 0.91). At t=1.0, Acc ≈ 0.89. At t=2.0, Acc ≈ 0.84.
4. **Blue Solid Line (`λ=1, μ=3`):**
* **Trend:** Rises quickly, peaks, then declines sharply.
* **Key Points:** Starts near (0.0, 0.80). Peaks at approximately (0.4, 0.91). At t=1.0, Acc ≈ 0.87. At t=1.5, Acc ≈ 0.82.
5. **Dark Blue Solid Line (`λ=0.7, μ=3`):**
* **Trend:** Follows a path very close to but slightly below the blue line (`λ=1, μ=3`) for most of its trajectory.
* **Key Points:** Starts near (0.0, 0.80). Peaks at approximately (0.4, 0.905). At t=1.0, Acc ≈ 0.865.
6. **Black Dash-Dot-Dot Line (`K=2`):**
* **Trend:** Has the lowest peak and the most gradual decline, showing the most stability over the range of `t`.
* **Key Points:** Starts near (0.0, 0.80). Peaks at approximately (0.4, 0.86). At t=1.0, Acc ≈ 0.855. At t=2.0, Acc ≈ 0.845.
### Key Observations
1. **Peak Performance Hierarchy:** The configuration `λ=1.5, μ=2` achieves the highest overall test accuracy (~0.955), followed closely by `K=16` (~0.95). The `K=2` configuration has the lowest peak accuracy (~0.86).
2. **Stability vs. Peak Trade-off:** Configurations with higher peak accuracy (`λ=1.5, μ=2` and `K=16`) exhibit a more pronounced decline after their peak. Conversely, the `K=2` configuration, while having a lower peak, maintains its accuracy more consistently as `t` increases.
3. **Parameter Grouping:** The three curves defined by `λ` and `μ` (cyan, blue, dark blue) show that for a fixed `μ=3`, decreasing `λ` from 1 to 0.7 results in a very slight decrease in performance. The `λ=1.5, μ=2` configuration is an outlier, performing significantly better.
4. **K-Value Trend:** Among the `K`-defined curves, increasing `K` (from 2 to 4 to 16) leads to a higher peak accuracy but also a steeper subsequent decline.
### Interpretation
This chart likely visualizes the performance of a machine learning model (e.g., a neural network) under different regularization or architectural settings, where `t` could represent a training hyperparameter like temperature, noise level, or a time-like evolution parameter.
* **What the data suggests:** There is a clear optimal value for `t` (around 0.4-0.6) that maximizes test accuracy for all shown configurations. Beyond this point, increasing `t` harms performance, possibly by over-regularizing the model or pushing it into a less optimal state.
* **Relationship between elements:** The parameters `λ`, `μ`, and `K` control the model's behavior. Higher `K` (which might represent the number of components, clusters, or a complexity parameter) and a specific combination of higher `λ` with lower `μ` enable the model to reach higher accuracy peaks. However, these high-performance configurations appear more sensitive to the value of `t`, degrading faster as `t` moves away from the optimum.
* **Notable Anomaly:** The `λ=1.5, μ=2` curve is distinct. It not only has the highest peak but also maintains a higher accuracy than all other curves for `t > ~0.8`, suggesting this parameter combination offers a better balance of peak performance and robustness to increasing `t` compared to the high-`K` models.
* **Practical Implication:** The choice of model configuration involves a trade-off. If one can precisely control `t` to stay near 0.5, a high-peak model like `λ=1.5, μ=2` or `K=16` is preferable. If `t` is expected to vary or be less controlled, a more stable model like `K=2` might be more reliable, albeit with lower maximum accuracy.
</details>
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: Test Accuracy vs. Parameter `t` for Different Model Configurations
### Overview
The image is a line chart plotting test accuracy (`Acc_test`) against a parameter `t`. It displays six distinct curves, grouped into three performance tiers, representing different model configurations defined by parameters `λ` (lambda) and `K`. The chart demonstrates how test accuracy evolves as `t` increases for each configuration.
### Components/Axes
* **Y-Axis:** Labeled `Acc_test`. Scale ranges from 0.65 to 1.00, with major tick marks at 0.05 intervals (0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.00).
* **X-Axis:** Labeled `t`. Scale ranges from 0.0 to 2.5, with major tick marks at 0.5 intervals (0.0, 0.5, 1.0, 1.5, 2.0, 2.5).
* **Legend:** Located in the bottom-right quadrant of the chart area. It defines six data series:
* `λ = 2`: Cyan solid line.
* `λ = 1`: Blue solid line.
* `λ = 0.5`: Dark blue solid line.
* `K = 16`: Magenta dash-dot line.
* `K = 4`: Purple dash-dot line.
* `K = 2`: Black dash-dot line.
### Detailed Analysis
The six curves form three distinct clusters based on their peak accuracy and overall trajectory.
**1. High-Performance Cluster (Peak Acc_test ≈ 0.97 - 0.98):**
* **Trend:** All curves in this group rise very steeply from `t=0`, begin to plateau around `t=0.75`, and maintain a high, nearly flat accuracy for `t > 1.0`.
* **Data Series & Points (Approximate):**
* `λ = 2` (Cyan): Reaches the highest peak (~0.98) around `t=1.0`. Slight decline to ~0.97 by `t=2.5`.
* `K = 16` (Magenta): Very closely follows `λ=2`, peaking at ~0.975. Ends near ~0.97.
* `λ = 1` (Blue): Peaks at ~0.97. Shows a very slight downward trend after `t=1.5`, ending near ~0.965.
* `K = 4` (Purple): Peaks at ~0.965. Follows a similar slight decline as `λ=1`.
* `K = 2` (Black): The lowest in this cluster. Peaks at ~0.96 and declines to ~0.95 by `t=2.5`.
**2. Mid-Performance Cluster (Peak Acc_test ≈ 0.82 - 0.83):**
* **Trend:** Curves rise to a peak between `t=0.75` and `t=1.0`, then exhibit a clear, gradual decline as `t` increases further.
* **Data Series & Points (Approximate):**
* `λ = 0.5` (Dark Blue): Peaks at ~0.83 around `t=0.9`. Declines to ~0.79 by `t=2.5`.
* The magenta (`K=16`), purple (`K=4`), and black (`K=2`) dash-dot lines also have representatives in this cluster, following nearly identical paths to the `λ=0.5` line, making them difficult to distinguish. They all peak near 0.82-0.83 and decline to ~0.78-0.79.
**3. Low-Performance Cluster (Peak Acc_test ≈ 0.69 - 0.70):**
* **Trend:** Curves rise to a relatively low peak around `t=0.6`, then decline sharply and continuously.
* **Data Series & Points (Approximate):**
* This cluster appears to contain the lower counterparts of the `K` series (likely `K=4` and `K=2` based on line style), though the legend mapping is ambiguous here due to overlap. The peak is approximately 0.69-0.70. By `t=2.5`, accuracy has fallen to near or below the starting value of 0.65.
### Key Observations
1. **Clear Stratification:** Performance is strongly stratified into three distinct bands, suggesting the parameters `λ` and `K` have a categorical impact on model accuracy.
2. **Parameter Sensitivity:** Higher values of `λ` (2, 1) and `K` (16) are associated with the high-performance cluster. Lower values (`λ=0.5`, `K=2`) are associated with mid or low performance.
3. **Optimal `t` Range:** For high-performing models, accuracy peaks and stabilizes for `t` between approximately 0.75 and 1.5. For lower-performing models, accuracy degrades after a peak at a lower `t` value (~0.6-0.9).
4. **Line Style Correlation:** Solid lines (`λ` series) and dash-dot lines (`K` series) are present in multiple performance clusters, indicating that both parameter types influence the outcome, but their specific value is the primary determinant of the performance tier.
### Interpretation
This chart likely visualizes the results of a machine learning or optimization experiment where `t` represents a training duration, regularization strength, or a similar continuous hyperparameter. `Acc_test` is the model's generalization performance.
The data suggests a strong interaction between the parameters `λ`/`K` and `t`. **High values of `λ` or `K` enable models to achieve superior accuracy that is robust to increases in `t` beyond an optimal point.** In contrast, models with low `λ` or `K` are not only less accurate but also more sensitive to `t`, suffering from performance degradation (possibly overfitting or instability) as `t` grows large.
The clustering implies that `λ` and `K` might control similar underlying mechanisms (e.g., model capacity, regularization, or ensemble size), where crossing a certain threshold (e.g., `λ >= 1` or `K >= 16`) unlocks a significantly better and more stable performance regime. The near-overlap of curves within clusters suggests that within a regime, the exact value of `t` (beyond the minimum) is less critical than the choice of `λ` or `K`.
</details>
Figure 9: Predicted test accuracy $Acc_test$ of the continuous GCN, at $r=∞$ . Left: for $α=1$ and $ρ=0.1$ ; right: for $α=2$ , $μ=1$ and $ρ=0.3$ . The performance of the continuous GCN are given by eq. (126) while for its discretization at finite $K$ they are given by numerically solving the fixed-point equations (87 - 94).
### Symmetrized graph
The following figures support the discussion of part III.2.3 for the symmetrized graph. They compare the theoretical predictions for the continuous GCN to numerical simulations of the trained network. They show the convergence towards the limit $r→∞$ .
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Chart with Scatter Points: Test Accuracy vs. Parameter t
### Overview
The image is a scientific line chart plotting test accuracy (`Acc_test`) against a parameter `t`. It displays three continuous trend lines, each corresponding to a different combination of parameters `λ*` and `μ`. Overlaid on these lines are scatter points representing data for different values of a parameter `r`. The chart illustrates how accuracy evolves with `t` under various model or experimental conditions.
### Components/Axes
* **X-Axis:** Labeled `t`. The scale is linear, ranging from -1 to 4, with major tick marks at -1, 0, 1, 2, 3, and 4.
* **Y-Axis:** Labeled `Acc_test`. The scale is linear, ranging from 0.5 to 0.9, with major tick marks at 0.5, 0.6, 0.7, 0.8, and 0.9.
* **Legend (Top-Left Corner):** Contains two sections.
* **Lines Section:** Lists three colored lines with their parameters:
* Cyan line: `λ* = 1.5, μ = 3`
* Medium blue line: `λ* = 1, μ = 2`
* Dark blue line: `λ* = 0.7, μ = 1`
* **Scatter Points Section:** Lists four colored markers representing different `r` values:
* Yellow circle: `r = 10³`
* Orange circle: `r = 10²`
* Brown circle: `r = 10¹`
* Purple circle: `r = 10⁰`
### Detailed Analysis
**Trend Lines:**
1. **Cyan Line (`λ* = 1.5, μ = 3`):** This is the highest-performing series. It starts at approximately `Acc_test = 0.48` at `t = -1`. It rises steeply, crossing `Acc_test ≈ 0.8` near `t = 0.2`, and reaches a peak plateau of approximately `Acc_test = 0.92` between `t = 0.8` and `t = 1.5`. After `t = 1.5`, it shows a very gradual decline, ending at approximately `Acc_test = 0.88` at `t = 4`.
2. **Medium Blue Line (`λ* = 1, μ = 2`):** This is the middle-performing series. It starts at approximately `Acc_test = 0.49` at `t = -1`. It rises steadily, crossing `Acc_test ≈ 0.7` near `t = 0.3`, and reaches a peak of approximately `Acc_test = 0.77` near `t = 1.0`. It then declines gradually, ending at approximately `Acc_test = 0.73` at `t = 4`.
3. **Dark Blue Line (`λ* = 0.7, μ = 1`):** This is the lowest-performing series. It starts at approximately `Acc_test = 0.50` at `t = -1`. It rises more slowly, crossing `Acc_test ≈ 0.6` near `t = 0.4`, and reaches a peak of approximately `Acc_test = 0.64` near `t = 0.8`. It then declines gradually, ending at approximately `Acc_test = 0.60` at `t = 4`.
**Scatter Points (Data Distribution):**
The scatter points show the variance of individual data points around the trend lines for different `r` values.
* **General Pattern:** For a given `t`, points with higher `r` values (yellow, `r=10³`) are consistently positioned higher on the y-axis (higher accuracy) and cluster more tightly around the trend lines. Points with lower `r` values (purple, `r=10⁰`) are positioned lower and show greater vertical spread (higher variance).
* **Spatial Grounding & Cross-Reference:**
* At `t ≈ 0.5`, yellow points (`r=10³`) are near `Acc_test ≈ 0.85` (close to the cyan line), while purple points (`r=10⁰`) are near `Acc_test ≈ 0.75`.
* At `t ≈ 2.0`, the spread is very clear. For the cyan line region: yellow points cluster near `0.90`, orange near `0.87`, brown near `0.83`, and purple near `0.78`.
* The vertical ordering of colors (yellow > orange > brown > purple) is consistent across the entire `t` range for all three trend line regions.
### Key Observations
1. **Performance Hierarchy:** There is a clear and consistent performance hierarchy: `λ*=1.5, μ=3` > `λ*=1, μ=2` > `λ*=0.7, μ=1`. Higher values of `λ*` and `μ` correlate with higher test accuracy.
2. **Peak and Decline:** All three trend lines exhibit a similar shape: an initial rise to a peak between `t=0.8` and `t=1.5`, followed by a slow, steady decline as `t` increases further.
3. **Impact of `r`:** The parameter `r` has a strong, monotonic effect on accuracy and variance. Higher `r` yields higher accuracy and lower variance (tighter clustering). Lower `r` yields lower accuracy and higher variance (wider scatter).
4. **Convergence at Low `t`:** At very low `t` values (`t < -0.5`), the three trend lines and all scatter points converge to a similar accuracy range (~0.48-0.52), suggesting the parameters `λ*`, `μ`, and `r` have minimal differentiating effect in this regime.
### Interpretation
This chart likely visualizes the results of a machine learning or statistical model evaluation. The parameter `t` could represent a training step, a regularization strength, or a temperature parameter. The parameters `λ*` and `μ` appear to be hyperparameters controlling model complexity or learning dynamics, where more aggressive settings (higher values) lead to better peak performance.
The parameter `r` likely represents a resource such as sample size, resolution, or number of Monte Carlo samples. The clear stratification shows that increasing this resource (`r`) directly improves both the expected accuracy and the reliability (lower variance) of the model's predictions. The peak-and-decline shape of the curves suggests an optimal value for `t` exists; moving beyond this point (increasing `t` further) leads to overfitting, degradation, or diminishing returns.
The convergence at low `t` indicates a baseline or underfitting regime where model specifics don't matter. The divergence as `t` increases shows where the choices of `λ*`, `μ`, and `r` become critical for performance. The chart effectively communicates that achieving high, stable accuracy requires both well-tuned hyperparameters (`λ*`, `μ`) and sufficient resources (`r`).
</details>
Figure 10: Predicted test accuracy $Acc_test$ of the continuous GCN, at $r=∞$ for a symmetrized graph. $α=4$ , $ρ=0.1$ . We remind that $λ^s=√{2}λ$ . The performance of the continuous GCN are given by eq. (126). Dots: numerical simulation of the continuous GCN for $N=10^4$ and $d=30$ , trained with quadratic loss, averaged over ten experiments.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Line Chart with Scatter Data: Accuracy over Time for Different Parameters
### Overview
The image is a scientific line chart overlaid with scatter data points. It plots a metric labeled "Acc_{ent}" (likely "Accuracy" or a related measure) against a variable "t" (likely time or a time-like parameter). The chart compares three theoretical or fitted curves (solid lines) against four sets of empirical or simulated data points (scatter markers), each corresponding to different parameter values.
### Components/Axes
* **X-Axis:**
* **Label:** `t`
* **Range:** -1 to 4
* **Major Ticks:** At -1, 0, 1, 2, 3, 4.
* **Y-Axis:**
* **Label:** `Acc_{ent}` (The subscript "ent" is present).
* **Range:** 0.5 to 0.9 (approximately, based on visible ticks).
* **Major Ticks:** At 0.5, 0.6, 0.7, 0.8, 0.9.
* **Legend (Position: Bottom-Left Corner):**
* **Line Series:**
1. **Cyan Line:** `λ* = 1.5, μ = 2`
2. **Medium Blue Line:** `λ* = 1, μ = 3`
3. **Dark Blue Line:** `λ* = 0.7, μ = 3`
* **Scatter Series (Circular Markers):**
1. **Yellow/Gold Markers:** `r = 10^3`
2. **Light Brown/Tan Markers:** `r = 10^2`
3. **Medium Brown Markers:** `r = 10^1`
4. **Dark Purple/Brown Markers:** `r = 10^0`
### Detailed Analysis
**1. Line Trends (Theoretical/Fitted Models):**
* **Cyan Line (λ*=1.5, μ=2):** Starts lowest at t=-1 (~0.47). Rises most steeply, crossing the other lines. Peaks highest at approximately t=0.5 with Acc_{ent} ≈ 0.93. After the peak, it declines the most gradually, remaining the highest line for t > 1, ending near Acc_{ent} ≈ 0.89 at t=4.
* **Medium Blue Line (λ*=1, μ=3):** Starts at an intermediate level at t=-1 (~0.49). Rises steeply but less so than the cyan line. Peaks at approximately t=0.3 with Acc_{ent} ≈ 0.89. Declines more steeply than the cyan line, ending near Acc_{ent} ≈ 0.75 at t=4.
* **Dark Blue Line (λ*=0.7, μ=3):** Starts highest at t=-1 (~0.50). Has the shallowest initial rise. Peaks earliest and lowest at approximately t=0.2 with Acc_{ent} ≈ 0.85. Declines the most steeply, ending as the lowest line near Acc_{ent} ≈ 0.64 at t=4.
**2. Scatter Data Trends (Empirical/Simulated Points):**
* **General Pattern:** All scatter series follow the same general shape as the lines: a rapid rise from t=-1 to a peak between t=0 and t=1, followed by a gradual decline. The points show significant vertical spread (variance) at each t value.
* **Yellow/Gold (r=10^3):** Consistently forms the upper envelope of the data cloud. Its peak is the highest, near Acc_{ent} ≈ 0.94 around t=0.7. It remains the highest series throughout the decline phase.
* **Light Brown (r=10^2):** Generally lies below the yellow points but above the others. Peaks around Acc_{ent} ≈ 0.92.
* **Medium Brown (r=10^1):** Lies in the middle-lower region of the data cloud.
* **Dark Purple (r=10^0):** Consistently forms the lower envelope of the data cloud. Its peak is the lowest, near Acc_{ent} ≈ 0.84. It shows the most pronounced decline, dropping to near Acc_{ent} ≈ 0.55 by t=4.
**3. Relationship Between Lines and Scatter:**
* The **Cyan Line (λ*=1.5, μ=2)** appears to be a good fit for the central tendency of the **Yellow (r=10^3)** data, especially for t > 1.
* The **Medium Blue Line (λ*=1, μ=3)** passes through the middle of the overall data cloud.
* The **Dark Blue Line (λ*=0.7, μ=3)** appears to be a good fit for the central tendency of the **Dark Purple (r=10^0)** data.
* This suggests the lines may represent model predictions for different values of a parameter `r`, with `λ*` and `μ` being other model parameters that vary with `r`.
### Key Observations
1. **Parameter Sensitivity:** The system's peak accuracy and its long-term decay are highly sensitive to the parameters. Higher `r` (and correspondingly higher `λ*`) leads to higher peak accuracy and slower decay.
2. **Convergence and Divergence:** All models and data converge in the rising phase (t < 0). They diverge significantly after the peak (t > 1), with the performance gap between high and low `r` widening over time.
3. **Peak Timing:** The peak occurs earlier for lower `r` values (and lower `λ*`). The dark blue line/low-r data peaks near t=0.2, while the cyan line/high-r data peaks near t=0.5-0.7.
4. **Data Variance:** The vertical spread of the scatter points indicates substantial stochasticity or noise in the measured `Acc_{ent}` for any given `t` and `r`. The variance appears relatively consistent across `t` after the peak.
### Interpretation
This chart likely illustrates the performance (accuracy) of a system—such as a machine learning model, a communication protocol, or a biological process—over time or under increasing load (`t`). The parameter `r` could represent a resource level (e.g., signal-to-noise ratio, data rate, population size), where higher `r` denotes a more favorable condition.
The key finding is a **performance trade-off**: configurations optimized for a high peak accuracy (high `r`, high `λ*`) also exhibit greater resilience, maintaining their advantage over time. Conversely, resource-constrained configurations (low `r`) not only peak lower but also degrade faster. The fitted lines suggest this behavior can be captured by a mathematical model with parameters `λ*` and `μ`, where `λ*` may control the peak height and `μ` the decay rate. The strong correlation between the line fits and the respective scatter envelopes validates the model's ability to describe the system's behavior across different operating regimes. The persistent scatter highlights the inherent uncertainty or noise in the process, which the model averages through.
</details>
<details>
<summary>x14.png Details</summary>

### Visual Description
## Line Chart: Test Accuracy vs. Parameter `t` for Different Model Configurations
### Overview
The image is a line chart with error bars, plotting test accuracy (`Acc_test`) against a parameter `t`. It compares the performance of models configured with different values of a regularization or scaling parameter `λ^s` (lambda superscript s) and a parameter `r`. The chart demonstrates how accuracy evolves as `t` increases from -1 to 4 for each configuration.
### Components/Axes
* **X-Axis:** Labeled `t`. Linear scale ranging from -1 to 4, with major tick marks at -1, 0, 1, 2, 3, 4.
* **Y-Axis:** Labeled `Acc_test`. Linear scale ranging from 0.4 to 1.0, with major tick marks at 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0.
* **Legend (Top-Left Corner):** Contains two distinct groups:
1. **Lines (Solid):** Represent different values of `λ^s`.
* Cyan line: `λ^s = 2`
* Light blue line: `λ^s = 1`
* Dark blue line: `λ^s = 0.5`
2. **Markers (Circles with Error Bars):** Represent different values of `r`.
* Yellow circle: `r = 10^3`
* Gold circle: `r = 10^2`
* Brown circle: `r = 10^1`
* Purple circle: `r = 10^0`
### Detailed Analysis
**Data Series Trends (Lines):**
1. **`λ^s = 2` (Cyan Line):** Shows a steep, sigmoidal increase from `t = -1` (Acc ≈ 0.40) to `t ≈ 0.5`, then plateaus near the maximum accuracy of ~0.97 for `t > 1`. It maintains the highest accuracy across most of the `t` range.
2. **`λ^s = 1` (Light Blue Line):** Increases from `t = -1` (Acc ≈ 0.44) to a peak of ~0.80 at `t ≈ 0.5`, then gradually declines to ~0.73 at `t = 4`.
3. **`λ^s = 0.5` (Dark Blue Line):** Increases from `t = -1` (Acc ≈ 0.48) to a peak of ~0.68 at `t ≈ 0.2`, then declines more steeply to ~0.58 at `t = 4`.
**Data Points by `r` Value (Markers):**
* **General Pattern:** For a given `t` and `λ^s` line, higher `r` values (e.g., `10^3`, yellow) consistently yield higher accuracy points than lower `r` values (e.g., `10^0`, purple). The vertical spread of markers at a given `t` illustrates the impact of `r`.
* **`r = 10^3` (Yellow):** Points cluster very closely to the `λ^s = 2` line, especially for `t > 0`. At `t=4`, accuracy is ~0.97.
* **`r = 10^2` (Gold):** Points generally lie slightly below the `λ^s = 2` line and above the `λ^s = 1` line. At `t=4`, accuracy is ~0.95.
* **`r = 10^1` (Brown):** Points show significant scatter. They are near the `λ^s = 1` line for `t < 1` but fall below it for larger `t`. At `t=4`, accuracy is ~0.82.
* **`r = 10^0` (Purple):** Points exhibit the highest variance and lowest accuracy. They are consistently the lowest set of points. At `t=4`, accuracy is ~0.65.
* **Error Bars:** Vertical error bars are present on all markers, indicating variability in the accuracy measurement. The bars appear larger for lower `r` values (e.g., `r=10^0`, purple) and for points where the trend is changing rapidly (e.g., near `t=0`).
### Key Observations
1. **Dominant Trend:** The `λ^s = 2` configuration achieves the highest and most stable test accuracy for `t > 0.5`.
2. **Parameter Interaction:** The benefit of a higher `r` value is most pronounced for the `λ^s = 2` configuration. For `λ^s = 0.5` and `λ^s = 1`, the performance gap between different `r` values narrows as `t` increases.
3. **Peak and Decline:** Both the `λ^s = 1` and `λ^s = 0.5` series show a clear peak in accuracy at low `t` values (`t ≈ 0.5` and `t ≈ 0.2`, respectively), followed by a decline. This suggests a trade-off or overfitting effect as `t` increases for these configurations.
4. **Convergence at Low `t`:** All lines and most data points converge in the region `t ≈ -0.5` to `t ≈ 0`, with accuracy values between 0.6 and 0.7.
### Interpretation
This chart likely visualizes the results of a machine learning experiment studying the dynamics of model training or generalization. The parameter `t` could represent a training step, a noise scale, or a temperature parameter in a method like diffusion models or stochastic gradient descent.
* **`λ^s` as a Regularizer/Scaler:** A higher `λ^s` (2) appears to act as a strong stabilizing force, enabling the model to reach and maintain high accuracy. Lower values (1, 0.5) lead to an initial improvement followed by degradation, which could indicate instability or overfitting as the process (`t`) continues.
* **`r` as Model Capacity or Data Scale:** The parameter `r` (plotted on a log scale) strongly correlates with performance. Higher `r` (e.g., 1000) likely represents a larger model, more data, or a higher-rank approximation, leading to better and more consistent accuracy. The large error bars for low `r` suggest high sensitivity and instability in low-capacity regimes.
* **The Critical Region:** The most dynamic changes occur between `t = -1` and `t = 1`. This is where configurations diverge, peaks are reached, and the impact of `r` becomes most visually distinct. The post-`t=1` region shows the long-term behavior: stable high performance for `λ^s=2`, and gradual decay for others.
* **Practical Implication:** To achieve robust, high accuracy that persists as `t` increases, the combination of a high `λ^s` (2) and a high `r` (≥100) is optimal. Configurations with lower `λ^s` are only competitive at very specific, low `t` values.
</details>
Figure 11: Predicted test accuracy $Acc_test$ of the continuous GCN, at $r=∞$ for a symmetrized graph. Left: for $α=1$ and $ρ=0.1$ ; right: for $α=2$ , $μ=1$ and $ρ=0.3$ . We remind that $λ^s=√{2}λ$ . The performance of the continuous GCN are given by eq. (126). Dots: numerical simulation of the continuous GCN for $N=7× 10^3$ and $d=30$ , trained with quadratic loss, averaged over ten experiments.
### Comparison with optimality
The following figures support the discussion of parts III.2.3 and III.2.3. They show how the optimal diffusion time $t^*$ varies with respect to the parameters of the model and they compare the performance of the optimal continuous GCN and its discrete counterpart to the Bayes-optimality.
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Lambda with Inset Plot
### Overview
The image displays a technical line chart plotting estimated accuracy (`Acc_est`) against a parameter lambda (`λ`). It includes multiple data series representing different model configurations and a theoretical optimum. A smaller inset plot is embedded within the main chart area, showing the relationship between a different variable (`ℓ*`) and `λ`.
### Components/Axes
**Main Chart:**
* **X-axis:** Label is `λ` (lambda). Scale ranges from 0.0 to 2.5, with major ticks at 0.0, 0.5, 1.0, 1.5, 2.0, 2.5.
* **Y-axis:** Label is `Acc_est` (estimated accuracy). Scale ranges from 0.70 to 1.00, with major ticks at 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.00.
* **Legend:** Located in the bottom-right quadrant of the main chart. It contains seven entries, each with a line style/color sample and a label:
1. `Bayes-optimal` (black, dotted line)
2. `K = ∞, symmetrized graph` (red, solid line)
3. `K = ∞` (red, dashed line)
4. `K = 16` (red, dash-dot line)
5. `K = 4` (black, dashed line)
6. `K = 2` (black, dash-dot line)
7. `K = 1` (black, long-dash line)
**Inset Plot (Top-Right Quadrant):**
* **X-axis:** Label is `λ`. Scale ranges from 0 to 2, with major ticks at 0, 1, 2.
* **Y-axis:** Label is `ℓ*` (ell-star). Scale ranges from 0.0 to approximately 0.8, with major ticks at 0.0, 0.5.
* **Data Series:** A single red, solid line (matching the style for `K = ∞, symmetrized graph` from the main legend).
### Detailed Analysis
**Main Chart Trends & Data Points:**
All curves show a monotonically increasing trend of `Acc_est` as `λ` increases, with diminishing returns as they approach an asymptote near 1.0.
1. **Bayes-optimal (Black Dotted):**
* **Trend:** Starts highest and maintains the highest accuracy across the entire range, approaching 1.0 most rapidly.
* **Approximate Points:** (λ=0.0, Acc≈0.85), (λ=0.5, Acc≈0.90), (λ=1.0, Acc≈0.96), (λ=1.5, Acc≈0.99), (λ=2.0, Acc≈1.00).
2. **K = ∞, symmetrized graph (Red Solid):**
* **Trend:** Starts at the lowest point among all series but rises steeply, converging towards the Bayes-optimal line. It is the highest of the non-optimal curves for λ > ~0.3.
* **Approximate Points:** (λ=0.0, Acc≈0.72), (λ=0.5, Acc≈0.80), (λ=1.0, Acc≈0.92), (λ=1.5, Acc≈0.98), (λ=2.0, Acc≈0.995).
3. **K = ∞ (Red Dashed):**
* **Trend:** Follows a very similar path to the symmetrized version but is consistently slightly lower.
* **Approximate Points:** (λ=0.0, Acc≈0.72), (λ=0.5, Acc≈0.79), (λ=1.0, Acc≈0.91), (λ=1.5, Acc≈0.97), (λ=2.0, Acc≈0.99).
4. **K = 16 (Red Dash-Dot):**
* **Trend:** Rises more slowly than the K=∞ curves. It is the lowest of the red lines.
* **Approximate Points:** (λ=0.0, Acc≈0.72), (λ=0.5, Acc≈0.78), (λ=1.0, Acc≈0.89), (λ=1.5, Acc≈0.96), (λ=2.0, Acc≈0.98).
5. **K = 4 (Black Dashed):**
* **Trend:** Rises more slowly than the K=16 curve.
* **Approximate Points:** (λ=0.0, Acc≈0.72), (λ=0.5, Acc≈0.77), (λ=1.0, Acc≈0.87), (λ=1.5, Acc≈0.94), (λ=2.0, Acc≈0.97).
6. **K = 2 (Black Dash-Dot):**
* **Trend:** Rises more slowly than the K=4 curve.
* **Approximate Points:** (λ=0.0, Acc≈0.72), (λ=0.5, Acc≈0.76), (λ=1.0, Acc≈0.85), (λ=1.5, Acc≈0.92), (λ=2.0, Acc≈0.96).
7. **K = 1 (Black Long-Dash):**
* **Trend:** The slowest rising curve, remaining the lowest for all λ > 0.
* **Approximate Points:** (λ=0.0, Acc≈0.72), (λ=0.5, Acc≈0.75), (λ=1.0, Acc≈0.83), (λ=1.5, Acc≈0.90), (λ=2.0, Acc≈0.94).
**Inset Plot Analysis:**
* **Trend:** The single red curve (corresponding to `K = ∞, symmetrized graph`) shows `ℓ*` increasing sharply from near 0 at λ=0 to a peak, then gradually decreasing.
* **Approximate Key Points:** The peak occurs at approximately λ ≈ 1.3, where `ℓ*` ≈ 0.75. The value at λ=2.0 is approximately `ℓ*` ≈ 0.65.
### Key Observations
1. **Hierarchy of Performance:** There is a clear performance hierarchy: Bayes-optimal > K=∞ (symmetrized) > K=∞ > K=16 > K=4 > K=2 > K=1. This order is maintained across the entire λ range shown.
2. **Convergence:** All non-optimal curves converge towards the Bayes-optimal performance as λ increases, suggesting that a larger λ compensates for model limitations (lower K).
3. **Starting Point:** All non-optimal curves (K=1 through K=∞) originate from the same approximate accuracy (~0.72) at λ=0, while the Bayes-optimal starts significantly higher (~0.85).
4. **Inset Peak:** The inset reveals a non-monotonic relationship for `ℓ*` with respect to λ for the symmetrized K=∞ model, indicating an optimal λ value (~1.3) that maximizes this particular metric.
### Interpretation
This chart likely illustrates the performance of a statistical learning or estimation algorithm (e.g., community detection, clustering) where `K` represents a model complexity parameter (like the number of communities or clusters) and `λ` is a signal-to-noise ratio or regularization parameter.
* **Core Message:** Increasing the signal strength (`λ`) improves estimation accuracy (`Acc_est`) for all model complexities. However, models with higher complexity (`K=∞`) can achieve near-optimal (Bayes-optimal) performance if the signal is strong enough, while simpler models (`K=1,2`) require a much stronger signal to approach the same accuracy.
* **Trade-off Illustrated:** The inset plot for `ℓ*` (which could represent a learned parameter like community size or a loss function) shows that for the most complex model (`K=∞, symmetrized`), there is an intermediate `λ` that maximizes this internal metric. This suggests a potential trade-off or phase transition in the model's behavior as the signal strength varies.
* **Symmetrization Benefit:** The `K=∞, symmetrized graph` curve consistently outperforms the standard `K=∞` curve, indicating that symmetrizing the input graph (a common preprocessing step) provides a tangible accuracy benefit across all signal levels.
* **Practical Implication:** The chart demonstrates that model selection (`K`) is less critical in high-signal regimes (large λ), as all models perform well. In low-signal regimes (small λ), using a more complex model (higher K) is essential to achieve reasonable accuracy. The Bayes-optimal line serves as a theoretical upper bound, showing the fundamental limit of estimation given the data.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
## Line Graph with Inset: Test Accuracy vs. λ for Different K Values
### Overview
The image displays a line graph plotting test accuracy (`Acc_test`) against a parameter `λ` (lambda). It compares the performance of a model under different conditions, primarily varying a parameter `K`. An inset graph in the top-left corner shows the relationship between an optimal parameter `t*` and `λ` for two specific cases. The overall trend shows accuracy improving with increasing `λ`, with different `K` values leading to distinct performance curves.
### Components/Axes
**Main Chart:**
* **X-axis:** Label is `λ`. Scale runs from 0.0 to 2.5, with major ticks at 0.0, 0.5, 1.0, 1.5, 2.0, 2.5.
* **Y-axis:** Label is `Acc_test`. Scale runs from 0.65 to 1.00, with major ticks at 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.00.
* **Legend:** Located in the bottom-right quadrant. Contains seven entries:
1. `Bayes – optimal`: Dotted black line.
2. `K = ∞, symmetrized graph`: Solid red line.
3. `K = ∞`: Solid red line (visually identical in style to the symmetrized version, but represents a different data series).
4. `K = 16`: Dashed red line.
5. `K = 4`: Dashed red line.
6. `K = 2`: Dashed black line.
7. `K = 1`: Dashed black line.
**Inset Chart (Top-Left Corner):**
* **X-axis:** Label is `λ`. Scale runs from 0 to 2, with ticks at 0, 1, 2.
* **Y-axis:** Label is `t*`. Scale runs from 0.0 to 1.0, with ticks at 0.0, 0.5, 1.0.
* **Data Series:** Two solid red lines, corresponding to the `K = ∞` and `K = ∞, symmetrized graph` series from the main legend.
### Detailed Analysis
**Main Chart Trends & Approximate Data Points:**
* **Bayes-optimal (dotted black):** This line represents the theoretical upper bound. It starts at ~0.65 accuracy at λ=0, rises steeply, and asymptotically approaches 1.00 by λ≈2.0.
* **K = ∞, symmetrized graph (solid red):** This is the highest-performing practical model. It starts near 0.63 at λ=0, follows a sigmoidal curve, and closely approaches the Bayes-optimal line, reaching ~0.99 by λ=2.5.
* **K = ∞ (solid red):** Performs slightly below the symmetrized version. Starts near 0.63 at λ=0, follows a similar sigmoidal shape, and reaches ~0.98 by λ=2.5.
* **K = 16 (dashed red):** Starts near 0.63 at λ=0. Its curve is below the K=∞ lines. It reaches ~0.97 by λ=2.5.
* **K = 4 (dashed red):** Starts near 0.63 at λ=0. Its curve is below the K=16 line. It reaches ~0.96 by λ=2.5.
* **K = 2 (dashed black):** Starts near 0.63 at λ=0. Its curve is below the K=4 line. It reaches ~0.95 by λ=2.5.
* **K = 1 (dashed black):** This is the lowest-performing model. It starts near 0.63 at λ=0 and rises more slowly than all others, reaching only ~0.92 by λ=2.5.
**Inset Chart Trends:**
* The inset shows the optimal value of a parameter `t*` as a function of `λ` for the two `K=∞` cases.
* Both red lines show `t*` increasing from 0 at λ=0, peaking at a value slightly above 1.0 around λ≈1.5, and then beginning to decrease as λ approaches 2.
* The `K = ∞, symmetrized graph` line appears to have a slightly higher peak `t*` value than the standard `K = ∞` line.
### Key Observations
1. **Performance Hierarchy:** There is a clear and consistent performance hierarchy: Bayes-optimal > K=∞ (symmetrized) > K=∞ > K=16 > K=4 > K=2 > K=1. This order is maintained across the entire range of λ shown.
2. **Impact of K:** Increasing the parameter `K` monotonically improves test accuracy. The gap between consecutive `K` values (e.g., between K=1 and K=2, or K=4 and K=16) is significant at lower λ but narrows as λ increases and all models approach saturation.
3. **Sigmoidal Shape:** All accuracy curves exhibit a sigmoidal (S-shaped) growth pattern, indicating a phase of rapid improvement followed by diminishing returns.
4. **Asymptotic Behavior:** All models, even K=1, show accuracy improving with λ and trending towards an asymptote near 1.0, though the rate of convergence differs drastically.
5. **Inset Correlation:** The peak in `t*` around λ=1.5 in the inset corresponds to the region in the main chart where the accuracy curves for the high-K models begin to flatten significantly, suggesting `t*` may be a parameter governing the transition to the saturation regime.
### Interpretation
This graph likely comes from a study on graph neural networks or semi-supervised learning, where `K` could represent the number of neighbors, message-passing steps, or a similar complexity parameter. `λ` is probably a regularization or noise parameter.
The data demonstrates a fundamental trade-off: **model complexity (`K`) versus robustness to the parameter `λ`**. Simpler models (low `K`) are less effective overall and more sensitive to `λ`, showing slower accuracy gains. More complex models (high `K`, especially with symmetrization) achieve near-optimal performance much faster and are more robust, maintaining high accuracy across a wider range of `λ`.
The "Bayes-optimal" line serves as a benchmark, showing the maximum achievable accuracy given the data. The fact that the `K=∞, symmetrized graph` model nearly matches it suggests that with sufficient complexity and the right inductive bias (symmetrization), the model can capture almost all predictable patterns in the data. The inset's `t*` parameter likely controls a threshold or temperature in the model; its non-monotonic relationship with `λ` indicates an optimal operating point that shifts with the data's noise or regularization level. The overall message is that investing in model complexity (`K`) and proper graph symmetrization yields substantial gains in both peak performance and robustness.
</details>
Figure 12: Predicted test accuracy $Acc_test$ of the continuous GCN and of its discrete counterpart with depth $K$ , at optimal times $t^*$ and $r=∞$ . Left: for $α=1$ , $μ=2$ and $ρ=0.1$ ; right: for $α=2$ , $μ=1$ and $ρ=0.3$ . The performance of the continuous GCN $K=∞$ are given by eq. (126) while for its discretization at finite $K$ they are given by numerically solving the fixed-point equations (87 - 94). Inset: $t^*$ the maximizer at $K=∞$ .
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Heatmap: Acc_BO-GCN vs. μ and λ
### Overview
The image is a 2D heatmap visualizing the relationship between two parameters, μ (mu) and λ (lambda), and a performance metric labeled "Acc_BO-GCN" (likely Accuracy of a Bayesian Optimization - Graph Convolutional Network model). The color intensity represents the accuracy value, with a gradient from dark purple (low accuracy) to bright yellow (high accuracy).
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** λ (lambda)
* **Scale:** Linear, ranging from approximately 0.25 to 2.00.
* **Markers:** Major ticks are present at 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75, and 2.00.
* **Y-Axis (Vertical):**
* **Label:** μ (mu)
* **Scale:** Linear, ranging from approximately 0.25 to 2.00.
* **Markers:** Major ticks are present at 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75, and 2.00.
* **Color Bar (Legend):**
* **Placement:** Right side of the heatmap, vertically oriented.
* **Label:** Acc_BO-GCN
* **Scale:** Linear, ranging from approximately 0.02 to 0.12.
* **Gradient:** The color scale transitions from dark purple at the bottom (≈0.02) through teal and green to bright yellow at the top (≈0.12). The bar has pointed ends, indicating the scale extends slightly beyond the labeled ticks.
### Detailed Analysis
The heatmap is a grid of colored cells, where each cell's color corresponds to the Acc_BO-GCN value for a specific (λ, μ) pair. The grid appears to have a resolution of approximately 8x8 major cells, with smooth color interpolation between them.
**Trend Verification:** The overall visual trend shows a diagonal gradient. Accuracy is highest in the top-left region and decreases towards the bottom-right.
* **High Accuracy Region (Top-Left):** The brightest yellow cells are concentrated where λ is low (≈0.25-0.50) and μ is high (≈1.75-2.00). The peak accuracy (≈0.12) appears to be at the very top-left corner (λ≈0.25, μ≈2.00).
* **Medium Accuracy Region (Diagonal Band):** A band of green and teal colors runs diagonally from the top-right (high λ, high μ) to the bottom-left (low λ, low μ). For example, at (λ≈1.00, μ≈1.00), the color is a medium teal, corresponding to an accuracy of approximately 0.06-0.07.
* **Low Accuracy Region (Bottom-Right):** The darkest purple cells are in the bottom-right corner, where both λ and μ are high (λ≈1.75-2.00, μ≈0.25-0.50). The lowest accuracy (≈0.02) is found in this region.
**Spatial Grounding & Approximate Values:**
* At (λ=0.25, μ=2.00): Bright yellow, Acc_BO-GCN ≈ 0.12.
* At (λ=0.50, μ=1.50): Light green, Acc_BO-GCN ≈ 0.09.
* At (λ=1.00, μ=1.00): Teal, Acc_BO-GCN ≈ 0.065.
* At (λ=1.50, μ=0.75): Dark blue/purple, Acc_BO-GCN ≈ 0.03.
* At (λ=2.00, μ=0.25): Darkest purple, Acc_BO-GCN ≈ 0.02.
### Key Observations
1. **Strong Diagonal Gradient:** The most prominent pattern is the smooth, diagonal transition from high accuracy (top-left) to low accuracy (bottom-right). This indicates a strong interaction effect between λ and μ on the model's accuracy.
2. **Optimal Parameter Region:** The model performance is maximized for **low λ combined with high μ**. This suggests that for this specific BO-GCN configuration, a smaller λ parameter and a larger μ parameter are beneficial.
3. **Performance Degradation:** Accuracy degrades most rapidly as one moves from the top-left towards the bottom-right corner. Moving purely horizontally (increasing λ while holding μ constant) or purely vertically (decreasing μ while holding λ constant) also decreases accuracy, but the combined effect is strongest along the diagonal.
4. **No Extreme Outliers:** The color gradient is smooth without isolated cells of contrasting color, suggesting the model's performance changes predictably and consistently across this parameter space.
### Interpretation
This heatmap provides a clear visual guide for hyperparameter tuning of the BO-GCN model. The data demonstrates that the parameters λ and μ are not independent; their joint values critically determine model accuracy.
* **What it Suggests:** The model is sensitive to the balance between λ and μ. The optimal configuration lies in a specific region of the parameter space (low λ, high μ). This could imply that λ controls a regularization or complexity term that should be kept small, while μ might control a feature or influence term that should be emphasized.
* **Relationship Between Elements:** The axes (λ, μ) are the independent control variables, and the color (Acc_BO-GCN) is the dependent performance metric. The color bar is the essential key that translates the visual pattern into quantitative insight.
* **Notable Implications:** The smooth gradient indicates that the performance surface is well-behaved in this region, which is favorable for optimization algorithms like Bayesian Optimization. A practitioner would use this map to narrow their search for the best parameters to the top-left quadrant, potentially saving significant computational resources. The absence of performance "cliffs" or plateaus suggests that fine-tuning within the high-accuracy region should yield reliable improvements.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Heatmap: Acc_cost (BO-GCN) vs. λ and μ
### Overview
This image is a 2D heatmap visualizing the relationship between two parameters, λ (lambda) on the x-axis and μ (mu) on the y-axis, and a performance metric labeled `Acc_cost (BO-GCN)`. The color intensity represents the value of this metric, with a color bar on the right providing the scale. The data suggests the performance of a system (likely a Bayesian Optimization - Graph Convolutional Network model) is highly sensitive to the combination of these two hyperparameters.
### Components/Axes
* **Chart Type:** 2D Heatmap.
* **X-Axis:**
* **Label:** `λ` (Greek letter lambda).
* **Scale & Ticks:** Linear scale from approximately 0.1 to 2.1. Major tick marks are present at 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75, and 2.00.
* **Y-Axis:**
* **Label:** `μ` (Greek letter mu).
* **Scale & Ticks:** Linear scale from approximately 0.1 to 2.1. Major tick marks are present at 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75, and 2.00.
* **Color Bar (Legend):**
* **Position:** Vertically oriented on the right side of the heatmap.
* **Label:** `Acc_cost (BO-GCN)`. This likely stands for "Accuracy Cost" or a similar performance metric for a BO-GCN model.
* **Scale:** Continuous, ranging from a minimum of approximately **0.005** (dark purple) to a maximum of approximately **0.040** (bright yellow).
* **Gradient:** The color gradient transitions from dark purple (lowest values) through blue, teal, and green to yellow (highest values).
### Detailed Analysis
The heatmap displays a clear, non-uniform distribution of the `Acc_cost` metric across the λ-μ parameter space.
* **Spatial Grounding & Trend Verification:**
* **High-Value Region (Yellow/Green):** The highest values (bright yellow, ~0.040) are concentrated in a specific region. This region is located in the **bottom-left quadrant** of the heatmap, centered approximately at **λ ≈ 0.75 to 1.00** and **μ ≈ 0.25 to 0.50**. The color here is the brightest, indicating peak performance.
* **Gradient Direction:** Moving away from this peak region, the values decrease. The gradient is not symmetric.
* **Increasing λ (moving right):** There is a **sharp decline** in value. As λ increases beyond 1.00, the color rapidly shifts from green to teal to dark blue, even for low μ values. For λ > 1.50, the entire column is dark blue/purple regardless of μ.
* **Increasing μ (moving up):** There is a **moderate decline** in value. As μ increases from 0.50 towards 2.00, the color shifts from yellow/green to teal and then to blue, especially for λ values less than 1.00.
* **Low-Value Region (Dark Purple/Blue):** The lowest values (~0.005 - 0.010) are found in two primary areas:
1. The **top-right quadrant** (high λ > 1.50, high μ > 1.25). This area is uniformly dark purple.
2. The **far-right edge** (λ ≈ 2.00) across almost all μ values, which is a very dark blue/purple.
* **Data Point Confirmation:** The brightest yellow cell aligns with the top of the color bar (~0.040). The darkest purple cells align with the bottom of the color bar (~0.005).
### Key Observations
1. **Optimal Parameter Zone:** There is a distinct "sweet spot" for the `Acc_cost` metric where λ is moderate (~0.75-1.00) and μ is low (~0.25-0.50).
2. **Asymmetric Sensitivity:** The metric is more sensitive to increases in λ than to increases in μ. A small increase in λ beyond 1.00 causes a more dramatic performance drop than a comparable increase in μ.
3. **Threshold Effect:** There appears to be a performance threshold around **λ = 1.00**. To the left of this line, values are generally higher (green/yellow). To the right, values drop significantly (blue/purple).
4. **Interaction Effect:** The parameters λ and μ interact. The negative effect of a high μ is exacerbated when λ is also high (top-right corner is the worst-performing region). Conversely, a low μ can partially buffer the negative effect of a moderately high λ.
### Interpretation
This heatmap provides a visual guide for hyperparameter tuning of the BO-GCN model with respect to the `Acc_cost` metric.
* **What the data suggests:** The model's performance (as measured by `Acc_cost`) is not linearly related to λ and μ. Instead, it exhibits a complex interaction with a clear optimum. The sharp decline past λ=1.00 suggests this parameter may control a regularization strength or a trade-off term where exceeding a critical value is detrimental.
* **How elements relate:** λ and μ are likely coupled hyperparameters. The heatmap shows that optimizing one without considering the other is ineffective. The best performance requires a balanced, low-to-moderate setting for both, with a particular emphasis on keeping λ below 1.00.
* **Notable Anomalies/Patterns:** The most striking pattern is the **cliff-like drop** in performance as λ crosses 1.00. This is a critical insight for anyone using this model, indicating that λ should be carefully constrained below this threshold. The relative stability of the metric for μ values below 0.75 (when λ is also low) suggests the model is more robust to changes in μ within that range.
* **Practical Implication:** To achieve the best `Acc_cost`, one should set λ between 0.75 and 1.00 and μ between 0.25 and 0.50. Exploring parameter combinations outside this region, especially with λ > 1.25, is likely to yield poor results.
</details>
Figure 13: Gap to the Bayes-optimality. Predicted difference between the Bayes-optimal test accuracy and the test accuracy of the continuous GCN at optimal time $t^*$ and $r=∞$ , vs the two signals $λ$ and $μ$ . Left: for $α=1$ and $ρ=0.1$ ; right: for $α=2$ and $ρ=0.3$ . The performance of the continuous GCN are given by eq. (126).
### Comparison between symmetric and asymmetric graphs
The following figure supports the claim of part III.1.3, that the performance of the GCN depends little whether the graph is symmetric or not at same $λ$ , and that it is not able to deal with the supplementary information the asymmetry gives.
<details>
<summary>x19.png Details</summary>

### Visual Description
## Multi-Panel Line Chart: Accuracy vs. Regularization Parameter
### Overview
The image displays a set of four line charts arranged horizontally. Each chart plots the test accuracy (`Acc_{cost}`) against a regularization parameter (`c` or `t`) for different machine learning model configurations. The charts compare the effects of loss function type (logistic vs. quadratic), model complexity (`K=1`, `K=2`, continuous), and data distribution symmetry (symmetric vs. asymmetric `A`).
### Components/Axes
* **Overall Structure:** Four subplots in a 1x4 grid.
* **Y-Axis (All Panels):** Labeled `Acc_{cost}`. The scale ranges from 0.65 to 1.00, with major ticks at 0.05 intervals.
* **X-Axis:**
* **Panels 1-3:** Labeled `c`. The scale ranges from 0.0 to 2.0, with major ticks at 0.5 intervals.
* **Panel 4:** Labeled `t`. The scale ranges from 0 to 2, with major ticks at 1 intervals.
* **Legend (Located in Panel 2, bottom-right):**
* **Blue dotted line:** `Bayes - opt., asymmetric A`
* **Red dashed line:** `Bayes - opt., symmetric A`
* **Blue circle marker:** `r = 10^2, asymmetric A`
* **Red circle marker:** `r = 10^2, symmetric A`
* **Blue 'x' marker:** `r = 10^{-2}, asymmetric A`
* **Red 'x' marker:** `r = 10^{-2}, symmetric A`
* **Panel Titles:**
* **Panel 1 (Left):** `K = 1, logistic loss`
* **Panel 2:** `K = 1, quadratic loss`
* **Panel 3:** `K = 2, quadratic loss`
* **Panel 4 (Right):** `continuous, quadratic loss`
### Detailed Analysis
**Panel 1: K = 1, logistic loss**
* **Trend:** All four data series (blue/red circles and 'x's) follow a similar inverted-U shape. Accuracy increases from `c=0.0` to a peak around `c=0.8-1.0`, then gradually decreases.
* **Data Points (Approximate):**
* At `c=0.0`, all series start near `Acc ≈ 0.71`.
* Peak accuracy for all series is between `0.84` and `0.86`.
* At `c=2.0`, accuracy for all series falls to between `0.80` and `0.83`.
* **Bayes-Optimal Lines:** The blue dotted line (`asymmetric A`) is constant at `Acc ≈ 0.99`. The red dashed line (`symmetric A`) is constant at `Acc ≈ 0.94`.
**Panel 2: K = 1, quadratic loss**
* **Trend:** Similar inverted-U shape as Panel 1, but the curves are lower and the peak is less pronounced. The separation between the `r=10^2` (circles) and `r=10^{-2}` ('x's) series is more distinct.
* **Data Points (Approximate):**
* At `c=0.0`, series start between `Acc ≈ 0.68` and `0.72`.
* The `r=10^2` series (circles) peak around `c=0.8` with `Acc ≈ 0.86`.
* The `r=10^{-2}` series ('x's) peak around `c=0.8` with `Acc ≈ 0.80`.
* At `c=2.0`, the `r=10^2` series are near `Acc ≈ 0.83`, and the `r=10^{-2}` series are near `Acc ≈ 0.77`.
* **Bayes-Optimal Lines:** Identical to Panel 1 (Blue dotted `≈0.99`, Red dashed `≈0.94`).
**Panel 3: K = 2, quadratic loss**
* **Trend:** The curves rise more steeply from `c=0.0` and plateau earlier (around `c=0.5-1.0`) compared to K=1 models. The `r=10^2` series (circles) achieve significantly higher accuracy than the `r=10^{-2}` series ('x's).
* **Data Points (Approximate):**
* At `c=0.0`, series start between `Acc ≈ 0.70` and `0.75`.
* The `r=10^2` series (circles) plateau between `Acc ≈ 0.90` and `0.92`.
* The `r=10^{-2}` series ('x's) plateau between `Acc ≈ 0.82` and `0.84`.
* Accuracy remains relatively stable from `c=1.0` to `c=2.0`.
* **Bayes-Optimal Lines:** Identical to previous panels.
**Panel 4: continuous, quadratic loss**
* **Trend:** This panel shows a different dynamic. The `r=10^2` series (circles) rise sharply to a peak near `t=1.0` and then decline. The `r=10^{-2}` series ('x's) rise to a lower peak and decline more sharply.
* **Data Points (Approximate):**
* At `t=0`, all series start near `Acc ≈ 0.65`.
* The `r=10^2` series (circles) peak near `t=1.0` with `Acc ≈ 0.92`.
* The `r=10^{-2}` series ('x's) peak near `t=0.8` with `Acc ≈ 0.85`.
* At `t=2.0`, the `r=10^2` series are near `Acc ≈ 0.90`, and the `r=10^{-2}` series are near `Acc ≈ 0.78`.
* **Bayes-Optimal Lines:** Identical to previous panels.
### Key Observations
1. **Consistent Bayes-Optimal Benchmark:** The theoretical maximum accuracy (Bayes-optimal) is consistently higher for the asymmetric data distribution (`A`) than for the symmetric one across all model types.
2. **Impact of Regularization (`c`):** For finite `K` models (Panels 1-3), there is an optimal intermediate value of the regularization parameter `c` that maximizes test accuracy. Too little or too much regularization hurts performance.
3. **Effect of Model Complexity (`K`):** Moving from `K=1` (Panels 1 & 2) to `K=2` (Panel 3) allows models to achieve higher accuracy, closer to the Bayes-optimal limit, especially for the `r=10^2` setting.
4. **Parameter `r` Sensitivity:** The parameter `r` (likely a noise or scale parameter) has a major impact. Models with `r=10^2` (circles) consistently outperform those with `r=10^{-2}` ('x's) across all configurations.
5. **Loss Function Comparison:** For `K=1`, the logistic loss (Panel 1) yields slightly higher peak accuracy (`≈0.86`) than the quadratic loss (Panel 2, `≈0.86` for `r=10^2`), but the quadratic loss curves show more separation between the `r` values.
6. **Continuous Model Behavior:** The continuous model (Panel 4) shows a distinct peak-and-decline pattern with respect to `t`, suggesting a different optimal operating point compared to the finite `K` models parameterized by `c`.
### Interpretation
This set of charts investigates the generalization performance of models with varying complexity (`K=1`, `K=2`, continuous) under different loss functions and data distributions. The key takeaway is the interplay between **model complexity**, **regularization strength** (`c` or `t`), and a **data/noise parameter** (`r`).
* **The Bias-Variance Tradeoff:** The inverted-U curves in Panels 1-3 are a classic visualization of the bias-variance tradeoff. As regularization (`c`) increases, model variance decreases but bias may increase. The peak represents the optimal balance for test set generalization.
* **Model Capacity Matters:** More complex models (`K=2`, continuous) can achieve higher accuracy, approaching the Bayes-optimal bound more closely than simpler models (`K=1`). However, they also require careful tuning of the regularization parameter.
* **Data Distribution Impact:** The consistently higher Bayes-optimal line for asymmetric `A` suggests that the underlying classification task is inherently easier when the data distribution is asymmetric. All models reflect this, performing better on the asymmetric task (blue markers/lines) than the symmetric one (red markers/lines) for the same `r` value.
* **The Role of `r`:** The parameter `r` appears to be a critical factor controlling the difficulty of the learning problem or the model's sensitivity. A higher `r` (`10^2`) leads to substantially better and more robust performance across all model types compared to a low `r` (`10^{-2}`). This could correspond to a signal-to-noise ratio or a scaling factor in the data generation process.
In summary, the figure demonstrates that achieving near-optimal performance requires selecting a model with sufficient complexity, tuning regularization appropriately, and is fundamentally constrained by the properties of the data distribution (`A` and `r`). The continuous model exhibits a different sensitivity profile to its complexity parameter `t`.
</details>
Figure 14: Test accuracy of the GCN, on asymmetric $A$ and its symmetric counterpart, obtained by equaling $A_ij$ with $A_ji$ for all $i<j$ . $α=4$ , $λ=1.5$ , $μ=3$ and $ρ=0.1$ . Lines: predictions. Dots: numerical simulation of the GCN for $N=10^4$ and $d=30$ , averaged over ten experiments.
### Train error
The following figure displays the train error $E_train$ eq. 18 vs the self-loop intensity $c$ , in the same settings as fig. 2 of part III.1. It shows in particular that to treat $c$ as a parameter trained to minimize the train error would degrade the performance, since it would lead to $c→∞$ . As a consequence, $c$ should be treated as a hyperparameter, tuned to maximize the test accuracy, as done in the main part of the article.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Line Charts: Training Error (E_train) vs. Parameter (c) for Different Model Complexity (K) and Regularization (r)
### Overview
The image displays a 2x3 grid of six line charts. The charts plot the training error, denoted as **E_train**, on the y-axis against a parameter **c** on the x-axis. Each column of charts corresponds to a different value of **K** (K=1, K=2, K=3), which likely represents model complexity or a similar hyperparameter. Within each chart, four distinct lines represent different values of a parameter **r** (r=10⁴, r=10², r=10⁰, r=10⁻²), which appears to be a regularization strength or a related coefficient. The overall visualization demonstrates how the training error evolves with the parameter `c` under varying conditions of model complexity (`K`) and regularization (`r`).
### Components/Axes
* **Y-axis (All Charts):** Labeled **E_train**. Represents the training error metric.
* Top Row Scale: Ranges from 0.0 to approximately 0.7.
* Bottom Row Scale: Ranges from 0.0 to 0.5.
* **X-axis (All Charts):** Labeled **c**. Represents an independent parameter.
* Charts for K=1 and K=2: The x-axis ranges from 0.0 to 2.0.
* Charts for K=3: The x-axis ranges from 0 to 3.
* **Column Headers:** Each column is titled with the value of **K**.
* Left Column: **K=1**
* Middle Column: **K=2**
* Right Column: **K=3**
* **Legend (Present in the top-left chart, K=1):** Located in the center-left area of the plot. It defines the four data series by color and line style (solid line with circular markers).
* **Cyan Line:** `r = 10⁴`
* **Teal Line:** `r = 10²`
* **Green Line:** `r = 10⁰` (which equals 1)
* **Brown Line:** `r = 10⁻²` (which equals 0.01)
### Detailed Analysis
The analysis is segmented by the `K` value (columns).
**Column 1: K=1**
* **Top Chart (E_train scale ~0-0.7):**
* `r=10⁴` (Cyan): Nearly horizontal line at the top, E_train ≈ 0.7, showing almost no change as `c` increases.
* `r=10²` (Teal): Starts at E_train ≈ 0.6, slopes gently downward to ≈ 0.4 at c=2.0.
* `r=10⁰` (Green): Follows a nearly identical path to the teal line, starting at ≈ 0.6 and ending at ≈ 0.4.
* `r=10⁻²` (Brown): Starts low at E_train ≈ 0.1, slopes gently downward to ≈ 0.05 at c=2.0.
* **Bottom Chart (E_train scale 0-0.5):**
* `r=10⁴` (Cyan): Nearly horizontal line at E_train ≈ 0.5.
* `r=10²` (Teal): Starts at ≈ 0.32, slopes downward to ≈ 0.1 at c=2.0.
* `r=10⁰` (Green): Follows a nearly identical path to the teal line.
* `r=10⁻²` (Brown): A flat line very close to E_train = 0.0 across the entire range of `c`.
**Column 2: K=2**
* **Top Chart:**
* `r=10⁴` (Cyan): Nearly horizontal at E_train ≈ 0.7.
* `r=10²` (Teal): Starts at ≈ 0.6, curves downward more steeply than in K=1, reaching ≈ 0.15 at c=2.0.
* `r=10⁰` (Green): Follows a nearly identical path to the teal line.
* `r=10⁻²` (Brown): Starts at ≈ 0.1, slopes downward to near 0.0 at c=2.0.
* **Bottom Chart:**
* `r=10⁴` (Cyan): Nearly horizontal at E_train ≈ 0.5.
* `r=10²` (Teal): Starts at ≈ 0.32, curves downward, approaching 0.0 at c=2.0.
* `r=10⁰` (Green): Follows a nearly identical path to the teal line.
* `r=10⁻²` (Brown): Flat line at E_train = 0.0.
**Column 3: K=3**
* **Top Chart (x-axis extends to 3):**
* `r=10⁴` (Cyan): Nearly horizontal at E_train ≈ 0.7, with a very slight downward curve at the far right (c=3).
* `r=10²` (Teal): Starts at ≈ 0.6, curves downward sharply, crossing below the green line around c=1.5 and reaching ≈ 0.2 at c=3.
* `r=10⁰` (Green): Starts at ≈ 0.58, curves downward very steeply, approaching 0.0 by c=2.5.
* `r=10⁻²` (Brown): Starts at ≈ 0.1, drops quickly to near 0.0 by c=1.0 and remains flat.
* **Bottom Chart (x-axis extends to 3):**
* `r=10⁴` (Cyan): Nearly horizontal at E_train ≈ 0.5, with a slight downward curve starting around c=2.
* `r=10²` (Teal): Starts at ≈ 0.32, curves downward sharply, approaching 0.0 by c=3.
* `r=10⁰` (Green): Starts at ≈ 0.3, exhibits a small local maximum (bump) around c=0.5, then plummets steeply to 0.0 by c=1.5.
* `r=10⁻²` (Brown): Flat line at E_train = 0.0.
### Key Observations
1. **Effect of `r` (Regularization):** Higher values of `r` (10⁴, cyan) consistently result in higher training error (`E_train`) that is largely insensitive to changes in `c`. Lower values of `r` (10⁻², brown) lead to very low training error, often near zero.
2. **Effect of `K` (Complexity):** As `K` increases from 1 to 3, the curves for intermediate `r` values (10² and 10⁰) become steeper. The decline in `E_train` with increasing `c` happens more rapidly and reaches lower values for higher `K`.
3. **Interaction between `r` and `c`:** For a fixed `K`, the parameter `c` has a much stronger effect on reducing `E_train` for intermediate `r` values (10², 10⁰) than for the very high or very low `r` extremes.
4. **Convergence:** For `K=2` and `K=3`, the lines for `r=10²` and `r=10⁰` converge to similar low error values as `c` increases, especially in the bottom row of charts.
5. **Anomaly:** In the bottom chart for K=3, the green line (`r=10⁰`) shows a distinct non-monotonic behavior with a local peak around c=0.5 before its steep descent.
### Interpretation
This set of charts likely illustrates the bias-variance trade-off or the effect of regularization in a machine learning context. `E_train` is the error on the training dataset.
* **High `r` (Strong Regularization):** The model is heavily constrained (high bias). It cannot fit the training data well, resulting in high `E_train` that doesn't improve much with the tuning parameter `c`. The model is underfitting.
* **Low `r` (Weak Regularization):** The model is very flexible (low bias). It can fit the training data almost perfectly (`E_train ≈ 0`), regardless of `c`. This risks overfitting, though only training error is shown here.
* **Intermediate `r`:** This is the interesting regime. Here, the parameter `c` acts as a crucial tuning knob. Increasing `c` systematically reduces the training error, and this effect is amplified with greater model complexity (`K`). The steep descent suggests that for these regularization strengths, the model's capacity to learn from the data is highly sensitive to `c`.
* **Role of `K`:** Increasing `K` (model complexity) makes the model more responsive to the parameter `c` when regularization is not too strong. The sharper declines for K=3 indicate that a more complex model can leverage the adjustment provided by `c` to reduce training error more effectively, but this also heightens the risk of overfitting if validation error were plotted.
In summary, the visualization demonstrates that optimal model performance (balancing fit and generalization) would likely be found by selecting an intermediate `r` and then tuning `c`, with the optimal `c` value potentially depending on the chosen model complexity `K`. The charts provide a clear map of how these three hyperparameters interact to influence training error.
</details>
Figure 15: Predicted train error $E_test$ for different values of $K$ . Top: for $λ=1.5$ , $μ=3$ and logistic loss; bottom: for $λ=1$ , $μ=2$ and quadratic loss; $α=4$ and $ρ=0.1$ . We take $c_k=c$ for all $k$ . Dots: numerical simulation of the GCN for $N=10^4$ and $d=30$ , averaged over ten experiments.
## References
- Wang et al. [2023] Y. Wang, Z. Li, and A. Barati Farimani, Graph neural networks for molecules, in Machine Learning in Molecular Sciences (Springer International Publishing, 2023) p. 21–66, arXiv:2209.05582.
- Li et al. [2022] M. M. Li, K. Huang, and M. Zitnik, Graph representation learning in biomedicine and healthcare, Nature Biomedical Engineering 6, 1353–1369 (2022), arXiv:2104.04883.
- Bessadok et al. [2021] A. Bessadok, M. A. Mahjoub, and I. Rekik, Graph neural networks in network neuroscience (2021), arXiv:2106.03535.
- Sanchez-Gonzalez et al. [2020] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. W. Battaglia, Learning to simulate complex physics with graph networks, in Proceedings of the 37th International Conference on Machine Learning (2020) arXiv:2002.09405.
- Shlomi et al. [2020] J. Shlomi, P. Battaglia, and J.-R. Vlimant, Graph neural networks in particle physics, Machine Learning: Science and Technology 2 (2020), arXiv:2007.13681.
- Peng et al. [2021] Y. Peng, B. Choi, and J. Xu, Graph learning for combinatorial optimization: A survey of state-of-the-art, Data Science and Engineering 6, 119 (2021), arXiv:2008.12646.
- Cappart et al. [2023] Q. Cappart, D. Chételat, E. Khalil, A. Lodi, C. Morris, and P. Veličković, Combinatorial optimization and reasoning with graph neural networks, Journal of Machine Learning Research 24, 1 (2023), arXiv:2102.09544.
- Morris et al. [2024] C. Morris, F. Frasca, N. Dym, H. Maron, I. I. Ceylan, R. Levie, D. Lim, M. Bronstein, M. Grohe, and S. Jegelka, Position: Future directions in the theory of graph machine learning, in Proceedings of the 41st International Conference on Machine Learning (2024).
- Li et al. [2018] Q. Li, Z. Han, and X.-M. Wu, Deeper insights into graph convolutional networks for semi-supervised learning, in Thirty-Second AAAI conference on artificial intelligence (2018) arXiv:1801.07606.
- Oono and Suzuki [2020] K. Oono and T. Suzuki, Graph neural networks exponentially lose expressive power for node classification, in International conference on learning representations (2020) arXiv:1905.10947.
- Li et al. [2019] G. Li, M. Müller, A. Thabet, and B. Ghanem, DeepGCNs: Can GCNs go as deep as CNNs?, in ICCV (2019) arXiv:1904.03751.
- Chen et al. [2020] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li, Simple and deep graph convolutional networks, in Proceedings of the 37th International Conference on Machine Learning (2020) arXiv:2007.02133.
- Ju et al. [2023] H. Ju, D. Li, A. Sharma, and H. R. Zhang, Generalization in graph neural networks: Improved PAC-Bayesian bounds on graph diffusion, in AISTATS (2023) arXiv:2302.04451.
- Tang and Liu [2023] H. Tang and Y. Liu, Towards understanding the generalization of graph neural networks (2023), arXiv:2305.08048.
- Cong et al. [2021] W. Cong, M. Ramezani, and M. Mahdavi, On provable benefits of depth in training graph convolutional networks, in 35th Conference on Neural Information Processing Systems (2021) arxiv:2110.15174.
- Esser et al. [2021] P. M. Esser, L. C. Vankadara, and D. Ghoshdastidar, Learning theory can (sometimes) explain generalisation in graph neural networks, in 35th Conference on Neural Information Processing Systems (2021) arXiv:2112.03968.
- Seung et al. [1992] H. S. Seung, H. Sompolinsky, and N. Tishby, Statistical mechanics of learning from examples, Physical review A 45, 6056 (1992).
- Loureiro et al. [2021] B. Loureiro, C. Gerbelot, H. Cui, S. Goldt, F. Krzakala, M. Mezard, and L. Zdeborová, Learning curves of generic features maps for realistic datasets with a teacher-student model, Advances in Neural Information Processing Systems 34, 18137 (2021).
- Mei and Montanari [2022] S. Mei and A. Montanari, The generalization error of random features regression: Precise asymptotics and the double descent curve, Communications on Pure and Applied Mathematics 75, 667 (2022).
- Shi et al. [2023] C. Shi, L. Pan, H. Hu, and I. Dokmanić, Homophily modulates double descent generalization in graph convolution networks, PNAS 121 (2023), arXiv:2212.13069.
- Yan and Sarkar [2021] B. Yan and P. Sarkar, Covariate regularized community detection in sparse graphs, Journal of the American Statistical Association 116, 734 (2021), arxiv:1607.02675.
- Deshpande et al. [2018] Y. Deshpande, S. Sen, A. Montanari, and E. Mossel, Contextual stochastic block models, in Advances in Neural Information Processing Systems, Vol. 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (2018) arxiv:1807.09596.
- Chien et al. [2021] E. Chien, J. Peng, P. Li, and O. Milenkovic, Adaptative universal generalized pagerank graph neural network, in Proceedings of the 39th International Conference on Learning Representations (2021) arxiv:2006.07988.
- Fu et al. [2021] G. Fu, P. Zhao, and Y. Bian, p-Laplacian based graph neural networks, in Proceedings of the 39th International Conference on Machine Learning (2021) arxiv:2111.07337.
- Lei et al. [2022] R. Lei, Z. Wang, Y. Li, B. Ding, and Z. Wei, EvenNet: Ignoring odd-hop neighbors improves robustness of graph neural networks, in 36th Conference on Neural Information Processing Systems (2022) arxiv:2205.13892.
- Duranthon and Zdeborová [2024a] O. Duranthon and L. Zdeborová, Asymptotic generalization error of a single-layer graph convolutional network, in The Learning on Graphs Conference (2024) arxiv:2402.03818.
- Chen et al. [2018] R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, Neural ordinary differential equations, in 32nd Conference on Neural Information Processing Systems (2018) arXiv:1806.07366.
- Kipf and Welling [2017] T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, in International Conference on Learning Representations (2017) arxiv:1609.02907.
- Cui et al. [2023] H. Cui, F. Krzakala, and L. Zdeborová, Bayes-optimal learning of deep random networks of extensive-width, in Proceedings of the 40th International Conference on Machine Learning (2023) arxiv:2302.00375.
- McCallum et al. [2000] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore, Automating the construction of internet portals with machine learning, Information Retrieval 3, 127–163 (2000).
- Shchur et al. [2018] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann, Pitfalls of graph neural network evaluation, (2018), arXiv:1811.05868.
- Giles et al. [1998] C. L. Giles, K. D. Bollacker, and S. Lawrenc, Citeseer: An automatic citation indexing system, in Proceedings of the third ACM conference on Digital libraries (1998) p. 89–98.
- Sen et al. [2008] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, Collective classification in network data, AI magazine 29 (2008).
- Baranwal et al. [2021] A. Baranwal, K. Fountoulakis, and A. Jagannath, Graph convolution for semi-supervised classification: Improved linear separability and out-of-distribution generalization, in Proceedings of the 38th International Conference on Machine Learning (2021) arxiv:2102.06966.
- Baranwal et al. [2023] A. Baranwal, K. Fountoulakis, and A. Jagannath, Optimality of message-passing architectures for sparse graphs, in 37th Conference on Neural Information Processing Systems (2023) arxiv:2305.10391.
- Wang et al. [2024] R. Wang, A. Baranwal, and K. Fountoulakis, Analysis of corrected graph convolutions (2024), arXiv:2405.13987.
- Mignacco et al. [2020] F. Mignacco, F. Krzakala, Y. M. Lu, and L. Zdeborová, The role of regularization in classification of high-dimensional noisy Gaussian mixture, in International conference on learning representations (2020) arxiv:2002.11544.
- Aubin et al. [2020] B. Aubin, F. Krzakala, Y. M. Lu, and L. Zdeborová, Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization, in Advances in Neural Information Processing Systems (2020) arxiv:2006.06560.
- Duranthon and Zdeborová [2024b] O. Duranthon and L. Zdeborová, Optimal inference in contextual stochastic block models, Transactions on Machine Learning Research (2024b), arxiv:2306.07948.
- Keriven [2022] N. Keriven, Not too little, not too much: a theoretical analysis of graph (over)smoothing, in 36th Conference on Neural Information Processing Systems (2022) arXiv:2205.12156.
- He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in IEEE Conference on Computer Vision and Pattern Recognition (2016) arXiv:1512.03385.
- Pham et al. [2017] T. Pham, T. Tran, D. Phung, and S. Venkatesh, Column networks for collective classification, in AAAI (2017) arXiv:1609.04508.
- Xu et al. [2021] K. Xu, M. Zhang, S. Jegelka, and K. Kawaguchi, Optimization of graph neural networks: Implicit acceleration by skip connections and more depth, in Proceedings of the 38th International Conference on Machine Learning (2021) arXiv:2105.04550.
- Sander et al. [2022] M. E. Sander, P. Ablin, and G. Peyré, Do residual neural networks discretize neural ordinary differential equations?, in 36th Conference on Neural Information Processing Systems (2022) arXiv:2205.14612.
- Ling et al. [2016] J. Ling, A. Kurzawski, and J. Templeton, Reynolds averaged turbulence modelling using deep neural networks with embedded invariance, Journal of Fluid Mechanics 807, 155–166 (2016).
- Rackauckas et al. [2020] C. Rackauckas, Y. Ma, J. Martensen, C. Warner, K. Zubov, R. Supekar, D. Skinner, A. Ramadhan, and A. Edelman, Universal differential equations for scientific machine learning (2020), arXiv:2001.04385.
- Marion [2023] P. Marion, Generalization bounds for neural ordinary differential equations and deep residual networks (2023), arXiv:2305.06648.
- Poli et al. [2019] M. Poli, S. Massaroli, J. Park, A. Yamashita, H. Asama, and J. Park, Graph neural ordinary differential equations (2019), arXiv:1911.07532.
- Xhonneux et al. [2020] L.-P. A. C. Xhonneux, M. Qu, and J. Tang, Continuous graph neural networks, in Proceedings of the 37th International Conference on Machine Learning (2020) arXiv:1912.00967.
- Han et al. [2023] A. Han, D. Shi, L. Lin, and J. Gao, From continuous dynamics to graph neural networks: Neural diffusion and beyond (2023), arXiv:2310.10121.
- Lu and Sen [2020] C. Lu and S. Sen, Contextual stochastic block model: Sharp thresholds and contiguity (2020), arXiv:2011.09841.
- Wu et al. [2019] F. Wu, T. Zhang, A. H. de Souza Jr., C. Fifty, T. Yu, and K. Q. Weinberger, Simplifying graph convolutional networks, in Proceedings of the 36th International Conference on Machine Learning (2019) arxiv:1902.07153.
- Zhu and Koniusz [2021] H. Zhu and P. Koniusz, Simple spectral graph convolution, in International Conference on Learning Representations (2021).
- Lesieur et al. [2017] T. Lesieur, F. Krzakala, and L. Zdeborová, Constrained low-rank matrix estimation: Phase transitions, approximate message passing and applications, Journal of Statistical Mechanics: Theory and Experiment 2017, 073403 (2017), arxiv:1701.00858.
- Duranthon and Zdeborová [2023] O. Duranthon and L. Zdeborová, Neural-prior stochastic block model, Mach. Learn.: Sci. Technol. (2023), arxiv:2303.09995.
- Baik et al. [2005] J. Baik, G. B. Arous, and S. Péché, Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices, Annals of Probability , 1643 (2005).