# Nearest-Neighbours Neural Network architecture for efficient sampling of statistical physics models
**Authors**: Luca Maria Del Bono, Federico Ricci-Tersenghi, Francesco Zamponi
> Dipartimento di Fisica, Sapienza Università di Roma, Piazzale Aldo Moro 5, Rome 00185, Italy CNR-Nanotec, Rome unit, Piazzale Aldo Moro 5, Rome 00185, Italy
> Dipartimento di Fisica, Sapienza Università di Roma, Piazzale Aldo Moro 5, Rome 00185, Italy CNR-Nanotec, Rome unit, Piazzale Aldo Moro 5, Rome 00185, Italy INFN, sezione di Roma1, Piazzale Aldo Moro 5, Rome 00185, Italy
> Dipartimento di Fisica, Sapienza Università di Roma, Piazzale Aldo Moro 5, Rome 00185, Italy
thanks: Corresponding author: lucamaria.delbono@uniroma1.it
## Abstract
The task of sampling efficiently the Gibbs-Boltzmann distribution of disordered systems is important both for the theoretical understanding of these models and for the solution of practical optimization problems. Unfortunately, this task is known to be hard, especially for spin-glass-like problems at low temperatures. Recently, many attempts have been made to tackle the problem by mixing classical Monte Carlo schemes with newly devised Neural Networks that learn to propose smart moves. In this article, we introduce the Nearest-Neighbours Neural Network (4N) architecture, a physically interpretable deep architecture whose number of parameters scales linearly with the size of the system and that can be applied to a large variety of topologies. We show that the 4N architecture can accurately learn the Gibbs-Boltzmann distribution for a prototypical spin-glass model, the two-dimensional Edwards-Anderson model, and specifically for some of its most difficult instances. In particular, it captures properties such as the energy, the correlation function and the overlap probability distribution. Finally, we show that the 4N performance increases with the number of layers, in a way that clearly connects to the correlation length of the system, thus providing a simple and interpretable criterion to choose the optimal depth.
## Introduction
A central problem in many fields of science, from statistical physics to computer science and artificial intelligence, is that of sampling from a complex probability distribution over a large number of variables. More specifically, a very common such probability is a Gibbs-Boltzmann distribution. Given a set of $N\gg 1$ random variables $\boldsymbol{\sigma}=\{\sigma_{1},\cdots,\sigma_{N}\}$ , such a distribution is written in the form
$$
P_{\text{GB}}(\boldsymbol{\sigma})=\frac{e^{-\beta\mathcal{H}(\boldsymbol{
\sigma})}}{\mathcal{Z}(\beta)}\ . \tag{1}
$$
In statistical physics language, the normalization constant $\mathcal{Z}(\beta)$ is called partition function, $\mathcal{H}(\boldsymbol{\sigma})$ is called the Hamiltonian function and $\beta=1/T$ is the inverse temperature, corresponding to a global rescaling of $\mathcal{H}(\boldsymbol{\sigma})$ .
While Eq. (1) is simply a trivial definition, i.e. $\mathcal{H}(\boldsymbol{\sigma})=-\frac{1}{\beta}\log P_{\text{GB}}( \boldsymbol{\sigma})+\text{const}$ , the central idea of Gibbs-Boltzmann distributions is that $\mathcal{H}(\boldsymbol{\sigma})$ is expanded as a sum of ‘local’ interactions. For instance, in the special case of binary (Ising) variables, $\sigma_{i}\in\{-1,+1\}$ , one can always write
$$
\mathcal{H}=-\sum_{i}H_{i}\sigma_{i}-\sum_{i<j}J_{ij}\sigma_{i}\sigma_{j}-\sum
_{i<j<k}J_{ijk}\sigma_{i}\sigma_{j}\sigma_{k}+\cdots \tag{2}
$$
In many applications, like in physics (i.e. spin glasses), inference (i.e. maximum entropy models) and artificial intelligence (i.e. Boltzmann machines), the expansion in Eq. (2) is truncated to the pairwise terms, thus neglecting higher-order interactions. This leads to a Hamiltonian
$$
\mathcal{H}=-\sum_{i}H_{i}\sigma_{i}-\sum_{i<j}J_{ij}\sigma_{i}\sigma_{j} \tag{3}
$$
parametrized by ‘local external fields’ $H_{i}$ and ‘pairwise couplings’ $J_{ij}$ . In physics applications such as spin glasses, these are often chosen to be independent and identically distributed random variables, e.g. $H_{i}\sim\mathcal{N}(0,H^{2})$ and $J_{ij}\sim\mathcal{N}(0,J^{2}/N)$ . In Boltzmann Machines, instead, the fields and couplings are learned by maximizing the likelihood of a given training set.
In many cases, dealing with Eq. (1) analytically, e.g. computing expectation values of interesting quantities, is impossible, and one resorts to numerical computations. A universal strategy is to use local Markov Chain Monte Carlo (MCMC) methods, which have the advantage of being applicable to a wide range of different systems. In these methods, one proposes a local move, typically flipping a single spin, $\sigma_{i}\to-\sigma_{i}$ , and accepts or rejects the move in such a way to guarantee that after many iterations, the configuration $\boldsymbol{\sigma}$ is distributed according to Eq. (1). Unfortunately, these methods are difficult (if not altogether impossible) to apply in many hard-to-sample problems for which the convergence time is very large, which, in practice, ends up requiring a huge amount of computational time. For instance, finding the ground state of the Sherrington-Kirkpatrick model (a particular case of sampling at $T\to 0$ ) is known to be a NP-hard problem; sampling a class of optimization problems is known to take a time scaling exponentially with $N$ . The training of Boltzmann Machines is also known to suffer from this kind of problems.
These inconveniences can be avoided by using system-specific global algorithms that leverage properties of the system under examination in order to gain a significant speedup. Instead of flipping one spin at a time, these methods are able to construct smart moves, which update simultaneously a large number of spins. An example is the Wolff algorithm [1] for the Ising model. Another class of algorithms uses unphysical moves or extended variable spaces, such as the Swap Monte Carlo for glass-forming models of particles [2]. The downside of these algorithms is that they cannot generally be transferred from one system to another, meaning that one has to develop new techniques specifically for each system of interest. Yet another class is Parallel Tempering (PT) [3], or one of its modifications [4], which considers a group of systems at different temperatures and alternates between simple MCMC moves and moves that swap the systems at two different temperatures. While PT is a universal strategy, its drawback is that, in order to produce low-temperature configurations, one has to simulate the system at a ladder of higher temperatures, which can be computationally expensive and redundant.
The rise of machine learning technology has sparked a new line of research aimed at using Neural Networks to enhance Monte Carlo algorithms, in the wake of similar approaches in many-body quantum physics [5], molecular dynamics [6] and the study of glasses [7, 8, 9]. The key idea is to use the network to propose new states with a probability $P_{\text{NN}}$ that (i) can be efficiently sampled and (ii) is close to the Gibbs-Boltzmann distribution, e.g., with respect to the Kullback-Leibler (KL) divergence $D_{\text{KL}}$ :
$$
D_{\text{KL}}(P_{\text{GB}}\parallel P_{\text{NN}})=\sum_{\boldsymbol{\sigma}}
P_{\text{GB}}(\boldsymbol{\sigma})\log\frac{P_{\text{GB}}(\boldsymbol{\sigma})
}{P_{\text{NN}}(\boldsymbol{\sigma})}\ . \tag{4}
$$
The proposed configurations can be used in a Metropolis scheme, accepting them with probability:
$$
\begin{split}\text{Acc}\left[\boldsymbol{\sigma}\rightarrow\boldsymbol{\sigma}
^{\prime}\right]&=\min\left[1,\frac{P_{\text{GB}}(\boldsymbol{\sigma}^{\prime}
)\times P_{\text{NN}}(\boldsymbol{\sigma})}{P_{\text{GB}}(\boldsymbol{\sigma})
\times P_{\text{NN}}(\boldsymbol{\sigma}^{\prime})}\right]\\
&=\min\left[1,\frac{e^{-\beta\mathcal{H}(\boldsymbol{\sigma}^{\prime})}\times P
_{\text{NN}}(\boldsymbol{\sigma})}{e^{-\beta\mathcal{H}(\boldsymbol{\sigma})}
\times P_{\text{NN}}(\boldsymbol{\sigma}^{\prime})}\right].\end{split} \tag{5}
$$
Because $\mathcal{H}$ can be computed in polynomial time and $P_{\text{NN}}$ can be sampled efficiently by hypothesis, this approach ensures an efficient sampling of the Gibbs-Boltzmann probability, as long as (i) $P_{\text{NN}}$ covers well the support of $P_{\text{GB}}$ (i.e., the so-called mode collapse is avoided) and (ii) the acceptance probability in Eq. (5) is significantly different from zero for most of the generated configurations. This scheme can also be combined with standard local Monte Carlo moves to improve efficiency [10].
The main problem is how to train the Neural Network to minimize Eq. (4); in fact, the computation of $D_{\rm KL}$ requires sampling from $P_{\text{GB}}$ . A possible solution is to use $D_{\text{KL}}(P_{\text{NN}}\parallel P_{\text{GB}})$ instead [11], but this has been shown to be prone to mode collapse [12]. Another proposed solution is to set up an iterative procedure, called ‘Sequential Tempering’ (ST), in which one first learns $P_{\text{NN}}$ at high temperature where sampling from $P_{\text{GB}}$ is possible, then gradually decreases the temperature, updating $P_{\text{NN}}$ slowly [13]. A variety of new methods and techniques have thus been proposed to tackle the problem, such as autoregressive models [11, 13, 14], normalizing flows [15, 10, 16, 17], and diffusion models [18]. It is still unclear whether this kind of strategies actually performs well on hard problems and challenges already-existing algorithms [12]. The first steps in a theoretical understanding of the performances of such algorithms are just being made [19, 20]. The main drawback of these architectures is that the number of parameters scales poorly with the system size, typically as $N^{\alpha}$ with $\alpha>1$ , while local Monte Carlo requires a number of operations scaling linearly in $N$ . Furthermore, these architectures – with the notable exception of TwoBo [21] that directly inspired our work – are not informed about the physical properties of the model, and in particular about the structure of its interaction graph and correlations. It should be noted that the related problem of finding the ground state of the system (which coincides with zero-temperature sampling) has been tackled using similar ideas, again with positive [22, 23, 24, 25, 26] and negative results [27, 28, 29, 30].
In this paper, we introduce the Nearest Neighbours Neural Network, or 4N for short, a Graph Neural Network-inspired architecture that implements an autoregressive scheme to learn the Gibbs-Boltzmann distribution. This architecture has a number of parameters scaling linearly in the system size and can sample new configurations in $\mathcal{O}(N)$ time, thus achieving the best possible scaling one could hope for such architectures. Moreover, 4N has a straightforward physical interpretation. In particular, the choice of the number of layers can be directly linked to the correlation length of the model under study. Finally, the architecture can easily be applied to essentially any statistical system, such as lattices in higher dimensions or random graphs, and it is thus more general than other architectures. As a proof of concept, we evaluate the 4N architecture on the two-dimensional Edwards-Anderson spin glass model, a standard benchmark also used in previous work [13, 12]. We demonstrate that the model succeeds in accurately learning the Gibbs-Boltzmann distribution, especially for some of the most challenging model instances. Notably, it precisely captures properties such as energy, the correlation function, and the overlap probability distribution.
## State of the art
Some common architectures used for $P_{\text{NN}}$ are normalizing flows [15, 10, 16, 17] and diffusion models [18]. In this paper, we will however focus on autoregressive models [11], which make use of the factorization:
$$
\begin{split}&P_{\text{GB}}(\boldsymbol{\sigma})=P(\sigma_{1})P(\sigma_{2}\mid
\sigma_{1})P(\sigma_{3}\mid\sigma_{1},\sigma_{2})\times\\
&\times\cdots P(\sigma_{n}\mid\sigma_{1},\cdots,\sigma_{n-1})=\prod_{i=1}^{N}P
(\sigma_{i}\mid\boldsymbol{\sigma}_{<i})\ ,\end{split} \tag{6}
$$
where $\boldsymbol{\sigma}_{<i}=(\sigma_{1},\sigma_{2},\ldots,\sigma_{i-1})$ , so that states can then be generated using ancestral sampling, i.e. generating first $\sigma_{1}$ , then $\sigma_{2}$ conditioned on the sampled value of $\sigma_{1}$ , then $\sigma_{3}$ conditioned to $\sigma_{1}$ and $\sigma_{2}$ and so on. The factorization in (6) is exact, but computing $P(\sigma_{i}\mid\boldsymbol{\sigma}_{<i})$ exactly generally requires a number of operations scaling exponentially with $N$ . Hence, in practice, $P_{\text{GB}}$ is approximated by $P_{\text{NN}}$ that takes the form in Eq. (6) where each individual term $P(\sigma_{i}\mid\boldsymbol{\sigma}_{<i})$ is approximated using a small set of parameters. Note that the unsupervised problem of approximating $P_{\text{GB}}$ is now formally translated into a supervised problem of learning the output, i.e. the probability $\pi_{i}\equiv P(\sigma_{i}=+1\mid\boldsymbol{\sigma}_{<i})$ , as a function of the input $\boldsymbol{\sigma}_{<i}$ . The specific way in which this approximation is carried out depends on the architecture. In this section, we will describe some common autoregressive architectures found in literature, for approximating the function $\pi_{i}(\boldsymbol{\sigma}_{<i})$ .
- In the Neural Autoregressive Distribution Estimator (NADE) architecture [13], the input $\boldsymbol{\sigma}_{<i}$ is encoded into a vector $\mathbf{y}_{i}$ of size $N_{h}$ using
$$
\mathbf{y}_{i}=\Psi\left(\underline{\underline{A}}\boldsymbol{\sigma}_{<i}+
\mathbf{B}\right), \tag{7}
$$
where $\mathbf{y}_{i}\in\mathbb{R}^{N_{h}}$ , $\underline{\underline{A}}\in\mathbb{R}^{N_{h}\times N}$ and $\boldsymbol{\sigma}_{<i}$ is the vector of the spins in which the spins $l\geq i$ have been masked, i.e. $\mathbf{\sigma}_{<i}=(\sigma_{1},\sigma_{2},\dots,\sigma_{i-1},0,\dots,0)$ . $\mathbf{B}\in\mathbb{R}^{N_{h}}$ is the bias vector and $\Psi$ is an element-wise activation function. Note that $\underline{\underline{A}}$ and $\mathbf{B}$ do not depend on the index $i$ , i.e. they are shared across all spins. The information from the hidden layer is then passed to a fully connected layer
$$
{\color[rgb]{0,0,0}\psi_{i}}=\Psi\left(\mathbf{V}_{i}\cdot\mathbf{y}_{i}+C_{i}\right) \tag{8}
$$
where $\mathbf{V}_{i}\in\mathbb{R}^{N_{h}}$ and $C_{i}\in\mathbb{R}$ . Finally, the output probability is obtained by applying a sigmoid function to the output, $\pi_{i}=S({\color[rgb]{0,0,0}\psi_{i}})$ . The Nade Architecture uses $\mathcal{O}(N\times N_{h})$ parameters. However, the choice of the hidden dimension $N_{h}$ and how it should be related to $N$ is not obvious. For this architecture to work well in practical applications, $N_{h}$ needs to be $\mathcal{O}(N)$ , thus leading to $\mathcal{O}(N^{2})$ parameters [13].
- The Masked Autoencoder for Distribution Estimator (MADE) [31] is a generic dense architecture with the addition of the autoregressive requirement, obtained by setting, for all layers $l$ in the network, the weights $W^{l}_{ij}$ between node $j$ at layer $l$ and node $i$ at layer $l+1$ equal to 0 when $i\geq j$ . Between one layer and the next one adds nonlinear activation functions and the width of the hidden layers can also be increased. The MADE architecture, despite having high expressive power, has at least $\mathcal{O}(N^{2})$ parameters, which makes it poorly scalable.
- Autoregressive Convolutional Neural Network (CNN) architectures, such as PixelCNN [32, 33], are networks that implement a CNN structure but superimpose the autoregressive property. These architectures typically use translationally invariant kernels ill-suited to study disordered models.
- The TwoBo (two-body interaction) architecture [21] is derived from the observation that, given the Gibbs-Boltzmann probability in Eq. (1) and the Hamiltonian in Eq. (3), the probability $\pi_{i}$ can be exactly rewritten as [34]
$$
\pi_{i}(\boldsymbol{\sigma}_{<i})=S\left(2\beta h^{i}_{i}+2\beta H_{i}+\rho_{i
}(\boldsymbol{h}^{i})\right), \tag{9}
$$
where $S$ is the sigmoid function, $\rho_{i}$ is a highly non-trivial function and the $\boldsymbol{h}^{i}=\{h^{i}_{i},h^{i}_{i+1},\cdots h^{i}_{N}\}$ are defined as:
$$
h^{i}_{l}=\sum_{f<i}J_{lf}\sigma_{f}\quad\text{for}\quad l\geq i. \tag{10}
$$
Therefore, the TwoBo architecture first computes the vector $\boldsymbol{h}^{i}$ , then approximates $\rho_{i}(\boldsymbol{h}^{i})$ using a Neural Network (for instance, a Multilayer Perceptron) and finally computes Eq. (9) using a skip connection (for the term $2\beta h^{i}_{i}$ ) and a sigmoid activation. By construction, in a two-dimensional model with $N=L^{2}$ variables and nearest-neighbor interactions, for given $i$ only $\mathcal{O}(L=\sqrt{N})$ of the $h^{i}_{l}$ are non-zero, see Ref. [21, Fig.1]. Therefore, the TwoBo architecture has $\mathcal{O}(N^{\frac{3}{2}})$ parameters in the two-dimensional case. However, the scaling becomes worse in higher dimensions $d$ , i.e. $\mathcal{O}(N^{\frac{2d-1}{d}})$ and becomes the same as MADE (i.e. $\mathcal{O}(N^{2})$ ) for random graphs. Additionally, the choice of the $h^{i}_{l}$ , although justified mathematically, does not have a clear physical interpretation: the TwoBo architecture takes into account far away spins which, however, should not matter much when $N$ is large.
To summarize, the NADE, MADE and TwoBo architectures all need $\mathcal{O}(N^{2})$ parameters for generic models defined on random graph structures. TwoBo has less parameters for models defined on $d$ -dimensional lattices, but still $\mathcal{O}(N^{\frac{2d-1}{d}})\gg\mathcal{O}(N)$ . CNN architectures with translationally invariant kernels have fewer parameters, but are not suitable for disordered systems, are usually applied to $d=2$ and remain limited to $d\leq 3$ .
To solve these problems, we present here the new, physically motivated 4N architecture that has $\mathcal{O}(\mathcal{A}N)$ parameters, with a prefactor $\mathcal{A}$ that remains finite for $N\to\infty$ and is interpreted as the correlation length of the system, as we show in the next section.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Graph Neural Network Processing Pipeline
### Overview
The diagram illustrates a multi-stage process for analyzing input data using graph neural networks (GNNs). It includes:
1. A binary input grid with labeled regions (+1, -1, Unf.)
2. A node-based representation with color-coded values
3. GNN layers processing node interactions
4. A probabilistic output for node classification
### Components/Axes
#### (a) Input Grid
- **Legend**:
- `+1` (white squares)
- `-1` (black squares)
- `Unf.` (gray squares)
- **Spatial Features**:
- Purple outline highlights a specific region (centered on node `i`)
- Grid dimensions: 8x8 squares (approximate)
#### (b) Node Representation
- **Color Gradient**:
- Red → Blue scale (0 → 1)
- Node `i` marked with purple outline
- **Connections**:
- Nodes linked in a hierarchical structure (top-down aggregation)
#### (c) GNN Layer 1
- **Structure**:
- Nodes colored in red, blue, and purple (matching gradient in (b))
- Central node `i` (purple) with weighted connections
- **Operation**:
- Summation (`+`) of node features
#### (d) Probabilistic Output
- **Equation**:
- `P_i(σ_i = +1)` (probability of node `i` being +1)
- **Symbol**:
- `S` (likely a sigmoid or softmax function)
#### (e) GNN Layer 2
- **Structure**:
- Simplified node network with fewer connections
- Node `i` retains central position
### Detailed Analysis
#### (a) Input Grid
- Binary classification (`+1`/`-1`) with undefined regions (`Unf.`).
- Purple outline suggests a focus on the central cluster of nodes.
#### (b) Node Representation
- Color gradient maps node values to a continuous scale (0–1).
- Node `i` (purple) likely represents a critical data point.
#### (c) GNN Layer 1
- Aggregates features from neighboring nodes (red/blue → purple).
- Central node `i` acts as a focal point for feature integration.
#### (d) Probabilistic Output
- Output `P_i(σ_i = +1)` quantifies confidence in node `i` being classified as `+1`.
#### (e) GNN Layer 2
- Simplifies the network while preserving node `i` as the target.
### Key Observations
1. **Input-Output Flow**:
- Input grid → Node embeddings → GNN processing → Probabilistic output.
2. **Node `i` Significance**:
- Highlighted in all stages, suggesting it is the primary target for analysis.
3. **Color Consistency**:
- Node colors in (b) and (c) align with the 0–1 gradient, ensuring accurate spatial grounding.
### Interpretation
This diagram models a GNN-based system for classifying nodes in a graph. The input grid (a) likely represents spatial or categorical data, where `+1`/`-1` denote binary labels and `Unf.` indicates missing/irrelevant data. The node network (b) transforms this input into embeddings, with node `i` as the focus. GNN layers (c, e) propagate and aggregate features, culminating in a probability output (d) that quantifies the likelihood of node `i` belonging to the `+1` class.
The use of color gradients and hierarchical node aggregation suggests the system handles both spatial relationships (via the grid) and graph-structured data (via GNNs). The purple outline in (a) and (b) emphasizes node `i`, indicating it may represent a critical entity (e.g., a user in a social network, a pixel in an image, or a sensor in IoT).
The absence of explicit numerical values for node features implies the diagram abstracts implementation details, focusing instead on the pipeline’s conceptual flow. The probabilistic output (d) aligns with common GNN applications like node classification or anomaly detection.
</details>
Figure 1: Schematic representation of the 4N architecture implemented on a two-dimensional lattice model with nearest-neighbors interactions. (a) We want to estimate the probability $\pi_{i}\equiv P(\sigma_{i}=+1\mid\boldsymbol{\sigma}_{<i})$ that the spin $i$ (in purple) has value $+1$ given the spins $<i$ (in black and white). The spins $>i$ (in grey) have not yet been fixed. (b) We compute the values of the local fields $h_{l}^{i,(0)}$ as in Eq. (10), for $l$ in a neighborhood of $i$ of size $\ell$ , indicated by a purple contour in (a). Notice that spins $l$ that are not in the neighborhood of one of the fixed spin have $h_{l}^{i,(0)}=0$ and are indicated in white in (b). (c) In the main cycle of the algorithm, we apply a series of $\ell$ GNN layers with the addition of skip connections. (d) The final GNN layer is not followed by a skip connection and yields the final values of the field $h_{i}^{i,(\ell)}$ . (e) $\pi_{i}$ is estimated by applying a sigmoid function to $h_{i}^{i,(\ell)}$ . Periodic (or other) boundary conditions can be considered, but are not shown here for clarity.
## Results
### The 4N architecture
The 4N architecture (Fig. 1) computes the probability $\pi_{i}\equiv P(\sigma_{i}=+1\mid\boldsymbol{\sigma}_{<i})$ by propagating the $h_{l}^{i}$ , defined as in TwoBo, Eq. (10), through a Graph Neural Network (GNN) architecture. The variable $h_{l}^{i}$ is interpreted as the local field induced by the frozen variables $\boldsymbol{\sigma}_{<i}$ on spin $l$ . Note that in 4N, we consider all possible values of $l=1,\cdots,N$ , not only $l\geq i$ as in TwoBo. The crucial difference with TwoBo is that only the fields in a finite neighborhood of variable $i$ need to be considered, because the initial values of the fields are propagated through a series of GNN layers that take into account the locality of the physical problem. During propagation, fields are combined with layer- and edge-dependent weights together with the couplings $J_{ij}$ and the inverse temperature $\beta$ . After each layer of the network (except the final one), a skip connection to the initial configuration is added. After the final layer, a sigmoidal activation is applied to find $\pi_{i}$ . The network has a total number of parameters equal to $(c+1)\times\ell\times N$ , where $c$ is the average number of neighbors for each spin. Details are given in the Methods section.
More precisely, because GNN updates are local, and there are $\ell$ of them, in order to compute the final field on site $i$ , hence $\pi_{i}$ , we only need to consider the set of fields corresponding to sites at distance at most $\ell$ , hence the number of operations scales proportionally to the number of such sites which we call $\mathcal{B}(\ell)$ . Because we need to compute $\pi_{i}$ for each of the $N$ sites, the generation of a new configuration then requires $\mathcal{O}(\mathcal{B}(\ell)\times N)$ steps. We recall that in a $d-$ dimensional lattice $\mathcal{B}(\ell)\propto\ell^{d}$ , while for random graphs $\mathcal{B}(\ell)\propto c^{\ell}$ . It is therefore crucial to assess how the number of layers $\ell$ should scale with the system size $N$ .
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Graph: Energy Difference vs. β Parameter
### Overview
The graph illustrates the relationship between the energy difference (relative) and the β parameter for four distinct values of ℓ (2, 4, 6, 10). Energy difference is plotted on the y-axis (0.000–0.040), while β ranges from 0.5 to 3.0 on the x-axis. Four colored lines represent different ℓ values, with solid lines indicating central trends and dashed lines showing shaded confidence intervals or variability ranges.
### Components/Axes
- **X-axis (β)**: Labeled "β", scaled from 0.5 to 3.0 in increments of 0.5.
- **Y-axis (Energy Difference)**: Labeled "Energy difference (relative)", scaled from 0.000 to 0.040 in increments of 0.005.
- **Legend**: Located in the top-left corner, associating:
- Yellow (solid/dashed): ℓ = 2
- Red (solid/dashed): ℓ = 6
- Blue (solid/dashed): ℓ = 4
- Green (solid/dashed): ℓ = 10
### Detailed Analysis
1. **ℓ = 2 (Yellow)**:
- Solid line: Rises sharply from β = 0.5 (0.000) to a peak at β ≈ 1.5 (0.025), then declines gradually.
- Dashed line: Follows the same trend with a wider shaded region (0.015–0.035 at β = 1.5).
2. **ℓ = 4 (Blue)**:
- Solid line: Peaks at β ≈ 1.2 (0.015), then decreases steadily.
- Dashed line: Shaded region spans 0.010–0.020 at β = 1.2.
3. **ℓ = 6 (Red)**:
- Solid line: Peaks at β ≈ 1.0 (0.010), then declines.
- Dashed line: Shaded region narrows to 0.005–0.015 at β = 1.0.
4. **ℓ = 10 (Green)**:
- Solid line: Peaks at β ≈ 0.8 (0.008), then decreases.
- Dashed line: Shaded region is narrowest (0.004–0.012 at β = 0.8).
### Key Observations
- **Trend**: Higher ℓ values exhibit lower energy differences and earlier peaks in β. For example, ℓ = 10 peaks at β = 0.8, while ℓ = 2 peaks at β = 1.5.
- **Variability**: Shaded regions (dashed lines) widen for higher ℓ values, indicating greater uncertainty in measurements.
- **Convergence**: All lines converge near β = 3.0, with energy differences approaching 0.000.
### Interpretation
The graph demonstrates that energy difference decreases as ℓ increases, suggesting a stabilizing effect of higher ℓ values on the system. The peak positions (β ≈ 0.8–1.5) indicate optimal β values for minimizing energy difference, with ℓ = 10 achieving the lowest energy difference. The widening shaded regions for higher ℓ values imply reduced confidence in measurements at larger ℓ, possibly due to experimental or computational limitations. The convergence at β = 3.0 suggests a universal trend toward energy minimization regardless of ℓ, though practical applications may prioritize lower ℓ values for stability.
</details>
Figure 2: Relative difference in mean energy between the configurations generated by the 4N architecture and by PT, for different numbers of layers $\ell$ and different values of the inverse temperature $\beta$ . Solid lines indicate averages over 10 different instances, while the dashed lines and the colored area identify the region corresponding to plus or minus one standard deviation.
### The benchmark Gibbs-Boltzmann distribution
In this paper, in order to test the 4N architecture on a prototypically hard-to-sample problem, we will consider the Edwards-Anderson spin glass model, described by the Hamiltonian in Eq. (3) for Ising spins, where we set $H_{i}=0$ for simplicity. The couplings are chosen to be non-zero only for neighboring pairs $\langle i,j\rangle$ on a two-dimensional square lattice. Non-zero couplings $J_{ij}$ are independent random variables, identically distributed according to either a normal or a Rademacher distribution. This model was also used as a benchmark in Refs. [13, 12]. All the results reported in the rest of the section are for a $16\times 16$ square lattice, hence with $N=256$ spins, which guarantees a large enough system to investigate correlations at low temperatures while keeping the problem tractable. In the SI we also report some data for a $32\times 32$ lattice.
While this model has no finite-temperature spin glass transition, sampling it via standard local MCMC becomes extremely hard for $\beta\gtrsim 1.5$ [12]. To this day, the best option for sampling this model at lower temperatures is parallel tempering (PT), which we used to construct a sample of $M=2^{16}$ equilibrium configurations for several instances of the model and at several temperatures. Details are given in the Methods section.
The PT-generated training set is used to train the 4N architecture, independently for each instance and each temperature. Remember that the goal here is to test the expressivity of the architecture, which is why we train it in the best possible situation, i.e. by maximizing the likelihood of equilibrium configurations at each temperature. Once the 4N model has been trained, we perform several comparisons that are reported in the following.
### Energy
We begin by comparing the mean energy of the configurations generated by the model with the mean energy of the configurations generated via parallel tempering, which is important because the energy plays a central role in the Metropolis reweighing scheme of Eq. (5). Fig. 2 shows the relative energy difference, as a function of inverse temperature $\beta$ and number of layers $\ell$ , averaged over 10 different instances of the disorder. First, it is clear that adding layers improves results remarkably. We stress that, since the activation functions of our network are linear, adding layers only helps in taking into account spins that are further away. Second, we notice that the relative error found when using 6 layers or more is small for all temperatures considered. Finally, we notice that the behaviour of the error is non-monotonic, and indeed appears to decrease for low temperatures even for networks with few layers.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: Relationship Between Number of Layers and |⟨log Pℓ_AR⟩_data/S|
### Overview
The chart illustrates how the normalized logarithmic average of a parameter (|⟨log Pℓ_AR⟩_data/S|) changes with the number of layers (ℓ) across different temperatures (T). Seven distinct temperature conditions are represented by colored lines, showing varying degrees of sensitivity to layer count.
### Components/Axes
- **X-axis**: "Number of layers, ℓ" (integer values from 2 to 10, labeled at intervals of 2).
- **Y-axis**: "|⟨log Pℓ_AR⟩_data/S|" (continuous scale from 1.00 to 1.10, with increments of 0.02).
- **Legend**: Located in the top-right corner, mapping colors to temperature values:
- Blue: T = 0.58
- Purple: T = 0.72
- Magenta: T = 0.86
- Red: T = 1.02
- Orange: T = 1.41
- Yellow: T = 1.95
- Yellow: T = 2.83 (note: two yellow entries, possibly a labeling error).
### Detailed Analysis
1. **T = 0.58 (Blue Line)**:
- Starts at ~1.08 (ℓ=2) and decreases steeply to ~1.02 (ℓ=10).
- Data points: 1.08 (ℓ=2), 1.06 (ℓ=4), 1.04 (ℓ=6), 1.03 (ℓ=8), 1.02 (ℓ=10).
2. **T = 0.72 (Purple Line)**:
- Begins at ~1.04 (ℓ=2) and declines gradually to ~1.01 (ℓ=10).
- Data points: 1.04 (ℓ=2), 1.03 (ℓ=4), 1.02 (ℓ=6), 1.01 (ℓ=8), 1.01 (ℓ=10).
3. **T = 0.86 (Magenta Line)**:
- Starts at ~1.02 (ℓ=2) and decreases slightly to ~1.01 (ℓ=10).
- Data points: 1.02 (ℓ=2), 1.015 (ℓ=4), 1.01 (ℓ=6), 1.01 (ℓ=8), 1.01 (ℓ=10).
4. **T = 1.02 (Red Line)**:
- Flat trend with minimal variation (~1.01 across all ℓ).
- Data points: ~1.01 (ℓ=2–10).
5. **T = 1.41 (Orange Line)**:
- Nearly flat, with a slight dip from ~1.005 (ℓ=2) to ~1.002 (ℓ=10).
- Data points: ~1.005 (ℓ=2), ~1.003 (ℓ=4), ~1.002 (ℓ=6–10).
6. **T = 1.95 (Yellow Line)**:
- Completely flat at ~1.00 across all ℓ.
- Data points: ~1.00 (ℓ=2–10).
7. **T = 2.83 (Yellow Line)**:
- Identical to T = 1.95, flat at ~1.00.
- Data points: ~1.00 (ℓ=2–10).
### Key Observations
- **Inverse Relationship**: All lines show a decreasing trend with increasing ℓ, but the rate of decline weakens at higher T values.
- **Saturation Effect**: For T ≥ 1.02, the parameter stabilizes near 1.00, suggesting minimal dependence on layer count.
- **Color Consistency**: Legend colors match line colors exactly (e.g., blue for T=0.58, yellow for T=1.95/2.83).
### Interpretation
The data demonstrates that higher temperatures (T) reduce the sensitivity of |⟨log Pℓ_AR⟩_data/S| to the number of layers. This implies that thermal energy may suppress layer-dependent effects, leading to a plateau in the parameter's behavior. The near-identical performance of T=1.95 and T=2.83 suggests a critical threshold (T ≈ 1.02) beyond which additional temperature increases have negligible impact. This could reflect a phase transition or equilibrium state in the system being modeled.
</details>
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Graph: Relationship Between Number of Layers and Log P_AR^10 Difference
### Overview
The graph illustrates the relationship between the number of layers (ℓ) and the absolute difference in logarithmic values of ⟨log P_AR^10⟩_data for various temperature (T) conditions. Six distinct data series are plotted, each corresponding to a specific T value, with trends showing how the difference evolves as the number of layers increases from 2 to 10.
### Components/Axes
- **X-axis**: "Number of layers, ℓ" (integer values from 2 to 10).
- **Y-axis**: "|⟨log P_AR^10⟩_data − ⟨log P_AR^10⟩_data|" (logarithmic scale, ranging from 0.00 to 2.00).
- **Legend**: Located in the top-right corner, mapping six T values to colors:
- **T = 0.58** (dark blue, solid circles)
- **T = 0.72** (purple, solid circles)
- **T = 0.86** (magenta, solid circles)
- **T = 1.02** (pink, solid circles)
- **T = 1.41** (orange, solid circles)
- **T = 1.95** (yellow, solid circles)
- **T = 2.83** (red, solid circles)
- **Line Style**: All lines are dashed, with markers (circles) at each data point.
### Detailed Analysis
1. **T = 0.58 (dark blue)**:
- Starts at ~2.00 for ℓ = 2, decreasing sharply to ~0.05 by ℓ = 10.
- Steepest decline among all series.
2. **T = 0.72 (purple)**:
- Begins at ~1.80 for ℓ = 2, dropping to ~0.10 by ℓ = 10.
- Intermediate slope compared to other series.
3. **T = 0.86 (magenta)**:
- Initial value ~1.60 at ℓ = 2, decreasing to ~0.15 at ℓ = 10.
- Moderate decline rate.
4. **T = 1.02 (pink)**:
- Starts at ~1.40 for ℓ = 2, falling to ~0.20 at ℓ = 10.
- Slower decline than lower T values.
5. **T = 1.41 (orange)**:
- Begins at ~0.50 for ℓ = 2, dropping to ~0.05 at ℓ = 10.
- Shallow slope, minimal change across layers.
6. **T = 1.95 (yellow)**:
- Starts at ~0.30 for ℓ = 2, decreasing to ~0.02 at ℓ = 10.
- Very gradual decline.
7. **T = 2.83 (red)**:
- Initial value ~0.10 at ℓ = 2, remaining near 0.00 for ℓ ≥ 4.
- Near-zero difference after ℓ = 4.
### Key Observations
- **Inverse Relationship**: Higher T values correlate with smaller absolute differences in ⟨log P_AR^10⟩_data.
- **Convergence**: All series approach zero as ℓ increases, with higher T values stabilizing faster.
- **Layer Sensitivity**: Lower T values (e.g., 0.58) exhibit stronger dependence on layer count, while higher T values (e.g., 2.83) show minimal layer-dependent variation.
- **Data Point Consistency**: All markers align precisely with their respective legend colors, confirming accurate series identification.
### Interpretation
The data suggests that temperature (T) modulates the sensitivity of ⟨log P_AR^10⟩_data to the number of layers. Lower T values amplify layer-dependent variations, while higher T values suppress these effects, potentially indicating a saturation or equilibrium state. The convergence of all series toward zero at higher layers implies that beyond a critical layer count (ℓ > 8), the system reaches a stable configuration where T has negligible impact. This could reflect a phase transition or critical threshold in the modeled system, where increasing layers homogenize the logarithmic differences regardless of T. The stark contrast between T = 0.58 (high sensitivity) and T = 2.83 (near-zero sensitivity) highlights T as a critical parameter governing system behavior.
</details>
Figure 3: (a) Ratio between the cross-entropy $|\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}|$ and the Gibbs-Boltzmann entropy $S_{\text{GB}}$ for different values of the temperature $T$ and of the number of layers $\ell$ . Both $|\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}|$ and $S_{\text{GB}}$ are averaged over 10 samples. Dashed lines are exponential fits in the form $Ae^{-\ell/\overline{\ell}}+C$ . (b) Absolute difference between $\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}$ at various $\ell$ and at $\ell=10$ . Dashed lines are exponential fits in the form $Ae^{-\ell/\overline{\ell}}$ .
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: Probability Distribution P(q) at T = 1.95, Instance 1
### Overview
The chart displays four probability distributions P(q) as functions of q, plotted for different values of ℓ (2, 4, 6, 10). All distributions peak near q = 0, with slight variations in peak position and height. The legend is positioned in the top-right corner, and the x-axis spans from -1.00 to 1.00, while the y-axis ranges from 0 to 4.5.
### Components/Axes
- **Title**: "T = 1.95, Instance 1" (top-center).
- **Y-axis**: Labeled "P(q)" with values from 0.0 to 4.5 in increments of 0.5.
- **X-axis**: Labeled "q" with values from -1.00 to 1.00 in increments of 0.25.
- **Legend**: Located in the top-right corner, with four entries:
- Purple line: ℓ = 2
- Blue line: ℓ = 4
- Teal line: ℓ = 6
- Yellow line: ℓ = 10
### Detailed Analysis
- **ℓ = 2 (Purple)**: Peaks at approximately P(q) = 4.2 near q = 0.0. The curve is symmetric and narrow.
- **ℓ = 4 (Blue)**: Peaks at ~4.1 near q = 0.0. Slightly broader than ℓ = 2.
- **ℓ = 6 (Teal)**: Peaks at ~4.0 near q = 0.0. Broader than ℓ = 4.
- **ℓ = 10 (Yellow)**: Peaks at ~3.9 near q = 0.1 (slightly shifted right). Broader than ℓ = 6.
### Key Observations
1. All distributions are unimodal and symmetric around q = 0, except ℓ = 10, which shifts slightly to the right.
2. Peak heights decrease as ℓ increases (4.2 → 4.1 → 4.0 → 3.9).
3. The width of the distributions increases with ℓ, indicating greater dispersion for higher ℓ values.
4. The ℓ = 10 curve (yellow) is visually distinct due to its rightward shift and lower peak.
### Interpretation
The chart suggests that the probability distribution P(q) depends on the parameter ℓ. The peak shift for ℓ = 10 implies a dependency on ℓ, potentially reflecting a phase transition or critical behavior in the system. The decreasing peak heights and increasing widths with higher ℓ may indicate a trade-off between concentration and dispersion in the distribution. The closeness of the curves for ℓ = 2, 4, and 6 suggests similar underlying dynamics, while ℓ = 10 introduces a notable deviation, possibly signaling a critical threshold or anomaly in the system's behavior.
</details>
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Chart: Probability Distribution P(q) at T = 0.72, Instance 1
### Overview
The chart displays probability distributions P(q) as a function of q for four distinct values of ℓ (2, 4, 6, 10) at a fixed temperature T = 0.72. A dashed reference curve is included for comparison. All distributions are symmetric around q = 0 but exhibit shifts and variations in peak height and width.
### Components/Axes
- **X-axis (q)**: Ranges from -1.00 to 1.00 in increments of 0.25. Labeled "q".
- **Y-axis (P(q))**: Ranges from 0.00 to 2.00 in increments of 0.25. Labeled "P(q)".
- **Legend**: Located in the top-right corner, mapping colors to ℓ values:
- Purple: ℓ = 2
- Blue: ℓ = 4
- Green: ℓ = 6
- Yellow: ℓ = 10
- Dashed black: Reference curve (ℓ unspecified).
- **Title**: "T = 0.72, Instance 1" at the top-center.
- **Graph Label**: "(b)" in the top-left corner.
### Detailed Analysis
1. **ℓ = 2 (Purple Line)**:
- Peaks sharply at q ≈ 0.25 with P(q) ≈ 1.75.
- Symmetric but skewed slightly toward positive q.
- Steep decline on both sides of the peak.
2. **ℓ = 4 (Blue Line)**:
- Peaks at q ≈ 0.2 with P(q) ≈ 1.5.
- Broader than ℓ = 2, indicating a wider distribution.
- Smoother curve with less pronounced shoulders.
3. **ℓ = 6 (Green Line)**:
- Peaks at q ≈ 0.15 with P(q) ≈ 1.3.
- Further broadened and flattened compared to ℓ = 4.
- Shoulders extend closer to q = 0.
4. **ℓ = 10 (Yellow Line)**:
- Peaks at q ≈ 0.1 with P(q) ≈ 1.1.
- Nearly symmetric around q = 0.
- Minimal shoulders; distribution is more uniform.
5. **Dashed Black Reference Curve**:
- Peaks at q = 0 with P(q) ≈ 1.0.
- Symmetric and smooth, serving as a baseline for comparison.
### Key Observations
- **Peak Shift**: As ℓ increases, the peak of P(q) shifts leftward (toward negative q) and lowers in height.
- **Width Increase**: Higher ℓ values correspond to broader distributions, suggesting increased variability in q.
- **Reference Curve**: The dashed line (ℓ = ∞?) provides a theoretical limit, showing a uniform distribution at q = 0.
- **Symmetry**: All curves are symmetric about q = 0, but higher ℓ values exhibit reduced asymmetry.
### Interpretation
The data suggests a phase transition or critical behavior in the system as ℓ increases. At lower ℓ (e.g., ℓ = 2), the system exhibits a sharp, localized peak, indicative of a narrow, high-probability state. As ℓ increases, the distribution broadens and flattens, implying a transition toward a more disordered or critical state. The dashed reference curve may represent an idealized or asymptotic behavior (e.g., ℓ → ∞), where the system becomes uniformly distributed. The temperature T = 0.72 likely plays a role in stabilizing these distributions, with higher ℓ values approaching criticality. The shift in peak positions could reflect competing interactions or external perturbations in the system.
</details>
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Graph: Probability Distribution P(q) at T = 0.31, Instance 1
### Overview
The graph displays probability distributions P(q) as a function of q for four distinct values of ℓ (2, 4, 6, 10) at a fixed temperature T = 0.31. A dashed reference line is included for comparison. All distributions are symmetric about q = 0, with peaks centered at q = 0 and secondary peaks at q ≈ ±0.75.
### Components/Axes
- **X-axis (q)**: Ranges from -1.00 to 1.00 in increments of 0.25.
- **Y-axis (P(q))**: Ranges from 0.0 to 1.6 in increments of 0.2.
- **Legend**: Located in the top-right corner, mapping colors to ℓ values:
- Purple: ℓ = 2
- Teal: ℓ = 6
- Blue: ℓ = 4
- Yellow: ℓ = 10
- **Dashed Line**: Black dashed curve, likely a reference distribution.
### Detailed Analysis
1. **ℓ = 2 (Purple Line)**:
- Sharpest peak at q = 0 (P(q) ≈ 1.0).
- Secondary peaks at q ≈ ±0.75 (P(q) ≈ 0.8).
- Narrowest distribution overall.
2. **ℓ = 4 (Blue Line)**:
- Peak at q = 0 (P(q) ≈ 0.9).
- Secondary peaks at q ≈ ±0.75 (P(q) ≈ 0.7).
- Broader than ℓ = 2 but narrower than ℓ = 6/10.
3. **ℓ = 6 (Teal Line)**:
- Peak at q = 0 (P(q) ≈ 0.85).
- Secondary peaks at q ≈ ±0.75 (P(q) ≈ 0.6).
- Broader than ℓ = 4, with reduced peak height.
4. **ℓ = 10 (Yellow Line)**:
- Peak at q = 0 (P(q) ≈ 0.8).
- Secondary peaks at q ≈ ±0.75 (P(q) ≈ 0.5).
- Broadest distribution, lowest peak height.
5. **Dashed Line**:
- Peaks at q ≈ ±0.75 (P(q) ≈ 1.0).
- Flat minimum at q = 0 (P(q) ≈ 0.4).
- Symmetric but inverted relative to solid lines.
### Key Observations
- **Inverse Relationship**: Higher ℓ values correlate with broader distributions and lower peak probabilities at q = 0.
- **Secondary Peaks**: All solid lines exhibit secondary maxima at q ≈ ±0.75, matching the dashed line's primary peaks.
- **Dashed Line Contrast**: The dashed line’s inverted profile suggests it represents a complementary or orthogonal distribution (e.g., q²-dependent terms).
### Interpretation
The data suggests that ℓ modulates the width and height of the probability distribution P(q). Lower ℓ values (e.g., ℓ = 2) produce sharper, more localized distributions, while higher ℓ values (e.g., ℓ = 10) yield broader, flatter distributions. The dashed line’s secondary peaks at q ≈ ±0.75 may represent boundary effects or interactions between system components. The temperature T = 0.31 likely stabilizes the system in a regime where these distributions are thermally equilibrated. The inverse scaling of peak height with ℓ implies that larger ℓ values suppress central concentration in favor of distributed states.
</details>
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: Probability Distribution P(q) at T = 0.31, Instance 2
### Overview
The chart displays the probability distribution \( P(q) \) as a function of \( q \) for four distinct values of \( \ell \) (2, 4, 6, 10) at a fixed temperature \( T = 0.31 \). A dashed black reference line is included for comparison. The y-axis (\( P(q) \)) ranges from 0 to 1.6, while the x-axis (\( q \)) spans from -1.00 to 1.00.
### Components/Axes
- **X-axis (q)**: Labeled with values from -1.00 to 1.00 in increments of 0.25.
- **Y-axis (P(q))**: Labeled with values from 0.0 to 1.6 in increments of 0.2.
- **Legend**: Located in the top-right corner, associating colors with \( \ell \) values:
- Purple: \( \ell = 2 \)
- Blue: \( \ell = 4 \)
- Teal: \( \ell = 6 \)
- Yellow: \( \ell = 10 \)
- **Title**: "T = 0.31, Instance 2" at the top of the chart.
### Detailed Analysis
1. **Dashed Black Line (Reference)**:
- Peaks at \( q \approx -0.5 \) and \( q \approx 0.5 \), with \( P(q) \approx 1.2 \).
- Symmetric about \( q = 0 \), forming a bimodal distribution.
2. **Solid Lines (by \( \ell \))**:
- **Purple (\( \ell = 2 \))**:
- Peaks at \( q \approx -0.5 \) and \( q \approx 0.5 \), with \( P(q) \approx 0.8 \).
- Slightly broader than the dashed line.
- **Blue (\( \ell = 4 \))**:
- Peaks at \( q \approx -0.5 \) and \( q \approx 0.5 \), with \( P(q) \approx 0.7 \).
- Broader than \( \ell = 2 \), with reduced amplitude.
- **Teal (\( \ell = 6 \))**:
- Peaks at \( q \approx -0.5 \) and \( q \approx 0.5 \), with \( P(q) \approx 0.5 \).
- Further reduced amplitude and broader spread.
- **Yellow (\( \ell = 10 \))**:
- Peaks at \( q \approx -0.5 \) and \( q \approx 0.5 \), with \( P(q) \approx 0.4 \).
- Flattest distribution, minimal amplitude.
### Key Observations
- All lines exhibit bimodal symmetry around \( q = 0 \), mirroring the dashed reference line.
- As \( \ell \) increases, the peaks of \( P(q) \) decrease in height and broaden, indicating reduced probability concentration at \( q = \pm 0.5 \).
- The dashed line (\( \ell \to \infty \)?) serves as an upper bound for \( P(q) \).
### Interpretation
The chart suggests that \( \ell \) inversely correlates with the sharpness of the \( P(q) \) distribution. Higher \( \ell \) values flatten the distribution, implying a transition from localized to delocalized behavior in the system. The temperature \( T = 0.31 \) and "Instance 2" label hint at a controlled experimental or simulated scenario, possibly studying phase transitions or critical phenomena where \( \ell \) modulates system dynamics. The dashed line may represent an idealized or theoretical limit (e.g., infinite \( \ell \)), against which finite-\( \ell \) results are compared.
</details>
Figure 4: Examples of the probability distribution of the overlap, $P(q)$ , for different instances of the disorder and different temperatures. Solid color lines correspond to configurations generated using 4N, while dashed black lines correspond to the $P(q)$ obtained from equilibrium configurations generated via PT. (a) At high temperature, even a couple of layers are enough to reproduce well the distribution. (b) At lower temperatures more layers are required, as expected because of the increase of the correlation length and of the complexity of the Gibbs-Boltzmann distribution. (c) Even lower temperature for the same instance. (d) A test on a more complex instance shows that 4N is capable to learn even non-trivial forms of $P(q)$ .
### Entropy and Kullback-Leibler divergence
Next, we consider the Kullback-Leibler divergence between the learned probability distribution $P_{\text{NN}}$ and the target probability distribution, $D_{\text{KL}}(P_{\text{GB}}\parallel P_{\text{NN}})$ . Indicating with $\langle\,\cdot\,\rangle_{\text{GB}}$ the average with respect to $P_{\text{GB}}$ , we have
$$
D_{\text{KL}}(P_{\text{GB}}\parallel P_{\text{NN}})=\langle\log P_{\text{GB}}
\rangle_{\text{GB}}-\langle\log P_{\text{NN}}\rangle_{\text{GB}}\ . \tag{11}
$$
The first term $\langle\log P_{\text{GB}}\rangle_{\text{GB}}=-S_{\text{GB}}(T)$ is minus the entropy of the Gibbs-Boltzmann distribution and does not depend on $P_{\text{NN}}$ , hence we focus on the cross-entropy $-\langle\log P_{\text{NN}}\rangle_{\text{GB}}$ as a function of the number of layers for different temperatures. Minimizing this quantity corresponds to minimizing the KL divergence. In the case of perfect learning, i.e. $D_{\text{KL}}=0$ , we have $-\langle\log P_{\text{NN}}\rangle_{\text{GB}}=S_{\text{GB}}(T)$ . In Fig. 3 we compare the cross-entropy, estimated as $-\langle\log P_{\text{NN}}\rangle_{\text{data}}$ on the data generated by PT, with $S_{\text{GB}}(T)$ computed using thermodynamic integration. We find a good match for large enough $\ell$ , indicating an accurate training of the 4N model that does not suffer from mode collapse.
In order to study more carefully how changing the number of layers affects the accuracy of the training, in Fig. 3 we have also shown the difference $|\langle\log P^{\ell}_{AR}\rangle_{data}-\langle\log P^{10}_{AR}\rangle_{data}|$ as a function of the number of layers $\ell$ for different temperatures. The plots are compatible with an exponential decay in the form $Ae^{-\ell/\overline{\ell}}$ , with $\overline{\ell}$ thus being an estimate of the number of layers above which the 4N architecture becomes accurate. We thus need to understand how $\overline{\ell}$ changes with temperature and system size. To do so, we need to introduce appropriate correlation functions, as we do next.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Graph: G vs. r at T = 1.02
### Overview
The image is a logarithmic line graph plotting the variable **G** (y-axis, logarithmic scale from 10⁻² to 10⁰) against **r** (x-axis, linear scale from 0 to 7). The graph includes five data series: a dashed black line labeled "Data" and four solid-colored lines labeled with integer values of **l** (2, 4, 6, 10). All lines originate at **G = 10⁰** when **r = 0** and decline as **r** increases.
### Components/Axes
- **Y-axis (G)**: Logarithmic scale with ticks at 10⁻², 10⁻¹, and 10⁰.
- **X-axis (r)**: Linear scale from 0 to 7.
- **Legend**: Located in the bottom-left corner, with:
- Dashed black line: "Data"
- Solid purple line: **l = 2**
- Solid blue line: **l = 4**
- Solid green line: **l = 6**
- Solid yellow line: **l = 10**
- **Title**: "T = 1.02" at the top center.
### Detailed Analysis
1. **Data Line (Dashed Black)**:
- Starts at **G = 10⁰** (r = 0) and declines steeply, crossing **G = 10⁻¹** at **r ≈ 3.5** and **G = 10⁻²** at **r ≈ 6.5**.
- Remains the uppermost line across all **r** values.
2. **l = 10 (Yellow)**:
- Starts at **G = 10⁰** (r = 0) and declines gradually, crossing **G = 10⁻¹** at **r ≈ 4.5** and **G = 10⁻²** at **r ≈ 6.8**.
- Parallel to the Data line but less steep.
3. **l = 6 (Green)**:
- Starts at **G = 10⁰** (r = 0) and declines moderately, crossing **G = 10⁻¹** at **r ≈ 5** and **G = 10⁻²** at **r ≈ 6.5**.
- Less steep than l = 10.
4. **l = 4 (Blue)**:
- Starts at **G = 10⁰** (r = 0) and declines slowly, crossing **G = 10⁻¹** at **r ≈ 5.5** and **G = 10⁻²** at **r ≈ 6.7**.
- Least steep among the solid lines.
5. **l = 2 (Purple)**:
- Starts at **G = 10⁰** (r = 0) and declines very gradually, crossing **G = 10⁻¹** at **r ≈ 6** and **G = 10⁻²** at **r ≈ 6.9**.
- Flattest slope among all lines.
### Key Observations
- All lines converge near **G = 10⁻²** as **r** approaches 7, but none reach it within the plotted range.
- Higher **l** values correspond to slower declines in **G** (e.g., l = 10 declines slower than l = 2).
- The "Data" line (dashed black) dominates the upper region, suggesting it represents a baseline or reference trend.
### Interpretation
The graph demonstrates an inverse relationship between **G** and **r**, with **G** decreasing exponentially as **r** increases. The parameter **l** modulates the rate of decline: larger **l** values result in slower decay of **G**. The "Data" line likely represents empirical or theoretical observations, while the colored lines may correspond to modeled scenarios with varying **l** parameters. The logarithmic y-axis emphasizes multiplicative changes in **G**, highlighting sensitivity to **r** and **l**. The value **T = 1.02** (possibly a temperature or system parameter) suggests the data is context-dependent, requiring domain-specific interpretation.
**Note**: No textual information in the image is in a language other than English.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Graph: G vs. r at T = 0.45
### Overview
The image is a semi-logarithmic line graph plotting the variable **G** (y-axis, logarithmic scale from 10⁻¹ to 10⁰) against **r** (x-axis, linear scale from 0 to 7). The graph includes five data series: a dashed black line labeled "Data" and four solid lines representing different values of **l** (2, 4, 6, 10). All lines slope downward from the top-left to the bottom-right, with the dashed "Data" line consistently above the others.
### Components/Axes
- **X-axis (r)**: Linear scale from 0 to 7, labeled "r".
- **Y-axis (G)**: Logarithmic scale from 10⁻¹ to 10⁰, labeled "G".
- **Legend**: Located in the bottom-left corner, with the following entries:
- Dashed black line: "Data"
- Solid purple line: **l = 2**
- Solid blue line: **l = 4**
- Solid green line: **l = 6**
- Solid yellow line: **l = 10**
- **Title**: "T = 0.45" is displayed in the top-right corner.
### Detailed Analysis
1. **Data Series**:
- **Dashed Black Line ("Data")**: Starts near (0, 10⁰) and slopes downward to approximately (7, 10⁻¹).
- **Solid Purple Line (l = 2)**: Starts near (0, 10⁰) and slopes steeply downward to (7, ~10⁻¹.5).
- **Solid Blue Line (l = 4)**: Starts near (0, 10⁰) and slopes less steeply than l = 2, ending at (7, ~10⁻¹).
- **Solid Green Line (l = 6)**: Starts near (0, 10⁰) and slopes gently, ending at (7, ~10⁻¹).
- **Solid Yellow Line (l = 10)**: Starts near (0, 10⁰) and slopes the least steeply, ending at (7, ~10⁻¹).
2. **Key Trends**:
- All lines intersect at the origin (0, 10⁰), suggesting a shared initial condition.
- The "Data" line (dashed black) remains consistently above the other lines across all **r** values, indicating it represents a distinct dataset or condition.
- Lines for higher **l** values (e.g., l = 10) exhibit shallower slopes, suggesting weaker dependence of **G** on **r** for larger **l**.
3. **Spatial Grounding**:
- Legend is positioned in the bottom-left corner, clearly associating colors with labels.
- Title "T = 0.45" is anchored in the top-right corner, separate from the main graph.
- Axis labels are placed adjacent to their respective axes (left for **G**, bottom for **r**).
### Key Observations
- The "Data" line (dashed black) does not correspond to any **l** value and likely represents empirical measurements or a reference dataset.
- Lines for **l = 2** and **l = 4** show the steepest declines, implying stronger sensitivity of **G** to **r** for smaller **l**.
- The logarithmic y-axis compresses the range of **G** values, emphasizing relative changes rather than absolute differences.
### Interpretation
The graph demonstrates an inverse relationship between **G** and **r**, with the magnitude of **G** decreasing as **r** increases. The parameter **l** modulates this relationship: smaller **l** values (e.g., 2, 4) produce steeper declines in **G**, while larger **l** values (e.g., 6, 10) result in gentler slopes. The "Data" line’s consistent elevation above the modeled lines suggests it may represent a baseline or control condition, or it could reflect experimental measurements that deviate from theoretical predictions. The logarithmic scale highlights proportional changes, which is critical for interpreting trends in systems where **G** spans orders of magnitude.
The parameter **T = 0.45** (likely temperature or a similar variable) may influence the scaling of **G** and **r**, but its exact role is not explicitly defined in the graph. Further context would be needed to link **T** to the observed trends.
</details>
Figure 5: Comparison between the spatial correlation functions obtained from 4N and those obtained from PT, at different high (a) and low (b) temperatures and for different number of layers. The grey dashed lines correspond to the value $1/e$ that defines the correlation length $\xi$ . Data are averaged over 10 instances.
### Overlap distribution function
In order to show that the 4N architecture is able to capture more complex features of the Gibbs-Boltzmann distribution, we have considered a more sensitive observable: the probability distribution of the overlap, $P(q)$ . The overlap $q$ between two configurations of spins $\ \boldsymbol{\sigma}^{1}=\{\sigma^{1}_{1},\sigma^{1}_{2},\dots\sigma^{1}_{N}\}$ and $\ \boldsymbol{\sigma}^{2}=\{\sigma^{2}_{1},\sigma^{2}_{2},\dots\sigma^{2}_{N}\}$ is a central quantity in spin glass physics, and is defined as:
$$
q_{12}=\frac{1}{N}\boldsymbol{\sigma}^{1}\cdot\boldsymbol{\sigma}^{2}=\frac{1}
{N}\sum_{i=1}^{N}\sigma^{1}_{i}\sigma^{2}_{i}. \tag{12}
$$
The histogram of $P(q)$ is obtained by considering many pairs of configurations, independently taken from either the PT-generated training set, or by generating them with the 4N architecture. Examples for two different instances at several temperatures are shown in Fig. 4. At high temperatures, even very few layers are able to reproduce the distribution of the overlaps. When going to lower temperatures, and to more complex shapes of $P(q)$ , however, the number of layers that is required to have a good approximation of $P(q)$ increases. This result is compatible with the intuitive idea that an increasing number of layers is required to reproduce a system at low temperatures, where the correlation length is large.
As shown in the SI, the performance of 4N is on par with that of other algorithms with more parameters. Moreover, with a suitable increase of the number of layers, 4N is also able to achieve satisfactory results when system’s size increases.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: Relationship between β and Correlation Length/ℓ
### Overview
The chart displays two data series plotted against a β parameter on the x-axis (0.25–2.25) and a y-axis ranging from 0.5 to 3.5. Teal dots represent "Correlation length" and red dots represent "ℓ". Both series show increasing trends with β, but with distinct patterns.
### Components/Axes
- **X-axis (β)**: Labeled with increments of 0.25 from 0.25 to 2.25
- **Y-axis**: Unlabeled numerical scale from 0.5 to 3.5 in 0.5 increments
- **Legend**: Positioned at top-right corner with:
- Teal circle: "Correlation length"
- Red circle: "ℓ"
- **Data Points**:
- Teal dots (correlation length) at β = 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.25
- Red dots (ℓ) at same β values
### Detailed Analysis
**Correlation Length (Teal):**
- β=0.25 → ~0.6
- β=0.5 → ~0.8
- β=0.75 → ~0.9
- β=1.0 → ~1.1
- β=1.25 → ~1.3
- β=1.5 → ~1.7
- β=1.75 → ~2.3
- β=2.0 → ~3.3
- β=2.25 → ~3.3
**ℓ (Red):**
- β=0.25 → ~0.7
- β=0.5 → ~0.9
- β=0.75 → ~1.0
- β=1.0 → ~1.5
- β=1.25 → ~1.8
- β=1.5 → ~2.0
- β=1.75 → ~2.1
- β=2.0 → ~2.1
- β=2.25 → ~2.1
### Key Observations
1. **Correlation length** increases sharply after β=1.5, reaching ~3.3 at β=2.25
2. **ℓ** plateaus at β≥1.75, maintaining ~2.1 despite further β increases
3. Both series show similar growth patterns until β=1.5
4. Correlation length consistently exceeds ℓ values across all β
### Interpretation
The data suggests a critical threshold behavior in the system being studied. The correlation length's rapid increase after β=1.5 indicates a phase transition or critical point, while ℓ's plateau suggests saturation of whatever physical quantity it represents. The divergence between the two series at higher β values implies different scaling behaviors or competing mechanisms in the system. This pattern is characteristic of systems approaching criticality, where correlation lengths diverge while other properties stabilize.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Graph: Normalized Difference in Log Probabilities vs. Number of Layers
### Overview
The image is a line graph depicting the normalized difference between two logarithmic probability metrics (⟨logPℓ_AR⟩_data and ⟨logPℓ_AR⟩_data) divided by sample size (N) across varying numbers of layers (ℓ). Four distinct data series are plotted, each corresponding to different system sizes (L) and sample sizes (N). The graph shows a clear trend of decreasing values as the number of layers increases.
### Components/Axes
- **X-axis**: "Number of layers, ℓ" (integer values from 2 to 10)
- **Y-axis**: "(|⟨logPℓ_AR⟩_data − ⟨logPℓ_AR⟩_data|)/N" (normalized difference, ranging from 0.000 to 0.006)
- **Legend**: Located in the top-right corner, with four entries:
- Blue circles: L = 12, N = 144
- Gray circles: L = 16, N = 256
- Olive circles: L = 20, N = 400
- Yellow circles: L = 24, N = 576
- **Lines**: Four distinct lines with shaded gradient regions (likely confidence intervals or error margins).
### Detailed Analysis
1. **Blue Line (L=12, N=144)**:
- Starts at ~0.006 at ℓ=2.
- Gradually decreases to ~0.001 at ℓ=10.
- Shaded region widest at ℓ=2, narrowing as ℓ increases.
2. **Gray Line (L=16, N=256)**:
- Begins at ~0.005 at ℓ=2.
- Declines to ~0.0005 at ℓ=10.
- Shaded region narrower than blue but wider than olive/yellow.
3. **Olive Line (L=20, N=400)**:
- Starts at ~0.0045 at ℓ=2.
- Drops to ~0.0003 at ℓ=10.
- Shaded region narrower than gray, indicating tighter confidence intervals.
4. **Yellow Line (L=24, N=576)**:
- Begins at ~0.004 at ℓ=2.
- Declines sharply to ~0.0001 at ℓ=10.
- Shaded region narrowest, suggesting highest precision.
### Key Observations
- **Inverse Relationship**: All lines show a strong negative correlation between ℓ and the normalized difference, indicating that increasing layers reduces discrepancy.
- **L/N Dependency**: Higher L values (and corresponding N) exhibit steeper declines and lower starting points, suggesting larger systems/samples yield better agreement.
- **Shaded Regions**: Narrower for higher L values, implying reduced uncertainty in larger systems.
- **Convergence**: All lines approach zero as ℓ increases, with L=24 (yellow) reaching near-zero faster than others.
### Interpretation
The data demonstrates that increasing the number of layers (ℓ) improves alignment between the model's predicted log probabilities and empirical data. Systems with larger L (and N) show more pronounced improvements, likely due to enhanced model capacity or reduced noise in larger datasets. The narrowing shaded regions for higher L values suggest greater confidence in predictions as system size grows. This trend could be critical in applications like neural network design or statistical modeling, where balancing model complexity (layers) and data fit is essential. The rapid convergence for L=24 implies that very large systems may achieve near-optimal performance with fewer layers, warranting further investigation into efficiency trade-offs.
</details>
Figure 6: (a) Comparison between the correlation length $\xi$ and the decay parameter $\overline{\ell}$ of the exponential fit as a function of the inverse temperature $\beta$ . Both data sets are averaged over 10 instances. The data are limited to $T\geq 0.45$ because below that temperature the finite size of the sample prevents us to obtain a reliable proper estimate of the correlation length. (b) Absolute difference between $\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}$ at various $\ell$ and at $\ell=10$ for $T=1.02$ and different system sizes. Data are averaged over ten different instances of the disorder and the colored areas identify regions corresponding to plus or minus one standard deviation. A version of this plot in linear-logarithmic scale is available in the SI.
### Spatial correlation function
To better quantify spatial correlations, from the overlap field, a proper correlation function for a spin glass system can be defined as [35]
$$
G(r)=\frac{1}{{\color[rgb]{0,0,0}\mathcal{N}(r)}}\sum_{i,j:d_{ij}=r}\overline{
\langle q_{i}q_{j}\rangle}\ , \tag{13}
$$
where $r=d_{ij}$ is the lattice distance of sites $i,j$ , ${q_{i}=\sigma^{1}_{i}\sigma^{2}_{i}}$ is the local overlap, and $\mathcal{N}(r)=\sum_{i,j:d_{ij}=r}1$ is the number of pairs of sites at distance $r$ . Note that in this way $G(0)=1$ by construction. Results for $G(r)$ at various temperatures are shown in Fig. 5. It is clear that increasing the number of layers leads to a better agreement at lower temperatures. Interestingly, architectures with a lower numbers of layers produce a smaller correlation function at any distance, which shows that upon increasing the number of layers the 4N architecture is effectively learning the correlation of the Gibbs-Boltzmann measure.
To quantify this intuition, we extract from the exact $G(r)$ of the EA model a correlation length $\xi$ , from the relation $G(\xi)=1/e$ . Then, we compare $\xi$ with $\overline{\ell}$ , i.e. the number of layers above which the model becomes accurate, extracted from Fig. 3. These two quantities are compared in Fig. 6 as a function of temperature, showing that they are roughly proportional. Actually $\overline{\ell}$ seems to grow even slower than $\xi$ at low temperatures. In Fig. 6 we also reproduce the results of Fig. 3b for different sizes of the system. The curves coincide for large enough values of $N$ , showing that the value of $\ell$ saturates to a system-independent value when increasing the system size. We are thus able to introduce a physical criterion to fix the optimal number of layers needed for the 4N architecture to perform well: $\ell$ must be of the order of the correlation length $\xi$ of the model at that temperature.
We recall that the 4N architecture requires $(c+1)\times\ell\times N$ parameters, where $c$ is the average connectivity of the graph ( $c=2d$ for a $d$ -dimensional lattice), and $\mathcal{O}(\mathcal{B}(\ell)\times N)$ operations to generate a new configuration. We thus conclude that, for a model with a correlation length $\xi$ that remains finite for $N\to\infty$ , we can choose $\ell$ proportional to $\xi$ , hence achieving the desired linear scaling with system size both for the number of parameters and the number of operations.
## Discussion
In this work we have introduced a new, physics-inspired 4N architecture for the approximation of the Gibbs-Boltzmann probability distribution of a spin glass system of $N$ spins. The 4N architecture is inspired by the previously introduced TwoBo [21], but it implements explicitly the locality of the physical model via a Graph Neural Network structure.
The 4N network possesses many desirable properties. (i) The architecture has a physical interpretation, with the required number of layers being directly linked to the correlation length of the original data. (ii) As a consequence, if the correlation length of the model is finite, the required number of parameters for a good fit scales linearly in the number of spins, thus greatly improving the scalability with respect to previous architectures. (iii) The architecture is very general, and can be applied to lattices in all dimensions as well as arbitrary interaction graphs without losing the good scaling of the number of parameters, as long as the average connectivity remains finite. It can also be generalized to systems with more than two-body interactions by using a factor graph representation in the GNN layers.
As a benchmark study, we have tested that the 4N architecture is powerful enough to reproduce key properties of the two-dimensional Edwards-Anderson spin glass model, such as the energy, the correlation function and the probability distribution of the overlap.
Having tested that 4N achieves the best possible scaling with system size while being expressive enough, the next step is to check whether it is possible to apply an iterative, self-consistent procedure to automatically train the network at lower and lower temperatures without the aid of training data generated with a different algorithm (here, parallel tempering). This would allow to set up a simulated tempering procedure with a good scaling in the number of variables, with applications not only in the simulation of disordered systems but in the solution of optimization problems as well. Whether setting up such a procedure is actually possible will be the focus of future work.
## Methods
### Details of the 4N architecture
The Nearest Neighbors Neural Network (4N) architecture uses a GNN structure to respect the locality of the problem, as illustrated in Fig. 1. The input, given by the local fields computed from the already fixed spins, is passed through $\ell$ layers of weights $W_{ij}^{k}$ . Each weight corresponds to a directed nearest neighbor link $i\leftarrow j$ (including $i=j$ ) and $k=1,\dots,\ell$ , therefore the network has a total number of parameters equal to $(c+1)\times\ell\times N$ , i.e. linear in $N$ as long as the number of layers and the model connectivity remain finite. A showcase of the real wall-clock times required for the training of the network and for the generation of new configurations is shown in the SI.
The operations performed in order to compute ${\pi_{i}=P(\sigma_{i}=1\mid\boldsymbol{\sigma}_{<i})}$ are the following:
1. initialize the local fields from the input spins $\boldsymbol{\sigma}_{<i}$ :
$$
h_{l}^{i,(0)}=\sum_{f(<i)}J_{lf}\sigma_{f}\ ,\qquad\forall l:|i-l|\leq\ell\ . \tag{0}
$$
1. iterate the GNN layers $\forall k=1,\cdots,\ell-1$ :
$$
h_{l}^{i,(k)}=\phi^{k}\left[\beta W_{ll}^{k}h_{l}^{i,(k-1)}+\sum_{j\in\partial
l
}\beta J_{lj}W_{lj}^{k}h_{j}^{i,(k-1)}\right]+h_{l}^{i,(0)}\ , \tag{0}
$$
where the sum over $j\in\partial l$ runs over the neighbors of site $l$ . The fields are computed $\forall l:|i-l|\leq\ell-k$ , see Fig. 1.
1. compute the field on site $i$ at the final layer,
$$
h_{i}^{i,(\ell)}=\phi^{\ell}\left[\beta W_{ii}^{\ell}h_{i}^{i,(\ell-1)}+\sum_{
j\in\partial i}\beta J_{ij}W_{ij}^{\ell}h_{j}^{i,(\ell-1)}\right]\ .
$$
1. output $\pi_{i}=P(\sigma_{i}=1|\boldsymbol{\sigma}_{<i})=S[2h_{i}^{i,(\ell)}]$ .
Here $S(x)=1/(1+e^{-x})$ is the sigmoid function and $\phi^{k}(x)$ are layer-specific functions that can be used to introduce non-linearities in the system. The fields $h_{l}^{i,(0)}=\xi_{il}$ in the notation of the TwoBo architecture. Differently from TwoBo, however, 4N uses all the $h_{l}^{i,(0)}$ , instead of the ones corresponding to spins that have not yet been fixed.
In practice, we have observed that the network with identity functions, i.e. $\phi^{k}(x)=x$ for all $k$ , is sufficiently expressive, and using non-linearities does not seem to improve the results. Moreover, linear functions have the advantage of being easily explainable and faster, therefore we have chosen to use them in this work.
### Spin update order
Autoregressive models such as 4N are based on an ordering of spins to implement Eq. (6). One can choose a simple raster scan (i.e., from left to right on one row and updating each row from top to bottom sequentially) or any other scheme of choice, see e.g. Refs. [14, 36] for a discussion. In this work, we have adopted a spiral scheme, in which the spins at the edges of the system are updated first, and then one moves in spirals towards the central spins. As shown in the SI, it appears as 4N performs better, for equal training conditions, when the spins are updated following a spiral scheme with respect to, for instance, a raster scheme.
### Building the training data
We use a training set of $M=2^{16}$ configurations obtained at several temperatures using a Parallel Tempering algorithm for different instances of the couplings $J_{ij}$ , and for a 16x16 two-dimensional square lattice with open boundary conditions. Data where gathered in runs of $2^{27}$ steps by considering $2^{26}$ steps for thermalization and then taking one configuration each $2^{10}$ steps. Temperature swap moves, in which one swaps two configurations at nearby temperature, were attempted once every 32 normal Metropolis sweeps. The choice of $M=2^{16}$ was motivated based on previous studies of the same model [12].
### Model training
The training has been carried out maximizing the likelihood of the data, a procedure which is equivalent to minimizing the KL divergence $D_{\text{KL}}(P_{\text{GB}}\parallel P_{\text{NN}})$ , with $P_{\text{GB}}$ estimated by the training set. We used an ADAM optimizer with batch size 128 for 80 epochs and a learning rate 0.001 with exponential decay in time. Training is stopped if the loss has not improved in the last 10 steps, and the lowest-loss model is considered.
Acknowledgements. We especially thank Indaco Biazzo and Giuseppe Carleo for several important discussions on their TwoBo architecture that motivated and inspired the present study. We also thank Giulio Biroli, Patrick Charbonneau, Simone Ciarella, Marylou Gabrié, Leonardo Galliano, Jeanne Trinquier, Martin Weigt and Lenka Zdeborová for useful discussions. We acknowledge support from the computational infrastructure DARIAH.IT, PON Project code PIR01_00022, National Research Council of Italy. The research has received financial support from the “National Centre for HPC, Big Data and Quantum Computing - HPC”, Project CN_00000013, CUP B83C22002940006, NRP Mission 4 Component 2 Investment 1.5, Funded by the European Union - NextGenerationEU. Author LMDB acknowledges funding from the Bando Ricerca Scientifica 2024 - Avvio alla Ricerca (D.R. No. 1179/2024) of Sapienza Università di Roma, project B83C24005280001 – MaLeDiSSi.
## References
- [1] Ulli Wolff. Collective monte carlo updating for spin systems. Physical Review Letters, 62(4):361–364, 1989. Publisher: American Physical Society.
- [2] Andrea Ninarello, Ludovic Berthier, and Daniele Coslovich. Models and algorithms for the next generation of glass transition studies. Phys. Rev. X, 7:021039, Jun 2017.
- [3] E. Marinari and G. Parisi. Simulated tempering: A new monte carlo scheme. Europhysics Letters, 19(6):451, jul 1992.
- [4] J. Houdayer. A cluster monte carlo algorithm for 2-dimensional spin glasses. The European Physical Journal B - Condensed Matter and Complex Systems, 22(4):479–484, 2001.
- [5] Giuseppe Carleo and Matthias Troyer. Solving the quantum many-body problem with artificial neural networks. Science, 355(6325):602–606, 2017. Publisher: American Association for the Advancement of Science.
- [6] Samuel Tamagnone, Alessandro Laio, and Marylou Gabrié. Coarse grained molecular dynamics with normalizing flows. arXiv preprint arXiv:2406.01524, 2024.
- [7] Gerhard Jung, Rinske M Alkemade, Victor Bapst, Daniele Coslovich, Laura Filion, François P Landes, Andrea Liu, Francesco Saverio Pezzicoli, Hayato Shiba, Giovanni Volpe, et al. Roadmap on machine learning glassy liquids. arXiv preprint arXiv:2311.14752, 2023.
- [8] Leonardo Galliano, Riccardo Rende, and Daniele Coslovich. Policy-guided monte carlo on general state spaces: Application to glass-forming mixtures. arXiv preprint arXiv:2407.03275, 2024.
- [9] Simone Ciarella, Dmytro Khomenko, Ludovic Berthier, Felix C Mocanu, David R Reichman, Camille Scalliet, and Francesco Zamponi. Finding defects in glasses through machine learning. Nature Communications, 14(1):4229, 2023.
- [10] Marylou Gabrié, Grant M. Rotskoff, and Eric Vanden-Eijnden. Adaptive monte carlo augmented with normalizing flows. Proceedings of the National Academy of Sciences, 119(10):e2109420119, 2022. Publisher: Proceedings of the National Academy of Sciences.
- [11] Dian Wu, Lei Wang, and Pan Zhang. Solving statistical mechanics using variational autoregressive networks. Physical Review Letters, 122(8):080602, 2019. Publisher: American Physical Society.
- [12] Simone Ciarella, Jeanne Trinquier, Martin Weigt, and Francesco Zamponi. Machine-learning-assisted monte carlo fails at sampling computationally hard problems. Machine Learning: Science and Technology, 4(1):010501, 2023.
- [13] B. McNaughton, M. V. Milošević, A. Perali, and S. Pilati. Boosting monte carlo simulations of spin glasses using autoregressive neural networks. Physical Review E, 101(5):053312, 2020. Publisher: American Physical Society.
- [14] Jeanne Trinquier, Guido Uguzzoni, Andrea Pagnani, Francesco Zamponi, and Martin Weigt. Efficient generative modeling of protein sequences using simple autoregressive models. Nature communications, 12(1):5800, 2021.
- [15] Jonas Köhler, Leon Klein, and Frank Noé. Equivariant flows: Exact likelihood generative learning for symmetric densities.
- [16] Manuel Dibak, Leon Klein, Andreas Krämer, and Frank Noé. Temperature steerable flows and boltzmann generators. Nature Machine Intelligence, 4(4):L042005, 2022. Publisher: American Physical Society.
- [17] Michele Invernizzi, Andreas Krämer, Cecilia Clementi, and Frank Noé. Skipping the replica exchange ladder with normalizing flows. The Journal of Physical Chemistry Letters, 13(50):11643–11649, 2022.
- [18] Wenlin Chen, Mingtian Zhang, Brooks Paige, José Miguel Hernández-Lobato, and David Barber. Diffusive gibbs sampling. arXiv preprint arXiv:2402.03008, 2024.
- [19] Davide Ghio, Yatin Dandi, Florent Krzakala, and Lenka Zdeborová. Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective. arXiv preprint arXiv:2308.14085, 2023.
- [20] Michael Kilgour and Lena Simine. Inside the black box: A physical basis for the effectiveness of deep generative models of amorphous materials. Journal of Computational Physics, 452:110885, 2022.
- [21] Indaco Biazzo, Dian Wu, and Giuseppe Carleo. Sparse autoregressive neural networks for classical spin systems.
- [22] Martin J. A. Schuetz, J. Kyle Brubaker, and Helmut G. Katzgraber. Combinatorial optimization with physics-inspired graph neural networks. Nature Machine Intelligence, 4(4):367–377, 2022. Publisher: Nature Publishing Group.
- [23] Martin J. A. Schuetz, J. Kyle Brubaker, Zhihuai Zhu, and Helmut G. Katzgraber. Graph coloring with physics-inspired graph neural networks. Physical Review Research, 4(4):043131, 2022. Publisher: American Physical Society.
- [24] Changjun Fan, Mutian Shen, Zohar Nussinov, Zhong Liu, Yizhou Sun, and Yang-Yu Liu. Searching for spin glass ground states through deep reinforcement learning. Nature Communications, 14(1):725, February 2023. Publisher: Nature Publishing Group.
- [25] Daria Pugacheva, Andrei Ermakov, Igor Lyskov, Ilya Makarov, and Yuriy Zotov. Enhancing gnns performance on combinatorial optimization by recurrent feature update. arXiv:2407.16468, 2024.
- [26] Lorenzo Colantonio, Andrea Cacioppo, Federico Scarpati, and Stefano Giagu. Efficient graph coloring with neural networks: A physics-inspired approach for large graphs. arXiv:2408.01503, 2024.
- [27] Maria Chiara Angelini and Federico Ricci-Tersenghi. Modern graph neural networks do worse than classical greedy algorithms in solving combinatorial optimization problems like maximum independent set. Nature Machine Intelligence, 5(1):29–31, December 2022. arXiv:2206.13211 [cond-mat].
- [28] Stefan Boettcher. Inability of a graph neural network heuristic to outperform greedy algorithms in solving combinatorial optimization problems. Nature Machine Intelligence, 5(1):24–25, 2023.
- [29] Stefan Boettcher. Deep reinforced learning heuristic tested on spin-glass ground states: The larger picture. Nature Communications, 14(1):5658, 2023.
- [30] Changjun Fan, Mutian Shen, Zohar Nussinov, Zhong Liu, Yizhou Sun, and Yang-Yu Liu. Reply to: Deep reinforced learning heuristic tested on spin-glass ground states: The larger picture. Nature communications, 14(1):5659, 2023.
- [31] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International conference on machine learning, pages 881–889. PMLR, 2015.
- [32] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.
- [33] Ata Madanchi, Michael Kilgour, Frederik Zysk, Thomas D Kühne, and Lena Simine. Simulations of disordered matter in 3d with the morphological autoregressive protocol (map) and convolutional neural networks. The Journal of Chemical Physics, 160(2), 2024.
- [34] Indaco Biazzo. The autoregressive neural network architecture of the boltzmann distribution of pairwise interacting spins systems. Communications Physics, 6(1):296, 2023.
- [35] LA Fernandez, E Marinari, V Martin-Mayor, Giorgio Parisi, and JJ Ruiz-Lorenzo. Universal critical behavior of the two-dimensional ising spin glass. Physical Review B, 94(2):024402, 2016.
- [36] Yi Hong Teoh and Roger G Melko. Autoregressive model path dependence near ising criticality. arXiv:2408.15715, 2024.
## S1 Computational details
We now introduce a easy way to generate efficiently data using the 4N architecture. As long as the activation functions $\phi^{k}$ of the different layers of 4N are linear, all the intermediate message passing layers can be substituted with a single matrix multiplication. Indeed, let us consider, as in the main text, identity activation functions. In this case, we can rewrite the matrix implementing the $t-$ th layer, $\underline{\underline{A}}^{t}$ , as $\underline{\underline{A}}^{t}=\beta(\underline{\underline{J}}+\underline{ \underline{I}})\odot\underline{\underline{W}}^{t}$ , where $\beta$ is the inverse temperature, $\underline{\underline{J}}$ is the coupling matrix, $\underline{\underline{I}}$ is the identity matrix, and $\underline{\underline{W}}^{t}$ it the $t-$ th layer weight matrix. Here, $\odot$ is the elementwise multiplication. Now we observe that (in 4N we put skip connections after all but the last layer)
$$
\displaystyle\mathbf{h}^{t}=\underline{\underline{A}}^{t}\mathbf{h}^{t-1}=
\underline{\underline{A}}^{t}(\underline{\underline{A}}^{t-1}\mathbf{h}^{t-2}+
h^{0})=\dots=(\underline{\underline{A}}^{t}\underline{\underline{A}}^{t-1}
\dots\underline{\underline{A}}^{1}+\underline{\underline{A}}^{t}\underline{
\underline{A}}^{t-1}\dots\underline{\underline{A}}^{2}+\dots+\underline{
\underline{A}}^{t}\underline{\underline{A}}^{t-1}+\underline{\underline{A}}^{t
})\mathbf{h}^{0}. \tag{16}
$$
Now, remembering that, if the input configuration $\mathbf{s}$ is properly masked, we have $\mathbf{h}^{0}=\underline{\underline{J}}\mathbf{s}$ , it follows that
$$
\displaystyle\mathbf{h}^{t}=(\underline{\underline{A}}^{t}\underline{
\underline{A}}^{t-1}\dots\underline{\underline{A}}^{1}+\underline{\underline{A
}}^{t}\underline{\underline{A}}^{t-1}\dots\underline{\underline{A}}^{2}+\dots+
\underline{\underline{A}}^{t}\underline{\underline{A}}^{t-1}+\underline{
\underline{A}}^{t})\underline{\underline{J}}\mathbf{s}, \tag{17}
$$
which can be rewritten in matrix form as $\mathbf{h}^{t}=\underline{\underline{M}}\mathbf{s}$ by defining
$$
\displaystyle\underline{\underline{M}}=(\underline{\underline{A}}^{t}
\underline{\underline{A}}^{t-1}\dots\underline{\underline{A}}^{1}+\underline{
\underline{A}}^{t}\underline{\underline{A}}^{t-1}\dots\underline{\underline{A}
}^{2}+\dots+\underline{\underline{A}}^{t}\underline{\underline{A}}^{t-1}+
\underline{\underline{A}}^{t})\underline{\underline{J}}, \tag{18}
$$
and the final output is obtained simply by taking the elementwise sigmoidal of $\mathbf{h}^{t}$ . Notice that the $i$ -th row of the matrix $\underline{\underline{M}}$ has zeroes in all columns corresponding to spins at distance larger than $\ell+1$ from the spin $i$ , that is, the number of non-zero elements is constant with system size as long as $\ell$ is kept fixed. Therefore, generation can simply be implemented as a series of vector products, each considering only the nonzero elements of the matrix $\underline{\underline{M}}$ . This allows one to compute the $i$ -th element of vector $\mathbf{h}^{t}$ in a time $\mathcal{O}(1)$ in system size, therefore keeping generation procedures $\mathcal{O}(N)$ . Moreover, the training time scales as $\mathcal{O}(N)$ .
We point out that we can train directly the whole matrix $\underline{\underline{M}}$ , with the additional constraint of putting to zero the elements corresponding to spins at distance larger than $\ell+1$ . Conceptually, this approach can be interpreted as a reduced MADE, in which the weight matrix is not only triangular but has additional zeros. We will refer to this approach as $\ell$ -MADE in the following. This method is easier to implement and can be faster in the training and generation processes, but has the disadvantage or requiring more parameters than the standard 4N approach, e.g. $\mathcal{O}(\ell^{d}N)$ for a $d-$ dimensional lattice against the $(c+1)\ell N$ of 4N. Moreover, it loses the physical interpretation of 4N, as it can no longer be interpreted as a series of fields propagating on the lattice.
Wall-clock generation and training times are shown in Fig. S1. Comparison between Fig. 3b obtained using 4N and the analogue obtained using $\ell$ -MADE is shown in Fig. S2.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Scatter Plot: Execution Time vs. Number of Spins
### Overview
The image is a scatter plot showing the relationship between the number of spins and the execution time for 5 repetitions. A linear fit line is overlaid on the data points, indicating a positive correlation between the two variables.
### Components/Axes
- **X-axis (Horizontal)**: "Number of Spins" with values ranging from 0 to 16,000 in increments of 2,000.
- **Y-axis (Vertical)**: "Execution Time for 5 Repetitions (s)" with values ranging from 0 to 10 in increments of 2.
- **Legend**: Located in the top-left corner.
- **Data**: Blue circles (labeled "Data").
- **Linear fit**: Black line (labeled "Linear fit").
### Detailed Analysis
- **Data Points**: Blue circles are plotted across the graph, showing a general upward trend.
- **Linear Fit**: A black line passes through the data points, starting near (0, 0) and ending near (16,000, 10). The slope of the line suggests a direct proportionality between spins and execution time.
- **Spread of Data**: Some data points deviate slightly from the linear fit, with a few above and below the line.
### Key Observations
1. **Positive Correlation**: As the number of spins increases, the execution time for 5 repetitions also increases.
2. **Linear Relationship**: The linear fit line indicates a strong linear trend, though minor deviations exist.
3. **Outliers**: A few data points (e.g., near 12,000 spins) show slightly higher execution times than the linear fit predicts.
### Interpretation
The data demonstrates that execution time scales linearly with the number of spins, suggesting a predictable relationship between the two variables. The linear fit line provides a clear model for estimating execution time based on spin count. However, the slight deviations from the line indicate potential variability in the system, such as experimental noise or unaccounted factors. This trend is critical for optimizing performance in systems where spin count and execution time are interdependent.
</details>
<details>
<summary>x14.png Details</summary>

### Visual Description
## Line Graph: Training Time vs. Number of Spins
### Overview
The image depicts a line graph illustrating the relationship between the number of spins and training time (in seconds). A linear fit line is overlaid on the data points, which are represented as red dots. The graph suggests a direct proportional relationship between the two variables.
### Components/Axes
- **X-axis (Horizontal)**: Labeled "Number of spins," with values ranging from 0 to 3000 in increments of 500.
- **Y-axis (Vertical)**: Labeled "Training Time (s)," with values ranging from 0 to 70 in increments of 10.
- **Legend**: Located in the top-left corner, with:
- Red dots labeled "Data"
- Black line labeled "Linear fit"
### Detailed Analysis
- **Data Points**: Red dots are plotted at the following approximate coordinates:
- (250, 5)
- (500, 8)
- (750, 12)
- (1000, 18)
- (1250, 23)
- (1500, 30)
- (1750, 35)
- (2000, 40)
- (2250, 45)
- (2500, 50)
- (2750, 55)
- (3000, 65)
- **Linear Fit**: A black line passes through all data points, indicating a perfect linear relationship. The slope of the line is approximately **0.0218 seconds per spin** (calculated from the rise over run between the first and last data points: (65 - 5) / (3000 - 250) ≈ 0.0218).
### Key Observations
1. **Linear Trend**: The data points align precisely with the linear fit line, confirming a strong linear correlation.
2. **Proportionality**: Training time increases by approximately **0.0218 seconds per spin**.
3. **No Outliers**: All data points lie exactly on the linear fit line, suggesting no anomalies or measurement errors.
### Interpretation
The graph demonstrates that training time scales linearly with the number of spins, implying a direct computational cost per spin. This suggests the model's training efficiency is consistent across different spin counts, with no diminishing returns or unexpected delays. The absence of outliers indicates reliable data collection or a well-optimized algorithm. The linear relationship could be critical for predicting training times in resource planning or algorithm optimization.
</details>
Figure S1: Scatter plots of the generation (left) and training (right) elapsed real times at at different system sizes for $\ell$ -MADE. Both training and generations have been performed using dummy data (and therefore $N$ needs not to be the square of an integer). For generation, we generate 10000 configurations of $N$ five times. For training, we train a model with $\ell=2$ for 40 epochs using an Adam optimizer once per each value of $N$ . In both cases, elapsed times scale linearly with the system’s size, as further evidenced by the linear fits (black lines).
<details>
<summary>x15.png Details</summary>

### Visual Description
## Scatter Plot: Relationship Between Number of Layers and Log Probability Ratio
### Overview
The graph illustrates the relationship between the number of layers (ℓ) and the absolute value of the average log probability ratio (|⟨logPℓ_AR⟩_data/S|) across six distinct temperature conditions (T). Data points are color-coded by temperature, with trends showing how |⟨logPℓ_AR⟩_data/S| varies with increasing ℓ for each T.
### Components/Axes
- **X-axis**: "Number of layers, ℓ" (integer values from 2 to 10).
- **Y-axis**: "|⟨logPℓ_AR⟩_data/S|" (logarithmic scale, ranging from 1.00 to 1.10).
- **Legend**: Located in the top-right corner, mapping six temperatures (T) to colors:
- Blue: T = 0.58
- Red: T = 1.41
- Purple: T = 0.72
- Orange: T = 1.95
- Magenta: T = 0.86
- Yellow: T = 2.83
### Detailed Analysis
1. **T = 0.58 (Blue)**:
- Starts at |⟨logPℓ_AR⟩_data/S| ≈ 1.08 for ℓ = 2.
- Decreases steadily to ~1.02 by ℓ = 10.
- Trend: Strong negative correlation between ℓ and |⟨logPℓ_AR⟩_data/S|.
2. **T = 1.41 (Red)**:
- Values cluster tightly between 1.00 and 1.01 across all ℓ.
- Trend: Minimal variation; nearly flat.
3. **T = 0.72 (Purple)**:
- Begins at ~1.04 for ℓ = 2.
- Decreases to ~1.01 by ℓ = 10.
- Trend: Moderate negative slope.
4. **T = 1.95 (Orange)**:
- Values remain near 1.00–1.01 for all ℓ.
- Trend: Flat, with slight upward drift at higher ℓ.
5. **T = 0.86 (Magenta)**:
- Starts at ~1.03 for ℓ = 2.
- Decreases to ~1.01 by ℓ = 10.
- Trend: Gradual decline.
6. **T = 2.83 (Yellow)**:
- Values hover around 1.00–1.01 for all ℓ.
- Trend: Completely flat, lowest magnitude across all T.
### Key Observations
- **Temperature Dependence**: Lower T values (e.g., T = 0.58, 0.72) exhibit larger |⟨logPℓ_AR⟩_data/S| and stronger ℓ-dependent trends.
- **Layer Sensitivity**: For T < 1.0, |⟨logPℓ_AR⟩_data/S| decreases with increasing ℓ. Higher T values (T ≥ 1.41) show minimal ℓ dependence.
- **Outliers**: No significant outliers; all data points align with expected trends.
### Interpretation
The data suggests that temperature modulates the sensitivity of |⟨logPℓ_AR⟩_data/S| to the number of layers. Lower temperatures amplify the ℓ-dependent trend, while higher temperatures suppress it. This could imply that thermal energy disrupts layer-specific correlations in the system. The flat behavior at high T (e.g., T = 2.83) may indicate a saturation effect, where layer interactions become negligible. The trend reversal (e.g., T = 0.58) highlights a critical threshold where layer interactions dominate over thermal noise.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
## Scatter Plot: Relationship between ℓ and |⟨logPℓ_AR⟩_data/S|
### Overview
The image is a scatter plot showing the relationship between the variable ℓ (x-axis) and the logarithmic average value |⟨logPℓ_AR⟩_data/S| (y-axis). Data points are color-coded by temperature (T) values, with six distinct T values represented. The plot includes a legend for T values and axis labels.
### Components/Axes
- **X-axis (ℓ)**: Labeled as "ℓ", with integer values from 2 to 10.
- **Y-axis (|⟨logPℓ_AR⟩_data/S|)**: Labeled as "|⟨logPℓ_AR⟩_data/S|", with values ranging from 1.00 to 1.10.
- **Legend**: Located in the top-right corner, mapping colors to T values:
- Blue: T = 0.58
- Purple: T = 0.72
- Pink: T = 0.86
- Red: T = 1.02
- Orange: T = 1.41
- Yellow: T = 1.95
- Light Yellow: T = 2.83
### Detailed Analysis
#### T = 0.58 (Blue)
- **Trend**: Points decrease monotonically as ℓ increases.
- **Values**:
- ℓ = 2: ~1.075
- ℓ = 4: ~1.04
- ℓ = 6: ~1.03
- ℓ = 8: ~1.02
- ℓ = 10: ~1.01
#### T = 0.72 (Purple)
- **Trend**: Slightly decreasing trend with minor fluctuations.
- **Values**:
- ℓ = 2: ~1.04
- ℓ = 4: ~1.03
- ℓ = 6: ~1.02
- ℓ = 8: ~1.01
- ℓ = 10: ~1.005
#### T = 0.86 (Pink)
- **Trend**: Gradual decline with smaller amplitude.
- **Values**:
- ℓ = 2: ~1.025
- ℓ = 4: ~1.02
- ℓ = 6: ~1.015
- ℓ = 8: ~1.01
- ℓ = 10: ~1.005
#### T = 1.02 (Red)
- **Trend**: Near-constant values with slight downward slope.
- **Values**:
- ℓ = 2: ~1.01
- ℓ = 4: ~1.008
- ℓ = 6: ~1.007
- ℓ = 8: ~1.006
- ℓ = 10: ~1.005
#### T = 1.41 (Orange)
- **Trend**: Flat line with minimal variation.
- **Values**:
- ℓ = 2: ~1.005
- ℓ = 4: ~1.004
- ℓ = 6: ~1.003
- ℓ = 8: ~1.002
- ℓ = 10: ~1.001
#### T = 1.95 (Yellow)
- **Trend**: Slight upward trend at higher ℓ.
- **Values**:
- ℓ = 2: ~1.002
- ℓ = 4: ~1.001
- ℓ = 6: ~1.0005
- ℓ = 8: ~1.0003
- ℓ = 10: ~1.0001
#### T = 2.83 (Light Yellow)
- **Trend**: Near-constant values across all ℓ.
- **Values**:
- ℓ = 2: ~1.0005
- ℓ = 4: ~1.0003
- ℓ = 6: ~1.0001
- ℓ = 8: ~1.00005
- ℓ = 10: ~1.00001
### Key Observations
1. **Inverse Relationship**: Higher T values correlate with lower |⟨logPℓ_AR⟩_data/S|, suggesting a possible inverse proportionality.
2. **Steepest Decline**: T = 0.58 (blue) shows the most significant drop in y-values as ℓ increases.
3. **Convergence**: All T values converge toward ~1.00 as ℓ approaches 10, indicating a saturation effect.
4. **Precision**: Data points for T ≥ 1.02 are tightly clustered near 1.00, suggesting minimal variability.
### Interpretation
The plot demonstrates that the logarithmic average |⟨logPℓ_AR⟩_data/S| decreases with increasing ℓ, with the rate of decline diminishing at higher T values. This could imply that higher temperatures stabilize the system, reducing fluctuations in the measured quantity. The convergence near 1.00 for all T values at ℓ = 10 suggests a universal asymptotic behavior, potentially indicating a critical threshold or equilibrium state. The precision of data points for higher T values (e.g., T = 2.83) highlights the system's stability under extreme conditions.
</details>
Figure S2: Ratio between the cross-entropy $|\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}|$ and the Gibbs-Boltzmann entropy $S_{\text{GB}}$ for different values of the temperature $T$ and of the parameter $\ell$ for 4N (a, left) and $\ell$ -MADE (b, right). Both $|\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}|$ and $S_{\text{GB}}$ are averaged over 10 samples.
## S2 Comparison between different architectures
<details>
<summary>x17.png Details</summary>

### Visual Description
## Line Chart: Probability Distribution at T = 0.31, Instance 2
### Overview
The chart displays probability distributions (P(q)) for five distinct models/methods across a normalized variable q (-1.00 to 1.00). All distributions exhibit U-shaped curves with peaks in the positive q range, suggesting asymmetric probability distributions centered around q ≈ 0.7–0.8.
### Components/Axes
- **X-axis (q)**: Ranges from -1.00 to 1.00 in increments of 0.25.
- **Y-axis (P(q))**: Probability values from 0.2 to 1.0 in increments of 0.2.
- **Legend**: Located in the top-right corner, mapping line styles/colors to models:
- **PT**: Dashed black line
- **4N, ℓ = 10**: Solid yellow line
- **MADE**: Solid blue line
- **NADE**: Solid red line
- **TwoBo**: Solid green line
### Detailed Analysis
1. **PT (Dashed Black)**:
- Peaks at q ≈ 0.8 with P(q) ≈ 0.9.
- Shows a minor dip at q ≈ -0.5 (P(q) ≈ 0.75).
- Smooth curve with no oscillations.
2. **4N, ℓ = 10 (Solid Yellow)**:
- Peaks at q ≈ 0.7 with P(q) ≈ 0.85.
- Deeper trough at q ≈ 0.0 (P(q) ≈ 0.25).
- Slightly asymmetric compared to other models.
3. **MADE (Solid Blue)**:
- Peaks at q ≈ 0.8 with P(q) ≈ 0.88.
- Overlaps closely with NADE and TwoBo.
- Minimal curvature in the negative q range.
4. **NADE (Solid Red)**:
- Peaks at q ≈ 0.8 with P(q) ≈ 0.87.
- Nearly identical to MADE, suggesting similar modeling assumptions.
- Slightly sharper peak than MADE.
5. **TwoBo (Solid Green)**:
- Peaks at q ≈ 0.8 with P(q) ≈ 0.92.
- Smoothest curve with the highest peak.
- No visible oscillations or troughs.
### Key Observations
- **Peak Consistency**: All models peak near q = 0.8, but TwoBo has the highest probability (0.92) at this point.
- **Asymmetry**: The 4N model shows the most pronounced asymmetry, with a lower probability at q = -1.0 (≈0.2) compared to q = 1.0 (≈0.85).
- **Overlap**: MADE and NADE are nearly indistinguishable, differing by <0.01 in peak P(q).
- **PT vs. TwoBo**: PT’s peak is 0.03 lower than TwoBo’s, but its curve is smoother.
### Interpretation
The data suggests that **TwoBo** produces the most concentrated probability distribution at q ≈ 0.8, potentially indicating superior performance or confidence in this region. The **4N model** (ℓ = 10) exhibits the widest spread, possibly due to parameter constraints (ℓ = 10) or modeling limitations. The near-identical performance of MADE and NADE implies they may share architectural or methodological similarities. The PT model’s smoothness contrasts with the oscillatory behavior of other models, hinting at differing regularization or optimization strategies. The consistent peak near q = 0.8 across models suggests a shared underlying assumption about the system’s behavior at this parameter value.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Line Graph: Probability Distribution at T = 0.31, Instance 3
### Overview
The image is a line graph depicting probability distributions (P(q)) across a normalized parameter q (-1.00 to 1.00) for five distinct models or methods. The graph includes six data series (five solid lines and one dashed line) with varying peak heights and trough depths. The title specifies a temperature parameter (T = 0.31) and instance identifier (Instance 3), suggesting a controlled experimental or simulation context.
### Components/Axes
- **X-axis (q)**: Ranges from -1.00 to 1.00 in increments of 0.25. Labeled "q" with no units specified.
- **Y-axis (P(q))**: Ranges from 0.0 to 0.8 in increments of 0.2. Labeled "P(q)" with no units specified.
- **Legend**: Located in the bottom-right corner, mapping colors to models/methods:
- **Dashed black**: PT
- **Solid yellow**: 4N, ℓ = 10
- **Solid cyan**: MADE
- **Solid red**: NADE
- **Solid green**: TwoBo
### Detailed Analysis
1. **PT (Dashed Black)**:
- Highest peaks (~0.85) at q ≈ ±0.75 and q ≈ 0.25.
- Deepest troughs (~0.15) at q ≈ -0.25 and q ≈ 0.75.
- Symmetric distribution with sharp transitions.
2. **4N, ℓ = 10 (Solid Yellow)**:
- Peaks slightly lower (~0.80) than PT, at similar q values.
- Troughs (~0.20) shallower than PT.
- Smoother curves with less pronounced symmetry.
3. **MADE (Solid Cyan)**:
- Peaks (~0.75) slightly lower than 4N.
- Troughs (~0.25) deeper than 4N but shallower than PT.
- Asymmetric distribution with a slight skew toward positive q.
4. **NADE (Solid Red)**:
- Peaks (~0.78) nearly matching PT but with a slight dip at q ≈ 0.25.
- Troughs (~0.22) deeper than MADE but shallower than PT.
- Symmetric distribution with sharper transitions than MADE.
5. **TwoBo (Solid Green)**:
- Lowest peaks (~0.65) across all q values.
- Shallowest troughs (~0.30).
- Broad, rounded peaks with minimal symmetry.
### Key Observations
- **PT** dominates in peak height and sharpness, suggesting optimal performance or highest probability mass concentration.
- **NADE** and **4N** show intermediate performance, with NADE slightly outperforming 4N in peak height but with less symmetry.
- **TwoBo** exhibits the lowest probability concentrations, indicating weaker performance or broader distribution.
- All models exhibit three peaks, but their relative magnitudes and symmetry differ significantly.
### Interpretation
The data suggests a hierarchy of model performance, with PT achieving the highest probability concentrations (likely indicating better predictive accuracy or parameter estimation). The differences in symmetry and peak sharpness may reflect trade-offs between bias and variance or computational constraints (e.g., ℓ = 10 for 4N). The presence of multiple peaks across all models implies multimodal distributions, which could relate to the underlying problem's complexity (e.g., multiple stable states in a physical system). The temperature parameter (T = 0.31) might modulate these distributions, with higher T potentially broadening peaks (not visible here). The legend's placement ensures clarity, but the absence of units for q and P(q) limits interpretability without additional context.
</details>
Figure S3: Comparison of the probability distribution of the overlap obtained using different architectures: 4N with $\ell=10$ layers (12160 parameters), shallow MADE (32640 parameters), NADE with $N_{h}=64$ (32768 parameters) and TwoBo (4111 parameters). The black dashed line is obtained using data gathered via Parallel Tempering (PT), while solid color lines are obtained using the different architectures. From a quick qualitative analysis 4N performs comparably to MADE and NADE despite having less parameters. Moreover, it also performs better than TwoBo. Despite the latter having fewer parameters in this $N=256$ case, 4N has a better scaling when increasing the system size, $\mathcal{O}(N)$ compared to $\mathcal{O}(N^{\frac{3}{2}})$ , as pointed out in the main text. The TwoBo implementation comes from the GitHub of the authors (https://github.com/cqsl/sparse-twobody-arnn) and was slightly modified to support supervised learning (which could be improved by a different choice of the hyperparameters used for training).
## S3 Comparison between different update sequences
<details>
<summary>x19.png Details</summary>

### Visual Description
## Line Graph: Probability Distribution at T = 0.31, Instance 2
### Overview
The image depicts a line graph comparing three probability distributions (PT, Spiral, Raster) across a normalized parameter `q` ranging from -1.00 to 1.00. The y-axis represents probability `P(q)`, scaled from 0.00 to 1.75. The graph highlights distinct peak structures and symmetry patterns among the distributions.
### Components/Axes
- **X-axis (q)**: Normalized parameter ranging from -1.00 to 1.00 in increments of 0.25.
- **Y-axis (P(q))**: Probability values from 0.00 to 1.75 in increments of 0.25.
- **Legend**: Located in the top-right corner, with three entries:
- **PT**: Dashed black line
- **Spiral**: Solid red line
- **Raster**: Solid yellow line
### Detailed Analysis
1. **PT (Dashed Black Line)**:
- Symmetric distribution with two peaks at `q ≈ -0.75` and `q ≈ 0.75`.
- Peak height ≈ 0.75.
- Minima at `q ≈ -0.25` and `q ≈ 0.25` (≈0.25 probability).
2. **Spiral (Solid Red Line)**:
- Similar symmetry to PT but with slightly lower peaks (≈0.70).
- Minima at `q ≈ -0.25` and `q ≈ 0.25` (≈0.20 probability).
- Gradual decline toward `q = ±1.00`.
3. **Raster (Solid Yellow Line)**:
- Single sharp peak at `q = 0.00` (≈1.75 probability).
- Rapid decline to near-zero values at `q = ±0.50`.
- No secondary peaks observed.
### Key Observations
- **Raster** exhibits the highest probability density at `q = 0.00`, exceeding both PT and Spiral.
- **PT** and **Spiral** show mirrored distributions, suggesting symmetric behavior around `q = 0.00`.
- All distributions decay to near-zero values at `q = ±1.00`.
### Interpretation
The graph likely compares the performance or characteristics of three computational methods (PT, Spiral, Raster) in modeling a probabilistic system at temperature `T = 0.31`. The Raster method demonstrates a highly concentrated probability distribution at `q = 0.00`, indicating a strong preference for this state. In contrast, PT and Spiral methods exhibit broader, symmetric distributions, suggesting more distributed or exploratory behavior. The symmetry of PT and Spiral implies potential equilibrium or balanced dynamics, while Raster’s sharp peak may reflect a deterministic or localized outcome. The temperature parameter `T = 0.31` likely modulates the system’s sensitivity, with lower temperatures potentially favoring sharper distributions (as seen in Raster).
</details>
<details>
<summary>x20.png Details</summary>

### Visual Description
## Line Graph: Probability Distribution at T = 0.31, Instance 3
### Overview
The graph depicts three probability distributions (P(q)) as functions of the variable q, ranging from -1.00 to 1.00. Three distinct lines represent different models: PT (dashed black), Spiral (solid red), and Raster (solid yellow). The y-axis (P(q)) scales from 0.0 to 1.2, with gridlines for reference.
### Components/Axes
- **X-axis (q)**: Ranges from -1.00 to 1.00 in increments of 0.25.
- **Y-axis (P(q))**: Ranges from 0.0 to 1.2 in increments of 0.2.
- **Legend**: Located in the bottom-right corner, associating:
- Dashed black line → PT
- Solid red line → Spiral
- Solid yellow line → Raster
### Detailed Analysis
1. **PT (Dashed Black)**:
- Peaks at q ≈ -0.75, 0.25, and 0.75.
- Maximum P(q) ≈ 1.0 at q ≈ -0.75.
- Symmetrical decay toward q = 0 and q = ±1.00.
2. **Spiral (Solid Red)**:
- Peaks at q ≈ -0.75, 0.25, and 0.75.
- Maximum P(q) ≈ 0.8 at q ≈ -0.75.
- Smoother curve compared to PT, with less pronounced peaks.
3. **Raster (Solid Yellow)**:
- Peaks at q ≈ -0.5, 0.0, and 0.5.
- Maximum P(q) ≈ 1.2 at q ≈ -0.5 and 0.5.
- Sharpest peaks, with a distinct central peak at q = 0.0.
### Key Observations
- **Raster** exhibits the highest probability values (up to 1.2) and the most pronounced peaks, suggesting stronger emphasis on mid-range q values (-0.5, 0.0, 0.5).
- **PT** and **Spiral** share similar peak positions but differ in magnitude, with PT consistently higher than Spiral.
- All models show periodic behavior, but Raster’s peaks are more evenly spaced and centered, while PT/Spiral peaks are offset toward the edges.
### Interpretation
The graph illustrates how three models (PT, Spiral, Raster) distribute probability across q values at a fixed temperature (T = 0.31). The Raster model’s dominance at q = ±0.5 and 0.0 suggests it prioritizes central or symmetric states, while PT and Spiral favor edge-aligned peaks (q ≈ ±0.75). The temperature parameter (T = 0.31) likely modulates these distributions, with Raster’s higher peaks indicating greater sensitivity or confidence in specific q regions. The periodic patterns across all models imply an underlying oscillatory or repetitive mechanism in the system being modeled.
</details>
Figure S4: Comparison of the probability distribution of the overlap obtained using different update sequences of the spins for $\ell=10$ . The black dashed line is obtained using data gathered via Parallel Tempering (PT), while solid color lines are obtained using 4N with different sequence updates. Spiral: spins are update in a following a spiral going from the outside to the inside of the lattice; Raster: spins are updated row by row from left to right. It appears as the spiral update performs much better than the raster update for the 4N architecture in equal training conditions, something that is not observed when training, for instance, the MADE architecture. This effect could be related to the locality characteristics of the 4N architecture.
## S4 Cross-entropy plots in lin-lin and lin-log scale
<details>
<summary>x21.png Details</summary>

### Visual Description
## Line Graph: Convergence of ⟨logPℓ_AR⟩ with Number of Layers
### Overview
The graph illustrates the convergence behavior of the average absolute difference between ⟨logPℓ_AR⟩_data and ⟨logPℓ_AR⟩_data across varying numbers of layers (ℓ). Four data series are plotted, each corresponding to different combinations of model depth (L) and sample size (N). The y-axis represents the normalized absolute difference, while the x-axis shows the number of layers.
### Components/Axes
- **X-axis**: "Number of layers, ℓ" (integer values from 2 to 10)
- **Y-axis**: "(|⟨logPℓ_AR⟩_data − ⟨logPℓ_AR⟩_data|)/N" (normalized absolute difference, logarithmic scale from 0.000 to 0.006)
- **Legend**: Located in the top-right corner, with four entries:
- Dark blue: L = 12, N = 144
- Gray: L = 16, N = 256
- Olive: L = 20, N = 400
- Yellow: L = 24, N = 576
- **Shaded Regions**: Confidence intervals or error margins around each data series.
### Detailed Analysis
1. **L = 12, N = 144 (Dark Blue)**:
- Data points at ℓ = 2, 3, 4, 5, 6, 7, 8.
- Values: 0.006 (ℓ=2), 0.003 (ℓ=3), 0.0018 (ℓ=4), 0.0012 (ℓ=5), 0.0009 (ℓ=6), 0.0007 (ℓ=7), 0.0002 (ℓ=8).
- Trend: Steepest decline, with values dropping by ~50% between ℓ=2 and ℓ=3, then gradually flattening.
2. **L = 16, N = 256 (Gray)**:
- Data points at ℓ = 2, 3, 4, 5, 6, 7, 8, 9, 10.
- Values: 0.005 (ℓ=2), 0.0025 (ℓ=3), 0.0015 (ℓ=4), 0.001 (ℓ=5), 0.0008 (ℓ=6), 0.0006 (ℓ=7), 0.0004 (ℓ=8), 0.0003 (ℓ=9), 0.0002 (ℓ=10).
- Trend: Smoother decline, with ~50% reduction between ℓ=2 and ℓ=3, followed by slower convergence.
3. **L = 20, N = 400 (Olive)**:
- Data points at ℓ = 2, 3, 4, 5, 6, 7, 8, 9, 10.
- Values: 0.0045 (ℓ=2), 0.002 (ℓ=3), 0.001 (ℓ=4), 0.0008 (ℓ=5), 0.0006 (ℓ=6), 0.0005 (ℓ=7), 0.0004 (ℓ=8), 0.0003 (ℓ=9), 0.0002 (ℓ=10).
- Trend: Moderate decline, with ~55% reduction between ℓ=2 and ℓ=3, then gradual flattening.
4. **L = 24, N = 576 (Yellow)**:
- Data points at ℓ = 2, 3, 4, 5, 6, 7, 8, 9, 10.
- Values: 0.004 (ℓ=2), 0.0018 (ℓ=3), 0.001 (ℓ=4), 0.0007 (ℓ=5), 0.0005 (ℓ=6), 0.0004 (ℓ=7), 0.0003 (ℓ=8), 0.0002 (ℓ=9), 0.0001 (ℓ=10).
- Trend: Fastest initial decline (~55% between ℓ=2 and ℓ=3), followed by steady convergence.
### Key Observations
- **Universal Decline**: All series show a decreasing trend as ℓ increases, indicating improved convergence with more layers.
- **Rate of Convergence**: Larger L and N values (e.g., L=24, N=576) achieve lower absolute differences faster than smaller configurations.
- **Shaded Regions**: Narrower error margins at higher ℓ values suggest increased precision with deeper models.
- **ℓ=10 Asymptote**: All series approach ~0.0001–0.0002 at ℓ=10, implying diminishing returns beyond this point.
### Interpretation
The graph demonstrates that increasing the number of layers (ℓ) reduces the discrepancy between ⟨logPℓ_AR⟩_data and ⟨logPℓ_AR⟩_data, with larger models (higher L and N) achieving faster convergence. The shaded regions highlight that uncertainty decreases as ℓ grows, reinforcing the reliability of deeper models. The rapid initial decline (ℓ=2 to ℓ=3) across all series suggests that early layers have the most significant impact on convergence. The flattening trend at higher ℓ values implies diminishing returns, where additional layers contribute minimally to further improvement. This could inform model design choices, balancing computational cost (larger L/N) against performance gains.
</details>
<details>
<summary>x22.png Details</summary>

### Visual Description
## Line Graph: Relationship Between Number of Layers and Normalized Logarithmic Difference
### Overview
The graph illustrates the relationship between the number of layers (ℓ) and the normalized logarithmic difference between two quantities, ⟨logPℓ_AR⟩_data and ⟨logPℓ_AR⟩_data, scaled by N. Four distinct data series are plotted, each corresponding to different combinations of L (model size) and N (sample size). The y-axis uses a logarithmic scale, and shaded regions around each line likely represent confidence intervals or uncertainty bounds.
### Components/Axes
- **X-axis**: "Number of layers, ℓ" (integer values from 2 to 8).
- **Y-axis**: "(⟨logPℓ_AR⟩_data − ⟨logPℓ_AR⟩_data)/N" (logarithmic scale, ranging from ~10⁻⁴ to ~10⁻²).
- **Legend**: Located in the top-right corner, with four entries:
- Dark blue circles: L = 12, N = 144
- Gray circles: L = 16, N = 256
- Olive circles: L = 20, N = 400
- Yellow circles: L = 24, N = 576
### Detailed Analysis
1. **Data Series Trends**:
- **Dark Blue (L=12, N=144)**: Starts at ~1.2×10⁻² for ℓ=2, decreasing steeply to ~8×10⁻³ at ℓ=8. The shaded region (confidence interval) is widest, indicating high variability.
- **Gray (L=16, N=256)**: Begins at ~9×10⁻³ for ℓ=2, declining to ~5×10⁻³ at ℓ=8. The shaded area narrows slightly compared to dark blue.
- **Olive (L=20, N=400)**: Starts at ~7×10⁻³ for ℓ=2, dropping to ~4×10⁻³ at ℓ=8. The confidence interval is tighter than the previous two series.
- **Yellow (L=24, N=576)**: Begins at ~6×10⁻³ for ℓ=2, decreasing to ~3×10⁻³ at ℓ=8. The shaded region is the narrowest, suggesting the most stable measurements.
2. **Key Observations**:
- All series exhibit a **downward trend** as ℓ increases, with higher L values showing less steep declines.
- The **dark blue series (L=12)** has the highest initial value and widest confidence interval, while the **yellow series (L=24)** has the smallest spread.
- The **y-axis values decrease by ~1–2 orders of magnitude** across the ℓ range for all series.
### Interpretation
The graph suggests that increasing the number of layers (ℓ) reduces the normalized logarithmic difference between the two quantities, with larger model sizes (L) and sample sizes (N) associated with more stable or consistent results. The widening confidence intervals for smaller L and N (e.g., dark blue) imply greater uncertainty in measurements for smaller models or datasets. Conversely, larger L and N (e.g., yellow) show tighter bounds, indicating improved reliability. This could reflect trade-offs between model complexity, data efficiency, and measurement precision in the studied system.
</details>
Figure S5: Absolute difference between $\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}$ at various $\ell$ and at $\ell=10$ for $T=1.02$ and different system’s sizes, in lin-lin and lin-log. Data are averaged over ten different instances of the disorder and the colored areas identify regions corresponding to plus or minus one standard deviation.