# Nearest-Neighbours Neural Network architecture for efficient sampling of statistical physics models
**Authors**: Luca Maria Del Bono, Federico Ricci-Tersenghi, Francesco Zamponi
> Dipartimento di Fisica, Sapienza Università di Roma, Piazzale Aldo Moro 5, Rome 00185, ItalyCNR-Nanotec, Rome unit, Piazzale Aldo Moro 5, Rome 00185, Italy
> Dipartimento di Fisica, Sapienza Università di Roma, Piazzale Aldo Moro 5, Rome 00185, ItalyCNR-Nanotec, Rome unit, Piazzale Aldo Moro 5, Rome 00185, ItalyINFN, sezione di Roma1, Piazzale Aldo Moro 5, Rome 00185, Italy
> Dipartimento di Fisica, Sapienza Università di Roma, Piazzale Aldo Moro 5, Rome 00185, Italy
thanks: Corresponding author: lucamaria.delbono@uniroma1.it
Abstract
The task of sampling efficiently the Gibbs-Boltzmann distribution of disordered systems is important both for the theoretical understanding of these models and for the solution of practical optimization problems. Unfortunately, this task is known to be hard, especially for spin-glass-like problems at low temperatures. Recently, many attempts have been made to tackle the problem by mixing classical Monte Carlo schemes with newly devised Neural Networks that learn to propose smart moves. In this article, we introduce the Nearest-Neighbours Neural Network (4N) architecture, a physically interpretable deep architecture whose number of parameters scales linearly with the size of the system and that can be applied to a large variety of topologies. We show that the 4N architecture can accurately learn the Gibbs-Boltzmann distribution for a prototypical spin-glass model, the two-dimensional Edwards-Anderson model, and specifically for some of its most difficult instances. In particular, it captures properties such as the energy, the correlation function and the overlap probability distribution. Finally, we show that the 4N performance increases with the number of layers, in a way that clearly connects to the correlation length of the system, thus providing a simple and interpretable criterion to choose the optimal depth.
Introduction
A central problem in many fields of science, from statistical physics to computer science and artificial intelligence, is that of sampling from a complex probability distribution over a large number of variables. More specifically, a very common such probability is a Gibbs-Boltzmann distribution. Given a set of $N\gg 1$ random variables $\boldsymbol{\sigma}=\{\sigma_{1},·s,\sigma_{N}\}$ , such a distribution is written in the form
$$
P_{\text{GB}}(\boldsymbol{\sigma})=\frac{e^{-\beta\mathcal{H}(\boldsymbol{%
\sigma})}}{\mathcal{Z}(\beta)}\ . \tag{1}
$$
In statistical physics language, the normalization constant $\mathcal{Z}(\beta)$ is called partition function, $\mathcal{H}(\boldsymbol{\sigma})$ is called the Hamiltonian function and $\beta=1/T$ is the inverse temperature, corresponding to a global rescaling of $\mathcal{H}(\boldsymbol{\sigma})$ .
While Eq. (1) is simply a trivial definition, i.e. $\mathcal{H}(\boldsymbol{\sigma})=-\frac{1}{\beta}\log P_{\text{GB}}(%
\boldsymbol{\sigma})+\text{const}$ , the central idea of Gibbs-Boltzmann distributions is that $\mathcal{H}(\boldsymbol{\sigma})$ is expanded as a sum of ‘local’ interactions. For instance, in the special case of binary (Ising) variables, $\sigma_{i}∈\{-1,+1\}$ , one can always write
$$
\mathcal{H}=-\sum_{i}H_{i}\sigma_{i}-\sum_{i<j}J_{ij}\sigma_{i}\sigma_{j}-\sum%
_{i<j<k}J_{ijk}\sigma_{i}\sigma_{j}\sigma_{k}+\cdots \tag{2}
$$
In many applications, like in physics (i.e. spin glasses), inference (i.e. maximum entropy models) and artificial intelligence (i.e. Boltzmann machines), the expansion in Eq. (2) is truncated to the pairwise terms, thus neglecting higher-order interactions. This leads to a Hamiltonian
$$
\mathcal{H}=-\sum_{i}H_{i}\sigma_{i}-\sum_{i<j}J_{ij}\sigma_{i}\sigma_{j} \tag{3}
$$
parametrized by ‘local external fields’ $H_{i}$ and ‘pairwise couplings’ $J_{ij}$ . In physics applications such as spin glasses, these are often chosen to be independent and identically distributed random variables, e.g. $H_{i}\sim\mathcal{N}(0,H^{2})$ and $J_{ij}\sim\mathcal{N}(0,J^{2}/N)$ . In Boltzmann Machines, instead, the fields and couplings are learned by maximizing the likelihood of a given training set.
In many cases, dealing with Eq. (1) analytically, e.g. computing expectation values of interesting quantities, is impossible, and one resorts to numerical computations. A universal strategy is to use local Markov Chain Monte Carlo (MCMC) methods, which have the advantage of being applicable to a wide range of different systems. In these methods, one proposes a local move, typically flipping a single spin, $\sigma_{i}→-\sigma_{i}$ , and accepts or rejects the move in such a way to guarantee that after many iterations, the configuration $\boldsymbol{\sigma}$ is distributed according to Eq. (1). Unfortunately, these methods are difficult (if not altogether impossible) to apply in many hard-to-sample problems for which the convergence time is very large, which, in practice, ends up requiring a huge amount of computational time. For instance, finding the ground state of the Sherrington-Kirkpatrick model (a particular case of sampling at $T→ 0$ ) is known to be a NP-hard problem; sampling a class of optimization problems is known to take a time scaling exponentially with $N$ . The training of Boltzmann Machines is also known to suffer from this kind of problems.
These inconveniences can be avoided by using system-specific global algorithms that leverage properties of the system under examination in order to gain a significant speedup. Instead of flipping one spin at a time, these methods are able to construct smart moves, which update simultaneously a large number of spins. An example is the Wolff algorithm [1] for the Ising model. Another class of algorithms uses unphysical moves or extended variable spaces, such as the Swap Monte Carlo for glass-forming models of particles [2]. The downside of these algorithms is that they cannot generally be transferred from one system to another, meaning that one has to develop new techniques specifically for each system of interest. Yet another class is Parallel Tempering (PT) [3], or one of its modifications [4], which considers a group of systems at different temperatures and alternates between simple MCMC moves and moves that swap the systems at two different temperatures. While PT is a universal strategy, its drawback is that, in order to produce low-temperature configurations, one has to simulate the system at a ladder of higher temperatures, which can be computationally expensive and redundant.
The rise of machine learning technology has sparked a new line of research aimed at using Neural Networks to enhance Monte Carlo algorithms, in the wake of similar approaches in many-body quantum physics [5], molecular dynamics [6] and the study of glasses [7, 8, 9]. The key idea is to use the network to propose new states with a probability $P_{\text{NN}}$ that (i) can be efficiently sampled and (ii) is close to the Gibbs-Boltzmann distribution, e.g., with respect to the Kullback-Leibler (KL) divergence $D_{\text{KL}}$ :
$$
D_{\text{KL}}(P_{\text{GB}}\parallel P_{\text{NN}})=\sum_{\boldsymbol{\sigma}}%
P_{\text{GB}}(\boldsymbol{\sigma})\log\frac{P_{\text{GB}}(\boldsymbol{\sigma})%
}{P_{\text{NN}}(\boldsymbol{\sigma})}\ . \tag{4}
$$
The proposed configurations can be used in a Metropolis scheme, accepting them with probability:
$$
\begin{split}\text{Acc}\left[\boldsymbol{\sigma}\rightarrow\boldsymbol{\sigma}%
^{\prime}\right]&=\min\left[1,\frac{P_{\text{GB}}(\boldsymbol{\sigma}^{\prime}%
)\times P_{\text{NN}}(\boldsymbol{\sigma})}{P_{\text{GB}}(\boldsymbol{\sigma})%
\times P_{\text{NN}}(\boldsymbol{\sigma}^{\prime})}\right]\\
&=\min\left[1,\frac{e^{-\beta\mathcal{H}(\boldsymbol{\sigma}^{\prime})}\times P%
_{\text{NN}}(\boldsymbol{\sigma})}{e^{-\beta\mathcal{H}(\boldsymbol{\sigma})}%
\times P_{\text{NN}}(\boldsymbol{\sigma}^{\prime})}\right].\end{split} \tag{5}
$$
Because $\mathcal{H}$ can be computed in polynomial time and $P_{\text{NN}}$ can be sampled efficiently by hypothesis, this approach ensures an efficient sampling of the Gibbs-Boltzmann probability, as long as (i) $P_{\text{NN}}$ covers well the support of $P_{\text{GB}}$ (i.e., the so-called mode collapse is avoided) and (ii) the acceptance probability in Eq. (5) is significantly different from zero for most of the generated configurations. This scheme can also be combined with standard local Monte Carlo moves to improve efficiency [10].
The main problem is how to train the Neural Network to minimize Eq. (4); in fact, the computation of $D_{\rm KL}$ requires sampling from $P_{\text{GB}}$ . A possible solution is to use $D_{\text{KL}}(P_{\text{NN}}\parallel P_{\text{GB}})$ instead [11], but this has been shown to be prone to mode collapse [12]. Another proposed solution is to set up an iterative procedure, called ‘Sequential Tempering’ (ST), in which one first learns $P_{\text{NN}}$ at high temperature where sampling from $P_{\text{GB}}$ is possible, then gradually decreases the temperature, updating $P_{\text{NN}}$ slowly [13]. A variety of new methods and techniques have thus been proposed to tackle the problem, such as autoregressive models [11, 13, 14], normalizing flows [15, 10, 16, 17], and diffusion models [18]. It is still unclear whether this kind of strategies actually performs well on hard problems and challenges already-existing algorithms [12]. The first steps in a theoretical understanding of the performances of such algorithms are just being made [19, 20]. The main drawback of these architectures is that the number of parameters scales poorly with the system size, typically as $N^{\alpha}$ with $\alpha>1$ , while local Monte Carlo requires a number of operations scaling linearly in $N$ . Furthermore, these architectures – with the notable exception of TwoBo [21] that directly inspired our work – are not informed about the physical properties of the model, and in particular about the structure of its interaction graph and correlations. It should be noted that the related problem of finding the ground state of the system (which coincides with zero-temperature sampling) has been tackled using similar ideas, again with positive [22, 23, 24, 25, 26] and negative results [27, 28, 29, 30].
In this paper, we introduce the Nearest Neighbours Neural Network, or 4N for short, a Graph Neural Network-inspired architecture that implements an autoregressive scheme to learn the Gibbs-Boltzmann distribution. This architecture has a number of parameters scaling linearly in the system size and can sample new configurations in $\mathcal{O}(N)$ time, thus achieving the best possible scaling one could hope for such architectures. Moreover, 4N has a straightforward physical interpretation. In particular, the choice of the number of layers can be directly linked to the correlation length of the model under study. Finally, the architecture can easily be applied to essentially any statistical system, such as lattices in higher dimensions or random graphs, and it is thus more general than other architectures. As a proof of concept, we evaluate the 4N architecture on the two-dimensional Edwards-Anderson spin glass model, a standard benchmark also used in previous work [13, 12]. We demonstrate that the model succeeds in accurately learning the Gibbs-Boltzmann distribution, especially for some of the most challenging model instances. Notably, it precisely captures properties such as energy, the correlation function, and the overlap probability distribution.
State of the art
Some common architectures used for $P_{\text{NN}}$ are normalizing flows [15, 10, 16, 17] and diffusion models [18]. In this paper, we will however focus on autoregressive models [11], which make use of the factorization:
$$
\begin{split}&P_{\text{GB}}(\boldsymbol{\sigma})=P(\sigma_{1})P(\sigma_{2}\mid%
\sigma_{1})P(\sigma_{3}\mid\sigma_{1},\sigma_{2})\times\\
&\times\cdots P(\sigma_{n}\mid\sigma_{1},\cdots,\sigma_{n-1})=\prod_{i=1}^{N}P%
(\sigma_{i}\mid\boldsymbol{\sigma}_{<i})\ ,\end{split} \tag{6}
$$
where $\boldsymbol{\sigma}_{<i}=(\sigma_{1},\sigma_{2},...,\sigma_{i-1})$ , so that states can then be generated using ancestral sampling, i.e. generating first $\sigma_{1}$ , then $\sigma_{2}$ conditioned on the sampled value of $\sigma_{1}$ , then $\sigma_{3}$ conditioned to $\sigma_{1}$ and $\sigma_{2}$ and so on. The factorization in (6) is exact, but computing $P(\sigma_{i}\mid\boldsymbol{\sigma}_{<i})$ exactly generally requires a number of operations scaling exponentially with $N$ . Hence, in practice, $P_{\text{GB}}$ is approximated by $P_{\text{NN}}$ that takes the form in Eq. (6) where each individual term $P(\sigma_{i}\mid\boldsymbol{\sigma}_{<i})$ is approximated using a small set of parameters. Note that the unsupervised problem of approximating $P_{\text{GB}}$ is now formally translated into a supervised problem of learning the output, i.e. the probability $\pi_{i}\equiv P(\sigma_{i}=+1\mid\boldsymbol{\sigma}_{<i})$ , as a function of the input $\boldsymbol{\sigma}_{<i}$ . The specific way in which this approximation is carried out depends on the architecture. In this section, we will describe some common autoregressive architectures found in literature, for approximating the function $\pi_{i}(\boldsymbol{\sigma}_{<i})$ .
- In the Neural Autoregressive Distribution Estimator (NADE) architecture [13], the input $\boldsymbol{\sigma}_{<i}$ is encoded into a vector $\mathbf{y}_{i}$ of size $N_{h}$ using
$$
\mathbf{y}_{i}=\Psi\left(\underline{\underline{A}}\boldsymbol{\sigma}_{<i}+%
\mathbf{B}\right), \tag{7}
$$
where $\mathbf{y}_{i}∈\mathbb{R}^{N_{h}}$ , $\underline{\underline{A}}∈\mathbb{R}^{N_{h}× N}$ and $\boldsymbol{\sigma}_{<i}$ is the vector of the spins in which the spins $l≥ i$ have been masked, i.e. $\mathbf{\sigma}_{<i}=(\sigma_{1},\sigma_{2},...,\sigma_{i-1},0,...,0)$ . $\mathbf{B}∈\mathbb{R}^{N_{h}}$ is the bias vector and $\Psi$ is an element-wise activation function. Note that $\underline{\underline{A}}$ and $\mathbf{B}$ do not depend on the index $i$ , i.e. they are shared across all spins. The information from the hidden layer is then passed to a fully connected layer
$$
{\color[rgb]{0,0,0}\psi_{i}}=\Psi\left(\mathbf{V}_{i}\cdot\mathbf{y}_{i}+C_{i}\right) \tag{8}
$$
where $\mathbf{V}_{i}∈\mathbb{R}^{N_{h}}$ and $C_{i}∈\mathbb{R}$ . Finally, the output probability is obtained by applying a sigmoid function to the output, $\pi_{i}=S({\color[rgb]{0,0,0}\psi_{i}})$ . The Nade Architecture uses $\mathcal{O}(N× N_{h})$ parameters. However, the choice of the hidden dimension $N_{h}$ and how it should be related to $N$ is not obvious. For this architecture to work well in practical applications, $N_{h}$ needs to be $\mathcal{O}(N)$ , thus leading to $\mathcal{O}(N^{2})$ parameters [13].
- The Masked Autoencoder for Distribution Estimator (MADE) [31] is a generic dense architecture with the addition of the autoregressive requirement, obtained by setting, for all layers $l$ in the network, the weights $W^{l}_{ij}$ between node $j$ at layer $l$ and node $i$ at layer $l+1$ equal to 0 when $i≥ j$ . Between one layer and the next one adds nonlinear activation functions and the width of the hidden layers can also be increased. The MADE architecture, despite having high expressive power, has at least $\mathcal{O}(N^{2})$ parameters, which makes it poorly scalable.
- Autoregressive Convolutional Neural Network (CNN) architectures, such as PixelCNN [32, 33], are networks that implement a CNN structure but superimpose the autoregressive property. These architectures typically use translationally invariant kernels ill-suited to study disordered models.
- The TwoBo (two-body interaction) architecture [21] is derived from the observation that, given the Gibbs-Boltzmann probability in Eq. (1) and the Hamiltonian in Eq. (3), the probability $\pi_{i}$ can be exactly rewritten as [34]
$$
\pi_{i}(\boldsymbol{\sigma}_{<i})=S\left(2\beta h^{i}_{i}+2\beta H_{i}+\rho_{i%
}(\boldsymbol{h}^{i})\right), \tag{9}
$$
where $S$ is the sigmoid function, $\rho_{i}$ is a highly non-trivial function and the $\boldsymbol{h}^{i}=\{h^{i}_{i},h^{i}_{i+1},·s h^{i}_{N}\}$ are defined as:
$$
h^{i}_{l}=\sum_{f<i}J_{lf}\sigma_{f}\quad\text{for}\quad l\geq i. \tag{10}
$$
Therefore, the TwoBo architecture first computes the vector $\boldsymbol{h}^{i}$ , then approximates $\rho_{i}(\boldsymbol{h}^{i})$ using a Neural Network (for instance, a Multilayer Perceptron) and finally computes Eq. (9) using a skip connection (for the term $2\beta h^{i}_{i}$ ) and a sigmoid activation. By construction, in a two-dimensional model with $N=L^{2}$ variables and nearest-neighbor interactions, for given $i$ only $\mathcal{O}(L=\sqrt{N})$ of the $h^{i}_{l}$ are non-zero, see Ref. [21, Fig.1]. Therefore, the TwoBo architecture has $\mathcal{O}(N^{\frac{3}{2}})$ parameters in the two-dimensional case. However, the scaling becomes worse in higher dimensions $d$ , i.e. $\mathcal{O}(N^{\frac{2d-1}{d}})$ and becomes the same as MADE (i.e. $\mathcal{O}(N^{2})$ ) for random graphs. Additionally, the choice of the $h^{i}_{l}$ , although justified mathematically, does not have a clear physical interpretation: the TwoBo architecture takes into account far away spins which, however, should not matter much when $N$ is large.
To summarize, the NADE, MADE and TwoBo architectures all need $\mathcal{O}(N^{2})$ parameters for generic models defined on random graph structures. TwoBo has less parameters for models defined on $d$ -dimensional lattices, but still $\mathcal{O}(N^{\frac{2d-1}{d}})\gg\mathcal{O}(N)$ . CNN architectures with translationally invariant kernels have fewer parameters, but are not suitable for disordered systems, are usually applied to $d=2$ and remain limited to $d≤ 3$ .
To solve these problems, we present here the new, physically motivated 4N architecture that has $\mathcal{O}(\mathcal{A}N)$ parameters, with a prefactor $\mathcal{A}$ that remains finite for $N→∞$ and is interpreted as the correlation length of the system, as we show in the next section.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: GNN Layer Processing
### Overview
The image illustrates the processing steps of a Graph Neural Network (GNN) layer, starting from an input grid, transforming it through intermediate representations, and culminating in an output probability. The diagram is divided into five sections labeled (a) through (e), each representing a stage in the process.
### Components/Axes
* **(a) Input:** A grid representing the input data. The grid cells are either black (+1), white (-1), or gray (Unf., representing an undefined state). A purple line highlights a specific region of interest. The label "i" indicates a specific cell within this region.
* **Color Scale (Left of Input):**
* Top: +1 (Black)
* Middle: -1 (White)
* Bottom: Unf. (Gray)
* **(b) Intermediate Representation:** A graph where nodes are colored based on a continuous color scale, ranging from blue to red. The color of each node represents a value between 0 (blue) and a positive value (red). The node labeled "i" is colored purple.
* **Color Scale (Left of Graph):**
* Bottom: Blue, labeled "0"
* Top: Red, unlabeled (Implies a positive value)
* **(c) GNN Layer:** This section shows the application of a GNN layer. It involves multiple GNN layers, each followed by an addition operation (+). The output of each addition is a graph with nodes colored similarly to section (b). The "..." indicates that there are intermediate steps not explicitly shown.
* **(d) Node i:** A single node "i" colored purple, representing the node of interest. It is connected to a function "S" via a downward arrow.
* **(e) Output:** A purple rectangle labeled "Pᵢ(σᵢ = +1)", representing the output probability for node i having a value of +1. The label "Output" is placed below the rectangle.
### Detailed Analysis
* **Input Grid (a):** The input grid consists of black, white, and gray cells. The purple line highlights a specific region, and the label "i" points to a particular cell within that region.
* **Intermediate Graph (b):** The input grid is transformed into a graph where each node's color represents a value. The color scale ranges from blue (0) to red (positive value). The node "i" is colored purple, indicating its value in this representation.
* **GNN Layer Processing (c):** The graph is processed through multiple GNN layers. Each layer's output is added to the previous result. The "..." indicates that there are intermediate GNN layers and addition operations not explicitly shown.
* **Output Probability (e):** The final output is the probability Pᵢ(σᵢ = +1), which represents the likelihood that node i has a value of +1.
### Key Observations
* The diagram illustrates the transformation of data from a grid-based input to a probability output through a GNN layer.
* The color scales in sections (a) and (b) provide a visual representation of the data values at different stages of the process.
* The GNN layer processing involves multiple steps, including GNN layers and addition operations.
### Interpretation
The diagram demonstrates how a Graph Neural Network (GNN) processes input data to generate an output probability. The input grid is transformed into a graph representation, which is then processed through multiple GNN layers. The color scales provide a visual representation of the data values at different stages of the process. The final output is the probability that a specific node (i) has a value of +1. This process is crucial for tasks such as image segmentation, node classification, and graph-based prediction, where the relationships between data points are important. The use of GNN layers allows the network to learn complex patterns and dependencies within the graph structure, leading to accurate predictions.
</details>
Figure 1: Schematic representation of the 4N architecture implemented on a two-dimensional lattice model with nearest-neighbors interactions. (a) We want to estimate the probability $\pi_{i}\equiv P(\sigma_{i}=+1\mid\boldsymbol{\sigma}_{<i})$ that the spin $i$ (in purple) has value $+1$ given the spins $<i$ (in black and white). The spins $>i$ (in grey) have not yet been fixed. (b) We compute the values of the local fields $h_{l}^{i,(0)}$ as in Eq. (10), for $l$ in a neighborhood of $i$ of size $\ell$ , indicated by a purple contour in (a). Notice that spins $l$ that are not in the neighborhood of one of the fixed spin have $h_{l}^{i,(0)}=0$ and are indicated in white in (b). (c) In the main cycle of the algorithm, we apply a series of $\ell$ GNN layers with the addition of skip connections. (d) The final GNN layer is not followed by a skip connection and yields the final values of the field $h_{i}^{i,(\ell)}$ . (e) $\pi_{i}$ is estimated by applying a sigmoid function to $h_{i}^{i,(\ell)}$ . Periodic (or other) boundary conditions can be considered, but are not shown here for clarity.
Results
The 4N architecture
The 4N architecture (Fig. 1) computes the probability $\pi_{i}\equiv P(\sigma_{i}=+1\mid\boldsymbol{\sigma}_{<i})$ by propagating the $h_{l}^{i}$ , defined as in TwoBo, Eq. (10), through a Graph Neural Network (GNN) architecture. The variable $h_{l}^{i}$ is interpreted as the local field induced by the frozen variables $\boldsymbol{\sigma}_{<i}$ on spin $l$ . Note that in 4N, we consider all possible values of $l=1,·s,N$ , not only $l≥ i$ as in TwoBo. The crucial difference with TwoBo is that only the fields in a finite neighborhood of variable $i$ need to be considered, because the initial values of the fields are propagated through a series of GNN layers that take into account the locality of the physical problem. During propagation, fields are combined with layer- and edge-dependent weights together with the couplings $J_{ij}$ and the inverse temperature $\beta$ . After each layer of the network (except the final one), a skip connection to the initial configuration is added. After the final layer, a sigmoidal activation is applied to find $\pi_{i}$ . The network has a total number of parameters equal to $(c+1)×\ell× N$ , where $c$ is the average number of neighbors for each spin. Details are given in the Methods section.
More precisely, because GNN updates are local, and there are $\ell$ of them, in order to compute the final field on site $i$ , hence $\pi_{i}$ , we only need to consider the set of fields corresponding to sites at distance at most $\ell$ , hence the number of operations scales proportionally to the number of such sites which we call $\mathcal{B}(\ell)$ . Because we need to compute $\pi_{i}$ for each of the $N$ sites, the generation of a new configuration then requires $\mathcal{O}(\mathcal{B}(\ell)× N)$ steps. We recall that in a $d-$ dimensional lattice $\mathcal{B}(\ell)\propto\ell^{d}$ , while for random graphs $\mathcal{B}(\ell)\propto c^{\ell}$ . It is therefore crucial to assess how the number of layers $\ell$ should scale with the system size $N$ .
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Chart: Energy Difference vs. Beta
### Overview
The image is a line chart displaying the relationship between "Energy difference (relative)" on the y-axis and "β" (beta) on the x-axis for four different values of "l": 2, 4, 6, and 10. Each line represents a different value of "l", and the shaded regions around the lines indicate uncertainty or variability.
### Components/Axes
* **X-axis:** β (beta), ranging from 0.5 to 3.0 in increments of 0.5.
* **Y-axis:** Energy difference (relative), ranging from 0.000 to 0.040 in increments of 0.005.
* **Legend (Top-Left):**
* Yellow line: l = 2
* Teal line: l = 4
* Red line: l = 6
* Green line: l = 10
* Each line has a shaded region around it, bounded by dashed lines of the same color.
### Detailed Analysis
* **l = 2 (Yellow):** The yellow line starts at approximately 0.002 at β = 0.5, increases to a peak of approximately 0.023 at β = 1.0, and then gradually decreases to approximately 0.017 at β = 3.0.
* **l = 4 (Teal):** The teal line starts at approximately 0.002 at β = 0.5, increases to a peak of approximately 0.011 at β = 1.5, and then remains relatively stable around 0.010 at β = 3.0.
* **l = 6 (Red):** The red line starts at approximately 0.002 at β = 0.5, increases to a peak of approximately 0.007 at β = 1.5, and then remains relatively stable around 0.006 at β = 3.0.
* **l = 10 (Green):** The green line starts at approximately 0.001 at β = 0.5, increases to a peak of approximately 0.005 at β = 1.5, and then remains relatively stable around 0.004 at β = 3.0.
### Key Observations
* The energy difference is highest for l = 2 and decreases as l increases.
* All lines show an initial increase in energy difference as β increases from 0.5 to approximately 1.0-1.5.
* After reaching their respective peaks, the energy differences tend to stabilize or slightly decrease as β increases further.
* The shaded regions around the lines suggest that the uncertainty or variability in the energy difference is greater at lower values of β and decreases as β increases.
### Interpretation
The chart illustrates how the relative energy difference changes with respect to β for different values of l. The data suggests that the energy difference is most significant when l = 2 and decreases as l increases. The initial increase in energy difference with increasing β may indicate a transition or change in the system's behavior. The stabilization or slight decrease in energy difference at higher β values suggests that the system reaches a more stable state. The shaded regions indicate the range of possible values, highlighting the uncertainty in the measurements. The trend suggests that higher 'l' values lead to lower energy differences, implying a possible inverse relationship between 'l' and energy difference within the observed range of 'β'.
</details>
Figure 2: Relative difference in mean energy between the configurations generated by the 4N architecture and by PT, for different numbers of layers $\ell$ and different values of the inverse temperature $\beta$ . Solid lines indicate averages over 10 different instances, while the dashed lines and the colored area identify the region corresponding to plus or minus one standard deviation.
The benchmark Gibbs-Boltzmann distribution
In this paper, in order to test the 4N architecture on a prototypically hard-to-sample problem, we will consider the Edwards-Anderson spin glass model, described by the Hamiltonian in Eq. (3) for Ising spins, where we set $H_{i}=0$ for simplicity. The couplings are chosen to be non-zero only for neighboring pairs $\langle i,j\rangle$ on a two-dimensional square lattice. Non-zero couplings $J_{ij}$ are independent random variables, identically distributed according to either a normal or a Rademacher distribution. This model was also used as a benchmark in Refs. [13, 12]. All the results reported in the rest of the section are for a $16× 16$ square lattice, hence with $N=256$ spins, which guarantees a large enough system to investigate correlations at low temperatures while keeping the problem tractable. In the SI we also report some data for a $32× 32$ lattice.
While this model has no finite-temperature spin glass transition, sampling it via standard local MCMC becomes extremely hard for $\beta\gtrsim 1.5$ [12]. To this day, the best option for sampling this model at lower temperatures is parallel tempering (PT), which we used to construct a sample of $M=2^{16}$ equilibrium configurations for several instances of the model and at several temperatures. Details are given in the Methods section.
The PT-generated training set is used to train the 4N architecture, independently for each instance and each temperature. Remember that the goal here is to test the expressivity of the architecture, which is why we train it in the best possible situation, i.e. by maximizing the likelihood of equilibrium configurations at each temperature. Once the 4N model has been trained, we perform several comparisons that are reported in the following.
Energy
We begin by comparing the mean energy of the configurations generated by the model with the mean energy of the configurations generated via parallel tempering, which is important because the energy plays a central role in the Metropolis reweighing scheme of Eq. (5). Fig. 2 shows the relative energy difference, as a function of inverse temperature $\beta$ and number of layers $\ell$ , averaged over 10 different instances of the disorder. First, it is clear that adding layers improves results remarkably. We stress that, since the activation functions of our network are linear, adding layers only helps in taking into account spins that are further away. Second, we notice that the relative error found when using 6 layers or more is small for all temperatures considered. Finally, we notice that the behaviour of the error is non-monotonic, and indeed appears to decrease for low temperatures even for networks with few layers.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Scatter Chart: Log Probability vs. Number of Layers
### Overview
The image is a scatter plot showing the relationship between the absolute value of the log probability of something labeled 'AR' and the number of layers, for different temperatures (T). The plot includes data points for six different temperatures, each represented by a different color, and dashed lines that appear to be fitted to the data.
### Components/Axes
* **X-axis:** Number of layers, labeled as "Number of layers, l". The axis ranges from 2 to 10, with integer markings at each value.
* **Y-axis:** Absolute value of the log probability, labeled as "|<logP<sub>AR</sub><sup>l</sup>><sub>data</sub>|/S". The axis ranges from 1.00 to 1.10, with markings at intervals of 0.02.
* **Legend:** Located in the top-right corner, the legend indicates the temperature (T) associated with each color:
* Blue: T = 0.58
* Purple: T = 0.72
* Dark Purple: T = 0.86
* Pink: T = 1.02
* Orange: T = 1.41
* Yellow: T = 1.95
* Light Yellow: T = 2.83
* **Title:** There is a label "(a)" in the top-center of the chart.
### Detailed Analysis
* **T = 0.58 (Blue):** The blue data series shows a decreasing trend as the number of layers increases.
* At l = 2, the value is approximately 1.083.
* At l = 10, the value is approximately 1.024.
* **T = 0.72 (Purple):** The purple data series shows a decreasing trend as the number of layers increases.
* At l = 2, the value is approximately 1.045.
* At l = 10, the value is approximately 1.015.
* **T = 0.86 (Dark Purple):** The dark purple data series shows a decreasing trend as the number of layers increases.
* At l = 2, the value is approximately 1.035.
* At l = 10, the value is approximately 1.010.
* **T = 1.02 (Pink):** The pink data series shows a decreasing trend as the number of layers increases.
* At l = 2, the value is approximately 1.025.
* At l = 10, the value is approximately 1.007.
* **T = 1.41 (Orange):** The orange data series shows a slight decreasing trend as the number of layers increases.
* At l = 2, the value is approximately 1.007.
* At l = 10, the value is approximately 1.004.
* **T = 1.95 (Yellow):** The yellow data series shows a slight decreasing trend as the number of layers increases.
* At l = 2, the value is approximately 1.004.
* At l = 10, the value is approximately 1.002.
* **T = 2.83 (Light Yellow):** The light yellow data series shows a slight decreasing trend as the number of layers increases.
* At l = 2, the value is approximately 1.002.
* At l = 10, the value is approximately 1.001.
### Key Observations
* As the temperature (T) increases, the absolute value of the log probability generally decreases.
* For all temperatures, the absolute value of the log probability decreases as the number of layers increases. The rate of decrease is more pronounced at lower temperatures.
* The data points appear to be fitted with dashed lines, suggesting a functional relationship between the number of layers and the log probability for each temperature.
### Interpretation
The plot illustrates how the absolute value of the log probability changes with the number of layers at different temperatures. The decreasing trend observed for all temperatures suggests that as the number of layers increases, the log probability converges towards a certain value. The fact that the curves for higher temperatures are lower and flatter indicates that higher temperatures lead to a lower and more stable log probability value, less sensitive to the number of layers. This could imply that at higher temperatures, the system is less influenced by the number of layers, possibly due to increased thermal fluctuations.
</details>
<details>
<summary>x4.png Details</summary>

### Visual Description
## Chart: Log Probability vs. Number of Layers at Different Temperatures
### Overview
The image is a scatter plot showing the relationship between the absolute value of the difference between the log base 10 of probability *P* of *AR* with *l* layers and the average log base 10 of probability *P* of *AR*, and the number of layers (*l*) for different temperatures (*T*). The plot includes data points and dashed lines representing the trend for each temperature.
### Components/Axes
* **X-axis:** Number of layers, *l*. Scale ranges from 2 to 10, with integer markings.
* **Y-axis:** |<log *P*<sub>*AR*</sub><sup>*l*</sup>><sub>data</sub> - <log *P*<sub>*AR*</sub><sup>10</sup>><sub>data</sub>|. Scale ranges from 0.00 to 2.00, with markings at intervals of 0.25.
* **Legend:** Located in the top-right corner, the legend maps colors to temperatures:
* Blue: *T* = 0.58
* Dark Purple: *T* = 0.72
* Purple: *T* = 0.86
* Pink: *T* = 1.02
* Red-Pink: *T* = 1.41
* Orange: *T* = 1.95
* Yellow: *T* = 2.83
* **Title:** There is no explicit title, but the axes labels and legend provide context.
* **Annotation:** The label "(b)" is present in the top-left corner of the plot.
### Detailed Analysis
The plot shows how the absolute value of the difference between the log base 10 of probability *P* of *AR* with *l* layers and the average log base 10 of probability *P* of *AR* changes with the number of layers at different temperatures. Each temperature has a corresponding data series represented by colored dots and a dashed line.
* **T = 0.58 (Blue):** Starts at approximately 2.1 at *l* = 2 and decreases rapidly, reaching approximately 0.5 at *l* = 10.
* **T = 0.72 (Dark Purple):** Starts at approximately 1.9 at *l* = 2 and decreases rapidly, reaching approximately 0.3 at *l* = 10.
* **T = 0.86 (Purple):** Starts at approximately 1.5 at *l* = 2 and decreases rapidly, reaching approximately 0.2 at *l* = 10.
* **T = 1.02 (Pink):** Starts at approximately 1.1 at *l* = 2 and decreases rapidly, reaching approximately 0.15 at *l* = 10.
* **T = 1.41 (Red-Pink):** Starts at approximately 0.5 at *l* = 2 and decreases rapidly, reaching approximately 0.05 at *l* = 10.
* **T = 1.95 (Orange):** Starts at approximately 0.2 at *l* = 2 and decreases rapidly, reaching approximately 0.02 at *l* = 10.
* **T = 2.83 (Yellow):** Starts at approximately 0.05 at *l* = 2 and decreases rapidly, reaching approximately 0.01 at *l* = 10.
All data series show a decreasing trend as the number of layers increases. The rate of decrease appears to be higher for lower temperatures.
### Key Observations
* The absolute value of the difference between the log base 10 of probability *P* of *AR* with *l* layers and the average log base 10 of probability *P* of *AR* decreases as the number of layers increases for all temperatures.
* The rate of decrease is more pronounced at lower temperatures.
* The curves appear to converge towards zero as the number of layers increases.
* The initial values (at *l* = 2) are higher for lower temperatures.
### Interpretation
The plot suggests that as the number of layers increases, the log probability of *AR* converges towards a stable value, regardless of the temperature. The convergence is faster at lower temperatures, indicating that the system reaches a more stable state with fewer layers when the temperature is lower. The higher initial values at lower temperatures suggest that the initial deviation from the average log probability is greater at lower temperatures, but this deviation is quickly reduced as the number of layers increases. The data demonstrates that the number of layers has a significant impact on the log probability of *AR*, and this impact is modulated by temperature.
</details>
Figure 3: (a) Ratio between the cross-entropy $|\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}|$ and the Gibbs-Boltzmann entropy $S_{\text{GB}}$ for different values of the temperature $T$ and of the number of layers $\ell$ . Both $|\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}|$ and $S_{\text{GB}}$ are averaged over 10 samples. Dashed lines are exponential fits in the form $Ae^{-\ell/\overline{\ell}}+C$ . (b) Absolute difference between $\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}$ at various $\ell$ and at $\ell=10$ . Dashed lines are exponential fits in the form $Ae^{-\ell/\overline{\ell}}$ .
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: P(q) vs. q for different l values
### Overview
The image is a line chart displaying the relationship between P(q) and q for different values of 'l' at a constant temperature T = 1.95 for Instance 1. The chart includes four data series, each representing a different value of 'l' (2, 4, 6, and 10). All lines form a bell-shaped curve centered around q = 0.
### Components/Axes
* **Title:** T = 1.95, Instance 1
* **X-axis:** q, ranging from -1.00 to 1.00 with increments of 0.25.
* **Y-axis:** P(q), ranging from 0.0 to 4.5 with increments of 0.5.
* **Legend (Top-Right):**
* Purple line: l = 2
* Dark Blue line: l = 4
* Green line: l = 6
* Yellow line: l = 10
* **(a):** Located at the top left corner of the chart.
### Detailed Analysis
* **l = 2 (Purple):** The line starts near 0 at q = -1.00, rises to a peak of approximately 4.2 at q = 0, and then decreases back to near 0 at q = 1.00.
* **l = 4 (Dark Blue):** The line starts near 0 at q = -1.00, rises to a peak of approximately 4.1 at q = 0, and then decreases back to near 0 at q = 1.00.
* **l = 6 (Green):** The line starts near 0 at q = -1.00, rises to a peak of approximately 4.0 at q = 0, and then decreases back to near 0 at q = 1.00.
* **l = 10 (Yellow):** The line starts near 0 at q = -1.00, rises to a peak of approximately 3.9 at q = 0, and then decreases back to near 0 at q = 1.00.
### Key Observations
* All lines exhibit a similar bell-shaped curve, indicating a probability distribution centered around q = 0.
* As 'l' increases, the peak of the curve slightly decreases.
* The lines are very close to each other, especially near the peak.
### Interpretation
The chart illustrates the probability distribution P(q) for different values of 'l' at a fixed temperature. The bell-shaped curves suggest that the most probable value of 'q' is 0. The slight decrease in the peak of the curve as 'l' increases indicates that the probability distribution becomes slightly broader, suggesting a greater spread of 'q' values around 0. The proximity of the lines suggests that the value of 'l' has a relatively small impact on the overall shape of the probability distribution.
</details>
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Chart: P(q) vs. q for Different 'l' Values
### Overview
The image is a line chart displaying the relationship between P(q) and q for different values of 'l' (2, 4, 6, and 10) at a constant temperature T = 0.72 for Instance 1. The chart includes a dashed black line, which is not explicitly defined in the legend. The x-axis represents 'q', and the y-axis represents 'P(q)'.
### Components/Axes
* **Title:** T = 0.72, Instance 1
* **X-axis:**
* Label: q
* Scale: -1.00 to 1.00, with increments of 0.25
* **Y-axis:**
* Label: P(q)
* Scale: 0.00 to 2.00, with increments of 0.25
* **Legend (Top-Right):**
* l = 2 (Purple)
* l = 4 (Dark Blue)
* l = 6 (Green)
* l = 10 (Yellow)
* **Unlabeled Line:** Dashed Black Line
### Detailed Analysis
* **l = 2 (Purple):** The line starts at approximately P(q) = 0 at q = -1.00, rises to a peak of approximately P(q) = 1.65 at q = 0, and then decreases back to approximately P(q) = 0 at q = 1.00.
* **l = 4 (Dark Blue):** The line starts at approximately P(q) = 0 at q = -1.00, rises to a peak of approximately P(q) = 1.40 at q = 0, and then decreases back to approximately P(q) = 0 at q = 1.00.
* **l = 6 (Green):** The line starts at approximately P(q) = 0 at q = -1.00, rises to a peak of approximately P(q) = 1.25 at q = 0, and then decreases back to approximately P(q) = 0 at q = 1.00.
* **l = 10 (Yellow):** The line starts at approximately P(q) = 0 at q = -1.00, rises to a peak of approximately P(q) = 1.05 at q = 0, and then decreases back to approximately P(q) = 0 at q = 1.00.
* **Dashed Black Line:** The line starts at approximately P(q) = 0 at q = -1.00, rises to a local peak of approximately P(q) = 0.95 at q = -0.25, dips to approximately P(q) = 0.90 at q = 0, rises to a local peak of approximately P(q) = 0.95 at q = 0.25, and then decreases back to approximately P(q) = 0 at q = 1.00.
### Key Observations
* As the value of 'l' increases, the peak value of P(q) at q = 0 decreases.
* All lines are symmetric around q = 0.
* The dashed black line has a flatter peak compared to the other lines.
### Interpretation
The chart illustrates how the distribution P(q) changes with different values of 'l' at a fixed temperature. The decrease in the peak value of P(q) as 'l' increases suggests that the distribution becomes broader and less concentrated around q = 0. The dashed black line likely represents a baseline or a different condition, as it exhibits a flatter peak and a different shape compared to the other lines. The symmetry around q = 0 indicates that the system is symmetric with respect to the variable 'q'.
</details>
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: P(q) vs. q at T=0.31, Instance 1
### Overview
The image is a line chart displaying the relationship between P(q) and q for different values of 'l' at a constant temperature T=0.31 for Instance 1. The chart includes four data series, each representing a different value of 'l' (2, 4, 6, and 10). A dashed black line is also present, but not identified in the legend.
### Components/Axes
* **Title:** T = 0.31, Instance 1
* **X-axis:** q, ranging from -1.00 to 1.00 in increments of 0.25.
* **Y-axis:** P(q), ranging from 0.0 to 1.6 in increments of 0.2.
* **Legend:** Located at the top-right of the chart.
* Purple line: l = 2
* Blue line: l = 4
* Green line: l = 6
* Yellow line: l = 10
* **(c):** Located at the top-left of the chart.
### Detailed Analysis
* **Purple line (l = 2):** Starts at approximately 0.0 at q = -1.00, rises to a peak of approximately 0.7 at q = -0.75, decreases to approximately 0.2 at q = -0.50, rises to a peak of approximately 1.0 at q = 0.00, decreases to approximately 0.2 at q = 0.50, rises to a peak of approximately 0.7 at q = 0.75, and decreases to approximately 0.0 at q = 1.00.
* **Blue line (l = 4):** Starts at approximately 0.0 at q = -1.00, rises to a peak of approximately 0.5 at q = -0.75, decreases to approximately 0.4 at q = -0.50, rises to a peak of approximately 0.8 at q = 0.00, decreases to approximately 0.4 at q = 0.50, rises to a peak of approximately 0.5 at q = 0.75, and decreases to approximately 0.0 at q = 1.00.
* **Green line (l = 6):** Starts at approximately 0.0 at q = -1.00, rises to a peak of approximately 0.5 at q = -0.75, decreases to approximately 0.5 at q = -0.50, rises to a peak of approximately 0.7 at q = 0.00, decreases to approximately 0.5 at q = 0.50, rises to a peak of approximately 0.5 at q = 0.75, and decreases to approximately 0.0 at q = 1.00.
* **Yellow line (l = 10):** Starts at approximately 0.0 at q = -1.00, rises to a peak of approximately 0.7 at q = -0.75, decreases to approximately 0.4 at q = -0.50, rises to a peak of approximately 0.6 at q = 0.00, decreases to approximately 0.4 at q = 0.50, rises to a peak of approximately 0.7 at q = 0.75, and decreases to approximately 0.0 at q = 1.00.
* **Dashed Black line:** Starts at approximately 0.0 at q = -1.00, rises to a peak of approximately 0.8 at q = -0.75, decreases to approximately 0.2 at q = -0.50, rises to a peak of approximately 1.0 at q = 0.00, decreases to approximately 0.2 at q = 0.50, rises to a peak of approximately 0.8 at q = 0.75, and decreases to approximately 0.0 at q = 1.00.
### Key Observations
* All lines start and end at approximately P(q) = 0.0 at q = -1.00 and q = 1.00, respectively.
* All lines exhibit a similar pattern with peaks around q = -0.75, q = 0.00, and q = 0.75.
* The purple line (l = 2) has the highest peak at q = 0.00, while the other lines have lower peaks.
* The dashed black line has the sharpest peaks.
### Interpretation
The chart illustrates the distribution of P(q) for different values of 'l' at a specific temperature and instance. The peaks suggest preferred values of 'q', and the variation in peak height indicates the relative probability of these values for different 'l'. The dashed black line is not identified in the legend, so its meaning is unknown. The data suggests that as 'l' increases, the distribution of P(q) becomes more uniform, with lower peaks at the preferred 'q' values. The peaks at q = -0.75 and q = 0.75 are symmetric around q = 0.00, suggesting a symmetric behavior in the underlying system.
</details>
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: P(q) vs. q for T = 0.31, Instance 2
### Overview
The image is a line chart displaying the relationship between P(q) and q for different values of 'l' at a constant temperature T = 0.31, Instance 2. The chart includes four distinct lines representing l = 2, l = 4, l = 6, and l = 10, along with a dashed black line. The x-axis represents 'q', and the y-axis represents 'P(q)'.
### Components/Axes
* **Title:** T = 0.31, Instance 2
* **X-axis:**
* Label: q
* Scale: -1.00 to 1.00, with increments of 0.25 (-1.00, -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75, 1.00)
* **Y-axis:**
* Label: P(q)
* Scale: 0.0 to 1.6, with increments of 0.2 (0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6)
* **Legend:** Located in the top-right corner.
* Purple line: l = 2
* Blue line: l = 4
* Green line: l = 6
* Yellow line: l = 10
* Black dashed line: Unspecified
### Detailed Analysis
* **Purple line (l = 2):** This line starts at approximately 0.0 at q = -1.00, rises to a peak of approximately 0.78 at q = -0.50, dips to approximately 0.35 at q = -0.25, rises again to a peak of approximately 0.82 at q = 0.00, dips to approximately 0.35 at q = 0.25, rises to a peak of approximately 0.78 at q = 0.50, and falls back to approximately 0.0 at q = 1.00.
* **Blue line (l = 4):** This line starts at approximately 0.0 at q = -1.00, rises to a peak of approximately 0.75 at q = -0.60, dips to approximately 0.30 at q = -0.25, rises again to a peak of approximately 0.55 at q = 0.00, dips to approximately 0.30 at q = 0.25, rises to a peak of approximately 0.75 at q = 0.60, and falls back to approximately 0.0 at q = 1.00.
* **Green line (l = 6):** This line starts at approximately 0.0 at q = -1.00, rises to a peak of approximately 0.78 at q = -0.70, dips to approximately 0.28 at q = -0.25, rises again to a peak of approximately 0.35 at q = 0.00, dips to approximately 0.28 at q = 0.25, rises to a peak of approximately 0.78 at q = 0.70, and falls back to approximately 0.0 at q = 1.00.
* **Yellow line (l = 10):** This line starts at approximately 0.0 at q = -1.00, rises to a peak of approximately 0.80 at q = -0.75, dips to approximately 0.20 at q = -0.25, rises again to a peak of approximately 0.22 at q = 0.00, dips to approximately 0.20 at q = 0.25, rises to a peak of approximately 0.80 at q = 0.75, and falls back to approximately 0.0 at q = 1.00.
* **Black dashed line:** This line starts at approximately 0.0 at q = -1.00, rises to a peak of approximately 0.85 at q = -0.75, dips to approximately 0.30 at q = -0.25, rises again to a peak of approximately 0.20 at q = 0.00, dips to approximately 0.30 at q = 0.25, rises to a peak of approximately 0.85 at q = 0.75, and falls back to approximately 0.0 at q = 1.00.
### Key Observations
* All lines start and end at approximately P(q) = 0.0 at q = -1.00 and q = 1.00, respectively.
* The peaks of the lines shift slightly towards the center (q = 0.00) as 'l' increases.
* The height of the central peak at q = 0.00 decreases as 'l' increases.
* The black dashed line has the highest peaks at q = -0.75 and q = 0.75.
### Interpretation
The chart illustrates how the probability distribution P(q) changes with different values of 'l' at a fixed temperature. As 'l' increases, the central peak at q = 0.00 becomes less pronounced, suggesting a shift in the distribution. The black dashed line, which is not explicitly defined in the legend, exhibits the highest probability peaks at q = -0.75 and q = 0.75, indicating a different behavior compared to the other lines. The data suggests that the system's behavior is influenced by the parameter 'l', potentially related to a characteristic length scale or interaction range within the system.
</details>
Figure 4: Examples of the probability distribution of the overlap, $P(q)$ , for different instances of the disorder and different temperatures. Solid color lines correspond to configurations generated using 4N, while dashed black lines correspond to the $P(q)$ obtained from equilibrium configurations generated via PT. (a) At high temperature, even a couple of layers are enough to reproduce well the distribution. (b) At lower temperatures more layers are required, as expected because of the increase of the correlation length and of the complexity of the Gibbs-Boltzmann distribution. (c) Even lower temperature for the same instance. (d) A test on a more complex instance shows that 4N is capable to learn even non-trivial forms of $P(q)$ .
Entropy and Kullback-Leibler divergence
Next, we consider the Kullback-Leibler divergence between the learned probability distribution $P_{\text{NN}}$ and the target probability distribution, $D_{\text{KL}}(P_{\text{GB}}\parallel P_{\text{NN}})$ . Indicating with $\langle\,·\,\rangle_{\text{GB}}$ the average with respect to $P_{\text{GB}}$ , we have
$$
D_{\text{KL}}(P_{\text{GB}}\parallel P_{\text{NN}})=\langle\log P_{\text{GB}}%
\rangle_{\text{GB}}-\langle\log P_{\text{NN}}\rangle_{\text{GB}}\ . \tag{11}
$$
The first term $\langle\log P_{\text{GB}}\rangle_{\text{GB}}=-S_{\text{GB}}(T)$ is minus the entropy of the Gibbs-Boltzmann distribution and does not depend on $P_{\text{NN}}$ , hence we focus on the cross-entropy $-\langle\log P_{\text{NN}}\rangle_{\text{GB}}$ as a function of the number of layers for different temperatures. Minimizing this quantity corresponds to minimizing the KL divergence. In the case of perfect learning, i.e. $D_{\text{KL}}=0$ , we have $-\langle\log P_{\text{NN}}\rangle_{\text{GB}}=S_{\text{GB}}(T)$ . In Fig. 3 we compare the cross-entropy, estimated as $-\langle\log P_{\text{NN}}\rangle_{\text{data}}$ on the data generated by PT, with $S_{\text{GB}}(T)$ computed using thermodynamic integration. We find a good match for large enough $\ell$ , indicating an accurate training of the 4N model that does not suffer from mode collapse.
In order to study more carefully how changing the number of layers affects the accuracy of the training, in Fig. 3 we have also shown the difference $|\langle\log P^{\ell}_{AR}\rangle_{data}-\langle\log P^{10}_{AR}\rangle_{data}|$ as a function of the number of layers $\ell$ for different temperatures. The plots are compatible with an exponential decay in the form $Ae^{-\ell/\overline{\ell}}$ , with $\overline{\ell}$ thus being an estimate of the number of layers above which the 4N architecture becomes accurate. We thus need to understand how $\overline{\ell}$ changes with temperature and system size. To do so, we need to introduce appropriate correlation functions, as we do next.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Correlation Function vs. Distance
### Overview
The image is a line chart displaying the correlation function (G) as a function of distance (r) for different values of a parameter 'l' at a fixed temperature T = 1.02. The chart includes a "Data" series represented by a dashed line, and four other series for l = 2, 4, 6, and 10. All series show a decreasing trend as the distance 'r' increases.
### Components/Axes
* **Title:** T = 1.02
* **X-axis:** r (distance), with tick marks at 0, 1, 2, 3, 4, 5, 6, and 7.
* **Y-axis:** G (correlation function), with a logarithmic scale. Tick marks are at 10<sup>0</sup>, 10<sup>-1</sup>, and 10<sup>-2</sup>.
* **Legend:** Located in the bottom-left corner.
* Data (black dashed line)
* l = 2 (purple line)
* l = 4 (dark blue line)
* l = 6 (green line)
* l = 10 (yellow line)
* **Annotation:** "(a)" in the top-right corner.
* A horizontal dashed line is present at approximately G = 0.3.
### Detailed Analysis
* **Data (Black Dashed Line):** Starts at G = 1.0 at r = 0 and decreases to approximately G = 0.005 at r = 7.
* **l = 2 (Purple Line):** Starts at G = 1.0 at r = 0 and decreases to approximately G = 0.002 at r = 7.
* **l = 4 (Dark Blue Line):** Starts at G = 1.0 at r = 0 and decreases to approximately G = 0.003 at r = 7.
* **l = 6 (Green Line):** Starts at G = 1.0 at r = 0 and decreases to approximately G = 0.004 at r = 7.
* **l = 10 (Yellow Line):** Starts at G = 1.0 at r = 0 and decreases to approximately G = 0.005 at r = 7.
The lines for different values of 'l' are very close to each other, especially at smaller values of 'r'. As 'r' increases, the lines diverge slightly, with the line for l = 2 being the lowest and the line for l = 10 being the highest.
### Key Observations
* All correlation functions decrease as the distance 'r' increases.
* The "Data" series (dashed line) is consistently above the other lines, indicating a slightly higher correlation function value for a given 'r'.
* The correlation functions for different 'l' values are very similar, suggesting that 'l' has a relatively small effect on the correlation function at this temperature (T = 1.02).
* The horizontal dashed line at G = 0.3 might represent a threshold or a reference value.
### Interpretation
The chart illustrates the decay of correlation as distance increases for a system at a specific temperature (T = 1.02). The different lines represent the correlation function for different values of a parameter 'l'. The fact that the lines are close together suggests that the parameter 'l' has a limited influence on the correlation function at this temperature. The "Data" series likely represents experimental or simulation data, which is being compared to theoretical predictions for different 'l' values. The horizontal line at G = 0.3 could be a critical value or a point of interest for the analysis. The overall trend indicates that correlations diminish rapidly with increasing distance, as expected in many physical systems.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
## Chart: Correlation Function G vs. Distance r
### Overview
The image is a semi-logarithmic plot showing the correlation function G as a function of distance r for different values of a parameter 'l' at a fixed temperature T = 0.45. The plot includes a "Data" series represented by a dashed line, and four other series for l = 2, 4, 6, and 10, each represented by a solid line of different colors. A horizontal dashed line is present at approximately G = 0.25.
### Components/Axes
* **Title:** T = 0.45
* **X-axis:** r (distance), linear scale from 0 to 7.
* Axis markers: 0, 1, 2, 3, 4, 5, 6, 7
* **Y-axis:** G (correlation function), logarithmic scale from 10<sup>-1</sup> to 10<sup>0</sup> (0.1 to 1).
* Axis markers: 10<sup>-1</sup>, 10<sup>0</sup>
* **Legend:** Located in the bottom-left corner.
* Data: Black dashed line
* l = 2: Dark purple solid line
* l = 4: Dark blue-grey solid line
* l = 6: Green solid line
* l = 10: Yellow-green solid line
* **Annotation:** "(b)" in the top-right corner.
* **Horizontal Line:** Dashed grey line at approximately G = 0.25
### Detailed Analysis
* **Data (Black Dashed Line):** Starts at G = 1 at r = 0 and decreases to approximately G = 0.15 at r = 7. The trend is downward, indicating a decreasing correlation with increasing distance.
* r = 0, G = 1
* r = 7, G ≈ 0.15
* **l = 2 (Dark Purple Solid Line):** Starts at G = 1 at r = 0 and decreases to approximately G = 0.03 at r = 7. This line has the steepest downward slope.
* r = 0, G = 1
* r = 7, G ≈ 0.03
* **l = 4 (Dark Blue-Grey Solid Line):** Starts at G = 1 at r = 0 and decreases to approximately G = 0.07 at r = 7.
* r = 0, G = 1
* r = 7, G ≈ 0.07
* **l = 6 (Green Solid Line):** Starts at G = 1 at r = 0 and decreases to approximately G = 0.09 at r = 7.
* r = 0, G = 1
* r = 7, G ≈ 0.09
* **l = 10 (Yellow-Green Solid Line):** Starts at G = 1 at r = 0 and decreases to approximately G = 0.1 at r = 7. This line has the shallowest downward slope among the 'l' series.
* r = 0, G = 1
* r = 7, G ≈ 0.1
* **Horizontal Dashed Line:** Located at G ≈ 0.25, it serves as a reference point.
### Key Observations
* All data series start at G = 1 when r = 0.
* The correlation function G decreases with increasing distance r for all values of 'l' and for the "Data" series.
* The rate of decrease of G with respect to r is most rapid for l = 2 and least rapid for l = 10.
* As 'l' increases, the curves become less steep, indicating a slower decay of the correlation function with distance.
* The "Data" series lies above all the 'l' series, indicating a slower decay of correlation compared to the model with specific 'l' values.
### Interpretation
The plot illustrates how the correlation function G decays with distance r at a fixed temperature T = 0.45. The different values of 'l' represent different parameters in a model, and the plot shows how these parameters affect the spatial correlation. The "Data" series represents experimental or simulation results, and the comparison with the 'l' series indicates how well the model fits the data for different parameter values. The fact that the "Data" series decays slower than any of the individual 'l' series suggests that the model may need to be refined or that a combination of different 'l' values might better represent the actual data. The horizontal line at G = 0.25 provides a visual reference for comparing the decay rates of the different series.
</details>
Figure 5: Comparison between the spatial correlation functions obtained from 4N and those obtained from PT, at different high (a) and low (b) temperatures and for different number of layers. The grey dashed lines correspond to the value $1/e$ that defines the correlation length $\xi$ . Data are averaged over 10 instances.
Overlap distribution function
In order to show that the 4N architecture is able to capture more complex features of the Gibbs-Boltzmann distribution, we have considered a more sensitive observable: the probability distribution of the overlap, $P(q)$ . The overlap $q$ between two configurations of spins $\ \boldsymbol{\sigma}^{1}=\{\sigma^{1}_{1},\sigma^{1}_{2},...\sigma^{1}_{N}\}$ and $\ \boldsymbol{\sigma}^{2}=\{\sigma^{2}_{1},\sigma^{2}_{2},...\sigma^{2}_{N}\}$ is a central quantity in spin glass physics, and is defined as:
$$
q_{12}=\frac{1}{N}\boldsymbol{\sigma}^{1}\cdot\boldsymbol{\sigma}^{2}=\frac{1}%
{N}\sum_{i=1}^{N}\sigma^{1}_{i}\sigma^{2}_{i}. \tag{12}
$$
The histogram of $P(q)$ is obtained by considering many pairs of configurations, independently taken from either the PT-generated training set, or by generating them with the 4N architecture. Examples for two different instances at several temperatures are shown in Fig. 4. At high temperatures, even very few layers are able to reproduce the distribution of the overlaps. When going to lower temperatures, and to more complex shapes of $P(q)$ , however, the number of layers that is required to have a good approximation of $P(q)$ increases. This result is compatible with the intuitive idea that an increasing number of layers is required to reproduce a system at low temperatures, where the correlation length is large.
As shown in the SI, the performance of 4N is on par with that of other algorithms with more parameters. Moreover, with a suitable increase of the number of layers, 4N is also able to achieve satisfactory results when system’s size increases.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Scatter Plot: Correlation Length vs. Beta
### Overview
The image is a scatter plot showing the relationship between "Correlation length" and "l̄" (l-bar) as a function of "β" (beta). Both "Correlation length" and "l̄" generally increase as "β" increases. The plot includes a legend in the top-right corner and the label "(a)" in the top-left corner.
### Components/Axes
* **X-axis (Horizontal):** β (beta), with scale markers at 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 2.00, and 2.25.
* **Y-axis (Vertical):** No explicit label, but the data represents "Correlation length" and "l̄". Scale markers are at 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, and 3.5.
* **Legend (Top-Right):**
* Teal/Cyan: Correlation length
* Red: l̄ (l-bar)
* **Plot Title:** Labeled with "(a)" in the top-left corner.
### Detailed Analysis
**Correlation Length (Teal/Cyan):** The correlation length generally increases with beta.
* β = 0.375, Correlation length ≈ 0.7
* β = 0.5, Correlation length ≈ 0.8
* β = 0.75, Correlation length ≈ 0.9
* β = 1.0, Correlation length ≈ 1.15
* β = 1.25, Correlation length ≈ 1.4
* β = 1.5, Correlation length ≈ 1.75
* β = 1.75, Correlation length ≈ 2.3
* β = 2.25, Correlation length ≈ 3.3
**l̄ (l-bar) (Red):** The l̄ also generally increases with beta.
* β = 0.375, l̄ ≈ 0.7
* β = 0.5, l̄ ≈ 0.9
* β = 0.75, l̄ ≈ 1.1
* β = 1.0, l̄ ≈ 1.5
* β = 1.25, l̄ ≈ 1.75
* β = 1.5, l̄ ≈ 1.95
* β = 1.75, l̄ ≈ 2.05
* β = 2.25, l̄ ≈ 2.1
### Key Observations
* Both "Correlation length" and "l̄" exhibit a positive correlation with "β".
* The "Correlation length" appears to increase more rapidly than "l̄" at higher values of "β".
* The data points are somewhat scattered, suggesting some variability in the relationship.
### Interpretation
The scatter plot suggests that as the value of "β" increases, both the "Correlation length" and "l̄" tend to increase. This indicates a positive relationship between "β" and these two parameters. The difference in the rate of increase at higher "β" values might indicate a changing dynamic between the parameters as "β" increases. The scatter in the data points could be due to experimental error, other influencing factors not accounted for in the plot, or inherent variability in the system being studied.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Chart: Data Variation vs. Number of Layers
### Overview
The image is a line chart displaying the relationship between the number of layers (l) and the variation in log probability data, normalized by N. Four different data series are plotted, each corresponding to a different value of L and N. The chart includes error bands around each line.
### Components/Axes
* **Title:** There is no explicit title.
* **X-axis:** "Number of layers, l". The axis ranges from 2 to 10, with tick marks at intervals of 2.
* **Y-axis:** "(|<log P^(l)_AR>_data - <log P^(10)_AR>_data|)/N". The axis ranges from 0.000 to 0.006, with tick marks at intervals of 0.001.
* **Legend:** Located in the top-right corner.
* Blue: L = 12, N = 144
* Gray: L = 16, N = 256
* Dark Yellow: L = 20, N = 400
* Yellow: L = 24, N = 576
* **Annotation:** "(b)" is present in the top-center of the chart.
### Detailed Analysis
The chart contains four data series, each represented by a line with error bands. All series show a decreasing trend as the number of layers increases.
* **L = 12, N = 144 (Blue):**
* At l = 2, the value is approximately 0.006.
* At l = 4, the value is approximately 0.003.
* At l = 6, the value is approximately 0.0008.
* At l = 8, the value is approximately 0.0002.
* At l = 10, the value is approximately 0.00005.
* The line slopes downward, decreasing rapidly between l = 2 and l = 4, then decreasing more gradually.
* **L = 16, N = 256 (Gray):**
* At l = 2, the value is approximately 0.0052.
* At l = 4, the value is approximately 0.0018.
* At l = 6, the value is approximately 0.0006.
* At l = 8, the value is approximately 0.0001.
* At l = 10, the value is approximately 0.00001.
* The line slopes downward, decreasing rapidly between l = 2 and l = 4, then decreasing more gradually.
* **L = 20, N = 400 (Dark Yellow):**
* At l = 2, the value is approximately 0.0048.
* At l = 4, the value is approximately 0.0015.
* At l = 6, the value is approximately 0.0005.
* At l = 8, the value is approximately 0.00008.
* At l = 10, the value is approximately 0.00001.
* The line slopes downward, decreasing rapidly between l = 2 and l = 4, then decreasing more gradually.
* **L = 24, N = 576 (Yellow):**
* At l = 2, the value is approximately 0.0045.
* At l = 4, the value is approximately 0.001.
* At l = 6, the value is approximately 0.0004.
* At l = 8, the value is approximately 0.00005.
* At l = 10, the value is approximately 0.00001.
* The line slopes downward, decreasing rapidly between l = 2 and l = 4, then decreasing more gradually.
### Key Observations
* All data series exhibit a similar decreasing trend.
* The error bands are wider at lower values of 'l' and narrow as 'l' increases.
* The data series with lower L and N values (blue) generally have higher values on the y-axis compared to those with higher L and N values (yellow).
### Interpretation
The chart suggests that as the number of layers (l) increases, the variation in log probability data, normalized by N, decreases. This indicates that with more layers, the model becomes more stable or consistent in its predictions. The different L and N values seem to influence the magnitude of this variation, with lower L and N resulting in higher initial variation. The decreasing error bands suggest that the uncertainty in the data also decreases as the number of layers increases. The annotation (b) likely refers to a specific experimental setup or condition under which the data was collected.
</details>
Figure 6: (a) Comparison between the correlation length $\xi$ and the decay parameter $\overline{\ell}$ of the exponential fit as a function of the inverse temperature $\beta$ . Both data sets are averaged over 10 instances. The data are limited to $T≥ 0.45$ because below that temperature the finite size of the sample prevents us to obtain a reliable proper estimate of the correlation length. (b) Absolute difference between $\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}$ at various $\ell$ and at $\ell=10$ for $T=1.02$ and different system sizes. Data are averaged over ten different instances of the disorder and the colored areas identify regions corresponding to plus or minus one standard deviation. A version of this plot in linear-logarithmic scale is available in the SI.
Spatial correlation function
To better quantify spatial correlations, from the overlap field, a proper correlation function for a spin glass system can be defined as [35]
$$
G(r)=\frac{1}{{\color[rgb]{0,0,0}\mathcal{N}(r)}}\sum_{i,j:d_{ij}=r}\overline{%
\langle q_{i}q_{j}\rangle}\ , \tag{13}
$$
where $r=d_{ij}$ is the lattice distance of sites $i,j$ , ${q_{i}=\sigma^{1}_{i}\sigma^{2}_{i}}$ is the local overlap, and $\mathcal{N}(r)=\sum_{i,j:d_{ij}=r}1$ is the number of pairs of sites at distance $r$ . Note that in this way $G(0)=1$ by construction. Results for $G(r)$ at various temperatures are shown in Fig. 5. It is clear that increasing the number of layers leads to a better agreement at lower temperatures. Interestingly, architectures with a lower numbers of layers produce a smaller correlation function at any distance, which shows that upon increasing the number of layers the 4N architecture is effectively learning the correlation of the Gibbs-Boltzmann measure.
To quantify this intuition, we extract from the exact $G(r)$ of the EA model a correlation length $\xi$ , from the relation $G(\xi)=1/e$ . Then, we compare $\xi$ with $\overline{\ell}$ , i.e. the number of layers above which the model becomes accurate, extracted from Fig. 3. These two quantities are compared in Fig. 6 as a function of temperature, showing that they are roughly proportional. Actually $\overline{\ell}$ seems to grow even slower than $\xi$ at low temperatures. In Fig. 6 we also reproduce the results of Fig. 3b for different sizes of the system. The curves coincide for large enough values of $N$ , showing that the value of $\ell$ saturates to a system-independent value when increasing the system size. We are thus able to introduce a physical criterion to fix the optimal number of layers needed for the 4N architecture to perform well: $\ell$ must be of the order of the correlation length $\xi$ of the model at that temperature.
We recall that the 4N architecture requires $(c+1)×\ell× N$ parameters, where $c$ is the average connectivity of the graph ( $c=2d$ for a $d$ -dimensional lattice), and $\mathcal{O}(\mathcal{B}(\ell)× N)$ operations to generate a new configuration. We thus conclude that, for a model with a correlation length $\xi$ that remains finite for $N→∞$ , we can choose $\ell$ proportional to $\xi$ , hence achieving the desired linear scaling with system size both for the number of parameters and the number of operations.
Discussion
In this work we have introduced a new, physics-inspired 4N architecture for the approximation of the Gibbs-Boltzmann probability distribution of a spin glass system of $N$ spins. The 4N architecture is inspired by the previously introduced TwoBo [21], but it implements explicitly the locality of the physical model via a Graph Neural Network structure.
The 4N network possesses many desirable properties. (i) The architecture has a physical interpretation, with the required number of layers being directly linked to the correlation length of the original data. (ii) As a consequence, if the correlation length of the model is finite, the required number of parameters for a good fit scales linearly in the number of spins, thus greatly improving the scalability with respect to previous architectures. (iii) The architecture is very general, and can be applied to lattices in all dimensions as well as arbitrary interaction graphs without losing the good scaling of the number of parameters, as long as the average connectivity remains finite. It can also be generalized to systems with more than two-body interactions by using a factor graph representation in the GNN layers.
As a benchmark study, we have tested that the 4N architecture is powerful enough to reproduce key properties of the two-dimensional Edwards-Anderson spin glass model, such as the energy, the correlation function and the probability distribution of the overlap.
Having tested that 4N achieves the best possible scaling with system size while being expressive enough, the next step is to check whether it is possible to apply an iterative, self-consistent procedure to automatically train the network at lower and lower temperatures without the aid of training data generated with a different algorithm (here, parallel tempering). This would allow to set up a simulated tempering procedure with a good scaling in the number of variables, with applications not only in the simulation of disordered systems but in the solution of optimization problems as well. Whether setting up such a procedure is actually possible will be the focus of future work.
Methods
Details of the 4N architecture
The Nearest Neighbors Neural Network (4N) architecture uses a GNN structure to respect the locality of the problem, as illustrated in Fig. 1. The input, given by the local fields computed from the already fixed spins, is passed through $\ell$ layers of weights $W_{ij}^{k}$ . Each weight corresponds to a directed nearest neighbor link $i← j$ (including $i=j$ ) and $k=1,...,\ell$ , therefore the network has a total number of parameters equal to $(c+1)×\ell× N$ , i.e. linear in $N$ as long as the number of layers and the model connectivity remain finite. A showcase of the real wall-clock times required for the training of the network and for the generation of new configurations is shown in the SI.
The operations performed in order to compute ${\pi_{i}=P(\sigma_{i}=1\mid\boldsymbol{\sigma}_{<i})}$ are the following:
1. initialize the local fields from the input spins $\boldsymbol{\sigma}_{<i}$ :
$$
h_{l}^{i,(0)}=\sum_{f(<i)}J_{lf}\sigma_{f}\ ,\qquad\forall l:|i-l|\leq\ell\ . \tag{0}
$$
1. iterate the GNN layers $∀ k=1,·s,\ell-1$ :
$$
h_{l}^{i,(k)}=\phi^{k}\left[\beta W_{ll}^{k}h_{l}^{i,(k-1)}+\sum_{j\in\partial
l%
}\beta J_{lj}W_{lj}^{k}h_{j}^{i,(k-1)}\right]+h_{l}^{i,(0)}\ , \tag{0}
$$
where the sum over $j∈∂ l$ runs over the neighbors of site $l$ . The fields are computed $∀ l:|i-l|≤\ell-k$ , see Fig. 1.
1. compute the field on site $i$ at the final layer,
$$
h_{i}^{i,(\ell)}=\phi^{\ell}\left[\beta W_{ii}^{\ell}h_{i}^{i,(\ell-1)}+\sum_{%
j\in\partial i}\beta J_{ij}W_{ij}^{\ell}h_{j}^{i,(\ell-1)}\right]\ .
$$
1. output $\pi_{i}=P(\sigma_{i}=1|\boldsymbol{\sigma}_{<i})=S[2h_{i}^{i,(\ell)}]$ .
Here $S(x)=1/(1+e^{-x})$ is the sigmoid function and $\phi^{k}(x)$ are layer-specific functions that can be used to introduce non-linearities in the system. The fields $h_{l}^{i,(0)}=\xi_{il}$ in the notation of the TwoBo architecture. Differently from TwoBo, however, 4N uses all the $h_{l}^{i,(0)}$ , instead of the ones corresponding to spins that have not yet been fixed.
In practice, we have observed that the network with identity functions, i.e. $\phi^{k}(x)=x$ for all $k$ , is sufficiently expressive, and using non-linearities does not seem to improve the results. Moreover, linear functions have the advantage of being easily explainable and faster, therefore we have chosen to use them in this work.
Spin update order
Autoregressive models such as 4N are based on an ordering of spins to implement Eq. (6). One can choose a simple raster scan (i.e., from left to right on one row and updating each row from top to bottom sequentially) or any other scheme of choice, see e.g. Refs. [14, 36] for a discussion. In this work, we have adopted a spiral scheme, in which the spins at the edges of the system are updated first, and then one moves in spirals towards the central spins. As shown in the SI, it appears as 4N performs better, for equal training conditions, when the spins are updated following a spiral scheme with respect to, for instance, a raster scheme.
Building the training data
We use a training set of $M=2^{16}$ configurations obtained at several temperatures using a Parallel Tempering algorithm for different instances of the couplings $J_{ij}$ , and for a 16x16 two-dimensional square lattice with open boundary conditions. Data where gathered in runs of $2^{27}$ steps by considering $2^{26}$ steps for thermalization and then taking one configuration each $2^{10}$ steps. Temperature swap moves, in which one swaps two configurations at nearby temperature, were attempted once every 32 normal Metropolis sweeps. The choice of $M=2^{16}$ was motivated based on previous studies of the same model [12].
Model training
The training has been carried out maximizing the likelihood of the data, a procedure which is equivalent to minimizing the KL divergence $D_{\text{KL}}(P_{\text{GB}}\parallel P_{\text{NN}})$ , with $P_{\text{GB}}$ estimated by the training set. We used an ADAM optimizer with batch size 128 for 80 epochs and a learning rate 0.001 with exponential decay in time. Training is stopped if the loss has not improved in the last 10 steps, and the lowest-loss model is considered.
Acknowledgements. We especially thank Indaco Biazzo and Giuseppe Carleo for several important discussions on their TwoBo architecture that motivated and inspired the present study. We also thank Giulio Biroli, Patrick Charbonneau, Simone Ciarella, Marylou Gabrié, Leonardo Galliano, Jeanne Trinquier, Martin Weigt and Lenka Zdeborová for useful discussions. We acknowledge support from the computational infrastructure DARIAH.IT, PON Project code PIR01_00022, National Research Council of Italy. The research has received financial support from the “National Centre for HPC, Big Data and Quantum Computing - HPC”, Project CN_00000013, CUP B83C22002940006, NRP Mission 4 Component 2 Investment 1.5, Funded by the European Union - NextGenerationEU. Author LMDB acknowledges funding from the Bando Ricerca Scientifica 2024 - Avvio alla Ricerca (D.R. No. 1179/2024) of Sapienza Università di Roma, project B83C24005280001 – MaLeDiSSi.
References
- [1] Ulli Wolff. Collective monte carlo updating for spin systems. Physical Review Letters, 62(4):361–364, 1989. Publisher: American Physical Society.
- [2] Andrea Ninarello, Ludovic Berthier, and Daniele Coslovich. Models and algorithms for the next generation of glass transition studies. Phys. Rev. X, 7:021039, Jun 2017.
- [3] E. Marinari and G. Parisi. Simulated tempering: A new monte carlo scheme. Europhysics Letters, 19(6):451, jul 1992.
- [4] J. Houdayer. A cluster monte carlo algorithm for 2-dimensional spin glasses. The European Physical Journal B - Condensed Matter and Complex Systems, 22(4):479–484, 2001.
- [5] Giuseppe Carleo and Matthias Troyer. Solving the quantum many-body problem with artificial neural networks. Science, 355(6325):602–606, 2017. Publisher: American Association for the Advancement of Science.
- [6] Samuel Tamagnone, Alessandro Laio, and Marylou Gabrié. Coarse grained molecular dynamics with normalizing flows. arXiv preprint arXiv:2406.01524, 2024.
- [7] Gerhard Jung, Rinske M Alkemade, Victor Bapst, Daniele Coslovich, Laura Filion, François P Landes, Andrea Liu, Francesco Saverio Pezzicoli, Hayato Shiba, Giovanni Volpe, et al. Roadmap on machine learning glassy liquids. arXiv preprint arXiv:2311.14752, 2023.
- [8] Leonardo Galliano, Riccardo Rende, and Daniele Coslovich. Policy-guided monte carlo on general state spaces: Application to glass-forming mixtures. arXiv preprint arXiv:2407.03275, 2024.
- [9] Simone Ciarella, Dmytro Khomenko, Ludovic Berthier, Felix C Mocanu, David R Reichman, Camille Scalliet, and Francesco Zamponi. Finding defects in glasses through machine learning. Nature Communications, 14(1):4229, 2023.
- [10] Marylou Gabrié, Grant M. Rotskoff, and Eric Vanden-Eijnden. Adaptive monte carlo augmented with normalizing flows. Proceedings of the National Academy of Sciences, 119(10):e2109420119, 2022. Publisher: Proceedings of the National Academy of Sciences.
- [11] Dian Wu, Lei Wang, and Pan Zhang. Solving statistical mechanics using variational autoregressive networks. Physical Review Letters, 122(8):080602, 2019. Publisher: American Physical Society.
- [12] Simone Ciarella, Jeanne Trinquier, Martin Weigt, and Francesco Zamponi. Machine-learning-assisted monte carlo fails at sampling computationally hard problems. Machine Learning: Science and Technology, 4(1):010501, 2023.
- [13] B. McNaughton, M. V. Milošević, A. Perali, and S. Pilati. Boosting monte carlo simulations of spin glasses using autoregressive neural networks. Physical Review E, 101(5):053312, 2020. Publisher: American Physical Society.
- [14] Jeanne Trinquier, Guido Uguzzoni, Andrea Pagnani, Francesco Zamponi, and Martin Weigt. Efficient generative modeling of protein sequences using simple autoregressive models. Nature communications, 12(1):5800, 2021.
- [15] Jonas Köhler, Leon Klein, and Frank Noé. Equivariant flows: Exact likelihood generative learning for symmetric densities.
- [16] Manuel Dibak, Leon Klein, Andreas Krämer, and Frank Noé. Temperature steerable flows and boltzmann generators. Nature Machine Intelligence, 4(4):L042005, 2022. Publisher: American Physical Society.
- [17] Michele Invernizzi, Andreas Krämer, Cecilia Clementi, and Frank Noé. Skipping the replica exchange ladder with normalizing flows. The Journal of Physical Chemistry Letters, 13(50):11643–11649, 2022.
- [18] Wenlin Chen, Mingtian Zhang, Brooks Paige, José Miguel Hernández-Lobato, and David Barber. Diffusive gibbs sampling. arXiv preprint arXiv:2402.03008, 2024.
- [19] Davide Ghio, Yatin Dandi, Florent Krzakala, and Lenka Zdeborová. Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective. arXiv preprint arXiv:2308.14085, 2023.
- [20] Michael Kilgour and Lena Simine. Inside the black box: A physical basis for the effectiveness of deep generative models of amorphous materials. Journal of Computational Physics, 452:110885, 2022.
- [21] Indaco Biazzo, Dian Wu, and Giuseppe Carleo. Sparse autoregressive neural networks for classical spin systems.
- [22] Martin J. A. Schuetz, J. Kyle Brubaker, and Helmut G. Katzgraber. Combinatorial optimization with physics-inspired graph neural networks. Nature Machine Intelligence, 4(4):367–377, 2022. Publisher: Nature Publishing Group.
- [23] Martin J. A. Schuetz, J. Kyle Brubaker, Zhihuai Zhu, and Helmut G. Katzgraber. Graph coloring with physics-inspired graph neural networks. Physical Review Research, 4(4):043131, 2022. Publisher: American Physical Society.
- [24] Changjun Fan, Mutian Shen, Zohar Nussinov, Zhong Liu, Yizhou Sun, and Yang-Yu Liu. Searching for spin glass ground states through deep reinforcement learning. Nature Communications, 14(1):725, February 2023. Publisher: Nature Publishing Group.
- [25] Daria Pugacheva, Andrei Ermakov, Igor Lyskov, Ilya Makarov, and Yuriy Zotov. Enhancing gnns performance on combinatorial optimization by recurrent feature update. arXiv:2407.16468, 2024.
- [26] Lorenzo Colantonio, Andrea Cacioppo, Federico Scarpati, and Stefano Giagu. Efficient graph coloring with neural networks: A physics-inspired approach for large graphs. arXiv:2408.01503, 2024.
- [27] Maria Chiara Angelini and Federico Ricci-Tersenghi. Modern graph neural networks do worse than classical greedy algorithms in solving combinatorial optimization problems like maximum independent set. Nature Machine Intelligence, 5(1):29–31, December 2022. arXiv:2206.13211 [cond-mat].
- [28] Stefan Boettcher. Inability of a graph neural network heuristic to outperform greedy algorithms in solving combinatorial optimization problems. Nature Machine Intelligence, 5(1):24–25, 2023.
- [29] Stefan Boettcher. Deep reinforced learning heuristic tested on spin-glass ground states: The larger picture. Nature Communications, 14(1):5658, 2023.
- [30] Changjun Fan, Mutian Shen, Zohar Nussinov, Zhong Liu, Yizhou Sun, and Yang-Yu Liu. Reply to: Deep reinforced learning heuristic tested on spin-glass ground states: The larger picture. Nature communications, 14(1):5659, 2023.
- [31] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International conference on machine learning, pages 881–889. PMLR, 2015.
- [32] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.
- [33] Ata Madanchi, Michael Kilgour, Frederik Zysk, Thomas D Kühne, and Lena Simine. Simulations of disordered matter in 3d with the morphological autoregressive protocol (map) and convolutional neural networks. The Journal of Chemical Physics, 160(2), 2024.
- [34] Indaco Biazzo. The autoregressive neural network architecture of the boltzmann distribution of pairwise interacting spins systems. Communications Physics, 6(1):296, 2023.
- [35] LA Fernandez, E Marinari, V Martin-Mayor, Giorgio Parisi, and JJ Ruiz-Lorenzo. Universal critical behavior of the two-dimensional ising spin glass. Physical Review B, 94(2):024402, 2016.
- [36] Yi Hong Teoh and Roger G Melko. Autoregressive model path dependence near ising criticality. arXiv:2408.15715, 2024.
S1 Computational details
We now introduce a easy way to generate efficiently data using the 4N architecture. As long as the activation functions $\phi^{k}$ of the different layers of 4N are linear, all the intermediate message passing layers can be substituted with a single matrix multiplication. Indeed, let us consider, as in the main text, identity activation functions. In this case, we can rewrite the matrix implementing the $t-$ th layer, $\underline{\underline{A}}^{t}$ , as $\underline{\underline{A}}^{t}=\beta(\underline{\underline{J}}+\underline{%
\underline{I}})\odot\underline{\underline{W}}^{t}$ , where $\beta$ is the inverse temperature, $\underline{\underline{J}}$ is the coupling matrix, $\underline{\underline{I}}$ is the identity matrix, and $\underline{\underline{W}}^{t}$ it the $t-$ th layer weight matrix. Here, $\odot$ is the elementwise multiplication. Now we observe that (in 4N we put skip connections after all but the last layer)
$$
\displaystyle\mathbf{h}^{t}=\underline{\underline{A}}^{t}\mathbf{h}^{t-1}=%
\underline{\underline{A}}^{t}(\underline{\underline{A}}^{t-1}\mathbf{h}^{t-2}+%
h^{0})=\dots=(\underline{\underline{A}}^{t}\underline{\underline{A}}^{t-1}%
\dots\underline{\underline{A}}^{1}+\underline{\underline{A}}^{t}\underline{%
\underline{A}}^{t-1}\dots\underline{\underline{A}}^{2}+\dots+\underline{%
\underline{A}}^{t}\underline{\underline{A}}^{t-1}+\underline{\underline{A}}^{t%
})\mathbf{h}^{0}. \tag{16}
$$
Now, remembering that, if the input configuration $\mathbf{s}$ is properly masked, we have $\mathbf{h}^{0}=\underline{\underline{J}}\mathbf{s}$ , it follows that
$$
\displaystyle\mathbf{h}^{t}=(\underline{\underline{A}}^{t}\underline{%
\underline{A}}^{t-1}\dots\underline{\underline{A}}^{1}+\underline{\underline{A%
}}^{t}\underline{\underline{A}}^{t-1}\dots\underline{\underline{A}}^{2}+\dots+%
\underline{\underline{A}}^{t}\underline{\underline{A}}^{t-1}+\underline{%
\underline{A}}^{t})\underline{\underline{J}}\mathbf{s}, \tag{17}
$$
which can be rewritten in matrix form as $\mathbf{h}^{t}=\underline{\underline{M}}\mathbf{s}$ by defining
$$
\displaystyle\underline{\underline{M}}=(\underline{\underline{A}}^{t}%
\underline{\underline{A}}^{t-1}\dots\underline{\underline{A}}^{1}+\underline{%
\underline{A}}^{t}\underline{\underline{A}}^{t-1}\dots\underline{\underline{A}%
}^{2}+\dots+\underline{\underline{A}}^{t}\underline{\underline{A}}^{t-1}+%
\underline{\underline{A}}^{t})\underline{\underline{J}}, \tag{18}
$$
and the final output is obtained simply by taking the elementwise sigmoidal of $\mathbf{h}^{t}$ . Notice that the $i$ -th row of the matrix $\underline{\underline{M}}$ has zeroes in all columns corresponding to spins at distance larger than $\ell+1$ from the spin $i$ , that is, the number of non-zero elements is constant with system size as long as $\ell$ is kept fixed. Therefore, generation can simply be implemented as a series of vector products, each considering only the nonzero elements of the matrix $\underline{\underline{M}}$ . This allows one to compute the $i$ -th element of vector $\mathbf{h}^{t}$ in a time $\mathcal{O}(1)$ in system size, therefore keeping generation procedures $\mathcal{O}(N)$ . Moreover, the training time scales as $\mathcal{O}(N)$ .
We point out that we can train directly the whole matrix $\underline{\underline{M}}$ , with the additional constraint of putting to zero the elements corresponding to spins at distance larger than $\ell+1$ . Conceptually, this approach can be interpreted as a reduced MADE, in which the weight matrix is not only triangular but has additional zeros. We will refer to this approach as $\ell$ -MADE in the following. This method is easier to implement and can be faster in the training and generation processes, but has the disadvantage or requiring more parameters than the standard 4N approach, e.g. $\mathcal{O}(\ell^{d}N)$ for a $d-$ dimensional lattice against the $(c+1)\ell N$ of 4N. Moreover, it loses the physical interpretation of 4N, as it can no longer be interpreted as a series of fields propagating on the lattice.
Wall-clock generation and training times are shown in Fig. S1. Comparison between Fig. 3b obtained using 4N and the analogue obtained using $\ell$ -MADE is shown in Fig. S2.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Scatter Plot: Execution Time vs. Number of Spins
### Overview
The image is a scatter plot showing the relationship between the number of spins and the execution time for 5 repetitions. The plot includes data points and a linear fit line. The data points generally follow a linear trend, with some deviation.
### Components/Axes
* **X-axis:** Number of Spins, ranging from 0 to 16000, with increments of 2000.
* **Y-axis:** Execution Time for 5 Repetitions (s), ranging from 0 to 10, with increments of 2.
* **Legend (Top-Left):**
* Blue circles: Data
* Black line: Linear fit
* **Gridlines:** Light gray dashed lines are present in the background.
### Detailed Analysis
* **Data Points (Blue Circles):** The data points show a generally linear relationship between the number of spins and the execution time.
* At 1000 spins, the execution time is approximately 1.3 seconds.
* At 3000 spins, the execution time is approximately 3 seconds.
* At 5000 spins, the execution time is approximately 4 seconds.
* At 7000 spins, the execution time is approximately 5 seconds.
* At 9000 spins, the execution time is approximately 6.5 seconds.
* At 11000 spins, the execution time is approximately 7 seconds.
* At 12000 spins, the execution time is approximately 8 seconds.
* At 13000 spins, the execution time is approximately 9.5 seconds.
* At 15000 spins, the execution time is approximately 10 seconds.
* **Linear Fit (Black Line):** The black line represents the linear fit of the data. It starts at approximately 0.8 seconds at 0 spins and increases linearly to approximately 11 seconds at 16000 spins.
### Key Observations
* The execution time generally increases linearly with the number of spins.
* There is some scatter around the linear fit, indicating that the relationship is not perfectly linear.
* The data point at 13000 spins appears to deviate slightly above the linear fit.
### Interpretation
The plot suggests a strong positive correlation between the number of spins and the execution time for 5 repetitions. The linear fit provides a good approximation of this relationship. The scatter around the line may be due to various factors, such as experimental error or non-linear effects not captured by the linear model. The outlier at 13000 spins could be due to a measurement error or a specific condition during that trial. Overall, the data indicates that increasing the number of spins leads to a predictable increase in execution time.
</details>
<details>
<summary>x14.png Details</summary>

### Visual Description
## Chart: Training Time vs. Number of Spins
### Overview
The image is a scatter plot showing the relationship between the number of spins and the training time. The plot includes data points and a linear fit line. The x-axis represents the number of spins, and the y-axis represents the training time in seconds.
### Components/Axes
* **X-axis:** Number of spins, ranging from 0 to 3000, with markers at 500, 1000, 1500, 2000, 2500, and 3000.
* **Y-axis:** Training Time (s), ranging from 0 to 70, with markers at 10, 20, 30, 40, 50, 60, and 70.
* **Data:** Represented by red circles.
* **Linear fit:** Represented by a black line.
* **Legend:** Located in the top-left corner, indicating "Data" as red circles and "Linear fit" as a black line.
### Detailed Analysis
The data points and the linear fit show a positive correlation between the number of spins and the training time. As the number of spins increases, the training time also increases.
* **Data Points:**
* Approximately (200, 6)
* Approximately (400, 10)
* Approximately (600, 15)
* Approximately (900, 21)
* Approximately (1450, 38)
* Approximately (1950, 44)
* Approximately (2500, 56)
* Approximately (3000, 67)
* **Linear Fit:** The black line starts at approximately (0, 4) and extends to approximately (3000, 67). The line appears to closely follow the trend of the data points.
### Key Observations
* The relationship between the number of spins and training time is approximately linear.
* The linear fit line closely matches the data points, suggesting a good fit.
* There are no obvious outliers.
### Interpretation
The chart demonstrates that the training time increases linearly with the number of spins. This suggests that the computational complexity of the training process is directly proportional to the number of spins. The linear fit provides a model to predict the training time for a given number of spins. The absence of significant outliers indicates a consistent relationship between these two variables.
</details>
Figure S1: Scatter plots of the generation (left) and training (right) elapsed real times at at different system sizes for $\ell$ -MADE. Both training and generations have been performed using dummy data (and therefore $N$ needs not to be the square of an integer). For generation, we generate 10000 configurations of $N$ five times. For training, we train a model with $\ell=2$ for 40 epochs using an Adam optimizer once per each value of $N$ . In both cases, elapsed times scale linearly with the system’s size, as further evidenced by the linear fits (black lines).
<details>
<summary>x15.png Details</summary>

### Visual Description
## Chart: Log Probability vs. Number of Layers
### Overview
The image is a scatter plot showing the relationship between the absolute value of the log probability (averaged over data) divided by S, and the number of layers, for different temperatures (T). The plot displays data for six different temperatures, ranging from T=0.58 to T=2.83. The x-axis represents the number of layers, ranging from 2 to 10. The y-axis represents the absolute value of the log probability divided by S, ranging from 1.00 to 1.10.
### Components/Axes
* **Title:** There is no explicit title for the chart, but the plot is labeled with "(a)" in the top-left corner.
* **X-axis:** "Number of layers, l". The x-axis ranges from 2 to 10, with tick marks at every even integer.
* **Y-axis:** "|<logP<sub>AR</sub><sup>l</sup>><sub>data</sub>|/S". The y-axis ranges from 1.00 to 1.10, with tick marks at intervals of 0.02.
* **Legend:** Located in the top-right corner, the legend indicates the temperature (T) associated with each color:
* Dark Blue: T = 0.58
* Purple: T = 0.72
* Magenta: T = 0.86
* Pink: T = 1.02
* Salmon: T = 1.41
* Orange: T = 1.95
* Yellow: T = 2.83
### Detailed Analysis
Here's a breakdown of the data series for each temperature:
* **T = 0.58 (Dark Blue):** The line starts at approximately 1.085 at l=2 and decreases to approximately 1.025 at l=10. The trend is downward sloping.
* l=2: ~1.085
* l=4: ~1.05
* l=6: ~1.04
* l=8: ~1.03
* l=10: ~1.025
* **T = 0.72 (Purple):** The line starts at approximately 1.045 at l=2 and decreases to approximately 1.01 at l=10. The trend is downward sloping.
* l=2: ~1.045
* l=4: ~1.02
* l=6: ~1.015
* l=8: ~1.01
* l=10: ~1.008
* **T = 0.86 (Magenta):** The line starts at approximately 1.03 at l=2 and decreases to approximately 1.008 at l=10. The trend is downward sloping.
* l=2: ~1.03
* l=4: ~1.015
* l=6: ~1.01
* l=8: ~1.008
* l=10: ~1.006
* **T = 1.02 (Pink):** The line starts at approximately 1.02 at l=2 and decreases to approximately 1.005 at l=10. The trend is downward sloping.
* l=2: ~1.02
* l=4: ~1.01
* l=6: ~1.008
* l=8: ~1.006
* l=10: ~1.004
* **T = 1.41 (Salmon):** The line starts at approximately 1.007 at l=2 and remains relatively constant around 1.003 at l=10. The trend is relatively flat.
* l=2: ~1.007
* l=4: ~1.005
* l=6: ~1.004
* l=8: ~1.003
* l=10: ~1.002
* **T = 1.95 (Orange):** The line starts at approximately 1.003 at l=2 and remains relatively constant around 1.002 at l=10. The trend is relatively flat.
* l=2: ~1.003
* l=4: ~1.002
* l=6: ~1.002
* l=8: ~1.002
* l=10: ~1.001
* **T = 2.83 (Yellow):** The line starts at approximately 1.001 at l=2 and remains relatively constant around 1.001 at l=10. The trend is relatively flat.
* l=2: ~1.001
* l=4: ~1.001
* l=6: ~1.001
* l=8: ~1.001
* l=10: ~1.001
### Key Observations
* As the number of layers increases, the absolute value of the log probability divided by S generally decreases for lower temperatures (T = 0.58, 0.72, 0.86, 1.02).
* For higher temperatures (T = 1.41, 1.95, 2.83), the absolute value of the log probability divided by S remains relatively constant as the number of layers increases.
* The curves for higher temperatures (T = 1.41, 1.95, 2.83) are clustered closer to 1.00 compared to the curves for lower temperatures.
### Interpretation
The plot suggests that at lower temperatures, the log probability is more sensitive to the number of layers, as indicated by the steeper downward slopes. This could imply that the model's performance or predictability changes more significantly with increasing complexity (number of layers) at lower temperatures. Conversely, at higher temperatures, the log probability is less affected by the number of layers, suggesting a more stable or saturated state. The data indicates a relationship between temperature, model complexity (number of layers), and the log probability, which could be related to the model's ability to capture underlying patterns in the data. The fact that the higher temperature curves are closer to 1.00 might indicate a baseline or a more random state, where the model's predictions are less informative.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
## Scatter Plot: Log-Probability vs. l
### Overview
The image is a scatter plot displaying the relationship between the absolute value of the log probability ratio (|<log P_AR^l>_data|/S) and the variable 'l' for different temperatures (T). The plot includes data points for six different temperatures, each represented by a distinct color. The plot shows a general decreasing trend of the log probability ratio as 'l' increases, with variations depending on the temperature.
### Components/Axes
* **Title:** There is no overall title for the chart, but there is a label "(b)" in the top-left corner.
* **X-axis:**
* Label: 'l'
* Scale: 2, 4, 6, 8, 10
* **Y-axis:**
* Label: |<log P_AR^l>_data|/S
* Scale: 1.00, 1.02, 1.04, 1.06, 1.08, 1.10
* **Legend:** Located in the top-right corner.
* Dark Blue: T = 0.58
* Purple: T = 0.72
* Dark Magenta: T = 0.86
* Magenta: T = 1.02
* Coral: T = 1.41
* Orange: T = 1.95
* Yellow: T = 2.83
### Detailed Analysis
* **T = 0.58 (Dark Blue):** The line starts at approximately 1.075 at l=2 and decreases to approximately 1.02 at l=10.
* l = 2: ~1.075
* l = 4: ~1.04
* l = 6: ~1.025
* l = 8: ~1.025
* l = 10: ~1.02
* **T = 0.72 (Purple):** The line starts at approximately 1.04 at l=2 and decreases to approximately 1.01 at l=10.
* l = 2: ~1.04
* l = 4: ~1.02
* l = 6: ~1.015
* l = 8: ~1.01
* l = 10: ~1.01
* **T = 0.86 (Dark Magenta):** The line starts at approximately 1.03 at l=2 and decreases to approximately 1.008 at l=10.
* l = 2: ~1.03
* l = 4: ~1.015
* l = 6: ~1.01
* l = 8: ~1.008
* l = 10: ~1.008
* **T = 1.02 (Magenta):** The line starts at approximately 1.02 at l=2 and decreases to approximately 1.005 at l=10.
* l = 2: ~1.02
* l = 4: ~1.013
* l = 6: ~1.01
* l = 8: ~1.005
* l = 10: ~1.005
* **T = 1.41 (Coral):** The line starts at approximately 1.008 at l=2 and remains relatively constant around 1.003 at l=10.
* l = 2: ~1.008
* l = 4: ~1.003
* l = 6: ~1.003
* l = 8: ~1.003
* l = 10: ~1.003
* **T = 1.95 (Orange):** The line starts at approximately 1.003 at l=2 and remains relatively constant around 1.003 at l=10.
* l = 2: ~1.003
* l = 4: ~1.003
* l = 6: ~1.003
* l = 8: ~1.003
* l = 10: ~1.003
* **T = 2.83 (Yellow):** The line starts at approximately 1.002 at l=2 and remains relatively constant around 1.002 at l=10.
* l = 2: ~1.002
* l = 4: ~1.002
* l = 6: ~1.002
* l = 8: ~1.002
* l = 10: ~1.002
### Key Observations
* As 'l' increases, the value of |<log P_AR^l>_data|/S generally decreases for all temperatures.
* Higher temperatures (T = 1.41, 1.95, and 2.83) have lower values of |<log P_AR^l>_data|/S and remain relatively constant as 'l' increases.
* Lower temperatures (T = 0.58, 0.72, 0.86, and 1.02) show a more noticeable decrease in |<log P_AR^l>_data|/S as 'l' increases.
* The rate of decrease in |<log P_AR^l>_data|/S diminishes as 'l' increases, especially for lower temperatures.
### Interpretation
The plot suggests that the log probability ratio is influenced by both 'l' and temperature (T). The decreasing trend with increasing 'l' indicates that the relationship weakens as 'l' grows. The temperature dependence shows that higher temperatures lead to lower log probability ratios, suggesting a different state or behavior of the system at higher temperatures. The convergence of the higher temperature lines to a near-constant value implies that, beyond a certain temperature, the log probability ratio becomes less sensitive to changes in 'l'. The data may represent a system transitioning between different phases or states as temperature changes, with 'l' potentially representing a length scale or interaction range.
</details>
Figure S2: Ratio between the cross-entropy $|\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}|$ and the Gibbs-Boltzmann entropy $S_{\text{GB}}$ for different values of the temperature $T$ and of the parameter $\ell$ for 4N (a, left) and $\ell$ -MADE (b, right). Both $|\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}|$ and $S_{\text{GB}}$ are averaged over 10 samples.
S2 Comparison between different architectures
<details>
<summary>x17.png Details</summary>

### Visual Description
## Line Chart: Probability Distribution vs. q
### Overview
The image is a line chart displaying the probability distribution P(q) as a function of q for different models: PT, 4N (with l=10), MADE, NADE, and TwoBo. The chart is for a specific instance (Instance 2) at a temperature T=0.31. The x-axis represents 'q', ranging from -1.00 to 1.00, and the y-axis represents P(q), ranging from 0 to 1.0.
### Components/Axes
* **Title:** T = 0.31, Instance 2
* **X-axis:**
* Label: q
* Scale: -1.00, -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75, 1.00
* **Y-axis:**
* Label: P(q)
* Scale: 0.2, 0.4, 0.6, 0.8, 1.0
* **Legend:** Located in the top-right area of the chart.
* PT (Black, dashed line)
* 4N, l = 10 (Gold line)
* MADE (Teal line)
* NADE (Red line)
* TwoBo (Green line)
### Detailed Analysis
* **PT (Black, dashed line):** Starts at approximately 0.25 at q=-1.00, increases to approximately 0.75 at q=-0.75, decreases to approximately 0.15 at q=0.00, then increases again to approximately 0.75 at q=0.75, and ends at approximately 0.25 at q=1.00.
* **4N, l = 10 (Gold line):** Starts at approximately 0.15 at q=-1.00, increases to approximately 0.75 at q=-0.75, decreases to approximately 0.20 at q=0.00, then increases again to approximately 0.75 at q=0.75, and ends at approximately 0.15 at q=1.00.
* **MADE (Teal line):** Starts at approximately 0.25 at q=-1.00, increases to approximately 0.75 at q=-0.75, decreases to approximately 0.15 at q=0.00, then increases again to approximately 0.75 at q=0.75, and ends at approximately 0.25 at q=1.00.
* **NADE (Red line):** Starts at approximately 0.25 at q=-1.00, increases to approximately 0.75 at q=-0.75, decreases to approximately 0.20 at q=0.00, then increases again to approximately 0.75 at q=0.75, and ends at approximately 0.25 at q=1.00.
* **TwoBo (Green line):** Starts at approximately 0.30 at q=-1.00, increases to approximately 0.95 at q=-0.75, decreases to approximately 0.05 at q=0.00, then increases again to approximately 0.95 at q=0.75, and ends at approximately 0.30 at q=1.00.
### Key Observations
* All models exhibit a similar U-shaped distribution, with peaks around q = -0.75 and q = 0.75, and a minimum around q = 0.00.
* The TwoBo model has the most pronounced peaks and the lowest minimum value.
* The PT, MADE, and NADE models are very similar to each other.
* The 4N model is similar to the other models, but has a slightly higher minimum value.
### Interpretation
The chart compares the probability distributions of different models (PT, 4N, MADE, NADE, and TwoBo) for a given instance and temperature. The U-shaped distribution suggests that the system is more likely to be in states corresponding to q values around -0.75 and 0.75, and less likely to be in a state corresponding to q = 0. The differences in the peak heights and minimum values between the models indicate variations in their ability to capture the underlying physics of the system. The TwoBo model appears to have the most pronounced preference for the states around q = -0.75 and q = 0.75, while the other models show a less extreme preference. The similarity between PT, MADE, and NADE suggests that these models may be capturing similar aspects of the system's behavior.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Line Chart: Probability Distribution Comparison
### Overview
The image is a line chart comparing the probability distribution P(q) across different models (PT, 4N, MADE, NADE, TwoBo) for a system at temperature T = 0.31, instance 3. The x-axis represents 'q', ranging from -1.00 to 1.00, and the y-axis represents P(q), ranging from 0.0 to 0.8. The chart displays a bimodal distribution, with peaks around q = -0.75 and q = 0.75.
### Components/Axes
* **Title:** T = 0.31, Instance 3
* **X-axis:**
* Label: q
* Scale: -1.00, -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75, 1.00
* **Y-axis:**
* Label: P(q)
* Scale: 0.0, 0.2, 0.4, 0.6, 0.8
* **Legend:** Located in the center of the chart.
* PT (Black, dashed line)
* 4N, l = 10 (Gold line)
* MADE (Teal line)
* NADE (Red line)
* TwoBo (Green line)
### Detailed Analysis
* **PT (Black, dashed line):** The PT line forms the upper bound of the probability distribution. It has peaks at approximately q = -0.75 and q = 0.75, reaching a P(q) value of approximately 0.85. It has a local minimum at q = 0.0, with a P(q) value of approximately 0.73.
* **4N, l = 10 (Gold line):** The 4N line follows a similar trend to the PT line but with slightly lower P(q) values. It has peaks at approximately q = -0.75 and q = 0.75, reaching a P(q) value of approximately 0.80. It has a local minimum at q = 0.0, with a P(q) value of approximately 0.30.
* **MADE (Teal line):** The MADE line also follows a similar trend to the PT line but with slightly lower P(q) values than the 4N line. It has peaks at approximately q = -0.75 and q = 0.75, reaching a P(q) value of approximately 0.82. It has a local minimum at q = 0.0, with a P(q) value of approximately 0.74.
* **NADE (Red line):** The NADE line is very close to the MADE line. It has peaks at approximately q = -0.75 and q = 0.75, reaching a P(q) value of approximately 0.83. It has a local minimum at q = 0.0, with a P(q) value of approximately 0.74.
* **TwoBo (Green line):** The TwoBo line has a similar shape but with a lower peak and higher valley. It has peaks at approximately q = -0.75 and q = 0.75, reaching a P(q) value of approximately 0.84. It has a local minimum at q = 0.0, with a P(q) value of approximately 0.54.
### Key Observations
* All models exhibit a bimodal distribution with peaks around q = -0.75 and q = 0.75.
* The PT model has the highest peaks, closely followed by NADE, MADE, 4N, and TwoBo.
* The TwoBo model has a significantly higher probability at q = 0.0 compared to the other models.
* The 4N model has the lowest probability at q = 0.0 compared to the other models.
### Interpretation
The chart compares the probability distributions generated by different models (PT, 4N, MADE, NADE, TwoBo) for a specific system at a given temperature and instance. The bimodal distribution suggests that the system has two preferred states, corresponding to the peaks at q = -0.75 and q = 0.75. The differences in the peak heights and valley depths indicate variations in how well each model captures the underlying physics of the system. The PT model appears to provide the closest fit, as it forms the upper bound of the probability distribution. The TwoBo model seems to favor a state where q is closer to 0.0 compared to the other models.
</details>
Figure S3: Comparison of the probability distribution of the overlap obtained using different architectures: 4N with $\ell=10$ layers (12160 parameters), shallow MADE (32640 parameters), NADE with $N_{h}=64$ (32768 parameters) and TwoBo (4111 parameters). The black dashed line is obtained using data gathered via Parallel Tempering (PT), while solid color lines are obtained using the different architectures. From a quick qualitative analysis 4N performs comparably to MADE and NADE despite having less parameters. Moreover, it also performs better than TwoBo. Despite the latter having fewer parameters in this $N=256$ case, 4N has a better scaling when increasing the system size, $\mathcal{O}(N)$ compared to $\mathcal{O}(N^{\frac{3}{2}})$ , as pointed out in the main text. The TwoBo implementation comes from the GitHub of the authors (https://github.com/cqsl/sparse-twobody-arnn) and was slightly modified to support supervised learning (which could be improved by a different choice of the hyperparameters used for training).
S3 Comparison between different update sequences
<details>
<summary>x19.png Details</summary>

### Visual Description
## Line Chart: Probability Distribution vs. q for T=0.31, Instance 2
### Overview
The image is a line chart comparing the probability distribution P(q) as a function of q for three different methods: PT (dashed black line), Spiral (solid red line), and Raster (solid gold line). The chart is titled "T = 0.31, Instance 2". The x-axis (q) ranges from -1.00 to 1.00, and the y-axis (P(q)) ranges from 0.00 to 1.75.
### Components/Axes
* **Title:** T = 0.31, Instance 2
* **X-axis:**
* Label: q
* Scale: -1.00, -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75, 1.00
* **Y-axis:**
* Label: P(q)
* Scale: 0.00, 0.25, 0.50, 0.75, 1.00, 1.25, 1.50, 1.75
* **Legend:** Located in the top-right corner of the chart.
* PT: Dashed black line
* Spiral: Solid red line
* Raster: Solid gold line
### Detailed Analysis
* **PT (Dashed Black Line):** The PT line starts at approximately 0.2 at q = -1.00, rises to a peak of approximately 0.85 at q = -0.75, dips to approximately 0.7 at q = -0.5, dips again to approximately 0.2 at q = -0.25, rises to approximately 0.2 at q = 0.25, rises again to approximately 0.85 at q = 0.75, and falls back down to approximately 0.2 at q = 1.00. The trend is a double-peaked curve, symmetric around q = 0.
* **Spiral (Solid Red Line):** The Spiral line starts at approximately 0.2 at q = -1.00, rises to a peak of approximately 0.75 at q = -0.75, dips to approximately 0.25 at q = -0.25, remains at approximately 0.25 until q = 0.25, rises to approximately 0.75 at q = 0.75, and falls back down to approximately 0.2 at q = 1.00. The trend is a double-peaked curve, symmetric around q = 0.
* **Raster (Solid Gold Line):** The Raster line starts at approximately 0.0 at q = -1.00, remains at approximately 0.0 until q = -0.5, rises sharply to a peak of approximately 1.75 at q = 0.0, and falls sharply back down to approximately 0.0 at q = 0.5, and remains at approximately 0.0 until q = 1.00. The trend is a single peak centered around q = 0.
### Key Observations
* The Raster method exhibits a much higher peak probability around q = 0 compared to the PT and Spiral methods.
* The PT and Spiral methods show similar double-peaked distributions, with peaks around q = -0.75 and q = 0.75.
* All three methods converge to similar low probabilities at the extreme values of q (-1.00 and 1.00).
### Interpretation
The chart compares the probability distributions of q obtained using three different methods (PT, Spiral, and Raster) for a system at temperature T = 0.31 in Instance 2. The Raster method shows a strong preference for q values near 0, while the PT and Spiral methods indicate a higher probability for q values around -0.75 and 0.75. This suggests that the Raster method might be more sensitive to a specific state or configuration, while the PT and Spiral methods capture a broader range of possible states. The double peaks in the PT and Spiral distributions could indicate the presence of two distinct, equally probable states. The differences in the distributions highlight the impact of the chosen method on the observed probability distribution of q.
</details>
<details>
<summary>x20.png Details</summary>

### Visual Description
## Line Chart: P(q) vs. q for T = 0.31, Instance 3
### Overview
The image is a line chart comparing the probability distribution P(q) as a function of q for three different methods: PT (dashed black line), Spiral (solid red line), and Raster (solid gold line). The chart displays data for T = 0.31 and Instance 3. The x-axis (q) ranges from -1.00 to 1.00, and the y-axis (P(q)) ranges from 0.0 to 1.2.
### Components/Axes
* **Title:** T = 0.31, Instance 3
* **X-axis:**
* Label: q
* Scale: -1.00, -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75, 1.00
* **Y-axis:**
* Label: P(q)
* Scale: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2
* **Legend:** Located in the center-right of the chart.
* PT: Dashed black line
* Spiral: Solid red line
* Raster: Solid gold line
### Detailed Analysis
* **PT (Dashed Black Line):**
* Trend: Starts at approximately 0.0 at q = -1.00, rises to a peak around 0.96 at q = -0.75, decreases to a local minimum around 0.26 at q = -0.25, rises again to a peak around 0.76 at q = 0.00, decreases to a local minimum around 0.26 at q = 0.25, rises to a peak around 0.96 at q = 0.75, and then decreases back to approximately 0.0 at q = 1.00.
* Key Points:
* (-1.00, ~0.0)
* (-0.75, ~0.96)
* (-0.25, ~0.26)
* (0.00, ~0.76)
* (0.25, ~0.26)
* (0.75, ~0.96)
* (1.00, ~0.0)
* **Spiral (Solid Red Line):**
* Trend: Starts at approximately 0.0 at q = -1.00, rises to a peak around 0.78 at q = -0.75, decreases to a local minimum around 0.34 at q = -0.25, rises again to a peak around 0.76 at q = 0.00, decreases to a local minimum around 0.34 at q = 0.25, rises to a peak around 0.78 at q = 0.75, and then decreases back to approximately 0.0 at q = 1.00.
* Key Points:
* (-1.00, ~0.0)
* (-0.75, ~0.78)
* (-0.25, ~0.34)
* (0.00, ~0.76)
* (0.25, ~0.34)
* (0.75, ~0.78)
* (1.00, ~0.0)
* **Raster (Solid Gold Line):**
* Trend: Starts at approximately 0.0 at q = -1.00, rises to a peak around 0.92 at q = -0.25, decreases to a local minimum around 0.98 at q = 0.00, rises again to a peak around 1.16 at q = 0.25, and then decreases back to approximately 0.0 at q = 1.00.
* Key Points:
* (-1.00, ~0.0)
* (-0.25, ~1.16)
* (0.00, ~0.98)
* (0.25, ~1.16)
* (1.00, ~0.0)
### Key Observations
* All three methods (PT, Spiral, and Raster) show symmetrical distributions around q = 0.00.
* The Raster method has the highest peak probability, reaching approximately 1.16 at q = -0.25 and q = 0.25.
* The PT method has sharper peaks compared to the Spiral method.
* The Spiral method has the lowest peak probability, reaching approximately 0.78 at q = -0.75 and q = 0.75.
### Interpretation
The chart compares the probability distributions P(q) obtained from three different methods (PT, Spiral, and Raster) for a specific system configuration (T = 0.31, Instance 3). The data suggests that the Raster method predicts a higher probability of finding the system in states corresponding to q = -0.25 and q = 0.25, while the Spiral method predicts a lower probability for states corresponding to q = -0.75 and q = 0.75. The PT method provides an intermediate distribution. The symmetry of the distributions around q = 0.00 indicates that the system exhibits similar behavior for positive and negative values of q. The differences in the peak heights and widths suggest that the methods capture different aspects of the system's behavior or have varying levels of accuracy.
</details>
Figure S4: Comparison of the probability distribution of the overlap obtained using different update sequences of the spins for $\ell=10$ . The black dashed line is obtained using data gathered via Parallel Tempering (PT), while solid color lines are obtained using 4N with different sequence updates. Spiral: spins are update in a following a spiral going from the outside to the inside of the lattice; Raster: spins are updated row by row from left to right. It appears as the spiral update performs much better than the raster update for the 4N architecture in equal training conditions, something that is not observed when training, for instance, the MADE architecture. This effect could be related to the locality characteristics of the 4N architecture.
S4 Cross-entropy plots in lin-lin and lin-log scale
<details>
<summary>x21.png Details</summary>

### Visual Description
## Line Chart: Log Probability Difference vs. Number of Layers
### Overview
The image is a line chart that plots the difference between the absolute value of the log probability of AR data and the average log probability of AR data, normalized by N, against the number of layers, l. There are four data series, each representing a different value of L and N. The chart includes a legend, axis labels, and gridlines. The data series all show a decreasing trend as the number of layers increases.
### Components/Axes
* **Title:** None explicitly present, but the chart is labeled "(b)" in the top-center.
* **X-axis:** "Number of layers, l". The axis ranges from 2 to 10, with tick marks at every increment of 2 (2, 4, 6, 8, 10).
* **Y-axis:** "(|⟨logP<sub>AR</sub><sup>l</sup>⟩<sub>data</sub> - ⟨logP<sub>AR</sub><sup>10</sup>⟩<sub>data</sub>|)/N". The axis ranges from 0.000 to 0.006, with tick marks at every increment of 0.001 (0.000, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006).
* **Legend:** Located in the top-right corner. It identifies the four data series:
* Dark Blue: L = 12, N = 144
* Dark Gray: L = 16, N = 256
* Olive Green: L = 20, N = 400
* Yellow: L = 24, N = 576
### Detailed Analysis
* **Dark Blue Line (L = 12, N = 144):** This line starts at approximately (2, 0.006) and decreases to approximately (10, 0.000).
* (2, 0.006)
* (4, 0.003)
* (6, 0.0008)
* (8, 0.0003)
* (10, 0.000)
* **Dark Gray Line (L = 16, N = 256):** This line starts at approximately (2, 0.005) and decreases to approximately (10, 0.000).
* (2, 0.005)
* (4, 0.0018)
* (6, 0.0007)
* (8, 0.0002)
* (10, 0.000)
* **Olive Green Line (L = 20, N = 400):** This line starts at approximately (2, 0.0047) and decreases to approximately (10, 0.000).
* (2, 0.0047)
* (4, 0.0013)
* (6, 0.0005)
* (8, 0.0001)
* (10, 0.000)
* **Yellow Line (L = 24, N = 576):** This line starts at approximately (2, 0.0045) and decreases to approximately (10, 0.000).
* (2, 0.0045)
* (4, 0.0009)
* (6, 0.0004)
* (8, 0.0001)
* (10, 0.000)
Each line has a shaded region around it, indicating some form of uncertainty or standard deviation.
### Key Observations
* All four lines exhibit a decreasing trend. As the number of layers (l) increases, the value of (|⟨logP<sub>AR</sub><sup>l</sup>⟩<sub>data</sub> - ⟨logP<sub>AR</sub><sup>10</sup>⟩<sub>data</sub>|)/N decreases.
* The dark blue line (L = 12, N = 144) consistently has the highest values across the range of the x-axis.
* The yellow line (L = 24, N = 576) consistently has the lowest values across the range of the x-axis.
* The lines converge towards 0 as the number of layers approaches 10.
### Interpretation
The chart suggests that as the number of layers (l) increases, the difference between the log probability of the AR data at layer l and the log probability at layer 10 decreases. This implies that with more layers, the model's performance converges towards a stable state, represented by the 10th layer. The different values of L and N seem to influence the initial difference, with lower L and N resulting in a higher initial difference. The convergence towards zero suggests that increasing the number of layers beyond a certain point (likely around 10) may not significantly improve the model's performance, as the difference becomes negligible. The shaded regions around each line indicate the variability or uncertainty in the data, which decreases as the number of layers increases, further supporting the idea of convergence.
</details>
<details>
<summary>x22.png Details</summary>

### Visual Description
## Log-Log Plot: Scaled Log Probability vs. Number of Layers
### Overview
The image is a log-log plot showing the relationship between a scaled log probability difference and the number of layers in a system. There are four data series, each representing a different system size (L), with corresponding number of elements (N). The plot shows a decreasing trend for all series as the number of layers increases. Shaded regions around each line indicate uncertainty.
### Components/Axes
* **Title:** Implicit, but the plot explores the relationship between scaled log probability and number of layers.
* **X-axis:** "Number of layers, l". Linear scale, with tick marks at integers from 2 to 8.
* **Y-axis:** "(|<log P^(l)_AR>_data - <log P^(10)_AR>_data|)/N". Logarithmic scale, with tick marks at 10^-4 and 10^-3.
* **Legend:** Located in the top-right corner.
* Blue: L = 12, N = 144
* Dark Gray: L = 16, N = 256
* Light Gray: L = 20, N = 400
* Yellow: L = 24, N = 576
* **(a):** Located in the top-center of the plot.
### Detailed Analysis
* **Data Series:**
* **Blue (L=12, N=144):** Decreasing trend.
* l=2: y ≈ 6.3e-3
* l=3: y ≈ 3.5e-3
* l=4: y ≈ 2.0e-3
* l=5: y ≈ 1.2e-3
* l=6: y ≈ 0.7e-3
* l=8: y ≈ 0.3e-3
* **Dark Gray (L=16, N=256):** Decreasing trend.
* l=2: y ≈ 5.0e-3
* l=4: y ≈ 1.5e-3
* l=5: y ≈ 0.8e-3
* l=6: y ≈ 0.5e-3
* l=8: y ≈ 0.15e-3
* **Light Gray (L=20, N=400):** Decreasing trend.
* l=2: y ≈ 4.5e-3
* l=4: y ≈ 1.2e-3
* l=5: y ≈ 0.7e-3
* l=6: y ≈ 0.4e-3
* l=8: y ≈ 0.12e-3
* **Yellow (L=24, N=576):** Decreasing trend.
* l=2: y ≈ 4.0e-3
* l=4: y ≈ 1.0e-3
* l=5: y ≈ 0.5e-3
* l=6: y ≈ 0.3e-3
* l=8: y ≈ 0.1e-3
### Key Observations
* All data series show a decreasing trend, indicating that the scaled log probability difference decreases as the number of layers increases.
* The blue series (L=12, N=144) consistently has the highest values for a given number of layers.
* The yellow series (L=24, N=576) consistently has the lowest values for a given number of layers.
* The shaded regions indicate uncertainty, which appears to increase as the number of layers increases.
### Interpretation
The plot suggests that as the number of layers increases, the difference between the average log probability and a reference log probability (at 10 layers), scaled by N, decreases. The different system sizes (L and N) show a clear separation, with larger systems exhibiting lower scaled log probability differences. This could indicate that larger systems converge to the reference log probability more quickly as the number of layers increases, or that the reference log probability is more representative of larger systems. The increasing uncertainty with more layers might reflect the challenges in accurately estimating log probabilities for deeper networks.
</details>
Figure S5: Absolute difference between $\langle\log P^{\ell}_{\text{NN}}\rangle_{\text{data}}$ at various $\ell$ and at $\ell=10$ for $T=1.02$ and different system’s sizes, in lin-lin and lin-log. Data are averaged over ten different instances of the disorder and the colored areas identify regions corresponding to plus or minus one standard deviation.