# Chapter 0 Statistical Mechanics and Artificial Neural Networks: Principles, Models, and Applications
## Abstract
The field of neuroscience and the development of artificial neural networks (ANNs) have mutually influenced each other, drawing from and contributing to many concepts initially developed in statistical mechanics. Notably, Hopfield networks and Boltzmann machines are versions of the Ising model, a model extensively studied in statistical mechanics for over a century. In the first part of this chapter, we provide an overview of the principles, models, and applications of ANNs, highlighting their connections to statistical mechanics and statistical learning theory.
Artificial neural networks can be seen as high-dimensional mathematical functions, and understanding the geometric properties of their loss landscapes (i.e., the high-dimensional space on which one wishes to find extrema or saddles) can provide valuable insights into their optimization behavior, generalization abilities, and overall performance. Visualizing these functions can help us design better optimization methods and improve their generalization abilities. Thus, the second part of this chapter focuses on quantifying geometric properties and visualizing loss functions associated with deep ANNs.
Contents
1. 1 Introduction
1. 1 The McCulloch–Pitts calculus and Hebbian learning
1. 2 The Perceptron
1. 3 Objective Bayes, entropy, and information theory
1. 4 Connections to the Ising model
1. 2 Hopfield network
1. 3 Boltzmann machine learning
1. 1 Boltzmann machines
1. 2 Restricted Boltzmann machines
1. 3 Derivation of the learning algorithm
1. 4 Loss landscapes of artificial neural networks
1. 1 Principal curvature in random projections
1. 2 Extracting curvature information
1. 3 Hessian directions
1. 5 Conclusions
### 1 Introduction
There has been a long tradition in studying learning systems within statistical mechanics. [1, 2, 3] Some versions of artificial neural networks (ANNs) were, in fact, inspired by the Ising model that has been proposed in 1920 by Wilhelm Lenz as a model for ferromagnetism and solved in one dimension by his doctoral student, Ernst Ising, in 1924. [4, 5, 6] In the language of machine learning, the Ising model can be considered as a non-learning recurrent neural network (RNN).
Contributions to ANN research extend beyond statistical mechanics, however, requiring a broad range of scientific disciplines. Historically, these efforts have been marked by eras of interdisciplinary collaboration referred to as cybernetics [7, 8], connectionism [9, 10], artificial intelligence [11, 12], and machine learning [13, 14]. To emphasize the importance of multidisciplinary contributions in ANN research, we will often highlight the research disciplines of individuals who have played pivotal roles in the advancement of ANNs.
#### 1 The McCulloch–Pitts calculus and Hebbian learning
An influential paper by the neuroscientist Warren S. McCulloch and logician Walter Pitts published in 1943 is often regarded as inaugurating neural network research, [15] but in fact there was an active research community at that time working on the mathematics of neural activity. [16] McCulloch and Pitts’s pivotal contribution was to join propositional logic and Alan Turing’s mathematical notion of computation [17] to propose a calculus for creating compositional representations of neural behavior. Their work has later on been picked up by the mathematician Stephen C. Kleene [18], who made some of their work more accessible. Kleene comments on the work by McCulloch and Pitts in his 1956 article [18]: “The present memorandum is partly an exposition of the McCulloch–Pitts results; but we found the part of their paper which treats of arbitrary nets obscure; so we have proceeded independently here.”
Contemporaneously, the physchologist Donald O. Hebb formulated in his 1949 book “The Organization of Behavior” [19] a neurophysiological postulate, which is known today as “Hebb’s rule” or “Hebbian learning”: Let us assume then that the persistence or repetition of a reverberatory activity (or “trace”) tends to induce lasting cellular changes that add to its stability. … When an axon of cell $A$ is near enough to excite a cell $B$ and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that $A$ ’s efficiency, as one of the cells firing $B$ , is increased. Donald O. Hebb in “The Organization of Behavior” (1949) In the modern literature, Hebbian learning is often summarized by the slogan “neurons wire together if they fire together”. [20]
#### 2 The Perceptron
In 1958, the psychologist Frank Rosenblatt developed the perceptron [21, 22], which is essentially a binary classifier that is based on a single artificial neuron and a Hebbian-type learning rule. The artificial neuron of this type is sometimes referred to as “McCulloch–Pitts neuron”, owing to its connection to the foundational work of McCulloch and Pitts [15]. Some authors distinguish McCulloch–Pitts neurons from perceptrons by noting that the former handle binary signals, while the latter can process real-valued inputs. In parallel to Rosenblatt’s work, the electrical engineers Bernard Widrow and Ted Hoff developed a similar ANN named ADALINE (Adaptive Linear Neuron). [23]
We show a schematic of a perceptron in Fig. 1. It receives inputs $x_1,x_2,\dots,x_n$ that are weighted with $w_1,w_2,\dots,w_n$ . Within the summer (indicated by a $Σ$ symbol), the artificial neuron computes $∑_iw_ix_i+b$ , where $b$ is a bias term. This quantity is then used as input of a step activation function (indicated by a step symbol). The output of this function is $y$ . Such simple artificial neurons can be used to implement logical functions including AND, OR, and NOT. For example, consider a perceptron with Heaviside activation, two binary inputs $x_1,x_2∈\{0,1\}$ , weights $w_1,w_1=1$ and bias $b=-1.5$ . We can readily verify that $y=x_1\wedge x_2$ . An example of a logical function that cannot be represented by a perceptron is XOR. This was shown by the computer scientists Marvin L. Minsky and Seymour A. Papert in their 1969 book “Perceptrons: An Introduction to Computational Geometry” [11]. Based on their analysis of perceptrons they considered the extension of research on this topic to be “sterile”.
Problems like the one associated with representing XOR using a single perceptron can be overcome by employing multilayer perceptrons (MLPs) or, in a broader sense, multilayer feedforward networks equipped with various types of activation functions. [24, 25, 26] One effective way to train such multilayered structures is reverse mode automatic differentiation (i.e., backpropagation). [27, 28, 29, 30, 31]
Notice that perceptrons and other ANNs predominantly operate on real-valued data. Nevertheless, there is an expanding body of research focused on ANNs that are designed to handle complex-valued and, in some cases, quaternion-based information. [32, 33, 34, 35, 36, 37, 38]
$x_1$
$x_2$
$\dots$
$x_n$ $Σ$ $w_1$ $w_2$ $\dots$ $w_n$ $b$ $y$
Figure 1: A schematic of a perceptron with output $y$ , inputs $x_1,x_2,\dots,x_n$ , corresponding weights $w_1,w_2,\dots,w_n$ , and a bias term $b$ . Traditionally, the inputs and outputs of perceptrons were considered to be binary (e.g., $\{-1,1\}$ or $\{0,1\}$ ), while weights and biases are generally represented as real numbers. The symbol $Σ$ signifies the summation process wherein the artificial neuron computes $∑_iw_ix_i+b$ . This computed value then becomes the input for a step activation function, denoted by a step symbol. We denote the final output as $y$ . While the original perceptron employed a step activation function, contemporary applications frequently utilize a variety of functions including sigmoid, $\tanh$ , and rectified linear unit (ReLU).
#### 3 Objective Bayes, entropy, and information theory
Over the same period of time, seminal work in probability and statistics appeared that would later prove crucial to the development of ANNs. In the early 20th century, a renewed interest in inverse “Bayesian“ inference [39, 40] met strong resistance from the emerging frequentist school of statistics led by Jerzy Neyman, Karl Pearson, and especially Ronald Fisher, who advocated his own fiducial inference as a better alternative to Bayesian inference [41, 42]. In response to criticism that Bayesian methods were irredeemably subjective, Harold Jeffreys presented a methodology for “objectively” assigning prior probabilities based on symmetry principles and invariance properties. [43]
Edwin Jaynes, building on Jeffreys’s ideas for objective Bayesian priors and Shannon’s quantification of uncertainty [44], proposed the Maximum Entropy (MaxEnt) principle as a rationality criteria for assigning prior probabilities that are at once consistent with known information while remaining maximally uncommitted about what remains uncertain [45, 46, 47]. Jaynes envisioned the MaxEnt principle as the basis for a “logic of science” [48]. For instance, he proposed a reinterpretation of the principles of thermodynamics in information-theoretic terms where macro-states of a physical system are understood through the partial information they provide about the possible underlying microstates, showing how the original Boltzmann–Gibbs interpretation of entropy could be replaced by a more general single framework based on Shannon entropy. Objective Bayesianism does not remove subjective judgment [49], MaxEnt is far from a universal principle for encoding prior information [50], and the terms ‘objective’ and ‘subjective’ are outdated. [51, 52]
For example, the MaxEnt principle can be used to select the least biased distribution from many common exponential family distributions by imposing constraints that match the expectation values of their sufficient statistics. That is, from constraints on the expected values of some function $h(x)$ on data parameterized by $θ$ , applying the MaxEnt principle to find the probability distribution that maximizes entropy under those constraints often takes the form of an exponential family distribution
$$
p(x\midθ)=h(x) \exp≤ft(θ^⊤T(x)-A(θ)\right) , \tag{1}
$$
where $h(x)$ is the underlying base measure, $θ$ is the natural parameter of the distribution, $A(θ)$ is the log normalizer, and $T(x)$ is the sufficient statistic determined by MaxEnt under the given constraints. MaxEnt derived distributions depend on contingencies of the data and the availability of the right constraints to ensure the derivation goes through, however–contingencies that undermine viewing MaxEnt as a fundamental principle of rational inference.
Nevertheless, Jaynes’s notion of treating probability as an extension of logic [48, 48, 53] and probabilistic inference as governed by coherence conditions on information states [54, 55] has proved useful in optimization (e.g., cross-entropy loss [56]), entropy-based regularization of ANNs [57], and Bayesian neural networks. [58]
#### 4 Connections to the Ising model
Returning to the Ising model, a modified version of it has been endowed with a learning algorithm by the mathematical neuroscientist Shun’ichi Amari in 1972 and has been employed for learning patterns and sequences of patterns. [59] Amari’s self-organizing ANN is an early model of associative memory that has later been popularized by the physicist John Hopfield. [60] Despite its limited representational capacity and use in practical applications, the study of Hopfield-type learning systems is still an active research area. [61, 36, 62, 63]
The Hopfield network shares the Hamiltonian with the Ising model. Its learning algorithm is deterministic and aims at minimizing the “energy” of the system. One problem with the deterministic learning algorithm of Hopfield networks is that it cannot escape local minima. A solution to this problem is to employ a stochastic learning rule akin to Monte-Carlo methods that are used to generate Ising configurations. [64] In 1985, David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski proposed such a learning algorithm for an Ising-type ANN which they called “Boltzmann machine” (BM). [65, 66, 67, 68] In 1986, Smolensky introduced the concept of “Harmonium” [69], which later evolved into restricted Boltzmann machines (RBMs). In contrast to Boltzmann machines, RBMs benefit from more efficient training algorithms that were developed in the early 2000s. [70, 71] Since then, RBMs have been applied in various context such as to reduce dimensionality of datasets [72], study phase transitions [73, 74, 3, 75], represent wave functions [76, 77].
Our historical overview of ANNs aims at emphasizing the strong interplay between statistical mechanics and related fields in the investigation of learning systems. Given the extensive span of over eight decades since the pioneering work of McCulloch and Pitts, summarizing the history of ANNs within a few pages cannot do justice to its depth and significance. We therefore refer the reader to the excellent review by Jürgen Schmidhuber (see Ref. 78) for a more detailed overview of the history of deep learning in ANNs.
As computing power continues to increase, deep (i.e., multilayer) ANNs have, over the past decade, become pervasive across various scientific domains. [79] Different established and newly developed ANN architectures have not only been employed to tackle scientific problems but have also found extensive applications in tasks such as image classification and natural language processing. [80, 81, 82, 83]
Finally, we would also like to emphasize an area of research that has significantly advanced due to the availability of automatic differentiation [84] and the utilization of ANNs as universal function approximators. [85, 86, 87] This area resides at the intersection of non-linear dynamics [88, 89], control theory [90, 91, 92, 93, 94, 95, 96, 97, 98], and dynamics-informed learning, where researchers often aim to integrate ANNs and their associated learning algorithms with domain-specific knowledge. [99, 100, 101, 102, 103, 104] This integration introduces an inductive bias that facilitates effective learning.
The outline of this chapter is as follows. To better illustrate the connections between the Ising model and ANNs, we devote Secs. 2 and 3 to the Hopfield model and Boltzmann machines, respectively. In Sec. 4, we discuss how mathematical tools from high-dimensional probability and differential geometry can help us study the loss landscape of deep ANNs. [105] We conclude this chapter in Sec. 5.
### 2 Hopfield network
A Hopfield network is a complete (i.e., fully connected) undirected graph in which each node is an artificial neuron of McCulloch–Pitts type (see Fig. 1). We show an example of a Hopfield network with $n=5$ neurons in Fig. 2. We use $x_i∈\{-1,1\}$ to denote the state of neuron $i$ ( $i∈\{1,\dots,n\}$ ). Because the underlying graph is fully connected, neuron $i$ receives $n-1$ inputs $x_j$ ( $j≠ i$ ). The inputs $x_j$ associated with neuron $i$ are assigned weights $w_ij∈ℝ$ . In a Hopfield network, weights are assumed to be symmetric (i.e., $w_ij=w_ji$ ), and self-weights are considered to be absent (i.e., $w_ii=0$ ). $x_1$ $x_2$ $x_3$ $x_4$ $x_5$ $w_12$ $w_15$
Figure 2: An example of a Hopfield network with $n=5$ neurons. Each neuron is connected to all other neurons through black edges, representing both inputs and outputs. (Adapted from 106.)
With these definitions in place, the activation of neuron $i$ is given by
$$
a_i=∑_jw_ijx_j+b_i , \tag{2}
$$
where $b_i$ is the bias term of neuron $i$ .
Regarding connections between the Ising model and Hopfield networks, the states of artificial neurons align with the binary values employed in the Ising model to model the orientations of elementary magnets, often referred to as classical “spins”. The weights in a Hopfield network represent the potentially diverse couplings between different elementary magnets. Lastly, the bias terms assume the function of external magnetic fields that act on these elementary magnets. Given these structural similarities between the Ising model and Hopfield networks, it is natural to assign them the Ising-type energy function
$$
E=-\frac{1}{2}∑_i,jw_ijx_ix_j-∑_ib_ix_i . \tag{3}
$$
Hopfield networks are not just a collection of interconnected artificial neurons, but they are dynamical systems whose states $x_i$ evolve in discrete time according to
$$
x_i←\begin{cases}1 ,&if a_i≥ 0\\
-1 ,&otherwise ,\end{cases} \tag{4}
$$
where $a_i$ is the activation of neuron $i$ [see Eq. (2)].
For asynchronous updates in which the state of one neuron is updated at a time, the update rule (4) has an interesting property: Under this update rule, the energy $E$ [see Eq. (3)] of a Hopfield network never increases.
The proof of this statement is straightforward. We are interested in the cases where an update of $x_i$ causes this quantity to change its sign. Otherwise the energy will stay constant. Consider the two cases (i) $a_i=∑_jw_ijx_j+b_i<0$ with $x_i=1$ , and (ii) $a_i≥ 0$ with $x_i=-1$ . In the first case, the energy difference is
$$
Δ E_i=E(x_i=-1)-E(x_i=1)=2\Bigl{(}b_i+∑_jw_ijx_j\Bigr{)}=2a_i . \tag{5}
$$
Notice that $Δ E_i<0$ because $a_i=∑_jw_ijx_j+b_i<0$ .
In the second case, we have
$$
Δ E_i=E(x_i=1)-E(x_i=-1)=-2\Bigl{(}b_i+∑_jw_ijx_j\Bigr{)}=-2a_i . \tag{6}
$$
The energy difference satisfies $Δ E_i≤ 0$ because $a_i≥ 0$ . In summary, we have shown that $Δ E_i≤ 0$ for all changes of the state variable $x_i$ according to update rule (4).
Equations (5) and (6) also show that the energy difference associated with a change of sign in $x_i$ is $2a_i$ and $-2a_i$ for $a_i<0$ and $a_i≥ 0$ , respectively. Absorbing the bias term $b_i$ in the weights $w_ij$ by associating an extra active unit to every node in the network yields for the corresponding energy differences
$$
Δ E_i=± 2∑_jw_ijx_j . \tag{7}
$$
One potential application of Hopfield networks is to store and retrieve information in local minima by adjusting their weights. For instance, consider the task of storing $N$ binary (black and white) patterns in a Hopfield network. Let these $N$ patterns be represented by the set $\{p_i^ν=± 1|1≤ i≤ n\}$ with $ν∈\{1,\dots,N\}$ .
To learn the weights that represent our binary patterns, we can apply the Hebbian-type rule
$$
w_ij=w_ji=\frac{1}{N}∑_ν=1^Np_i^νp_j^ν . \tag{8}
$$
In Hopfield networks with a large number of neurons, the capacity for storing and retrieving patterns is limited to approximately 14% of the total number of neurons. [107] However, it is possible to enhance this capacity through the use of modified versions of the learning rule (8). [108]
After applying the Hebbian learning rule to adjust the weights of a given Hopfield network, we may be interested in studying the evolution of different initial configurations $\{x_i\}$ according to update rule (4). A Hopfield network accurately represents a pattern $\{p_i^ν=± 1|1≤ i≤ n\}$ if $x_i$ remains equal to $p_i^ν$ both before and after an update for all $i$ (meaning $\{p_i^ν\}$ is a fixed point of the system). Depending on the number of stored patterns and the chosen initial configuration, the Hopfield network may converge to a local minimum that does not align with the desired pattern. To mitigate this behavior, one can employ a stochastic update rule instead of the deterministic rule (4). This will be the topic that we are going to discuss in the following section.
### 3 Boltzmann machine learning
In Hopfield networks, if we start with an initial configuration that is close enough to a desired local energy minimum, we can recover the corresponding pattern $\{p_i^ν\}$ . However, for certain applications, relying solely on the deterministic update rule (4) may not be enough as it never accepts transitions that are associated with an increase in energy. In constraint satisfaction tasks, we often require learning algorithms to have the capability to occasionally accept transitions to configurations of higher energy and move the system under consideration away from local minima towards a global minimum. A method that is commonly used in this context is the M(RT) 2 algorithm introduced by Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller in their seminal 1953 paper “Equation of State Calculations by Fast Computing Machines”. [109, 110] This algorithm became the basis of many optimization methods, such as simulated annealing. [111]
In the 1980s, the M(RT) 2 algorithm has been adopted to equip Hopfield-type systems with a stochastic update rule, in which neuron $i$ is activated (set to $1$ ) regardless of its current state, with probability
$$
σ_i≡σ(Δ E_i/T)=σ(2a_i/T)=\frac{1}{1+\exp(-Δ E_i/T)} . \tag{9}
$$
Otherwise, it is set to $-1$ . The corresponding ANNs were dubbed “’Boltzmann machines”. [65, 66, 67, 68] $-10$ $-8$ $-6$ $-4$ $-2$ $0$ $2$ $4$ $6$ $8$ $10$ $0$ $0.2$ $0.4$ $0.6$ $0.8$ $1$ $Δ E_i$ $σ_i$ $T=0.5$ $T=1$ $T=2$
Figure 3: The sigmoid function $σ_i≡σ(Δ E_i/T)$ [see Eq. (9)] as function of $Δ E_i$ for $T=0.5,1,2$ .
In Eq. (9), the quantity $Δ E_i=2a_i$ is the energy difference between an inactive neuron $i$ and an active one [see Eq. (5)]. We use the convention employed in Refs. 67, 68. The authors of Ref. 66, use the convention that $Δ E_i=-2a_i$ and $σ_i≡σ(Δ E_i/T)=1/(1+\exp(Δ E_i/T))$ . The function $σ(x)=1/≤ft(1+\exp(-x)\right)$ represents the sigmoid function, and the parameter $T$ serves as an equivalent to temperature. In the limit $T→ 0$ , we recover the deterministic update rule (4). In Fig. 3, we show $σ(Δ E_i/T)$ as a function of $Δ E_i$ for $T=0.5,1,2$ .
Examining Eqs. (3) and (9), we notice that we are simulating a system akin to the Ising model with Glauber (heat bath) dynamics. [112, 64] Because Glauber dynamics satisfy the detailed balance condition, Boltzmann machines will eventually reach thermal equilibrium. The corresponding probabilities $p_eq(X)$ and $p_eq(Y)$ for the ANN to be in states $X$ and $Y$ , respectively, will satisfy We set the Boltzmann constant $k_B$ to 1.
$$
\frac{p_eq(Y)}{p_eq(X)}=\exp≤ft(-\frac{E(Y)-E(X)}{T}\right) . \tag{10}
$$
In other words, the Boltzmann distribution provides the relative probability $p_eq(Y)/p_eq(X)$ associated with the states $X$ and $Y$ of a “thermalized” Boltzmann machine. Regardless of the initial configuration, at a given temperature $T$ , the stochastic update rule in which neurons are activated with probability $σ_i$ always leads to a thermal equilibrium configuration that is solely determined by its energy.
#### 1 Boltzmann machines
Unlike Hopfield networks, Boltzmann machines have two types of nodes: visible units and hidden units. We denote the corresponding sets of visible and hidden units by $V$ and $H$ , respectively. Notice that the set $H$ may be empty. In Fig. 4, we show an example of a Boltzmann machine with five visible units and three hidden units.
During the training of a Boltzmann machine, the visible units are “clamped” to the environment, which means they are set to binary vectors drawn from an empirical distribution. Hidden units may be used to account for constraints involving more than two visible units. $h_1$ $h_2$ $h_3$ $v_1$ $v_2$ $v_3$ $v_4$ $v_5$
Figure 4: Boltzmann machines consist of visible units (blue) and hidden units (red). In the shown example, there are five visible units $≤ft\{v_i\right\}$ $({i∈\{1,\dots,5\}})$ and three hidden units $≤ft\{h_j\right\}$ $(j∈\{1,2,3\})$ . Similar to Hopfield networks, the network architecture in a Boltzmann machine is complete.
We denote the probability distribution of all configurations of visible units $\{ν\}$ in a freely running network as $P^\prime≤ft(\{ν\}\right)$ . Here, “freely running” means that no external inputs are fixed (or “clamped”) to the visible units. We derive the distribution $P^\prime≤ft(\{ν\}\right)$ by summing (i.e., marginalizing) over the corresponding joint probability distribution. That is,
$$
P^\prime≤ft(\{ν\}\right)=∑_\{h\}P^\prime≤ft(\{ν\},\{h\}\right) , \tag{11}
$$
where the summation is performed over all possible configurations of hidden units $\{h\}$ .
Our objective is now to devise a method such that $P^\prime≤ft(\{ν\}\right)$ converges to the unknown environment (i.e., data) distribution $P≤ft(\{ν\}\right)$ . To do so, we quantity the disparity between $P^\prime≤ft(\{ν\}\right)$ and $P≤ft(\{ν\}\right)$ using the Kullback–Leibler (KL) divergence (or relative entropy)
$$
G(P,P^\prime)=∑_\{ν\}P≤ft(\{ν\}\right)\ln≤ft[\frac{P≤ft(\{ν\}\right)}{P^\prime≤ft(\{ν\}\right)}\right] . \tag{12}
$$
To minimize the KL divergence $G(P,P^\prime)$ , we perform a gradient descent according to
$$
\frac{∂ G}{∂ w_ij}=-\frac{1}{T}≤ft(p_ij-p^\prime_ij\right) , \tag{13}
$$
where $p_ij$ represents the probability of both units $i$ and $j$ being active when the environment dictates the states of the visible units, and $p^\prime_ij$ is the corresponding probability in a freely running network without any connection to the environment. [66, 67, 68] We derive Eq. (13) in Sec. 3.
Both probabilities $p_ij$ and $p_ij^\prime$ are measured once the Boltzmann machine has reached thermal equilibrium. Subsequently, the weights $w_ij$ of the network are then updated according to
$$
Δ w_ij=ε≤ft(p_ij-p^\prime_ij\right) , \tag{14}
$$
where $ε$ is the learning rate. To reach thermal equilibrium, the states of the visible and hidden units are updated using the update probability (9).
In summary, the steps relevant for training a Boltzmann machine are as follows. 1.
Clamp the input data (environment distribution) to the visible units. 2.
Update the state of all hidden units according to Eq. (9) until the system reaches thermal equilibrium and then compute $p_ij$ . 3.
Unclamp the input data from the visible units. 4.
Update the state of all neurons according to Eq. (9) until the system reaches thermal equilibrium and then compute $p_ij^\prime$ . 5.
Update all weights according to Eq. (14) and return to step 2 or stop if the weight updates are sufficiently small. After training a Boltzmann machine, we can unclamp the visible units from the environment and generate samples to evaluate their quality. To do this, we use various initial configurations and activate neurons according to Eq. (9) until thermal equilibrium is reached. If the Boltzmann machine was trained successfully, the distribution of states for the unclamped visible units should align with the environment distribution.
#### 2 Restricted Boltzmann machines
Boltzmann machines are not widely used in general learning tasks. Their impracticality arises from the significant computational burden associated with achieving thermal equilibrium, especially in instances involving large system sizes. $h_1$ $h_2$ $h_3$ $h_4$ $v_1$ $v_2$ $v_3$ $v_4$ $v_5$ $v_6$
Figure 5: Restricted Boltzmann machines consist of a visible layer (blue) and a hidden layer (red). In the shown example, the respective layers comprise six visible units $≤ft\{v_i\right\}$ $(i∈\{1,\dots,6\})$ and four hidden units $≤ft\{h_j\right\}$ $(j∈\{1,\dots,4\})$ . The network structure of an RBM is bipartite and undirected.
Restricted Boltzmann machines provide an ANN structure that can be trained more efficiently by omitting connections between the hidden and visible units (see Fig. 5). Because of these missing intra-layer connections, the network architecture of an RBM is bipartite.
In RBMs, updates for visible and hidden units are performed alternately. Because there are no connections within each layer, we can update all units within each layer in parallel. Specifically, visible unit $v_i$ is activated with conditional probability
$$
p(v_i=1|\{h_j\})=σ\Bigl{(}∑_jw_ijh_j+b_i\Bigr{)} , \tag{15}
$$
where $b_i$ is the bias of visible unit $v_i$ and $\{h_j\}$ is a given configuration of hidden units. We then activate all hidden units based on their conditional probabilities
$$
p(h_j=1|\{v_i\})=σ\Bigl{(}∑_iw_ijv_i+c_j\Bigr{)} , \tag{16}
$$
where $h_j$ represents hidden unit $j$ , $c_j$ is its associated bias, and $\{v_i\}$ denotes a specific configuration of visible units. This technique of sampling is referred to as “block Gibbs sampling”.
Training an RBM shares similarities with training a BM. The key difference lies in the need to consider the bipartite network structure in the weight update equation (14). For an RBM, the weight updates are
$$
Δ w_ij=ε(⟨ν_ih_j⟩_data-⟨ν_ih_j⟩_model) . \tag{17}
$$
Instead of sampling configurations to compute $⟨ν_ih_j⟩_data$ and $⟨ν_ih_j⟩_model$ at thermal equilibrium, we can instead employ a few relaxation steps. This approach is known as “contrastive divergence”. [70, 113] The corresponding weight updates are
$$
Δ w_ij^CD=ε(⟨ν_ih_j⟩_data-⟨ν_ih_j⟩_model^k) . \tag{18}
$$
The superscript $k$ indicates the number of block Gibbs updates performed. For a more detailed discussion on the contrastive divergence training of RBMs, see Ref. 114.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Binary Pattern Comparison Across Temperature Regimes
### Overview
The image is a 2x3 grid of square panels displaying binary (black and white) pixel patterns. It compares the output or state of two different models or methods—labeled **M(RT)²** (top row) and **RBM** (bottom row)—across three distinct temperature conditions relative to a critical temperature, **T_c**. The patterns visually demonstrate how the system's configuration changes from an ordered, sparse state at low temperature, through a clustered, critical state, to a disordered, noisy state at high temperature.
### Components/Axes
* **Row Labels (Left Side):**
* Top Row: `M(RT)²`
* Bottom Row: `RBM`
* **Column Labels (Bottom):**
* Left Column: `T < T_c` (Temperature below critical)
* Middle Column: `T ≈ T_c` (Temperature near critical)
* Right Column: `T > T_c` (Temperature above critical)
* **Panel Content:** Each of the six panels contains a square grid of black pixels on a white background (or vice-versa, depending on interpretation). The grid resolution appears consistent across all panels, approximately 30x30 to 40x40 pixels.
* **Spatial Layout:** The diagram is organized in a strict grid. The row labels are vertically centered to the left of their respective rows. The column labels are horizontally centered beneath their respective columns.
### Detailed Analysis
**Panel-by-Panel Description:**
1. **Top-Left (M(RT)², T < T_c):** The pattern is very sparse. Approximately 10-15 isolated black pixels are scattered randomly across the white field. There is no visible clustering or large-scale structure.
2. **Bottom-Left (RBM, T < T_c):** Similarly sparse, with roughly 8-12 isolated black pixels. The distribution appears random, comparable to the panel above it.
3. **Top-Middle (M(RT)², T ≈ T_c):** The pattern shows significant structure. Large, irregular clusters of black pixels form, connected in a web-like or percolating fashion. Corresponding large white clusters are also present. This is characteristic of a system near a phase transition.
4. **Bottom-Middle (RBM, T ≈ T_c):** Also exhibits large, connected clusters of black and white. The morphology is qualitatively very similar to the panel above, suggesting both models capture the critical behavior similarly.
5. **Top-Right (M(RT)², T > T_c):** The pattern becomes fine-grained and noisy. Black and white pixels are thoroughly mixed, forming many small, disconnected clusters. No large-scale structure is visible, indicating a disordered state.
6. **Bottom-Right (RBM, T > T_c):** Displays a similarly disordered, noisy pattern. The cluster size distribution appears comparable to the top-right panel, though the exact configuration differs due to randomness.
### Key Observations
* **Trend Consistency:** Both rows (M(RT)² and RBM) exhibit the same fundamental trend: sparse/isolated → large clusters → fine-grained noise as temperature increases from `T < T_c` to `T > T_c`.
* **Model Similarity:** For each temperature condition, the visual characteristics of the patterns generated by M(RT)² and RBM are qualitatively alike. This suggests both methods are modeling similar underlying statistical mechanics.
* **Critical Phenomena:** The middle column (`T ≈ T_c`) clearly shows the hallmark of a critical point: the emergence of structure at all scales (large clusters with intricate boundaries), distinct from both the ordered (sparse) and disordered (noisy) phases.
### Interpretation
This diagram is a visual comparison of how two computational models—likely a **Restricted Boltzmann Machine (RBM)** and another method denoted **M(RT)²**—reproduce the statistical configurations of a binary system (e.g., Ising model spins) at different temperatures.
* **What it demonstrates:** It provides empirical, visual evidence that both models can successfully capture the three canonical phases of a system undergoing a second-order phase transition: the ordered phase (low T), the critical region (near T_c), and the disordered phase (high T).
* **Relationship between elements:** The rows represent different modeling approaches, while the columns represent the control parameter (temperature). The direct visual comparison in the grid format allows for immediate assessment of each model's performance across the phase diagram.
* **Notable implication:** The strong qualitative agreement between the two rows suggests that the RBM, a well-known machine learning model, is capable of learning and generating configurations that are statistically similar to those produced by the M(RT)² method (which may be a traditional Monte Carlo simulation or another theoretical approach). This could be used to validate the RBM's ability to model physical systems or to demonstrate the efficiency of one method over the other. The lack of quantitative data (e.g., specific energy or magnetization values) means the comparison is purely morphological.
</details>
Figure 6: Snapshots of $32× 32$ Ising configurations are shown for $T∈\{1.5,2.5,4\}$ . These configurations are derived from both M(RT) 2 and RBM samples. The quantity $T_c=2/\ln(1+√{2})≈ 2.269$ is the critical temperature of the two-dimensional Ising model.
Restricted Boltzmann machines have been employed in diverse contexts, including dimensionality reduction of datasets [72], the study of phase transitions [73, 74, 3, 75], and the representation of wave functions [76, 77]. In Fig. 6, we show three snapshots of Ising configurations, generated using both M(RT) 2 sampling and RBMs, each comprising $32× 32$ spins. The RBM was trained using $20× 10^4$ realizations of Ising configurations sampled at various temperatures. [75]
#### 3 Derivation of the learning algorithm
To derive Eq. (13), we follow the approach outlined in Ref. 68. Notice that the environment distribution $P≤ft(\{ν\}\right)$ does not depend $w_ij$ . Hence, we have
$$
\frac{∂ G}{∂ w_ij}=-∑_\{ν\}\frac{P≤ft(\{ν\}\right)}{P^\prime≤ft(\{ν\}\right)}\frac{∂ P^\prime≤ft(\{ν\}\right)}{∂ w_ij} . \tag{19}
$$
Next, we wish to compute the gradient $∂ P^\prime(\{ν\})/∂ w_ij$ . In a freely running BM, the equilibrium distribution of the visible units follows a Boltzmann distribution. That is,
$$
P^\prime(\{ν\})=∑_\{h\}P^\prime≤ft(\{ν\},\{h\}\right)=\frac{∑_\{h\}e^-E≤ft({\{ν,h\}\right)/T}}{∑_\{ν,h\}e^-E≤ft({\{ν,h\}\right)/T}} . \tag{20}
$$
Here, the quantity
$$
E≤ft({\{ν,h\}}\right)=-\frac{1}{2}∑_i,jw_ijx_i^\{ν,h\}x_j^\{ν,h\} \tag{21}
$$
is the Ising-type energy function in which field (or bias) terms are absorbed in the weights $w_ij$ [see Eqs. (3) and (7)]. For a BM in state ${\{ν,h\}}$ , we denote the state of neuron $i$ by $x_i^\{ν,h\}$ . Using
$$
\frac{∂ e^-E≤ft({\{ν,h\}\right)/T}}{∂ w_ij}=\frac{1}{T}x_i^\{ν,h\}x_j^\{ν,h\}e^-E≤ft({\{ν,h\}\right)/T} \tag{22}
$$
yields
$$
\displaystyle\begin{split}\frac{∂ P^\prime≤ft(\{ν\}\right)}{∂ w_ij}&=\frac{\frac{1}{T}∑_\{h\}x_i^\{ν,h\}x_j^\{ν,h\}e^-E≤ft({\{ν,h\}\right)/T}}{∑_\{ν,h\}e^-E≤ft({\{ν,h\}\right)/T}}\\
&-\frac{∑_\{h\}e^-E≤ft({\{ν,h\}\right)/T}\frac{1}{T}∑_\{ν,h\}x_i^\{ν,h\}x_j^\{ν,h\}e^-E≤ft({\{ν,h\}\right)/T}}{≤ft(∑_\{ν,h\}e^-E≤ft({\{ν,h\}\right)/T}\right)^2}\\
&=\frac{1}{T}≤ft[∑_\{h\}x_i^\{ν,h\}x_j^\{ν,h\}P^\prime≤ft(\{ν\},\{h\}\right)\right.\\
&\hskip 28.45274pt≤ft.-P^\prime(\{ν\})∑_\{ν,h\}x_i^\{ν,h\}x_j^\{ν,h\}P^\prime≤ft(\{ν\},\{h\}\right)\right] .\end{split} \tag{23}
$$
We will now substitute this result into Eq. (19) to obtain
$$
\displaystyle\begin{split}\frac{∂ G}{∂ w_ij}&=-∑_\{ν\}\frac{P≤ft(\{ν\}\right)}{P^\prime≤ft(\{ν\}\right)}\frac{1}{T}≤ft[∑_\{h\}x_i^\{ν,h\}x_j^\{ν,h\}P^\prime≤ft(\{ν\},\{h\}\right)\right.\\
&\hskip 91.04872pt≤ft.-P^\prime≤ft(\{ν\}\right)∑_\{ν,h\}x_i^\{ν,h\}x_j^\{ν,h\}P^\prime≤ft(\{ν\},\{h\}\right)\right]\\
&=-\frac{1}{T}≤ft[∑_\{ν,h\}x_i^\{ν,h\}x_j^\{ν,h\}P≤ft(\{ν\},\{h\}\right)-∑_\{ν,h\}x_i^\{ν,h\}x_j^\{ν,h\}P^\prime≤ft(\{ν\},\{h\}\right)\right] ,\end{split} \tag{24}
$$
where we used that $∑_\{ν\}P≤ft(\{ν\}\right)=1$ and $P≤ft(\{ν\}\right)/P^\prime≤ft(\{ν\}\right)P^\prime≤ft(\{ν\},\{h\}\right)=P≤ft(\{ν\},\{h\}\right) .$ The latter equation follows from the definition of joint probability distributions
$$
P≤ft(\{ν\},\{h\}\right)=P≤ft(\{h\}|\{ν\}\right)P≤ft(\{ν\}\right) \tag{25}
$$
and
$$
P^\prime≤ft(\{ν\},\{h\}\right)=P^\prime≤ft(\{h\}|\{ν\}\right)P^\prime≤ft(\{ν\}\right) , \tag{26}
$$
with $P≤ft(\{h\}|\{ν\}\right)=P^\prime≤ft(\{h\}|\{ν\}\right)$ . The conditional distributions $P(\{h\}|\{ν\})$ and $P^\prime(\{h\}|\{ν\})$ are identical, as the probability of observing a certain hidden state given a visible one is independent of the origin of the visible state in an equilibrated system. In other words, for the conditional distributions in an equilibrated system, it is irrelevant whether the visible state is provided by the environment or generated by a freely running machine. Defining
$$
p_ij\coloneqq∑_\{ν,h\}x_i^\{ν,h\}x_j^\{ν,h\}P≤ft(\{ν\},\{h\}\right) \tag{27}
$$
and
$$
p_ij^\prime\coloneqq∑_\{ν,h\}x_i^\{ν,h\}x_j^\{ν,h\}P^\prime≤ft(\{ν\},\{h\}\right) , \tag{28}
$$
we finally obtain Eq. (13).
### 4 Loss landscapes of artificial neural networks
Training an ANN involves minimizing a given loss function such as the KL divergence [see Eq. (12)]. The loss landscapes of ANNs are affected by several factors, including structural properties [86, 87, 115] and a range of implementation attributes [116, 117, 118]. Knowledge of their precise effects on learning performance, however, remains incomplete.
One path to a better understanding of the relationships between ANN structure, implementation attributes, and learning performance is through a more in-depth analysis of the geometric properties of loss landscapes. For instance, Keskar et al. [119] analyze the local curvature around candidate minimizers via the spectrum of the underlying Hessian to characterize the flatness and sharpness of loss minima, and Dinh et al. [120] demonstrate that reparameterizations can render flat minima sharp without affecting generalization properties. Even so, one challenge to the study of geometric properties of loss landscapes is high dimensionality. To meet that challenge, some propose to visualize the curvature around a given point by projecting curvature properties of high-dimensional loss functions to a lower-dimensional (and often random) projection of two or three dimensions [121, 122, 123, 124, 125]. Horoi et al. [125], building on this approach, pursue improvements in learning by dynamically sampling points in projected low-loss regions surrounding local minima during training.
However, visualizing high-dimensional loss landscape curvature relies on accurate projections of curvature properties to lower dimensions. Unfortunately, random projections do not preserve curvature information, thus do not afford accurate low-dimensional representations of curvature information. This argument is given in Sec. 1 and illustrated with a simulation example in Sec. 2. Principal curvatures in a low-dimensional projection are nevertheless given by functions of weighted ensemble means of the Hessian elements in the original, high-dimensional space, a result also established in Sec. 1. Instead of using random projections to visualize loss functions, we propose to analyze projections along dominant Hessian directions associated with the largest-magnitude positive and negative principal curvatures.
#### 1 Principal curvature in random projections
To describe the connection between the principal curvature of a loss function $L(θ)$ with ANN parameters $θ∈ℝ^N$ and that associated with a lower-dimensional, random projection, we provide in Sec. 1 an overview of concepts from differential geometry [126, 127, 128] that are useful to mathematically describe curvature in high-dimensional spaces. In Secs. 1 and 1, we show the relationship between principal curvature and random projections. Finally, in Sec. 1, we highlight a relationship between measures of curvature presented in Sec. 1 and Hessian trace estimates, which affords an alternative to Hutchinson’s method for computing unbiased Hessian trace estimates.
Differential and information geometry concepts
In differential geometry, the principal curvatures are the eigenvalues of the shape operator (or Weingarten map Some authors distinguish between the shape operator and the Weingarten map depending on if the change of the underlying tangent vector is described in the original manifold or in Euclidean space (see, e.g., chapter 3 in [129]).), a linear endomorphism defined on the tangent space $T_p$ of $L$ at a point $p$ . For the original high-dimensional space, we have $(θ,L(θ))⊆ℝ^N+1$ and there are $N$ principal curvatures $κ_1^θ≥κ_2^θ≥\dots≥κ_N^θ$ . At a non-degenerate critical point $θ^*$ where the gradient $∇_θL$ vanishes, the matrix of the shape parameter is given by the Hessian $H_θ$ with elements $(H_θ)_ij=∂^2L/(∂θ_i∂θ_j)$ ( $i,j∈\{1,\dots,N\}$ ). A critical point is degenerate if the Hessian $H_θ$ at this point is singular (i.e., $det(H_θ)=0$ ). At degenerate critical points, one cannot use the eigenvalues of $H_θ$ to determine if the critical point is a minimum (positive definite $H_θ$ ) or a maximum (negative definite $H_θ$ ). Geometrically, at a degenerate critical point, a quadratic approximation fails to capture the local behavior of the function that one wishes to study. Some works refer to the Hessian as the “curvature matrix” [130] or use it to characterize the curvature properties of $L(θ)$ [122]. In the vicinity of a non-degenerate critical point $θ^*$ , the eigenvalues of the Hessian $H_θ$ are the principal curvatures and describe the loss function in the eigenbasis of $H_θ$ according to
$$
L(θ^*+Δθ)=L(θ^*)+\frac{1}{2}∑_i=1^Nκ_i^θΔθ_i^2 . \tag{29}
$$
The Morse lemma states that, if a critical point $θ^*$ of $L(θ)$ is non-degenerate, then there exists a chart $(\tilde{θ}_1,\dots,\tilde{θ}_N)$ in a neighborhood of $θ^*$ such that
$$
L(\tilde{θ})=-\tilde{θ}^2_1-⋯-\tilde{θ}^2_i+\tilde{θ}^2_i+1+⋯+\tilde{θ}^2_N+L(θ^*) , \tag{30}
$$
where $\tilde{θ}_i(θ)=0$ for $i∈\{1,\dots,N\}$ . The loss function $L(\tilde{θ})$ in Eq. (30) is decreasing along $i$ directions and increasing along the remaining $i+1$ to $N$ directions. Further, the index $i$ of a critical point $θ^*$ is the number of negative eigenvalues of the Hessian $H_θ$ at that point.
In the standard basis, the Hessian is
$$
H_θ\coloneqq∇_θ∇_θL(θ)=\begin{pmatrix}\frac{∂^2L}{∂θ_1^2}&⋯&\frac{∂^2L}{∂θ_1∂θ_N}\\
⋮&⋱&⋮\\
\frac{∂^2L}{∂θ_N∂θ_1}&⋯&\frac{∂^2L}{∂θ_N^2}\end{pmatrix} . \tag{31}
$$
Random projections
To graphically explore an $N$ -dimensional loss function $L$ around a critical point $θ^*$ , one may wish to work in a lower-dimensional representation. For example, a two-dimensional projection of $L$ around $θ^*$ is provided by
$$
L(θ^*+αη+βδ) , \tag{32}
$$
where the parameters $α,β∈ℝ$ scale the directions $η,δ∈ℝ^N$ . The corresponding graph representation is $(α,β,L(α,β))⊆ℝ^3$ .
In high-dimensional spaces, there exist vastly many more almost-orthogonal than orthogonal directions. In fact, if the dimension of our space is large enough, with high probability, random vectors will be sufficiently close to orthogonal [131]. Following this result, many related works [122, 124, 125] use random Gaussian directions with independent and identically distributed vector elements $η_i,δ_i∼N(0,1)$ ( $i∈\{1,\dots,N\}$ ).
The scalar product of random Gaussian vectors $η,δ$ is a sum of the difference between two chi-squared distributed random variables because
$$
∑_i=1^Nη_iδ_i=∑_i=1^N\frac{1}{4}(η_i+δ_i)^2-\frac{1}{4}(η_i-δ_i)^2=∑_i=1^N\frac{1}{2}X_i^2-\frac{1}{2}Y_i^2 , \tag{33}
$$
where $X_i,Y_i∼N(0,1)$ .
Notice that $η,δ$ are almost orthogonal, which can be proved using a concentration inequality for chi-squared distributed random variables. For further details, see Ref. 105.
Principal curvature
With the form of random Gaussian projections in hand, we now analyze the principal curvatures in both the original and lower-dimensional spaces. The Hessian associated with the two-dimensional loss projection (32) is
$$
\displaystyle\begin{split}H_α,β&=\begin{pmatrix}\frac{∂^2L}{∂α^2}&\frac{∂^2L}{∂α∂β}\\
\frac{∂^2L}{∂β∂α}&\frac{∂^2L}{∂β^2}\\
\end{pmatrix}\\
&=\begin{pmatrix}∑_i,jη_iη_j\frac{∂^2L}{∂θ_iθ_j}&∑_i,jη_iδ_j\frac{∂^2L}{∂θ_iθ_j}\\
∑_i,jη_iδ_j\frac{∂^2L}{∂θ_iθ_j}&∑_i,jδ_iδ_j\frac{∂^2L}{∂θ_iθ_j}\\
\end{pmatrix}\\
&=\begin{pmatrix}(H_θ)_ijη^iη^j&(H_θ)_ijη^iδ^j\\
(H_θ)_ijη^iδ^j&(H_θ)_ijδ^iδ^j\\
\end{pmatrix} ,\end{split} \tag{34}
$$
where we use Einstein notation in the last equality.
Because the elements of $δ,η$ are distributed according to a standard normal distribution, the second derivatives of the loss function $L$ in Eq. (34) have prefactors that are products of standard normal variables and, hence, can be expressed as sums of chi-squared distributed random variables as in Eq. (33). To summarize, elements of $H_α,β$ are sums of second derivatives of $L$ in the original space weighted with chi-squared distributed prefactors.
The principal curvatures $κ_±^α,β$ (i.e., the eigenvalues of $H_α,β$ ) are
$$
κ_±^α,β=\frac{1}{2}≤ft(A+C±√{4B^2+(A-C)^2}\right) , \tag{35}
$$
where $A=(H_θ)_ijη^iη^j$ , $B=(H_θ)_ijη^iδ^j$ , and $C=(H_θ)_ijδ^iδ^j$ . To the best of our knowledge, a closed, analytic expression for the distribution of the quantities $A,B,C$ is not yet known [132, 133, 134, 135].
Returning to principal curvature, since $∑_i,ja_ijη^iη^j=∑_ia_iiη^iη^i+∑_i≠ ja_ijη^iη^j$ ( $a_ij∈ℝ$ ), we find that $\mathds{E}[A]=\mathds{E}[C]={(H_θ)^i}_i$ and $\mathds{E}[B]=0$ where ${(H_θ)^i}_i≡tr(H_θ)=∑_i=1^Nκ_i^θ$ . The expected values of the quantities $A$ , $B$ , and $C$ correspond to ensemble means (43) in the limit $S→∞$ , where $S$ is the number of independent realizations of the underlying random variable. To show that the expected values of $a_ijη^iη^j$ ( $i≠ j$ ) or $a_ijη^iδ^j$ vanish, one can either invoke independence of $η^i,η^j$ ( $i≠ j$ ) and $η^i,δ^j$ or transform both products into corresponding differences of two chi-squared random variables with the same mean [see Eq. (33)].
Hence, the expected, dimension-reduced Hessian (34) is
$$
\mathds{E}[H_α,β]=\begin{pmatrix}{(H_θ)^i}_i&0\\
0&{(H_θ)^i}_i\\
\end{pmatrix} . \tag{36}
$$
The corresponding eigenvalue (or principal curvature) $\bar{κ}^α,β$ is therefore given by the sum over all principal curvatures in the original space (i.e., $\bar{κ}^α,β=∑_i=1^Nκ_i^θ$ ). Hence, the value of the principal curvature $\bar{κ}^α,β$ in the expected dimension-reduced space will be either positive (if the positive principal curvatures in the original space dominate), negative (if the negative principal curvatures in the original space dominate), or close to zero (if positive and negative principal curvatures in the original space cancel out each other). As a result, saddle points will not appear as such in the expected random projection.
In addition to the connection between $\bar{κ}^α,β$ and the principal curvatures $κ_i^θ$ , we now provide an overview of additional mathematical relations between different curvature measures that are useful to quantify curvature properties of high-dimensional loss functions and their two-dimensional random projections.
Quantities to characterize curvature. Symbol Definition $H_θ∈ℝ^N× N$ Hessian in original loss space $κ_i^θ∈ℝ$ principal curvatures in original loss space with $i∈\{1,\dots,N\}$ (i.e., the eigenvalues of $H_θ$ ) $H_α,β∈ℝ^2× 2$ Hessian in a two-dimensional projection of an $N$ -dimensional loss function $κ_±^α,β∈ℝ$ principal curvatures in a two-dimensional loss projection (i.e., the eigenvalues of $H_α,β$ ) $\bar{κ}^α,β∈ℝ$ principal curvature in expected, two-dimensional loss projection (i.e., the eigenvalues of $E[H_α,β]$ ) $H∈ℝ$ mean curvature (i.e., $∑_i=1^Nκ_i^θ/N=\bar{κ}^α,β/N$ )
Invoking Eq. (35), we can relate $\bar{κ}^α,β$ to $tr(H_θ)$ and $κ_±^α,β$ . Because $κ_+^α,β+κ_-^α,β=A+C$ , we have
$$
tr(H_θ)=\bar{κ}^α,β=∑_i=1^Nκ_i^θ=\frac{1}{2}≤ft(E[κ_-^α,β]+E[κ_+^α,β]\right) . \tag{37}
$$
The mean curvature $H$ in the original space is related to $\bar{κ}^α,β$ via
$$
H=\frac{1}{N}\bar{κ}^α,β=\frac{1}{N}∑_i=1^Nκ_i^θ . \tag{38}
$$
We summarize the definitions of the employed Hessians and curvature measures in Tab. 1.
The appeal of random projections is that pairwise distances between points in a high-dimensional space can be nearly preserved by a lower-dimensional linear embedding, affording a low-dimensional representation of mean and variance information with minimal distortion [136]. The relationship between random Gaussian directions and principal curvature is less straightforward. Our results show that the principal curvatures $κ_±^α,β$ in a two-dimensional loss projection are weighted averages of the Hessian elements $(H_θ)_ij$ in the original space, not weighted averages of the principal curvatures $κ_i^θ$ as claimed by Ref. 122. Similar arguments apply to projections with dimension larger than 2.
Hessian trace estimates
Finally, we point to a connection between curvature measures and Hessian trace estimates. A common way of estimating $tr(H_θ)$ without explicitly computing all eigenvalues of $H_θ$ is based on Hutchinson’s method [137] and random numerical linear algebra [138, 139]. The basic idea behind this approach is to (i) use a random vector $z∈ℝ^N$ with elements $z_i$ that are distributed according to a distribution function with zero mean and unit variance (e.g., a Rademacher distribution with $\Pr≤ft(z_i=± 1\right)=1/2$ ), and (ii) compute $z^⊤H_θz$ , an unbiased estimator of $tr(H_θ)$ . That is,
$$
tr(H_θ)=E[z^⊤H_θz] . \tag{39}
$$
Recall that Eq. (37) shows that the principal curvature of the expected random loss projection, $\bar{κ}^α,β$ , is equal to $tr(H_θ)$ . Instead of estimating $tr(H_θ)$ using Hutchinson’s method (39), an alternative Hutchinson-type estimate of this quantity is provided by the mean of the expected values of $κ_-^α,β$ and $κ_+^α,β$ [see Eq. (37)].
#### 2 Extracting curvature information
We now study two examples that will help build intuitions. In the first example presented in Sec. 2, we study a critical point $θ^*$ of an $N$ -dimensional loss function $L(θ)$ for which (i) all principal curvatures have the same magnitude and (ii) the number of positive curvature directions is equal to the number of negative curvature directions. In this example, saddles can be correctly detected by ensemble means but success depends on the averaging process used. In the second example presented in Sec. 2, we use a loss function associated with an unequal number of negative and positive curvature directions. For the different curvature measures derived in Sec. 1, we find that random projections cannot identify the underlying saddle point. In Sec. 2, we will use the two example loss functions to discuss how curvature-based Hessian trace estimates relate to those obtained with the original Hutchinson’s method.
Equal number of curvature directions
The loss function of our first example is
$$
L(θ)=\frac{1}{2}θ_2n+1≤ft(∑_i=1^nθ_i^2-θ_i+n^2\right) , n∈ℤ_+ , \tag{40}
$$
where we set $N=2n+1$ . A critical point $θ^*$ of the loss function (40) satisfies
$$
(∇_θL)(θ^*)=≤ft(\begin{array}[]{c}θ^*_1θ^*_2n+1\\
⋮\\
θ^*_nθ^*_2n+1\\
-θ^*_n+1θ^*_2n+1\\
⋮\\
-θ^*_2nθ^*_2n+1\\
\frac{1}{2}≤ft(∑_i=1^n{θ_i^{*\textsuperscript{2}}}-{θ_i+n^{*\textsuperscript{2}}}\right)\end{array}\right)=0 . \tag{41}
$$
The Hessian at the critical point $θ^*=(θ^*_1,\dots,θ^*_2n,θ^*_2n+1)$ $=(0,\dots,0,1)$ is
$$
H_θ=diag(\underbrace{1,\dots,1}_n~times,\underbrace{-1,\dots,-1}_n~times,0) . \tag{42}
$$
<details>
<summary>hessian_alpha_beta_left.png Details</summary>

### Visual Description
## Line Chart: Convergence of Estimator Deviations
### Overview
The image contains two vertically stacked line charts, labeled (a) and (c), displaying the deviation of an estimated quantity from its expected value as a function of the number of samples. Both charts share identical axes, scales, and legend structures. The data shows how the difference between an average value and its expectation converges toward zero as the sample size increases.
### Components/Axes
* **Chart Labels:** The top chart is labeled "(a)" in its top-left corner. The bottom chart is labeled "(c)" in its top-left corner.
* **X-Axis (Both Charts):**
* **Title:** `number of samples S`
* **Scale:** Logarithmic (base 10).
* **Range:** From `10^0` (1) to `10^4` (10,000).
* **Major Ticks:** At `10^0`, `10^1`, `10^2`, `10^3`, `10^4`.
* **Y-Axis (Both Charts):**
* **Title:** `⟨(H_{α,β})_{ij}⟩ - 𝔼[(H_{α,β})_{ij}]`
* This represents the sample average (⟨...⟩) of a quantity `(H_{α,β})_{ij}` minus its expected value (𝔼[...]).
* **Scale:** Linear.
* **Range:** From -20 to +20.
* **Major Ticks:** At -20, -10, 0, 10, 20.
* **Reference Line:** A dashed gray horizontal line is drawn at y=0.
* **Legend (Both Charts, positioned in the top-right quadrant):**
* **Solid Black Line:** `(i, j) = (1, 1)`
* **Dashed Blue Line:** `(i, j) = (2, 2)`
* **Dashed Red Line:** `(i, j) = (1, 2), (i, j) = (2, 1)` (This single red line represents two symmetric index pairs).
### Detailed Analysis
**Chart (a):**
* **Trend for (1,1) [Black Line]:** Starts at a high positive value (>20) for S=1. Shows large, erratic fluctuations between approximately S=2 and S=100, with values ranging from ~+20 down to ~-10. After S≈100, the fluctuations dampen significantly, and the line converges tightly to the y=0 axis by S=10^4.
* **Trend for (2,2) [Blue Dashed Line]:** Also starts high (>20). Exhibits sharp oscillations, notably a deep dip to ~-10 around S=3-4, followed by a peak near +20 around S=5-6. Like the black line, its volatility decreases after S≈100, and it converges to zero.
* **Trend for (1,2)/(2,1) [Red Dashed Line]:** Begins at a large negative value (~-18) for S=2. Rises sharply to a peak near +15 around S=4, then oscillates with decreasing amplitude. It crosses the zero line multiple times before stabilizing near zero for S > 1000.
**Chart (c):**
* **Trend for (1,1) [Black Line]:** Starts high (>20). Shows a distinct pattern of oscillation with a notable dip to ~-5 around S=20-30, followed by a rise and then a gradual convergence to zero. The convergence appears slightly noisier than in chart (a) for S between 100 and 1000.
* **Trend for (2,2) [Blue Dashed Line]:** Starts high, dips sharply to ~-10 around S=3, then oscillates. It shows a prolonged period of negative values between S≈100 and S≈1000 before finally converging to zero.
* **Trend for (1,2)/(2,1) [Red Dashed Line]:** Starts negative (~-6 at S=2), dips further to ~-13 around S=3, then rises to a peak near +10 around S=5. It oscillates and spends significant time in negative territory between S=10 and S=100 before converging.
### Key Observations
1. **Universal Convergence:** All data series in both plots converge to the y=0 line as the number of samples `S` increases towards 10^4. This indicates the estimator is consistent.
2. **High Initial Volatility:** The most dramatic fluctuations occur for small sample sizes (S < 100). The deviations can be large in both positive and negative directions.
3. **Series-Specific Behavior:** While all converge, the path to convergence differs. The off-diagonal terms (red line) often start negative, while the diagonal terms (black, blue) start positive. The blue line (2,2) in chart (c) shows a distinct, sustained negative deviation in the mid-range of S.
4. **Plot Similarity:** Charts (a) and (c) are qualitatively similar but not identical. The specific trajectories and the timing/magnitude of peaks and troughs differ, suggesting they may represent results from different experimental conditions, parameters, or datasets.
### Interpretation
This visualization demonstrates the **law of large numbers** in action for a specific statistical estimator `H`. The y-axis shows the *error* of the sample mean as an estimator of the true expected value.
* **What the data suggests:** The estimator is unbiased (converges to the correct expectation) but exhibits high variance for small sample sizes. The initial large swings indicate that estimates based on few samples are unreliable and can be significantly above or below the true value.
* **How elements relate:** The x-axis (sample size) is the independent variable controlling the precision of the estimate. The y-axis is the measure of error. The different colored lines track this error for different components (`ij` indices) of the quantity being estimated. The convergence of all lines to zero confirms that with enough data, the error for all components diminishes.
* **Notable patterns/anomalies:** The fact that the red line (off-diagonal terms) often starts negative while diagonal terms start positive could indicate a systematic bias in the estimator for small samples that affects different matrix elements differently. The sustained negative deviation of the blue line in chart (c) is an anomaly worth investigating—it suggests that for that specific condition, the estimator for the (2,2) element consistently underestimates the true value over a wide range of moderate sample sizes before correcting. The difference between plots (a) and (c) implies that the convergence behavior is sensitive to some underlying parameter not shown on the axes.
</details>
<details>
<summary>hessian_alpha_beta_right.png Details</summary>

### Visual Description
\n
## Line Chart: Principal Curvatures vs. Number of Samples
### Overview
The image contains two vertically stacked line charts, labeled (b) and (d), which plot "principal curvatures" against the "number of samples S" on a logarithmic scale. Each chart compares two different methods or estimators for calculating these curvatures, represented by solid black and dashed red lines. The data shows how the estimated curvature values converge as the sample size increases.
### Components/Axes
**Common Elements:**
* **X-Axis (Both Plots):** Labeled "number of samples S". It uses a logarithmic scale with major tick marks at \(10^0\), \(10^1\), \(10^2\), \(10^3\), and \(10^4\).
* **Y-Axis (Both Plots):** Labeled "principal curvatures". The scale is linear but differs significantly between the two plots.
* **Legend (Both Plots):** Located in the top-right quadrant of each plot area.
* Solid black line (—): Represents the quantity \(\langle \kappa_{\pm}^{\alpha,\beta} \rangle\).
* Dashed red line (--): Represents the quantity \(\langle \tilde{\kappa}_{\pm}^{\alpha,\beta} \rangle\).
* **Reference Lines:** A horizontal dashed gray line is present in both plots, serving as a baseline reference.
**Plot (b) Specifics:**
* **Y-Axis Range:** Approximately -50 to 75.
* **Reference Line:** Positioned at y = 0.
* **Data Series:** Four distinct lines are visible: two solid black and two dashed red. This suggests the plot shows both the positive and negative principal curvature estimates for each method.
**Plot (d) Specifics:**
* **Y-Axis Range:** Approximately 550 to 700.
* **Reference Line:** Positioned at y = 600.
* **Data Series:** Similar to plot (b), there are four lines: two solid black and two dashed red.
### Detailed Analysis
**Plot (b) Analysis:**
* **Trend Verification:**
* **Black Lines (\(\langle \kappa_{\pm}^{\alpha,\beta} \rangle\)):** One line starts near -15 at S=2, rises sharply to a peak near 0 at S≈5, then declines steadily, stabilizing around -40 for S > 100. The other black line starts near 25, rises to a peak near 70 at S≈5, then declines and stabilizes around 40 for S > 100. The two lines diverge from a common region and then stabilize at symmetric values around zero.
* **Red Lines (\(\langle \tilde{\kappa}_{\pm}^{\alpha,\beta} \rangle\)):** Both lines show high volatility for S < 100, oscillating roughly between -10 and 50. For S > 100, they converge and stabilize very close to the y=0 reference line.
* **Key Data Points (Approximate):**
* At S=2: Black lines ≈ -15 and 25; Red lines ≈ -15 and 25.
* At S=5 (Peak for black lines): Upper black ≈ 70, Lower black ≈ 0.
* At S=100: Upper black ≈ 40, Lower black ≈ -40; Red lines ≈ 0.
* At S=10,000: Upper black ≈ 40, Lower black ≈ -40; Red lines ≈ 0.
**Plot (d) Analysis:**
* **Trend Verification:**
* **Black Lines (\(\langle \kappa_{\pm}^{\alpha,\beta} \rangle\)):** One line starts near 635 at S=2, drops sharply to about 560 by S=10, then gradually declines further, stabilizing around 560 for S > 100. The other black line starts near 700, drops to about 660 by S=10, then declines more gradually, stabilizing around 640 for S > 100.
* **Red Lines (\(\langle \tilde{\kappa}_{\pm}^{\alpha,\beta} \rangle\)):** Both lines start near 700 at S=2, drop rapidly, and then fluctuate around the y=600 reference line for S > 10, showing much less variance than the black lines in the mid-range.
* **Key Data Points (Approximate):**
* At S=2: All lines start near 700.
* At S=10: Upper black ≈ 660, Lower black ≈ 560; Red lines ≈ 620 and 610.
* At S=100: Upper black ≈ 640, Lower black ≈ 560; Red lines ≈ 600.
* At S=10,000: Upper black ≈ 640, Lower black ≈ 560; Red lines ≈ 600.
### Key Observations
1. **Convergence with Sample Size:** All estimates show significant volatility for small sample sizes (S < 100) and converge to stable values as S increases towards 10,000.
2. **Method Comparison:** The method represented by the dashed red lines (\(\langle \tilde{\kappa}_{\pm}^{\alpha,\beta} \rangle\)) consistently converges to the reference line (0 in plot b, 600 in plot d). The method represented by solid black lines (\(\langle \kappa_{\pm}^{\alpha,\beta} \rangle\)) converges to values symmetrically offset from this reference.
3. **Scale Difference:** Plot (d) deals with curvature values an order of magnitude larger (550-700) than plot (b) (-50 to 75), suggesting it may represent a different geometric feature or a different scale of analysis.
4. **Initial Transient:** The most dramatic changes in estimated values occur for S between 2 and 100, indicating this is the critical range for sample size in these estimations.
### Interpretation
The charts demonstrate the performance and convergence properties of two different estimators for principal curvatures as a function of sample size.
* **What the data suggests:** The estimator \(\langle \tilde{\kappa}_{\pm}^{\alpha,\beta} \rangle\) (red dashed) appears to be an **unbiased or corrected estimator**, as its mean converges to the theoretical or reference value (0 or 600). The estimator \(\langle \kappa_{\pm}^{\alpha,\beta} \rangle\) (black solid) appears to be a **biased estimator**, converging to values that are symmetrically offset from the reference. The bias is consistent and stable for large S.
* **Relationship between elements:** The two plots likely show the same analysis applied to two different datasets or geometric contexts (e.g., different manifolds or regions of a manifold), given the vastly different curvature scales. The symmetric offset of the black lines around the red convergence line in both plots is a key pattern, suggesting a systematic relationship between the two estimators.
* **Notable patterns:** The initial high volatility for small S is expected in statistical estimation. The fact that the biased estimator (black) shows a clear peak/trough structure at very low S (plot b) before settling into its biased convergence path is notable. The charts effectively argue that using the \(\langle \tilde{\kappa}_{\pm}^{\alpha,\beta} \rangle\) method yields estimates that align with the reference value, while the \(\langle \kappa_{\pm}^{\alpha,\beta} \rangle\) method yields precise but systematically offset results. For applications requiring accuracy relative to the reference, the red-dashed method is superior.
</details>
Figure 7: Convergence of the ensemble mean (43) of Hessian elements and curvatures measures as a function of the number of random projections $S$ . (a,c) The deviation of the ensemble means $⟨(H_α,β)_ij⟩$ ( ${i,j∈\{1,2\}}$ ) of Hessian elements from the corresponding expected values as a function of $S$ . Notice that the expected value of the diagonal elements $(H_α,β)_11$ and $(H_α,β)_22$ is equal to $\bar{κ}^α,β$ (i.e., to the sum of principal curvatures in the original space) [see Eqs. (36) and (37)]. A relatively large number of random projections between $10^3$ and $10^4$ is required to keep the deviations at values smaller than about 2–4. (b,d) The ensemble means $⟨κ^α,β_±⟩$ [see Eq. (35)] and $⟨\tilde{κ}^α,β_±⟩$ [see Eq. (44)] as a function of $S$ . Dashed grey lines represent $\bar{κ}^α,β=tr(H_θ)$ . In panels (a,b) and (c,d), the $N$ -dimensional loss functions are given by Eqs. (40) and (45), respectively. We evaluate the corresponding Hessians (42) and (46) at the saddle point $θ^*=(θ^*_1,\dots,θ^*_2n,θ^*_2n+1)=(0,\dots,0,1)$ . In both loss functions, we set $n=500$ and in loss function (45) we set $\tilde{n}=800$ .
Because $H_θ$ has positive and negative eigenvalues, the critical point is a saddle. The corresponding principal curvatures are $κ_i^θ∈\{-1,1\}$ ( $i∈\{1,\dots,N-1\}$ ) and $κ^θ_N=0$ . In this example, the mean curvature $H$ , as defined in Eq. (38), is equal to 0. According to Eq. (36), the principal curvature, $\bar{κ}^α,β$ , associated with the expected, dimension-reduced Hessian $H_α,β$ is also equal to 0, erroneously indicating an apparently flat loss landscape if one would use $\bar{κ}^α,β$ as the main measure of curvature. To compare the convergence of different curvature measures as a function of the number of loss projections $S$ , we will now study the ensemble mean
$$
⟨ X⟩=\frac{1}{S}∑_k=1^SX^(k) \tag{43}
$$
of different quantities of interest $X$ such as Hessian elements and principal curvature measures in dimension-reduced space. Here, $X^(k)$ is the $k$ -th realization (or sample) of $X$ .
<details>
<summary>curvature_pdfs_left.png Details</summary>

### Visual Description
## Probability Density Function (PDF) Plot: Principal Curvatures
### Overview
The image is a scientific plot, specifically a probability density function (PDF) histogram with overlaid smooth curves. It displays the distribution of two related quantities, labeled as principal curvatures, on a single set of axes. The plot is labeled "(a)" in the top-left corner, suggesting it is part of a multi-panel figure.
### Components/Axes
* **Chart Type:** Histogram with overlaid probability density curves.
* **Panel Label:** "(a)" located in the top-left corner of the plot area.
* **X-Axis:**
* **Title:** `principal curvatures κ±α,β` (The notation uses Greek letters kappa (κ) with subscripts ± and superscripts α,β).
* **Scale:** Linear scale.
* **Range:** Approximately -150 to +150.
* **Major Tick Marks & Labels:** -150, -100, -50, 0, 50, 100, 150.
* **Y-Axis:**
* **Title:** `PDF` (Probability Density Function).
* **Scale:** Linear scale.
* **Range:** 0.0 to 1.5.
* **Major Tick Marks & Labels:** 0.0, 0.5, 1.0, 1.5.
* **Legend:** Located in the top-right quadrant of the plot area.
* **Entry 1:** A red line segment followed by the label `κ-α,β`. This corresponds to the red histogram and the overlaid red curve.
* **Entry 2:** A black line segment followed by the label `κ+α,β`. This corresponds to the gray/black histogram and the overlaid black curve.
### Detailed Analysis
The plot contains two distinct but overlapping data distributions.
1. **Distribution for κ-α,β (Red):**
* **Visual Trend:** This distribution is centered in the negative domain of the x-axis. It has a bell-shaped, roughly symmetric profile.
* **Data Points (Approximate):**
* The histogram bars and the smooth curve peak at an x-value of approximately **-50**.
* The peak PDF value (y-value) at this point is approximately **1.05**.
* The distribution spans from roughly -140 to +20 on the x-axis, with the bulk of the density between -100 and 0.
* The curve approaches a PDF of 0 near x = -150 and x = +50.
2. **Distribution for κ+α,β (Black/Gray):**
* **Visual Trend:** This distribution is centered in the positive domain of the x-axis. It also has a bell-shaped, roughly symmetric profile, mirroring the red distribution.
* **Data Points (Approximate):**
* The histogram bars and the smooth curve peak at an x-value of approximately **+50**.
* The peak PDF value (y-value) at this point is approximately **1.05**, very similar to the peak of the red distribution.
* The distribution spans from roughly -20 to +140 on the x-axis, with the bulk of the density between 0 and +100.
* The curve approaches a PDF of 0 near x = -50 and x = +150.
3. **Overlap Region:**
* The two distributions overlap significantly in the central region around x = 0.
* The area of overlap is visually represented by a darker, brownish-red color, indicating the summation of the red and gray histogram densities in that region.
* The two smooth curves intersect at approximately x = 0, where both have a PDF value of about **0.6**.
### Key Observations
* **Symmetry and Separation:** The two distributions (κ- and κ+) are nearly mirror images of each other, centered symmetrically around zero but offset by approximately ±50 units.
* **Similar Shape and Scale:** Both distributions have very similar shapes (Gaussian-like), widths (standard deviations), and peak heights (~1.05 PDF).
* **Clear Bimodality:** The combined data shows a clear bimodal distribution, with two distinct peaks separated by a valley near zero.
* **Notable Outliers:** There are no extreme outliers visible; the data tails off smoothly on both ends for each distribution.
### Interpretation
This plot demonstrates the statistical distribution of two opposing principal curvature values (κ- and κ+) for a given system or surface, characterized by parameters α and β.
* **What the data suggests:** The system exhibits a strong tendency to have curvatures of opposite signs. Surfaces or points are much more likely to have one positive and one negative principal curvature (saddle-like geometry) than two curvatures of the same sign. The clear separation of the peaks indicates that the magnitudes of these curvatures are typically non-zero and centered around a characteristic scale (±50 in these units).
* **How elements relate:** The perfect symmetry of the red and black distributions implies a fundamental balance or pairing in the underlying physical or mathematical model. The parameter sets (α,β) likely define a condition where positive and negative curvatures are equally probable and similarly distributed in magnitude.
* **Notable patterns/anomalies:** The most significant pattern is the pronounced bimodality with a minimum at zero curvature. This suggests that "flat" or "minimal" curvature states (κ ≈ 0) are relatively rare in this dataset. The overlap at zero indicates a non-zero probability of finding points where one curvature is zero, but the dominant behavior is the coexistence of distinct positive and negative curvatures.
**Language Declaration:** All text in the image is in English, with mathematical notation using Greek letters (κ, α, β).
</details>
<details>
<summary>curvature_pdfs_right.png Details</summary>

### Visual Description
\n
## Histogram with Overlaid Curves: Principal Curvatures κ±α,β
### Overview
The image is a scientific plot, specifically a histogram with two overlapping distributions, each overlaid with a fitted smooth curve. It is labeled as panel "(b)" in the top-left corner, suggesting it is part of a multi-panel figure. The chart compares the frequency distributions of two related quantities, denoted as κ-α,β and κ+α,β.
### Components/Axes
* **X-Axis:**
* **Label:** `principal curvatures κ±α,β`
* **Scale:** Linear scale ranging from 450 to 750.
* **Major Tick Marks & Labels:** 450, 500, 550, 600, 650, 700, 750.
* **Minor Tick Marks:** Present between major ticks, indicating intervals of 10 units.
* **Y-Axis:**
* **Label:** Not explicitly labeled with text, but represents frequency or probability density.
* **Scale:** Linear scale ranging from 0.0 to 1.5.
* **Major Tick Marks & Labels:** 0.0, 0.5, 1.0, 1.5.
* **Minor Tick Marks:** Present between major ticks, indicating intervals of 0.1.
* **Legend:**
* **Position:** Top-right corner of the plot area.
* **Entry 1:** A red horizontal line segment followed by the label `κ_-^{α,β}` (kappa subscript minus, superscript alpha comma beta).
* **Entry 2:** A dark gray horizontal line segment followed by the label `κ_+^{α,β}` (kappa subscript plus, superscript alpha comma beta).
* **Panel Label:** The text `(b)` is located in the top-left corner of the plot area.
### Detailed Analysis
The plot displays two histograms and their corresponding fitted curves.
1. **Red Distribution (κ-α,β):**
* **Visual Trend:** The histogram bars (light red) and the overlaid smooth curve (dark red) form a unimodal, roughly symmetric, bell-shaped distribution.
* **Peak Location:** The distribution peaks at an x-value of approximately **560-570**. The peak height on the y-axis is approximately **1.1**.
* **Spread:** The distribution spans from approximately **450** to **650** on the x-axis. The full width at half maximum (FWHM) is approximately **80-90** units (estimated from ~520 to ~605).
* **Overlap:** This distribution significantly overlaps with the gray distribution in the range of approximately **580** to **650**.
2. **Gray Distribution (κ+α,β):**
* **Visual Trend:** The histogram bars (dark gray) and the overlaid smooth curve (black) also form a unimodal, roughly symmetric, bell-shaped distribution.
* **Peak Location:** The distribution peaks at an x-value of approximately **630-640**. The peak height on the y-axis is approximately **1.1**, similar to the red peak.
* **Spread:** The distribution spans from approximately **550** to **750** on the x-axis. The full width at half maximum (FWHM) is approximately **80-90** units (estimated from ~590 to ~675), similar to the red distribution.
* **Overlap:** This distribution significantly overlaps with the red distribution in the range of approximately **580** to **650**.
3. **Relationship:** The two distributions are shifted relative to each other along the x-axis. The κ+α,β distribution is centered at a higher curvature value than the κ-α,β distribution. The area of overlap creates a darker, brownish-red region where the semi-transparent red and gray bars intersect.
### Key Observations
* **Separation of Means:** The primary feature is the clear separation between the central tendencies (peaks) of the two distributions. The mean of κ+α,β is approximately **60-70 units higher** than the mean of κ-α,β.
* **Similar Shape and Spread:** Both distributions exhibit very similar shapes (Gaussian-like) and widths, suggesting the underlying processes generating κ- and κ+ have comparable variability.
* **Significant Overlap:** Despite the separation, there is a substantial region of overlap, indicating that for a given principal curvature value between ~580 and ~650, it is not uniquely determined whether it belongs to the κ- or κ+ population.
* **Symmetry:** Both distributions appear visually symmetric around their respective peaks.
### Interpretation
This histogram visualizes the statistical distribution of two types of principal curvatures, likely extracted from a dataset of surfaces, interfaces, or geometric objects in a scientific study (e.g., in materials science, biology, or physics). The notation κ±α,β suggests these are curvature values associated with specific directions (α, β) or modes.
The data demonstrates that the two curvature measures, while related, are distinct populations with different average values. The clear shift in the peaks indicates a systematic difference: the κ+ curvatures are, on average, larger (more positive or less negative, depending on the sign convention) than the κ- curvatures. The similar widths imply that the degree of variation or "noise" around these average values is comparable for both.
The significant overlap is a critical observation. It means that while there is a statistical tendency for κ+ to be larger than κ-, individual measurements cannot be reliably classified as κ- or κ+ based solely on their magnitude within the overlapping range. This could imply a continuous spectrum between the two states or the presence of other factors influencing the curvature values. The plot effectively communicates both the central tendency difference and the inherent variability and overlap in the data.
</details>
Figure 8: Distribution of principal curvatures $κ_-^α,β$ (red bars) and $κ_+^α,β$ (black bars). In panels (a) and (b), the loss functions are given by Eqs. (40) and (45), respectively. We evaluate the corresponding Hessians (42) and (46) at the saddle point $θ^*=(θ^*_1,\dots,θ^*_2n,θ^*_2n+1)=(0,\dots,0,1)$ . In both loss functions, we set $n=500$ and in loss function (45) we set $\tilde{n}=800$ . In panel (a), the probability $\Pr(κ_+^α,βκ_-^α,β>0)$ that the critical point in the lower-dimensional, random projection does not appear as a saddle is about 0.3, and it is 1 in panel (b). Histograms are based on 10,000 random projections used to compute $κ_±^α,β$ . Solid grey lines are Gaussian approximations of the empirical distributions.
We first study the dependence of ensemble means $⟨(H_α,β)_ij⟩$ ( $i,j∈\{1,2\}$ ) of elements of the dimension-reduced Hessian $H_α,β$ on the number of samples $S$ . According to Eq. (36), the diagonal elements of $\mathds{E}[H_α,β]$ are proportional to the mean curvature of the original high-dimensional space and are thus useful to examine curvature properties of high-dimensional loss functions. Figure 7 (a) shows the convergence of the ensemble means $⟨(H_α,β)_ij⟩$ toward the expected values $\mathds{E}[(H_α,β)_ij]$ as a function of $S$ . Note that $\mathds{E}[(H_α,β)_ij]=0$ for all $i,j$ . For a few dozen loss projections, the deviations of some of the ensemble means from the corresponding expected values reach values larger than 20. A relatively large number of loss projections $S$ between $10^3$ – $10^4$ is required to keep these deviations at values that are smaller than about 2–4. The solid black and red lines in Fig. 7 (b), respectively, show the ensemble means $⟨κ_±^α,β⟩$ and
$$
⟨\tilde{κ}_±^α,β⟩=\frac{1}{2}≤ft(⟨ A⟩+⟨ C⟩±√{4⟨ B⟩^2+(⟨ A⟩-⟨ C⟩)^2}\right) \tag{44}
$$
as a function of $S$ . Since $\mathds{E}[A]=\mathds{E}[C]={(H_θ)^i}_i$ and $\mathds{E}[B]=0$ [see Eq. (36)], we have that $⟨\tilde{κ}_±^α,β⟩=\bar{κ}^α,β$ in the limit $S→∞$ . In the current example, the ensemble means $⟨\tilde{κ}_±^α,β⟩$ thus approach $\bar{κ}^α,β=0$ for large numbers of samples $S$ , represented by the dashed red lines in Fig. 7 (b). The ensemble means $⟨κ_±^α,β⟩$ converge towards values of opposite sign, indicating a saddle point.
For a sample size of $S=10^4$ , we show the distribution of the principal curvatures $κ_±^α,β$ in Fig. 8 (a). We observe that the distributions are plausibly Gaussian. We also calculate the probability $\Pr(κ_+^α,βκ_-^α,β>0)$ that the critical point in the lower-dimensional, random projection does not appear as a saddle (i.e., $κ_+^α,βκ_-^α,β>0$ ). For the example shown in Fig. 8 (a), we find that $\Pr(κ_+^α,βκ_-^α,β>0)≈ 0.3$ . That is, in about 30% of the simulated projections, the lower-dimensional loss landscape wrongly indicates that it does not correspond to a saddle.
Our first example, which is based on the loss function (40), shows that the principal curvatures in the lower-dimensional representation of $L(θ)$ may capture the saddle behavior in the original space if one computes ensemble means $⟨κ_±^α,β⟩$ in the lower-dimensional space [see Fig. 7 (b)]. However, if one first calculates ensemble means of the elements of the dimension-reduced Hessian $H_α,β$ to infer $⟨\tilde{κ}_±^α,β⟩$ , the loss landscape appears to be flat in this example. We thus conclude that different ways of computing ensemble means (either before or after calculating the principal curvatures) may lead to different results with respect to the “flatness” of a dimension-reduced loss landscape.
Unequal number of curvature directions
In the next example, we will show that random projections cannot identify certain saddle points regardless of the underlying averaging process. We consider the loss function
$$
L(θ)=\frac{1}{2}θ_2n+1≤ft(∑_i=1^\tilde{n}θ_i^2-∑_i=\tilde{n+1}^2nθ_i^2\right) , n∈ℤ_+ ,n<\tilde{n}≤ 2n , \tag{45}
$$
where we use the convention $∑_i=a^b(·)=0$ if $a>b$ . The Hessian at the critical point $(θ^*_1,\dots,θ^*_2n,θ^*_2n+1)=(0,\dots,0,1)$ is
$$
H_θ=diag(\underbrace{1,\dots,1}_\tilde{n~times},\underbrace{-1,\dots,-1}_2n-\tilde{n~times},0) . \tag{46}
$$
As in the previous example, the critical point is again a saddle, but the mean curvature is ${H=2(\tilde{n}-n)/N>0}$ . In the following numerical experiments, we set $n=500$ and $\tilde{n}=800$ . Figure 7 (c) shows that the ensemble means $⟨(H_α,β)_ij⟩$ converge towards the expected value $\mathds{E}[(H_α,β)_ij]$ as the number of samples increases. We again observe that a relatively large number of random loss projections $S$ between $10^3$ and $10^4$ is required to keep the deviations of ensemble means from their corresponding expected values small. Because of the dominance of positive principal curvatures $κ_i^θ$ in the original space, the corresponding ensemble means of principal curvatures (i.e., $⟨κ_±^α,β⟩$ , $⟨\tilde{κ}_±^α,β⟩$ ) in the lower-dimensional representation approach positive values [see Fig. 7 (d)]. The distribution of $κ_±^α,β$ indicates that the probability of observing a saddle in the lower-dimensional loss landscape is vanishingly small [see Fig. 8 (b)]. In this second example, both ways of computing ensemble means, before and after calculating the lower-dimensional principal curvatures, mistakenly suggest that the saddle in the original space is a minimum in dimension-reduced space.
To summarize, for both loss functions (40) and (45), the saddle point $θ^*=(0,\dots,0,1)$ in the original loss function $L(θ)$ is often misrepresented in lower-dimensional representations $L(θ+αη+βδ)$ if random directions are used. Depending on (i) the employed curvature measure and (ii) the index of the underlying Hessian $H_θ$ in the original space, the saddle $θ^*=(0,\dots,0,1)$ may appear erroneously as either a minimum, maximum, or an almost flat region.
If the critical point were a minimum or maximum (i.e., a critical point associated with a positive definite or negative definite Hessian $H_θ$ ), it would be correctly represented in the corresponding expected random projection because the sign of its principal curvature $\bar{κ}^α,β$ , which is proportional to the sum of all eigenvalues $κ_i^θ$ of the Hessian $H_θ$ in the original loss space, would be equal to the sign of the principal curvatures $κ_i^θ$ . However, such points are scarce in high-dimensional loss spaces [140, 141].
Finally, because of the confounding factors associated with the inability of random projections to correctly identify saddle information, it does not appear advisable to use “flatness” around a critical point in a lower-dimensional random loss projection as a measure of generalization error [122].
Curvature-based Hessian trace estimation
<details>
<summary>trace_estimate_top.png Details</summary>

### Visual Description
\n
## Line Graph: Convergence of Trace Estimators
### Overview
The image displays a scientific line graph comparing two different estimators for the trace of a Hessian matrix, `tr(H_θ)`, as a function of the number of samples, `S`. The plot demonstrates the convergence behavior and variance of these estimators. The label "(a)" in the top-left corner indicates this is likely panel (a) of a larger multi-part figure.
### Components/Axes
* **Y-Axis:**
* **Label:** `estimate of tr(H_θ)`
* **Scale:** Linear, ranging from -20 to 20.
* **Ticks:** Major ticks at intervals of 10 (-20, -10, 0, 10, 20). Minor ticks are present between major ticks.
* **X-Axis:**
* **Label:** `number of samples S`
* **Scale:** Logarithmic (base 10).
* **Range:** From `10^0` (1) to `10^3` (1000).
* **Ticks:** Major ticks at `10^0`, `10^1`, `10^2`, `10^3`. Minor ticks are present between major ticks.
* **Legend:**
* **Position:** Top-right quadrant of the plot area.
* **Entry 1:** A solid black line, labeled with the mathematical expression `⟨z^T H_θ z⟩`.
* **Entry 2:** A dashed pink/salmon-colored line, labeled with the mathematical expression `⟨κ^α⟩`.
* **Reference Line:** A dashed gray horizontal line is drawn at `y = 0`.
### Detailed Analysis
The graph plots two data series against the logarithmic sample size `S`.
1. **Series `⟨z^T H_θ z⟩` (Solid Black Line):**
* **Trend:** Starts at a high positive value, exhibits large oscillations for small `S`, and gradually converges toward zero with decreasing variance as `S` increases.
* **Approximate Data Points:**
* At `S = 1` (`10^0`): y ≈ 15.
* At `S ≈ 2`: y drops sharply to a local minimum ≈ -5.
* At `S ≈ 5-6`: y peaks at a global maximum ≈ 18.
* At `S ≈ 10` (`10^1`): y is near 0.
* For `S > 10`: The line fluctuates significantly below zero, reaching a minimum near -12 around `S ≈ 20-30`. It then trends upward, crossing zero around `S ≈ 200`, and continues to oscillate with decreasing amplitude around zero up to `S = 1000`.
2. **Series `⟨κ^α⟩` (Dashed Pink Line):**
* **Trend:** Follows a pattern qualitatively similar to the black line but with consistently smaller amplitude (lower variance). It also converges toward zero as `S` increases.
* **Approximate Data Points:**
* At `S = 1`: y ≈ 8.
* At `S ≈ 2`: y drops to a local minimum ≈ -2.
* At `S ≈ 5-6`: y peaks at ≈ 10.
* At `S ≈ 10`: y is near 0.
* For `S > 10`: The line fluctuates mostly between -5 and 0, trending upward and converging to zero from below. By `S = 1000`, it is very close to zero.
### Key Observations
* **High Initial Variance:** Both estimators show extremely high variance and bias for very small sample sizes (`S < 10`), with values swinging from large positive to negative.
* **Convergence:** Both series clearly converge toward the reference line at `y = 0` as the number of samples `S` increases. The convergence appears to be in the mean, with the oscillations dampening.
* **Relative Performance:** The estimator `⟨κ^α⟩` (pink dashed) exhibits lower variance (smaller oscillations) than `⟨z^T H_θ z⟩` (black solid) across the entire range of `S`, particularly for `S < 100`.
* **Bias at Low S:** For small `S`, both estimators appear to have a positive bias (starting above zero), which then reverses into a negative bias for intermediate `S` (roughly 10 to 100) before converging.
### Interpretation
This graph is a diagnostic tool comparing the statistical efficiency of two methods for estimating the trace of a Hessian matrix, a quantity important in optimization and machine learning (e.g., for understanding loss landscape curvature or computing the Fisher information matrix).
* **What the data suggests:** The plot demonstrates that both proposed estimators are *consistent*—their expected value converges to the true value (presumably zero in this test case) as the sample size `S` grows. However, they differ significantly in their *variance*.
* **Relationship between elements:** The `⟨κ^α⟩` estimator appears to be a variance-reduced version of the `⟨z^T H_θ z⟩` estimator. The similar shape of the curves suggests they are estimating the same underlying quantity, but the pink dashed line's tighter oscillations indicate it is a more statistically efficient estimator, requiring fewer samples to achieve a given level of precision.
* **Notable Anomalies/Trends:** The most striking feature is the dramatic reduction in variance for both estimators once `S` exceeds approximately 100. This suggests a phase transition in the estimation error, where the law of large numbers begins to dominate. The initial positive bias and subsequent negative bias for intermediate `S` could be indicative of properties of the specific distribution from which the samples `z` are drawn or the geometry of the Hessian `H_θ` at the point of estimation.
**In summary, the image provides empirical evidence that the estimator `⟨κ^α⟩` offers a more stable and precise approximation of `tr(H_θ)` than the `⟨z^T H_θ z⟩` estimator, especially in the computationally constrained regime of a low number of samples.**
</details>
<details>
<summary>trace_estimate_bottom.png Details</summary>

### Visual Description
## Line Graph: Convergence of Two Metrics with Sample Size
### Overview
The image displays a scientific line graph, labeled "(b)" in the top-left corner, indicating it is likely part of a multi-panel figure. The graph plots the values of two distinct mathematical metrics against the number of samples, S, on a logarithmic scale. Both metrics show an initial rapid increase, followed by fluctuations, and eventual convergence toward a common asymptotic value.
### Components/Axes
* **X-Axis:**
* **Label:** `number of samples S`
* **Scale:** Logarithmic (base 10).
* **Range:** From `10^0` (1) to `10^3` (1000).
* **Major Ticks:** At `10^0`, `10^1`, `10^2`, `10^3`.
* **Minor Ticks:** Present between major ticks, indicating intermediate logarithmic values (e.g., 2, 3, 4... between 1 and 10).
* **Y-Axis:**
* **Label:** Not explicitly stated. The axis represents the numerical value of the plotted metrics.
* **Scale:** Linear.
* **Range:** From 550 to 610.
* **Major Ticks:** At 550, 570, 590, 610.
* **Minor Ticks:** Present, with 4 minor ticks between each major tick, representing increments of 4 units (e.g., 554, 558, 562, 566 between 550 and 570).
* **Legend:**
* **Position:** Bottom-right quadrant of the chart area.
* **Content:**
1. A solid black line segment followed by the text: `⟨z^T H_θ z⟩`
2. A dashed red line segment followed by the text: `⟨κ^α⟩`
* **Reference Line:** A horizontal, dashed gray line is present at approximately y = 600, spanning the width of the graph. This likely represents a theoretical limit, target, or asymptotic value.
### Detailed Analysis
**Data Series 1: `⟨z^T H_θ z⟩` (Solid Black Line)**
* **Trend Verification:** The line starts low, rises steeply, peaks, experiences a significant dip, recovers with fluctuations, and then stabilizes near the top of the graph.
* **Key Data Points (Approximate):**
* At S=1 (`10^0`): y ≈ 555.
* At S≈5: Reaches a local maximum of y ≈ 582.
* At S≈8: Drops to a local minimum of y ≈ 570.
* At S≈15: Another local minimum of y ≈ 573.
* From S≈20 to S≈50: Shows a steep, jagged increase.
* At S≈60: Reaches a peak near y ≈ 602, slightly above the dashed reference line.
* For S > 100: Oscillates with decreasing amplitude around y ≈ 600, closely following the dashed reference line.
**Data Series 2: `⟨κ^α⟩` (Dashed Red Line)**
* **Trend Verification:** Follows a very similar trajectory to the black line but with slightly less pronounced dips and peaks in the mid-range. It converges to the same final value.
* **Key Data Points (Approximate):**
* At S=1 (`10^0`): y ≈ 555 (nearly identical to the black line).
* At S≈5: Reaches a local maximum of y ≈ 585 (slightly higher than the black line).
* At S≈8: Dips to y ≈ 572 (less severe than the black line's dip).
* At S≈15: Dips to y ≈ 575.
* From S≈20 onward: Tracks the black line very closely, often overlapping.
* For S > 100: Converges and oscillates with the black line around y ≈ 600.
**Convergence Behavior:**
Both lines show high variance for sample sizes between S=1 and S≈100. For S > 100, the fluctuations diminish significantly, and both metrics stabilize near the value of 600, as indicated by the dashed gray reference line.
### Key Observations
1. **Strong Correlation:** The two metrics `⟨z^T H_θ z⟩` and `⟨κ^α⟩` are highly correlated across the entire range of sample sizes. Their trajectories are nearly identical, especially for S > 20.
2. **Initial Volatility:** The most significant fluctuations and the largest divergence between the two lines occur at low sample counts (S < 20).
3. **Asymptotic Convergence:** Both metrics appear to converge to a common value of approximately 600 as the number of samples increases beyond 100. The dashed gray line at y=600 serves as a visual guide for this limit.
4. **Logarithmic Scale Insight:** The use of a logarithmic x-axis emphasizes the behavior at small sample sizes, revealing the detailed structure of the initial rise and fluctuations that would be compressed on a linear scale.
### Interpretation
This graph demonstrates the convergence of two different estimators or quantities, `⟨z^T H_θ z⟩` and `⟨κ^α⟩`, as the sample size `S` increases. The notation suggests these are expected values (angle brackets `⟨ ⟩`) involving mathematical objects like vectors (`z`), matrices (`H_θ`), and exponents (`α`).
The key takeaway is that both quantities are consistent estimators: they approach the same true value (≈600) with increasing data. The initial volatility indicates high variance in the estimates when data is scarce. The fact that the two distinct mathematical expressions converge to the same limit suggests they may be theoretically equivalent or are measuring the same underlying property of the system being studied. The dashed line at 600 likely represents the known or theoretical asymptotic value, confirming the estimators' accuracy in the large-sample limit. The plot validates that a sample size of S > 100 is sufficient for both metrics to provide stable and reliable estimates.
</details>
Figure 9: Estimating the trace of the Hessian $H_θ$ . In panels (a) and (b), the loss functions are given by Eqs. (40) and (45), respectively. We evaluate the corresponding Hessians (42) and (46) at the saddle point $θ^*=(θ^*_1,\dots,θ^*_2n,θ^*_2n+1)=(0,\dots,0,1)$ . In both loss functions, we set $n=500$ and in loss function (45) we set $\tilde{n}=800$ . Solid black and dash-dotted red lines represent Hutchinson [ $⟨ z^⊤H_θz⟩$ ; see Eq. (39)] and curvature-based ( $⟨κ^α⟩$ ) estimates of $tr(H_θ)$ , respectively. We compute ensemble means $⟨·⟩$ as defined in Eq. (43) for different numbers of random projections $S$ . The trace estimates in panels (a) and (b), respectively, converge towards the true trace values $tr(H_θ)=0$ and $tr(H_θ)=600$ that are indicated by dashed grey lines. In both methods, the same random vectors with elements that are distributed according to a standard normal distribution are used. For the curvature-based estimation of $tr(H_θ)$ , we perform least-square fits of $L(θ^*+αη)$ over an interval $α∈[-0.05,0.05]$ .
In accordance with Sec. 1, we now use the loss functions (40) and (45) to compare the convergence behavior between the original Hutchinson method (39) and the curvature-based trace estimation (37). Instead of two random directions, we use one random Gaussian direction $η$ and perform least-squares fits for 50 equidistant values of $α$ in the interval $[-0.05,0.05]$ to extract estimates of $tr(H_θ)$ from $L(θ^*+αη)$ . We use the same random Gaussian directions in Hutchinson’s method.
Figure 9 shows how the Hutchinson and curvature-based trace estimates converge towards the true trace values, 0 for the loss function (40) and 600 for the loss function (45) with $n=500$ and $\tilde{n}=800$ . Given that we use the same random vectors in both methods, their convergence behavior towards the true trace value is similar.
With the curvature-based method, one can produce Hutchinson-type trace estimates without computing Hessian-vector products. However, it requires the user to specify an appropriate interval for $α$ so that the mean curvature can be properly estimated in a quadratic-approximation regime. It also requires a sufficiently large number of points in this interval. Therefore, it may be less accurate than the original Hutchinson method.
#### 3 Hessian directions
<details>
<summary>loss_projection_top.png Details</summary>

### Visual Description
## Mathematical Diagram: Saddle Point vs. Minimum on a 2D Surface
### Overview
The image displays two side-by-side 3D surface plots illustrating the geometric interpretation of critical points on a two-variable function. The left diagram (a) depicts a saddle point, while the right diagram (b) depicts a local minimum. Both plots are rendered as blue, semi-transparent surfaces with a visible grid pattern, set against a white background. The diagrams are labeled with mathematical notation describing the principal curvatures at the central critical point.
### Components/Axes
**Spatial Layout:**
- The image is divided into two distinct panels.
- **Panel (a) - Left:** Titled "(a) Saddle" in the top-left corner.
- **Panel (b) Right:** Titled "(b) Minimum" in the top-right corner.
- A caption at the bottom center reads: "Hessian directions η, δ".
**Axes (Common to both plots):**
- Each plot features a 3D coordinate system with two horizontal axes.
- The axis extending from the bottom-left towards the center is labeled **α** (alpha).
- The axis extending from the bottom-right towards the center is labeled **β** (beta).
- These axes define the domain plane over which the surface is plotted. The vertical axis (function value) is not explicitly labeled.
**Surface Elements:**
- **Central Point:** A black dot marks the critical point (where the gradient is zero) at the center of each surface.
- **Curves:** Two dark blue curves are drawn on each surface, intersecting at the central point. These represent the paths of principal curvature.
- **Curvature Labels:**
- In both plots, one curve is labeled **κ₊^{α,β}** (kappa-plus, with superscripts α,β).
- The other curve is labeled **κ₋^{α,β}** (kappa-minus, with superscripts α,β).
**Mathematical Annotations:**
- **Panel (a) Saddle:** An equation is placed near the top of the surface: **κ₊^{α,β} κ₋^{α,β} < 0**.
- **Panel (b) Minimum:** An equation is placed near the top of the surface: **κ₊^{α,β} > 0, κ₋^{α,β} > 0**.
### Detailed Analysis
**Diagram (a) - Saddle Point:**
- **Surface Shape:** The surface curves upward along one principal direction and downward along the other, creating a classic saddle or hyperbolic paraboloid shape. The central point is a minimax point.
- **Curvature Relationship:** The annotation **κ₊^{α,β} κ₋^{α,β} < 0** indicates that the two principal curvatures have opposite signs. This is the defining mathematical condition for a saddle point in two dimensions.
- **Curve Behavior:** The curve labeled **κ₊^{α,β}** appears to follow a path where the surface curves upward (convex) from the central point. The curve labeled **κ₋^{α,β}** follows a path where the surface curves downward (concave) from the central point.
**Diagram (b) - Minimum:**
- **Surface Shape:** The surface curves upward in all directions from the central point, forming a bowl or valley shape. The central point is the lowest point in its local neighborhood.
- **Curvature Relationship:** The annotation **κ₊^{α,β} > 0, κ₋^{α,β} > 0** indicates that both principal curvatures are positive. This is the defining mathematical condition for a local minimum in two dimensions.
- **Curve Behavior:** Both curves, **κ₊^{α,β}** and **κ₋^{α,β}**, follow paths where the surface curves upward (convex) away from the central point.
**Bottom Caption:**
- The text **"Hessian directions η, δ"** suggests that the two principal curvature directions (κ₊ and κ₋) correspond to the eigenvectors (η, δ) of the Hessian matrix evaluated at the critical point. The Hessian matrix contains the second partial derivatives of the function.
### Key Observations
1. **Visual Contrast:** The diagrams provide a clear visual contrast between a saddle point (indefinite Hessian) and a minimum (positive definite Hessian).
2. **Geometric Interpretation:** The curves on the surfaces are not arbitrary; they represent the directions of maximum and minimum curvature (principal directions) at the critical point.
3. **Mathematical Notation:** The use of superscripts (α,β) on the curvature symbols (κ) explicitly ties these geometric properties to the coordinate system defined by the α and β axes.
4. **Color and Style:** The consistent use of blue for the surface, black for the critical point and curves, and gray for the axes creates a clean, technical aesthetic focused on conveying mathematical concepts.
### Interpretation
These diagrams are fundamental visualizations in multivariable calculus, optimization theory, and differential geometry. They illustrate how the second derivative test (via the Hessian matrix) classifies critical points.
- **What the data suggests:** The diagrams demonstrate that the sign of the product of the principal curvatures (or equivalently, the determinant of the Hessian) determines the type of critical point. A negative product indicates a saddle point, while two positive curvatures indicate a minimum. (By extension, two negative curvatures would indicate a maximum).
- **How elements relate:** The axes (α, β) define the input space. The surface height represents the function value. The curves (κ₊, κ₋) show the paths of steepest ascent/descent in curvature, which are the eigenvectors of the Hessian. The equations summarize the curvature conditions that define the point's nature.
- **Notable patterns/anomalies:** The key pattern is the direct link between algebraic conditions (signs of κ) and geometric shape. There are no anomalies; the diagrams are canonical representations. The "Hessian directions" caption explicitly connects the geometric principal directions to the algebraic eigenvectors of the second-derivative matrix, unifying the visual and analytical perspectives.
In essence, this image serves as a pedagogical tool to bridge the gap between the abstract algebra of second derivatives and the tangible geometry of surfaces, showing why a saddle point is neither a max nor a min, and how a minimum "holds water" from all directions.
</details>
<details>
<summary>loss_projection_bottom.png Details</summary>

### Visual Description
\n
## Mathematical Diagrams: Curvature at Critical Points
### Overview
The image displays two 3D surface plots, labeled (c) and (d), illustrating geometric concepts related to curvature at critical points on a surface parameterized by variables α and β. These diagrams are likely from a technical paper on differential geometry, optimization, or manifold learning.
### Components/Axes
**Diagram (c) Maximum:**
* **Title:** "(c) Maximum" (top-left corner).
* **Surface:** A dome-shaped (elliptic paraboloid) surface, concave down.
* **Axes:** Two axes labeled with Greek letters:
* `α` (alpha) - positioned along the bottom-left edge.
* `β` (beta) - positioned along the bottom-right edge.
* **Key Point:** A dark dot at the apex (maximum) of the surface.
* **Curves:** Two dark curves intersect at the apex, representing principal directions of curvature.
* **Labels on Curves:**
* `κ₊^{α,β}` (kappa plus, superscript alpha, beta) - labels one curve.
* `κ₋^{α,β}` (kappa minus, superscript alpha, beta) - labels the other curve.
* **Mathematical Condition:** Text above the apex states: `κ₊^{α,β} < 0, κ₋^{α,β} < 0`.
**Diagram (d) Projection:**
* **Title:** "(d) Projection" (top-left corner).
* **Surface:** A saddle-shaped (hyperbolic paraboloid) surface.
* **Axes:** Two axes labeled:
* `α` (alpha) - positioned along the left edge.
* `β` (beta) - positioned along the bottom edge.
* **Key Point:** A dark dot at the saddle point (minimax) of the surface.
* **Curves:** Two dark curves intersect at the saddle point.
* **Labels on Curves:**
* `κ₊^{α,β}` (kappa plus, superscript alpha, beta) - labels one curve.
* `κ₋^{α,β}` (kappa minus, superscript alpha, beta) - labels the other curve.
* **Additional Text:** Below the diagram, the text "random directions η, δ" is present, where η (eta) and δ (delta) are Greek letters.
### Detailed Analysis
**Diagram (c) Maximum:**
* **Visual Trend:** The surface curves downward in all directions from the central apex point.
* **Component Isolation:** The diagram is segmented into the surface, the intersecting curves, and the axis labels. The mathematical condition is placed directly above the point of interest.
* **Spatial Grounding:** The legend/labels (`κ₊^{α,β}`, `κ₋^{α,β}`) are placed directly on their corresponding curves near the apex. The condition `κ₊^{α,β} < 0, κ₋^{α,β} < 0` is positioned at the top-center, directly above the apex dot.
* **Data/Fact Extraction:** The diagram explicitly states that at the maximum point, both principal curvatures (`κ₊` and `κ₋`) in the α-β parameterization are negative (`< 0`). This is a defining characteristic of a local maximum on a surface.
**Diagram (d) Projection:**
* **Visual Trend:** The surface curves upward in one principal direction and downward in the other, creating a saddle shape.
* **Component Isolation:** The diagram is segmented into the saddle surface, the intersecting curves, the axis labels, and the standalone text about random directions.
* **Spatial Grounding:** The curve labels (`κ₊^{α,β}`, `κ₋^{α,β}`) are placed on their respective curves near the saddle point. The text "random directions η, δ" is placed in the bottom-right corner, separate from the main diagram.
* **Data/Fact Extraction:** The diagram shows a saddle point where the principal curvatures have opposite signs (one positive, one negative), though the specific signs are not stated in text. The mention of "random directions η, δ" suggests this projection might relate to analyzing curvature in arbitrary or sampled directions, not just the principal ones.
### Key Observations
1. **Contrasting Geometries:** The two diagrams present a direct visual contrast between a local maximum (all curvatures negative) and a saddle point (curvatures of opposite sign).
2. **Notation Consistency:** The same notation (`κ₊^{α,β}`, `κ₋^{α,β}`) is used for principal curvatures in both diagrams, facilitating comparison.
3. **Contextual Text:** The phrase "random directions η, δ" in diagram (d) is an important annotation that is not present in (c), hinting at a different analytical context for the projection case.
### Interpretation
These diagrams serve as visual proofs or illustrations of fundamental concepts in differential geometry applied to optimization or analysis on manifolds.
* **What the data suggests:** Diagram (c) demonstrates that for a function or surface to have a local maximum at a point, its Hessian matrix (represented here by the principal curvatures `κ₊` and `κ₋`) must be negative definite. Diagram (d) illustrates a saddle point, where the Hessian is indefinite (has both positive and negative eigenvalues).
* **How elements relate:** The axes (α, β) define a local coordinate system on the surface. The curves represent the principal directions of curvature, which are orthogonal at the critical point. The signs of the curvatures along these directions determine the nature of the critical point (max, min, or saddle).
* **Notable implications:** The inclusion of "random directions η, δ" in the projection diagram suggests the author may be discussing how curvature behaves when not measured along the principal axes, which is relevant for numerical methods or stochastic analysis where the exact principal directions are unknown. The diagrams collectively emphasize the importance of second-order derivative information (curvature) in classifying critical points.
</details>
Figure 10: Dimensionality-reduced loss $L(θ^*+αη+βδ)$ of Eq. (45) with $n=900,\tilde{n}=1000$ for different directions $η,δ$ . (a–c) The directions $η,δ$ correspond to eigenvectors of the Hessian $H_θ$ of Eq. (45). If the eigenvalues associated with $η,δ$ have different signs, the corresponding loss landscape is a saddle as depicted in panel (a). If the eigenvalues associated with $η,δ$ have the same sign, the corresponding loss landscape is either a minimum (both signs are positive) as shown in panel (b) or a maximum (both signs are negative) as shown in panel (c). Because there is an excess of $\tilde{n}-n=100$ positive eigenvalues in $H_θ$ , a projection onto a dimension-reduced space that is spanned by the random directions $η,δ$ is often associated with an apparently convex loss landscape. An example of such an apparent minimum is shown in panel (d).
Given the described shortcomings of random projections incorrectly identifying saddles of high-dimensional loss functions, we suggest to use Hessian directions (i.e., the eigenbasis of $H_θ$ ) as directions $η,δ$ in $L(θ^*+αη+βδ)$ .
For Eq. (45) with $n=900,\tilde{n}=1000$ , we show projections along different Hessian directions in Fig. 10 (a–c). We observe that different Hessian directions indicate different types of critical points in dimension-reduced space. If the eigenvalues associated with the Hessian directions $η,δ$ have different signs, the corresponding lower-dimensional loss landscape is a saddle [see Fig. 10 (a)]. If both eigenvalues have the same sign, the loss landscape is either a minimum [see Fig. 10 (b): both signs are positive] or it is a maximum [see Fig. 10 (c): both signs are negative]. If one uses a random projection instead, the resulting lower-dimensional loss landscape often appears to be a minimum in this example [see Fig. 10 (d)]. To quantify the proportion of random projections that correctly identify the saddle with $n=900,\tilde{n}=1000$ , we generated 10,000 realizations of $κ_±^α,β$ [see Eq. (35)]. We find that the signs of $κ_±^α,β$ were different in only about 0.5% of all simulated realizations. That is, in this example the principal curvatures $κ_±^α,β$ indicate a saddle in only about 0.5% of the studied projections.
In accordance with related works that use Hessian information [142, 143], our proposal uses Hessian-vector products (HVPs). We first compute the largest-magnitude eigenvalue and then use an annihilation method [144] to compute the second largest-magnitude eigenvalue of opposite sign (see Algorithm 1). The corresponding eigenvectors are the dominant Hessian directions. Other deflation techniques can be used to find additional Hessian directions.
We employ HVPs to compute Hessian directions without an explicit representation of $H_θ$ using the identity
$$
∇_θ[(∇_θL)^⊤v]=(∇_θ∇_θL)v+(∇_θL)^⊤∇_θv=H_θv . \tag{47}
$$
In the first step, the gradient $(∇_θL)^⊤$ is computed using reverse-mode autodifferentiation (AD) to then compute the scalar product $[(∇_θL)^⊤v]$ . In the second step, we again apply reverse-mode AD to the computational graph associated with the scalar product $[(∇_θL)^⊤v]$ . Because the vector $v$ does not depend on $θ$ (i.e., $∇_θv=0$ ), the result is $H_θv$ . One may also use forward-mode AD in the second step to provide a more memory efficient implementation.
Algorithm 1 Compute dominant Hessian directions
1: $L_1=LinearOperator((N,N),~matvec=hvp)$ initialize linear operator for HVP calculation
2: eigval1, eigvec1 = solve_lm_evp( $L_1$ ) compute largest-magnitude eigenvalue and corresponding eigenvector associated with operator $L_1$
3: shifted_hvp(vec) = hvp(vec) - eigval1*vec define shifted HVP
4: $L_2=LinearOperator((N,N),~matvec=shifted\_hvp)$ initialize linear operator for shifted HVP calculation
5: eigval2, eigvec2 = solve_lm_evp( $L_2$ ) compute largest-magnitude eigenvalue and corresponding eigenvector associated with operator $L_2$
6: eigval2 += eigval1
7: if eigval1 $>=$ 0 then
8: maxeigval, maxeigvec = eigval1, eigvec1
9: mineigval, mineigvec = eigval2, eigvec2
10: else
11: maxeigval, maxeigvec = eigval2, eigvec2
12: mineigval, mineigvec = eigval1, eigvec1
13: end if
14: return: maxeigval, maxeigvec, mineigval, mineigvec
Image classification
<details>
<summary>resnet_panel.png Details</summary>

### Visual Description
\n
## 3D Surface Plots: Comparison of Hessian vs. Random Direction Landscapes
### Overview
The image displays four 3D surface plots arranged in a 2x2 grid. Each plot visualizes a mathematical function's surface over a two-dimensional parameter space defined by variables α (alpha) and β (beta). The plots are grouped into two categories: "Hessian directions" (left column) and "Random directions" (right column). The top row shows the landscape over a broad range, while the bottom row shows a zoomed-in view near the origin. The surfaces are colored with a gradient, where blue represents lower function values and red/brown represents higher values.
### Components/Axes
* **Titles:**
* Top-left (a): "Hessian directions"
* Top-right (b): "Random directions"
* Bottom-left (c): "Hessian directions"
* Bottom-right (d): "Random directions"
* **Axis Labels:** All four plots have identical axis labels.
* Horizontal axis (front): α (alpha)
* Horizontal axis (side): β (beta)
* Vertical axis: Implied to be a function value, f(α, β), but not explicitly labeled.
* **Range Annotations:**
* Top Row (a & b): `α, β ∈ [-1, 1]` (indicating the parameters range from -1 to 1).
* Bottom Row (c & d): `α, β ∈ [-0.05, 0.05]` (indicating a zoomed-in view near zero).
* **Color Mapping (Implicit Legend):** The surface color corresponds to the vertical (function) value. The gradient transitions from deep blue (lowest points) through light blue/white to red/brown (highest points).
### Detailed Analysis
**Plot (a) - Hessian directions, Range [-1, 1]:**
* **Trend:** The surface is highly non-convex and complex. It features multiple local minima (deep blue valleys) and maxima (red/brown peaks). The landscape is "wrinkled" with significant curvature changes.
* **Spatial Details:** A prominent deep blue valley runs diagonally from the front-left towards the center. Several sharp red peaks are visible, particularly one on the right side and another towards the back. The surface exhibits saddle points and ridges.
**Plot (b) - Random directions, Range [-1, 1]:**
* **Trend:** The surface is much smoother and simpler compared to (a). It resembles a broad, shallow basin or a slightly distorted paraboloid.
* **Spatial Details:** The lowest point (deepest blue) is near the center of the α-β plane. The function value increases smoothly and monotonically as one moves away from the center towards the edges of the [-1, 1] domain, transitioning to red/brown at the corners. There are no visible local minima or complex features.
**Plot (c) - Hessian directions, Range [-0.05, 0.05]:**
* **Trend:** This is a zoomed-in view of the landscape near the origin (α=0, β=0) for the Hessian case. The surface shows a distinct saddle shape or a sharp valley.
* **Spatial Details:** A deep blue valley runs through the center. The function increases sharply (to red) along one diagonal direction and more gradually along the orthogonal diagonal. An inset box with connecting lines highlights a specific rectangular region on the surface, likely to emphasize local curvature or a region of interest for further analysis. The inset is positioned in the lower-left quadrant of the plot.
**Plot (d) - Random directions, Range [-0.05, 0.05]:**
* **Trend:** This zoomed-in view confirms the smooth, convex nature of the random direction landscape near the origin.
* **Spatial Details:** The surface is a simple, upward-opening parabolic bowl. The minimum is at the center (α=0, β=0), and the value increases uniformly in all directions, creating a smooth gradient from blue at the center to red at the edges of the small domain.
### Key Observations
1. **Complexity Dichotomy:** The most striking observation is the drastic difference in complexity between the "Hessian directions" and "Random directions" landscapes. The Hessian-based surface (a, c) is rugged with multiple optima, while the random-direction surface (b, d) is smooth and convex.
2. **Scale Invariance of Smoothness:** The smooth, convex property of the random direction landscape holds at both the large scale (b) and the very small, local scale (d).
3. **Local vs. Global Structure:** Plot (c) reveals that even in a very small neighborhood around a point, the Hessian-informed landscape can have complex, non-quadratic structure (the saddle/valley), whereas the random direction landscape (d) is purely quadratic (parabolic) locally.
4. **Color as Value Indicator:** The consistent color mapping allows for direct visual comparison of function value ranges and gradients across all four plots.
### Interpretation
This figure is likely from a technical paper on optimization, machine learning, or numerical analysis, specifically discussing loss landscapes or the geometry of objective functions.
* **What it Demonstrates:** The plots visually argue that exploring a function's landscape along directions informed by its Hessian matrix (which captures second-order curvature information) reveals a much more complex and challenging optimization terrain than exploring along random directions. The random directions provide a smoothed, averaged view of the landscape that hides the intricate, potentially problematic structure (like sharp valleys and multiple minima) that Hessian-based methods must navigate.
* **Relationship Between Elements:** The top row provides the global context, showing the overall shape of the landscapes. The bottom row provides a local, magnified view, proving that the complexity difference persists even at microscopic scales. The inset in (c) further emphasizes the need to examine specific local features.
* **Implications:** This has significant implications for optimization algorithms. It suggests that:
1. Second-order methods (using Hessian information) operate in a more complex geometric space than first-order or random search methods.
2. The apparent smoothness observed when moving in random directions might be misleading, hiding difficult curvature that can trap or slow down sophisticated optimizers.
3. Understanding the true, Hessian-informed landscape is crucial for designing algorithms that can efficiently find good minima in high-dimensional spaces, such as those in neural network training. The "random directions" view might explain why simple methods can sometimes work surprisingly well, as they effectively smooth out the difficult geometry.
</details>
Figure 11: Loss landscape projections for ResNet-56. (a,c) The projection directions $η,δ$ are given by the eigenvectors associated with the largest and smallest eigenvalues of the Hessian $H_θ$ , respectively. The zoomed inset in panel (c) shows the loss landscape for $(α,β)∈[-0.01,0.005]×[-0.05,0.05]$ . We observe a decreasing loss along the negative $β$ -axis. (b,d) The projection directions $η,δ$ are given by random vectors. The domains of $(α,β)$ in panels (a,b) and (c,d) are $[-1,1]×[-1,1]$ and $[-0.05,0.05]×[-0.05,0.05]$ , respectively. All shown cross-entropy loss landscapes are based on evaluating the CIFAR-10 training dataset that consists of 50,000 images.
As an illustration of a loss landscape of an ANN employed in image classification, we focus on the ResNet-56 architecture that has been trained as detailed in Ref. 122 on CIFAR-10 using stochastic gradient descent (SGD) with Nesterov momentum. The number of parameters of this ANN is 855,770. The training and test losses at the local optimum found by SGD are ${9.20× 10^-4}$ and 0.29, respectively; the corresponding accuracies are 100.00 and 93.66, respectively.
In accordance with Ref. 122, we apply filter normalization to random directions. This and related normalization methods are often employed when generating random projections. One reason is that simply adding random vectors to parameters of a neural network loss function (or parameters of other functions) does not consider the range of parameters associated with different elements of that function. As a result, random perturbations may be too small or large to properly resolve the influence of certain parameters on a given function.
When calculating Hessian directions, we are directly taking into account the parameterization of the underlying functions that we want to visualize. Therefore, there is no need for an additional rescaling of different parts of the perturbation vector. Still, reparameterizations of a neural network can result in changes of curvature properties (see, e.g., Theorem 4 in Ref. 120).
Figure 11 shows the two-dimensional projections of the loss function (cross entropy loss) around a local critical point. The smallest and largest eigenvalues are $-16.4$ and $5007.9$ , respectively. This means that the found critical point is a saddle with a maximum negative curvature that is more than two orders of magnitude smaller than the maximum positive curvature at that point. The saddle point is clearly visible in Fig. 11 (a,c). We observe in the zoomed inset in Fig. 11 (c) that the loss decreases along the negative $β$ -axis.
If random directions are used, the corresponding projections indicate that the optimizer converged to a local minimum and not to a saddle point [see Fig. 11 (b,d)]. Overall, the ResNet-56 visualizations that we show in Fig. 11 exhibit structural similarities to those that we generated using the simple loss model (45) [see Fig. 10 (a,d)].
Function approximation
<details>
<summary>heatmap_function_approximation.png Details</summary>

### Visual Description
## Heatmap Grid: Loss Landscapes for Fully Connected Neural Networks (FCNNs)
### Overview
The image displays a 4x2 grid of eight heatmap plots, labeled (a) through (h). Each plot visualizes the loss landscape of a Fully Connected Neural Network (FCNN) in a 2D parameter subspace defined by axes `α` and `β`. The plots compare two different network architectures (FCNN 1 and FCNN 2), two parameter initialization methods ("random" and "Hessian"), and two dataset splits ("train" and "test"). The color intensity represents the loss value at each coordinate pair (α, β).
### Components/Axes
* **Grid Structure:** 4 rows x 2 columns.
* **Subplot Labels:** (a), (b), (c), (d), (e), (f), (g), (h) in the top-left corner of each plot.
* **Subplot Titles:** Each title follows the format: `[Model] ([Initialization], [Dataset])`.
* (a) FCNN 1 (random,train)
* (b) FCNN 1 (random,test)
* (c) FCNN 1 (Hessian,train)
* (d) FCNN 1 (Hessian,test)
* (e) FCNN 2 (random,train)
* (f) FCNN 2 (random,test)
* (g) FCNN 2 (Hessian,train)
* (h) FCNN 2 (Hessian,test)
* **Axes:**
* **X-axis (Horizontal):** Labeled `α` at the bottom of the grid (plots g, h). Range: -0.1 to 0.1. Major ticks at -0.1, 0, 0.1.
* **Y-axis (Vertical):** Labeled `β` on the left side of the grid (plots a, c, e, g). Range: -0.1 to 0.1. Major ticks at -0.1, 0, 0.1.
* **Color Bar (Legend):** Located to the right of each heatmap. It maps color to loss value.
* **Color Scale:** A gradient from dark green (low loss) through yellow to dark red (high loss).
* **Labels & Scale:**
* Plots (a), (b), (e), (f) [Random Init]: Labeled `loss [×10⁻¹]`. Scale ranges from 0.0 to 10.0. This indicates the displayed values should be multiplied by 0.1 (e.g., 10.0 represents a loss of 1.0).
* Plots (c), (d) [FCNN 1, Hessian Init]: Labeled `loss [×10⁻²]`. Scale ranges from 0.0 to 10.0. This indicates the displayed values should be multiplied by 0.01 (e.g., 10.0 represents a loss of 0.1).
* Plots (g), (h) [FCNN 2, Hessian Init]: Labeled `loss`. Scale ranges from 0.0 to 10.0. No multiplier is indicated, suggesting the values are as displayed.
* **Reference Lines:** White dashed lines cross at (α=0, β=0) in each plot.
* **Marked Point:** A black 'x' marks a specific point in each landscape, likely the converged solution or minimum found during training.
### Detailed Analysis
**1. Random Initialization Landscapes (Plots a, b, e, f):**
* **Trend/Shape:** The low-loss region (green) forms a broad, roughly circular or elliptical basin centered near (0,0). Loss increases radially outward, transitioning to yellow and then red at the edges of the [-0.1, 0.1] domain.
* **Marked Point ('x'):** Located precisely at the intersection of the white dashed lines, (α=0, β=0), in all four random initialization plots.
* **Loss Scale:** The color bar multiplier of `×10⁻¹` indicates the loss values in these landscapes are an order of magnitude larger than those in the Hessian-initialized FCNN 1 plots.
* **Train vs. Test:** The landscapes for train (a, e) and test (b, f) are visually very similar for each respective model, suggesting the loss surface geometry generalizes well.
**2. Hessian Initialization Landscapes (Plots c, d, g, h):**
* **Trend/Shape:** The landscapes are dominated by vertical bands. The lowest loss (green) forms a vertical stripe centered around α=0. Loss increases as |α| increases, moving into yellow and red regions on the left and right sides. The variation along the β-axis is much less pronounced compared to the α-axis.
* **Marked Point ('x') Position:**
* **FCNN 1 (c, d):** The 'x' is offset from the center. In (c) train, it is at approximately (α=0, β≈0.03). In (d) test, it is at approximately (α=0, β≈0.09).
* **FCNN 2 (g, h):** The 'x' is offset in the negative β direction. In both (g) train and (h) test, it is at approximately (α=0, β≈-0.03).
* **Loss Scale:**
* FCNN 1 (c, d): The `×10⁻²` multiplier indicates these loss values are 100 times smaller than those in the random initialization plots for FCNN 1.
* FCNN 2 (g, h): The scale has no multiplier, suggesting loss values are directly comparable to the 0-10 range, but the landscape shape is fundamentally different from its random initialization counterpart.
### Key Observations
1. **Dramatic Landscape Change with Initialization:** The most striking feature is the complete transformation of the loss landscape geometry when switching from random to Hessian-based initialization. Random init yields isotropic (circular) basins, while Hessian init creates anisotropic (vertically elongated) valleys.
2. **Consistency Across Train/Test:** For a given model and initialization, the train and test loss landscapes are remarkably similar in shape and scale. This implies the local geometry of the loss surface around the solution is determined more by the model and initialization than by the specific data split.
3. **Divergent Convergence Points:** The black 'x' marks show that Hessian initialization leads the optimization to converge to different points in the (α, β) subspace compared to random initialization (which always finds (0,0)). Furthermore, the converged point differs between FCNN 1 and FCNN 2 under Hessian init.
4. **Scale of Loss:** The loss values at the minima (green regions) are significantly lower for Hessian-initialized FCNN 1 (scale ~0.01) compared to random initialization (scale ~0.1). FCNN 2 with Hessian init shows loss values on a 0-10 scale, but the landscape shape is the key difference.
### Interpretation
This visualization provides a Peircean investigation into the **effect of parameter initialization on the geometry of neural network loss landscapes and optimization outcomes**.
* **What the Data Suggests:** The "random" initialization places the starting point in a region where the loss surface is relatively symmetric and bowl-shaped around the minimum. The "Hessian" initialization, which uses curvature information, places the starting point in a region where the loss surface is highly elongated—a narrow valley. This suggests the Hessian-based method finds a path to a solution that lies in a flatter, more stable region of the parameter space (as indicated by the vertical banding, showing low curvature in the β direction).
* **Relationship Between Elements:** The side-by-side comparison of train and test plots for each condition acts as a control, confirming that the observed landscape properties are intrinsic to the model and initialization, not an artifact of overfitting. The difference between FCNN 1 and FCNN 2 under the same initialization shows that network architecture also influences the resulting landscape.
* **Notable Anomalies/Insights:** The fact that the converged point ('x') for Hessian init is not at (0,0) is critical. It shows that the "optimal" solution found depends on the initialization trajectory. The vertical banding in Hessian plots indicates that the loss is much more sensitive to changes in `α` than in `β`. This could imply that the `β` parameter corresponds to a direction in weight space that is less important for the model's function, or that the Hessian initialization has already optimized along that direction. The stark contrast between the landscapes underscores that **optimization does not occur in a vacuum; the starting point fundamentally shapes the path taken and the final solution discovered**, with potential implications for generalization and robustness.
</details>
Figure 12: Heatmaps of the mean squared error (MSE) loss along random and dominant Hessian directions for a function-approximation task. (a–d) Loss heatmaps for a fully connected neural network (FCNN) with 2 layers and 100 ReLU activations per layer [(a,c): training data, (b,d): test data]. (e–h) Loss heatmaps for an FCNN with 10 layers and 100 ReLU activations per layer [(e,g): training data, (f,h): test data]. Random directions are used in panels (a,b,e,f) while Hessian directions are used in panels (c,d,g,h). Green and red regions indicate small and large mean squared error (MSE) loss values, respectively. Black crosses indicate the positions of loss minima in the shown domain. Both neural networks, FCNN 1 and FCNN 2, are trained to approximate the smooth one-dimensional function (48).
As another example, we compare loss function visualizations that are based on random and Hessian directions in a function-approximation task. In accordance with Ref. 145, we consider the smooth one-dimensional function
$$
f(x)=\log(\sin(10x)+2)+\sin(x) , \tag{48}
$$
where $x∈[-1,1)$ . To approximate $f(x)$ , we use two fully connected neural networks (FCNNs) with 2 and 10 layers, respectively. Each layer has 100 ReLU activations. The numbers of parameters of the 2 and 10 layer architectures are 20,501 and 101,301, respectively. The training data is based on 50 points that are sampled uniformly at random from the interval $[-1,1)$ . We train both ANNs using a mean squared error (MSE) loss function and SGD with a learning rate of 0.1. The 2 and 10 layer architectures are respectively trained for 100,000 and 50,000 epochs to reach loss values of less than $10^-4$ . The best model was saved and evaluated by calculating the MSE loss for 1,000 points that were sampled uniformly at random from the interval $[-1,1)$ . We present an animation illustrating the evolution of random and Hessian loss projections during training at https://vimeo.com/787174225.
Figure 12 shows heatmaps of the training and test loss landscapes of both ANNs along random and dominant Hessian directions. Black crosses in Fig. 12 indicate loss minima. In line with the previous example, which focused on an image classification ANN, we observe for both FCNNs that random projections are associated with loss values that increase along both directions $δ,η$ and for both training and test data [see Fig. 12 (a,b,e,f)]. For these projections, we find that the loss minima are very close to or at the origin of the loss space. The situation is different in the loss projections that are based on Hessian directions. Figure 12 (c) shows that the loss minimum for the 2-layer FCNN is not located at the origin but at $(α,β)≈(0,0.04)$ .
We find that the value of the training loss at that point is more than 9% smaller than the smallest training loss found in a random projection plot. The corresponding test loss is about 1% smaller than the test loss associated with the smallest training loss in the random projection plot. For the 10-layer FCNN, the smallest training loss in the Hessian direction projection is more than 26% smaller than the smallest training loss in the randomly projected loss landscape [see Fig. 12 (e,g)]. The corresponding test loss in the Hessian direction plot is about 6% smaller than the corresponding test loss minimum in the random direction plot [see Fig. 12 (f,h)]. Notice that both the training and test loss minima in the random direction heatmaps in Fig. 12 (e,f) are located at the origin while they are located at $(α,β)≈(0,-0.05)$ in the Hessian direction heatmaps in Fig. 12 (g,h).
These results show that it is possible to improve the training loss values along the dominant negative curvature direction. Smaller loss values may also be achievable with further gradient-based training iterations, so one has to consider tradeoffs between gradient and curvature-based optimization methods.
### 5 Conclusions
Over several decades, research at the intersection of statistical mechanics, neuroscience, and computer science has significantly enhanced our understanding of information processing in both living systems and machines. With the increasing computing power, numerous learning algorithms conceived in the latter half of the 20th century have found widespread application in tasks such as image classification, natural language processing, and anomaly detection.
The Ising model, one of the most thoroughly studied models in statistical mechanics, shares close connections with ANNs such as Hopfield networks and Boltzmann machines. In the first part of this chapter, we described these connections and provided an illustrative example of a Boltzmann machine that learned to generate new Ising configurations based on a set of training samples.
Despite decades of active research in artificial intelligence and machine learning, providing mechanistic insights into the representational capacity and learning dynamics of modern ANNs remains challenging due to their high dimensionality and complex, non-linear structure. Visualizing the loss landscapes associated with ANNs poses a particular challenge, as one must appropriately reduce dimensionality to visualize landscapes in two or three dimensions. To put it in the words of Ker-Chau Li, “There are too many directions to project a high-dimensional data set and unguided plotting can be time-consuming and fruitless”. One approach is to employ random projections. However, this method often inaccurately depicts saddle points in high-dimensional loss landscapes as apparent minima in low-dimensional random projections.
In the second part of this chapter, we discussed various curvature measures that can be employed to characterize critical points in high-dimensional loss landscapes. Additionally, we outlined how Hessian directions (i.e., the eigenvectors of the Hessian at a critical point) can inform a more structured approach to projecting high-dimensional functions onto lower-dimensional representations.
While much of contemporary research in machine learning is driven by practical applications, incorporating theoretical contributions rooted in concepts from statistical mechanics and related disciplines holds promise for continuing to guide the development of learning algorithms.
### Acknowledgements
LB acknowledges financial support from hessian.AI and the ARO through grant W911NF-23-1-0129.
## References
- 1. L. Zdeborová and F. Krzakala, Statistical physics of inference: Thresholds and algorithms, Advances in Physics. 65 (5), 453–552 (2016).
- 2. G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto, and L. Zdeborová, Machine learning and the physical sciences, Reviews of Modern Physics. 91 (4), 045002 (2019).
- 3. P. Mehta, M. Bukov, C.-H. Wang, A. G. Day, C. Richardson, C. K. Fisher, and D. J. Schwab, A high-bias, low-variance introduction to machine learning for physicists, Physics Reports. 810, 1–124 (2019).
- 4. D. J. Amit, H. Gutfreund, and H. Sompolinsky, Spin-glass models of neural networks, Physical Review A. 32 (2), 1007 (1985).
- 5. T. Ising, R. Folk, R. Kenna, B. Berche, and Y. Holovatch, The fate of Ernst Ising and the fate of his model, Journal of Physical Studies. 21, 3002 (2017).
- 6. R. Folk. The survival of Ernst Ising and the struggle to solve his model. In ed. Y. Holovatch, Order, Disorder and Criticality: Advanced Problems of Phase Transition Theory, pp. 1–77. World Scientific, Singapore (2023).
- 7. A. Rosenblueth, N. Wiener, and J. Bigelow, Behavior, purpose and teleology, Philosophy of Science. 10 (1), 18–24 (1943).
- 8. N. Wiener, Cybernetics or Control and Communication in the Animal and the Machine. MIT Press, Boston, MA, USA (1948).
- 9. E. Thorndike, Fundamentals of Learning. Teachers College, Columbia University, New York City, NY, USA (1932).
- 10. I. S. Berkeley, The curious case of connectionism, Open Philosophy. 2 (1), 190–205 (2019).
- 11. M. Minsky and S. Papert, Perceptrons: An Introduction to Computational Geometry, 2nd edn. MIT Press, Cambridge, MA, USA (1972).
- 12. S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall, Upper Saddle River, NJ, USA (2009).
- 13. V. Vapnik and A. Chervonenkis, Theory of Pattern Recognition: Statistical Learning Problems. Nauka, Moscow, Russia (1974). In Russian.
- 14. T. M. Mitchell, Machine Learning. McGraw-Hill, New York City, NY, USA (1997).
- 15. W. S. McCulloch and W. Pitts, A logical calculus of the ideas immanent in nervous activity, The Bulletin of Mathematical Biophysics. 5, 115–133 (1943).
- 16. T. H. Abraham, Nicolas Rashevsky’s mathematical biophysics, Journal of the History of Biology. 37 (2), 333–385 (2004).
- 17. A. Turing, On computable numbers with an application to the Entscheidungsproblem, Proceedings of the London Mathematical Society. 58, 230–265 (1937).
- 18. S. C. Kleene, Representation of Events in Nerve Nets and Finite Automata, In eds. C. E. Shannon and J. McCarthy, Automata Studies. (AM-34), vol. 34, pp. 3–42. Princeton University Press, Princeton, MA, USA (1956).
- 19. D. O. Hebb, The organization of behavior: A neuropsychological theory. Wiley & Sons, New York City, NY, USA (1949).
- 20. S. Löwel and W. Singer, Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity, Science. 255 (5041), 209–212 (1992).
- 21. F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain, Psychological Review. 65 (6), 386 (1958).
- 22. F. Rosenblatt et al., Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. vol. 55, Spartan Books, Washington, DC, USA (1962).
- 23. B. Widrow and M. Hoff. Adaptive switching circuits. In 1960 IRE WESCON Convention Record, pp. 96–104 (1960).
- 24. S. Amari, A theory of adaptive pattern classifiers, IEEE Transactions on Electronic Computers. EC-16 (3), 299–307 (1967).
- 25. A. G. Ivakhnenko and V. G. Lapa, Cybernetics and Forecasting Techniques. American Elsevier Publishing Company, Philadelphia, PA, USA (1967).
- 26. A. G. Ivakhnenko, Polynomial theory of complex systems, IEEE Transactions on Systems, Man, and Cybernetics. (4), 364–378 (1971).
- 27. H. J. Kelley, Gradient theory of optimal flight paths, Ars Journal. 30 (10), 947–954 (1960).
- 28. S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a taylor expansion of the local rounding errors. Master’s thesis, University of Helsinki (1970).
- 29. S. Linnainmaa, Taylor expansion of the accumulated rounding error, BIT Numerical Mathematics. 16 (2), 146–160 (1976).
- 30. P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In eds. R. F. Drenick and F. Kozin, System Modeling and Optimization, pp. 762–770, Springer Berlin Heidelberg, Berlin, Heidelberg, DE (1982).
- 31. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by back-propagating errors, Nature. 323 (6088), 533–536 (1986).
- 32. A. J. Noest. Phasor neural networks. In ed. D. Z. Anderson, Neural Information Processing Systems, Denver, Colorado, USA, 1987, pp. 584–591, American Institue of Physics (1987).
- 33. A. J. Noest, Discrete-state phasor neural networks, Physical Review A. 38 (4), 2196 (1988).
- 34. A. J. Noest, Associative memory in sparse phasor neural networks, Europhysics Letters. 6 (5), 469 (1988).
- 35. M. Kobayashi, Exceptional reducibility of complex-valued neural networks, IEEE Transactions on Neural Networks. 21 (7), 1060–1072 (2010).
- 36. T. Minemoto, T. Isokawa, H. Nishimura, and N. Matsui, Quaternionic multistate Hopfield neural network with extended projection rule, Artificial Life and Robotics. 21, 106–111 (2016).
- 37. H. Zhang, M. Gu, X. Jiang, J. Thompson, H. Cai, S. Paesani, R. Santagati, A. Laing, Y. Zhang, M. Yung, et al., An optical neural chip for implementing complex-valued neural network, Nature Communications. 12 (1), 457 (2021).
- 38. L. Böttcher and M. A. Porter, Complex networks with complex weights, Physical Review E. 109 (2), 024314 (2024).
- 39. T. Bayes, An essay towards solving a problem in the doctrine of chances, Philosophical Transactions of the Royal Society of London. 53, 370–418 (1763).
- 40. P. Laplace, M’emoire sur la probabilité des causes par les Événements, M’emoires de Mathématique et de Physique Presentés á l/Académie Royale des Sciences, Par Divers Davans, Lûs dans ses Assemblées. 6, 621–656 (1774).
- 41. T. Seidenfeld, R. A. Fisher’s fiducial argument and Bayes’ theorem, Statistical Science. 7, 358–368 (1992).
- 42. T. Seidenfeld. Forbidden fruit: When Epistemic Probability may not take a bite of the Bayesian apple. In eds. W. Harper and G. Wheeler, Probability and Inference: Essays in Honor of Henry E. Kyburg, Jr. King’s College Publications, London (2007).
- 43. H. Jeffrey, The Theory of Probability, 3rd edn. Oxford Classic Texts in the Physical Sciences, Oxford University Press, Oxford, UK (1939).
- 44. C. E. Shannon, A mathematical theory of communication, The Bell System Technical Journal. 27 (3), 379–423 (1948).
- 45. E. T. Jaynes, Information theory and statistical mechanics, Physical Review. 106 (4), 620 (1957).
- 46. E. T. Jaynes, Information theory and statistical mechanics. II, Physical Review. 108 (2), 171 (1957).
- 47. E. T. Jaynes, The well-posed problem, Foundations of Physics. 3 (4), 477–493 (1973).
- 48. E. T. Jaynes, Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, UK (2003).
- 49. T. Seidenfeld, Why I am not an objective Bayesian., Theory and Decision. 11 (4), 413–440 (1979).
- 50. G. Wheeler, Objective Bayesianism and the problem of non-convex evidence, The British Journal for the Philosophy of Science. 63 (3), 841–50 (2012).
- 51. I. Hacking. Let’s not talk about objectivity. In eds. F. Padovani, A. Richardson, and J. Y. Tsou, Objectivity in Science, vol. 310, Boston Studies in the Philosophy and History of Science, pp. 19–33. Springer, Cham, CH (2015).
- 52. A. Gelman, Beyond subjective and objective in statistics, Journal of the Royal Statistical Society. Series A. 180 (4), 967–1033 (2017).
- 53. R. Haenni, J.-W. Romeijn, G. Wheeler, and J. Williamson. Logical relations in a statistical problem. In eds. B. Löwe, E. Pacuit, and J.-W. Romeijn, Foundations of the Formal Sciences VI: Reasoning about probabilities and probabilistic reasoning, vol. 16, Studies in Logic, pp. 49–79, College Publications, London, UK (2009).
- 54. G. de Cooman, Belief models: An order-theoretic investigation, Annals of Mathematics and Artificial Intelligence. 45 (5–34) (2005).
- 55. R. Haenni, J.-W. Romeijn, G. Wheeler, and J. Williamson, Probabilistic Logics and Probabilistic Networks. Synthese Library, Springer, Dordrecht (2011).
- 56. D. R. Cox, The regression analysis of binary sequences, Journal of the Royal Statistical Society. 20 (2), 215–242 (1958).
- 57. G. E. Hinton and D. van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In ed. L. Pitt, Proceedings of the Sixth Annual ACM Conference on Computational Learning Theory, COLT 1993, Santa Cruz, CA, USA, July 26-28, 1993, pp. 5–13, ACM (1993).
- 58. R. M. Neal, Bayesian learning for neural networks. vol. 118, Lecture Notes in Statistics, Springer Verlog, New York City, NY, USA (1996).
- 59. S.-I. Amari, Learning patterns and pattern sequences by self-organizing nets of threshold elements, IEEE Transactions on Computers. 100 (11), 1197–1206 (1972).
- 60. J. J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, Proceedings of the National Academy of Sciences of the United States of America. 79 (8), 2554–2558 (1982).
- 61. T. Isokawa, H. Nishimura, N. Kamiura, and N. Matsui, Associative memory in quaternionic Hopfield neural network, International Journal of Neural Systems. 18 (02), 135–145 (2008).
- 62. H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, L. Gruber, M. Holzleitner, T. Adler, D. P. Kreil, M. K. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter. Hopfield networks is all you need. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 (2021).
- 63. M. Benedetti, E. Ventura, E. Marinari, G. Ruocco, and F. Zamponi, Supervised perceptron learning vs unsupervised Hebbian unlearning: Approaching optimal memory retrieval in Hopfield-like networks, The Journal of Chemical Physics. 156 (10) (2022).
- 64. L. Böttcher and H. J. Herrmann, Computational Statistical Physics. Cambridge University Press, Cambridge, UK (2021).
- 65. S. E. Fahlman, G. E. Hinton, and T. J. Sejnowski. Massively Parallel Architectures for AI: NETL, Thistle, and Boltzmann Machines. In ed. M. R. Genesereth, Proceedings of the National Conference on Artificial Intelligence, Washington, DC, USA, August 22-26, 1983, pp. 109–113, AAAI Press (1983).
- 66. G. E. Hinton and T. J. Sejnowski. Analyzing cooperative computation. In Proceedings of the Fifth Annual Conference of the Cognitive Science Society, Rochester, NY, USA (1983).
- 67. G. E. Hinton and T. J. Sejnowski. Optimal perceptual inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, vol. 448, pp. 448–453 (1983).
- 68. D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A learning algorithm for Boltzmann machines, Cognitive Science. 9 (1), 147–169 (1985).
- 69. P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, pp. 194–281. MIT Press, Cambridge, MA, USA (1986).
- 70. G. E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation. 14 (8), 1771–1800 (2002).
- 71. G. E. Hinton, S. Osindero, and Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural computation. 18 (7), 1527–1554 (2006).
- 72. G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science. 313 (5786), 504–507 (2006).
- 73. D. Kim and D.-H. Kim, Smallest neural network to learn the Ising criticality, Physical Review E. 98 (2), 022138 (2018).
- 74. S. Efthymiou, M. J. Beach, and R. G. Melko, Super-resolving the ising model with convolutional neural networks, Physical Review B. 99 (7), 075113 (2019).
- 75. F. D’Angelo and L. Böttcher, Learning the Ising model with generative neural networks, Physical Review Research. 2 (2), 023266 (2020).
- 76. G. Carleo and M. Troyer, Solving the quantum many-body problem with artificial neural networks, Science. 355 (6325), 602–606 (2017).
- 77. G. Torlai, G. Mazzola, J. Carrasquilla, M. Troyer, R. Melko, and G. Carleo, Neural-network quantum state tomography, Nature Physics. 14 (5), 447–450 (2018).
- 78. J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks. 61, 85–117 (2015).
- 79. J. Dean, A golden decade of deep learning: Computing systems & applications, Daedalus. 151 (2), 58–74 (2022).
- 80. S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation. 9 (8), 1735–1780 (1997).
- 81. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE. 86 (11), 2278–2324 (1998).
- 82. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In eds. Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 2672–2680 (2014).
- 83. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In eds. I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008 (2017).
- 84. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, Automatic differentiation in PyTorch (2017).
- 85. G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems. 2 (4), 303–314 (1989).
- 86. K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Networks. 2 (5), 359–366 (1989).
- 87. K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks. 4 (2), 251–257 (1991).
- 88. Y.-J. Wang and C.-T. Lin, Runge-Kutta neural network for identification of dynamical systems in high accuracy, IEEE Transactions on Neural Networks. 9 (2), 294–307 (1998).
- 89. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural ordinary differential equations. In eds. S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 6572–6583 (2018).
- 90. E. Kaiser, J. N. Kutz, and S. L. Brunton, Data-driven discovery of Koopman eigenfunctions for control, Machine Learning: Science and Technology. 2 (3), 035023 (2021).
- 91. F. Schäfer, P. Sekatski, M. Koppenhöfer, C. Bruder, and M. Kloc, Control of stochastic quantum dynamics by differentiable programming, Machine Learning: Science and Technology. 2 (3), 035004 (2021).
- 92. L. Böttcher, N. Antulov-Fantulin, and T. Asikis, AI Pontryagin or how artificial neural networks learn to control dynamical systems, Nature Communications. 13 (1), 333 (2022).
- 93. T. Asikis, L. Böttcher, and N. Antulov-Fantulin, Neural ordinary differential equation control of dynamics on graphs, Physical Review Research. 4 (1), 013221 (2022).
- 94. L. Böttcher and T. Asikis, Near-optimal control of dynamical systems with neural ordinary differential equations, Machine Learning: Science and Technology. 3 (4), 045004 (2022).
- 95. L. Böttcher, T. Asikis, and I. Fragkos, Control of dual-sourcing inventory systems using recurrent neural networks, INFORMS Journal on Computing. 35 (6), 1308–1328 (2023).
- 96. L. Böttcher. Gradient-free training of neural ODEs for system identification and control using ensemble Kalman inversion. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, Honolulu, HI, USA, 2023 (2023).
- 97. D. Ruiz-Balet and E. Zuazua, Neural ODE control for classification, approximation, and transport, SIAM Review. 65 (3), 735–773 (2023).
- 98. T. Asikis. Towards recommendations for value sensitive sustainable consumption. In NeurIPS 2023 Workshop on Tackling Climate Change with Machine Learning: Blending New and Existing Knowledge Systems (2023). URL https://nips.cc/virtual/2023/76939.
- 99. M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational physics. 378, 686–707 (2019).
- 100. S. L. Brunton and J. N. Kutz, Data-driven science and engineering: Machine learning, dynamical systems, and control. Cambridge University Press, Cambridge, UK (2022).
- 101. S. Mowlavi and S. Nabi, Optimal control of PDEs using physics-informed neural networks, Journal of Computational Physics. 473, 111731 (2023).
- 102. C. Fronk and L. Petzold, Interpretable polynomial neural ordinary differential equations, Chaos: An Interdisciplinary Journal of Nonlinear Science. 33 (4) (2023).
- 103. M. Xia, L. Böttcher, and T. Chou, Spectrally adapted physics-informed neural networks for solving unbounded domain problems, Machine Learning: Science and Technology. 4 (2), 025024 (2023).
- 104. L. Böttcher, L. L. Fonseca, and R. C. Laubenbacher, Control of medical digital twins with artificial neural networks, arXiv preprint arXiv:2403.13851 (2024).
- 105. L. Böttcher and G. Wheeler, Visualizing high-dimensional loss landscapes with Hessian directions, Journal of Statistical Mechanics: Theory and Experiment. p. 023401 (2024).
- 106. M. Thoma. LaTeX Examples: Hopfield Network. URL https://github.com/MartinThoma/LaTeX-examples/tree/master/tikz/hopfield-network (Accessed: September 16, 2023).
- 107. J. Hertz, R. G. Palmer, and A. S. Krogh, Introduction to the Theory of Neural Computation, 1st edn. Perseus Publishing, New York City, New York, USA (1991).
- 108. A. J. Storkey. Increasing the capacity of a Hopfield network without sacrificing functionality. In eds. W. Gerstner, A. Germond, M. Hasler, and J. Nicoud, Artificial Neural Networks - ICANN ’97, 7th International Conference, Lausanne, Switzerland, October 8-10, 1997, Proceedings, vol. 1327, Lecture Notes in Computer Science, pp. 451–456, Springer (1997).
- 109. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, Equation of state calculations by fast computing machines, The Journal of Chemical Physics. 21 (6), 1087–1092 (1953).
- 110. J. E. Gubernatis, Marshall Rosenbluth and the Metropolis algorithm, Physics of Plasmas. 12 (5) (2005).
- 111. S. Kirkpatrick, C. D. Gelatt Jr, and M. P. Vecchi, Optimization by simulated annealing, Science. 220 (4598), 671–680 (1983).
- 112. R. J. Glauber, Time-dependent statistics of the Ising model, Journal of Mathematical Physics. 4 (2), 294–307 (1963).
- 113. M. Á. Carreira-Perpiñán and G. E. Hinton. On contrastive divergence learning. In eds. R. G. Cowell and Z. Ghahramani, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, AISTATS 2005, Bridgetown, Barbados, January 6-8, 2005, Society for Artificial Intelligence and Statistics (2005).
- 114. G. Hinton, A practical guide to training restricted Boltzmann machines, Momentum. 9 (1), 926 (2010).
- 115. S. Park, C. Yun, J. Lee, and J. Shin. Minimum width for universal approximation. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 (2021).
- 116. M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In eds. M. Balcan and K. Q. Weinberger, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, vol. 48, JMLR Workshop and Conference Proceedings, pp. 1225–1234, JMLR.org (2016).
- 117. A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, The marginal value of adaptive gradient methods in machine learning. pp. 4148–4158 (2017).
- 118. D. Choi, C. J. Shallue, Z. Nado, J. Lee, C. J. Maddison, and G. E. Dahl, On empirical comparisons of optimizers for deep learning, arXiv preprint arXiv:1910.05446 (2019).
- 119. N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings (2017).
- 120. L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. In eds. D. Precup and Y. W. Teh, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, vol. 70, Proceedings of Machine Learning Research, pp. 1019–1028 (2017).
- 121. I. J. Goodfellow and O. Vinyals. Qualitatively characterizing neural network optimization problems. In eds. Y. Bengio and Y. LeCun, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015).
- 122. H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural nets. In eds. S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 6391–6401 (2018).
- 123. Y. Hao, L. Dong, F. Wei, and K. Xu. Visualizing and understanding the effectiveness of BERT. In eds. K. Inui, J. Jiang, V. Ng, and X. Wan, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 4141–4150, Association for Computational Linguistics (2019).
- 124. D. Wu, S. Xia, and Y. Wang. Adversarial weight perturbation helps robust generalization. In eds. H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020).
- 125. S. Horoi, J. Huang, B. Rieck, G. Lajoie, G. Wolf, and S. Krishnaswamy. Exploring the geometry and topology of neural network loss landscapes. In eds. T. Bouadi, É. Fromont, and E. Hüllermeier, Advances in Intelligent Data Analysis XX - 20th International Symposium on Intelligent Data Analysis, IDA 2022, Rennes, France, April 20-22, 2022, Proceedings, vol. 13205, Lecture Notes in Computer Science, pp. 171–184, Springer (2022).
- 126. J. M. Lee, Riemannian manifolds: An introduction to curvature. vol. 176, Springer Science & Business Media, New York City, NY, USA (2006).
- 127. M. Berger and B. Gostiaux, Differential Geometry: Manifolds, Curves, and Surfaces. Springer Science & Business Media, New York City, NY, USA (2012).
- 128. W. Kühnel, Differential Geometry. American Mathematical Society, Providence, RI, USA (2015).
- 129. K. Crane, Discrete differential geometry: An applied introduction, Notices of the AMS, Communication. pp. 1153–1159 (2018).
- 130. J. Martens, New insights and perspectives on the natural gradient method, Journal of Machine Learning Research. 21 (146), 1–76 (2020).
- 131. R. Hecht-Nielsen, Context vectors: general purpose approximate meaning representations self-organized from raw data, Computational Intelligence: Imitating Life, IEEE Press. pp. 43–56 (1994).
- 132. R. B. Davies, Algorithm AS 155: The distribution of a linear combination of $χ^2$ random variables, Applied Statistics. 29 (3), 323–333 (1980).
- 133. B. Laurent and P. Massart, Adaptive estimation of a quadratic functional by model selection, Annals of Statistics. 28 (5), 1302–1338 (2000).
- 134. J. Bausch, On the efficient calculation of a linear combination of chi-square random variables with an application in counting string vacua, Journal of Physics A: Mathematical and Theoretical. 46 (50), 505202 (2013).
- 135. T. Chen and T. Lumley, Numerical evaluation of methods approximating the distribution of a large quadratic form in normal variables, Computational Statistics & Data Analysis. 139, 75–81 (2019).
- 136. J. Matoušek, On variants of the Johnson–Lindenstrauss lemma, Random Structures and Algorithms. 33 (2), 142–156 (2008).
- 137. M. F. Hutchinson, A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines, Communications in Statistics-Simulation and Computation. 18 (3), 1059–1076 (1989).
- 138. Z. Bai, G. Fahey, and G. Golub, Some large-scale matrix computation problems, Journal of Computational and Applied Mathematics. 74 (1-2), 71–89 (1996).
- 139. H. Avron and S. Toledo, Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix, Journal of the ACM (JACM). 58 (2), 1–34 (2011).
- 140. Y. N. Dauphin, R. Pascanu, Ç. Gülçehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In eds. Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 2933–2941 (2014).
- 141. P. Baldi and K. Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural Networks. 2 (1), 53–58 (1989).
- 142. Z. Yao, A. Gholami, K. Keutzer, and M. W. Mahoney. PyHessian: Neural networks through the lens of the Hessian. In eds. X. Wu, C. Jermaine, L. Xiong, X. Hu, O. Kotevska, S. Lu, W. Xu, S. Aluru, C. Zhai, E. Al-Masri, Z. Chen, and J. Saltz, 2020 IEEE International Conference on Big Data (IEEE BigData 2020), Atlanta, GA, USA, December 10-13, 2020, pp. 581–590, IEEE (2020).
- 143. Z. Liao and M. W. Mahoney. Hessian eigenspectra of more realistic nonlinear models. In eds. M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 20104–20117 (2021).
- 144. R. L. Burden, J. D. Faires, and A. M. Burden, Numerical Analysis; 10th edition. Cengage Learning, Boston, MA, USA (2015).
- 145. B. Adcock and N. Dexter, The gap between theory and practice in function approximation with deep neural networks, SIAM Journal on Mathematics of Data Science. 3 (2), 624–655 (2021).