# Statistical Learning Analysis of Physics-Informed Neural Networks
**Authors**: David Barajas-Solano
## Abstract
We study the training and performance of physics-informed learning for initial and boundary value problems (IBVP) with physics-informed neural networks (PINNs) from a statistical learning perspective. Specifically, we restrict ourselves to parameterizations with hard initial and boundary condition constraints and reformulate the problem of estimating PINN parameters as a statistical learning problem. From this perspective, the physics penalty on the IBVP residuals can be better understood not as a regularizing term bus as an infinite source of indirect data, and the learning process as fitting the PINN distribution of residuals $p(y\mid x,t,w)q(x,t)$ to the true data-generating distribution $\delta(0)q(x,t)$ by minimizing the Kullback-Leibler divergence between the true and PINN distributions. Furthermore, this analysis show that physics-informed learning with PINNs is a singular learning problem, and we employ singular learning theory tools, namely the so-called Local Learning Coefficient [7] to analyze the estimates of PINN parameters obtained via stochastic optimization for a heat equation IBVP. Finally, we discuss implications of this analysis on the quantification of predictive uncertainty of PINNs and the extrapolation capacity of PINNs.
keywords: Physics-informed learning , physics-informed neural networks , statistical learning , singular learning theory
[PNNL]organization=Pacific Northwest National Laboratory,city=Richland, postcode=99352, state=WA, USA
## 1 Introduction
Physics-informed machine learning with physics-informed neural networks (PINNs) (see [6] for a comprehensive overview) has been widely advanced and successfully employed in recent years to solve approximate the solution of initial and boundary value problems (IBVPs) in computational mathematics. The central mechanism of physics-informed learning is model parameter identification by minimizing āphysics penaltyā terms penalizing the governing equation and initial and boundary condition residuals predicted by a given set of model parameters. While various significant longstanding issues of the use PINNs for physics-informed learning such as the spectral bias have been addressed, it remains active area of research, and our understanding of various features of the problem such as the loss landscape, and the extrapolation capacity of PINN models, remains incomplete.
It has been widely recognized that traditional statistical tools are inadequate for analyzing deep learning. Watanabeās āsingular learning theoryā (SLT) (see [13] for a review) recognizes that this is due to the singular character of learning with deep learning models and aims to address these limitations. The development of efficient deep learning and Bayesian inference tools and frameworks [3, 15, 2, 1] has enabled the application of SLT to analyzing practical applications of deep learning, leading to various insights on the features of the loss landscape, the generalization capacity of the model, and the training process [7, 4, 10].
Given that physics-informed learning with PINNs is also singular due to its use of neural networks as the backbone architecture, we propose in this work to use SLT to analyze the training and performance of PINNs. The physics penalty in physics-informed learning precludes an out-of-the-box application of SLT, so we first reformulate the PINN learning problem as a statistical learning problem by restricting our attention to PINN parameterizations with hard initial and boundary condition constraints. Employing this formulation, it can be seen that the PINN learning problem is equivalent to the statistical learning problem of fitting the PINN distribution of governing equation residuals $p(y\mid x,t,w)q(x,t)$ to the true data-generating distribution $\delta(0)q(x,t)$ corresponding to zero residuals $y(x,t)\equiv 0$ for all $(x,t)\in\Omega\times(0,T]$ , by minimizing the Kullback-Leibler divergence between the true and PINN distributions. We argue this implies that the physics penalty in physics-informed learning can be better understood as an infinite source of indirect data (with the residuals interpreted as indirect observations given that they are a function of the PINN prediction) as opposed to as a regularizing term as it is commonly interpreted in the literature. Furthermore, we employ the SLTās so-called Local Learning Coefficient (LLC) to study the features of the PINN loss for a heat equation IBVP.
This manuscript is structured as follows: In section 2 we reformulate and analyze the physics-informed learning problem as a singular statistical learning problem. In section 3 we discuss the numerical approximation of the LLC and compute it for a heat equation IBVP. Finally, further implications of this analysis are discussed in section 4.
## 2 Statistical learning interpretation of physics-informed learning for IBVPs
Consider the well-posed boundary value problem
$$
\displaystyle\mathcal{L}(u(x,t),x,t) \displaystyle=0 \displaystyle\forall x\in\Omega, \displaystyle t\in(0,T], \displaystyle\mathcal{B}(u(x,t),x,t) \displaystyle=0 \displaystyle\forall x\in\partial\Omega, \displaystyle t\in[0,T], \displaystyle u(x,0) \displaystyle=u_{0}(x) \displaystyle\forall x\in\Omega, \tag{1}
$$
where the $\Omega$ denotes the IBVPās spatial domain, $\partial\Omega$ the domainās boundary, $T$ the maximum simulation time, and $u(x,t)$ the IBVPās solution. In the PINN framework, we approximate the solution $u(x,t)$ using an overparameterized deep learning-based model $u(x,t,w)$ , usually a neural network model $\text{NN}_{w}(x,t)$ , where $w\in W$ denotes the model parameters and $W\subset\mathbb{R}^{d}$ denotes the parameter space. For simplicity, we assume that the model $u(x,t,w)$ is constructed in such a way that the initial and boundary conditions are identically satisfied for any model parameter, that is,
| | $\displaystyle\mathcal{B}(u(x,t,w),x,t)=0\quad\forall x\in\partial\Omega,\ t\in(0,T],\ w\in W,$ | |
| --- | --- | --- |
via hard-constraints parameterizations (e.g., [8, 9]). The model parameters are then identified by minimizing the squared residuals of the governing equation evaluated at $n$ āresidual evaluationā points $\{x_{i},t_{i}\}^{n}_{i=1}$ , that is, by minimizing the loss
$$
L^{\text{PINN}}_{n}(w)\coloneqq\frac{1}{n}\sum^{n}_{i=1}\left[\mathcal{L}(u(x_{i},t_{i},w),x_{i},t_{i})\right]^{2}, \tag{4}
$$
In the physics-informed-only limit, the PINN framework aims to approximate the solution without using direct observations of the solution, that is, without using labeled data pairs of the form $(x_{i},t_{i},u(x_{i},t_{i}))$ , which leads many authors to refer to it as an āunsupervisedā learning framework. An alternative view of the PINN learning problem is that it actually does use labeled data, but of the form $(x_{i},t_{i},\mathcal{L}(u(x_{i},t_{i}),x_{i},t_{i})\equiv 0)$ , as the PINN formulation encourages the residuals of the parameterized model, $\mathcal{L}(u(x,t,w),x,t)$ , to be as close to zero as possible. Crucially, āresidualā data of this form is āinfiniteā in the sense that such pairs can be constructed for all $x\in\Omega$ , $t\in(0,T]$ . In fact, the loss 4 can be interpreted as a sample approximation of the volume-and-time-averaged loss $L^{\text{PINN}}\coloneqq\int_{\Omega\times(0,T]}\left[\mathcal{L}(u(x,t),x,t)\right]^{2}q(x,t)\,\mathrm{d}x\mathrm{d}t$ for a certain weighting function $q(x,t)$ .
We can then establish a parallel with statistical learning theory. Namely, we can introduce a density over the inputs, $q(x,t)$ , and the true data-generating distribution $q(x,t,y)=\delta(0)q(x,t)$ , where $y$ denotes the equation residuals. For a given error model $p(y\mid\mathcal{L}(u(x,t,w),x,t))$ , the PINN model induces a conditional distribution of residuals $p(y\mid x,w)$ . For example, for a Gaussian error model $p(y\mid\mathcal{L}(u(x,t,w),x,t))=\mathcal{N}(y\mid 0,\sigma^{2})$ , the conditional distribution of residuals is of the form
$$
p(y\mid x,t,w)\propto\exp\left\{-\frac{1}{2\sigma^{2}}\left[\mathcal{L}(u(x,t,w),x,t)\right]^{2}\right\}. \tag{5}
$$
The PINN framework can then be understood as aiming to fit the model $p(x,t,y\mid w)\coloneqq p(y\mid x,t,w)q(x,t)$ to the data-generating distribution by minimizing the sample negative log-likelihood $L_{n}(w)$ of the training dataset $\{(x_{i},t_{i}y_{i}\equiv 0)\}^{n}_{i=1}$ drawn i.i.d. from $q(x,t,y)$ , with
$$
L_{n}(w)\coloneqq-\frac{1}{n}\sum^{n}_{i=1}\log p(y_{i}\equiv 0\mid x_{i},t_{i},w). \tag{6}
$$
Associated to this negative log-likelihood we also have the population negative log-likelihood $L(w)\coloneqq-\mathbb{E}_{q(x,t,y)}\log p(y\mid x,t,w)$ . Note that for the Gaussian error model introduced above we have that $L^{\text{PINN}}_{n}(w)=2\sigma^{2}L_{n}(w)$ up to an additive constant, so that learning by minimizing the PINN loss 4 and the negative log-likelihood 6 is equivalent. Furthermore, also note that $L^{\text{PINN}}(w)=2\sigma^{2}L(w)$ up to an additive constant if the volume-and-time averaging function is taken to be the same as the data-generating distribution of the inputs, $q(x,t)$ . Therefore, it follows that PINN training is equivalent to minimizing the Kullback-Leibler divergence between the true and model distributions of the residuals.
This equivalent reformulation of the PINN learning problem allows us to employ the tools of statistical learning theory to analyze the training of PINNs. Specifically, we observe that, because of the neural network-based forward model $\text{NN}_{w}(x)$ is a singular model, the PINN model $p(y\mid x,t,w)$ is also singular, that is, the parameter-to-distribution map $w\mapsto p(x,t,y\mid w)$ is not one-to-one, and the Fisher information matrix $I(w)\coloneqq\mathbb{E}_{p(x,t,y\mid w)}\nabla\nabla_{w}\,\log p(x,t,y\mid w)$ is not everywhere positive definite in $W$ . As a consequence, the PINN loss 4 is not characterized by a discrete, measure-zero, possibly infinite set of local minima that can be locally approximated quadratically. Instead, the PINN loss landscape, just as is generally the case for deep learning loss landscapes, is characterized by āflatā minima, crucially even in the limit of the number of residual evaluation points $n\to\infty$ . This feature allows us to better understand the role of the residual data in physics-informed learning.
Namely, it has been widely observed that the solutions to the PINN problem $\operatorname*{arg\,min}_{w}L^{\text{PINN}}_{n}(w)$ for a given loss tolerance $\epsilon$ are not unique even for large number of residual evaluation points, and regardless of the scheme used to generate such points (uniformly distributed, low-discrepancy sequences such as latin hypercube sampling, random sampling, etc.), while on the other hand it has also been observed that increasing $n$ seems to āregularizeā or āsmoothenā the PINN loss landscape (e.g., [11]). Although one may interpret this observed smoothing as similar to the regularizing effect of explicit regularization (e.g., $\ell_{2}$ regularization or the use of a proper prior on the parameters), in fact increasing $n$ doesnāt sharpen the loss landscape into locally quadratic local minima as explicit regularization does. Therefore, residual data (and correspondingly, the physics penalty) in physics-informed learning should not be understood as a regularizer but simply as a source of indirect data.
As estimates of the solution to the PINN problem are not unique, it is necessary to characterize them through other means besides their loss value. In this work we employ the SLT-based LLC metric, which aims to characterize the āflatnessā of the loss landscape of singular models. For completeness, we reproduce the definition of the LLC (introduced in [7]) in the sequel. Let $w^{\star}$ denote a local minima and $B(w^{\star})$ a sufficiently small neighborhood of $w^{\star}$ such that $L(\omega)\geq L(\omega^{\star})$ ; furthermore, let $B(w^{\star},\epsilon)\coloneqq\{w\in B(w^{\star})\mid L(w)-L(w^{\star})<\epsilon\}$ denote the subset of the neighborhood of $w^{\star}$ with loss within a certain tolerance $\epsilon$ , and $V(\epsilon)$ denote the volume of such subset, that is,
$$
V(\epsilon)\coloneqq\operatorname{vol}(B(w^{\star},\epsilon))=\int_{B(w^{\star},\epsilon)}\mathrm{d}w.
$$
Then, the LLC is defined as the rational number $\lambda(w^{\star})$ such that the volume scales as $V(\epsilon)\propto\exp\{\lambda(w^{\star})\}$ . The LLC can be interpreted as a measure of flatness because it corresponds to the rate at which the volume of solutions within the tolerance $\epsilon$ decreases as $\epsilon\to 0$ , with smaller LLC values correspond to slower loss of volume.
As its name implies, the LLC is the local analog to the SLT (global) learning coefficient $\lambda$ , which figures in the asymptotic expansion of the free energy of a singular statistical model. Specifically $\lambda(w^{\star})\equiv\lambda$ if $w^{\star}$ is a global minimum and $B(w^{\star})$ is taken to be the entire parameter space $W$ . The free energy $F_{n}$ is defined as
$$
F_{n}\coloneqq-\log\int\exp\{-nL_{n}(w)\}\varphi(w)\,\mathrm{d}w,
$$
where $\varphi(w)$ is a prior on the model parameters. Then, according to the SLT, the free energy for a singular model has the asymptotic expansion [13]
$$
F_{n}=nL_{n}(w_{0})+\lambda\log n+O_{P}(\log\log n),\quad n\to\infty,
$$
where $w_{0}$ is the parameter vector that minimizes the Kullback-Leibler divergence between the true data-generating distribution and the estimated distribution $p(y\mid x,t,w)q(x,t)$ , whereas for a regular model it has the expansion
$$
F_{n}=nL_{n}(\hat{w})+\frac{d}{2}\log n+O_{P}(1),\quad n\to\infty, \tag{1}
$$
where $\hat{w}$ is the maximum likelihood estimator of the model parameters, and $d$ the number of model parameters. The learning coefficient can then be interpreted as measuring model complexity, as it is the singular model analog to a regular modelās model complexity $d/2$ . Therefore, by computing the LLC around different estimates to the PINN solution we can gain some insight into the PINN loss landscape.
## 3 Numerical estimation of the local learning coefficient
### 3.1 MCMC-based local learning coefficient estimator
While there may exist other approaches to estimating the LLC (namely, via Imaiās estimator [5] or Watanabeās estimator [14] of the real log-canonical threshold), in this work we choose to employ the estimation framework introduced in [7], which we briefly describe as follows. Let $B_{\gamma}(w^{\star})$ denote a small ball of radius $\gamma$ centered around $w^{\star}$ . We introduce the local to $w^{\star}$ analog to the free energy and its asymptotic expansion:
$$
\begin{split}F_{n}(B_{\gamma}(w^{\star}))&=-\log\int_{B_{\gamma}(w^{\star})}\exp\{-nL_{n}(w)\}\varphi(w)\,\mathrm{d}w\\
&=nL_{n}(w^{\star})+\lambda(w^{\star})\log n+O_{P}(\log\log n).\end{split} \tag{7}
$$
We then substitute the hard restriction in the integral above to $B_{\gamma}(w^{\star})$ with a softer Gaussian weighting around $w^{\star}$ with covariance $\gamma^{-1}I_{d}$ , where $\gamma>0$ is a scale parameter. Let $\mathbb{E}_{w\mid w^{\star},\beta,\gamma}[\cdot]$ denote the expectation with respect to the tempered distribution
$$
p(w\mid w^{\star},\beta,\gamma)\propto\exp\left\{-n\beta L_{n}(w)-\frac{\gamma}{2}\|w-w^{\star}\|^{2}_{2}\right\}
$$
with inverse temperature $\beta\equiv 1/\log n$ ; then, $\mathbb{E}_{w\mid w^{\star},\beta,\gamma}[nL_{n}(w)]$ is a good estimator of the partition function of this tempered distribution, which is, depending on the choice $\gamma$ , an approximation to the local free energy. Substituting this approximation into 7, we obtain the estimator
$$
\hat{\lambda}_{\gamma}(w^{\star})\coloneqq n\beta\left[\mathbb{E}_{w\mid w^{\star},\beta,\gamma}[L_{n}(w)]-L_{n}(w^{\star})\right]. \tag{8}
$$
In this work we compute this estimate by computing samples $w\mid w^{\star},\beta,\gamma$ via Markov chain Monte Carlo (MCMC) simulation. Specifically, we employ the No-U Turns Sampler (NUTS) algorithm [3]. Furthermore, to evaluate $L_{n}(w)$ we choose a fixed set of $n$ residual evaluation points chosen randomly from $q(x,t)$ , with $n$ large enough that we can employ the asymptotic expansion 7 and that eq. 8 is valid. Note that $\hat{\lambda}_{\gamma}(w^{\star})$ is strictly positive as $w^{\star}$ is a local minima of $L_{n}(w)$ , so that $\mathbb{E}_{w\mid w^{\star},\beta,\gamma}[L_{n}(w)]-L_{n}(w^{\star})\geq 0$ . In fact, negative estimates of the LLC imply that the MCMC chain was able to visit a significant volume of parameter space for which $L_{n}(w)<L_{n}(w^{\star})$ , which means that $w^{\star}$ is not an actual minima of $L_{n}(w)$ . Given that minimization of the PINN loss via stochastic optimization algorithms is not guaranteed to produce exact minima, in practice it may be possible to obtain negative estimates of the LLC.
### 3.2 Numerical experiments
To analyze the PINN loss landscape for a concrete case, we compute and analyze the LLC for the following physics-informed learning problem: We consider the heat equation IBVP
| | $\displaystyle\partial_{t}u-\partial_{xx}u$ | $\displaystyle=f(x,t)$ | $\displaystyle\forall x\in(0,2),\ t\in(0,2],$ | |
| --- | --- | --- | --- | --- |
where $f(x,t)$ is chosen so that the solution to the IBVP is
$$
u(x,t)=\exp^{-t}\sin(\pi x).
$$
To exactly enforce initial and boundary conditions, we employ the hard-constraints parameterization of the PINN model
$$
u(x,t,w)=\sin(\pi x)+tx(x-1)\text{NN}_{w}(x,t). \tag{9}
$$
The neural network model $\text{NN}_{w}$ is taken to be a multi-layer perceptron (MLP) with 2 hidden layers of $50$ units each. This model has a total number of parameters $d=20,601.$ This relatively small model ensures that the wall-clock time of computing multiple MCMC chains is reasonable. The model eq. 9 is implemented using Flax NNX library [2], and MCMC sampling is performed using the BlackJAX library [1].
The parameters of the model are estimated by minimizing the PINN loss eq. 4 using the Adam stochastic optimization algorithm for $50,000$ iterations. At each iteration, a new batch of residual evaluation points is drawn from $q(x,t)$ and used to evaluate the PINN loss and its gradient. Note that these batches of residual evaluation points are distinct from the set used to evaluate $L_{n}(w)$ in eq. 8, which is fixed for a given experiment. Figure 1 shows the PINN solution $u(x,t,w^{\star})$ for $w^{\star}$ estimated using a batch size of $32$ and learning rate of $1\times 10^{-4}$ . This solution has an associated PINN loss of $1.854\times 10^{-5}$ , which, although small, shows that the true data-generating distribution $\delta(0)q(x,t)$ is not realizable within the model class 9 with the MLP architecture indicated above due to the finite capacity of the class.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Contour Plot: Variable Distribution
### Overview
The image presents a contour plot visualizing the distribution of a variable across a two-dimensional space defined by 'x' and 't'. The plot uses a color gradient to represent the value of the variable, with yellow indicating higher values and purple indicating lower values. A vertical black line divides the plot into two regions. Dashed lines are overlaid on the contour plot.
### Components/Axes
* **X-axis:** Labeled 'x', ranging from approximately 0.0 to 2.0.
* **Y-axis:** Labeled 't', ranging from approximately 0.0 to 2.0.
* **Colorbar:** Located on the right side of the plot. The colorbar represents the variable's value, with the following approximate mapping:
* Yellow: 0.75
* Light Yellow-Green: 0.50
* Green: 0.25
* Light Blue-Green: 0.00
* Blue: -0.25
* Dark Blue: -0.50
* Purple: -0.75
* **Vertical Line:** A solid black vertical line positioned at approximately x = 1.0, dividing the plot into two sections.
* **Dashed Lines:** A series of dashed black lines are overlaid on the contour plot, forming curves.
### Detailed Analysis
The contour plot shows a gradient of colors representing the variable's value.
* **Left Side (x < 1.0):** The variable values are generally positive, ranging from approximately 0.0 to 0.75. The contours are relatively close together, indicating a steeper gradient. The highest values (yellow) are concentrated around x = 0.5 and t = 0.0. As 't' increases, the values decrease, forming a series of curved contours.
* **Right Side (x > 1.0):** The variable values are generally negative, ranging from approximately -0.25 to -0.75. The contours are also relatively close together, indicating a steep gradient. The lowest values (purple) are concentrated around x = 1.5 and t = 0.0. As 't' increases, the values decrease, forming a series of curved contours.
* **Dashed Lines:** The dashed lines appear to follow the contours of the variable, but with a slight offset. They seem to represent a specific value or threshold of the variable. The dashed lines are more closely spaced on the right side of the plot, suggesting a more rapid change in the variable's value in that region.
### Key Observations
* There is a clear discontinuity in the variable's value at x = 1.0, as indicated by the sharp change in color and the vertical black line.
* The variable's value is positive for x < 1.0 and negative for x > 1.0.
* The contours are more closely spaced near x = 1.0, indicating a steeper gradient in that region.
* The dashed lines appear to be related to the contours, but their exact relationship is unclear.
### Interpretation
The contour plot likely represents a solution to a differential equation or a physical system with a boundary condition at x = 1.0. The discontinuity at x = 1.0 suggests a change in the system's properties or a forcing function at that point. The dashed lines could represent a specific level set of the variable, or a trajectory of the system. The plot demonstrates a clear separation in the behavior of the variable based on the value of 'x', with positive values on one side and negative values on the other. The steep gradients near x = 1.0 indicate a rapid transition between these two states. Without further context, it is difficult to determine the exact physical meaning of the variable and the system being represented.
</details>
(a) PINN solution $u(x,t,w^{\star})$
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Contour Plot: Data Visualization
### Overview
The image presents a contour plot visualizing a two-dimensional dataset. The plot displays isolines representing constant values of an unknown function of two variables, *x* and *t*. A color gradient is used to further represent the function's value, with a colorbar on the right indicating the mapping between color and value. The plot spans a range of *x* from 0.0 to 2.0 and *t* from 0.0 to 2.0.
### Components/Axes
* **X-axis:** Labeled as "*x*", ranging from 0.0 to 2.0.
* **Y-axis:** Labeled as "*t*", ranging from 0.0 to 2.0.
* **Colorbar:** Located on the right side of the plot. The colorbar ranges from approximately -0.002 (dark blue) to 0.001 (yellow). The values are linearly spaced.
* **Contour Lines:** Black solid lines represent isolines of the function. Dashed black lines are also present.
* **Color Gradient:** The plot is filled with colors corresponding to the function's value, as indicated by the colorbar.
### Detailed Analysis
The contour plot shows a complex distribution of values. The color gradient and contour lines reveal several key features:
* **Positive Region:** A region of positive values (yellow to light green) is located around *x* = 1.25 and *t* = 1.25. The maximum value appears to be approximately 0.001.
* **Negative Region:** A large region of negative values (dark green to dark blue) occupies the lower-right and lower-left portions of the plot. The minimum value appears to be approximately -0.002.
* **Transition Zone:** A transition zone between positive and negative values is visible around *x* = 0.75 to *x* = 1.75 and *t* = 0.5 to *t* = 1.5.
* **Dashed Contours:** The dashed contours are located in the upper-left corner of the plot, indicating a specific set of values that are distinct from the solid contours. These values are likely negative, based on the color gradient.
* **Contour Line Density:** The density of contour lines indicates the steepness of the function's gradient. Areas with closely spaced lines represent regions of rapid change.
### Key Observations
* The function has a local maximum around (1.25, 1.25).
* The function is generally negative, with a significant negative region dominating the plot.
* The dashed contours suggest a separate feature or behavior in the upper-left corner.
* The function appears to be relatively smooth, as indicated by the continuous contour lines.
### Interpretation
The contour plot likely represents a physical or mathematical process where the function's value corresponds to a measurable quantity. The positive region could represent a source or accumulation of something, while the negative region could represent a sink or depletion. The dashed contours might indicate a different process or condition occurring in that region.
Without knowing the specific context of the plot, it's difficult to provide a more precise interpretation. However, the plot suggests a complex interplay between the variables *x* and *t*, with a clear spatial distribution of values. The presence of both positive and negative regions indicates a dynamic system with sources and sinks. The contour lines provide a visual representation of the function's behavior, allowing for the identification of key features and trends. The color gradient enhances this visualization, providing a quantitative measure of the function's value at each point in the plot.
</details>
(b) Pointwise errors $u(x,t)-u(x,t,w^{\star})$
Figure 1: PINN solution and estimation error computed with batch size of $32$ and Adam learning rate of $1\times 10^{-4}.$
LLC estimates are computed for batch sizes of $8$ , $16$ , and $32$ , and for Adam learning rates $\eta$ of $1\times 10^{-3}$ and $1\times 10^{-4}.$ Furthermore, for all experiments we compute the LLC estimate using a fixed set of $256$ residual evaluation points drawn randomly from $q(x,t).$ To compute the LLC, we employ the Gaussian likelihood of residuals (5). As such, the LLC estimate depends not only on the choice of $\gamma$ but also on the standard deviation $\sigma$ . For each experiment we generate 2 MCMC chains and each chain is initialized with an warmup adaptation interval.
To inform the choice of hyperparameters of the LLC estimator we make the following observations: First, both hyperparameters control the volume of the local neighborhood employed to estimate the LLC, in the case of $\gamma$ by expanding or contracting the prior centered around $w^{\star}$ , and in the case of $\sigma$ by increasing or decreasing the range of values $L_{n}(w)$ can take within the local neighborhood. Second, given that the true data-generating distribution is not realizable, the PINN loss will always be non-zero. Therefore, for the true zero residuals to be resolvable with non-trivial probability by the PINN model we must choose $\sigma$ larger than the average optimal PINN residuals (the squared root of the optimal PINN loss.)
Given that $\gamma=1$ controls the volume of the local neighborhood, we take the value of $\gamma=1$ For the architecture described above, we find that the optimal PINN loss is $O(10^{-5})$ , so that $\sigma$ must be chosen to be $3\times 10^{-3}$ or larger. Given that $\sigma$ also controls the volume of the local neighborhood, we take the conservative value of $\sigma=1$ . It is not entirely obvious whether this conservative choice overestimates or underestimates the LLC. Numerical experiments not presented in this work showed that the small choice $\sigma=1\times 10^{-2}$ led often to negative LLC estimates and therefore was rejected. LLC values estimated for $\sigma=1\times 10^{-1}$ where consistently slightly larger than those for $\sigma=1$ , which indicates that the model class of PINN models that are more physics-respecting has slightly more complexity than a less physics-respecting class. Nevertheless, given that the difference in LLC estimates is small we proceeded with the conservative choice.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Charts: Training and Test Loss vs. Iteration & Local Learning Coefficient vs. Iteration
### Overview
The image presents two charts side-by-side. The left chart displays the training and test loss as a function of iteration for two different learning rates (Ī· = 1 x 10ā»ā“ and Ī· = 1 x 10ā»Ā³). The right chart shows the local learning coefficient as a function of iteration, also for the same two learning rates. Both charts aim to illustrate the training dynamics of a model.
### Components/Axes
**Left Chart:**
* **X-axis:** Iteration (Scale: 0 to 50000, logarithmic scale)
* **Y-axis:** Train and test loss (Scale: 10ā»ā¶ to 10ā°, logarithmic scale)
* **Legend:**
* Blue Line: Ī· = 1 x 10ā»ā“ train
* Orange Line: Ī· = 1 x 10ā»ā“ test
* Green Line: Ī· = 1 x 10ā»Ā³ train
* Red Line: Ī· = 1 x 10ā»Ā³ test
**Right Chart:**
* **X-axis:** Iteration (Scale: 0 to 50000)
* **Y-axis:** Local learning coefficient (Scale: 7.0 to 10.0)
* **Legend:**
* Blue Dashed Line: Ī· = 1 x 10ā»ā“
* Orange Dashed Line: Ī· = 1 x 10ā»Ā³
### Detailed Analysis or Content Details
**Left Chart:**
* **Ī· = 1 x 10ā»ā“ (Blue & Orange):** The blue line (train) starts at approximately 1.0 and generally decreases, with significant fluctuations, reaching a value of approximately 0.001 at iteration 50000. The orange line (test) starts at approximately 0.8 and also decreases, but remains consistently higher than the blue line, ending at approximately 0.01 at iteration 50000.
* **Ī· = 1 x 10ā»Ā³ (Green & Red):** The green line (train) starts at approximately 0.9 and decreases more rapidly than the blue line, reaching a value of approximately 0.0005 at iteration 50000. The red line (test) starts at approximately 0.7 and decreases rapidly initially, but then plateaus and fluctuates significantly, ending at approximately 0.05 at iteration 50000.
**Right Chart:**
* **Ī· = 1 x 10ā»ā“ (Blue):** The blue dashed line shows a sharp decrease from approximately 9.6 at iteration 0 to approximately 7.2 at iteration 20000. After this point, it fluctuates and gradually increases, reaching approximately 9.4 at iteration 50000. The shaded area around the line indicates the variance.
* **Ī· = 1 x 10ā»Ā³ (Orange):** The orange dashed line starts at approximately 9.6 at iteration 0 and remains relatively stable until approximately iteration 20000, where it begins to decrease. It reaches a minimum of approximately 8.2 at iteration 30000 and then increases, ending at approximately 9.3 at iteration 50000. The shaded area around the line indicates the variance.
### Key Observations
* The learning rate of 1 x 10ā»Ā³ (red and green lines) results in a faster initial decrease in loss compared to 1 x 10ā»ā“ (blue and orange lines).
* The test loss consistently remains higher than the training loss for both learning rates, indicating potential overfitting.
* The local learning coefficient decreases significantly for both learning rates initially, then stabilizes and fluctuates.
* The shaded areas in the right chart indicate a considerable variance in the local learning coefficient.
### Interpretation
The charts demonstrate the impact of different learning rates on the training process. A higher learning rate (1 x 10ā»Ā³) leads to faster initial learning but also exhibits more significant fluctuations in both loss and the local learning coefficient. The lower learning rate (1 x 10ā»ā“) results in a more stable, albeit slower, learning process. The consistently higher test loss suggests that the model is overfitting to the training data, regardless of the learning rate. The local learning coefficient plot suggests that the adaptive learning rate mechanism is adjusting the learning rate during training, potentially to mitigate the effects of the initial learning rate. The variance in the local learning coefficient indicates that the adaptation process is not uniform across all iterations. The initial drop in the local learning coefficient could be a result of the optimization algorithm reducing the learning rate to avoid overshooting the optimal solution. The subsequent fluctuations suggest that the algorithm is continuously adjusting the learning rate based on the gradient information.
</details>
(a) Batch size $=8$
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Charts: Training and Test Loss vs. Iteration & Local Learning Coefficient vs. Iteration
### Overview
The image presents two charts side-by-side. The left chart displays the training and test loss as a function of iteration for two different learning rates. The right chart shows the local learning coefficient as a function of iteration, also for two different learning rates. Both charts are used to analyze the performance of a machine learning model during training.
### Components/Axes
**Left Chart:**
* **X-axis:** Iteration (Scale: 0 to 50000, increments of 10000)
* **Y-axis:** Train and test loss (Logarithmic scale, from approximately 1e-1 to 1e-4)
* **Legend:**
* Blue Line: Ī· = 1 x 10<sup>-4</sup> train
* Orange Line: Ī· = 1 x 10<sup>-4</sup> test
* Green Line: Ī· = 1 x 10<sup>-3</sup> train
* Red Line: Ī· = 1 x 10<sup>-3</sup> test
**Right Chart:**
* **X-axis:** Iteration (Scale: 0 to 50000, increments of 10000)
* **Y-axis:** Local learning coefficient (Scale: 7.0 to 10.0, increments of 0.5)
* **Legend:**
* Blue Dashed Line with 'x' markers: Ī· = 1 x 10<sup>-4</sup>
* Orange Dashed Line with 'x' markers: Ī· = 1 x 10<sup>-3</sup>
### Detailed Analysis or Content Details
**Left Chart:**
* **Ī· = 1 x 10<sup>-4</sup> (Train - Blue):** The training loss starts at approximately 100 and decreases rapidly until around 10000 iterations, reaching a value of approximately 0.01. After 10000 iterations, the loss continues to decrease, but at a slower rate, stabilizing around 0.005 by 50000 iterations.
* **Ī· = 1 x 10<sup>-4</sup> (Test - Orange):** The test loss starts at approximately 100 and initially decreases, but fluctuates significantly. It reaches a minimum of around 0.01 at approximately 10000 iterations, then increases and stabilizes around 0.015 by 50000 iterations.
* **Ī· = 1 x 10<sup>-3</sup> (Train - Green):** The training loss starts at approximately 100 and decreases rapidly until around 10000 iterations, reaching a value of approximately 0.01. After 10000 iterations, the loss fluctuates significantly, with a general trend of decreasing, stabilizing around 0.005 by 50000 iterations.
* **Ī· = 1 x 10<sup>-3</sup> (Test - Red):** The test loss starts at approximately 100 and fluctuates wildly throughout the entire training process. It generally remains higher than the training loss, with a value of approximately 0.02 at 50000 iterations.
**Right Chart:**
* **Ī· = 1 x 10<sup>-4</sup> (Blue):** The local learning coefficient starts at approximately 7.2 and increases steadily until around 20000 iterations, reaching a value of approximately 9.2. After 20000 iterations, it plateaus around 9.2 with minor fluctuations.
* **Ī· = 1 x 10<sup>-3</sup> (Orange):** The local learning coefficient starts at approximately 7.2 and increases steadily until around 20000 iterations, reaching a value of approximately 9.6. After 20000 iterations, it plateaus around 9.6 with minor fluctuations.
### Key Observations
* The learning rate of 1 x 10<sup>-4</sup> results in a smoother training and test loss curve compared to 1 x 10<sup>-3</sup>.
* The test loss for the learning rate of 1 x 10<sup>-3</sup> is significantly more volatile than the test loss for the learning rate of 1 x 10<sup>-4</sup>.
* The local learning coefficient increases initially for both learning rates and then stabilizes.
* The local learning coefficient is slightly higher for the learning rate of 1 x 10<sup>-3</sup> compared to 1 x 10<sup>-4</sup>.
### Interpretation
The charts demonstrate the impact of different learning rates on the training process of a machine learning model. A smaller learning rate (1 x 10<sup>-4</sup>) leads to a more stable training process, as evidenced by the smoother loss curves. However, a larger learning rate (1 x 10<sup>-3</sup>) can lead to more volatile training and test loss, potentially indicating instability or overfitting. The increase and subsequent stabilization of the local learning coefficient suggest an adaptive learning rate mechanism is being employed, which adjusts the learning rate during training to optimize performance. The higher local learning coefficient for the larger learning rate might be a compensatory mechanism to mitigate the instability caused by the larger initial learning rate. The logarithmic scale on the left chart emphasizes the significant reduction in loss achieved during training, while the linear scale on the right chart highlights the relatively small changes in the local learning coefficient. The divergence between training and test loss, particularly for the larger learning rate, suggests potential overfitting, where the model performs well on the training data but poorly on unseen data.
</details>
(b) Batch size $=16$
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Charts: Training and Test Loss vs. Iteration & Local Learning Coefficient vs. Iteration
### Overview
The image presents two charts side-by-side. The left chart displays the training and test loss as a function of iteration for two different learning rates (Ī· = 1 x 10ā»ā“ and Ī· = 1 x 10ā»Ā³). The right chart shows the local learning coefficient as a function of iteration, also for the same two learning rates. Both charts aim to illustrate the training dynamics of a model.
### Components/Axes
**Left Chart:**
* **X-axis:** Iteration (Scale: 0 to 50000, logarithmic scale)
* **Y-axis:** Train and test loss (Scale: 10ā»Ā¹ to 10ā°, logarithmic scale)
* **Legend:**
* Blue Line: Ī· = 1 x 10ā»ā“ train
* Red Line: Ī· = 1 x 10ā»ā“ test
* Green Line: Ī· = 1 x 10ā»Ā³ train
* Orange Line: Ī· = 1 x 10ā»Ā³ test
**Right Chart:**
* **X-axis:** Iteration (Scale: 0 to 50000)
* **Y-axis:** Local learning coefficient (Scale: 7.0 to 10.0)
* **Legend:**
* Blue Dashed Line with 'x' Markers: Ī· = 1 x 10ā»ā“
* Orange Dashed Line with '+' Markers: Ī· = 1 x 10ā»Ā³
### Detailed Analysis or Content Details
**Left Chart:**
* **Ī· = 1 x 10ā»ā“ (Blue & Red):** The blue 'train' line starts at approximately 0.9 and decreases rapidly to around 0.01 by iteration 10000. It then fluctuates around 0.01 to 0.005 for the remainder of the iterations. The red 'test' line starts at approximately 0.9 and initially decreases, but remains significantly higher than the blue line, fluctuating between 0.02 and 0.08 throughout the iterations.
* **Ī· = 1 x 10ā»Ā³ (Green & Orange):** The green 'train' line starts at approximately 0.9 and decreases more quickly than the blue line, reaching around 0.001 by iteration 10000. It continues to decrease, but with more fluctuations, reaching approximately 0.0005 by iteration 50000. The orange 'test' line starts at approximately 0.9 and initially decreases, but quickly plateaus and fluctuates around 0.05 to 0.15 for the majority of the iterations.
**Right Chart:**
* **Ī· = 1 x 10ā»ā“ (Blue):** The blue line starts at approximately 9.6 at iteration 0, decreases to a minimum of around 8.2 at iteration 20000, and then increases again, fluctuating between 9.2 and 9.7 for the remainder of the iterations. The shaded area around the line represents the standard deviation or confidence interval.
* **Ī· = 1 x 10ā»Ā³ (Orange):** The orange line starts at approximately 9.6 at iteration 0, decreases to a minimum of around 8.0 at iteration 20000, and then increases again, fluctuating between 9.0 and 9.6 for the remainder of the iterations. The shaded area around the line represents the standard deviation or confidence interval.
### Key Observations
* The learning rate of 1 x 10ā»Ā³ results in a lower training loss compared to 1 x 10ā»ā“, but the test loss remains significantly higher.
* The test loss for both learning rates plateaus at a higher value than the training loss, indicating potential overfitting.
* The local learning coefficient fluctuates around a value of 9.5 for both learning rates, with some variation over iterations.
* The local learning coefficient appears to decrease initially and then stabilize.
### Interpretation
The charts demonstrate the impact of different learning rates on the training and generalization performance of a model. The lower learning rate (1 x 10ā»ā“) leads to a smoother training process but struggles to achieve a low test loss, suggesting underfitting or slow convergence. The higher learning rate (1 x 10ā»Ā³) achieves a lower training loss more quickly, but the higher test loss indicates overfitting. The local learning coefficient provides insight into the adaptive learning process, showing how the learning rate is adjusted during training. The fluctuations in the local learning coefficient suggest that the optimization algorithm is dynamically adapting to the loss landscape. The shaded areas in the right chart indicate the uncertainty or variability in the local learning coefficient, which could be due to stochasticity in the training process or the inherent complexity of the model. The divergence between training and test loss suggests that regularization techniques or a larger dataset might be needed to improve generalization performance.
</details>
(c) Batch size $=32$
Figure 2: Training and test histories (left, shown every 100 iterations), and LLC estimates with 95% confidence intervals (right, shown every 1,000 iterations) for various values of batch size and Adam learning rate.
Figure 2 summarizes the results of the computational experiments. The left panels show the history of the training and test losses every $100$ iterations. Here, the test loss is the average of $(u(x,t,w)-u(x,t))^{2}$ evaluated over a $51\times 51$ uniform grid of $(x,t)$ pairs. The right panels show the LLC estimates computed every $10,000$ iterations. Here, care must be taken when interpreting the smaller LLC estimates seen for earlier iterations. For early iterations, the parameter estimate $w_{t}$ is not a minima; therefore, the MCMC chains cover regions of parameter space for which $L_{n}(w)<L_{n}(w_{t})$ , leading to small and possibly negative values. Modulo this observation, it can be seen that across all experiments (batch size and learning rate values), the estimated LLC is $\hat{\lambda}(w^{\star})\approx 9.5$ . Additional experiments conducted by varying the pseudo-random number generator seed used to initialize $w$ also lead to the same estimate of the LLC (see fig. 3.)
This number is remarkable for two reasons: First, despite the estimate of the local minima $w^{\star}$ being vastly different across experiments, the LLC estimate is very consistent. Second, $\hat{\lambda}(w^{\star})$ is significantly smaller than the number of parameters $d=20,601$ of the PINN model. If the PINN model were regular, then it would have a much smaller $O(10)$ number of parameters. These results indicate that the PINN solution corresponds to a very flat region of parameter space, and different choices of initializations and stochastic optimization hyperparameters lead to the same region. As seen in figs. 2 and 3, the lower learning rate of $1\times 10^{-4}$ , due to reduced stochastic optimization noise, takes a larger number of iterations to arrive at this region than the larger learning rate of $1\times 10^{-3}.$
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Line Chart: Local Learning Coefficient vs. Iteration for Different Batch Sizes
### Overview
The image presents two line charts displaying the relationship between the local learning coefficient and the iteration number for three different batch sizes (8, 16, and 32). Both charts share the same axes and legend, suggesting a comparison of the learning coefficient behavior across batch sizes. The charts appear to represent two different experimental runs or conditions.
### Components/Axes
* **X-axis:** Iteration, ranging from approximately 10,000 to 50,000.
* **Y-axis:** Local learning coefficient, ranging from approximately 7.0 to 10.0.
* **Legend:** Located in the bottom-right corner of the right chart.
* Batch size 8 (Blue, dashed line with 'x' markers)
* Batch size 16 (Orange, dashed-dotted line with '+' markers)
* Batch size 32 (Green, solid line with '^' markers)
* **Chart 1:** Left side of the image.
* **Chart 2:** Right side of the image.
### Detailed Analysis or Content Details
**Chart 1 (Left):**
* **Batch Size 8 (Blue):** Starts at approximately 8.0 at iteration 10,000, rapidly increases to around 9.2 by iteration 15,000, then fluctuates between approximately 9.0 and 9.5 until iteration 50,000.
* **Batch Size 16 (Orange):** Begins at approximately 7.5 at iteration 10,000, increases steadily to around 9.4 by iteration 30,000, and then plateaus around 9.5 until iteration 50,000.
* **Batch Size 32 (Green):** Starts at approximately 9.0 at iteration 10,000, increases to around 9.6 by iteration 25,000, and then fluctuates between approximately 9.5 and 9.8 until iteration 50,000.
**Chart 2 (Right):**
* **Batch Size 8 (Blue):** Remains relatively stable around 9.2-9.4 from iteration 10,000 to 50,000, with minor fluctuations.
* **Batch Size 16 (Orange):** Starts at approximately 9.2 at iteration 10,000, increases to around 9.6 by iteration 20,000, and then fluctuates between approximately 9.5 and 9.8 until iteration 50,000.
* **Batch Size 32 (Green):** Begins at approximately 9.5 at iteration 10,000, remains relatively stable around 9.6-9.8 from iteration 10,000 to 50,000, with minor fluctuations.
### Key Observations
* **Chart 1:** Batch size 8 exhibits a significant initial increase in the local learning coefficient, followed by stabilization. Batch sizes 16 and 32 show a more gradual increase and higher overall values.
* **Chart 2:** All batch sizes demonstrate relatively stable local learning coefficients throughout the iterations. Batch size 32 consistently maintains the highest values.
* The behavior of Batch size 8 differs significantly between the two charts. In Chart 1, it shows a large initial jump, while in Chart 2, it remains stable.
### Interpretation
The charts likely represent the training process of a machine learning model, where the local learning coefficient is a parameter adjusted during optimization. The batch size influences the stability and speed of learning.
* **Chart 1** suggests that a smaller batch size (8) might lead to faster initial learning but potentially more instability, as indicated by the fluctuations. Larger batch sizes (16 and 32) demonstrate more stable learning, albeit potentially slower.
* **Chart 2** indicates that under different conditions (or a different experimental run), even the smallest batch size (8) can achieve stable learning. The larger batch sizes continue to show stable and relatively high learning coefficients.
* The discrepancy between the two charts for batch size 8 is a notable outlier. This could be due to variations in the initial conditions, the data used for training, or the specific optimization algorithm employed. It highlights the sensitivity of the learning process to these factors.
The data suggests that the optimal batch size depends on the specific training scenario. While larger batch sizes generally promote stability, smaller batch sizes might be beneficial in certain cases, particularly if the initial learning phase requires rapid adaptation. Further investigation is needed to understand the reasons behind the differing behavior of batch size 8 in the two charts.
</details>
(a) First initialization
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Line Chart: Local Learning Coefficient vs. Iteration for Different Batch Sizes
### Overview
The image presents two line charts displaying the relationship between the local learning coefficient and the iteration number for three different batch sizes (8, 16, and 32). Both charts share the same axes and legend, but appear to represent different experimental conditions or runs. The charts visualize how the local learning coefficient evolves over iterations for each batch size, with shaded areas representing the standard deviation or confidence interval around the mean.
### Components/Axes
* **X-axis:** Iteration, ranging from approximately 10,000 to 50,000.
* **Y-axis:** Local learning coefficient, ranging from approximately 7.0 to 10.0.
* **Legend:** Located in the bottom-center of the image.
* Batch size 8 (Blue dashed line with 'x' markers)
* Batch size 16 (Orange dashed line with '+' markers)
* Batch size 32 (Green dashed line with '*' markers)
* **Chart 1 (Left):** Shows a more dynamic change in the local learning coefficient over iterations.
* **Chart 2 (Right):** Shows a more stable local learning coefficient over iterations.
### Detailed Analysis or Content Details
**Chart 1 (Left):**
* **Batch Size 8 (Blue):** Starts at approximately 9.2 at iteration 10,000. Decreases sharply to approximately 7.2 at iteration 20,000. Then increases steadily to approximately 9.4 at iteration 50,000.
* **Batch Size 16 (Orange):** Starts at approximately 8.8 at iteration 10,000. Increases to approximately 9.4 at iteration 20,000. Then decreases to approximately 9.1 at iteration 30,000, and stabilizes around 9.2 at iteration 50,000.
* **Batch Size 32 (Green):** Starts at approximately 9.0 at iteration 10,000. Increases steadily to approximately 9.6 at iteration 30,000. Then decreases slightly to approximately 9.5 at iteration 50,000.
**Chart 2 (Right):**
* **Batch Size 8 (Blue):** Remains relatively stable around 9.5 from iteration 10,000 to 50,000, with minor fluctuations.
* **Batch Size 16 (Orange):** Starts at approximately 9.5 at iteration 10,000. Decreases slightly to approximately 9.3 at iteration 20,000, then stabilizes around 9.4 at iteration 50,000.
* **Batch Size 32 (Green):** Starts at approximately 9.6 at iteration 10,000. Decreases slightly to approximately 9.5 at iteration 50,000.
### Key Observations
* **Chart 1:** Batch size 8 exhibits the most significant fluctuation in the local learning coefficient, while batch size 32 shows the most stable behavior. Batch size 16 shows an intermediate behavior.
* **Chart 2:** All batch sizes demonstrate relatively stable local learning coefficients throughout the iterations.
* The shaded areas around the lines indicate the variability in the local learning coefficient for each batch size. The width of the shaded area suggests the degree of uncertainty or variance.
### Interpretation
The charts likely represent the training process of a machine learning model, where the local learning coefficient is adjusted during optimization. The batch size influences the stability and speed of convergence.
* **Chart 1** suggests that a smaller batch size (8) can lead to more volatile learning dynamics, potentially due to higher gradient variance. The initial drop in the learning coefficient for batch size 8 could indicate an initial period of instability or adaptation. Larger batch sizes (16 and 32) appear to provide more stable learning, but may converge slower.
* **Chart 2** indicates that under different conditions (or after some initial adaptation), all batch sizes achieve a stable local learning coefficient. This suggests that the model has converged to a stable state, regardless of the batch size.
The difference between the two charts could be due to different initialization conditions, learning rate schedules, or other hyperparameters. The charts demonstrate the importance of batch size selection in training machine learning models, as it can significantly impact the learning dynamics and convergence behavior. The consistent trend across both charts is that larger batch sizes tend to result in more stable learning, while smaller batch sizes can be more sensitive to noise and fluctuations.
</details>
(b) Second initialization
Figure 3: LLC estimates with 95% confidence intervals shown every 1,000 iterations, computed for various values of batch size and Adam learning rate $\eta$ and for two intializations. Left: $\eta=1\times 10^{-4}$ . Right: $\eta=1\times 10^{-3}$ .
## 4 Discussion
A significant limitation of the proposed reformulation of the physics-informed learning problem is that itās limited to PINN models with hard initial and boundary condition constraints. In practice, physics-informed learning is often carried without such hard constraints by penalizing not only the residual of the governing equation, $y_{\text{GE}}(x,t)\coloneqq\mathcal{L}(u(x,t,w),x,t)$ but also the residual of the initial conditions, $y_{\text{IC}}(x,w)\coloneqq u(x,0,w)$ , and boundary conditions, $y_{\text{BC}}\coloneqq\mathcal{B}(u(x,t,w),x,t)$ . These residuals can also be regarded as observables of the PINN model. As such, a possible approach to reformulate the physics-informed learning problem with multiple residuals is to introduce a single vector observable $y\coloneqq[y_{\text{GE}},y_{\text{IC}},y_{\text{BC}}]$ and consider the statistical learning problem problem with model $p(y\mid x,t,w)q(x,t).$ In such a formulation, the number of samples $n$ would be the same for all residuals, which goes against the common practice of using different numbers of residual evaluation points for each term of the PINN loss. Nevertheless, we note that as indicated in section 2, PINN learning with randomized batches approximates minimizing the population loss $L(w)\equiv\lim_{n\to\infty}L_{n}(w)$ , which establishes a possible equivalency between the physics-informed learning problem with multiple residuals and the reformulation described above. Furthermore, special care must be taken when considering the weighting coefficients commonly used in physics-informed learning to weight the different residuals in the PINN loss. These problems will be considered in future work.
The presented statistical learning analysis of the physics-informed learning problem has interesting implications on the quantification of uncertainty in PINN predictions. Bayesian uncertainty quantification (UQ) in PINN predictions is often carried out by choosing a prior on the model parameters and a fixed set of residual evaluation points and computing samples from the posterior distribution of PINN parameters (see, e.g. [17, 16].) As noted in [17], MCMC sampling of the PINN loss is highly challenging computationally due to the singular nature of PINN models, but such an exhaustive exploration of parameter space may not be necessary because the solution candidates may differ significantly in parameter space but correspond to only a limited range of loss values and thus a limited range of possible predictions. In other words, this analysis justifies taking a function-space view of Bayesian UQ for physics-informed learning due to the singular nature of the PINN models involved.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Heatmap: 2D Distribution of a Variable
### Overview
The image presents a 2D heatmap visualizing the distribution of a variable across two dimensions, labeled 'x' and 't'. The color intensity represents the value of the variable, with a colorbar on the right indicating the mapping between color and value. A horizontal dashed red line is present at t = 2.0.
### Components/Axes
* **X-axis:** Labeled 'x', ranging from 0.0 to 2.0.
* **Y-axis:** Labeled 't', ranging from 0.0 to 4.0.
* **Colorbar:** Located on the right side of the image. Values range from 0.000 to 0.175, with a gradient from dark purple to bright yellow.
* **Horizontal Line:** A dashed red line at t = 2.0.
### Detailed Analysis
The heatmap shows a region of low values (dark purple) dominating the lower portion of the plot (t < 2.0). As 't' increases beyond 2.0, a wave-like pattern emerges, with increasing values (shifting from purple to blue, green, and eventually yellow) concentrated around x = 1.0 to x = 1.5. The maximum value, approximately 0.175, appears near x = 1.5 and t = 3.5. The values decrease as 't' approaches 4.0.
The colorbar provides the following approximate mapping:
* 0.000: Dark Purple
* 0.025: Slightly lighter Purple
* 0.050: Purple-Blue
* 0.075: Blue-Green
* 0.100: Green
* 0.125: Yellow-Green
* 0.150: Yellow
* 0.175: Bright Yellow
The wave-like pattern appears to be a series of contour lines, indicating constant values of the variable. The spacing between these lines suggests the gradient of the variable. The lines are more closely spaced where the variable changes rapidly, and more widely spaced where the variable changes slowly.
### Key Observations
* The variable is relatively constant and low for t < 2.0.
* A wave-like disturbance begins to form at t ā 2.0 and propagates upwards.
* The maximum value of the variable is approximately 0.175, occurring around (x=1.5, t=3.5).
* The horizontal line at t=2.0 appears to mark the initiation of the wave-like pattern.
### Interpretation
The heatmap likely represents the evolution of a physical quantity over time and space. The 'x' dimension could represent spatial position, and 't' could represent time. The initial low values suggest a stable or quiescent state. The emergence of the wave-like pattern at t = 2.0 indicates a disturbance or change in the system. This could represent a wave propagating through a medium, the diffusion of a substance, or the growth of a perturbation. The maximum value represents the peak amplitude of the wave or the highest concentration of the diffusing substance. The horizontal line at t=2.0 could represent the time at which an initial condition was changed, triggering the wave-like behavior.
Without further context, it is difficult to determine the exact physical meaning of the variables and the observed pattern. However, the data suggests a dynamic system undergoing a change in state, with a wave-like disturbance propagating over time.
</details>
(a) $|u(x,t)-u(x,t,w^{0})|$
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Contour Plot: Variable Distribution over Time and Space
### Overview
The image presents a contour plot visualizing the distribution of a variable (represented by color) across two dimensions: *x* and *t*. The plot shows how the variable changes as *x* varies from 0.0 to 2.0 and *t* varies from 0.0 to 4.0. A horizontal dashed red line is present at *t* = 2.0. A colorbar on the right indicates the mapping between color and variable value, ranging from 0.00 (dark purple) to 0.30 (yellow).
### Components/Axes
* **X-axis:** Labeled "*x*", ranging from 0.0 to 2.0.
* **Y-axis:** Labeled "*t*", ranging from 0.0 to 4.0.
* **Colorbar:** Located on the right side of the plot. Values range from 0.00 to 0.30, with a gradient from dark purple to yellow.
* **Contour Lines:** Represent constant values of the variable.
* **Horizontal Line:** A dashed red line at *t* = 2.0.
### Detailed Analysis
The contour lines indicate the shape of the variable's distribution. The variable appears to be relatively low (dark purple) for *t* values below approximately 1.5. Above this value, the variable increases, forming a series of contours that peak around *x* = 1.0 and *t* = 3.5, reaching values close to 0.30 (yellow).
Here's an approximate breakdown of the variable values based on the colorbar and contour lines:
* **Low Values (0.00 - 0.05):** Predominant in the region where *t* < 1.5.
* **Moderate Values (0.05 - 0.15):** Found in the region 1.5 < *t* < 2.5 and 0.5 < *x* < 1.5.
* **High Values (0.15 - 0.25):** Present in the region 2.5 < *t* < 3.5 and 0.75 < *x* < 1.25.
* **Peak Values (0.25 - 0.30):** Concentrated around *x* = 1.0 and *t* = 3.5.
The dashed red line at *t* = 2.0 intersects the contour lines, indicating the variable's value at that specific time. Based on the color gradient, the variable's value along this line appears to range from approximately 0.10 to 0.20.
### Key Observations
* The variable's value increases with time (*t*) until it reaches a peak around *t* = 3.5.
* The maximum value of the variable is concentrated around *x* = 1.0.
* The variable is relatively stable for *t* values below 1.5.
* The dashed line at *t* = 2.0 provides a snapshot of the variable's distribution at that specific time.
### Interpretation
This contour plot likely represents the evolution of a physical quantity over time and space. The increasing variable values with time suggest a growth or accumulation process. The peak around *x* = 1.0 could indicate a region of maximum activity or concentration. The dashed line at *t* = 2.0 might represent a critical time point or a specific condition being investigated.
Without further context, it's difficult to determine the exact meaning of the variable and the underlying physical process. However, the plot provides valuable insights into the spatial and temporal distribution of the variable, allowing for qualitative and potentially quantitative analysis. The shape of the contours suggests a diffusion-like process, where the variable spreads out over time and reaches a maximum concentration in a specific region. The horizontal line could be a boundary condition or a point of interest for further investigation.
</details>
(b) $|u(x,t)-u(x,t,w^{1})|$
Figure 4: PINN pointwise absolute estimation errors for two different parameter vectors, $w^{0}$ and $w^{1}$ , obtained via different initializations of the PINN model parameters. Both parameter estimates were computed using batch size of 32 and Adam learning rate of $1\times 10^{-4}$ over $100,000$ iterations. Despite both cases exhibiting result in similar loss $L_{n}(w)$ values ( $O(10^{-5}$ ) and similar LLCs $\hat{\lambda}(w^{0})\approx\hat{\lambda}(w^{1})\approx 9.5$ , they exhibit noticeably different extrapolation behavior.
Finally, another implication of the presented analysis is related to the extrapolatory capacity of PINNs. It is understood that PINNs, particularly time-dependent PINNs, exhibit limited extrapolation capacity for times outside the training window $(0,T]$ (see, e.g., [12].) The presented analysis indicates that this is not a feature of the IBVP being solved, or the number and arrangement of residual evaluation points in space and time, or the optimization algorithm, but rather because physics-informed learning estimates model parameters by effectively minimizing $L(w)\propto\mathbb{E}_{q(x,t)}[\mathcal{L}(u(x,t,w),x,t)]^{2}$ , that is, the expected loss over the input-generating distribution $q(x,t)$ with support given by the spatio-temporal training windows. While PINN solution candidates along the flat minima are all within a limited range of loss values, there is no guarantee that this would hold for a different loss $L^{\prime}(w)\propto\mathbb{E}_{q^{\prime}(x,t)}[\mathcal{L}(u(x,t,w),x,t)]^{2}$ because the structure of the flat minima is tightly tied to the definition of the original loss, which is tightly tied to the input-generating distribution and thus to the spatio-temporal training window. In fact, it can be expected that two $w$ s with losses $L(w)$ within tolerance will lead to vastly different extrapolation losses $L^{\prime}(w)$ (e.g., see fig. 4), consistent with extrapolation results reported in the literature.
## 5 Conclusions
We have presented in this work a reformulation of physics-informed learning for IBVPs as a statistical learning problem valid for hard-constraints parameterizations. Employing this reformulation we argue that the physics penalty in physics-informed learning is not a regularizer as it is often interpreted (see, e.g. [6]) but as an infinite source of data, and that physics-informed learning with randomized batches of residual evaluation points is equivalent to minimizing $L(w)\propto\mathbb{E}_{q(x,t)}[\mathcal{L}(u(x,t,w),x,t)]^{2}$ . Furthermore, we show that the PINN loss landscape is not characterized by discrete, measure-zero local minima even in the limit $n\to\infty$ of the number of residual evaluation points but by very flat local minima corresponding to a low-complexity (relative to the number of model parameters) model class. We characterize the flatness of local minima of the PINN loss landscape by computing the LLC for a heat equation IBVP and various initializations and choices of learning hyperparameters, and find that across all initializations and hyperparameters studied the estimated LLC is very similar, indicating that the neighborhoods around the different PINN parameter estimates obtained across the experiments exhibit similar flatness properties. Finally, we discuss the implications of the proposed analysis on the quantification of the uncertainty in PINN predictions, and the extrapolation capacity of PINN models.
## Acknowledgments
This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research. Pacific Northwest National Laboratory is operated by Battelle for the DOE under Contract DE-AC05-76RL01830.
## References
- [1] A. Cabezas, A. Corenflos, J. Lao, and R. Louf (2024) BlackJAX: composable Bayesian inference in JAX. External Links: 2402.10797 Cited by: §1, §3.2.
- [2] Flax: a neural network library and ecosystem for JAX External Links: Link Cited by: §1, §3.2.
- [3] M. D. Homan and A. Gelman (2014-01) The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res. 15 (1), pp. 1593ā1623. External Links: ISSN 1532-4435 Cited by: §1, §3.1.
- [4] J. Hoogland, G. Wang, M. Farrugia-Roberts, L. Carroll, S. Wei, and D. Murfet (2025) Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research. Note: External Links: ISSN 2835-8856, Link Cited by: §1.
- [5] T. Imai (2019-06) Estimating Real Log Canonical Thresholds. arXiv e-prints, pp. arXiv:1906.01341. External Links: Document, 1906.01341 Cited by: §3.1.
- [6] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang (2021) Physics-informed machine learning. Nature Reviews Physics 3 (6), pp. 422ā440. Cited by: §1, §5.
- [7] E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei (2025-03ā05 May) The local learning coefficient: a singularity-aware complexity measure. In Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, Y. Li, S. Mandt, S. Agrawal, and E. Khan (Eds.), Proceedings of Machine Learning Research, Vol. 258, pp. 244ā252. External Links: Link Cited by: §1, §2, §3.1, Statistical Learning Analysis of Physics-Informed Neural Networks.
- [8] X. Li, J. Deng, J. Wu, S. Zhang, W. Li, and Y. Wang (2024) Physical informed neural networks with soft and hard boundary constraints for solving advection-diffusion equations using fourier expansions. Computers & Mathematics with Applications 159, pp. 60ā75. External Links: ISSN 0898-1221, Document, Link Cited by: §2.
- [9] L. Lu, R. Pestourie, W. Yao, Z. Wang, F. Verdugo, and S. G. Johnson (2021) Physics-informed neural networks with hard constraints for inverse design. SIAM Journal on Scientific Computing 43 (6), pp. B1105āB1132. External Links: Document, Link, https://doi.org/10.1137/21M1397908 Cited by: §2.
- [10] A. Muhamed and V. Smith (2025) The geometry of forgetting: analyzing machine unlearning through local learning coefficients. In ICML 2025 Workshop on Reliable and Responsible Foundation Models, External Links: Link Cited by: §1.
- [11] A. M. Tartakovsky and Y. Zong (2024) Physics-informed machine learning method with space-time karhunen-loève expansions for forward and inverse partial differential equations. Journal of Computational Physics 499, pp. 112723. External Links: ISSN 0021-9991, Document, Link Cited by: §2.
- [12] Y. Wang, Y. Yao, and Z. Gao (2025) An extrapolation-driven network architecture for physics-informed deep learning. Neural Networks 183, pp. 106998. External Links: ISSN 0893-6080, Document, Link Cited by: §4.
- [13] S. Watanabe (2009) Algebraic geometry and statistical learning theory. Vol. 25, Cambridge University Press. Cited by: §1, §2.
- [14] S. Watanabe (2013-03) A widely applicable bayesian information criterion. 14 (1), pp. 867ā897. External Links: ISSN 1532-4435 Cited by: §3.1.
- [15] M. Welling and Y. W. Teh (2011) Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICMLā11, Madison, WI, USA, pp. 681ā688. External Links: ISBN 9781450306195 Cited by: §1.
- [16] L. Yang, X. Meng, and G. E. Karniadakis (2021) B-pinns: bayesian physics-informed neural networks for forward and inverse pde problems with noisy data. Journal of Computational Physics 425, pp. 109913. External Links: ISSN 0021-9991, Document, Link Cited by: §4.
- [17] Y. Zong, D. A. Barajas-Solano, and A. M. Tartakovsky (2025) Randomized Physics-informed Neural Networks for Bayesian Data Assimilation. Comput. Methods Appl. Mech. Eng. 436, pp. 117670. External Links: Document Cited by: §4.