# Statistical Learning Analysis of Physics-Informed Neural Networks
**Authors**: David Barajas-Solano
> David.Barajas-Solano@pnnl.gov
Abstract
We study the training and performance of physics-informed learning for initial and boundary value problems (IBVP) with physics-informed neural networks (PINNs) from a statistical learning perspective. Specifically, we restrict ourselves to parameterizations with hard initial and boundary condition constraints and reformulate the problem of estimating PINN parameters as a statistical learning problem. From this perspective, the physics penalty on the IBVP residuals can be better understood not as a regularizing term bus as an infinite source of indirect data, and the learning process as fitting the PINN distribution of residuals $p(y\mid x,t,w)q(x,t)$ to the true data-generating distribution $\delta(0)q(x,t)$ by minimizing the Kullback-Leibler divergence between the true and PINN distributions. Furthermore, this analysis show that physics-informed learning with PINNs is a singular learning problem, and we employ singular learning theory tools, namely the so-called Local Learning Coefficient [7] to analyze the estimates of PINN parameters obtained via stochastic optimization for a heat equation IBVP. Finally, we discuss implications of this analysis on the quantification of predictive uncertainty of PINNs and the extrapolation capacity of PINNs.
keywords: Physics-informed learning , physics-informed neural networks , statistical learning , singular learning theory
Pacific Northwest National Laboratory
1 Introduction
Physics-informed machine learning with physics-informed neural networks (PINNs) (see [6] for a comprehensive overview) has been widely advanced and successfully employed in recent years to solve approximate the solution of initial and boundary value problems (IBVPs) in computational mathematics. The central mechanism of physics-informed learning is model parameter identification by minimizing “physics penalty” terms penalizing the governing equation and initial and boundary condition residuals predicted by a given set of model parameters. While various significant longstanding issues of the use PINNs for physics-informed learning such as the spectral bias have been addressed, it remains active area of research, and our understanding of various features of the problem such as the loss landscape, and the extrapolation capacity of PINN models, remains incomplete.
It has been widely recognized that traditional statistical tools are inadequate for analyzing deep learning. Watanabe’s “singular learning theory” (SLT) (see [13] for a review) recognizes that this is due to the singular character of learning with deep learning models and aims to address these limitations. The development of efficient deep learning and Bayesian inference tools and frameworks [3, 15, 2, 1] has enabled the application of SLT to analyzing practical applications of deep learning, leading to various insights on the features of the loss landscape, the generalization capacity of the model, and the training process [7, 4, 10].
Given that physics-informed learning with PINNs is also singular due to its use of neural networks as the backbone architecture, we propose in this work to use SLT to analyze the training and performance of PINNs. The physics penalty in physics-informed learning precludes an out-of-the-box application of SLT, so we first reformulate the PINN learning problem as a statistical learning problem by restricting our attention to PINN parameterizations with hard initial and boundary condition constraints. Employing this formulation, it can be seen that the PINN learning problem is equivalent to the statistical learning problem of fitting the PINN distribution of governing equation residuals $p(y\mid x,t,w)q(x,t)$ to the true data-generating distribution $\delta(0)q(x,t)$ corresponding to zero residuals $y(x,t)\equiv 0$ for all $(x,t)∈\Omega×(0,T]$ , by minimizing the Kullback-Leibler divergence between the true and PINN distributions. We argue this implies that the physics penalty in physics-informed learning can be better understood as an infinite source of indirect data (with the residuals interpreted as indirect observations given that they are a function of the PINN prediction) as opposed to as a regularizing term as it is commonly interpreted in the literature. Furthermore, we employ the SLT’s so-called Local Learning Coefficient (LLC) to study the features of the PINN loss for a heat equation IBVP.
This manuscript is structured as follows: In section 2 we reformulate and analyze the physics-informed learning problem as a singular statistical learning problem. In section 3 we discuss the numerical approximation of the LLC and compute it for a heat equation IBVP. Finally, further implications of this analysis are discussed in section 4.
2 Statistical learning interpretation of physics-informed learning for IBVPs
Consider the well-posed boundary value problem
$$
\displaystyle\mathcal{L}(u(x,t),x,t) \displaystyle=0 \displaystyle\forall x\in\Omega, \displaystyle t\in(0,T], \displaystyle\mathcal{B}(u(x,t),x,t) \displaystyle=0 \displaystyle\forall x\in\partial\Omega, \displaystyle t\in[0,T], \displaystyle u(x,0) \displaystyle=u_{0}(x) \displaystyle\forall x\in\Omega, \tag{1}
$$
where the $\Omega$ denotes the IBVP’s spatial domain, $∂\Omega$ the domain’s boundary, $T$ the maximum simulation time, and $u(x,t)$ the IBVP’s solution. In the PINN framework, we approximate the solution $u(x,t)$ using an overparameterized deep learning-based model $u(x,t,w)$ , usually a neural network model $\text{NN}_{w}(x,t)$ , where $w∈ W$ denotes the model parameters and $W⊂\mathbb{R}^{d}$ denotes the parameter space. For simplicity, we assume that the model $u(x,t,w)$ is constructed in such a way that the initial and boundary conditions are identically satisfied for any model parameter, that is,
| | $\displaystyle\mathcal{B}(u(x,t,w),x,t)=0\quad∀ x∈∂\Omega,\ t∈(0,T],\ w∈ W,$ | |
| --- | --- | --- |
via hard-constraints parameterizations (e.g., [8, 9]). The model parameters are then identified by minimizing the squared residuals of the governing equation evaluated at $n$ “residual evaluation” points $\{x_{i},t_{i}\}^{n}_{i=1}$ , that is, by minimizing the loss
$$
L^{\text{PINN}}_{n}(w)\coloneqq\frac{1}{n}\sum^{n}_{i=1}\left[\mathcal{L}(u(x_{i},t_{i},w),x_{i},t_{i})\right]^{2}, \tag{4}
$$
In the physics-informed-only limit, the PINN framework aims to approximate the solution without using direct observations of the solution, that is, without using labeled data pairs of the form $(x_{i},t_{i},u(x_{i},t_{i}))$ , which leads many authors to refer to it as an “unsupervised” learning framework. An alternative view of the PINN learning problem is that it actually does use labeled data, but of the form $(x_{i},t_{i},\mathcal{L}(u(x_{i},t_{i}),x_{i},t_{i})\equiv 0)$ , as the PINN formulation encourages the residuals of the parameterized model, $\mathcal{L}(u(x,t,w),x,t)$ , to be as close to zero as possible. Crucially, “residual” data of this form is “infinite” in the sense that such pairs can be constructed for all $x∈\Omega$ , $t∈(0,T]$ . In fact, the loss 4 can be interpreted as a sample approximation of the volume-and-time-averaged loss $L^{\text{PINN}}\coloneqq∈t_{\Omega×(0,T]}\left[\mathcal{L}(u(x,t),x,t)\right]^{2}q(x,t)\,\mathrm{d}x\mathrm{d}t$ for a certain weighting function $q(x,t)$ .
We can then establish a parallel with statistical learning theory. Namely, we can introduce a density over the inputs, $q(x,t)$ , and the true data-generating distribution $q(x,t,y)=\delta(0)q(x,t)$ , where $y$ denotes the equation residuals. For a given error model $p(y\mid\mathcal{L}(u(x,t,w),x,t))$ , the PINN model induces a conditional distribution of residuals $p(y\mid x,w)$ . For example, for a Gaussian error model $p(y\mid\mathcal{L}(u(x,t,w),x,t))=\mathcal{N}(y\mid 0,\sigma^{2})$ , the conditional distribution of residuals is of the form
$$
p(y\mid x,t,w)\propto\exp\left\{-\frac{1}{2\sigma^{2}}\left[\mathcal{L}(u(x,t,w),x,t)\right]^{2}\right\}. \tag{5}
$$
The PINN framework can then be understood as aiming to fit the model $p(x,t,y\mid w)\coloneqq p(y\mid x,t,w)q(x,t)$ to the data-generating distribution by minimizing the sample negative log-likelihood $L_{n}(w)$ of the training dataset $\{(x_{i},t_{i}y_{i}\equiv 0)\}^{n}_{i=1}$ drawn i.i.d. from $q(x,t,y)$ , with
$$
L_{n}(w)\coloneqq-\frac{1}{n}\sum^{n}_{i=1}\log p(y_{i}\equiv 0\mid x_{i},t_{i},w). \tag{6}
$$
Associated to this negative log-likelihood we also have the population negative log-likelihood $L(w)\coloneqq-\mathbb{E}_{q(x,t,y)}\log p(y\mid x,t,w)$ . Note that for the Gaussian error model introduced above we have that $L^{\text{PINN}}_{n}(w)=2\sigma^{2}L_{n}(w)$ up to an additive constant, so that learning by minimizing the PINN loss 4 and the negative log-likelihood 6 is equivalent. Furthermore, also note that $L^{\text{PINN}}(w)=2\sigma^{2}L(w)$ up to an additive constant if the volume-and-time averaging function is taken to be the same as the data-generating distribution of the inputs, $q(x,t)$ . Therefore, it follows that PINN training is equivalent to minimizing the Kullback-Leibler divergence between the true and model distributions of the residuals.
This equivalent reformulation of the PINN learning problem allows us to employ the tools of statistical learning theory to analyze the training of PINNs. Specifically, we observe that, because of the neural network-based forward model $\text{NN}_{w}(x)$ is a singular model, the PINN model $p(y\mid x,t,w)$ is also singular, that is, the parameter-to-distribution map $w\mapsto p(x,t,y\mid w)$ is not one-to-one, and the Fisher information matrix $I(w)\coloneqq\mathbb{E}_{p(x,t,y\mid w)}∇∇_{w}\,\log p(x,t,y\mid w)$ is not everywhere positive definite in $W$ . As a consequence, the PINN loss 4 is not characterized by a discrete, measure-zero, possibly infinite set of local minima that can be locally approximated quadratically. Instead, the PINN loss landscape, just as is generally the case for deep learning loss landscapes, is characterized by “flat” minima, crucially even in the limit of the number of residual evaluation points $n→∞$ . This feature allows us to better understand the role of the residual data in physics-informed learning.
Namely, it has been widely observed that the solutions to the PINN problem $\operatorname*{arg\,min}_{w}L^{\text{PINN}}_{n}(w)$ for a given loss tolerance $\epsilon$ are not unique even for large number of residual evaluation points, and regardless of the scheme used to generate such points (uniformly distributed, low-discrepancy sequences such as latin hypercube sampling, random sampling, etc.), while on the other hand it has also been observed that increasing $n$ seems to “regularize” or “smoothen” the PINN loss landscape (e.g., [11]). Although one may interpret this observed smoothing as similar to the regularizing effect of explicit regularization (e.g., $\ell_{2}$ regularization or the use of a proper prior on the parameters), in fact increasing $n$ doesn’t sharpen the loss landscape into locally quadratic local minima as explicit regularization does. Therefore, residual data (and correspondingly, the physics penalty) in physics-informed learning should not be understood as a regularizer but simply as a source of indirect data.
As estimates of the solution to the PINN problem are not unique, it is necessary to characterize them through other means besides their loss value. In this work we employ the SLT-based LLC metric, which aims to characterize the “flatness” of the loss landscape of singular models. For completeness, we reproduce the definition of the LLC (introduced in [7]) in the sequel. Let $w^{\star}$ denote a local minima and $B(w^{\star})$ a sufficiently small neighborhood of $w^{\star}$ such that $L(\omega)≥ L(\omega^{\star})$ ; furthermore, let $B(w^{\star},\epsilon)\coloneqq\{w∈ B(w^{\star})\mid L(w)-L(w^{\star})<\epsilon\}$ denote the subset of the neighborhood of $w^{\star}$ with loss within a certain tolerance $\epsilon$ , and $V(\epsilon)$ denote the volume of such subset, that is,
$$
V(\epsilon)\coloneqq\operatorname{vol}(B(w^{\star},\epsilon))=\int_{B(w^{\star},\epsilon)}\mathrm{d}w.
$$
Then, the LLC is defined as the rational number $\lambda(w^{\star})$ such that the volume scales as $V(\epsilon)\propto\exp\{\lambda(w^{\star})\}$ . The LLC can be interpreted as a measure of flatness because it corresponds to the rate at which the volume of solutions within the tolerance $\epsilon$ decreases as $\epsilon→ 0$ , with smaller LLC values correspond to slower loss of volume.
As its name implies, the LLC is the local analog to the SLT (global) learning coefficient $\lambda$ , which figures in the asymptotic expansion of the free energy of a singular statistical model. Specifically $\lambda(w^{\star})\equiv\lambda$ if $w^{\star}$ is a global minimum and $B(w^{\star})$ is taken to be the entire parameter space $W$ . The free energy $F_{n}$ is defined as
$$
F_{n}\coloneqq-\log\int\exp\{-nL_{n}(w)\}\varphi(w)\,\mathrm{d}w,
$$
where $\varphi(w)$ is a prior on the model parameters. Then, according to the SLT, the free energy for a singular model has the asymptotic expansion [13]
$$
F_{n}=nL_{n}(w_{0})+\lambda\log n+O_{P}(\log\log n),\quad n\to\infty,
$$
where $w_{0}$ is the parameter vector that minimizes the Kullback-Leibler divergence between the true data-generating distribution and the estimated distribution $p(y\mid x,t,w)q(x,t)$ , whereas for a regular model it has the expansion
$$
F_{n}=nL_{n}(\hat{w})+\frac{d}{2}\log n+O_{P}(1),\quad n\to\infty, \tag{1}
$$
where $\hat{w}$ is the maximum likelihood estimator of the model parameters, and $d$ the number of model parameters. The learning coefficient can then be interpreted as measuring model complexity, as it is the singular model analog to a regular model’s model complexity $d/2$ . Therefore, by computing the LLC around different estimates to the PINN solution we can gain some insight into the PINN loss landscape.
3 Numerical estimation of the local learning coefficient
3.1 MCMC-based local learning coefficient estimator
While there may exist other approaches to estimating the LLC (namely, via Imai’s estimator [5] or Watanabe’s estimator [14] of the real log-canonical threshold), in this work we choose to employ the estimation framework introduced in [7], which we briefly describe as follows. Let $B_{\gamma}(w^{\star})$ denote a small ball of radius $\gamma$ centered around $w^{\star}$ . We introduce the local to $w^{\star}$ analog to the free energy and its asymptotic expansion:
$$
\begin{split}F_{n}(B_{\gamma}(w^{\star}))&=-\log\int_{B_{\gamma}(w^{\star})}\exp\{-nL_{n}(w)\}\varphi(w)\,\mathrm{d}w\\
&=nL_{n}(w^{\star})+\lambda(w^{\star})\log n+O_{P}(\log\log n).\end{split} \tag{7}
$$
We then substitute the hard restriction in the integral above to $B_{\gamma}(w^{\star})$ with a softer Gaussian weighting around $w^{\star}$ with covariance $\gamma^{-1}I_{d}$ , where $\gamma>0$ is a scale parameter. Let $\mathbb{E}_{w\mid w^{\star},\beta,\gamma}[·]$ denote the expectation with respect to the tempered distribution
$$
p(w\mid w^{\star},\beta,\gamma)\propto\exp\left\{-n\beta L_{n}(w)-\frac{\gamma}{2}\|w-w^{\star}\|^{2}_{2}\right\}
$$
with inverse temperature $\beta\equiv 1/\log n$ ; then, $\mathbb{E}_{w\mid w^{\star},\beta,\gamma}[nL_{n}(w)]$ is a good estimator of the partition function of this tempered distribution, which is, depending on the choice $\gamma$ , an approximation to the local free energy. Substituting this approximation into 7, we obtain the estimator
$$
\hat{\lambda}_{\gamma}(w^{\star})\coloneqq n\beta\left[\mathbb{E}_{w\mid w^{\star},\beta,\gamma}[L_{n}(w)]-L_{n}(w^{\star})\right]. \tag{8}
$$
In this work we compute this estimate by computing samples $w\mid w^{\star},\beta,\gamma$ via Markov chain Monte Carlo (MCMC) simulation. Specifically, we employ the No-U Turns Sampler (NUTS) algorithm [3]. Furthermore, to evaluate $L_{n}(w)$ we choose a fixed set of $n$ residual evaluation points chosen randomly from $q(x,t)$ , with $n$ large enough that we can employ the asymptotic expansion 7 and that eq. 8 is valid. Note that $\hat{\lambda}_{\gamma}(w^{\star})$ is strictly positive as $w^{\star}$ is a local minima of $L_{n}(w)$ , so that $\mathbb{E}_{w\mid w^{\star},\beta,\gamma}[L_{n}(w)]-L_{n}(w^{\star})≥ 0$ . In fact, negative estimates of the LLC imply that the MCMC chain was able to visit a significant volume of parameter space for which $L_{n}(w)<L_{n}(w^{\star})$ , which means that $w^{\star}$ is not an actual minima of $L_{n}(w)$ . Given that minimization of the PINN loss via stochastic optimization algorithms is not guaranteed to produce exact minima, in practice it may be possible to obtain negative estimates of the LLC.
3.2 Numerical experiments
To analyze the PINN loss landscape for a concrete case, we compute and analyze the LLC for the following physics-informed learning problem: We consider the heat equation IBVP
| | $\displaystyle∂_{t}u-∂_{xx}u$ | $\displaystyle=f(x,t)$ | $\displaystyle∀ x∈(0,2),\ t∈(0,2],$ | |
| --- | --- | --- | --- | --- |
where $f(x,t)$ is chosen so that the solution to the IBVP is
$$
u(x,t)=\exp^{-t}\sin(\pi x).
$$
To exactly enforce initial and boundary conditions, we employ the hard-constraints parameterization of the PINN model
$$
u(x,t,w)=\sin(\pi x)+tx(x-1)\text{NN}_{w}(x,t). \tag{9}
$$
The neural network model $\text{NN}_{w}$ is taken to be a multi-layer perceptron (MLP) with 2 hidden layers of $50$ units each. This model has a total number of parameters $d=20,601.$ This relatively small model ensures that the wall-clock time of computing multiple MCMC chains is reasonable. The model eq. 9 is implemented using Flax NNX library [2], and MCMC sampling is performed using the BlackJAX library [1].
The parameters of the model are estimated by minimizing the PINN loss eq. 4 using the Adam stochastic optimization algorithm for $50,000$ iterations. At each iteration, a new batch of residual evaluation points is drawn from $q(x,t)$ and used to evaluate the PINN loss and its gradient. Note that these batches of residual evaluation points are distinct from the set used to evaluate $L_{n}(w)$ in eq. 8, which is fixed for a given experiment. Figure 1 shows the PINN solution $u(x,t,w^{\star})$ for $w^{\star}$ estimated using a batch size of $32$ and learning rate of $1× 10^{-4}$ . This solution has an associated PINN loss of $1.854× 10^{-5}$ , which, although small, shows that the true data-generating distribution $\delta(0)q(x,t)$ is not realizable within the model class 9 with the MLP architecture indicated above due to the finite capacity of the class.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Heatmap with Overlaid Curves: Spatial-Temporal Value Distribution
### Overview
The image depicts a 2D heatmap with axes labeled **x** (horizontal, 0.0–2.0) and **t** (vertical, 0.0–2.0). A color gradient (purple to yellow) represents values ranging from **-0.75 to 0.75**, as indicated by the legend on the right. Two sets of curves are overlaid: solid black lines (left side) and dashed black lines (right side), suggesting distinct data trends or boundaries.
---
### Components/Axes
- **X-axis**: Labeled "x", scaled from 0.0 to 2.0 in increments of 0.5.
- **Y-axis**: Labeled "t", scaled from 0.0 to 2.0 in increments of 0.5.
- **Color Legend**: Vertical bar on the right, transitioning from purple (-0.75) to yellow (0.75), with intermediate markers at -0.5, 0.0, 0.25, 0.5, 0.75.
- **Curves**:
- **Solid black lines**: Left side (x ≈ 0.0–1.0), forming parabolic peaks.
- **Dashed black lines**: Right side (x ≈ 1.0–2.0), forming inverted parabolic troughs.
---
### Detailed Analysis
1. **Color Gradient**:
- Left half (x < 1.0): Green-to-yellow gradient, indicating **positive values** (0.0–0.75).
- Right half (x > 1.0): Blue-to-purple gradient, indicating **negative values** (-0.75–0.0).
- Central boundary at x = 1.0 shows a sharp transition from green to blue.
2. **Solid Curves (Left Side)**:
- Three parabolic peaks centered at:
- (x = 0.5, t = 0.25) with maximum value ~0.75 (yellow).
- (x = 0.5, t = 0.75) with moderate value ~0.5 (green).
- (x = 0.5, t = 1.25) with lower value ~0.25 (light green).
- Symmetric about x = 0.5, suggesting periodic or resonant behavior.
3. **Dashed Curves (Right Side)**:
- Two inverted parabolic troughs centered at:
- (x = 1.5, t = 0.5) with minimum value ~-0.75 (purple).
- (x = 1.5, t = 1.25) with moderate value ~-0.5 (dark blue).
- Symmetric about x = 1.5, mirroring the left-side peaks but inverted.
4. **Spatial Relationships**:
- The solid and dashed curves are separated by x = 1.0, with no overlap.
- The left-side peaks (positive values) align vertically with the right-side troughs (negative values) at t = 0.5 and t = 1.25, suggesting a phase relationship.
---
### Key Observations
- **Symmetry**: Both curve sets exhibit mirror symmetry about their respective x-centers (0.5 and 1.5).
- **Phase Shift**: The right-side troughs occur at t = 0.5 and t = 1.25, offset by ~0.25 and ~0.75 relative to the left-side peaks.
- **Color-Value Correlation**: The heatmap’s color transitions align precisely with the legend, confirming value ranges.
- **Boundary at x = 1.0**: Sharp color and curve discontinuity suggests a system boundary or parameter change.
---
### Interpretation
The image likely represents a **spatiotemporal phenomenon** with opposing behaviors in two regions:
1. **Left Region (x < 1.0)**: Positive values (green/yellow) with periodic peaks at t = 0.25, 0.75, 1.25. This could model wave propagation, resonance, or constructive interference.
2. **Right Region (x > 1.0)**: Negative values (blue/purple) with inverted troughs at t = 0.5, 1.25. This may represent destructive interference, damping, or a phase-shifted response.
3. **Boundary Dynamics**: The abrupt transition at x = 1.0 implies a discontinuity (e.g., a wall, interface, or parameter change) causing reflection or phase inversion.
The curves’ symmetry and phase alignment suggest a **wave-like system** (e.g., acoustics, optics, or quantum mechanics) with standing waves or interference patterns. The dashed curves might represent boundary conditions or external influences altering the system’s behavior.
</details>
(a) PINN solution $u(x,t,w^{\star})$
<details>
<summary>x2.png Details</summary>

### Visual Description
## Heatmap: Function Value Distribution Over x and t
### Overview
The image depicts a 2D heatmap visualizing a function's values across spatial (x) and temporal (t) dimensions. The plot uses a color gradient from purple (negative values) to yellow (positive values), with contour lines indicating regions of similar magnitude. A prominent yellow circular region dominates the center, suggesting a local maximum or peak in the function's output.
### Components/Axes
- **Axes**:
- **x-axis**: Labeled "x", ranging from 0.0 to 2.0 in increments of 0.5.
- **t-axis**: Labeled "t", ranging from 0.0 to 2.0 in increments of 0.25.
- **Legend**:
- Located on the right side of the plot.
- Color scale: Purple (-0.002) → Green (0.000) → Yellow (0.001).
- No explicit numerical labels on contour lines, but color intensity correlates with magnitude.
- **Key Elements**:
- Dashed black contour lines (likely level curves for specific function values).
- Solid yellow circular region centered at (x=1.0, t=1.0).
### Detailed Analysis
- **Color Gradient**:
- Purple regions (left side, t > 1.5) correspond to values near -0.002.
- Green regions (middle, t ≈ 0.5–1.25) represent values close to 0.000.
- Yellow regions (right side, t < 0.5 and central circle) indicate values up to 0.001.
- **Contour Lines**:
- Dashed lines form closed loops, suggesting periodic or oscillatory behavior in the function.
- Lines are denser near the yellow circle, implying rapid changes in function values.
- **Yellow Circle**:
- Positioned at (x=1.0, t=1.0), with a radius of ~0.5 units.
- Contains the highest function values (yellow), indicating a local maximum.
### Key Observations
1. **Central Peak**: The yellow circle at (1.0, 1.0) is the region of maximum function output (~0.001).
2. **Negative Values**: Purple regions in the top-left corner (x < 0.5, t > 1.5) show the lowest values (-0.002).
3. **Symmetry**: The plot exhibits approximate symmetry about the line x = t, though the yellow circle breaks this symmetry.
4. **Contour Density**: Higher density of dashed lines near the yellow circle suggests steeper gradients in that region.
### Interpretation
The heatmap likely represents a physical or mathematical system where:
- **x and t** could correspond to spatial and temporal coordinates, or normalized parameters.
- The **central peak** at (1.0, 1.0) may represent a critical point (e.g., resonance, maximum stress, or energy concentration).
- The **dashed contour lines** imply the function has periodic or wave-like behavior, with oscillations dampening away from the center.
- The **color scale** indicates small-magnitude variations, suggesting the system operates near equilibrium or has tightly controlled parameters.
The absence of explicit numerical labels on contour lines limits precise quantification, but the spatial distribution and color coding strongly suggest a localized phenomenon with significant influence at (1.0, 1.0). The plot could model phenomena such as heat diffusion, wave interference, or optimization landscapes.
</details>
(b) Pointwise errors $u(x,t)-u(x,t,w^{\star})$
Figure 1: PINN solution and estimation error computed with batch size of $32$ and Adam learning rate of $1× 10^{-4}.$
LLC estimates are computed for batch sizes of $8$ , $16$ , and $32$ , and for Adam learning rates $\eta$ of $1× 10^{-3}$ and $1× 10^{-4}.$ Furthermore, for all experiments we compute the LLC estimate using a fixed set of $256$ residual evaluation points drawn randomly from $q(x,t).$ To compute the LLC, we employ the Gaussian likelihood of residuals (5). As such, the LLC estimate depends not only on the choice of $\gamma$ but also on the standard deviation $\sigma$ . For each experiment we generate 2 MCMC chains and each chain is initialized with an warmup adaptation interval.
To inform the choice of hyperparameters of the LLC estimator we make the following observations: First, both hyperparameters control the volume of the local neighborhood employed to estimate the LLC, in the case of $\gamma$ by expanding or contracting the prior centered around $w^{\star}$ , and in the case of $\sigma$ by increasing or decreasing the range of values $L_{n}(w)$ can take within the local neighborhood. Second, given that the true data-generating distribution is not realizable, the PINN loss will always be non-zero. Therefore, for the true zero residuals to be resolvable with non-trivial probability by the PINN model we must choose $\sigma$ larger than the average optimal PINN residuals (the squared root of the optimal PINN loss.)
Given that $\gamma=1$ controls the volume of the local neighborhood, we take the value of $\gamma=1$ For the architecture described above, we find that the optimal PINN loss is $O(10^{-5})$ , so that $\sigma$ must be chosen to be $3× 10^{-3}$ or larger. Given that $\sigma$ also controls the volume of the local neighborhood, we take the conservative value of $\sigma=1$ . It is not entirely obvious whether this conservative choice overestimates or underestimates the LLC. Numerical experiments not presented in this work showed that the small choice $\sigma=1× 10^{-2}$ led often to negative LLC estimates and therefore was rejected. LLC values estimated for $\sigma=1× 10^{-1}$ where consistently slightly larger than those for $\sigma=1$ , which indicates that the model class of PINN models that are more physics-respecting has slightly more complexity than a less physics-respecting class. Nevertheless, given that the difference in LLC estimates is small we proceeded with the conservative choice.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Graphs: Training and Test Loss vs. Local Learning Coefficient
### Overview
The image contains two line graphs. The left graph shows training and test loss over iterations for different learning rates (η). The right graph displays the local learning coefficient over iterations for the same learning rates. Both graphs use logarithmic and linear scales, respectively, with shaded regions indicating variability or confidence intervals.
---
### Components/Axes
#### Left Graph (Train and Test Loss)
- **X-axis**: "Iteration" (0 to 50,000, linear scale).
- **Y-axis**: "Train and test loss" (logarithmic scale: 10⁻⁶ to 10⁰).
- **Legend**:
- Blue line: η = 1 × 10⁻⁴ (train).
- Orange line: η = 1 × 10⁻⁴ (test).
- Green line: η = 1 × 10⁻³ (train).
- Red line: η = 1 × 10⁻³ (test).
#### Right Graph (Local Learning Coefficient)
- **X-axis**: "Iteration" (0 to 50,000, linear scale).
- **Y-axis**: "Local learning coefficient" (linear scale: 7.0 to 10.0).
- **Legend**:
- Blue line with crosses: η = 1 × 10⁻⁴.
- Orange line with crosses: η = 1 × 10⁻³.
- **Shaded regions**: Confidence intervals (light orange for η = 1 × 10⁻³).
---
### Detailed Analysis
#### Left Graph
1. **Blue line (η = 1 × 10⁻⁴, train)**:
- Starts at ~10⁰, decreases smoothly to ~10⁻⁴ by iteration 50,000.
- Minimal fluctuations after initial drop.
2. **Orange line (η = 1 × 10⁻⁴, test)**:
- Starts at ~10⁻¹, decreases to ~10⁻³ by iteration 50,000.
- Slight oscillations but generally stable.
3. **Green line (η = 1 × 10⁻³, train)**:
- Starts at ~10⁻², decreases to ~10⁻⁴ by iteration 50,000.
- More volatility than blue line, especially early on.
4. **Red line (η = 1 × 10⁻³, test)**:
- Starts at ~10⁻³, fluctuates between 10⁻⁴ and 10⁻⁵.
- Highest volatility, ending near 10⁻⁵.
#### Right Graph
1. **Blue line (η = 1 × 10⁻⁴)**:
- Starts at 9.5, drops sharply to 8.5 by iteration 20,000.
- Rises to 9.0 by iteration 50,000.
- Shaded region (confidence interval) narrows after 20,000.
2. **Orange line (η = 1 × 10⁻³)**:
- Starts at 9.0, dips to 8.0 by iteration 20,000.
- Rises to 9.5 by iteration 50,000.
- Wider shaded region indicates higher variability.
---
### Key Observations
1. **Left Graph**:
- Smaller η (1 × 10⁻⁴) results in smoother convergence for both train and test loss.
- Larger η (1 × 10⁻³) exhibits higher volatility, especially in test loss.
- Test loss consistently lags behind train loss for both η values.
2. **Right Graph**:
- η = 1 × 10⁻⁴ shows a sharp initial drop in local learning coefficient, stabilizing afterward.
- η = 1 × 10⁻³ has more pronounced fluctuations but ends at a higher value (9.5 vs. 9.0).
- Confidence intervals for η = 1 × 10⁻³ are broader, suggesting less certainty in its trajectory.
---
### Interpretation
- **Learning Rate Impact**:
- Smaller η (1 × 10⁻⁴) ensures stability in both loss and local learning coefficient, albeit with slower initial progress.
- Larger η (1 × 10⁻³) accelerates early changes but introduces instability, risking suboptimal convergence.
- **Trade-offs**:
- The local learning coefficient for η = 1 × 10⁻³ ends higher, suggesting potential for better performance despite volatility.
- The sharp drop in η = 1 × 10⁻⁴’s local learning coefficient may indicate an initial adjustment phase before stabilization.
- **Anomalies**:
- The red line (η = 1 × 10⁻³, test) in the left graph shows erratic behavior, possibly due to overfitting or sensitivity to noise.
- The right graph’s shaded regions highlight that η = 1 × 10⁻³’s local learning coefficient is less predictable.
This analysis underscores the balance between learning rate magnitude and model stability, with smaller rates favoring consistency and larger rates offering potential gains at the cost of variability.
</details>
(a) Batch size $=8$
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Charts: Training/Test Loss and Local Learning Coefficient
### Overview
The image contains two line charts. The left chart shows training and test loss over iterations for different learning rates (η). The right chart displays the evolution of a local learning coefficient over iterations for two learning rates. Both charts use logarithmic scales for loss values and linear scales for iterations.
### Components/Axes
**Left Chart:**
- **X-axis**: Iteration (0 to 50,000)
- **Y-axis**: Train and test loss (log scale: 1e-6 to 1e0)
- **Legend**:
- Blue: η = 1 × 10⁻⁴ (train)
- Orange: η = 1 × 10⁻⁴ (test)
- Green: η = 1 × 10⁻³ (train)
- Red: η = 1 × 10⁻³ (test)
- **Legend Position**: Top-left
**Right Chart:**
- **X-axis**: Iteration (10,000 to 50,000)
- **Y-axis**: Local learning coefficient (linear scale: 7 to 10)
- **Legend**:
- Blue dashed: η = 1 × 10⁻⁴
- Orange dashed: η = 1 × 10⁻³
- **Legend Position**: Bottom-right
- **Shaded Area**: Represents uncertainty bounds around the orange line
### Detailed Analysis
**Left Chart Trends:**
1. **η = 1 × 10⁻⁴ (blue/orange)**:
- Train loss (blue) starts at ~1e-1 and decreases smoothly to ~1e-4 by 50k iterations.
- Test loss (orange) starts at ~1e-1, dips to ~1e-3 by 20k iterations, then fluctuates around ~1e-3.
2. **η = 1 × 10⁻³ (green/red)**:
- Train loss (green) starts at ~1e-1 and decreases to ~1e-4 by 50k iterations, with sharper declines.
- Test loss (red) starts at ~1e-1, drops to ~1e-4 by 20k iterations, then oscillates between ~1e-4 and 1e-3.
**Right Chart Trends:**
1. **η = 1 × 10⁻⁴ (blue dashed)**:
- Local learning coefficient starts at ~7.5, rises sharply to ~9.5 by 20k iterations, then plateaus with minor fluctuations.
2. **η = 1 × 10⁻³ (orange dashed)**:
- Local learning coefficient starts at ~9.5, remains stable with slight oscillations around ~9.5–9.7.
### Key Observations
1. **Left Chart**:
- Lower η (1e-4) shows smoother convergence but higher test loss compared to η=1e-3.
- Test loss for η=1e-3 is more volatile but achieves lower values (~1e-4) earlier.
2. **Right Chart**:
- η=1e-4 demonstrates a significant increase in local learning coefficient (~+2), while η=1e-3 remains stable.
- The shaded uncertainty band for η=1e-3 suggests higher variability in coefficient estimates.
### Interpretation
The data suggests a trade-off between learning rate and model performance:
- **η=1e-4** (smaller rate):
- Slower convergence but smoother test loss.
- Higher local learning coefficient (~9.5), indicating more efficient parameter updates.
- Potential overfitting risk (higher test loss despite smoother curves).
- **η=1e-3** (larger rate):
- Faster initial convergence but noisier test loss.
- Lower local learning coefficient (~9.5), suggesting less efficient updates.
- Better generalization (lower test loss) but with higher volatility.
The shaded uncertainty in the right chart highlights that η=1e-3's local learning coefficient estimates are less reliable. The divergence between train/test loss trends implies that η=1e-4 may prioritize training efficiency at the cost of generalization, while η=1e-3 balances speed and stability but with less predictable updates.
</details>
(b) Batch size $=16$
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Graphs: Training/Test Loss and Local Learning Coefficient vs. Iteration
### Overview
The image contains two side-by-side line graphs. The left graph compares training and test loss across iterations for two learning rates (η = 1×10⁻⁴ and η = 1×10⁻³). The right graph shows the local learning coefficient over iterations for the same learning rates, with shaded confidence intervals.
---
### Components/Axes
#### Left Graph (Train/Test Loss)
- **X-axis**: Iteration (0 to 50,000, linear scale)
- **Y-axis**: Train and test loss (logarithmic scale, 10⁻⁷ to 10¹)
- **Legend**:
- Blue: η = 1×10⁻⁴ (train)
- Orange: η = 1×10⁻⁴ (test)
- Green: η = 1×10⁻³ (train)
- Red: η = 1×10⁻³ (test)
- **Legend Position**: Top-right corner
#### Right Graph (Local Learning Coefficient)
- **X-axis**: Iteration (0 to 50,000, linear scale)
- **Y-axis**: Local learning coefficient (linear scale, 7.0 to 10.0)
- **Legend**:
- Blue: η = 1×10⁻⁴
- Orange: η = 1×10⁻³
- **Legend Position**: Top-right corner
- **Shaded Regions**: Confidence intervals (blue: narrow, orange: wide)
---
### Detailed Analysis
#### Left Graph (Train/Test Loss)
1. **Training Loss**:
- **η = 1×10⁻⁴ (blue)**: Starts at ~10¹, drops sharply to ~10⁻⁵ by 10,000 iterations, then stabilizes with minor fluctuations.
- **η = 1×10⁻³ (green)**: Starts at ~10⁻¹, decreases to ~10⁻⁵ by 20,000 iterations, then fluctuates around 10⁻⁵–10⁻⁴.
2. **Test Loss**:
- **η = 1×10⁻⁴ (orange)**: Starts at ~10⁻¹, decreases to ~10⁻³ by 10,000 iterations, then stabilizes with minor noise.
- **η = 1×10⁻³ (red)**: Starts at ~10⁻², decreases to ~10⁻⁴ by 20,000 iterations, then exhibits high volatility (spikes to 10⁻²).
#### Right Graph (Local Learning Coefficient)
- **η = 1×10⁻⁴ (blue)**: Starts at ~8.5, rises to ~9.5 by 20,000 iterations, then stabilizes with minor fluctuations (~9.0–9.5).
- **η = 1×10⁻³ (orange)**: Starts at ~8.0, rises to ~9.0 by 20,000 iterations, then stabilizes with wider fluctuations (~8.5–9.5).
- **Confidence Intervals**:
- Blue (η = 1×10⁻⁴): Narrow (~±0.2)
- Orange (η = 1×10⁻³): Wide (~±0.5)
---
### Key Observations
1. **Training/Test Loss**:
- Lower η (1×10⁻⁴) achieves faster and more stable convergence for both training and test loss.
- Higher η (1×10⁻³) shows slower convergence and significant test loss volatility, suggesting potential overfitting or instability.
2. **Local Learning Coefficient**:
- Both η values stabilize around 9.0–9.5 after 20,000 iterations.
- Higher η (1×10⁻³) exhibits greater uncertainty (wider confidence interval), indicating less reliable learning dynamics.
---
### Interpretation
- **Learning Rate Impact**:
- η = 1×10⁻⁴ demonstrates superior performance in terms of faster convergence, lower loss, and stable learning dynamics.
- η = 1×10⁻³ introduces instability, as evidenced by test loss spikes and wider confidence intervals in the learning coefficient.
- **Test Loss Volatility**: The red line (η = 1×10⁻³ test) suggests the model may be overfitting or sensitive to noise at higher learning rates.
- **Confidence Intervals**: The right graph highlights that η = 1×10⁻³ has less predictable learning behavior, which could hinder reliable model training.
This analysis underscores the trade-off between learning rate magnitude and model stability, with lower η favoring robustness and higher η risking erratic performance.
</details>
(c) Batch size $=32$
Figure 2: Training and test histories (left, shown every 100 iterations), and LLC estimates with 95% confidence intervals (right, shown every 1,000 iterations) for various values of batch size and Adam learning rate.
Figure 2 summarizes the results of the computational experiments. The left panels show the history of the training and test losses every $100$ iterations. Here, the test loss is the average of $(u(x,t,w)-u(x,t))^{2}$ evaluated over a $51× 51$ uniform grid of $(x,t)$ pairs. The right panels show the LLC estimates computed every $10,000$ iterations. Here, care must be taken when interpreting the smaller LLC estimates seen for earlier iterations. For early iterations, the parameter estimate $w_{t}$ is not a minima; therefore, the MCMC chains cover regions of parameter space for which $L_{n}(w)<L_{n}(w_{t})$ , leading to small and possibly negative values. Modulo this observation, it can be seen that across all experiments (batch size and learning rate values), the estimated LLC is $\hat{\lambda}(w^{\star})≈ 9.5$ . Additional experiments conducted by varying the pseudo-random number generator seed used to initialize $w$ also lead to the same estimate of the LLC (see fig. 3.)
This number is remarkable for two reasons: First, despite the estimate of the local minima $w^{\star}$ being vastly different across experiments, the LLC estimate is very consistent. Second, $\hat{\lambda}(w^{\star})$ is significantly smaller than the number of parameters $d=20,601$ of the PINN model. If the PINN model were regular, then it would have a much smaller $O(10)$ number of parameters. These results indicate that the PINN solution corresponds to a very flat region of parameter space, and different choices of initializations and stochastic optimization hyperparameters lead to the same region. As seen in figs. 2 and 3, the lower learning rate of $1× 10^{-4}$ , due to reduced stochastic optimization noise, takes a larger number of iterations to arrive at this region than the larger learning rate of $1× 10^{-3}.$
<details>
<summary>x6.png Details</summary>

### Visual Description
## Line Graphs: Local Learning Coefficient vs. Iteration Across Batch Sizes
### Overview
The image contains two side-by-side line graphs comparing the evolution of a "Local Learning Coefficient" across 50,000 iterations for three batch sizes (8, 16, 32). Both graphs use shaded regions to represent confidence intervals, with distinct line styles and markers for each batch size. The left graph shows more pronounced volatility, while the right graph exhibits smoother trends.
### Components/Axes
- **X-axis (Horizontal)**: Iteration count (10,000 to 50,000 in increments of 10,000).
- **Y-axis (Vertical)**: Local Learning Coefficient (7.0 to 10.0 in increments of 0.5).
- **Legends**:
- **Left Graph**:
- Blue "x" markers: Batch size 8
- Orange "x" markers: Batch size 16
- Green "x" markers: Batch size 32
- **Right Graph**:
- Same color/marker scheme as left graph.
- **Shaded Regions**: Confidence intervals (darker for lower bounds, lighter for upper bounds).
### Detailed Analysis
#### Left Graph (Volatile Trends)
1. **Batch Size 8 (Blue)**:
- Starts at ~7.5 (iteration 10k), drops sharply to ~7.0 by 20k, then rises to ~9.5 by 30k.
- Fluctuates between ~8.5 and ~9.5 after 30k.
2. **Batch Size 16 (Orange)**:
- Begins at ~8.5 (10k), dips to ~8.0 by 20k, then stabilizes near ~9.0.
- Shows minor oscillations between ~8.8 and ~9.2.
3. **Batch Size 32 (Green)**:
- Starts at ~9.0 (10k), peaks at ~9.5 by 20k, then fluctuates between ~8.8 and ~9.3.
- Ends near ~9.2 at 50k.
#### Right Graph (Stable Trends)
1. **Batch Size 8 (Blue)**:
- Begins at ~9.0 (10k), dips to ~8.5 by 20k, then stabilizes near ~9.0.
- Minor fluctuations between ~8.8 and ~9.2.
2. **Batch Size 16 (Orange)**:
- Starts at ~9.0 (10k), rises slightly to ~9.2 by 20k, then stabilizes near ~9.1.
- Fluctuates between ~8.9 and ~9.3.
3. **Batch Size 32 (Green)**:
- Begins at ~9.0 (10k), peaks at ~9.2 by 20k, then stabilizes near ~9.1.
- Fluctuates between ~8.9 and ~9.3.
### Key Observations
1. **Initial Volatility**: Smaller batch sizes (8) exhibit sharper initial drops in the left graph, suggesting sensitivity to early iterations.
2. **Stability**: Larger batches (32) show smoother trends in both graphs, indicating reduced sensitivity to iteration changes.
3. **Confidence Intervals**: Wider shaded regions in the left graph (e.g., Batch 8’s initial drop) imply higher uncertainty in smaller batches.
4. **Convergence**: By 50k iterations, all batch sizes in the right graph cluster tightly around ~9.0–9.2, suggesting eventual stability.
### Interpretation
- **Batch Size Impact**: Smaller batches (8) demonstrate higher variability in learning coefficients, potentially due to noisier gradient estimates. Larger batches (32) provide more stable updates, aligning with theoretical expectations of reduced variance in stochastic gradient descent.
- **Confidence Intervals**: The shaded regions confirm that smaller batches have less reliable measurements, as seen in the left graph’s wider intervals during rapid changes.
- **Practical Implications**: While smaller batches may recover performance faster (e.g., Batch 8’s sharp rise in the left graph), larger batches offer more predictable training dynamics, which could be critical for applications requiring consistent updates.
*Note: All values are approximate, derived from visual inspection of the graphs. Uncertainty is reflected in the shaded confidence intervals.*
</details>
(a) First initialization
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Local Learning Coefficient vs. Iteration (Two Panels)
### Overview
The image contains two side-by-side line charts comparing the evolution of a "Local learning coefficient" across 50,000 iterations for three batch sizes (8, 16, 32). Each panel uses shaded confidence intervals (light-colored bands) to represent uncertainty. The left panel shows dynamic trends with significant fluctuations, while the right panel demonstrates stabilized behavior.
---
### Components/Axes
- **X-axis (Iteration)**:
- Range: 10,000 to 50,000
- Labels: "10000", "20000", "30000", "40000", "50000"
- **Y-axis (Local learning coefficient)**:
- Range: 7.0 to 10.0
- Labels: "7.0", "7.5", "8.0", "8.5", "9.0", "9.5", "10.0"
- **Legends**:
- **Left Panel**:
- Blue line with "x" markers: Batch size 8
- Orange line with "x" markers: Batch size 16
- Green line with "x" markers: Batch size 32
- **Right Panel**:
- Same color/marker scheme as left panel
- **Shading**: Light-colored bands around lines indicate 95% confidence intervals.
---
### Detailed Analysis
#### Left Panel (Dynamic Trends)
1. **Batch size 8 (Blue)**:
- Starts at ~8.5 (10k iterations)
- Sharp spike to ~9.5 at 20k iterations
- Drops to ~9.0 by 30k iterations
- Stabilizes near ~9.2 by 50k iterations
2. **Batch size 16 (Orange)**:
- Begins at ~9.0 (10k iterations)
- Dips to ~8.5 at 20k iterations
- Rises to ~9.3 by 30k iterations
- Fluctuates between ~9.1–9.4 by 50k iterations
3. **Batch size 32 (Green)**:
- Starts at ~9.5 (10k iterations)
- Dips to ~9.2 at 20k iterations
- Rises to ~9.4 by 30k iterations
- Stabilizes near ~9.3 by 50k iterations
#### Right Panel (Stabilized Behavior)
1. **All batch sizes**:
- Begin near 9.5 (10k iterations)
- Minor fluctuations between 9.3–9.6 across all iterations
- Confidence intervals narrow significantly compared to left panel
- Lines converge toward ~9.4–9.5 by 50k iterations
---
### Key Observations
1. **Left Panel**:
- Batch size 8 exhibits the highest volatility, with a 1.0-point swing between 10k–20k iterations.
- Batch size 32 maintains the highest baseline values but shows moderate dips.
- Confidence intervals are widest for batch size 8, indicating greater uncertainty.
2. **Right Panel**:
- All batch sizes converge to similar values (~9.4–9.5) after 30k iterations.
- Confidence intervals are 2–3x narrower than the left panel, suggesting stabilized learning.
- No significant divergence between batch sizes after 30k iterations.
---
### Interpretation
The left panel likely represents an early training phase where smaller batch sizes (8) introduce higher variability in learning dynamics, possibly due to noisy gradient estimates. The right panel demonstrates convergence toward stable learning coefficients as iterations increase, with larger batch sizes (32) showing marginally better stability. The narrowing confidence intervals in the right panel suggest that the model’s uncertainty decreases with prolonged training, and batch size has diminishing returns on learning coefficient stability after ~30k iterations. This implies that while batch size affects early-stage learning dynamics, long-term convergence is less sensitive to batch size choices.
</details>
(b) Second initialization
Figure 3: LLC estimates with 95% confidence intervals shown every 1,000 iterations, computed for various values of batch size and Adam learning rate $\eta$ and for two intializations. Left: $\eta=1× 10^{-4}$ . Right: $\eta=1× 10^{-3}$ .
4 Discussion
A significant limitation of the proposed reformulation of the physics-informed learning problem is that it’s limited to PINN models with hard initial and boundary condition constraints. In practice, physics-informed learning is often carried without such hard constraints by penalizing not only the residual of the governing equation, $y_{\text{GE}}(x,t)\coloneqq\mathcal{L}(u(x,t,w),x,t)$ but also the residual of the initial conditions, $y_{\text{IC}}(x,w)\coloneqq u(x,0,w)$ , and boundary conditions, $y_{\text{BC}}\coloneqq\mathcal{B}(u(x,t,w),x,t)$ . These residuals can also be regarded as observables of the PINN model. As such, a possible approach to reformulate the physics-informed learning problem with multiple residuals is to introduce a single vector observable $y\coloneqq[y_{\text{GE}},y_{\text{IC}},y_{\text{BC}}]$ and consider the statistical learning problem problem with model $p(y\mid x,t,w)q(x,t).$ In such a formulation, the number of samples $n$ would be the same for all residuals, which goes against the common practice of using different numbers of residual evaluation points for each term of the PINN loss. Nevertheless, we note that as indicated in section 2, PINN learning with randomized batches approximates minimizing the population loss $L(w)\equiv\lim_{n→∞}L_{n}(w)$ , which establishes a possible equivalency between the physics-informed learning problem with multiple residuals and the reformulation described above. Furthermore, special care must be taken when considering the weighting coefficients commonly used in physics-informed learning to weight the different residuals in the PINN loss. These problems will be considered in future work.
The presented statistical learning analysis of the physics-informed learning problem has interesting implications on the quantification of uncertainty in PINN predictions. Bayesian uncertainty quantification (UQ) in PINN predictions is often carried out by choosing a prior on the model parameters and a fixed set of residual evaluation points and computing samples from the posterior distribution of PINN parameters (see, e.g. [17, 16].) As noted in [17], MCMC sampling of the PINN loss is highly challenging computationally due to the singular nature of PINN models, but such an exhaustive exploration of parameter space may not be necessary because the solution candidates may differ significantly in parameter space but correspond to only a limited range of loss values and thus a limited range of possible predictions. In other words, this analysis justifies taking a function-space view of Bayesian UQ for physics-informed learning due to the singular nature of the PINN models involved.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Heatmap: Value Distribution Across x and t Axes
### Overview
The image depicts a heatmap visualizing a scalar field across two dimensions: spatial coordinate `x` (0.0–2.0) and temporal coordinate `t` (0.0–4.0). The color gradient transitions from purple (low values) to yellow (high values), with black contour lines delineating regions of similar magnitude. A red dashed horizontal line at `t = 2.0` intersects the plot, and a color bar on the right quantifies the value range (0.000–0.175).
---
### Components/Axes
- **Axes**:
- **x-axis**: Horizontal axis labeled `x`, ranging from 0.0 to 2.0 in increments of 0.5.
- **t-axis**: Vertical axis labeled `t`, ranging from 0.0 to 4.0 in increments of 0.5.
- **Color Bar**:
- Located on the right, with a gradient from purple (0.000) to yellow (0.175).
- Discrete value markers: 0.000, 0.025, 0.050, 0.075, 0.100, 0.125, 0.150, 0.175.
- **Contour Lines**:
- Black lines tracing regions of constant value, denser in high-value areas (top-right).
- No explicit numerical labels on contour lines; values inferred from color bar.
- **Red Dashed Line**:
- Horizontal line at `t = 2.0`, spanning the full `x` range (0.0–2.0).
---
### Detailed Analysis
- **Value Distribution**:
- **High Values (Yellow)**: Concentrated in the top-right corner (`x ≈ 2.0`, `t ≈ 4.0`), decreasing radially outward.
- **Low Values (Purple)**: Dominant in the bottom-left quadrant (`x < 1.0`, `t < 2.0`).
- **Transition Zone**: Intermediate values (green/blue) occupy the central region (`1.0 < x < 2.0`, `2.0 < t < 4.0`).
- **Contour Line Behavior**:
- Lines are circular/elliptical, suggesting a radial decay from the top-right peak.
- Spacing between lines indicates gradient steepness: closer lines near `t = 4.0` imply rapid value changes.
- **Red Dashed Line**:
- Positioned at `t = 2.0`, it separates the low-value region (below) from the transition zone (above).
- No direct correlation between the line and contour values; likely a threshold or reference.
---
### Key Observations
1. **Radial Decay Pattern**: Values peak at (`x = 2.0`, `t = 4.0`) and decay outward, consistent with a wavefront or diffusion process.
2. **Threshold at `t = 2.0`**: The red line may demarcate a regime shift (e.g., critical time for a physical phenomenon).
3. **Color-Legend Consistency**: All colors align precisely with the legend (e.g., yellow at `t = 4.0`, `x = 2.0` matches 0.175).
---
### Interpretation
The heatmap likely represents a physical or mathematical system where:
- **High values** (yellow) correspond to a localized source or boundary condition at (`x = 2.0`, `t = 4.0`).
- **Low values** (purple) indicate an initial or equilibrium state in the bottom-left.
- The **red line at `t = 2.0`** could represent a critical time step where the system transitions from a stable to an active phase (e.g., phase change, activation threshold).
- The **contour lines** suggest the phenomenon propagates outward from the top-right, with diminishing intensity over time and space.
Uncertainties:
- Exact contour values are approximate due to unlabeled lines.
- The red line’s significance (e.g., physical threshold vs. arbitrary marker) is not explicitly stated in the image.
</details>
(a) $|u(x,t)-u(x,t,w^{0})|$
<details>
<summary>x9.png Details</summary>

### Visual Description
## Heatmap: Value Distribution Across x and t Axes
### Overview
The image is a heatmap visualizing a continuous variable's distribution across two axes: `x` (horizontal, 0.0–2.0) and `t` (vertical, 0.0–4.0). The color gradient transitions from purple (low values) to yellow (high values), with a red dashed line at `t = 2.0`. A color bar on the right quantifies the values (0.00–0.30).
### Components/Axes
- **Axes**:
- `x` (horizontal): Labeled 0.0 to 2.0 in increments of 0.5.
- `t` (vertical): Labeled 0.0 to 4.0 in increments of 0.5.
- **Color Bar**:
- Positioned on the right, labeled 0.00 (purple) to 0.30 (yellow).
- Gradient matches the heatmap's color scale.
- **Red Dashed Line**:
- Horizontal line at `t = 2.0`, spanning the entire `x` range.
### Detailed Analysis
- **Color Gradient**:
- Values increase from purple (0.00) to yellow (0.30).
- Highest values (yellow) cluster in the upper-right quadrant (`x ≈ 2.0`, `t ≈ 4.0`).
- Lowest values (purple) dominate the lower-left quadrant (`x ≈ 0.0`, `t ≈ 0.0`).
- **Red Dashed Line**:
- Divides the plot into two regions:
- Below `t = 2.0`: Predominantly purple (values < 0.10).
- Above `t = 2.0`: Gradual transition to green/yellow (values > 0.10).
- **Contour Lines**:
- Black contour lines indicate value thresholds (e.g., 0.05, 0.10, 0.15, 0.20, 0.25, 0.30).
- Lines are denser in regions of rapid value change (e.g., near `x = 1.5`, `t = 3.0`).
### Key Observations
1. **Positive Correlation**: Values increase with both `x` and `t`, forming a diagonal gradient from bottom-left to top-right.
2. **Threshold Effect**: The red line at `t = 2.0` marks a clear boundary where values transition from <0.10 to >0.10.
3. **Peak Intensity**: Maximum values (0.30) are concentrated near `x = 2.0`, `t = 4.0`.
4. **Smooth Transitions**: No abrupt changes; gradients are continuous across the plot.
### Interpretation
The heatmap suggests a **bivariate relationship** where the measured variable (e.g., concentration, intensity) grows monotonically with both `x` and `t`. The red dashed line at `t = 2.0` likely represents a critical threshold (e.g., a phase transition or operational limit). The absence of outliers or anomalies indicates a well-behaved, deterministic relationship. The color bar’s scale (0.00–0.30) implies the variable is normalized or scaled for comparison.
**Note**: No textual labels or embedded data tables are present beyond axis titles and the color bar. The image relies entirely on visual encoding for interpretation.
</details>
(b) $|u(x,t)-u(x,t,w^{1})|$
Figure 4: PINN pointwise absolute estimation errors for two different parameter vectors, $w^{0}$ and $w^{1}$ , obtained via different initializations of the PINN model parameters. Both parameter estimates were computed using batch size of 32 and Adam learning rate of $1× 10^{-4}$ over $100,000$ iterations. Despite both cases exhibiting result in similar loss $L_{n}(w)$ values ( $O(10^{-5}$ ) and similar LLCs $\hat{\lambda}(w^{0})≈\hat{\lambda}(w^{1})≈ 9.5$ , they exhibit noticeably different extrapolation behavior.
Finally, another implication of the presented analysis is related to the extrapolatory capacity of PINNs. It is understood that PINNs, particularly time-dependent PINNs, exhibit limited extrapolation capacity for times outside the training window $(0,T]$ (see, e.g., [12].) The presented analysis indicates that this is not a feature of the IBVP being solved, or the number and arrangement of residual evaluation points in space and time, or the optimization algorithm, but rather because physics-informed learning estimates model parameters by effectively minimizing $L(w)\propto\mathbb{E}_{q(x,t)}[\mathcal{L}(u(x,t,w),x,t)]^{2}$ , that is, the expected loss over the input-generating distribution $q(x,t)$ with support given by the spatio-temporal training windows. While PINN solution candidates along the flat minima are all within a limited range of loss values, there is no guarantee that this would hold for a different loss $L^{\prime}(w)\propto\mathbb{E}_{q^{\prime}(x,t)}[\mathcal{L}(u(x,t,w),x,t)]^{2}$ because the structure of the flat minima is tightly tied to the definition of the original loss, which is tightly tied to the input-generating distribution and thus to the spatio-temporal training window. In fact, it can be expected that two $w$ s with losses $L(w)$ within tolerance will lead to vastly different extrapolation losses $L^{\prime}(w)$ (e.g., see fig. 4), consistent with extrapolation results reported in the literature.
5 Conclusions
We have presented in this work a reformulation of physics-informed learning for IBVPs as a statistical learning problem valid for hard-constraints parameterizations. Employing this reformulation we argue that the physics penalty in physics-informed learning is not a regularizer as it is often interpreted (see, e.g. [6]) but as an infinite source of data, and that physics-informed learning with randomized batches of residual evaluation points is equivalent to minimizing $L(w)\propto\mathbb{E}_{q(x,t)}[\mathcal{L}(u(x,t,w),x,t)]^{2}$ . Furthermore, we show that the PINN loss landscape is not characterized by discrete, measure-zero local minima even in the limit $n→∞$ of the number of residual evaluation points but by very flat local minima corresponding to a low-complexity (relative to the number of model parameters) model class. We characterize the flatness of local minima of the PINN loss landscape by computing the LLC for a heat equation IBVP and various initializations and choices of learning hyperparameters, and find that across all initializations and hyperparameters studied the estimated LLC is very similar, indicating that the neighborhoods around the different PINN parameter estimates obtained across the experiments exhibit similar flatness properties. Finally, we discuss the implications of the proposed analysis on the quantification of the uncertainty in PINN predictions, and the extrapolation capacity of PINN models.
Acknowledgments
This material is based upon work supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research. Pacific Northwest National Laboratory is operated by Battelle for the DOE under Contract DE-AC05-76RL01830.
References
- [1] A. Cabezas, A. Corenflos, J. Lao, and R. Louf (2024) BlackJAX: composable Bayesian inference in JAX. External Links: 2402.10797 Cited by: §1, §3.2.
- [2] Flax: a neural network library and ecosystem for JAX External Links: Link Cited by: §1, §3.2.
- [3] M. D. Homan and A. Gelman (2014-01) The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res. 15 (1), pp. 1593–1623. External Links: ISSN 1532-4435 Cited by: §1, §3.1.
- [4] J. Hoogland, G. Wang, M. Farrugia-Roberts, L. Carroll, S. Wei, and D. Murfet (2025) Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research. Note: External Links: ISSN 2835-8856, Link Cited by: §1.
- [5] T. Imai (2019-06) Estimating Real Log Canonical Thresholds. arXiv e-prints, pp. arXiv:1906.01341. External Links: Document, 1906.01341 Cited by: §3.1.
- [6] G. E. Karniadakis, I. G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, and L. Yang (2021) Physics-informed machine learning. Nature Reviews Physics 3 (6), pp. 422–440. Cited by: §1, §5.
- [7] E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei (2025-03–05 May) The local learning coefficient: a singularity-aware complexity measure. In Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, Y. Li, S. Mandt, S. Agrawal, and E. Khan (Eds.), Proceedings of Machine Learning Research, Vol. 258, pp. 244–252. External Links: Link Cited by: §1, §2, §3.1, Statistical Learning Analysis of Physics-Informed Neural Networks.
- [8] X. Li, J. Deng, J. Wu, S. Zhang, W. Li, and Y. Wang (2024) Physical informed neural networks with soft and hard boundary constraints for solving advection-diffusion equations using fourier expansions. Computers & Mathematics with Applications 159, pp. 60–75. External Links: ISSN 0898-1221, Document, Link Cited by: §2.
- [9] L. Lu, R. Pestourie, W. Yao, Z. Wang, F. Verdugo, and S. G. Johnson (2021) Physics-informed neural networks with hard constraints for inverse design. SIAM Journal on Scientific Computing 43 (6), pp. B1105–B1132. External Links: Document, Link, https://doi.org/10.1137/21M1397908 Cited by: §2.
- [10] A. Muhamed and V. Smith (2025) The geometry of forgetting: analyzing machine unlearning through local learning coefficients. In ICML 2025 Workshop on Reliable and Responsible Foundation Models, External Links: Link Cited by: §1.
- [11] A. M. Tartakovsky and Y. Zong (2024) Physics-informed machine learning method with space-time karhunen-loève expansions for forward and inverse partial differential equations. Journal of Computational Physics 499, pp. 112723. External Links: ISSN 0021-9991, Document, Link Cited by: §2.
- [12] Y. Wang, Y. Yao, and Z. Gao (2025) An extrapolation-driven network architecture for physics-informed deep learning. Neural Networks 183, pp. 106998. External Links: ISSN 0893-6080, Document, Link Cited by: §4.
- [13] S. Watanabe (2009) Algebraic geometry and statistical learning theory. Vol. 25, Cambridge University Press. Cited by: §1, §2.
- [14] S. Watanabe (2013-03) A widely applicable bayesian information criterion. 14 (1), pp. 867–897. External Links: ISSN 1532-4435 Cited by: §3.1.
- [15] M. Welling and Y. W. Teh (2011) Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, Madison, WI, USA, pp. 681–688. External Links: ISBN 9781450306195 Cited by: §1.
- [16] L. Yang, X. Meng, and G. E. Karniadakis (2021) B-pinns: bayesian physics-informed neural networks for forward and inverse pde problems with noisy data. Journal of Computational Physics 425, pp. 109913. External Links: ISSN 0021-9991, Document, Link Cited by: §4.
- [17] Y. Zong, D. A. Barajas-Solano, and A. M. Tartakovsky (2025) Randomized Physics-informed Neural Networks for Bayesian Data Assimilation. Comput. Methods Appl. Mech. Eng. 436, pp. 117670. External Links: Document Cited by: §4.