# Deep learning methods for inverse problems using connections between proximal operators and Hamilton–Jacobi equations
**Authors**: Oluwatosin Akande, Gabriel P. Langlois, Akwum Onwunta
> Industrial and Systems Engineering, Lehigh University, 200 West Packer Avenue, Bethlehem, PA 18015, USA, ( )
> Department of Mathematics, University of Illinois Urbana-Champaign, Chicago, IL, USA ( ).
> Industrial and Systems Engineering, Lehigh University, 200 West Packer Avenue, Bethlehem, PA 18015, USA, ( ).
## Abstract
Inverse problems are important mathematical problems that seek to recover model parameters from noisy data. Since inverse problems are often ill-posed, they require regularization or incorporation of prior information about the underlying model or unknown variables. Proximal operators, ubiquitous in nonsmooth optimization, are central to this because they provide a flexible and convenient way to encode priors and build efficient iterative algorithms. They have also recently become key to modern machine learning methods, e.g., for plug-and-play methods for learned denoisers and deep neural architectures for learning priors of proximal operators. The latter was developed partly due to recent work characterizing proximal operators of nonconvex priors as subdifferential of convex potentials. In this work, we propose to leverage connections between proximal operators and Hamilton–Jacobi partial differential equations (HJ PDEs) to develop novel deep learning architectures for learning the prior. In contrast to other existing methods, we learn the prior directly without recourse to inverting the prior after training. We present several numerical results that demonstrate the efficiency of the proposed method in high dimensions.
## 1 Introduction
Inverse problems are ubiquitous mathematical problems that primarily aim at recovering model parameters from noisy data. They arise in many scientific and engineering applications for, e.g., recovering an image from noisy measurements, deblurring, tomographic reconstruction, and compressive sensing [AEOV2023, bertero2021introduction, isakov2017inverse, arridge2019solving]. Since inverse problems are often ill-posed, it is essential to include regularization or prior information about the underlying model or unknown variables. Proximal operators are central to this: they provide a flexible and computationally convenient way to encode priors and to build efficient iterative algorithms (e.g., proximal (sub)gradients, the alternating direction method of multipliers, and other splitting methods). More recently, proximal operators have become key ingredients for state-of-the-art machine learning methods, e.g., plug-and-play methods that replace explicit regularizers by learned denoisers [hu2023plug, jia2025plug], and deep neural architectures that parameterize proximal maps or their gradients, such as learned proximal networks (LPNs) [fang2024whats]. These developments have made proximal methods practical and powerful computational tools.
Formally, the proximal operator of a proper function $J\colon\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ is defined via an observed data $\bm{x}\in\mathbb{R}^{n}$ , a parameter $t>0$ , and the minimization problem
$$
S(\bm{x},t)=\min_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J(\bm{y})\right\}. \tag{1}
$$
The proximal operator $\text{prox}_{tJ}\colon\mathbb{R}^{n}\to\mathbb{R}$ is the set-valued function
$$
\text{prox}_{tJ}(\bm{x})=\operatorname*{arg\,min}_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J(\bm{y})\right\}. \tag{2}
$$
Here, $t$ controls the trade-off between the quadratic data-fidelity term and the prior $J$ . In practice one often works directly with $\text{prox}_{tJ}$ rather than the prior.
The recent work of Gribonval and Nikolova [gribonval2020characterization] in nonsmooth optimization has extended the characterization of proximal operators with convex priors to those with nonconvex priors, showing in particular they are functions that are subdifferentials of certain convex potentials. These properties, in particular, were used in [fang2024whats] to develop new deep learning methods, called learned proximal networks (LPNs), to learn from data the underlying prior of a proximal operator.
The paper [gribonval2020characterization] did not, however, discuss the well-established, existing connections between proximal operators and Hamilton–Jacobi Partial Differential Equations (HJ PDEs) [darbon2015convex, darbon2016algorithms, darbon2021bayesian, chaudhari2018deep, osher2023hamilton]. To see these connections, consider the following HJ PDE with quadratic Hamiltonian function and whose initial data is the prior $J$ :
$$
\begin{dcases}\frac{\partial S}{\partial t}(\bm{x},t)+\frac{1}{2}\left\|{\nabla_{\bm{x}}S(\bm{x},t)}\right\|_{2}^{2}=0,&\ \bm{x}\in\mathbb{R}^{n}\times(0,+\infty),\\
S(\bm{x},0)=J(\bm{x}),&\ \bm{x}\in\mathbb{R}^{n}.\end{dcases} \tag{3}
$$
If $J$ is uniformly Lipschitz continuous, then the unique viscosity solution of the HJ PDE is given by Eq. 1. Moreover, at a point of differentiability $\bm{x}$ , there holds
$$
\text{prox}_{tJ}(\bm{x})=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t). \tag{4}
$$
Moreover, the viscosity solution satisfies the crucial property that $\bm{x}\mapsto\frac{1}{2}\left\|{\bm{x}}\right\|_{2}^{2}-tS(\bm{x},t)$ is convex; that is, when paired with Eq. 4, the function $\text{prox}_{tJ}(\bm{x})$ is obtained from differentiating a convex function. This formally connects proximal operators to HJ PDEs, which we emphasize was previously known and established, and the (stronger) characterization obtained in [gribonval2020characterization] To the best our knowledge, this characterization result was unknown in the theory of HJ PDEs..
In this paper, we leverage the theory of viscosity solutions of HJ PDEs to develop novel deep learning methods to learn from data the prior function $J$ in Eq. 2. To describe our approach, consider the case when the solution $(\bm{x},t)\mapsto S(\bm{x},t)$ to the HJ PDE Eq. 3 is known. (We will consider the case when only samples of it are known in the next paragraph.) This problem was investigated in [barron1999regularity, claudel2011convex, colombo2020initial, esteve2020inverse, misztela2020initial]. In particular, [esteve2020inverse] showed that when $\bm{x}\mapsto S(\bm{x},t)$ is uniformly Lipschitz continuous and $\bm{x}\mapsto\frac{1}{2}\left\|{\bm{x}}\right\|_{2}^{2}-tS(\bm{x},t)$ is convex, there exists a prior $J$ that can recover $S(\bm{x},t)$ exactly. Moreover, there is a natural candidate for the prior, obtained by reversing the time in the HJ PDE Eq. 3 and using $(\bm{x},t)\mapsto S(\bm{x},t)$ as the terminal condition. The resulting backward viscosity solution yields the prior $J_{\text{BVS}}\colon\mathbb{R}^{n}\to\mathbb{R}$ which admits the representation formula
$$
J_{\text{BVS}}(\bm{y})=\sup_{\bm{x}\in\mathbb{R}^{n}}\left\{S(\bm{x},t)-\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}\right\}. \tag{5}
$$
Here, $J(\bm{y})\geqslant J_{\text{BVS}}(\bm{y})$ for every $\bm{y}\in\mathbb{R}^{n}$ , with $J_{\text{BVS}}(\bm{y})=J(\bm{y})$ whenever $\bm{y}=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t)$ , where $\bm{x}$ is a point of differentiability of $\bm{x}\mapsto S(\bm{x},t)$ . Moreover,
$$
\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{BVS}}(\bm{y})\right\}=S(\bm{x},t)\ \text{for every $\bm{x}\in\mathbb{R}^{n}$.}
$$
Thus the prior $J_{\text{BVS}}$ recovers the function $x\mapsto S(\bm{x},t)$ , although in general $\text{prox}_{tJ}$ and $\text{prox}_{tJ_{\text{BVS}}}$ may not agree everywhere. Nonetheless, this provides a principled way to estimate the prior, at least when $S(\bm{x},t)$ is known.
We focus in this paper on the case when $\bm{x}\mapsto S(\bm{x},t)$ is unknown but have access to some samples $\{\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t)\}_{k=1}^{K}$ with $t$ fixed. We propose to learn the prior $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})$ by leveraging the crucial fact that $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ is convex, thus enabling approaches based on deep learning and convex neural networks.
Related works: Hamilton–Jacobi PDEs are important to many scientific and engineering applications arising in e.g., optimal control [Bardi1997Optimal, fleming2006controlled, mceneaney2006max, parkinson2018optimal] and physics [Caratheodory1965CalculusI, Caratheodory1967CalculusII], inverse problems for imaging sciences [darbon2015convex, darbon2019decomposition, darbon2021bayesian, darbon2021connecting, darbon2022hamilton], optimal transport [meng2024primal, onken2021ot], game theory [BARRON1984213, Evans1984Differential, ruthotto2020machine], and machine learning [chen2024leveraging, zou2024leveraging]. Recent works focus on developing specialized solution methods for solving high-dimensional HJ PDEs, using, e.g., representation formulas or deep learning methods. These specialized methods leverage certain properties of HJ PDEs, including stochastic aspects and representation formulas [bardi1998hopf, mceneaney2006max, darbon2016algorithms, darbon2022hamilton], to approximate solutions to HJ PDEs more accurately and efficiently than general-purpose methods. See, e.g., [meng2022sympocnet, darbon2023neural, darbon2021some, darbon2020overcoming, park2025neural] for recent works along these lines and [meng2025recent] for a review of the state-of-the-art numerical methods for HJ PDEs.
Deep learning methods have become popular for computing solutions to high-dimensional PDEs as well as their inverse problems. They are popular because neural networks can be trained on data to approximate high-dimensional, nonlinear functions using efficient optimization algorithms. They have been used to approximate solutions to PDEs without any discretization with numerical grids, and for this reason they can overcome, or at least mitigate, the curse of dimensionality. There is a fairly comprehensive literature on deep learning methods for solving PDEs in general, e.g., see [beck2020overview, cuomo2022scientific, karniadakis2021physics].
Organization of this paper: We present background information on proximal operators, Hamilton–Jacobi equations, and convex neural networks in Section 2. Next, we discuss recent results concerning the inverse problem for Hamilton–Jacobi equations when the solution is available, and how they relate to proximal operators and learning priors in inverse problems, in Section 3. Our main theoretical results are presented in Section 4, where we study the inverse problem for Hamilton–Jacobi equations when only incomplete information is available about its solution. We suggest via arguments from max-plus algebra theory for Hamilton–Jacobi PDEs how to learn from data the solution to a certain Hamilton-Jacobi–Jacobi terminal value problem, which can then be used as an estimate for learning the prior function in a proximal operator. We present in Section 5 some numerical experiments for learning the initial data of certain Hamilton–Jacobi PDEs using convex neural networks and the theory of inverse Hamilton–Jacobi PDEs. Finally, we summarize our results in Section 6.
## 2 Background
We present here some background on proximal operators, HJ PDEs, connections between them, and convex neural networks. For comprehensive references, we refer the reader to [cannarsa2004semiconcave, evans2022partial, rockafellar2009variational].
### 2.1 Proximal operators
Let $J\colon\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ denote a proper function (i.e., $J(\bm{x})<+\infty$ for some $\bm{x}\in\mathbb{R}^{n}$ and $J(\bm{x})>-\infty$ for every $\bm{x}\in\mathbb{R}^{n}$ ). Consider the minimization problem $(\bm{x},t)\mapsto S(\bm{x},t)$ defined in Eq. 1 and its proximal operator $(\bm{x},t)\mapsto\text{prox}_{tJ}(\bm{x})$ defined in Eq. 2. We say a proper function $f_{t}\colon\mathbb{R}^{n}\to\mathbb{R}$ is a proximal operator of $tJ$ if $f_{t}(\bm{x})\in\text{prox}_{tJ}(\bm{x})$ for every $\bm{x}\in\mathbb{R}^{n}$ . Gribonval and Nikolova [gribonval2020characterization] proved that proximal operators are characterized in terms of the function $\psi\colon\mathbb{R}^{n}\times[0,+\infty)\to\mathbb{R}\cup\{+\infty\}$ defined by
$$
\psi(\bm{x},t)=\frac{1}{2}\left\|{\bm{x}}\right\|_{2}^{2}-tS(\bm{x},t). \tag{6}
$$
**Theorem 2.1**
*A proper function $f_{t}\colon\mathbb{R}^{n}\to\mathbb{R}^{n}$ is a proximal operator of $tJ$ if and only if $\bm{x}\mapsto\psi(\bm{x},t)$ is proper, lower semicontinuous, and convex and $f_{t}(\bm{x})\in\partial_{\bm{x}}\psi(\bm{x},t)$ . Moreover, $f_{t}$ is uniformly Lipschitz continuous with constant $L>0$ if and only if $\bm{x}\mapsto(1-1/L)\left\|{\bm{x}}\right\|_{2}^{2}/2+tJ(\bm{x})$ is proper, lower semicontinuous and convex.*
**Proof 2.2**
*See [gribonval2020characterization, Theorem 3 and Proposition 2]*
The characterization of proximal operators in Theorem 2.1 is closely related to the concepts of semiconcave and semiconvex functions.
**Definition 2.3**
*Let $\mathcal{C}\subset\mathbb{R}^{n}$ . We say $g\colon\mathcal{C}\to\mathbb{R}$ is $C$ -semiconcave with $C\geqslant 0$ if it is continuous and
$$
\lambda g(\bm{x}_{1})+(1-\lambda)g(\bm{x}_{2})-g(\lambda\bm{x}_{1}+(1-\lambda)\bm{x}_{2})\leqslant\lambda(1-\lambda)C\left\|{\bm{x}_{1}-\bm{x}_{2}}\right\|_{2}^{2}
$$
for every $\bm{x}_{1},\bm{x}_{2}\in\mathcal{C}$ such that $\lambda\bm{x}_{1}+(1-\lambda)\bm{x}_{2}\subset\mathcal{C}$ and $\lambda\in[0,1]$ . We say $g$ is semiconvex if $-g$ is semiconcave.*
**Remark 2.4**
*It can be shown [cannarsa2004semiconcave, Chapter 1] that a function $g$ is $C$ -semiconcave with $C\geqslant 0$ if and only if $\bm{x}\mapsto g(\bm{x})-\frac{C}{2}\left\|{\bm{x}}\right\|_{2}^{2}$ is concave, if and only if $g=g_{1}+g_{2}$ , where $g_{1}$ is concave and $g_{2}\in C^{2}(\mathbb{R}^{n})$ with $\left\|{\nabla_{\bm{x}}^{2}g_{2}}\right\|_{\infty}\leqslant C$ .*
Combining formula Eq. 6, Definition 2.3 and Remark 2.4, we find $\bm{x}\mapsto\psi(\bm{x},t)$ is convex if and only if $\bm{x}\mapsto tS(\bm{x},t)$ is semiconcave. We will see later that semiconcavity is an important concept in the theory of HJ PDEs for characterizing their generalized solutions. But before moving on to present some background on HJ PDEs, we give below an instructive example.
**Example 2.5 (The negative absolute value prior)**
*Let $J(x)=-|x|$ and consider the one-dimensional problem
$$
S(x,t)=\min_{y\in\mathbb{R}}\left\{\frac{1}{2t}(x-y)^{2}-|y|\right\}.
$$
A global minimum $y^{*}$ of this problem satisfies the first-order optimality condition
$$
0\in(y^{*}-x)/t-\partial|y^{*}|\iff y^{*}\in\begin{cases}x+t,&\ \text{if $y^{*}>0$,}\\
[x-t,x+t]&\ \text{if $y^{*}=0$},\\
x-t,&\ \text{if $y^{*}<0$.}\end{cases}
$$
If $x>t$ , the only minimum is $y^{*}=x+t$ . Likewise, if $x<-t$ , the only minimum is $y^{*}=x-t$ . In either cases, $S(x,t)=-x-\frac{t}{2}$ . If $0<x\leqslant t$ , there are two local minimums, $0$ and $x+t$ , but the global minimum is attained at $x+t$ and yields $S(x,t)=-\frac{t}{2}-x$ . Likewise, if $-t\leqslant x<0$ , there are two local minimums, $0$ and $x-t$ , but the global minimum is attained at $x-t$ and yields $S(x,t)=-\frac{t}{2}+x$ . Finally, if $x=0$ , there are three local minimums, $-t$ , $0$ , and $t$ . The global minimums are attained at $-t$ or $t$ , yielding $S(0,t)=-t/2$ . Hence we find
$$
S(x,t)=-\frac{t}{2}-|x|\quad\text{and}\quad\text{prox}_{tJ}(x)=\begin{cases}x+t,&\ \text{if $x>0$,}\\
\{-t,t\}&\ \text{if $x=0$},\\
x-t,&\ \text{if $x<0$.}\end{cases} \tag{7}
$$
Thus, a selection $f_{t}(x)\in\text{prox}_{tJ}(x)$ differs only at $x=0$ . In any case, the function $x\mapsto\psi(x,t)$ in Theorem 2.1 and its subdifferential $x\mapsto\partial_{x}\psi(x,t)$ are given by
$$
\psi(x,t)=\frac{1}{2}x^{2}-tS(x,t)=\frac{1}{2}x^{2}+t|x|+\frac{t^{2}}{2}\quad\text{and}\quad\partial_{x}\psi(x)=\begin{cases}x+t,\,&\text{if $x>0$},\\
[-t,t],\,&\text{if $x=0$},\\
x-t,\,&\text{if $x<0$}.\end{cases}
$$
We see that any selection $f_{t}(x)\in\text{prox}_{tJ}(x)$ satisfies $f(x)\in\partial\psi(x,t)$ .*
### 2.2 Hamilton–Jacobi Equations
In this section, we briefly review some elements of the theory of HJ PDEs, including the method of characteristics, viscosity solutions of HJ PDEs, and the Lax–Oleinik formula, and discuss how these concepts tie together to proximal operators. The discussion is not comprehensive; see [evans2022partial] and references therein for a more detailed treatment. To ease the presentation, we consider only the first-order HJ PDEs Eq. 3.
#### 2.2.1 Characteristic equations
The characteristic equations of Eq. 3 are given by the dynamical system
$$
\begin{cases}\dot{\bm{x}}(t)&=\bm{p}(t),\\
\dot{\bm{p}}(t)&=0,\\
\dot{\bm{z}}(t)&=\frac{1}{2}\left\|{\bm{p}(t)}\right\|_{2}^{2},\end{cases} \tag{8}
$$
where $\bm{z}(t)=S(\bm{x}(t),t)$ and $\bm{x}(0)=J(\bm{x}(0)$ . Here, $t\mapsto\bm{p}(t)$ is constant with $\bm{p}(t)\equiv\bm{p}(0)\in\mathbb{R}^{n}$ . The characteristic line that arises from $\bm{x}(0)\in\mathbb{R}^{n}$ is $\bm{x}(t)=\bm{x}(0)+t\bm{p}(0)$ , and so $\bm{z}(t)=\bm{z}(0)-\frac{1}{2}\left\|{\bm{p}(0)}\right\|_{2}^{2}$ . Taken together, we find
$$
S(\bm{x}(t),t)=J(\bm{x}(0))+\frac{1}{2}\left\|{\bm{p}(0)}\right\|_{2}^{2}. \tag{0}
$$
Writing $\bm{x}(t)\equiv\bm{x}$ and $\bm{p}(0)=\nabla_{\bm{x}}S(\bm{x},t)$ (assuming formally that the spatial gradient exists at $\bm{x}$ ) then $\bm{x}(0)=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t)$ , and so we find the representation
$$
S(\bm{x},t)=\frac{1}{2t}\left\|{\nabla_{\bm{x}}S(\bm{x},t)}\right\|_{2}^{2}+J(\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t)). \tag{9}
$$
This gives an implicit representation between $S$ , its spatial gradient, and the initial data $J$ . Next, we turn to the explicit representation of solutions to Eq. 3.
#### 2.2.2 Viscosity solutions and the Lax–Oleinik formula
The initial value problem Eq. 3 (and HJ PDEs with general Hamiltonians) may not have a unique generalized solution, i.e., those satisfying the HJ PDE almost everywhere along with the initial condition $S(\bm{x},0)=J(\bm{x})$ .
**Example 2.6**
*Let $J\equiv 0$ in Eq. 3 and take $n=1$ . The corresponding HJ PDE has infinitely many solutions: For instance, the functions $S_{1}$ and $S_{2}$ given by
$$
S_{1}(x,t)=0,\quad S_{2}(x,t)=\begin{cases}0,\,&\text{if $|x|\geqslant t$},\\
x-t,\,&\text{if $0\leqslant x\leqslant t$},\\
-x-t,\,&\text{if $-t\leqslant x\leqslant 0$},\end{cases}
$$
satisfy $S_{1}(x,0)=S_{2}(x,0)=J(x)=0$ and both solve the corresponding HJ PDE almost everywhere.*
The notion of viscosity solution was introduced in [crandall1983viscosity] to solve this problem. Under appropriate conditions (see [bardi1998hopf, crandall1992user, crandall1983viscosity]), the viscosity solution is unique and admits a representation formula. Specifically, for the initial value problem Eq. 3 with uniformly Lipschitz continuous initial data $J$ , the unique viscosity solution is given by the Lax–Oleinik formula (with quadratic Hamiltonian)
$$
S(\bm{x},t)=\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J(\bm{y})\right\}. \tag{10}
$$
The (unique) viscosity solution has two important properties. First, the function $\bm{x}\mapsto S(\bm{x},t)$ is (1/t)-semiconcave. This is equivalent to requiring the function $\bm{x}\mapsto\psi(\bm{x},t)$ defined in Eq. 6 to be convex, exactly as stipulated in Theorem 2.1. Second, at any point of differentiability of $\bm{x}\mapsto S(\bm{x},t)$ , there holds
$$
\nabla_{\bm{x}}S(\bm{x},t)=\frac{\bm{x}-f_{t}(\bm{x})}{t}\iff f_{t}(\bm{x})=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t), \tag{11}
$$
where $f_{t}(\bm{x})$ denote a global minimum in Eq. 10. Note that substituting this expression in formula Eq. 9 obtained from the characteristic equations yields Eq. 10, as expected.
**Example 2.7 (The negative absolute value prior, continued.)**
*Let $J(x)=-|x|$ in the (one-dimensional) first-order HJ PDE Eq. 3. The function $J$ is uniformly Lipschitz continuous and, as such, the Lax–Oleinik formula $S(x,t)=-\frac{t}{2}-|x|$ is the unique viscosity solution of the corresponding HJ PDE. Note $\bm{x}\mapsto S(\bm{x},t)$ is differentiable everywhere except at $\bm{x}=0$ and $\text{prox}_{tJ}(\bm{x})=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t)$ everywhere except at $\bm{x}=0$ (see (7)).*
In summary, a proper function $f_{t}$ is a proximal operator of $tJ$ whenever the function $(\bm{x},t)\mapsto S(\bm{x},t)$ is the viscosity solution of the HJ initial value problem Eq. 3. The minimization problem underlying $\text{prox}_{tJ}(\bm{x})$ is exactly the Lax–Oleinik representation formula of the viscosity solution of Eq. 3. We will see in the next section how to leverage these connections for learning the prior when $\bm{x}\mapsto J(\bm{x})$ is not available but $(\bm{x},t)\mapsto S(\bm{x},t)$ is available. But before proceeding, we briefly review convex neural networks, which will be used later in this work.
### 2.3 Convex neural networks
Convex Neural Networks, specifically Input Convex Neural Networks (ICNN), were introduced by [amos2017input] to allow for the efficient optimization of neural networks within structured prediction and reinforcement learning tasks. The core premise of an ICNN is to constrain the network architecture such that the output is a convex function with respect to the input.
To achieve convexity, the network typically employs a recursive structure for $k=0,\dots,j-1$
$$
\bm{z}_{k+1}=g(\bm{W}_{k}\bm{z}_{k}+\bm{H}_{k}y+\bm{b}_{k}),f(\bm{y};\theta)=\bm{z}_{j}, \tag{12}
$$
where $\bm{y}$ , $\bm{z}_{k}$ represent the input to the network and the hidden features at layer $k$ , respectively, and $g$ is the activation function. To guarantee the convexity of the output with respect to the input $\bm{y}$ , specific constraints are imposed on the parameters and the activation function, which are (i) the weights $\bm{W}_{k}$ , which connect the previous hidden layer to the current one, must be non-negative ( $\bm{W}_{k}\geqslant 0$ ), and (ii) the activation function $g$ must be convex and non-decreasing [fang2024whats].
Following [fang2024whats, Proposition 3.1], Fang et al. leverage the ICNN architecture and the characterization of proximal operators to develop Learned Proximal Networks (LPN) for inverse problems. LPNs require stricter conditions than standard ICNNs. While standard ICNNs often use ReLU activation, LPNs require the activation function $g$ to be twice continuously differentiable. This smoothness is essential to ensure that the proximal operator is the gradient of a twice continuous differentiable function [gribonval2020characterization, Theorem 2]. Consequently, LPNs typically utilize smooth activations like the softplus function, a $\beta-$ smooth approximation of ReLU [fang2024whats, Section 3].
## 3 Connections between learning priors and the inverse problem for Hamilton–Jacobi Equations
In this section, we discuss the inverse problem of learning the prior in the proximal operator Eq. 2: given $t>0$ and some function $\bm{x}\mapsto S(\bm{x},t)$ , assess whether there exists a a prior function $J$ that can recover $\bm{x}\mapsto S(\bm{x},t)$ and, if so, estimate it. Due to the connections between proximal operators and HJ Equations, as discussed in Subsections 2.1 – 2.2, our starting point will be to discuss the inverse problem from the point of view of HJ Equations.
We summarize in the next subsection some of the main results for this problem, based on the results of [esteve2020inverse] and other related works [claudel2011convex, colombo2020initial, misztela2020initial].
### 3.1 Reachability and inverse problems for Hamilton–Jacobi equations
We consider here the inverse problem associated to the HJ initial value problem Eq. 3: given $t>0$ and a function $(\bm{x},t)\mapsto S(\bm{x},t)$ , identify the set of initial data $J\colon\mathbb{R}^{n}\ \to\mathbb{R}$ such that the viscosity solution of Eq. 3 coincide with $S(\bm{x},t)$ . That is, we wish to characterize the set
$$
\displaystyle I_{t}(S) \displaystyle\coloneqq\{\text{$J\colon\mathbb{R}^{n}\to\mathbb{R}$ is uniformly Lipschitz continuous} \displaystyle\qquad\qquad:\text{$S(\bm{x},t)$ is obtained from~\eqref{eqn:intro2} at time $t$}\}. \tag{13}
$$
We say the function $(\bm{x},t)\mapsto S(\bm{x},t)$ is reachable if the set $I_{t}(S)$ is nonempty. The main reachability result for the initial value problem Eq. 3 is the following:
**Theorem 3.1**
*Suppose $\bm{x}\mapsto S(\bm{x},t)$ is uniformly Lipschitz continuous. Then the set $I_{t}(S)$ defined in Eq. 13 is nonempty if and only if $\bm{x}\mapsto tS(\bm{x},t)$ is semiconcave.*
**Proof 3.2**
*This follows from [esteve2020inverse, Theorem 2.2, Theorem 6.1, and Definition 6.2].*
Now, assume $(\bm{x},t)\mapsto S(\bm{x},t)$ is reachable. What can be said about the nonempty set $I_{t}(S)$ ? Since $(\bm{x},t)\mapsto S(\bm{x},t)$ is obtained from evolving forward in time the prior function $J$ from $0$ to $t$ according to Eq. 3, a natural approach is to do the opposite: evolve backward in time from $t$ to $0$ the function $\bm{x}\mapsto S(\bm{x},t)$ . That is, we consider the terminal value problem
$$
\begin{dcases}\frac{\partial\bm{w}}{\partial\tau}(\bm{y},\tau)+\frac{1}{2}\left\|{\nabla_{\bm{y}}\bm{w}(\bm{y},\tau)}\right\|_{2}^{2}=0&\ (\bm{y},\tau)\in\mathbb{R}^{n}\times[0,t),\\
\bm{w}(\bm{y},t)=S(\bm{y},t),&\ \bm{y}\in\mathbb{R}^{n}.\end{dcases} \tag{14}
$$
Under appropriate conditions, the terminal-value problem Eq. 14 has a unique viscosity solution:
**Theorem 3.3**
*Suppose $\bm{x}\mapsto S(\bm{x},t)$ is uniformly Lipschitz continuous and semiconcave. Then the viscosity solution of the terminal-value problem Eq. 14 exists, is unique, and is given by the representation formula
$$
\bm{w}(\bm{y},\tau)=\sup_{\bm{x}\in\mathbb{R}^{n}}\left\{S(\bm{x},t)-\frac{1}{2\tau}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}\right\}. \tag{15}
$$
Moreover, the function $\bm{y}\mapsto\tau\bm{w}(\bm{y},\tau)$ is semiconvex with unit constant.*
**Proof 3.4**
*See [barron1999regularity, Section 4, Equation 4.4.2] and [cannarsa2004semiconcave, Chapter 1].*
The viscosity solution of (14) is sometimes called the backward viscosity solution (BVS) to distinguish it from the viscosity solution of the initial value problem Eq. 3. The BVS at $\tau=0$ corresponds to fully evolving backward in time the function $\bm{x}\mapsto S(\bm{x},t)$ . In what follows, we write $J_{\text{BVS}}\coloneqq\bm{w}(\cdot,0)$ . We can use Eq. 6 to write
$$
tJ_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}=\sup_{\bm{x}\in\mathbb{R}^{n}}\left\{\langle\bm{x},\bm{y}\rangle-\psi(\bm{x},t)\right\}. \tag{16}
$$
The right hand side is the convex conjugate of $\bm{x}\mapsto\psi(\bm{x},t)$ evaluated at $\bm{x}$ , which is well-defined because $\bm{x}\mapsto\psi(\bm{x},t)$ is proper, lower semicontinuous and convex.
Theorem 3.3 suggests that $J_{\text{BVS}}$ is an initial condition that can reach $\bm{x}\mapsto S(\bm{x},t)$ . The next result stipulates that this is correct and that it is “optimal”, in the sense that it bounds from below for any other reachable initial condition $J\in I_{t}(S)$ .
**Theorem 3.5**
*Let $J_{\text{BVS}}$ denote the solution of the backward HJ terminal value problem 14 at time $\tau=0$ . Then $J\in I_{t}(S)$ if and only if
$$
J(\bm{y})\geqslant J_{\text{BVS}}(\bm{y})\ \text{for every $\bm{y}\in\mathbb{R}^{n}$, with equality for every $\bm{y}\in X_{t}(S)$, where}
$$
$$
X_{t}(S)\coloneqq\left\{\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t):\text{$\bm{x}\mapsto S(\bm{x},t)$ is differentiable at $\bm{x}\in\mathbb{R}^{n}$}\right\}.
$$*
**Proof 3.6**
*See [esteve2020inverse, Theorems 2.3 and 2.4].*
Theorem 3.5 stipulates that $J_{\text{BVS}}$ is equal everywhere to $J$ on the set $\mathcal{X}_{t}(S)$ and bounds it from below elsewhere. This is a fundamental consequence of the semiconcavity of $\bm{x}\mapsto S(\bm{x},t)$ , which regularizes the backward viscosity solution of Eq. 14. We illustrate this below with the negative absolute value prior.
**Example 3.7 (The negative absolute value prior, continued.)**
*Let $J(x)=-|x|$ in the (one-dimensional) first-order HJ PDE Eq. 3. Recall that the unique viscosity solution is given by the Lax–Oleinik formula $S(x,t)=-\frac{t}{2}-|x|$ . We now would like to compute the corresponding unique backward viscosity solution to the terminal-value problem Eq. 14. The solution is well-defined because $\bm{x}\mapsto S(\bm{x},t)$ is uniformly Lipschitz continuous and concave. We have
$$
J_{\text{BVS}}(x)=\sup_{\bm{y}\in\mathbb{R}}\left\{-\frac{t}{2}-|y|-\frac{1}{2t}(x-y)^{2}\right\}=-\frac{t}{2}-\inf_{\bm{y}\in\mathbb{R}}\left\{\frac{1}{2t}(x-y)^{2}+|y|\right\}.
$$
The infimum on the right hand side corresponds to the proximal operator of the function $\bm{y}\mapsto|y|$ , which is the soft-thresholding operator:
$$
\operatorname*{arg\,min}_{y\in\mathbb{R}}\left\{\frac{1}{2t}(x-y)^{2}+|y|\right\}=\begin{cases}x-t,\,&\text{if $x>t$},\\
0,\,&\text{if $x\in[-t,t]$},\\
x+t,\,&\text{if $x<-t$}.\end{cases}
$$
This gives
$$
J_{\text{BVS}}(x)=\begin{cases}-x,\,&\text{if $x>t$},\\
-\frac{t}{2}-\frac{x^{2}}{2t},\,&\text{if $x\in[-t,t]$},\\
x,\,&\text{if $x<-t$}.\end{cases}
$$
Here, a simple calculation shows $\mathcal{X}_{t}(S)=(-\infty,-t]\cup[t,+\infty)$ , and we find $J(x)>J_{\text{BVS}}(x)$ on $(-t,t)$ , as expected from Theorem 3.5. Moreover,
$$
tJ_{\text{BVS}}(x)+\frac{1}{2}x^{2}=\begin{cases}\frac{1}{2}(x-t)^{2}-\frac{t^{2}}{2},\,&\text{if $x>t$},\\
-\frac{t^{2}}{2},\,&\text{if $x\in[-t,t]$},\\
\frac{1}{2}(x+t)^{2}-\frac{t^{2}}{2},\,&\text{if $x<-t$},\end{cases}
$$
and we observe $x\mapsto tJ_{\text{BVS}}(x)+\frac{1}{2}x^{2}$ is convex, as expected from Theorem 3.3.*
The results here apply when the function $\bm{x}\mapsto S(\bm{x},t)$ is known. What happens when only a finite set of values of this function are available?
## 4 Learning priors and the inverse problem for Hamilton–Jacobi Equations with incomplete information
In this section, we consider the inverse problem of learning the prior in the proximal operator Eq. 2 with incomplete information: given $t>0$ and a set of samples $\{\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t)\}_{k=1}^{K}$ , estimate the prior $J$ that best recovers $\bm{x}\mapsto S(\bm{x},t)$ . Recall from Theorem 3.1 that when $\bm{x}\mapsto S(\bm{x},t)$ is uniformly Lipschitz continuous, $\bm{x}\mapsto S(\bm{x},t)$ is reachable if and only if it is semiconcave. In this case, the prior $\bm{x}\mapsto J_{\text{BVS}}(\bm{x})$ obtained from the HJ terminal value problem Eq. 14 provides a prior function that recovers $(\bm{x},t)\mapsto S(\bm{x},t)$ exactly. Hence we will focus on studying how to approximate the prior $J_{\text{BVS}}$ from a set of samples.
Note that if the triplet $(\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t))$ is known, then (i) the function is $\bm{x}\mapsto S(\bm{x},t)$ is differentiable at $\bm{x}$ and (ii) the unique minimum in the Lax–Oleinik formula Eq. 10 can be represented via Eq. 11:
$$
S(\bm{x}_{k},t)=\frac{1}{2t}\left\|{\bm{x}_{k}-\bm{y}_{k}}\right\|_{2}^{2}+J(\bm{y}_{k}),\,\text{with}\ \bm{y}_{k}=\bm{x}_{k}-t\nabla_{\bm{x}}S(\bm{x}_{k},t). \tag{17}
$$
Moreover, Theorem 3.5 and formula Eq. 6 imply $J(\bm{y}_{k})=J_{\text{BVS}}(\bm{y}_{k})$ , $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ is convex. Thus one possible approach for estimating $J_{\text{BVS}}$ is to approximate $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ piecewise from below at the points $\left\{\bm{y}_{k}\right\}_{k=1}^{K}$ .
We consider the problem of approximating $J_{\text{BVS}}$ piecewise from below and its implications in Section 4.1. This approximation problem turns out to be related closely to max-plus algebra theory for approximating solutions to HJ PDEs [akian2006max, fleming2000max, gaubert2011curse]; we discuss this in Section 4.2. We then consider in Section 4.3 the more general problem of learning a convex function to approximate $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ directly, applying the discussions in Section 4.1 - Section 4.2.
### 4.1 Piecewise approximations
We consider here piecewise approximations of the prior $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})$ using the samples $\{\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t)\}_{k=1}^{K}$ and formula Eq. 17. We consider first using a piecewise affine minorant (PAM) approximation, and then, assuming some regularity on $J_{\text{BVS}}$ , using a piecewise quadratic minorant (PQM) approximation.
#### 4.1.1 Piecewise affine approximation
We first consider the PAM approximation of the convex function $\bm{y}\mapsto tJ_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ ”
$$
tJ_{\text{PAM}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}\coloneqq\max_{k\in\{1,\dots,K\}}\left\{tJ_{\text{BVS}}(\bm{y}_{k})+\frac{1}{2}\left\|{\bm{y}_{k}}\right\|_{2}^{2}+\left\langle\bm{x}_{k},\bm{y}-\bm{y}_{k}\right\rangle\right\}. \tag{18}
$$
Then $J_{\text{PAM}}(\bm{y})\leqslant J_{\text{BVS}}(\bm{y})$ for every $\bm{y}\in\mathbb{R}^{n}$ , with $J_{\text{PAM}}(\bm{y}_{k})=J_{\text{BVS}}(\bm{y}_{k})$ at each $k\in\{1\,\dots,K\}$ . A short calculation gives
$$
tJ_{\text{PAM}}(\bm{y})=\max_{k\in\{1,\dots,K\}}\left\{tJ_{\text{BVS}}(\bm{y}_{k})+\frac{1}{2}\left\|{\bm{x}_{k}-\bm{y}_{k}}\right\|_{2}^{2}-\frac{1}{2}\left\|{\bm{x}_{k}-\bm{y}}\right\|_{2}^{2}\right\}.
$$
How good is $J_{\text{PAM}}$ as initial condition for the HJ PDE Eq. 3? In light of Theorem 3.5, $J_{\text{PAM}}$ , unsurprisingly, cannot reconstruct $\bm{x}\mapsto S(\bm{x},t)$ . Indeed, a formal calculation yields
$$
\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{PAM}}(\bm{y})\right\}=\begin{cases}S(\bm{x}_{k},t)&\ \text{if $\bm{x}=\bm{x}_{k}$, $k\in\{1,\dots,K\}$},\\
+\infty,&\ \text{otherwise}.\end{cases} \tag{19}
$$
See Section A.1 for details. Thus approximating $J_{\text{BVS}}$ via its PAM approximation recovers the samples $\{S(\bm{x}_{k},t)\}_{k=1}^{K}$ but nothing else.
#### 4.1.2 Piecewise quadratic approximation
Here, we assume $\bm{y}\mapsto tJ_{\text{BVS}}(\bm{y})$ is semiconvex with constant $1-\alpha$ with $\alpha>0$ , so that $\bm{y}\mapsto tJ_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ is $1-\alpha$ strongly convex. We can then approximate this strongly convex function via its PQMs:
| | $\displaystyle tJ_{\text{PQM}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ | $\displaystyle\coloneqq\max_{k\in\{1,\dots,K\}}\biggl\{tJ_{\text{BVS}}(\bm{y}_{k})+$ | |
| --- | --- | --- | --- |
Then, $J_{\text{PQM}}(\bm{y})\leqslant J_{\text{BVS}}(\bm{y})$ for every $\bm{y}\in\mathbb{R}^{n}$ , with $J_{\text{PQM}}(\bm{y})=J_{\text{BVS}}(\bm{y}_{k})$ at each $k\in\{1,\dots,K\}$ . Moreover, a short calculation gives
$$
tJ_{\text{PQM}}(\bm{y})=\max_{k\in\{1,\dots,K\}}\left\{J(\bm{y}_{k})+\frac{1}{2}\left\|{\bm{x}_{k}-\bm{y}_{k}}\right\|_{2}^{2}-\frac{1}{2}\left\|{\bm{x}_{k}-\bm{y}}\right\|_{2}^{2}+\frac{\alpha}{2}\left\|{\bm{y}-\bm{y}_{k}}\right\|_{2}^{2}\right\}. \tag{20}
$$
How good is $J_{\text{PQM}}$ as an initial condition for the HJ PDE Eq. 3? Again, in light of Theorem 3.5, $J_{\text{PQM}}$ cannot reconstruct $\bm{x}\mapsto S(\bm{x},t)$ . Nonetheless, a formal calculation yields
$$
\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{PQM}}(\bm{y})\right\}=\frac{1}{2t}\left\|{\bm{x}-\bm{y}_{k}}\right\|_{2}^{2}+\frac{1}{2t\alpha}\left\|{\bm{x}-\bm{x}_{k}}\right\|_{2}^{2} \tag{21}
$$
for some $k\in\{1,\dots,K\}$ . See Section A.2 for more details. Hence $J_{\text{PQM}}$ leads to an approximation of $(\bm{x},t)\mapsto S(\bm{x},t)$ that is finite everywhere. In the next section, we describe how max-plus algebra theory [akian2006max, fleming2000max, gaubert2011curse] can be used to quantify the approximation errors more precisely.
### 4.2 Max-plus algebra theory for Hamilton–Jacobi PDEs and approximation results
We consider here max-plus algebra techniques for approximating solutions to certain HJ PDEs. Let $\alpha>0$ and let $\Psi\colon\mathbb{R}^{n}\to\mathbb{R}$ denote a $(1-\alpha)$ -semiconvex function obtained. Following [gaubert2011curse, Section III], we approximate $\Psi$ using $K$ vectors $\{\bm{p}_{k}\}_{k=1}^{K}\subset\mathbb{R}^{n}$ with $K$ semiconvex functions $\bm{y}\mapsto\langle\bm{p}_{k},\bm{y}\rangle-\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ and a function $a\colon\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ :
$$
\Psi_{\text{MP}}(\bm{y})\coloneqq\max_{k\in\{1,\dots,K\}}\left\{\langle\bm{p}_{k},\bm{y}\rangle-\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}-a(\bm{p}_{k})\right\}. \tag{22}
$$
Here, we suppose the vectors $\{\bm{p}_{k}\}_{k=1}^{K}$ and $\bm{p}\mapsto a(\bm{p})$ are selected so that $\Psi_{\text{MP}}(\bm{y})\leqslant\Psi(\bm{y})$ . As discussed in Section 4.1, such a selection is possible via the affine piecewise quadratic minorants of the $(1-\alpha)$ -strongly convex function $\bm{y}\mapsto\Psi(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ . Let $\mathcal{Y}$ denote a full dimensional compact, convex subset of $\mathbb{R}^{n}$ and consider the $L_{\infty}$ error
$$
\epsilon_{\infty}(\Psi,K,\mathcal{Y},\Psi_{\text{MP}})\coloneqq\sup_{\bm{y}\in\mathcal{Y}}|\Psi(\bm{y})-\Psi_{\text{MP}}(\bm{y})|.
$$
Furthermore, we define the corresponding minimal $L_{\infty}$ error as
$$
\delta_{\infty}(\Psi,K,\mathcal{Y})=\inf_{\Psi_{\text{MP}}\leqslant\Psi}\epsilon_{\infty}(\Psi,K,\mathcal{Y},\Psi_{\text{MP}}).
$$
The following result from max-plus algebra theory, proven in [gaubert2011curse], stipulates that whatever vectors $\{\bm{p}_{k}\}_{k=1}^{K}$ and function $\bm{p}\mapsto a(\bm{p})$ are used to approximate $\Psi$ , the minimal $L_{\infty}$ error scales as an inverse power law in $K$ and the dimension $n$ in the limit $K\to+\infty$ .
**Theorem 4.1 (Gaubert et al. (2011))**
*Let $\alpha>0$ , and let $\mathcal{Y}$ denote a full-dimensional compact, convex subset of $\mathbb{R}^{n}$ . If $\Psi\colon\mathbb{R}^{n}\to\mathbb{R}$ is twice continuously differentiable and $1-\alpha$ semiconvex, then there exists a constant $\beta(n)>0$ depending only on $n$ such that
$$
\delta_{\infty}(\Psi,K,\mathcal{Y})\sim\beta(n)\left(\frac{1}{K}\int_{\mathcal{Y}}(\det\left(\nabla_{\bm{y}}^{2}\Psi(\bm{y})+\bm{I}_{n\times n})\right)^{\frac{1}{2}}\mathop{}\!d\bm{y}\right)^{2/n} \tag{23}
$$
as $K\to+\infty$ .*
Thus the minimal $L_{\infty}$ error is $\Omega(1/K^{2/n})$ as $K\to+\infty$ , though the error is smaller the closer the Hessian matrix $\nabla_{\bm{y}}^{2}\Psi(\bm{y})$ is to the identity matrix $\bm{I}_{n\times n}$ .
### 4.3 Applications to the inverse problem for Hamilton–Jacobi Equations
We consider here the problem of quantifying approximations of the prior function $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})$ when the latter is sufficiently regularized and when we have access to the values $\{\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t)\}_{k=1}^{K}$ . Max-plus algebra theory provides us with a first approximation result:
**Corollary 4.2**
*Let $t>0$ and assume $tJ_{\text{BVS}}$ is twice continuously differentiable and $(1-\alpha)$ -semiconvex with $\alpha>0$ . Let $\mathcal{Y}$ denote a full-dimensional compact, convex set of $\mathbb{R}^{n}$ . Then there exists a constant $\beta(n)$ depending only on $n$ such that
$$
\delta_{\infty}(tJ_{\text{BVS}},K,\mathcal{Y})\sim\beta(n)\left(\frac{1}{K}\int_{\mathcal{Y}}\det\left(t\nabla_{\bm{y}}^{2}J_{\text{BVS}}(\bm{y})+\bm{I}_{n\times n}\right)^{\frac{1}{2}}\mathop{}\!d\bm{y}\right)^{2/n} \tag{24}
$$
as $K\to+\infty$ .*
**Proof 4.3**
*Immediate from Theorem 4.1 because $J_{\text{BVS}}$ satisfies all its assumptions.*
Corollary 4.2 provides a lower bound for the approximation error of $J_{\text{BVS}}$ relative to $J_{\text{PQM}}$ . Indeed, Theorem 4.1 and Corollary 4.2 and the fact that $J_{\text{PQM}}(\bm{y})\leqslant J_{\text{BVS}}(\bm{y})$ for every $\bm{y}\in\mathbb{R}^{n}$ imply
$$
\delta_{\infty}(tJ_{\text{BVS}},K,\mathcal{Y})\leqslant t\sup_{\bm{y}\in\mathcal{Y}}|J_{\text{BVS}}(\bm{y})-J_{\text{PQM}}(\bm{y})|. \tag{25}
$$
Thus in this case $J_{\text{PQM}}$ approximates $J_{\text{BVS}}$ from below in $\Omega(1/K^{n/2})$ as $K\to+\infty$ . We show below a similar upper bound holds using any reachable function $\tilde{J}\in I_{t}(S)$ .
**Theorem 4.4**
*Let $t>0$ and assume $tJ_{\text{BVS}}$ is twice continuously differentiable and $(1-\alpha)$ -semiconvex with $\alpha>0$ . Let $\mathcal{Y}$ denote a full-dimensional compact, convex set of $\mathbb{R}^{n}$ and let $\tilde{J}\in I_{t}(S)$ denote a function that can can reach $\bm{x}\mapsto S(\bm{x},t)$ . Then
$$
\delta_{\infty}(J_{\text{BVS}},K,\mathcal{Y})\leqslant t\sup_{\bm{y}\in\mathcal{Y}}|\tilde{J}(\bm{y})-J_{\text{PQM}}(\bm{y})|. \tag{26}
$$*
**Proof 4.5**
*First, note Theorem 3.5 implies $\tilde{J}(\bm{y})\geqslant J_{\text{BVS}}(\bm{y})$ for every $\bm{y}\in\mathbb{R}^{n}$ , with equality for every $\bm{y}\in\mathbb{R}^{n}$ for which $\bm{y}=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t)$ for some $\bm{x}\in\mathbb{R}^{n}$ . Thus
$$
t\tilde{J}(\bm{y})-tJ_{\text{BVS}}(\bm{y})=(t\tilde{J}(\bm{y})-tJ_{\text{PQM}}(\bm{y}))+(tJ_{\text{PQM}}(\bm{y})-tJ_{\text{BVS}}(\bm{y}))\geqslant 0,
$$
which we rearrange to get
$$
tJ_{\text{BVS}}(\bm{y})-tJ_{\text{PQM}}(\bm{y})\leqslant t\tilde{J}(\bm{y})-tJ_{\text{PQM}}(\bm{y}).
$$
Since the set $\mathcal{Y}$ is a compact and convex set, $\sup_{\bm{y}\in\mathcal{Y}}|tJ_{\text{BVS}}(\bm{y})-tJ_{\text{PQM}}(\bm{y})|$ is finite and attained in $\mathcal{Y}$ , say at $\bm{y}^{*}$ . Combining this with the inequality above yields
$$
t\sup_{\bm{y}\in\mathcal{Y}}|J_{\text{BVS}}(\bm{y})-J_{\text{PQM}}(\bm{y})|\leqslant t\tilde{J}(\bm{y}^{*})-tJ_{\text{PQM}}(\bm{y}^{*})\leqslant t\sup_{\bm{y}\in\mathcal{Y}}|\tilde{J}(\bm{y})-J_{\text{PQM}}(\bm{y})|.
$$
Finally, since $J_{\text{BVS}}$ is twice continuously differentiable and $(1-\alpha)$ semiconvex with $\alpha>0$ , we can invoke Theorem 4.1 with $\Psi\equiv J_{\text{BVS}}$ to get
$$
\delta_{\infty}(J_{\text{BVS}},K,\mathcal{Y})\leqslant t\sup_{\bm{y}\in\mathcal{Y}}|\tilde{J}(\bm{y})-J_{\text{PQM}}(\bm{y})|,
$$
that is, inequality Eq. 26 holds. This concludes the proof.*
Theorem 4.4 suggests it is possible to learn $J_{\text{BVS}}$ via a function $\tilde{J}$ that is twice continuously differentiable and (1- $\alpha$ )-semiconvex and assess the approximation error using the right-hand-side Eq. 26 as a proxy, in particular by driving $\sup_{\bm{y}\in\mathcal{Y}}|\tilde{J}(\bm{y})-J_{\text{PQM}}(\bm{y})|$ to zero using sufficiently large enough data by training $\tilde{J}(\bm{y})$ appropriately.
In the next section, we consider the problem of learning this function using deep neural networks, specifically learned proximal networks [fang2024whats], to enforce the semiconvexity property required for $\tilde{J}$ .
## 5 Numerical results
We evaluate Learned Proximal Networks (LPNs) for approximating the proximal operators of nonconvex and concave priors. While LPNs [fang2024whats] are theoretically grounded in convex analysis (parameterizing the proximal operator as the gradient of a convex potential $\psi$ ), these experiments investigate their behavior when trained on data generated from fundamentally nonconvex and concave landscapes. All experiments utilize the official LPN implementation. The network is trained via supervised learning, minimizing the mean squared error (MSE) or L1 loss between the network output and the true value. We use an LPN with $2$ layers and $256$ hidden units using Softplus activation ( $\beta=5$ ) to ensure $C^{2}$ smoothness. The model is trained using the Adam optimizer with a starting learning rate of $10^{-3}$ and decreased by a factor of $10^{-1}$ at every $10^{5}$ epochs for a total of $5\times 10^{5}$ epochs.
The data generation process for all experiments is as follows: $N$ samples ( $y_{i}$ ) are drawn uniformly from the hypercube $[-a,a]^{d}$ , where $a$ is chosen to be $4$ and $d$ is the dimension, equal $2,4,8,16,32$ and $64$ . $N=3\times 10^{4}$ is chosen for $d=2,4$ , $N=3\times 10^{4}$ is chosen for $d=8,16$ , and $N=4\times 10^{4}$ is chosen for $d=32,64$ .
We also trained a second LPN to recover the prior at arbitrary points and compare its performance to the “invert” method (find $y$ such that $f_{\theta}(y)=x$ ) used in [fang2024whats] for recovering the prior from its proximal. Our second LPN is based on the relationship that the non-convex prior $J(x)$ can be approximated using the convex conjugate of the learned potential $\psi(y)$ . Specifically, we compute:
$$
J(x)\approx G(x)-\frac{1}{2}\|x\|^{2} \tag{27}
$$
where $G(x)=\psi^{*}(x)$ represents the convex conjugate of the potential $\psi_{\theta}(y)$ learned by the first LPN. We generate a new dataset $\{(x_{k},G_{k})\}$ using the trained first LPN $\psi_{\theta}$ : (i) The gradients of the first network evaluated at the original sample points $y_{i}$ ,
$$
x_{k}=\nabla_{y}\psi_{\theta}(y_{i}), \tag{28}
$$
and (ii) the values of the Legendre transform corresponding to each point,
$$
G_{k}=\langle x_{k},y_{i}\rangle-\psi_{\theta}(y_{i}). \tag{29}
$$
The network $\phi_{G}$ is trained to map the gradients $x_{k}$ to the conjugate values $G_{k}$ by minimizing the Mean Squared Error (MSE). The optimization is performed using the Adam optimizer with the same parameters as used in the first LPN. Once the second LPN is trained, the estimated non-convex prior $\hat{J}(x)$ is recovered via
$$
\hat{J}(x)=\phi_{G}(x)-\frac{1}{2}\|x\|^{2}. \tag{30}
$$
### 5.1 Convex prior
We will benchmark our approach with the prior $J(\bm{x})=\left\|{\bm{x}}\right\|_{1}$ . For this example, we have
| | $\displaystyle\operatorname*{arg\,min}_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+\left\|{\bm{y}}\right\|_{1}\right\}$ | $\displaystyle=\cup_{j=1}^{n}\operatorname*{arg\,min}_{y_{j}\in\mathbb{R}}\left\{\frac{1}{2t}(x_{j}-y_{j})^{2}+|y_{j}|\right\}$ | |
| --- | --- | --- | --- |
With this, we can evaluate $S(\bm{x},t)$ and the LPN function $\bm{x}\mapsto\Psi(\bm{x})\coloneqq\frac{1}{2}-tS(\bm{x},t)$ .
Table 1: Mean square errors of LPN $\psi$ and prior $J$ with 2 layers and 256 neurons in the convex L1 prior example.
| | Dimension | LPN ( $\psi$ ) | Prior ( $J$ ) |
| --- | --- | --- | --- |
| Mean Square Errors | 2D | $1.04E-5$ | $3.33E-5$ |
| 4D | $2.97E-5$ | $2.17E-4$ | |
| 8D | $1.05E-4$ | $7.25E-4$ | |
| 16D | $5.27E-3$ | $2.11E-3$ | |
| 32D | $1.6E-1$ | $4.03E-2$ | |
| 64D | $2.89E-6$ | $2.69E-3$ | |
<details>
<summary>exp_L1_prior_8D_LPN.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of Convex Function in 8D Space
### Overview
The image contains two identical line graphs side-by-side, each depicting cross-sectional views of an 8-dimensional convex function. Both graphs show near-identical behavior for two data series: a solid blue line labeled "LPN" and a dashed orange line labeled "Ref". The graphs demonstrate symmetrical U-shaped curves with minima at the center (x=0) and maxima at the extremes (x=±4).
### Components/Axes
**Left Graph (x₁-axis):**
- **Title:** "Cross sections (x₁,0) of the convex function, Dim 8"
- **X-axis:** Labeled "x₁" with integer markers from -4 to 4
- **Y-axis:** Labeled "Convexfunctions(x₁,0, ...)" with integer markers from 0 to 4
- **Legend:** Bottom-left corner, blue = LPN, orange = Ref
- **Grid:** Light gray horizontal/vertical lines at integer intervals
**Right Graph (x₂-axis):**
- **Title:** "Cross sections (0,x₂,0) of the convex function, Dim 8"
- **X-axis:** Labeled "x₂" with identical scale (-4 to 4)
- **Y-axis:** Identical label and scale to left graph
- **Legend:** Identical positioning and labels
- **Grid:** Same light gray pattern
### Detailed Analysis
**Left Graph Trends:**
- Both lines descend sharply from x₁=-4 to x₁=0, reaching a minimum of ~0.05 at x₁=0
- Ascend symmetrically from x₁=0 to x₁=4, peaking at ~4.2 near x₁=±4
- LPN (solid blue) and Ref (dashed orange) overlap almost perfectly, with maximum deviation of ~0.02 at x₁=±4
**Right Graph Trends:**
- Identical shape to left graph, confirming symmetry in x₂ dimension
- Minimum value at x₂=0: ~0.05
- Maximum values at x₂=±4: ~4.2
- Lines maintain <0.03 difference throughout
### Key Observations
1. **Dimensional Symmetry:** Both cross-sections (x₁ and x₂) exhibit identical convex behavior, suggesting isotropic properties in the 8D space
2. **Model Accuracy:** LPN approximation matches reference function within 0.1% error margin across all measured points
3. **Convexity Confirmation:** Clear U-shaped curves validate the convex nature of the function in both dimensions
4. **Numerical Precision:** Minor deviations at extreme values (x=±4) suggest potential numerical rounding in reference data
### Interpretation
The graphs demonstrate that the LPN approximation maintains exceptional fidelity to the reference convex function across both principal dimensions of an 8-dimensional space. This indicates:
- The LPN model effectively captures the essential convex structure of high-dimensional functions
- The symmetry between x₁ and x₂ cross-sections suggests the function's properties are consistent across orthogonal dimensions
- The near-perfect overlap implies the LPN could serve as a reliable surrogate for optimization algorithms working in 8D spaces
- The observed numerical precision (0.02-0.03 difference) suggests either high-quality reference data or potential limitations in LPN's ability to capture subtle function variations at extreme values
The consistent convex shape across dimensions supports applications in machine learning optimization, where maintaining function structure in high-dimensional spaces is critical.
</details>
<details>
<summary>exp_L1_prior_8D_Pr1.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of Prior Function in 8D Space
### Overview
The image contains two side-by-side line graphs comparing two data series ("LPN" and "Ref") across an 8-dimensional prior function. Both graphs show cross-sectional behavior along different axes: the left graph plots (x₁,0) and the right graph plots (0,x₂,0). The graphs use a Cartesian coordinate system with x₁ as the horizontal axis and prior function values as the vertical axis.
### Components/Axes
**Left Graph (x₁,0 cross-section):**
- **X-axis**: x₁ values from -4 to 4 (integer increments)
- **Y-axis**: Prior function values from 0 to 20
- **Legend**:
- Blue line = LPN
- Orange line = Ref
- **Title**: "Cross sections (x₁,0) of the prior function, Dim 8"
**Right Graph (0,x₂,0 cross-section):**
- **X-axis**: x₁ values from -4 to 4 (integer increments)
- **Y-axis**: Prior function values from 0 to 25
- **Legend**:
- Blue line = LPN
- Orange line = Ref
- **Title**: "Cross sections (0,x₂,0) of the prior function, Dim 8"
### Detailed Analysis
**Left Graph Trends:**
1. **LPN (Blue)**:
- Starts at ~19.5 at x₁=-4
- Sharp decline to ~0.2 at x₁=0
- Rapid rise to ~16.5 at x₁=4
- Symmetric V-shaped pattern
2. **Ref (Orange)**:
- Linear decrease from ~4 at x₁=-4 to ~0.5 at x₁=0
- Linear increase to ~4 at x₁=4
- Flat baseline with shallow slope
**Right Graph Trends:**
1. **LPN (Blue)**:
- Starts at ~24 at x₁=-4
- Sharp decline to ~0.1 at x₁=0
- Rapid rise to ~15 at x₁=4
- Asymmetric V-shape with steeper left descent
2. **Ref (Orange)**:
- Linear decrease from ~4 at x₁=-4 to ~0.3 at x₁=0
- Linear increase to ~4 at x₁=4
- Consistent slope magnitude in both directions
### Key Observations
1. **LPN Dominance**: LPN values consistently exceed Ref values by 2-3x across all x₁ positions
2. **Symmetry Breaking**: Right graph shows LPN's right-side ascent (x₁=4) is 30% lower than left-side peak (x₁=-4)
3. **Reference Stability**: Ref maintains identical slope magnitude in both graphs despite different y-axis scales
4. **Zero-Crossing Consistency**: Both data series intersect near x₁=0 with minimal deviation (<0.5 units)
### Interpretation
The graphs demonstrate fundamental differences in how LPN and Ref prior functions behave in 8D space:
- **LPN's Sensitivity**: The sharp V-shaped response suggests LPN prioritizes extreme x₁ values (-4 and 4) while suppressing intermediate values
- **Ref's Uniformity**: The linear reference function implies a baseline model with consistent sensitivity across all x₁ positions
- **Dimensional Impact**: The right graph's reduced right-side peak (15 vs 24) indicates dimensional interactions may asymmetrically affect prior function behavior
- **Practical Implications**: LPN's extreme-value emphasis could represent a risk-averse modeling approach, while Ref's linearity might reflect a neutral baseline for comparison
All textual elements have been extracted with positional accuracy. No non-English text was present in the image.
</details>
<details>
<summary>exp_L1_prior_8D_Pr2.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of J(x1, 0,...) and J(0, x2,...) in Dimension 8
### Overview
The image contains two identical line graphs comparing two mathematical functions: LPN 2 (solid blue line) and a reference function J(x) = ||x||₁ (dashed orange line). Both graphs show cross-sectional behavior of a multidimensional function J in dimension 8, with one graph varying x₁ while holding other variables at 0, and the other varying x₂ under similar conditions.
### Components/Axes
- **Left Graph (x₁-axis)**:
- **X-axis**: Labeled "x₁" with integer markers from -4 to 4.
- **Y-axis**: Labeled "Priorfunctions(x₁, 0, ...)" with values from 0 to 4.
- **Legend**: Located at bottom-left, showing:
- Solid blue line: "LPN 2"
- Dashed orange line: "Ref J(x) = ||x||₁"
- **Right Graph (x₂-axis)**:
- Identical structure to the left graph, but:
- **X-axis**: Labeled "x₂" with markers from -4 to 4.
- **Y-axis**: Labeled "Priorfunctions(0, x₂, 0, ...)" with the same scale.
### Detailed Analysis
#### Left Graph (x₁-axis)
1. **LPN 2 (Solid Blue Line)**:
- Symmetric V-shape with minimum at x₁ = 0.
- Minimum value ≈ 0.25.
- Peaks at x₁ = ±4 ≈ 3.8.
- Slope: Steeper near x₁ = 0, flattening toward edges.
2. **Reference Function J(x) = ||x||₁ (Dashed Orange Line)**:
- Symmetric V-shape with minimum at x₁ = 0.
- Minimum value = 0.
- Peaks at x₁ = ±4 = 4.0.
- Linear slope throughout.
#### Right Graph (x₂-axis)
- Identical behavior to the left graph, with:
- LPN 2 minimum ≈ 0.25 at x₂ = 0.
- Reference function minimum = 0 at x₂ = 0.
- Peaks at x₂ = ±4 = 4.0 for reference, ≈3.8 for LPN 2.
### Key Observations
1. **Symmetry**: Both functions are even (symmetric about x=0).
2. **Divergence at Origin**: LPN 2 has a higher minimum (0.25 vs. 0) than the reference function.
3. **Convergence at Edges**: Both functions approach similar values (~3.8–4.0) at x=±4.
4. **Slope Differences**: LPN 2 has a steeper slope near the origin compared to the reference function.
### Interpretation
The graphs compare two regularization functions in an 8-dimensional space:
- **LPN 2** (solid blue) introduces a "softened" penalty near the origin (higher minimum) but matches the reference function's magnitude at the edges. This suggests LPN 2 may better handle sparse solutions while maintaining stability at extremes.
- The reference function J(x) = ||x||₁ (dashed orange) represents the standard L1 norm, which penalizes deviations linearly. Its lower minimum implies stronger regularization near zero.
- The dimension-8 context implies these cross-sections simplify a higher-dimensional optimization problem, where LPN 2 might balance sparsity and robustness better than the pure L1 norm.
The consistent behavior across x₁ and x₂ axes indicates the functions are isotropic in this simplified view, though the full 8D behavior could differ.
</details>
Figure 1: The cross sections of the convex function $\psi(x)$ for dimension $8$ (top). The bottom row compares the cross sections of the prior function from “invert LPN” (left) and our trained second LPN method (right).
### 5.2 Non-convex prior
#### Minplus algebra example
For this example, the prior is
$$
J(\bm{x})=\min\left(\frac{1}{2\sigma_{1}}\left\|{\bm{x}-\mu_{1}}\right\|_{2}^{2},\frac{1}{2\sigma_{2}}\left\|{\bm{x}-\mu_{2}}\right\|_{2}^{2}\right).
$$
We use $\mu_{1}=(1,0,\dots,0)$ , $\mu_{2}=\bm{1}/\sqrt{n}$ , and $\sigma_{1}=\sigma_{2}=1.0$ .
Table 2: Mean square errors of LPN $\psi$ and prior $J$ with 2 layers and 256 neurons in the min-plus example.
| | Dimension | LPN ( $\psi$ ) | Prior ( $J$ ) |
| --- | --- | --- | --- |
| Mean Square Errors | 2D | $3.33E-6$ | $5.73E-7$ |
| 4D | $7.64E-6$ | $4.92E-6$ | |
| 8D | $3.64E-5$ | $1.20E-4$ | |
| 16D | $1.99E-4$ | $3.44E-4$ | |
| 32D | $1.16E-3$ | $1.33E-3$ | |
| 64D | $2.32E-9$ | $5.21E-5$ | |
<details>
<summary>exp_1_minplus_8D_LPN.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of Convex Functions in 8D Space
### Overview
The image contains two side-by-side line graphs comparing two data series ("LPN" and "Ref") across two cross-sectional views of a convex function in 8-dimensional space. The graphs show symmetrical U-shaped curves with minor deviations between the two series.
### Components/Axes
- **Left Graph**:
- **Title**: "Cross sections (x₁,0) of the convex function, Dim 8"
- **X-axis**: Labeled "x₁" with values from -4 to 4 in increments of 1.
- **Y-axis**: Labeled "Convex functions(x₁,0, ..., )" with values from 0 to 6 in increments of 1.
- **Legend**: Top-left corner, with "LPN" (solid blue line) and "Ref" (dashed orange line).
- **Right Graph**:
- **Title**: "Cross sections (0,x₂,0) of the convex function, Dim 8"
- **X-axis**: Labeled "x₂" with values from -4 to 4 in increments of 1.
- **Y-axis**: Labeled "Convex functions(0,x₂,0, ..., )" with values from 0 to 4 in increments of 1.
- **Legend**: Top-left corner, matching the left graph's legend.
### Detailed Analysis
#### Left Graph (x₁-axis)
- **LPN (Solid Blue)**:
- Starts at ~3.0 at x₁ = -4.
- Dips to a minimum of ~-0.2 at x₁ = 0.
- Rises to ~5.8 at x₁ = 4.
- **Ref (Dashed Orange)**:
- Starts at ~3.0 at x₁ = -4.
- Dips to a minimum of ~-0.3 at x₁ = 0.
- Rises to ~5.8 at x₁ = 4.
#### Right Graph (x₂-axis)
- **LPN (Solid Blue)**:
- Starts at ~3.8 at x₂ = -4.
- Dips to a minimum of ~-0.1 at x₂ = 0.
- Rises to ~5.7 at x₂ = 4.
- **Ref (Dashed Orange)**:
- Starts at ~3.8 at x₂ = -4.
- Dips to a minimum of ~-0.2 at x₂ = 0.
- Rises to ~5.7 at x₂ = 4.
### Key Observations
1. **Symmetry**: Both graphs exhibit near-perfect symmetry around x=0, with mirrored behavior for positive and negative values.
2. **LPN vs. Ref**:
- LPN consistently shows slightly shallower troughs (higher minima) than Ref in both graphs.
- Peaks at x=±4 are nearly identical (~5.8 vs. ~5.7), suggesting convergence in high-dimensional space.
3. **Dimensional Consistency**: The y-axis ranges differ slightly (0–6 for x₁, 0–4 for x₂), likely due to scaling differences in the cross-sectional views.
### Interpretation
The graphs compare two convex function approximations ("LPN" and "Ref") in 8D space. The LPN series demonstrates marginally better performance near the origin (x=0) in both cross-sections, as evidenced by its higher (less negative) minima. However, the near-identical peak values suggest both approximations converge in regions far from the origin. The slight discrepancies in trough depths may indicate differences in optimization strategies or constraints applied to the LPN model. The symmetry implies the functions are invariant under sign changes in their input variables, a common property in high-dimensional convex optimization problems.
</details>
<details>
<summary>exp_1_minplus_8D_Pr1.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of Prior Function in 8D Space
### Overview
The image contains two side-by-side line graphs comparing cross-sectional views of a prior function in 8-dimensional space. The left graph shows cross-sections along the x₁-axis at fixed y=0, while the right graph shows cross-sections along the x₂-axis at fixed x₁=0. Both graphs compare two prior functions: "LPN" (blue line) and "Ref" (orange line).
### Components/Axes
**Left Graph:**
- **Title:** "Cross sections (x₁,0) of the prior function, Dim 8"
- **X-axis:** x₁ (ranging from -4 to 4)
- **Y-axis:** "Prior functions (x₁,0)"
- **Legend:**
- LPN (blue line)
- Ref (orange line)
**Right Graph:**
- **Title:** "Cross sections (0, x₂, 0) of the prior function, Dim 8"
- **X-axis:** x₁ (ranging from -4 to 4)
- **Y-axis:** "Prior functions (0, x₂, 0)"
- **Legend:**
- LPN (blue line)
- Ref (orange line)
### Detailed Analysis
**Left Graph Trends:**
1. **LPN (blue):**
- Starts at ~15.5 at x₁=-4
- Decreases sharply to ~0.2 at x₁=0
- Increases to ~10.5 at x₁=4
- *Visual trend:* Symmetric V-shape with steeper descent than ascent
2. **Ref (orange):**
- Starts at ~10 at x₁=-4
- Decreases to ~0.1 at x₁=0
- Increases to ~4.5 at x₁=4
- *Visual trend:* Symmetric V-shape with gentler slopes than LPN
**Right Graph Trends:**
1. **LPN (blue):**
- Starts at ~14.5 at x₁=-4
- Decreases to ~0.1 at x₁=0
- Increases to ~13.5 at x₁=4
- *Visual trend:* Symmetric V-shape with deeper trough than left graph
2. **Ref (orange):**
- Starts at ~8 at x₁=-4
- Decreases to ~0.05 at x₁=0
- Increases to ~7 at x₁=4
- *Visual trend:* Symmetric V-shape with shallower slopes than LPN
### Key Observations
1. **Symmetry:** Both functions exhibit mirror symmetry across x₁=0 in both graphs.
2. **Magnitude Differences:**
- LPN values are consistently 20-50% higher than Ref at corresponding x₁ positions
- LPN troughs reach near-zero values in both graphs, while Ref maintains slightly higher minima
3. **Dimensional Behavior:**
- Left graph (x₁-axis) shows LPN peaking at 15.5 vs Ref at 10
- Right graph (x₂-axis) shows LPN peaking at 14.5 vs Ref at 8
4. **Steepness:** LPN lines have steeper slopes in both graphs, indicating sharper transitions in the prior function.
### Interpretation
The graphs demonstrate that the LPN prior function exhibits:
1. **Greater Sensitivity:** Steeper slopes suggest stronger dependence on input variables
2. **Higher Peaks:** Indicates regions of higher prior probability density
3. **Deeper Troughs:** Suggests more pronounced regions of low probability
4. **Dimensional Consistency:** Similar patterns in both x₁ and x₂ cross-sections imply structural similarity across dimensions
The reference function (Ref) appears to represent a smoothed or regularized version of the LPN function, with reduced sensitivity and more gradual transitions. The 8-dimensional context suggests these cross-sections are projections of a complex multivariate prior distribution, where LPN captures more nuanced dependencies between variables compared to the reference function.
</details>
<details>
<summary>exp_1_minplus_8D_pr2.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of J(x) in 8 Dimensions
### Overview
The image contains two side-by-side line graphs comparing the behavior of two mathematical functions: **LPN 2** (solid blue line) and a reference function **J(x) = -1/4||x||²** (red dashed line). Both graphs represent cross-sections of an 8-dimensional function J, with one graph varying along the **x₁** axis and the other along the **x₂** axis. The y-axis represents "Priorfunctions" values.
### Components/Axes
- **Left Graph (x₁-axis):**
- **X-axis:** Labeled **x₁**, ranging from **-4 to 4** in increments of 1.
- **Y-axis:** Labeled **Priorfunctions(x₁, 0, ..., )**, with values from **0 to 10**.
- **Legend:** Top-right corner, with:
- **Solid blue line:** "LPN 2"
- **Red dashed line:** "Ref J(x) = -1/4||x||²"
- **Right Graph (x₂-axis):**
- **X-axis:** Labeled **x₂**, ranging from **-4 to 4** in increments of 1.
- **Y-axis:** Labeled **Priorfunctions(0, x₂, 0, ..., )**, with values from **0 to 10**.
- **Legend:** Top-right corner, identical to the left graph.
### Detailed Analysis
#### Left Graph (x₁-axis):
- **Trend Verification:**
- Both lines form a **parabolic curve** opening upward, symmetric about **x₁ = 0**.
- The **blue line (LPN 2)** and **red dashed line (Ref J(x))** overlap closely but diverge slightly near the bottom of the curve.
- **Key Data Points (approximate):**
- At **x₁ = -4**: Both lines ≈ **10**.
- At **x₁ = -2**: Both lines ≈ **3**.
- At **x₁ = 0**: Both lines ≈ **0.5** (minimum value).
- At **x₁ = 2**: Both lines ≈ **3**.
- At **x₁ = 4**: Both lines ≈ **10**.
#### Right Graph (x₂-axis):
- **Trend Verification:**
- Identical parabolic shape to the left graph, symmetric about **x₂ = 0**.
- The **blue line (LPN 2)** and **red dashed line (Ref J(x))** again overlap closely but show minor divergence near the minimum.
- **Key Data Points (approximate):**
- At **x₂ = -4**: Both lines ≈ **10**.
- At **x₂ = -2**: Both lines ≈ **3**.
- At **x₂ = 0**: Both lines ≈ **0.5** (minimum value).
- At **x₂ = 2**: Both lines ≈ **3**.
- At **x₂ = 4**: Both lines ≈ **10**.
### Key Observations
1. **Symmetry:** Both graphs exhibit perfect symmetry about their respective axes (x₁ = 0 and x₂ = 0).
2. **Function Similarity:** The LPN 2 and reference function J(x) are nearly identical, with deviations of **< 0.5 units** at all points.
3. **Minimum Values:** The minimum of both functions occurs at **x₁ = 0** and **x₂ = 0**, with a value of **~0.5**.
4. **Edge Behavior:** Both functions increase quadratically as |x₁| or |x₂| approaches ±4, reaching ~10.
### Interpretation
- The graphs demonstrate that **LPN 2** closely approximates the reference function **J(x) = -1/4||x||²** in 8-dimensional space. The minor deviations suggest potential numerical or computational approximations in the LPN 2 model.
- The symmetry implies that the function J(x) is **even** in its variables (i.e., J(x) = J(-x)), consistent with the quadratic form of the reference function.
- The near-identical behavior of LPN 2 and J(x) indicates that LPN 2 is a robust approximation for this specific cross-section, though further analysis across other dimensions or variables would be needed to confirm generalizability.
- The slight divergence near the minimum (x₁/x₂ = 0) could hint at differences in curvature or higher-order terms not captured by the LPN 2 model.
### Language Note
All text in the image is in **English**. No non-English content is present.
</details>
Figure 2: The cross sections of the convex function $\psi(x)$ for dimension $8$ (top). The bottom row compares the cross sections of the prior function from “invert LPN” (left) and our trained second LPN method (right).
<details>
<summary>exp_1_minplus_32D_LPN.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of Convex Function in 32D Space
### Overview
The image contains two side-by-side line graphs comparing two convex functions ("LPN" and "Ref") across two orthogonal cross-sections of a 32-dimensional space. Both graphs show U-shaped curves with distinct minima and maxima, suggesting optimization or constraint-related behavior.
### Components/Axes
**Left Graph (x₁,0 cross-section):**
- **X-axis**: x₁ (range: -4 to 4, increments of 1)
- **Y-axis**: Convex functions (range: 0 to 6, increments of 1)
- **Legend**:
- Solid blue line: LPN
- Dashed orange line: Ref
- **Title**: "Cross sections (x₁,0) of the convex function, Dim 32"
**Right Graph (0,x₂,0 cross-section):**
- **X-axis**: x₂ (range: -4 to 4, increments of 1)
- **Y-axis**: Convex functions (range: 0 to 6, increments of 1)
- **Legend**:
- Solid blue line: LPN
- Dashed orange line: Ref
- **Title**: "Cross sections (0,x₂,0) of the convex function, Dim 32"
### Detailed Analysis
**Left Graph (x₁,0):**
- **LPN (blue)**:
- Starts at ~3.5 when x₁ = -4
- Decreases to minimum ~0.5 at x₁ = 0
- Increases to ~5.5 at x₁ = 4
- **Ref (orange)**:
- Starts at ~3.2 when x₁ = -4
- Decreases to minimum ~0.0 at x₁ = 0
- Increases to ~4.8 at x₁ = 4
**Right Graph (0,x₂,0):**
- **LPN (blue)**:
- Starts at ~4.5 when x₂ = -4
- Decreases to minimum ~0.8 at x₂ = 0
- Increases to ~5.8 at x₂ = 4
- **Ref (orange)**:
- Starts at ~3.8 when x₂ = -4
- Decreases to minimum ~0.2 at x₂ = 0
- Increases to ~4.2 at x₂ = 4
### Key Observations
1. **Symmetry**: Both graphs exhibit mirror-symmetric U-shaped curves, confirming convexity.
2. **LPN vs. Ref**:
- LPN consistently shows higher values than Ref at equivalent x-values
- LPN's minimum is ~0.5 (x₁) and ~0.8 (x₂), while Ref reaches 0.0 (x₁) and 0.2 (x₂)
3. **Dimensional Consistency**: Similar curvature patterns in both cross-sections suggest isotropic properties in the 32D space.
4. **Asymmetry in Peaks**: LPN's right-side peak is steeper than its left-side peak in both graphs.
### Interpretation
The data demonstrates two convex functions with distinct optimization landscapes:
1. **LPN Function**:
- Represents a "harder" constraint surface with higher minimum values
- Shows greater sensitivity to input variations (steeper slopes)
- May model stricter optimization criteria or penalty terms
2. **Ref Function**:
- Represents a "softer" baseline with lower minimum values
- Could represent an idealized or theoretical benchmark
- The near-zero minimum in x₁ suggests potential degeneracy in certain dimensions
The consistent LPN dominance across both cross-sections implies that the 32D function maintains its convex properties while exhibiting dimensional coupling - changes in one dimension (x₁) correlate with proportional changes in orthogonal dimensions (x₂). The minimum value discrepancies suggest potential phase transitions or critical points in the function's behavior.
</details>
<details>
<summary>exp_1_minplus_32D_Pr1.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of Prior Functions in 32-Dimensional Space
### Overview
The image contains two side-by-side line graphs comparing "LPN" (blue) and "Ref" (orange) prior functions across two cross-sectional views of a 32-dimensional space. Both graphs share identical axes ranges (-4 to 4 on the x-axis) but differ in their y-axis labels, reflecting different dimensional perspectives.
### Components/Axes
- **Left Graph**:
- Title: "Cross sections (x₁,0) of the prior function, Dim 32"
- X-axis: Labeled "x₁" with ticks at -4, -3, -2, -1, 0, 1, 2, 3, 4
- Y-axis: Labeled "Prior functions (x₁,0)"
- Legend: Top-right corner, blue = "LPN", orange = "Ref"
- **Right Graph**:
- Title: "Cross sections (0, x₂, 0) of the prior function, Dim 32"
- X-axis: Labeled "x₁" with identical tick marks as the left graph
- Y-axis: Labeled "Prior functions (0, x₂, 0)"
- Legend: Top-right corner, same color coding as left graph
### Detailed Analysis
#### Left Graph (x₁,0 Cross-Section)
- **LPN (Blue)**:
- Starts at ~14.5 at x₁ = -4
- Dips to ~0.5 at x₁ = 0
- Rises to ~10.5 at x₁ = 4
- **Ref (Orange)**:
- Starts at ~9 at x₁ = -4
- Dips to ~0.5 at x₁ = 0
- Rises to ~4.5 at x₁ = 4
#### Right Graph (0, x₂, 0 Cross-Section)
- **LPN (Blue)**:
- Starts at ~14 at x₁ = -4
- Dips to ~0.5 at x₁ = 0
- Rises sharply to ~13.5 at x₁ = 4
- **Ref (Orange)**:
- Starts at ~9 at x₁ = -4
- Dips to ~0.5 at x₁ = 0
- Rises to ~8 at x₁ = 4
### Key Observations
1. **Symmetry**: Both graphs exhibit U-shaped curves, suggesting quadratic-like behavior in the prior functions.
2. **Divergence at Extremes**: LPN values are consistently higher than Ref values at x₁ = ±4 in both graphs.
3. **Convergence at Origin**: Both LPN and Ref lines intersect at their minimum point (~0.5) at x₁ = 0.
4. **Asymmetry in Right Graph**: The LPN line in the right graph rises more sharply than in the left graph, while the Ref line shows a more gradual increase.
### Interpretation
The graphs compare two prior functions (LPN and Ref) across two orthogonal cross-sections of a high-dimensional space. The LPN function demonstrates:
- **Higher sensitivity** to input values at the extremes (x₁ = ±4), with values exceeding Ref by ~50% in the right graph.
- **Stronger curvature** in the right graph, suggesting dimensional dependencies in the prior function's behavior.
- A **shared minimum** at x₁ = 0, indicating a common optimal point or constraint in both functions.
The Ref function appears to represent a baseline or reference model, while LPN may incorporate additional constraints or regularization terms that amplify its response at the boundaries of the input space. The dimensional perspective (x₁,0 vs. 0,x₂,0) reveals how the prior functions behave differently when varying specific dimensions in isolation.
</details>
<details>
<summary>exp_1_minplus_32D_pr2.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of J(x1, 0, ...) and J(0, x2, ...) in Dim 32
### Overview
The image contains two side-by-side line graphs comparing prior functions for two cross-sections of a multidimensional function J. The left graph shows the cross-section along x₁ (with x₂=0, ...), and the right graph shows the cross-section along x₂ (with x₁=0, ...). Both graphs plot prior function values against their respective variables (x₁ or x₂) and include two data series: a solid blue line labeled "LPN 2" and a dashed red line labeled "Ref J(x) = -1/4||x||²".
### Components/Axes
- **Left Graph (x₁-axis):**
- **X-axis (x₁):** Ranges from -4 to 4 in increments of 1.
- **Y-axis (Priorfunctions(x₁, 0, ...)):** Ranges from 0 to 8 in increments of 1.
- **Legend:** Top-right corner, with blue for "LPN 2" and red dashed for "Ref J(x) = -1/4||x||²".
- **Right Graph (x₂-axis):**
- **X-axis (x₂):** Ranges from -4 to 4 in increments of 1.
- **Y-axis (Priorfunctions(0, x₂, 0, ...)):** Ranges from 1 to 8 in increments of 1.
- **Legend:** Top-right corner, same labels as the left graph.
### Detailed Analysis
#### Left Graph (x₁-axis):
- **LPN 2 (Blue Solid Line):**
- Starts at ~7.5 when x₁ = -4.
- Decreases smoothly to a minimum of ~0.5 at x₁ = 0.
- Increases back to ~7.5 at x₁ = 4.
- **Reference Function (Red Dashed Line):**
- Starts at ~8.5 when x₁ = -4.
- Decreases sharply to a minimum of ~0.25 at x₁ = 0.
- Increases back to ~8.5 at x₁ = 4.
#### Right Graph (x₂-axis):
- **LPN 2 (Blue Solid Line):**
- Starts at ~8 when x₂ = -4.
- Decreases smoothly to a minimum of ~0.25 at x₂ = 0.
- Increases back to ~8 at x₂ = 4.
- **Reference Function (Red Dashed Line):**
- Starts at ~8.5 when x₂ = -4.
- Decreases sharply to a minimum of ~0.25 at x₂ = 0.
- Increases back to ~8.5 at x₂ = 4.
### Key Observations
1. **Symmetry:** Both graphs exhibit symmetric behavior around x=0, consistent with the quadratic reference function J(x) = -1/4||x||².
2. **LPN 2 vs. Reference Function:**
- The LPN 2 line is consistently flatter and less steep than the reference function, suggesting a less restrictive prior.
- The reference function has a deeper minimum at x=0, indicating a stricter prior centered at the origin.
3. **Y-axis Discrepancy:** The right graph’s y-axis starts at 1, while the left starts at 0. This may reflect a scaling difference or a labeling error, as the LPN 2 line dips below 1 in the right graph (e.g., ~0.25 at x₂=0).
### Interpretation
- **Prior Function Behavior:** The graphs compare two prior distributions (LPN 2 and the reference function) in cross-sectional views of a 32-dimensional space. The reference function’s sharper minimum suggests it penalizes deviations from the origin more heavily, while LPN 2 allows broader variability.
- **Dimensionality Impact:** The consistent shape across x₁ and x₂ cross-sections implies the prior functions are isotropic (symmetric in all dimensions).
- **Anomaly:** The right graph’s y-axis starting at 1 conflicts with the LPN 2 line’s value of ~0.25 at x₂=0. This may indicate a plotting error or intentional scaling to emphasize certain features.
The data highlights trade-offs between strictness and flexibility in prior distributions, with implications for Bayesian modeling or optimization in high-dimensional spaces.
</details>
Figure 3: The cross sections of the convex function $\psi(x)$ for dimension $32$ (top). The bottom row compares the cross sections of the prior function from “invert LPN” (left) and our trained second LPN method (right)
### 5.3 Concave prior
For this example, we use
$$
J(\bm{x})=-\left\|{\bm{x}}\right\|_{2}^{2}/4.
$$
This is actually challenging example because, technically, $J$ is not uniformly Lipschitz continuous. (Although numerically we can get around this by “Huberizing” the prior.) We use this prior because we have an exact solution for this problem. It’s also a bit challenging for a convex LPN network because according to the theory of HJ PDEs, the function $J+\frac{1}{4}\left\|{\bm{x}}\right\|_{2}^{2}$ is convex, and $J+\frac{1}{2}\left\|{\bm{x}}\right\|_{2}^{2}$ is strongly convex, so an LPN (which is not inherently S.C.) may not be able to detect the strong convexity and makes this function more challenging to learn.
Table 3: Mean square errors of LPN $\psi$ and prior $J$ with 2 layers and 256 neurons in the concave prior example.
| | Dimension | LPN ( $\psi$ ) | Prior ( $J$ ) |
| --- | --- | --- | --- |
| Mean Square Errors | 2D | $7.00E-7$ | $1.57E-6$ |
| 4D | $2.74E-5$ | $7.70E-5$ | |
| 8D | $5.58E-4$ | $7.91E-4$ | |
| 16D | $3.69E-3$ | $3.28E-3$ | |
| 32D | $8.70E-2$ | $3.01E-2$ | |
| 64D | $6.23E-6$ | $1.87E-3$ | |
<details>
<summary>exp_quadratic_concave_prior_8D_PR1.png Details</summary>

### Visual Description
## Line Graphs: Comparison of Learned vs. True Function J(x) in 8D Space
### Overview
The image contains two identical line graphs side-by-side, comparing a learned function (LPN) and a true mathematical function J(x) = -1/4||x||² in an 8-dimensional space. Both graphs plot the function along a single axis (x₁ for the left graph, x₂ for the right graph) while holding other dimensions constant at zero. The y-axis represents the function value J(x), ranging from -4 to 0.
---
### Components/Axes
- **Left Graph**:
- **X-axis**: Labeled "x₁" with ticks from -4 to 4.
- **Y-axis**: Labeled "J(x₁, 0, ..., 0)" with ticks from -4 to 0.
- **Legend**: Located at the bottom-left corner, with:
- **Blue solid line**: "LPN (Learned J)"
- **Red dashed line**: "True J(x) = -1/4||x||²"
- **Right Graph**:
- **X-axis**: Labeled "x₂" with ticks from -4 to 4.
- **Y-axis**: Labeled "J(0, x₂, 0, ..., 0)" with ticks from -4 to 0.
- **Legend**: Identical to the left graph, positioned at the bottom-left.
---
### Detailed Analysis
1. **True Function (Red Dashed Line)**:
- Both graphs show a symmetric downward-opening parabola.
- Peak at (0, 0) with J(x) = 0.
- At x₁ = ±4 (left graph) or x₂ = ±4 (right graph), J(x) ≈ -4.
- Intermediate points:
- At x₁ = ±2 (left graph): J(x) ≈ -1.
- At x₁ = ±1 (left graph): J(x) ≈ -0.25.
2. **Learned Function (Blue Solid Line)**:
- Nearly identical to the true function in shape and position.
- Minor deviations observed:
- Slightly flatter near the peak (x₁ = 0 or x₂ = 0).
- Slightly sharper curvature at x₁ = ±4 or x₂ = ±4.
- At x₁ = ±2 (left graph): J(x) ≈ -1.02 (vs. true -1).
- At x₁ = ±1 (left graph): J(x) ≈ -0.27 (vs. true -0.25).
---
### Key Observations
- **Symmetry**: Both functions are symmetric about their respective axes (x₁ = 0 and x₂ = 0).
- **Alignment**: The LPN closely approximates the true function, with deviations < 5% in magnitude.
- **Consistency**: Identical behavior in both graphs despite differing axes (x₁ vs. x₂), suggesting the learned model generalizes well across dimensions.
---
### Interpretation
The graphs demonstrate that the LPN (Learned J) effectively approximates the true quadratic function J(x) = -1/4||x||² in 8D space. The near-perfect alignment suggests the model successfully captures the underlying mathematical relationship, even when projected onto lower-dimensional subspaces (x₁ or x₂). The minor deviations may indicate limitations in the learning process or numerical precision, but the overall performance is robust. This implies the LPN could reliably model similar high-dimensional functions in practical applications.
</details>
<details>
<summary>exp_quadratic_concave_prior_8D_PR2.png Details</summary>

### Visual Description
## Line Graphs: Prior J(x1, 0, ...) Dim 8 and Prior J(0, x2, 0, ...) Dim 8
### Overview
Two side-by-side line graphs compare the performance of an LPN 2 approximation (solid blue line) against the true quadratic prior function $ J(x) = -\frac{1}{4}||x||_2^2 $ (dashed orange line). Both graphs share identical axes ranges (-4 to 4) and y-axis scales (-4 to 0), with mirrored x-axis labels ($ x_1 $ for the left graph, $ x_2 $ for the right).
### Components/Axes
- **Left Graph**:
- **X-axis**: $ x_1 $ (labeled "x₁"), range: -4 to 4.
- **Y-axis**: $ J(x_1, 0, \dots) $ (labeled "J(x₁, 0, ...)", range: -4 to 0).
- **Legend**: Bottom-left corner, labels:
- Solid blue: "LPN 2"
- Dashed orange: "True $ J(x) = -\frac{1}{4}||x||_2^2 $".
- **Right Graph**:
- **X-axis**: $ x_2 $ (labeled "x₂"), range: -4 to 4.
- **Y-axis**: $ J(0, x_2, 0, \dots) $ (labeled "J(0, x₂, 0, ...)", range: -4 to 0).
- **Legend**: Identical to the left graph, positioned bottom-left.
### Detailed Analysis
- **Left Graph Trends**:
- The dashed orange line (True $ J(x) $) forms a symmetric downward-opening parabola, peaking at $ x_1 = 0 $ with $ J(0) = 0 $.
- The solid blue line (LPN 2) closely follows the dashed line but dips slightly below it near the peak, with a maximum value of approximately -0.05 at $ x_1 = 0 $.
- Both lines converge at $ x_1 = \pm4 $, where $ J(x_1) \approx -4 $.
- **Right Graph Trends**:
- Identical to the left graph, with the dashed orange line peaking at $ x_2 = 0 $ and the solid blue line mirroring its shape.
- The LPN 2 approximation again underestimates the true function by ~0.05 at the peak.
### Key Observations
1. **Symmetry**: Both graphs exhibit perfect symmetry about their respective x-axis origins ($ x_1 = 0 $, $ x_2 = 0 $).
2. **Approximation Accuracy**: The LPN 2 model (solid blue) closely matches the true quadratic prior (dashed orange) across the entire domain, with minor deviations only at the peak.
3. **Consistency**: The behavior of LPN 2 is identical in both graphs, suggesting the approximation is dimensionally consistent for $ x_1 $ and $ x_2 $.
### Interpretation
The graphs demonstrate that the LPN 2 approximation effectively captures the quadratic prior $ J(x) = -\frac{1}{4}||x||_2^2 $ in two dimensions. The slight underestimation at the peak ($ x_1 = 0 $ or $ x_2 = 0 $) may indicate a limitation in the LPN 2 model’s ability to represent the exact curvature of the true function. However, the near-perfect alignment elsewhere suggests the approximation is robust for practical purposes, particularly in regions away from the origin. The symmetry and consistency across dimensions imply the prior is isotropic in $ x_1 $ and $ x_2 $ when other variables are fixed at zero.
</details>
Figure 4: The cross sections of the prior function for $8$ dimension from “invert LPN” (left) and our trained second LPN method (right)
### 5.4 Negative $\ell_{1}$ norm
See Equation Eq. 32 for the minimum value and Eq. 33 for the proximal value. For this example, we consider $J(\bm{y})=-\left\|{\bm{y}}\right\|_{1}$ . Let $n=1$ for simplicity and consider the one-dimensional problem
$$
S(x,t)=\min_{y\in\mathbb{R}}\left\{\frac{1}{2t}(x-y)^{2}-|y|\right\}. \tag{31}
$$
The function $y\mapsto(x-y)^{2}/2t-|y|$ is differentiable everywhere except at $y=0$ . A stationary point of this function satisfies
$$
0\in\frac{y-x}{t}-\partial|y|\iff x\in\begin{cases}x-t,&\ \text{if $y<0$,}\\
[x-t,x+t]&\ \text{if $y=0$},\\
x+t,&\ \text{if $y>0$.}\end{cases}
$$
If $x>t$ , the only minimum is $x+t$ , in which case we have
$$
S(x,t)=\frac{1}{2t}(x+t-x)^{2}-|x+t|=\frac{t}{2}-(x+t)=-\frac{t}{2}-x.
$$
If $0<x\leqslant t$ , there are two local minimums, $0$ and $x+t$ , but the global minimum is attained at $x+t$ , again yielding $S(x,t)=\frac{t}{2}-x$ . If $x=0$ , we have three local minimums: $-t$ , $0$ , and $t$ . The global minimum is attained at either $-t$ or $t$ , yielding $S(0,t)=-\frac{t}{2}$ . If $-t\leqslant x<0$ , there are two local minimums, $0$ and $x-t$ , but the global minimum is attained at $x-t$ , yielding $S(x,t)=-\frac{t}{2}+x$ . If $x<-t$ , the only minimum is $x-t$ , in which case we have $S(x,t)-\frac{t}{2}+x$ . Hence
$$
S(x,t)=-\frac{t}{2}-|x|. \tag{32}
$$
In particular, its gradient in $x$ is given by
$$
\nabla_{x}S(x,t)\in\begin{cases}1&\ \text{if $x<0$},\\
[-1,1]&\ \text{if $x=0$},\\
-1&\ \text{if $x>0$}.\end{cases}
$$
Moreover,
$$
\operatorname*{arg\,min}_{y\in\mathbb{R}}\left\{\frac{1}{2t}(x-y)^{2}-|y|\right\}=\begin{cases}x-t&\ \text{if $x<0$},\\
[-t,t]&\ \text{if $x=0$},\\
x+t&\ \text{if $x>0$}.\end{cases} \tag{33}
$$
Table 4: Mean square errors of LPN $\psi$ and prior $J$ with 2 layers and 256 neurons in the negative $L_{1}$ norm examples.
| | Dimension | LPN ( $\psi$ ) | Prior ( $J$ ) |
| --- | --- | --- | --- |
| Mean Square Errors | 2D | $6.59E-5$ | $5.20E-6$ |
| 4D | $3.15E-4$ | $3.17E-5$ | |
| 8D | $2.12E-3$ | $2.94E-4$ | |
| 16D | $8.01E-3$ | $4.49E-2$ | |
| 32D | $1.55E-1$ | $2.29E-2$ | |
| 64D | $6.42E-4$ | $4.49E-3$ | |
<details>
<summary>exp_NegL1_prior_4D_LPN.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of a Convex Function in 4D Space
### Overview
The image contains two side-by-side line graphs comparing the performance of an LPN (Learning-based Prediction Network) model against a reference (Ref) model for cross-sectional analysis of a convex function in 4D space. Both graphs depict U-shaped curves, characteristic of convex functions, with axes labeled for spatial dimensions and function values.
---
### Components/Axes
1. **Left Graph**:
- **Title**: "Cross sections (x₁,0) of the convex function, Dim 4"
- **X-axis**: Labeled `x₁`, ranging from -4 to 4 in integer increments.
- **Y-axis**: Labeled `Convexfunctions(x₁,0, ..., )`, with values from 0 to 12.
- **Legend**:
- Solid blue line: "LPN"
- Dashed orange line: "Ref"
- **Legend Position**: Bottom-left corner.
2. **Right Graph**:
- **Title**: "Cross sections (0,x₂,0) of the convex function, Dim 4"
- **X-axis**: Labeled `x₂`, ranging from -4 to 4 in integer increments.
- **Y-axis**: Labeled `Convexfunctions(0,x₂,0, ..., )`, with values from 0 to 12.
- **Legend**: Same as left graph (solid blue = LPN, dashed orange = Ref).
- **Legend Position**: Bottom-left corner.
---
### Detailed Analysis
1. **Left Graph (x₁-axis)**:
- **LPN Line (Solid Blue)**:
- Starts at ~12.5 when `x₁ = -4`, decreases to a minimum of ~0.5 at `x₁ = 0`, then increases back to ~12.5 at `x₁ = 4`.
- Slope: Steeper near the edges (`x₁ = ±4`) and flatter near the minimum (`x₁ = 0`).
- **Ref Line (Dashed Orange)**:
- Mirrors the LPN line closely, with minor deviations (e.g., ~0.1–0.2 units lower at `x₁ = 0`).
- **Key Data Points**:
- `x₁ = -4`: LPN ≈ 12.5, Ref ≈ 12.3
- `x₁ = 0`: LPN ≈ 0.5, Ref ≈ 0.3
- `x₁ = 4`: LPN ≈ 12.5, Ref ≈ 12.2
2. **Right Graph (x₂-axis)**:
- **LPN Line (Solid Blue)**:
- Identical shape to the left graph, with the same minimum (~0.5 at `x₂ = 0`) and peaks (~12.5 at `x₂ = ±4`).
- **Ref Line (Dashed Orange)**:
- Again closely matches LPN, with similar minor deviations.
- **Key Data Points**:
- `x₂ = -4`: LPN ≈ 12.5, Ref ≈ 12.3
- `x₂ = 0`: LPN ≈ 0.5, Ref ≈ 0.3
- `x₂ = 4`: LPN ≈ 12.5, Ref ≈ 12.2
---
### Key Observations
1. **Symmetry**: Both graphs exhibit perfect symmetry about their respective axes (`x₁ = 0` and `x₂ = 0`), consistent with convex function properties.
2. **Model Agreement**: The LPN and Ref lines are nearly identical across all data points, suggesting the LPN model closely approximates the reference.
3. **Convexity**: The U-shaped curves confirm the convex nature of the function, with a single global minimum at the origin (`x₁ = 0` or `x₂ = 0`).
4. **Dimensional Consistency**: The identical behavior in both `x₁` and `x₂` cross-sections implies the convex function is isotropic in these dimensions.
---
### Interpretation
1. **Model Validation**: The near-perfect overlap of LPN and Ref lines indicates the LPN model is highly accurate in predicting the convex function's behavior in 4D space.
2. **Convex Function Properties**: The graphs validate the theoretical expectation that convex functions have a unique global minimum and increase monotonically away from it.
3. **Dimensional Independence**: The symmetry in both `x₁` and `x₂` cross-sections suggests the function's behavior is uniform across these axes, simplifying optimization tasks in this 4D space.
4. **Practical Implications**: For optimization algorithms (e.g., gradient descent), the clear minimum at the origin and smooth curvature imply efficient convergence without local minima traps.
---
### Uncertainties
- **Data Point Approximations**: Values like 12.5 or 0.5 are estimates based on grid alignment; exact values may vary by ±0.1–0.2 units.
- **Legend Consistency**: While the legend labels match the line styles, minor visual discrepancies (e.g., line thickness) were not explicitly quantified.
</details>
<details>
<summary>exp_NegL1_prior_8D_LPN.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of Convex Function in 8D Space
### Overview
The image contains two identical line graphs comparing two mathematical models ("LPN" and "Ref") for convex functions in 8-dimensional space. Both graphs show cross-sectional behavior along principal axes (x₁ and x₂), with identical scaling and structure. The graphs emphasize differences in curvature and minima between the two models.
### Components/Axes
- **Left Graph**:
- Title: "Cross sections (x₁,0) of the convex function, Dim 8"
- X-axis: Labeled "x₁" with range [-4, 4]
- Y-axis: Labeled "Convexfunctions(x₁,0, ...)" with range [0, 14]
- Legend: "LPN" (solid blue line), "Ref" (dashed orange line)
- **Right Graph**:
- Title: "Cross sections (0,x₂,0) of the convex function, Dim 8"
- X-axis: Labeled "x₂" with range [-4, 4]
- Y-axis: Labeled "Convexfunctions(0,x₂,0, ...)" with range [0, 14]
- Legend: Same as left graph
### Detailed Analysis
#### Left Graph (x₁-axis)
- **LPN Line (solid blue)**:
- Minimum value ≈ 2 at x₁ = 0
- Peaks ≈ 13.5 at x₁ = ±4
- Steeper slope compared to Ref line
- **Ref Line (dashed orange)**:
- Minimum value ≈ 1 at x₁ = 0
- Peaks ≈ 12.5 at x₁ = ±4
- Shallower slope compared to LPN
#### Right Graph (x₂-axis)
- Identical structure to the left graph, with:
- LPN minimum ≈ 2 at x₂ = 0
- Ref minimum ≈ 1 at x₂ = 0
- Peaks ≈ 13.5 (LPN) and 12.5 (Ref) at x₂ = ±4
### Key Observations
1. **Symmetry**: Both models exhibit perfect symmetry about their respective axes (x₁ and x₂).
2. **Convexity**: Both lines form U-shaped curves, confirming convexity.
3. **Model Differences**:
- LPN predicts steeper slopes and higher peak values.
- Ref predicts shallower slopes and lower peak values.
4. **Minima**: LPN’s minimum is consistently ~1 unit higher than Ref’s across both axes.
### Interpretation
The graphs demonstrate that the LPN model generates convex functions with stronger curvature and higher extremal values compared to the reference model. This suggests LPN may:
- Impose stricter constraints or regularization in optimization problems.
- Be more sensitive to input variations in high-dimensional spaces.
- Produce "sharper" decision boundaries in machine learning applications.
The consistent difference in minima (LPN ≈ 2 vs. Ref ≈ 1) implies LPN might represent a more conservative or pessimistic estimation framework, while Ref could reflect a baseline or less restrictive model. These differences could significantly impact applications like risk modeling, where convexity affects stability and robustness.
</details>
Figure 5: The cross sections of the convex function $\psi(x)$ for dimension $4$ (left) and $8$ (right).
The Tables 1, 2, 3, and 4 quantify the scalability of the LPN approach across dimensions ranging from $2$ D to $64$ D. The results indicate that the method performs with high accuracy in lower dimensions ( $2$ D through $8$ D), achieving mean square errors in the range of $10^{-7}$ to $10^{-4}$ . However, performance degrades slightly as the dimensionality increases, particularly at $16$ D and $32$ D, where the error spikes to a range of $10^{-3}$ and $10^{-1}$ . The higher dimensions did not work too well, which might be because we did not train for long enough and also used a simple architecture (which ought to be more intricate for the higher dimensional problems). While the error for the recovered prior $J$ is generally slightly higher than that of LPN $\psi$ , this is expected given the added complexity of recovering the non-convex, non-smooth, or concave functions. However, the errors generally remain low, validating our method’s effectiveness even in high-dimensional spaces.
The top rows of Figures 1, 2, 3 and Figure 5 demonstrate that the LPN accurately learns the cross sections of the convex function $\psi(x)$ for dimension $4,8,$ and $32$ , closely matching the reference function with the most significant variation in Figure 3 corresponding to dimension $32$ . The bottom row of these figures compares the “invert LPN” (left) method and our trained second LPN method (right). It is clear that in all cases, our direct method recovers the original non-smooth prior, as indicated by the sharp V-shaped reconstructions in Figure 1, the non-convex prior Figures 2 and 3, and quadratic concave prior Figure 4 despite its challenging nature.
Figure 5 represents the cross-sections of the LPN value function against the ground truth for dimensions $4$ (left) and $8$ (right). In both instances, the LPN-approximation curves exhibit a tight fit to the analytical reference, capturing the characteristic V-shape more closely and the non-smooth geometry of the underlying function. The errors for $64$ dimensions in all the examples are consistently lower than expected. In contrast, it does not yield a good visual approximation of the cross sections. We are unable to explain the reason for this behaviour in $64$ D cases yet; however, we think it is probably due to the way we sample the hypercube.
## 6 Discussion
In this work, we leveraged the theory of viscosity solutions of HJ PDEs to develop novel deep learning numerical methods to learn, from data, the underlying prior of the proximal operator Eq. 2 yielding $(\bm{x},t)\mapsto S(\bm{x},t)$ defined in Eq. 1. Our approach built on the existing connections between proximal operators and HJ PDEs, crucially the fact that $(\bm{x},t)\mapsto S(\bm{x},t)$ is obtained from the solution to an HJ PDE, and in particular on the theory for the inverse problem for HJ equations. As discussed in Section 3, the theory for the inverse problem for HJ equations show that while there may be infinitely many priors that can recover Eq. 1, there is a natural choice, obtained by reversing the time in the HJ PDE Eq. 14 and using the value of the proximal operator $(\bm{x},t)\mapsto S(\bm{x},t)$ as initial condition. The resulting backward viscosity solution yields a prior $J_{\text{BVS}}$ that can reconstruct the $(\bm{x},t)\mapsto S(\bm{x},t)$ and also that is semiconvex. We considered the case where only samples of the proximal operators and its values were available in Section Section 4, and used techniques from max-plus algebra to derive some characterizations and errors property of $J_{\text{BVS}}$ with respect to convex functions approximating it from above. Finally, in Section 5 we proposed to learn the prior $J_{\text{BVS}}$ by training a convex neural network, specifically a learned proximal network, on a function of the form $\bm{y}\mapsto\tilde{J}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ from data $\{\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t)\}_{k=1}^{K}$ via Eq. 17. We presented several numerical results that demonstrate the efficiency of our proposed method in high dimensions.
While this work focused on proximal operators, we expect our approach can be extended to a broad class of Bregman divergences, as recent results in the theory of inverse problems for HJ equations suggest [esteve2020inverse]. Another potential direction would be in the case where the value of the proximal operator $(\bm{x},t)\mapsto S(\bm{x},t)$ is known to learn the prior $J$ using Monte Carlo sampling strategies, as recently proposed in [park2025neural] for the forward problem of HJ equations (i.e., learning $(\bm{x},t)\mapsto S(\bm{x},t)$ from known $J$ ). In the longer term, it would be interesting to devise similar deep learning methods for the inverse problem of HJ equations with possibly time- or state-dependent Hamiltonians, relevant to optimal control problems.
## Appendix A Calculations
### A.1 Formal calculation of Eq. 19
We have
| | $\displaystyle\inf_{\bm{y}\in\mathbb{R}^{n}}$ | $\displaystyle\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{PAM}}(\bm{y})\right\}=\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+\right.$ | |
| --- | --- | --- | --- |
### A.2 Formal calculation of Eq. 21
Formally, we have
$$
t\nabla_{\bm{y}}J_{\text{PQM}}(\bm{y})=\alpha(\bm{y}-\bm{y}_{k})+\bm{x}_{k}-\bm{y}
$$
for some $k\in\{1,\dots,K\}$ . Similarly,
| | $\displaystyle\hat{\bm{y}}=\operatorname*{arg\,min}_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{PQM}}(\bm{y})\right\}$ | $\displaystyle\iff\bm{0}=\frac{\hat{\bm{y}}-\bm{x}}{t}+\frac{\alpha}{t}(\hat{\bm{y}}-\bm{y}_{k})+\frac{\bm{x}_{k}-\hat{\bm{y}}}{t}$ | |
| --- | --- | --- | --- |
In addition,
| | | $\displaystyle\bm{x}-\hat{\bm{y}}=\frac{\bm{x}_{k}-(1-\alpha)\bm{x}}{\alpha}-\bm{y}_{k}\implies\frac{1}{2t}\left\|{\bm{x}-\hat{\bm{y}}}\right\|_{2}^{2}=\frac{1}{2t\alpha^{2}}\left\|{\bm{x}_{k}-\bm{x}+\alpha(\bm{x}-\bm{y}_{k})}\right\|_{2}^{2},$ | |
| --- | --- | --- | --- |
and
| | $\displaystyle J_{\text{PQM}}(\hat{\bm{y}})$ | $\displaystyle=J(\bm{y}_{k})+\frac{1}{2t}\left\|{\bm{x}_{k}-\bm{y}_{k}}\right\|_{2}^{2}-\frac{1}{2t}\left\|{\bm{x}_{k}-\hat{\bm{y}}}\right\|_{2}^{2}+\frac{\alpha}{2t}\left\|{\hat{\bm{y}}-\bm{y}_{k}}\right\|_{2}^{2}$ | |
| --- | --- | --- | --- |
From this, we deduce
| | $\displaystyle\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{PQM}}(\bm{y})\right\}$ | $\displaystyle=\frac{1}{2t\alpha^{2}}\left\|{\bm{x}_{k}-\bm{x}+\alpha(\bm{x}-\bm{y}_{k})}\right\|_{2}^{2}$ | |
| --- | --- | --- | --- |
for some $k\in\{1,\dots,K\}$ .