# Deep learning methods for inverse problems using connections between proximal operators and Hamilton–Jacobi equations
**Authors**: Oluwatosin Akande, Gabriel P. Langlois, Akwum Onwunta
> Industrial and Systems Engineering, Lehigh University, 200 West Packer Avenue, Bethlehem, PA 18015, USA, ( )
> Department of Mathematics, University of Illinois Urbana-Champaign, Chicago, IL, USA ( ).
> Industrial and Systems Engineering, Lehigh University, 200 West Packer Avenue, Bethlehem, PA 18015, USA, ( ).
## Abstract
Inverse problems are important mathematical problems that seek to recover model parameters from noisy data. Since inverse problems are often ill-posed, they require regularization or incorporation of prior information about the underlying model or unknown variables. Proximal operators, ubiquitous in nonsmooth optimization, are central to this because they provide a flexible and convenient way to encode priors and build efficient iterative algorithms. They have also recently become key to modern machine learning methods, e.g., for plug-and-play methods for learned denoisers and deep neural architectures for learning priors of proximal operators. The latter was developed partly due to recent work characterizing proximal operators of nonconvex priors as subdifferential of convex potentials. In this work, we propose to leverage connections between proximal operators and Hamilton–Jacobi partial differential equations (HJ PDEs) to develop novel deep learning architectures for learning the prior. In contrast to other existing methods, we learn the prior directly without recourse to inverting the prior after training. We present several numerical results that demonstrate the efficiency of the proposed method in high dimensions.
## 1 Introduction
Inverse problems are ubiquitous mathematical problems that primarily aim at recovering model parameters from noisy data. They arise in many scientific and engineering applications for, e.g., recovering an image from noisy measurements, deblurring, tomographic reconstruction, and compressive sensing [AEOV2023, bertero2021introduction, isakov2017inverse, arridge2019solving]. Since inverse problems are often ill-posed, it is essential to include regularization or prior information about the underlying model or unknown variables. Proximal operators are central to this: they provide a flexible and computationally convenient way to encode priors and to build efficient iterative algorithms (e.g., proximal (sub)gradients, the alternating direction method of multipliers, and other splitting methods). More recently, proximal operators have become key ingredients for state-of-the-art machine learning methods, e.g., plug-and-play methods that replace explicit regularizers by learned denoisers [hu2023plug, jia2025plug], and deep neural architectures that parameterize proximal maps or their gradients, such as learned proximal networks (LPNs) [fang2024whats]. These developments have made proximal methods practical and powerful computational tools.
Formally, the proximal operator of a proper function $J\colon\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ is defined via an observed data $\bm{x}\in\mathbb{R}^{n}$ , a parameter $t>0$ , and the minimization problem
$$
S(\bm{x},t)=\min_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J(\bm{y})\right\}. \tag{1}
$$
The proximal operator $\text{prox}_{tJ}\colon\mathbb{R}^{n}\to\mathbb{R}$ is the set-valued function
$$
\text{prox}_{tJ}(\bm{x})=\operatorname*{arg\,min}_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J(\bm{y})\right\}. \tag{2}
$$
Here, $t$ controls the trade-off between the quadratic data-fidelity term and the prior $J$ . In practice one often works directly with $\text{prox}_{tJ}$ rather than the prior.
The recent work of Gribonval and Nikolova [gribonval2020characterization] in nonsmooth optimization has extended the characterization of proximal operators with convex priors to those with nonconvex priors, showing in particular they are functions that are subdifferentials of certain convex potentials. These properties, in particular, were used in [fang2024whats] to develop new deep learning methods, called learned proximal networks (LPNs), to learn from data the underlying prior of a proximal operator.
The paper [gribonval2020characterization] did not, however, discuss the well-established, existing connections between proximal operators and Hamilton–Jacobi Partial Differential Equations (HJ PDEs) [darbon2015convex, darbon2016algorithms, darbon2021bayesian, chaudhari2018deep, osher2023hamilton]. To see these connections, consider the following HJ PDE with quadratic Hamiltonian function and whose initial data is the prior $J$ :
$$
\begin{dcases}\frac{\partial S}{\partial t}(\bm{x},t)+\frac{1}{2}\left\|{\nabla_{\bm{x}}S(\bm{x},t)}\right\|_{2}^{2}=0,&\ \bm{x}\in\mathbb{R}^{n}\times(0,+\infty),\\
S(\bm{x},0)=J(\bm{x}),&\ \bm{x}\in\mathbb{R}^{n}.\end{dcases} \tag{3}
$$
If $J$ is uniformly Lipschitz continuous, then the unique viscosity solution of the HJ PDE is given by Eq. 1. Moreover, at a point of differentiability $\bm{x}$ , there holds
$$
\text{prox}_{tJ}(\bm{x})=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t). \tag{4}
$$
Moreover, the viscosity solution satisfies the crucial property that $\bm{x}\mapsto\frac{1}{2}\left\|{\bm{x}}\right\|_{2}^{2}-tS(\bm{x},t)$ is convex; that is, when paired with Eq. 4, the function $\text{prox}_{tJ}(\bm{x})$ is obtained from differentiating a convex function. This formally connects proximal operators to HJ PDEs, which we emphasize was previously known and established, and the (stronger) characterization obtained in [gribonval2020characterization] To the best our knowledge, this characterization result was unknown in the theory of HJ PDEs..
In this paper, we leverage the theory of viscosity solutions of HJ PDEs to develop novel deep learning methods to learn from data the prior function $J$ in Eq. 2. To describe our approach, consider the case when the solution $(\bm{x},t)\mapsto S(\bm{x},t)$ to the HJ PDE Eq. 3 is known. (We will consider the case when only samples of it are known in the next paragraph.) This problem was investigated in [barron1999regularity, claudel2011convex, colombo2020initial, esteve2020inverse, misztela2020initial]. In particular, [esteve2020inverse] showed that when $\bm{x}\mapsto S(\bm{x},t)$ is uniformly Lipschitz continuous and $\bm{x}\mapsto\frac{1}{2}\left\|{\bm{x}}\right\|_{2}^{2}-tS(\bm{x},t)$ is convex, there exists a prior $J$ that can recover $S(\bm{x},t)$ exactly. Moreover, there is a natural candidate for the prior, obtained by reversing the time in the HJ PDE Eq. 3 and using $(\bm{x},t)\mapsto S(\bm{x},t)$ as the terminal condition. The resulting backward viscosity solution yields the prior $J_{\text{BVS}}\colon\mathbb{R}^{n}\to\mathbb{R}$ which admits the representation formula
$$
J_{\text{BVS}}(\bm{y})=\sup_{\bm{x}\in\mathbb{R}^{n}}\left\{S(\bm{x},t)-\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}\right\}. \tag{5}
$$
Here, $J(\bm{y})\geqslant J_{\text{BVS}}(\bm{y})$ for every $\bm{y}\in\mathbb{R}^{n}$ , with $J_{\text{BVS}}(\bm{y})=J(\bm{y})$ whenever $\bm{y}=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t)$ , where $\bm{x}$ is a point of differentiability of $\bm{x}\mapsto S(\bm{x},t)$ . Moreover,
$$
\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{BVS}}(\bm{y})\right\}=S(\bm{x},t)\ \text{for every $\bm{x}\in\mathbb{R}^{n}$.}
$$
Thus the prior $J_{\text{BVS}}$ recovers the function $x\mapsto S(\bm{x},t)$ , although in general $\text{prox}_{tJ}$ and $\text{prox}_{tJ_{\text{BVS}}}$ may not agree everywhere. Nonetheless, this provides a principled way to estimate the prior, at least when $S(\bm{x},t)$ is known.
We focus in this paper on the case when $\bm{x}\mapsto S(\bm{x},t)$ is unknown but have access to some samples $\{\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t)\}_{k=1}^{K}$ with $t$ fixed. We propose to learn the prior $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})$ by leveraging the crucial fact that $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ is convex, thus enabling approaches based on deep learning and convex neural networks.
Related works: Hamilton–Jacobi PDEs are important to many scientific and engineering applications arising in e.g., optimal control [Bardi1997Optimal, fleming2006controlled, mceneaney2006max, parkinson2018optimal] and physics [Caratheodory1965CalculusI, Caratheodory1967CalculusII], inverse problems for imaging sciences [darbon2015convex, darbon2019decomposition, darbon2021bayesian, darbon2021connecting, darbon2022hamilton], optimal transport [meng2024primal, onken2021ot], game theory [BARRON1984213, Evans1984Differential, ruthotto2020machine], and machine learning [chen2024leveraging, zou2024leveraging]. Recent works focus on developing specialized solution methods for solving high-dimensional HJ PDEs, using, e.g., representation formulas or deep learning methods. These specialized methods leverage certain properties of HJ PDEs, including stochastic aspects and representation formulas [bardi1998hopf, mceneaney2006max, darbon2016algorithms, darbon2022hamilton], to approximate solutions to HJ PDEs more accurately and efficiently than general-purpose methods. See, e.g., [meng2022sympocnet, darbon2023neural, darbon2021some, darbon2020overcoming, park2025neural] for recent works along these lines and [meng2025recent] for a review of the state-of-the-art numerical methods for HJ PDEs.
Deep learning methods have become popular for computing solutions to high-dimensional PDEs as well as their inverse problems. They are popular because neural networks can be trained on data to approximate high-dimensional, nonlinear functions using efficient optimization algorithms. They have been used to approximate solutions to PDEs without any discretization with numerical grids, and for this reason they can overcome, or at least mitigate, the curse of dimensionality. There is a fairly comprehensive literature on deep learning methods for solving PDEs in general, e.g., see [beck2020overview, cuomo2022scientific, karniadakis2021physics].
Organization of this paper: We present background information on proximal operators, Hamilton–Jacobi equations, and convex neural networks in Section 2. Next, we discuss recent results concerning the inverse problem for Hamilton–Jacobi equations when the solution is available, and how they relate to proximal operators and learning priors in inverse problems, in Section 3. Our main theoretical results are presented in Section 4, where we study the inverse problem for Hamilton–Jacobi equations when only incomplete information is available about its solution. We suggest via arguments from max-plus algebra theory for Hamilton–Jacobi PDEs how to learn from data the solution to a certain Hamilton-Jacobi–Jacobi terminal value problem, which can then be used as an estimate for learning the prior function in a proximal operator. We present in Section 5 some numerical experiments for learning the initial data of certain Hamilton–Jacobi PDEs using convex neural networks and the theory of inverse Hamilton–Jacobi PDEs. Finally, we summarize our results in Section 6.
## 2 Background
We present here some background on proximal operators, HJ PDEs, connections between them, and convex neural networks. For comprehensive references, we refer the reader to [cannarsa2004semiconcave, evans2022partial, rockafellar2009variational].
### 2.1 Proximal operators
Let $J\colon\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ denote a proper function (i.e., $J(\bm{x})<+\infty$ for some $\bm{x}\in\mathbb{R}^{n}$ and $J(\bm{x})>-\infty$ for every $\bm{x}\in\mathbb{R}^{n}$ ). Consider the minimization problem $(\bm{x},t)\mapsto S(\bm{x},t)$ defined in Eq. 1 and its proximal operator $(\bm{x},t)\mapsto\text{prox}_{tJ}(\bm{x})$ defined in Eq. 2. We say a proper function $f_{t}\colon\mathbb{R}^{n}\to\mathbb{R}$ is a proximal operator of $tJ$ if $f_{t}(\bm{x})\in\text{prox}_{tJ}(\bm{x})$ for every $\bm{x}\in\mathbb{R}^{n}$ . Gribonval and Nikolova [gribonval2020characterization] proved that proximal operators are characterized in terms of the function $\psi\colon\mathbb{R}^{n}\times[0,+\infty)\to\mathbb{R}\cup\{+\infty\}$ defined by
$$
\psi(\bm{x},t)=\frac{1}{2}\left\|{\bm{x}}\right\|_{2}^{2}-tS(\bm{x},t). \tag{6}
$$
**Theorem 2.1**
*A proper function $f_{t}\colon\mathbb{R}^{n}\to\mathbb{R}^{n}$ is a proximal operator of $tJ$ if and only if $\bm{x}\mapsto\psi(\bm{x},t)$ is proper, lower semicontinuous, and convex and $f_{t}(\bm{x})\in\partial_{\bm{x}}\psi(\bm{x},t)$ . Moreover, $f_{t}$ is uniformly Lipschitz continuous with constant $L>0$ if and only if $\bm{x}\mapsto(1-1/L)\left\|{\bm{x}}\right\|_{2}^{2}/2+tJ(\bm{x})$ is proper, lower semicontinuous and convex.*
**Proof 2.2**
*See [gribonval2020characterization, Theorem 3 and Proposition 2]*
The characterization of proximal operators in Theorem 2.1 is closely related to the concepts of semiconcave and semiconvex functions.
**Definition 2.3**
*Let $\mathcal{C}\subset\mathbb{R}^{n}$ . We say $g\colon\mathcal{C}\to\mathbb{R}$ is $C$ -semiconcave with $C\geqslant 0$ if it is continuous and
$$
\lambda g(\bm{x}_{1})+(1-\lambda)g(\bm{x}_{2})-g(\lambda\bm{x}_{1}+(1-\lambda)\bm{x}_{2})\leqslant\lambda(1-\lambda)C\left\|{\bm{x}_{1}-\bm{x}_{2}}\right\|_{2}^{2}
$$
for every $\bm{x}_{1},\bm{x}_{2}\in\mathcal{C}$ such that $\lambda\bm{x}_{1}+(1-\lambda)\bm{x}_{2}\subset\mathcal{C}$ and $\lambda\in[0,1]$ . We say $g$ is semiconvex if $-g$ is semiconcave.*
**Remark 2.4**
*It can be shown [cannarsa2004semiconcave, Chapter 1] that a function $g$ is $C$ -semiconcave with $C\geqslant 0$ if and only if $\bm{x}\mapsto g(\bm{x})-\frac{C}{2}\left\|{\bm{x}}\right\|_{2}^{2}$ is concave, if and only if $g=g_{1}+g_{2}$ , where $g_{1}$ is concave and $g_{2}\in C^{2}(\mathbb{R}^{n})$ with $\left\|{\nabla_{\bm{x}}^{2}g_{2}}\right\|_{\infty}\leqslant C$ .*
Combining formula Eq. 6, Definition 2.3 and Remark 2.4, we find $\bm{x}\mapsto\psi(\bm{x},t)$ is convex if and only if $\bm{x}\mapsto tS(\bm{x},t)$ is semiconcave. We will see later that semiconcavity is an important concept in the theory of HJ PDEs for characterizing their generalized solutions. But before moving on to present some background on HJ PDEs, we give below an instructive example.
**Example 2.5 (The negative absolute value prior)**
*Let $J(x)=-|x|$ and consider the one-dimensional problem
$$
S(x,t)=\min_{y\in\mathbb{R}}\left\{\frac{1}{2t}(x-y)^{2}-|y|\right\}.
$$
A global minimum $y^{*}$ of this problem satisfies the first-order optimality condition
$$
0\in(y^{*}-x)/t-\partial|y^{*}|\iff y^{*}\in\begin{cases}x+t,&\ \text{if $y^{*}>0$,}\\
[x-t,x+t]&\ \text{if $y^{*}=0$},\\
x-t,&\ \text{if $y^{*}<0$.}\end{cases}
$$
If $x>t$ , the only minimum is $y^{*}=x+t$ . Likewise, if $x<-t$ , the only minimum is $y^{*}=x-t$ . In either cases, $S(x,t)=-x-\frac{t}{2}$ . If $0<x\leqslant t$ , there are two local minimums, $0$ and $x+t$ , but the global minimum is attained at $x+t$ and yields $S(x,t)=-\frac{t}{2}-x$ . Likewise, if $-t\leqslant x<0$ , there are two local minimums, $0$ and $x-t$ , but the global minimum is attained at $x-t$ and yields $S(x,t)=-\frac{t}{2}+x$ . Finally, if $x=0$ , there are three local minimums, $-t$ , $0$ , and $t$ . The global minimums are attained at $-t$ or $t$ , yielding $S(0,t)=-t/2$ . Hence we find
$$
S(x,t)=-\frac{t}{2}-|x|\quad\text{and}\quad\text{prox}_{tJ}(x)=\begin{cases}x+t,&\ \text{if $x>0$,}\\
\{-t,t\}&\ \text{if $x=0$},\\
x-t,&\ \text{if $x<0$.}\end{cases} \tag{7}
$$
Thus, a selection $f_{t}(x)\in\text{prox}_{tJ}(x)$ differs only at $x=0$ . In any case, the function $x\mapsto\psi(x,t)$ in Theorem 2.1 and its subdifferential $x\mapsto\partial_{x}\psi(x,t)$ are given by
$$
\psi(x,t)=\frac{1}{2}x^{2}-tS(x,t)=\frac{1}{2}x^{2}+t|x|+\frac{t^{2}}{2}\quad\text{and}\quad\partial_{x}\psi(x)=\begin{cases}x+t,\,&\text{if $x>0$},\\
[-t,t],\,&\text{if $x=0$},\\
x-t,\,&\text{if $x<0$}.\end{cases}
$$
We see that any selection $f_{t}(x)\in\text{prox}_{tJ}(x)$ satisfies $f(x)\in\partial\psi(x,t)$ .*
### 2.2 Hamilton–Jacobi Equations
In this section, we briefly review some elements of the theory of HJ PDEs, including the method of characteristics, viscosity solutions of HJ PDEs, and the Lax–Oleinik formula, and discuss how these concepts tie together to proximal operators. The discussion is not comprehensive; see [evans2022partial] and references therein for a more detailed treatment. To ease the presentation, we consider only the first-order HJ PDEs Eq. 3.
#### 2.2.1 Characteristic equations
The characteristic equations of Eq. 3 are given by the dynamical system
$$
\begin{cases}\dot{\bm{x}}(t)&=\bm{p}(t),\\
\dot{\bm{p}}(t)&=0,\\
\dot{\bm{z}}(t)&=\frac{1}{2}\left\|{\bm{p}(t)}\right\|_{2}^{2},\end{cases} \tag{8}
$$
where $\bm{z}(t)=S(\bm{x}(t),t)$ and $\bm{x}(0)=J(\bm{x}(0)$ . Here, $t\mapsto\bm{p}(t)$ is constant with $\bm{p}(t)\equiv\bm{p}(0)\in\mathbb{R}^{n}$ . The characteristic line that arises from $\bm{x}(0)\in\mathbb{R}^{n}$ is $\bm{x}(t)=\bm{x}(0)+t\bm{p}(0)$ , and so $\bm{z}(t)=\bm{z}(0)-\frac{1}{2}\left\|{\bm{p}(0)}\right\|_{2}^{2}$ . Taken together, we find
$$
S(\bm{x}(t),t)=J(\bm{x}(0))+\frac{1}{2}\left\|{\bm{p}(0)}\right\|_{2}^{2}. \tag{0}
$$
Writing $\bm{x}(t)\equiv\bm{x}$ and $\bm{p}(0)=\nabla_{\bm{x}}S(\bm{x},t)$ (assuming formally that the spatial gradient exists at $\bm{x}$ ) then $\bm{x}(0)=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t)$ , and so we find the representation
$$
S(\bm{x},t)=\frac{1}{2t}\left\|{\nabla_{\bm{x}}S(\bm{x},t)}\right\|_{2}^{2}+J(\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t)). \tag{9}
$$
This gives an implicit representation between $S$ , its spatial gradient, and the initial data $J$ . Next, we turn to the explicit representation of solutions to Eq. 3.
#### 2.2.2 Viscosity solutions and the Lax–Oleinik formula
The initial value problem Eq. 3 (and HJ PDEs with general Hamiltonians) may not have a unique generalized solution, i.e., those satisfying the HJ PDE almost everywhere along with the initial condition $S(\bm{x},0)=J(\bm{x})$ .
**Example 2.6**
*Let $J\equiv 0$ in Eq. 3 and take $n=1$ . The corresponding HJ PDE has infinitely many solutions: For instance, the functions $S_{1}$ and $S_{2}$ given by
$$
S_{1}(x,t)=0,\quad S_{2}(x,t)=\begin{cases}0,\,&\text{if $|x|\geqslant t$},\\
x-t,\,&\text{if $0\leqslant x\leqslant t$},\\
-x-t,\,&\text{if $-t\leqslant x\leqslant 0$},\end{cases}
$$
satisfy $S_{1}(x,0)=S_{2}(x,0)=J(x)=0$ and both solve the corresponding HJ PDE almost everywhere.*
The notion of viscosity solution was introduced in [crandall1983viscosity] to solve this problem. Under appropriate conditions (see [bardi1998hopf, crandall1992user, crandall1983viscosity]), the viscosity solution is unique and admits a representation formula. Specifically, for the initial value problem Eq. 3 with uniformly Lipschitz continuous initial data $J$ , the unique viscosity solution is given by the Lax–Oleinik formula (with quadratic Hamiltonian)
$$
S(\bm{x},t)=\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J(\bm{y})\right\}. \tag{10}
$$
The (unique) viscosity solution has two important properties. First, the function $\bm{x}\mapsto S(\bm{x},t)$ is (1/t)-semiconcave. This is equivalent to requiring the function $\bm{x}\mapsto\psi(\bm{x},t)$ defined in Eq. 6 to be convex, exactly as stipulated in Theorem 2.1. Second, at any point of differentiability of $\bm{x}\mapsto S(\bm{x},t)$ , there holds
$$
\nabla_{\bm{x}}S(\bm{x},t)=\frac{\bm{x}-f_{t}(\bm{x})}{t}\iff f_{t}(\bm{x})=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t), \tag{11}
$$
where $f_{t}(\bm{x})$ denote a global minimum in Eq. 10. Note that substituting this expression in formula Eq. 9 obtained from the characteristic equations yields Eq. 10, as expected.
**Example 2.7 (The negative absolute value prior, continued.)**
*Let $J(x)=-|x|$ in the (one-dimensional) first-order HJ PDE Eq. 3. The function $J$ is uniformly Lipschitz continuous and, as such, the Lax–Oleinik formula $S(x,t)=-\frac{t}{2}-|x|$ is the unique viscosity solution of the corresponding HJ PDE. Note $\bm{x}\mapsto S(\bm{x},t)$ is differentiable everywhere except at $\bm{x}=0$ and $\text{prox}_{tJ}(\bm{x})=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t)$ everywhere except at $\bm{x}=0$ (see (7)).*
In summary, a proper function $f_{t}$ is a proximal operator of $tJ$ whenever the function $(\bm{x},t)\mapsto S(\bm{x},t)$ is the viscosity solution of the HJ initial value problem Eq. 3. The minimization problem underlying $\text{prox}_{tJ}(\bm{x})$ is exactly the Lax–Oleinik representation formula of the viscosity solution of Eq. 3. We will see in the next section how to leverage these connections for learning the prior when $\bm{x}\mapsto J(\bm{x})$ is not available but $(\bm{x},t)\mapsto S(\bm{x},t)$ is available. But before proceeding, we briefly review convex neural networks, which will be used later in this work.
### 2.3 Convex neural networks
Convex Neural Networks, specifically Input Convex Neural Networks (ICNN), were introduced by [amos2017input] to allow for the efficient optimization of neural networks within structured prediction and reinforcement learning tasks. The core premise of an ICNN is to constrain the network architecture such that the output is a convex function with respect to the input.
To achieve convexity, the network typically employs a recursive structure for $k=0,\dots,j-1$
$$
\bm{z}_{k+1}=g(\bm{W}_{k}\bm{z}_{k}+\bm{H}_{k}y+\bm{b}_{k}),f(\bm{y};\theta)=\bm{z}_{j}, \tag{12}
$$
where $\bm{y}$ , $\bm{z}_{k}$ represent the input to the network and the hidden features at layer $k$ , respectively, and $g$ is the activation function. To guarantee the convexity of the output with respect to the input $\bm{y}$ , specific constraints are imposed on the parameters and the activation function, which are (i) the weights $\bm{W}_{k}$ , which connect the previous hidden layer to the current one, must be non-negative ( $\bm{W}_{k}\geqslant 0$ ), and (ii) the activation function $g$ must be convex and non-decreasing [fang2024whats].
Following [fang2024whats, Proposition 3.1], Fang et al. leverage the ICNN architecture and the characterization of proximal operators to develop Learned Proximal Networks (LPN) for inverse problems. LPNs require stricter conditions than standard ICNNs. While standard ICNNs often use ReLU activation, LPNs require the activation function $g$ to be twice continuously differentiable. This smoothness is essential to ensure that the proximal operator is the gradient of a twice continuous differentiable function [gribonval2020characterization, Theorem 2]. Consequently, LPNs typically utilize smooth activations like the softplus function, a $\beta-$ smooth approximation of ReLU [fang2024whats, Section 3].
## 3 Connections between learning priors and the inverse problem for Hamilton–Jacobi Equations
In this section, we discuss the inverse problem of learning the prior in the proximal operator Eq. 2: given $t>0$ and some function $\bm{x}\mapsto S(\bm{x},t)$ , assess whether there exists a a prior function $J$ that can recover $\bm{x}\mapsto S(\bm{x},t)$ and, if so, estimate it. Due to the connections between proximal operators and HJ Equations, as discussed in Subsections 2.1 – 2.2, our starting point will be to discuss the inverse problem from the point of view of HJ Equations.
We summarize in the next subsection some of the main results for this problem, based on the results of [esteve2020inverse] and other related works [claudel2011convex, colombo2020initial, misztela2020initial].
### 3.1 Reachability and inverse problems for Hamilton–Jacobi equations
We consider here the inverse problem associated to the HJ initial value problem Eq. 3: given $t>0$ and a function $(\bm{x},t)\mapsto S(\bm{x},t)$ , identify the set of initial data $J\colon\mathbb{R}^{n}\ \to\mathbb{R}$ such that the viscosity solution of Eq. 3 coincide with $S(\bm{x},t)$ . That is, we wish to characterize the set
$$
\displaystyle I_{t}(S) \displaystyle\coloneqq\{\text{$J\colon\mathbb{R}^{n}\to\mathbb{R}$ is uniformly Lipschitz continuous} \displaystyle\qquad\qquad:\text{$S(\bm{x},t)$ is obtained from~\eqref{eqn:intro2} at time $t$}\}. \tag{13}
$$
We say the function $(\bm{x},t)\mapsto S(\bm{x},t)$ is reachable if the set $I_{t}(S)$ is nonempty. The main reachability result for the initial value problem Eq. 3 is the following:
**Theorem 3.1**
*Suppose $\bm{x}\mapsto S(\bm{x},t)$ is uniformly Lipschitz continuous. Then the set $I_{t}(S)$ defined in Eq. 13 is nonempty if and only if $\bm{x}\mapsto tS(\bm{x},t)$ is semiconcave.*
**Proof 3.2**
*This follows from [esteve2020inverse, Theorem 2.2, Theorem 6.1, and Definition 6.2].*
Now, assume $(\bm{x},t)\mapsto S(\bm{x},t)$ is reachable. What can be said about the nonempty set $I_{t}(S)$ ? Since $(\bm{x},t)\mapsto S(\bm{x},t)$ is obtained from evolving forward in time the prior function $J$ from $0$ to $t$ according to Eq. 3, a natural approach is to do the opposite: evolve backward in time from $t$ to $0$ the function $\bm{x}\mapsto S(\bm{x},t)$ . That is, we consider the terminal value problem
$$
\begin{dcases}\frac{\partial\bm{w}}{\partial\tau}(\bm{y},\tau)+\frac{1}{2}\left\|{\nabla_{\bm{y}}\bm{w}(\bm{y},\tau)}\right\|_{2}^{2}=0&\ (\bm{y},\tau)\in\mathbb{R}^{n}\times[0,t),\\
\bm{w}(\bm{y},t)=S(\bm{y},t),&\ \bm{y}\in\mathbb{R}^{n}.\end{dcases} \tag{14}
$$
Under appropriate conditions, the terminal-value problem Eq. 14 has a unique viscosity solution:
**Theorem 3.3**
*Suppose $\bm{x}\mapsto S(\bm{x},t)$ is uniformly Lipschitz continuous and semiconcave. Then the viscosity solution of the terminal-value problem Eq. 14 exists, is unique, and is given by the representation formula
$$
\bm{w}(\bm{y},\tau)=\sup_{\bm{x}\in\mathbb{R}^{n}}\left\{S(\bm{x},t)-\frac{1}{2\tau}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}\right\}. \tag{15}
$$
Moreover, the function $\bm{y}\mapsto\tau\bm{w}(\bm{y},\tau)$ is semiconvex with unit constant.*
**Proof 3.4**
*See [barron1999regularity, Section 4, Equation 4.4.2] and [cannarsa2004semiconcave, Chapter 1].*
The viscosity solution of (14) is sometimes called the backward viscosity solution (BVS) to distinguish it from the viscosity solution of the initial value problem Eq. 3. The BVS at $\tau=0$ corresponds to fully evolving backward in time the function $\bm{x}\mapsto S(\bm{x},t)$ . In what follows, we write $J_{\text{BVS}}\coloneqq\bm{w}(\cdot,0)$ . We can use Eq. 6 to write
$$
tJ_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}=\sup_{\bm{x}\in\mathbb{R}^{n}}\left\{\langle\bm{x},\bm{y}\rangle-\psi(\bm{x},t)\right\}. \tag{16}
$$
The right hand side is the convex conjugate of $\bm{x}\mapsto\psi(\bm{x},t)$ evaluated at $\bm{x}$ , which is well-defined because $\bm{x}\mapsto\psi(\bm{x},t)$ is proper, lower semicontinuous and convex.
Theorem 3.3 suggests that $J_{\text{BVS}}$ is an initial condition that can reach $\bm{x}\mapsto S(\bm{x},t)$ . The next result stipulates that this is correct and that it is “optimal”, in the sense that it bounds from below for any other reachable initial condition $J\in I_{t}(S)$ .
**Theorem 3.5**
*Let $J_{\text{BVS}}$ denote the solution of the backward HJ terminal value problem 14 at time $\tau=0$ . Then $J\in I_{t}(S)$ if and only if
$$
J(\bm{y})\geqslant J_{\text{BVS}}(\bm{y})\ \text{for every $\bm{y}\in\mathbb{R}^{n}$, with equality for every $\bm{y}\in X_{t}(S)$, where}
$$
$$
X_{t}(S)\coloneqq\left\{\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t):\text{$\bm{x}\mapsto S(\bm{x},t)$ is differentiable at $\bm{x}\in\mathbb{R}^{n}$}\right\}.
$$*
**Proof 3.6**
*See [esteve2020inverse, Theorems 2.3 and 2.4].*
Theorem 3.5 stipulates that $J_{\text{BVS}}$ is equal everywhere to $J$ on the set $\mathcal{X}_{t}(S)$ and bounds it from below elsewhere. This is a fundamental consequence of the semiconcavity of $\bm{x}\mapsto S(\bm{x},t)$ , which regularizes the backward viscosity solution of Eq. 14. We illustrate this below with the negative absolute value prior.
**Example 3.7 (The negative absolute value prior, continued.)**
*Let $J(x)=-|x|$ in the (one-dimensional) first-order HJ PDE Eq. 3. Recall that the unique viscosity solution is given by the Lax–Oleinik formula $S(x,t)=-\frac{t}{2}-|x|$ . We now would like to compute the corresponding unique backward viscosity solution to the terminal-value problem Eq. 14. The solution is well-defined because $\bm{x}\mapsto S(\bm{x},t)$ is uniformly Lipschitz continuous and concave. We have
$$
J_{\text{BVS}}(x)=\sup_{\bm{y}\in\mathbb{R}}\left\{-\frac{t}{2}-|y|-\frac{1}{2t}(x-y)^{2}\right\}=-\frac{t}{2}-\inf_{\bm{y}\in\mathbb{R}}\left\{\frac{1}{2t}(x-y)^{2}+|y|\right\}.
$$
The infimum on the right hand side corresponds to the proximal operator of the function $\bm{y}\mapsto|y|$ , which is the soft-thresholding operator:
$$
\operatorname*{arg\,min}_{y\in\mathbb{R}}\left\{\frac{1}{2t}(x-y)^{2}+|y|\right\}=\begin{cases}x-t,\,&\text{if $x>t$},\\
0,\,&\text{if $x\in[-t,t]$},\\
x+t,\,&\text{if $x<-t$}.\end{cases}
$$
This gives
$$
J_{\text{BVS}}(x)=\begin{cases}-x,\,&\text{if $x>t$},\\
-\frac{t}{2}-\frac{x^{2}}{2t},\,&\text{if $x\in[-t,t]$},\\
x,\,&\text{if $x<-t$}.\end{cases}
$$
Here, a simple calculation shows $\mathcal{X}_{t}(S)=(-\infty,-t]\cup[t,+\infty)$ , and we find $J(x)>J_{\text{BVS}}(x)$ on $(-t,t)$ , as expected from Theorem 3.5. Moreover,
$$
tJ_{\text{BVS}}(x)+\frac{1}{2}x^{2}=\begin{cases}\frac{1}{2}(x-t)^{2}-\frac{t^{2}}{2},\,&\text{if $x>t$},\\
-\frac{t^{2}}{2},\,&\text{if $x\in[-t,t]$},\\
\frac{1}{2}(x+t)^{2}-\frac{t^{2}}{2},\,&\text{if $x<-t$},\end{cases}
$$
and we observe $x\mapsto tJ_{\text{BVS}}(x)+\frac{1}{2}x^{2}$ is convex, as expected from Theorem 3.3.*
The results here apply when the function $\bm{x}\mapsto S(\bm{x},t)$ is known. What happens when only a finite set of values of this function are available?
## 4 Learning priors and the inverse problem for Hamilton–Jacobi Equations with incomplete information
In this section, we consider the inverse problem of learning the prior in the proximal operator Eq. 2 with incomplete information: given $t>0$ and a set of samples $\{\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t)\}_{k=1}^{K}$ , estimate the prior $J$ that best recovers $\bm{x}\mapsto S(\bm{x},t)$ . Recall from Theorem 3.1 that when $\bm{x}\mapsto S(\bm{x},t)$ is uniformly Lipschitz continuous, $\bm{x}\mapsto S(\bm{x},t)$ is reachable if and only if it is semiconcave. In this case, the prior $\bm{x}\mapsto J_{\text{BVS}}(\bm{x})$ obtained from the HJ terminal value problem Eq. 14 provides a prior function that recovers $(\bm{x},t)\mapsto S(\bm{x},t)$ exactly. Hence we will focus on studying how to approximate the prior $J_{\text{BVS}}$ from a set of samples.
Note that if the triplet $(\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t))$ is known, then (i) the function is $\bm{x}\mapsto S(\bm{x},t)$ is differentiable at $\bm{x}$ and (ii) the unique minimum in the Lax–Oleinik formula Eq. 10 can be represented via Eq. 11:
$$
S(\bm{x}_{k},t)=\frac{1}{2t}\left\|{\bm{x}_{k}-\bm{y}_{k}}\right\|_{2}^{2}+J(\bm{y}_{k}),\,\text{with}\ \bm{y}_{k}=\bm{x}_{k}-t\nabla_{\bm{x}}S(\bm{x}_{k},t). \tag{17}
$$
Moreover, Theorem 3.5 and formula Eq. 6 imply $J(\bm{y}_{k})=J_{\text{BVS}}(\bm{y}_{k})$ , $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ is convex. Thus one possible approach for estimating $J_{\text{BVS}}$ is to approximate $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ piecewise from below at the points $\left\{\bm{y}_{k}\right\}_{k=1}^{K}$ .
We consider the problem of approximating $J_{\text{BVS}}$ piecewise from below and its implications in Section 4.1. This approximation problem turns out to be related closely to max-plus algebra theory for approximating solutions to HJ PDEs [akian2006max, fleming2000max, gaubert2011curse]; we discuss this in Section 4.2. We then consider in Section 4.3 the more general problem of learning a convex function to approximate $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ directly, applying the discussions in Section 4.1 - Section 4.2.
### 4.1 Piecewise approximations
We consider here piecewise approximations of the prior $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})$ using the samples $\{\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t)\}_{k=1}^{K}$ and formula Eq. 17. We consider first using a piecewise affine minorant (PAM) approximation, and then, assuming some regularity on $J_{\text{BVS}}$ , using a piecewise quadratic minorant (PQM) approximation.
#### 4.1.1 Piecewise affine approximation
We first consider the PAM approximation of the convex function $\bm{y}\mapsto tJ_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ ”
$$
tJ_{\text{PAM}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}\coloneqq\max_{k\in\{1,\dots,K\}}\left\{tJ_{\text{BVS}}(\bm{y}_{k})+\frac{1}{2}\left\|{\bm{y}_{k}}\right\|_{2}^{2}+\left\langle\bm{x}_{k},\bm{y}-\bm{y}_{k}\right\rangle\right\}. \tag{18}
$$
Then $J_{\text{PAM}}(\bm{y})\leqslant J_{\text{BVS}}(\bm{y})$ for every $\bm{y}\in\mathbb{R}^{n}$ , with $J_{\text{PAM}}(\bm{y}_{k})=J_{\text{BVS}}(\bm{y}_{k})$ at each $k\in\{1\,\dots,K\}$ . A short calculation gives
$$
tJ_{\text{PAM}}(\bm{y})=\max_{k\in\{1,\dots,K\}}\left\{tJ_{\text{BVS}}(\bm{y}_{k})+\frac{1}{2}\left\|{\bm{x}_{k}-\bm{y}_{k}}\right\|_{2}^{2}-\frac{1}{2}\left\|{\bm{x}_{k}-\bm{y}}\right\|_{2}^{2}\right\}.
$$
How good is $J_{\text{PAM}}$ as initial condition for the HJ PDE Eq. 3? In light of Theorem 3.5, $J_{\text{PAM}}$ , unsurprisingly, cannot reconstruct $\bm{x}\mapsto S(\bm{x},t)$ . Indeed, a formal calculation yields
$$
\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{PAM}}(\bm{y})\right\}=\begin{cases}S(\bm{x}_{k},t)&\ \text{if $\bm{x}=\bm{x}_{k}$, $k\in\{1,\dots,K\}$},\\
+\infty,&\ \text{otherwise}.\end{cases} \tag{19}
$$
See Section A.1 for details. Thus approximating $J_{\text{BVS}}$ via its PAM approximation recovers the samples $\{S(\bm{x}_{k},t)\}_{k=1}^{K}$ but nothing else.
#### 4.1.2 Piecewise quadratic approximation
Here, we assume $\bm{y}\mapsto tJ_{\text{BVS}}(\bm{y})$ is semiconvex with constant $1-\alpha$ with $\alpha>0$ , so that $\bm{y}\mapsto tJ_{\text{BVS}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ is $1-\alpha$ strongly convex. We can then approximate this strongly convex function via its PQMs:
| | $\displaystyle tJ_{\text{PQM}}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ | $\displaystyle\coloneqq\max_{k\in\{1,\dots,K\}}\biggl\{tJ_{\text{BVS}}(\bm{y}_{k})+$ | |
| --- | --- | --- | --- |
Then, $J_{\text{PQM}}(\bm{y})\leqslant J_{\text{BVS}}(\bm{y})$ for every $\bm{y}\in\mathbb{R}^{n}$ , with $J_{\text{PQM}}(\bm{y})=J_{\text{BVS}}(\bm{y}_{k})$ at each $k\in\{1,\dots,K\}$ . Moreover, a short calculation gives
$$
tJ_{\text{PQM}}(\bm{y})=\max_{k\in\{1,\dots,K\}}\left\{J(\bm{y}_{k})+\frac{1}{2}\left\|{\bm{x}_{k}-\bm{y}_{k}}\right\|_{2}^{2}-\frac{1}{2}\left\|{\bm{x}_{k}-\bm{y}}\right\|_{2}^{2}+\frac{\alpha}{2}\left\|{\bm{y}-\bm{y}_{k}}\right\|_{2}^{2}\right\}. \tag{20}
$$
How good is $J_{\text{PQM}}$ as an initial condition for the HJ PDE Eq. 3? Again, in light of Theorem 3.5, $J_{\text{PQM}}$ cannot reconstruct $\bm{x}\mapsto S(\bm{x},t)$ . Nonetheless, a formal calculation yields
$$
\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{PQM}}(\bm{y})\right\}=\frac{1}{2t}\left\|{\bm{x}-\bm{y}_{k}}\right\|_{2}^{2}+\frac{1}{2t\alpha}\left\|{\bm{x}-\bm{x}_{k}}\right\|_{2}^{2} \tag{21}
$$
for some $k\in\{1,\dots,K\}$ . See Section A.2 for more details. Hence $J_{\text{PQM}}$ leads to an approximation of $(\bm{x},t)\mapsto S(\bm{x},t)$ that is finite everywhere. In the next section, we describe how max-plus algebra theory [akian2006max, fleming2000max, gaubert2011curse] can be used to quantify the approximation errors more precisely.
### 4.2 Max-plus algebra theory for Hamilton–Jacobi PDEs and approximation results
We consider here max-plus algebra techniques for approximating solutions to certain HJ PDEs. Let $\alpha>0$ and let $\Psi\colon\mathbb{R}^{n}\to\mathbb{R}$ denote a $(1-\alpha)$ -semiconvex function obtained. Following [gaubert2011curse, Section III], we approximate $\Psi$ using $K$ vectors $\{\bm{p}_{k}\}_{k=1}^{K}\subset\mathbb{R}^{n}$ with $K$ semiconvex functions $\bm{y}\mapsto\langle\bm{p}_{k},\bm{y}\rangle-\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ and a function $a\colon\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\}$ :
$$
\Psi_{\text{MP}}(\bm{y})\coloneqq\max_{k\in\{1,\dots,K\}}\left\{\langle\bm{p}_{k},\bm{y}\rangle-\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}-a(\bm{p}_{k})\right\}. \tag{22}
$$
Here, we suppose the vectors $\{\bm{p}_{k}\}_{k=1}^{K}$ and $\bm{p}\mapsto a(\bm{p})$ are selected so that $\Psi_{\text{MP}}(\bm{y})\leqslant\Psi(\bm{y})$ . As discussed in Section 4.1, such a selection is possible via the affine piecewise quadratic minorants of the $(1-\alpha)$ -strongly convex function $\bm{y}\mapsto\Psi(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ . Let $\mathcal{Y}$ denote a full dimensional compact, convex subset of $\mathbb{R}^{n}$ and consider the $L_{\infty}$ error
$$
\epsilon_{\infty}(\Psi,K,\mathcal{Y},\Psi_{\text{MP}})\coloneqq\sup_{\bm{y}\in\mathcal{Y}}|\Psi(\bm{y})-\Psi_{\text{MP}}(\bm{y})|.
$$
Furthermore, we define the corresponding minimal $L_{\infty}$ error as
$$
\delta_{\infty}(\Psi,K,\mathcal{Y})=\inf_{\Psi_{\text{MP}}\leqslant\Psi}\epsilon_{\infty}(\Psi,K,\mathcal{Y},\Psi_{\text{MP}}).
$$
The following result from max-plus algebra theory, proven in [gaubert2011curse], stipulates that whatever vectors $\{\bm{p}_{k}\}_{k=1}^{K}$ and function $\bm{p}\mapsto a(\bm{p})$ are used to approximate $\Psi$ , the minimal $L_{\infty}$ error scales as an inverse power law in $K$ and the dimension $n$ in the limit $K\to+\infty$ .
**Theorem 4.1 (Gaubert et al. (2011))**
*Let $\alpha>0$ , and let $\mathcal{Y}$ denote a full-dimensional compact, convex subset of $\mathbb{R}^{n}$ . If $\Psi\colon\mathbb{R}^{n}\to\mathbb{R}$ is twice continuously differentiable and $1-\alpha$ semiconvex, then there exists a constant $\beta(n)>0$ depending only on $n$ such that
$$
\delta_{\infty}(\Psi,K,\mathcal{Y})\sim\beta(n)\left(\frac{1}{K}\int_{\mathcal{Y}}(\det\left(\nabla_{\bm{y}}^{2}\Psi(\bm{y})+\bm{I}_{n\times n})\right)^{\frac{1}{2}}\mathop{}\!d\bm{y}\right)^{2/n} \tag{23}
$$
as $K\to+\infty$ .*
Thus the minimal $L_{\infty}$ error is $\Omega(1/K^{2/n})$ as $K\to+\infty$ , though the error is smaller the closer the Hessian matrix $\nabla_{\bm{y}}^{2}\Psi(\bm{y})$ is to the identity matrix $\bm{I}_{n\times n}$ .
### 4.3 Applications to the inverse problem for Hamilton–Jacobi Equations
We consider here the problem of quantifying approximations of the prior function $\bm{y}\mapsto J_{\text{BVS}}(\bm{y})$ when the latter is sufficiently regularized and when we have access to the values $\{\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t)\}_{k=1}^{K}$ . Max-plus algebra theory provides us with a first approximation result:
**Corollary 4.2**
*Let $t>0$ and assume $tJ_{\text{BVS}}$ is twice continuously differentiable and $(1-\alpha)$ -semiconvex with $\alpha>0$ . Let $\mathcal{Y}$ denote a full-dimensional compact, convex set of $\mathbb{R}^{n}$ . Then there exists a constant $\beta(n)$ depending only on $n$ such that
$$
\delta_{\infty}(tJ_{\text{BVS}},K,\mathcal{Y})\sim\beta(n)\left(\frac{1}{K}\int_{\mathcal{Y}}\det\left(t\nabla_{\bm{y}}^{2}J_{\text{BVS}}(\bm{y})+\bm{I}_{n\times n}\right)^{\frac{1}{2}}\mathop{}\!d\bm{y}\right)^{2/n} \tag{24}
$$
as $K\to+\infty$ .*
**Proof 4.3**
*Immediate from Theorem 4.1 because $J_{\text{BVS}}$ satisfies all its assumptions.*
Corollary 4.2 provides a lower bound for the approximation error of $J_{\text{BVS}}$ relative to $J_{\text{PQM}}$ . Indeed, Theorem 4.1 and Corollary 4.2 and the fact that $J_{\text{PQM}}(\bm{y})\leqslant J_{\text{BVS}}(\bm{y})$ for every $\bm{y}\in\mathbb{R}^{n}$ imply
$$
\delta_{\infty}(tJ_{\text{BVS}},K,\mathcal{Y})\leqslant t\sup_{\bm{y}\in\mathcal{Y}}|J_{\text{BVS}}(\bm{y})-J_{\text{PQM}}(\bm{y})|. \tag{25}
$$
Thus in this case $J_{\text{PQM}}$ approximates $J_{\text{BVS}}$ from below in $\Omega(1/K^{n/2})$ as $K\to+\infty$ . We show below a similar upper bound holds using any reachable function $\tilde{J}\in I_{t}(S)$ .
**Theorem 4.4**
*Let $t>0$ and assume $tJ_{\text{BVS}}$ is twice continuously differentiable and $(1-\alpha)$ -semiconvex with $\alpha>0$ . Let $\mathcal{Y}$ denote a full-dimensional compact, convex set of $\mathbb{R}^{n}$ and let $\tilde{J}\in I_{t}(S)$ denote a function that can can reach $\bm{x}\mapsto S(\bm{x},t)$ . Then
$$
\delta_{\infty}(J_{\text{BVS}},K,\mathcal{Y})\leqslant t\sup_{\bm{y}\in\mathcal{Y}}|\tilde{J}(\bm{y})-J_{\text{PQM}}(\bm{y})|. \tag{26}
$$*
**Proof 4.5**
*First, note Theorem 3.5 implies $\tilde{J}(\bm{y})\geqslant J_{\text{BVS}}(\bm{y})$ for every $\bm{y}\in\mathbb{R}^{n}$ , with equality for every $\bm{y}\in\mathbb{R}^{n}$ for which $\bm{y}=\bm{x}-t\nabla_{\bm{x}}S(\bm{x},t)$ for some $\bm{x}\in\mathbb{R}^{n}$ . Thus
$$
t\tilde{J}(\bm{y})-tJ_{\text{BVS}}(\bm{y})=(t\tilde{J}(\bm{y})-tJ_{\text{PQM}}(\bm{y}))+(tJ_{\text{PQM}}(\bm{y})-tJ_{\text{BVS}}(\bm{y}))\geqslant 0,
$$
which we rearrange to get
$$
tJ_{\text{BVS}}(\bm{y})-tJ_{\text{PQM}}(\bm{y})\leqslant t\tilde{J}(\bm{y})-tJ_{\text{PQM}}(\bm{y}).
$$
Since the set $\mathcal{Y}$ is a compact and convex set, $\sup_{\bm{y}\in\mathcal{Y}}|tJ_{\text{BVS}}(\bm{y})-tJ_{\text{PQM}}(\bm{y})|$ is finite and attained in $\mathcal{Y}$ , say at $\bm{y}^{*}$ . Combining this with the inequality above yields
$$
t\sup_{\bm{y}\in\mathcal{Y}}|J_{\text{BVS}}(\bm{y})-J_{\text{PQM}}(\bm{y})|\leqslant t\tilde{J}(\bm{y}^{*})-tJ_{\text{PQM}}(\bm{y}^{*})\leqslant t\sup_{\bm{y}\in\mathcal{Y}}|\tilde{J}(\bm{y})-J_{\text{PQM}}(\bm{y})|.
$$
Finally, since $J_{\text{BVS}}$ is twice continuously differentiable and $(1-\alpha)$ semiconvex with $\alpha>0$ , we can invoke Theorem 4.1 with $\Psi\equiv J_{\text{BVS}}$ to get
$$
\delta_{\infty}(J_{\text{BVS}},K,\mathcal{Y})\leqslant t\sup_{\bm{y}\in\mathcal{Y}}|\tilde{J}(\bm{y})-J_{\text{PQM}}(\bm{y})|,
$$
that is, inequality Eq. 26 holds. This concludes the proof.*
Theorem 4.4 suggests it is possible to learn $J_{\text{BVS}}$ via a function $\tilde{J}$ that is twice continuously differentiable and (1- $\alpha$ )-semiconvex and assess the approximation error using the right-hand-side Eq. 26 as a proxy, in particular by driving $\sup_{\bm{y}\in\mathcal{Y}}|\tilde{J}(\bm{y})-J_{\text{PQM}}(\bm{y})|$ to zero using sufficiently large enough data by training $\tilde{J}(\bm{y})$ appropriately.
In the next section, we consider the problem of learning this function using deep neural networks, specifically learned proximal networks [fang2024whats], to enforce the semiconvexity property required for $\tilde{J}$ .
## 5 Numerical results
We evaluate Learned Proximal Networks (LPNs) for approximating the proximal operators of nonconvex and concave priors. While LPNs [fang2024whats] are theoretically grounded in convex analysis (parameterizing the proximal operator as the gradient of a convex potential $\psi$ ), these experiments investigate their behavior when trained on data generated from fundamentally nonconvex and concave landscapes. All experiments utilize the official LPN implementation. The network is trained via supervised learning, minimizing the mean squared error (MSE) or L1 loss between the network output and the true value. We use an LPN with $2$ layers and $256$ hidden units using Softplus activation ( $\beta=5$ ) to ensure $C^{2}$ smoothness. The model is trained using the Adam optimizer with a starting learning rate of $10^{-3}$ and decreased by a factor of $10^{-1}$ at every $10^{5}$ epochs for a total of $5\times 10^{5}$ epochs.
The data generation process for all experiments is as follows: $N$ samples ( $y_{i}$ ) are drawn uniformly from the hypercube $[-a,a]^{d}$ , where $a$ is chosen to be $4$ and $d$ is the dimension, equal $2,4,8,16,32$ and $64$ . $N=3\times 10^{4}$ is chosen for $d=2,4$ , $N=3\times 10^{4}$ is chosen for $d=8,16$ , and $N=4\times 10^{4}$ is chosen for $d=32,64$ .
We also trained a second LPN to recover the prior at arbitrary points and compare its performance to the “invert” method (find $y$ such that $f_{\theta}(y)=x$ ) used in [fang2024whats] for recovering the prior from its proximal. Our second LPN is based on the relationship that the non-convex prior $J(x)$ can be approximated using the convex conjugate of the learned potential $\psi(y)$ . Specifically, we compute:
$$
J(x)\approx G(x)-\frac{1}{2}\|x\|^{2} \tag{27}
$$
where $G(x)=\psi^{*}(x)$ represents the convex conjugate of the potential $\psi_{\theta}(y)$ learned by the first LPN. We generate a new dataset $\{(x_{k},G_{k})\}$ using the trained first LPN $\psi_{\theta}$ : (i) The gradients of the first network evaluated at the original sample points $y_{i}$ ,
$$
x_{k}=\nabla_{y}\psi_{\theta}(y_{i}), \tag{28}
$$
and (ii) the values of the Legendre transform corresponding to each point,
$$
G_{k}=\langle x_{k},y_{i}\rangle-\psi_{\theta}(y_{i}). \tag{29}
$$
The network $\phi_{G}$ is trained to map the gradients $x_{k}$ to the conjugate values $G_{k}$ by minimizing the Mean Squared Error (MSE). The optimization is performed using the Adam optimizer with the same parameters as used in the first LPN. Once the second LPN is trained, the estimated non-convex prior $\hat{J}(x)$ is recovered via
$$
\hat{J}(x)=\phi_{G}(x)-\frac{1}{2}\|x\|^{2}. \tag{30}
$$
### 5.1 Convex prior
We will benchmark our approach with the prior $J(\bm{x})=\left\|{\bm{x}}\right\|_{1}$ . For this example, we have
| | $\displaystyle\operatorname*{arg\,min}_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+\left\|{\bm{y}}\right\|_{1}\right\}$ | $\displaystyle=\cup_{j=1}^{n}\operatorname*{arg\,min}_{y_{j}\in\mathbb{R}}\left\{\frac{1}{2t}(x_{j}-y_{j})^{2}+|y_{j}|\right\}$ | |
| --- | --- | --- | --- |
With this, we can evaluate $S(\bm{x},t)$ and the LPN function $\bm{x}\mapsto\Psi(\bm{x})\coloneqq\frac{1}{2}-tS(\bm{x},t)$ .
Table 1: Mean square errors of LPN $\psi$ and prior $J$ with 2 layers and 256 neurons in the convex L1 prior example.
| | Dimension | LPN ( $\psi$ ) | Prior ( $J$ ) |
| --- | --- | --- | --- |
| Mean Square Errors | 2D | $1.04E-5$ | $3.33E-5$ |
| 4D | $2.97E-5$ | $2.17E-4$ | |
| 8D | $1.05E-4$ | $7.25E-4$ | |
| 16D | $5.27E-3$ | $2.11E-3$ | |
| 32D | $1.6E-1$ | $4.03E-2$ | |
| 64D | $2.89E-6$ | $2.69E-3$ | |
<details>
<summary>exp_L1_prior_8D_LPN.png Details</summary>

### Visual Description
## Chart Type: Line Graphs
### Overview
The image contains two line graphs comparing the performance of 'LPN' and 'Ref' models on cross-sections of a convex function in 8 dimensions. The left graph shows the cross-section (x1, 0), while the right graph shows the cross-section (0, x2, 0). Both graphs plot "Convexfunctions" on the y-axis against x1 and x2 on the x-axis, respectively. The LPN model is represented by a solid blue line, and the Ref model is represented by a dashed orange line.
### Components/Axes
* **Titles:**
* Left Graph: "Cross sections (x1,0) of the convex function, Dim 8"
* Right Graph: "Cross sections (0, x2,0) of the convex function, Dim 8"
* **Y-Axis:**
* Left Graph: "Convexfunctions(x1,0,...)"
* Right Graph: "Convexfunctions(0, x2, 0,...)"
* Scale: 0 to 4, with tick marks at every integer value.
* **X-Axis:**
* Left Graph: "x1"
* Right Graph: "x2"
* Scale: -4 to 4, with tick marks at every integer value.
* **Legend:** Located in the bottom-left corner of each graph.
* LPN: Solid blue line
* Ref: Dashed orange line
### Detailed Analysis
**Left Graph: Cross sections (x1,0)**
* **LPN (Solid Blue Line):** The line forms a U-shape, decreasing from approximately 4.5 at x1 = -4 to approximately 0 at x1 = 0, then increasing back to approximately 4.5 at x1 = 4.
* x1 = -4, Convexfunctions(x1,0,...) ≈ 4.5
* x1 = -2, Convexfunctions(x1,0,...) ≈ 1
* x1 = 0, Convexfunctions(x1,0,...) ≈ 0
* x1 = 2, Convexfunctions(x1,0,...) ≈ 1
* x1 = 4, Convexfunctions(x1,0,...) ≈ 4.5
* **Ref (Dashed Orange Line):** The line closely follows the LPN line, also forming a U-shape.
* x1 = -4, Convexfunctions(x1,0,...) ≈ 4.5
* x1 = -2, Convexfunctions(x1,0,...) ≈ 1
* x1 = 0, Convexfunctions(x1,0,...) ≈ 0
* x1 = 2, Convexfunctions(x1,0,...) ≈ 1
* x1 = 4, Convexfunctions(x1,0,...) ≈ 4.5
**Right Graph: Cross sections (0, x2,0)**
* **LPN (Solid Blue Line):** The line forms a U-shape, decreasing from approximately 4.5 at x2 = -4 to approximately 0 at x2 = 0, then increasing back to approximately 4.5 at x2 = 4.
* x2 = -4, Convexfunctions(0, x2, 0,...) ≈ 4.5
* x2 = -2, Convexfunctions(0, x2, 0,...) ≈ 1
* x2 = 0, Convexfunctions(0, x2, 0,...) ≈ 0
* x2 = 2, Convexfunctions(0, x2, 0,...) ≈ 1
* x2 = 4, Convexfunctions(0, x2, 0,...) ≈ 4.5
* **Ref (Dashed Orange Line):** The line closely follows the LPN line, also forming a U-shape.
* x2 = -4, Convexfunctions(0, x2, 0,...) ≈ 4.5
* x2 = -2, Convexfunctions(0, x2, 0,...) ≈ 1
* x2 = 0, Convexfunctions(0, x2, 0,...) ≈ 0
* x2 = 2, Convexfunctions(0, x2, 0,...) ≈ 1
* x2 = 4, Convexfunctions(0, x2, 0,...) ≈ 4.5
### Key Observations
* Both graphs show nearly identical curves for LPN and Ref models.
* The minimum value of the convex function appears to be approximately 0 at x1 = 0 and x2 = 0.
* The function is symmetric around x1 = 0 and x2 = 0.
### Interpretation
The graphs suggest that the LPN model performs very similarly to the Ref model for these specific cross-sections of the convex function. The U-shape indicates that the function has a minimum at the origin (x1 = 0, x2 = 0). The symmetry of the function around the origin is also evident. The close alignment of the LPN and Ref lines implies that the LPN model is a good approximation of the Ref model in these dimensions.
</details>
<details>
<summary>exp_L1_prior_8D_Pr1.png Details</summary>

### Visual Description
## Line Chart: Cross Sections of Prior Function
### Overview
The image contains two line charts displayed side-by-side. Both charts depict cross-sections of a prior function with dimension 8. The left chart shows cross-sections (x1, 0), while the right chart shows cross-sections (0, x2, 0). Each chart plots two data series: "LPN" and "Ref". Both charts share a similar U-shaped trend for the LPN series, with the Ref series showing a more linear, gradually increasing trend.
### Components/Axes
* **Titles:**
* Left Chart: "Cross sections (x1,0) of the prior function, Dim 8"
* Right Chart: "Cross sections (0, x2,0) of the prior function, Dim 8"
* **Y-Axis Labels:**
* Left Chart: "Prior functions (x1,0)"
* Right Chart: "Prior functions (0,x2,0)"
* **X-Axis Label:**
* Both Charts: "x1"
* **Y-Axis Scale:**
* Left Chart: 0.0 to 20.0, with increments of 2.5
* Right Chart: 0 to 25, with increments of 5
* **X-Axis Scale:**
* Both Charts: -4 to 4, with increments of 1
* **Legend:** Located in the top-right corner of each chart.
* Blue Line: "LPN"
* Orange Line: "Ref"
### Detailed Analysis
**Left Chart: Cross sections (x1,0)**
* **LPN (Blue Line):** The LPN line starts at approximately (x1=-4, Prior functions (x1,0)=19.5), decreases sharply to a minimum at (x1=0, Prior functions (x1,0)=0), and then increases sharply to approximately (x1=4, Prior functions (x1,0)=16.5).
* (-4, 19.5)
* (-3, 7.2)
* (-2, 3.0)
* (-1, 1.0)
* (0, 0)
* (1, 1.0)
* (2, 3.0)
* (3, 7.0)
* (4, 16.5)
* **Ref (Orange Line):** The Ref line starts at approximately (x1=-4, Prior functions (x1,0)=4), decreases slightly to a minimum at (x1=0, Prior functions (x1,0)=0), and then increases gradually to approximately (x1=4, Prior functions (x1,0)=4).
* (-4, 4.0)
* (-3, 3.0)
* (-2, 2.0)
* (-1, 1.0)
* (0, 0)
* (1, 1.0)
* (2, 2.0)
* (3, 3.0)
* (4, 4.0)
**Right Chart: Cross sections (0, x2,0)**
* **LPN (Blue Line):** The LPN line starts at approximately (x1=-4, Prior functions (0,x2,0)=24), decreases sharply to a minimum at (x1=0, Prior functions (0,x2,0)=0), and then increases sharply to approximately (x1=4, Prior functions (0,x2,0)=17).
* (-4, 24)
* (-3, 6)
* (-2, 2)
* (-1, 0.5)
* (0, 0)
* (1, 0.5)
* (2, 2)
* (3, 6)
* (4, 17)
* **Ref (Orange Line):** The Ref line starts at approximately (x1=-4, Prior functions (0,x2,0)=4), decreases slightly to a minimum at (x1=0, Prior functions (0,x2,0)=0), and then increases gradually to approximately (x1=4, Prior functions (0,x2,0)=4).
* (-4, 4.0)
* (-3, 3.0)
* (-2, 2.0)
* (-1, 1.0)
* (0, 0)
* (1, 1.0)
* (2, 2.0)
* (3, 3.0)
* (4, 4.0)
### Key Observations
* Both charts show a similar trend: the LPN line has a U-shape, indicating a minimum value around x1 = 0, while the Ref line is more linear and gradually increases from x1 = 0.
* The LPN line in the right chart (cross-sections (0, x2,0)) starts at a higher value (approximately 24) compared to the left chart (cross-sections (x1,0), approximately 19.5).
* The Ref lines in both charts are nearly identical.
### Interpretation
The charts compare the prior functions of LPN and a reference (Ref) model across two different cross-sections (x1,0) and (0, x2,0) in an 8-dimensional space. The U-shape of the LPN line suggests that the prior function is minimized when either x1 or x2 is close to 0. The Ref line, being more linear, indicates a more uniform prior distribution across the range of x1 and x2 values. The higher starting value of the LPN line in the right chart suggests that the prior function might have a stronger dependence on x2 compared to x1 when other dimensions are set to 0. The similarity of the Ref lines in both charts indicates that the reference model's prior function behaves consistently across these two cross-sections.
</details>
<details>
<summary>exp_L1_prior_8D_Pr2.png Details</summary>

### Visual Description
## Chart Type: Line Graphs
### Overview
The image contains two line graphs, each displaying cross-sections of a function J with respect to a single variable (x1 and x2, respectively) while other variables are held at zero. Both graphs compare the function "LPN 2" (solid blue line) against a reference function "Ref J(x) = ||x||1" (dashed orange line). The graphs are labeled "Cross sections of J(x1, 0,...) Dim 8" and "Cross sections of J(0, x2, 0,...) Dim 8".
### Components/Axes
* **Titles:**
* Left Graph: "Cross sections of J(x1, 0,...) Dim 8"
* Right Graph: "Cross sections of J(0, x2, 0,...) Dim 8"
* **Y-Axis:**
* Label (both graphs): "Priorfunctions(x1, 0, ...)" (left) and "Priorfunctions(0, x2, 0, ...)" (right)
* Scale: 0.0 to 4.0, with tick marks at intervals of 0.5.
* **X-Axis:**
* Label (both graphs): "x1" (left) and "x2" (right)
* Scale: -4 to 4, with tick marks at intervals of 1.
* **Legend:** Located in the bottom-left corner of each graph.
* Blue solid line: "LPN 2"
* Orange dashed line: "Ref J(x) = ||x||1"
### Detailed Analysis
**Left Graph: Cross sections of J(x1, 0,...) Dim 8**
* **LPN 2 (Blue Solid Line):**
* Trend: Starts at approximately (x1=-4, y=3.7), decreases to a minimum around (x1=0, y=0.1), then increases to approximately (x1=4, y=3.7). The curve is smooth.
* Data Points (approximate):
* (-4, 3.7)
* (-3, 2.5)
* (-2, 1.3)
* (-1, 0.5)
* (0, 0.1)
* (1, 0.5)
* (2, 1.3)
* (3, 2.5)
* (4, 3.7)
* **Ref J(x) = ||x||1 (Orange Dashed Line):**
* Trend: Starts at (x1=-4, y=4), decreases linearly to (x1=0, y=0), then increases linearly to (x1=4, y=4).
* Data Points:
* (-4, 4)
* (-3, 3)
* (-2, 2)
* (-1, 1)
* (0, 0)
* (1, 1)
* (2, 2)
* (3, 3)
* (4, 4)
**Right Graph: Cross sections of J(0, x2, 0,...) Dim 8**
* **LPN 2 (Blue Solid Line):**
* Trend: Starts at approximately (x2=-4, y=3.7), decreases to a minimum around (x2=0, y=0.1), then increases to approximately (x2=4, y=3.7). The curve is smooth.
* Data Points (approximate):
* (-4, 3.7)
* (-3, 2.5)
* (-2, 1.3)
* (-1, 0.5)
* (0, 0.1)
* (1, 0.5)
* (2, 1.3)
* (3, 2.5)
* (4, 3.7)
* **Ref J(x) = ||x||1 (Orange Dashed Line):**
* Trend: Starts at (x2=-4, y=4), decreases linearly to (x2=0, y=0), then increases linearly to (x2=4, y=4).
* Data Points:
* (-4, 4)
* (-3, 3)
* (-2, 2)
* (-1, 1)
* (0, 0)
* (1, 1)
* (2, 2)
* (3, 3)
* (4, 4)
### Key Observations
* Both graphs are nearly identical, suggesting symmetry in the function J with respect to x1 and x2 when other variables are zero.
* The "LPN 2" function (blue line) deviates from the reference function "Ref J(x) = ||x||1" (orange dashed line), particularly near x1=0 and x2=0, where "LPN 2" exhibits a smoother, rounded minimum.
* The reference function is a simple absolute value function, while "LPN 2" appears to be a smoothed version of it.
### Interpretation
The graphs illustrate the behavior of the "LPN 2" function in comparison to a reference function, the L1 norm, along two dimensions (x1 and x2) while keeping the other dimensions at zero. The "LPN 2" function approximates the L1 norm but introduces a smoothing effect around the origin. This smoothing might be a desirable property in certain applications, as it can make the function more amenable to optimization algorithms that rely on gradient information. The symmetry between the two graphs suggests that the function J is likely symmetric with respect to the first two variables when the others are zero. The dimension is 8, which means there are 6 other variables being held constant at 0.
</details>
Figure 1: The cross sections of the convex function $\psi(x)$ for dimension $8$ (top). The bottom row compares the cross sections of the prior function from “invert LPN” (left) and our trained second LPN method (right).
### 5.2 Non-convex prior
#### Minplus algebra example
For this example, the prior is
$$
J(\bm{x})=\min\left(\frac{1}{2\sigma_{1}}\left\|{\bm{x}-\mu_{1}}\right\|_{2}^{2},\frac{1}{2\sigma_{2}}\left\|{\bm{x}-\mu_{2}}\right\|_{2}^{2}\right).
$$
We use $\mu_{1}=(1,0,\dots,0)$ , $\mu_{2}=\bm{1}/\sqrt{n}$ , and $\sigma_{1}=\sigma_{2}=1.0$ .
Table 2: Mean square errors of LPN $\psi$ and prior $J$ with 2 layers and 256 neurons in the min-plus example.
| | Dimension | LPN ( $\psi$ ) | Prior ( $J$ ) |
| --- | --- | --- | --- |
| Mean Square Errors | 2D | $3.33E-6$ | $5.73E-7$ |
| 4D | $7.64E-6$ | $4.92E-6$ | |
| 8D | $3.64E-5$ | $1.20E-4$ | |
| 16D | $1.99E-4$ | $3.44E-4$ | |
| 32D | $1.16E-3$ | $1.33E-3$ | |
| 64D | $2.32E-9$ | $5.21E-5$ | |
<details>
<summary>exp_1_minplus_8D_LPN.png Details</summary>

### Visual Description
## Line Graphs: Cross Sections of Convex Function
### Overview
The image presents two line graphs, each displaying cross-sections of a convex function in 8 dimensions. The left graph shows the cross-section (x1, 0), while the right graph shows (0, x2, 0). Each graph plots two data series: "LPN" and "Ref," allowing for a comparison of their behavior across the x-axis.
### Components/Axes
* **Titles:**
* Left Graph: "Cross sections (x1,0) of the convex function, Dim 8"
* Right Graph: "Cross sections (0, x2,0) of the convex function, Dim 8"
* **Y-Axis:**
* Left Graph: "Convexfunctions(x1, 0, ...)"
* Right Graph: "Convexfunctions(0, x2, 0, ...)"
* Scale: 0 to 6 (left) and 0 to 4 (right), with tick marks at every integer value.
* **X-Axis:**
* Left Graph: "x1"
* Right Graph: "x2"
* Scale: -4 to 4, with tick marks at every integer value.
* **Legend:** Located in the top-left corner of each graph.
* "LPN": Solid blue line
* "Ref": Dashed orange line
### Detailed Analysis
**Left Graph: Cross sections (x1,0)**
* **LPN (Solid Blue Line):**
* Trend: The line forms a U-shape, decreasing from x1 = -4 to a minimum around x1 = -0.5, then increasing to x1 = 4.
* Data Points:
* x1 = -4, Convexfunctions(x1, 0, ...) ≈ 3
* x1 = -2, Convexfunctions(x1, 0, ...) ≈ 0.5
* x1 = 0, Convexfunctions(x1, 0, ...) ≈ 0
* x1 = 2, Convexfunctions(x1, 0, ...) ≈ 1
* x1 = 4, Convexfunctions(x1, 0, ...) ≈ 5.7
* **Ref (Dashed Orange Line):**
* Trend: Similar U-shape to LPN, but with a slightly lower minimum value.
* Data Points:
* x1 = -4, Convexfunctions(x1, 0, ...) ≈ 3
* x1 = -2, Convexfunctions(x1, 0, ...) ≈ 0
* x1 = 0, Convexfunctions(x1, 0, ...) ≈ -0.2
* x1 = 2, Convexfunctions(x1, 0, ...) ≈ 1.2
* x1 = 4, Convexfunctions(x1, 0, ...) ≈ 5.7
**Right Graph: Cross sections (0, x2,0)**
* **LPN (Solid Blue Line):**
* Trend: The line forms a U-shape, decreasing from x2 = -4 to a minimum around x2 = -0.5, then increasing to x2 = 4.
* Data Points:
* x2 = -4, Convexfunctions(0, x2, 0, ...) ≈ 3.7
* x2 = -2, Convexfunctions(0, x2, 0, ...) ≈ 0.5
* x2 = 0, Convexfunctions(0, x2, 0, ...) ≈ 0
* x2 = 2, Convexfunctions(0, x2, 0, ...) ≈ 1
* x2 = 4, Convexfunctions(0, x2, 0, ...) ≈ 4.5
* **Ref (Dashed Orange Line):**
* Trend: Similar U-shape to LPN, but with a slightly lower minimum value.
* Data Points:
* x2 = -4, Convexfunctions(0, x2, 0, ...) ≈ 3.7
* x2 = -2, Convexfunctions(0, x2, 0, ...) ≈ 0
* x2 = 0, Convexfunctions(0, x2, 0, ...) ≈ -0.2
* x2 = 2, Convexfunctions(0, x2, 0, ...) ≈ 1.2
* x2 = 4, Convexfunctions(0, x2, 0, ...) ≈ 4.5
### Key Observations
* Both graphs show similar U-shaped curves for LPN and Ref, indicating a convex function.
* The "Ref" line consistently has a slightly lower minimum value than the "LPN" line in both graphs.
* The left graph has a higher maximum y-value than the right graph.
### Interpretation
The graphs illustrate cross-sections of a convex function in 8 dimensions, comparing the behavior of "LPN" and a reference ("Ref") implementation. The similarity in the U-shaped curves suggests that "LPN" closely approximates the convex function, with minor differences reflected in the slightly higher minimum values compared to "Ref". The different cross-sections (x1, 0) and (0, x2, 0) show similar behavior, but the (x1, 0) cross-section has a higher range of values. This could indicate that the function varies more along the x1 dimension than the x2 dimension.
</details>
<details>
<summary>exp_1_minplus_8D_Pr1.png Details</summary>

### Visual Description
## Chart: Cross Sections of Prior Functions
### Overview
The image contains two line charts, each displaying cross-sections of prior functions. Both charts compare two methods, LPN (blue line) and Ref (orange line), across a range of x-values. The left chart shows cross-sections (x1,0), while the right chart shows cross-sections (0, x2,0). Both charts are for a prior function with dimension 8.
### Components/Axes
**Left Chart:**
* **Title:** Cross sections (x1,0) of the prior function, Dim 8
* **X-axis:** x1, ranging from -4 to 4 in increments of 1.
* **Y-axis:** Prior functions (x1,0), ranging from 0 to 15.0 in increments of 2.5.
* **Legend:** Located in the top-right corner.
* LPN: Blue line
* Ref: Orange line
**Right Chart:**
* **Title:** Cross sections (0, x2,0) of the prior function, Dim 8
* **X-axis:** x1, ranging from -4 to 4 in increments of 1.
* **Y-axis:** Prior functions (0, x2,0), ranging from 0 to 14 in increments of 2.
* **Legend:** Located in the bottom-left corner.
* LPN: Blue line
* Ref: Orange line
### Detailed Analysis
**Left Chart (Cross sections (x1,0)):**
* **LPN (Blue):** The LPN line starts at approximately 16 at x1 = -4, decreases to a minimum of approximately -0.5 near x1 = 0.5, and then increases to approximately 10 at x1 = 4.
* x1 = -4, Prior function (x1,0) ≈ 16
* x1 = -2, Prior function (x1,0) ≈ 4
* x1 = 0, Prior function (x1,0) ≈ 0
* x1 = 2, Prior function (x1,0) ≈ 4
* x1 = 4, Prior function (x1,0) ≈ 10
* **Ref (Orange):** The Ref line starts at approximately 10 at x1 = -4, decreases to a minimum of approximately 1 near x1 = 0.5, and then increases to approximately 4 at x1 = 4.
* x1 = -4, Prior function (x1,0) ≈ 10
* x1 = -2, Prior function (x1,0) ≈ 2.5
* x1 = 0, Prior function (x1,0) ≈ 1
* x1 = 2, Prior function (x1,0) ≈ 2.5
* x1 = 4, Prior function (x1,0) ≈ 4
**Right Chart (Cross sections (0, x2,0)):**
* **LPN (Blue):** The LPN line starts at approximately 15 at x1 = -4, decreases to a minimum of approximately -0.5 near x1 = 0.5, and then increases to approximately 13 at x1 = 4.
* x1 = -4, Prior function (0, x2,0) ≈ 15
* x1 = -2, Prior function (0, x2,0) ≈ 3
* x1 = 0, Prior function (0, x2,0) ≈ 0
* x1 = 2, Prior function (0, x2,0) ≈ 3
* x1 = 4, Prior function (0, x2,0) ≈ 13
* **Ref (Orange):** The Ref line starts at approximately 8.5 at x1 = -4, decreases to a minimum of approximately 0.5 near x1 = 0.5, and then increases to approximately 7 at x1 = 4.
* x1 = -4, Prior function (0, x2,0) ≈ 8.5
* x1 = -2, Prior function (0, x2,0) ≈ 1.5
* x1 = 0, Prior function (0, x2,0) ≈ 0.5
* x1 = 2, Prior function (0, x2,0) ≈ 1.5
* x1 = 4, Prior function (0, x2,0) ≈ 7
### Key Observations
* Both charts show a similar U-shaped trend for both LPN and Ref, indicating a quadratic-like relationship between x1 and the prior functions.
* In both charts, the LPN line generally has higher values than the Ref line, especially at the extreme values of x1.
* The minimum values for both LPN and Ref occur near x1 = 0.5 in both charts.
### Interpretation
The charts compare the cross-sections of prior functions generated by two different methods, LPN and Ref. The U-shaped curves suggest that as x1 moves away from the center (around 0.5) in either direction, the prior function values increase. The LPN method tends to produce higher prior function values compared to the Ref method, particularly at the extremes of the x1 range. This could indicate that LPN assigns higher probabilities to extreme values compared to Ref, or that LPN has a wider spread. The fact that both charts show similar trends suggests that the underlying function has similar characteristics along both (x1,0) and (0, x2,0) cross-sections. The dimension of the prior function is 8, which implies that these are just two slices of a higher-dimensional function.
</details>
<details>
<summary>exp_1_minplus_8D_pr2.png Details</summary>

### Visual Description
## Chart: Cross Sections of J
### Overview
The image contains two line charts displaying cross-sections of a function J in 8 dimensions. The left chart shows the cross-section with respect to x1, while the right chart shows the cross-section with respect to x2. Each chart plots two functions: "LPN 2" and "Ref J(x) = -1/4||x||_2". Both charts exhibit a parabolic shape, indicating a quadratic relationship.
### Components/Axes
**Left Chart:**
* **Title:** Cross sections of J(x1, 0, ...) Dim 8
* **X-axis:** x1, ranging from -4 to 4 in increments of 1.
* **Y-axis:** Priorfunctions(x1, 0, ...), ranging from 0 to 10 in increments of 2.
* **Legend:** Located in the top-right corner.
* LPN 2 (solid blue line)
* Ref J(x) = -1/4||x||_2 (dashed brown line)
**Right Chart:**
* **Title:** Cross sections of J(0, x2, 0, ...) Dim 8
* **X-axis:** x2, ranging from -4 to 4 in increments of 1.
* **Y-axis:** Priorfunctions(0, x2, 0, ...), ranging from 0 to 10 in increments of 2.
* **Legend:** Located in the top-right corner.
* LPN 2 (solid blue line)
* Ref J(x) = -1/4||x||_2 (dashed brown line)
### Detailed Analysis
**Left Chart (x1):**
* **LPN 2 (blue line):** The line forms a parabola.
* At x1 = -4, Priorfunctions(x1, 0, ...) ≈ 10.5
* At x1 = 0, Priorfunctions(x1, 0, ...) ≈ 0.1
* At x1 = 4, Priorfunctions(x1, 0, ...) ≈ 5
* **Ref J(x) = -1/4||x||_2 (dashed brown line):** The line forms a parabola.
* At x1 = -4, Priorfunctions(x1, 0, ...) ≈ 9.8
* At x1 = 0, Priorfunctions(x1, 0, ...) ≈ 0.1
* At x1 = 4, Priorfunctions(x1, 0, ...) ≈ 4.1
**Right Chart (x2):**
* **LPN 2 (blue line):** The line forms a parabola.
* At x2 = -4, Priorfunctions(0, x2, 0, ...) ≈ 9.5
* At x2 = 0, Priorfunctions(0, x2, 0, ...) ≈ 0.1
* At x2 = 4, Priorfunctions(0, x2, 0, ...) ≈ 8
* **Ref J(x) = -1/4||x||_2 (dashed brown line):** The line forms a parabola.
* At x2 = -4, Priorfunctions(0, x2, 0, ...) ≈ 8.5
* At x2 = 0, Priorfunctions(0, x2, 0, ...) ≈ 0.1
* At x2 = 4, Priorfunctions(0, x2, 0, ...) ≈ 7.5
### Key Observations
* Both charts show similar parabolic trends for LPN 2 and Ref J(x).
* The minimum value for both functions in both charts occurs near x1 = 0 and x2 = 0, respectively.
* The LPN 2 function generally has slightly higher values than the Ref J(x) function, especially at the extremes of x1 and x2.
### Interpretation
The charts compare the cross-sections of the LPN 2 function with a reference function Ref J(x). The similarity in shape and values suggests that LPN 2 approximates the reference function, particularly in the region around x1 = 0 and x2 = 0. The slight differences at the extremes indicate that LPN 2 might deviate from the reference function as the input values increase in magnitude. The plots demonstrate the behavior of these functions in a high-dimensional space by visualizing their cross-sections along specific axes.
</details>
Figure 2: The cross sections of the convex function $\psi(x)$ for dimension $8$ (top). The bottom row compares the cross sections of the prior function from “invert LPN” (left) and our trained second LPN method (right).
<details>
<summary>exp_1_minplus_32D_LPN.png Details</summary>

### Visual Description
## Chart Type: Line Graphs of Convex Function Cross Sections
### Overview
The image presents two line graphs, each displaying cross-sections of a convex function with dimension 32. The left graph shows the cross-section (x1, 0), while the right graph shows (0, x2, 0). Each graph plots two data series: "LPN" (solid blue line) and "Ref" (dashed orange line). The x-axis represents x1 in the left graph and x2 in the right graph, ranging from -4 to 4. The y-axis represents the value of the convex function.
### Components/Axes
**Left Graph:**
* **Title:** Cross sections (x1,0) of the convex function, Dim 32
* **X-axis:** x1, ranging from -4 to 4 in increments of 1.
* **Y-axis:** Convexfunctions(x1, 0, ...), ranging from 0 to 6 in increments of 1.
* **Legend:** Located in the top-left corner.
* LPN: Solid blue line
* Ref: Dashed orange line
**Right Graph:**
* **Title:** Cross sections (0, x2,0) of the convex function, Dim 32
* **X-axis:** x2, ranging from -4 to 4 in increments of 1.
* **Y-axis:** Convexfunctions(0, x2, 0, ...), ranging from 0 to 4 in increments of 1.
* **Legend:** Located in the bottom-left corner.
* LPN: Solid blue line
* Ref: Dashed orange line
### Detailed Analysis
**Left Graph (x1, 0):**
* **LPN (Solid Blue):** The line forms a U-shape. It decreases from approximately 3.5 at x1 = -4 to a minimum of approximately 0.5 at x1 = 0, then increases to approximately 5.8 at x1 = 4.
* x1 = -4, y ≈ 3.5
* x1 = -2, y ≈ 1
* x1 = 0, y ≈ 0.5
* x1 = 2, y ≈ 1
* x1 = 4, y ≈ 5.8
* **Ref (Dashed Orange):** The line also forms a U-shape, but with a lower minimum value. It decreases from approximately 3.3 at x1 = -4 to a minimum of approximately -0.3 at x1 = 0, then increases to approximately 5.5 at x1 = 4.
* x1 = -4, y ≈ 3.3
* x1 = -2, y ≈ 0.3
* x1 = 0, y ≈ -0.3
* x1 = 2, y ≈ 0.3
* x1 = 4, y ≈ 5.5
**Right Graph (0, x2, 0):**
* **LPN (Solid Blue):** The line forms a U-shape. It decreases from approximately 4.5 at x2 = -4 to a minimum of approximately 0.5 at x2 = 0, then increases to approximately 5.8 at x2 = 4.
* x2 = -4, y ≈ 4.5
* x2 = -2, y ≈ 1
* x2 = 0, y ≈ 0.5
* x2 = 2, y ≈ 1
* x2 = 4, y ≈ 5.8
* **Ref (Dashed Orange):** The line also forms a U-shape, but with a lower minimum value. It decreases from approximately 3.5 at x2 = -4 to a minimum of approximately -0.3 at x2 = 0, then increases to approximately 4 at x2 = 4.
* x2 = -4, y ≈ 3.5
* x2 = -2, y ≈ 0.3
* x2 = 0, y ≈ -0.3
* x2 = 2, y ≈ 0.3
* x2 = 4, y ≈ 4
### Key Observations
* Both graphs show a convex (U-shaped) function.
* The "LPN" and "Ref" lines follow similar trends in both graphs, but the "Ref" line has a lower minimum value.
* The left graph (x1, 0) has a slightly higher y-axis range than the right graph (0, x2, 0).
* The minimum value for both LPN curves is approximately 0.5, occurring at x1 = 0 and x2 = 0.
* The minimum value for both Ref curves is approximately -0.3, occurring at x1 = 0 and x2 = 0.
### Interpretation
The graphs compare the cross-sections of a convex function as represented by two different methods, "LPN" and "Ref." The "Ref" method consistently yields lower values than the "LPN" method, especially near the minimum of the function. This suggests that "Ref" might be a more aggressive approximation or optimization technique, potentially leading to lower function values but also possibly introducing some bias or instability. The fact that both cross-sections (x1, 0) and (0, x2, 0) exhibit similar behavior indicates that the function is likely symmetric with respect to the x1 and x2 axes around the origin. The dimension of the convex function is 32, which implies that these are just two slices of a much higher-dimensional space.
</details>
<details>
<summary>exp_1_minplus_32D_Pr1.png Details</summary>

### Visual Description
## Line Chart: Cross Sections of Prior Function
### Overview
The image contains two line charts displayed side-by-side. Both charts depict cross-sections of a prior function in dimension 32. The left chart shows cross-sections (x1,0), while the right chart shows cross-sections (0, x2,0). Each chart plots two data series: "LPN" and "Ref". Both charts show a parabolic trend for both LPN and Ref, with LPN generally having higher values than Ref, especially at the extremes of the x-axis.
### Components/Axes
**Left Chart:**
* **Title:** Cross sections (x1,0) of the prior function, Dim 32
* **Y-axis:** Prior functions (x1,0), with scale from 0 to 14 in increments of 2.
* **X-axis:** x1, with scale from -4 to 4 in increments of 1.
* **Legend:** Located in the top-right corner.
* LPN (blue line)
* Ref (orange line)
**Right Chart:**
* **Title:** Cross sections (0, x2,0) of the prior function, Dim 32
* **Y-axis:** Prior functions (0, x2,0), with scale from 0 to 14 in increments of 2.
* **X-axis:** x1, with scale from -4 to 4 in increments of 1.
* **Legend:** Located in the bottom-left corner.
* LPN (blue line)
* Ref (orange line)
### Detailed Analysis
**Left Chart (x1,0):**
* **LPN (blue line):** The line forms a U-shape.
* At x1 = -4, Prior functions (x1,0) ≈ 15
* At x1 = -2, Prior functions (x1,0) ≈ 4
* At x1 = 0, Prior functions (x1,0) ≈ -0.5
* At x1 = 2, Prior functions (x1,0) ≈ 4
* At x1 = 4, Prior functions (x1,0) ≈ 11
* **Ref (orange line):** The line forms a U-shape.
* At x1 = -4, Prior functions (x1,0) ≈ 9
* At x1 = -2, Prior functions (x1,0) ≈ 2.5
* At x1 = 0, Prior functions (x1,0) ≈ 0.5
* At x1 = 2, Prior functions (x1,0) ≈ 0.5
* At x1 = 4, Prior functions (x1,0) ≈ 7.5
**Right Chart (0, x2,0):**
* **LPN (blue line):** The line forms a U-shape.
* At x1 = -4, Prior functions (0, x2,0) ≈ 14
* At x1 = -2, Prior functions (0, x2,0) ≈ 3
* At x1 = 0, Prior functions (0, x2,0) ≈ -0.5
* At x1 = 2, Prior functions (0, x2,0) ≈ 2
* At x1 = 4, Prior functions (0, x2,0) ≈ 13
* **Ref (orange line):** The line forms a U-shape.
* At x1 = -4, Prior functions (0, x2,0) ≈ 8.5
* At x1 = -2, Prior functions (0, x2,0) ≈ 2
* At x1 = 0, Prior functions (0, x2,0) ≈ 0.5
* At x1 = 2, Prior functions (0, x2,0) ≈ 0.5
* At x1 = 4, Prior functions (0, x2,0) ≈ 7.5
### Key Observations
* Both charts show similar trends for LPN and Ref, with a minimum value near x1 = 0.
* The LPN line has a steeper curve than the Ref line in both charts.
* The LPN values are generally higher than the Ref values, especially at the extremes of the x-axis.
* The minimum value for LPN is slightly negative in both charts.
### Interpretation
The charts compare the cross-sections of a prior function using two different methods, LPN and Ref. The parabolic shape suggests that the prior function has a quadratic component. The LPN method appears to produce a more pronounced curvature compared to the Ref method. The fact that the LPN values are generally higher than the Ref values, especially at the extremes, indicates that LPN might be more sensitive to changes in the input variables. The slight negative minimum value for LPN could be a result of the specific approximation or modeling technique used. The similarity between the (x1,0) and (0, x2,0) cross-sections suggests some symmetry in the prior function with respect to these variables.
</details>
<details>
<summary>exp_1_minplus_32D_pr2.png Details</summary>

### Visual Description
## Chart: Cross Sections of J(x)
### Overview
The image presents two line charts, each displaying cross-sections of a function J. The left chart shows the cross-section J(x1, 0, ...), while the right chart shows J(0, x2, 0, ...). Both charts compare two functions: "LPN 2" and "Ref J(x) = -1/4||x||²". The horizontal axis represents x1 and x2 respectively, and the vertical axis represents the prior functions. The dimension is specified as 32 for both.
### Components/Axes
* **Titles:**
* Left Chart: "Cross sections of J(x1, 0,...) Dim 32"
* Right Chart: "Cross sections of J(0, x2, 0,...) Dim 32"
* **X-axis:**
* Left Chart: x1, scale from -4 to 4
* Right Chart: x2, scale from -4 to 4
* **Y-axis:**
* Left Chart: Priorfunctions(x1, 0, ...), scale from 0 to 8
* Right Chart: Priorfunctions(0, x2, 0, ...), scale from 0 to 8
* **Legend:** (Located in the top-center of each chart)
* Blue solid line: LPN 2
* Orange dashed line: Ref J(x) = -1/4||x||²
### Detailed Analysis
**Left Chart: Cross sections of J(x1, 0,...) Dim 32**
* **LPN 2 (Blue Solid Line):**
* Trend: The line forms a U-shape, decreasing from x1 = -4 to a minimum around x1 = 0, then increasing to x1 = 4.
* Approximate Data Points:
* x1 = -4, Priorfunctions(x1, 0, ...) ≈ 7
* x1 = -2, Priorfunctions(x1, 0, ...) ≈ 2
* x1 = 0, Priorfunctions(x1, 0, ...) ≈ 0.5
* x1 = 2, Priorfunctions(x1, 0, ...) ≈ 2
* x1 = 4, Priorfunctions(x1, 0, ...) ≈ 7
* **Ref J(x) = -1/4||x||² (Orange Dashed Line):**
* Trend: The line forms a U-shape, decreasing from x1 = -4 to a minimum around x1 = 0, then increasing to x1 = 4.
* Approximate Data Points:
* x1 = -4, Priorfunctions(x1, 0, ...) ≈ 9
* x1 = -2, Priorfunctions(x1, 0, ...) ≈ 1
* x1 = 0, Priorfunctions(x1, 0, ...) ≈ 0
* x1 = 2, Priorfunctions(x1, 0, ...) ≈ 1
* x1 = 4, Priorfunctions(x1, 0, ...) ≈ 4
**Right Chart: Cross sections of J(0, x2, 0,...) Dim 32**
* **LPN 2 (Blue Solid Line):**
* Trend: The line forms a U-shape, decreasing from x2 = -4 to a minimum around x2 = 0, then increasing to x2 = 4.
* Approximate Data Points:
* x2 = -4, Priorfunctions(0, x2, 0, ...) ≈ 8
* x2 = -2, Priorfunctions(0, x2, 0, ...) ≈ 2
* x2 = 0, Priorfunctions(0, x2, 0, ...) ≈ 0.5
* x2 = 2, Priorfunctions(0, x2, 0, ...) ≈ 2
* x2 = 4, Priorfunctions(0, x2, 0, ...) ≈ 8
* **Ref J(x) = -1/4||x||² (Orange Dashed Line):**
* Trend: The line forms a U-shape, decreasing from x2 = -4 to a minimum around x2 = 0, then increasing to x2 = 4.
* Approximate Data Points:
* x2 = -4, Priorfunctions(0, x2, 0, ...) ≈ 8
* x2 = -2, Priorfunctions(0, x2, 0, ...) ≈ 1
* x2 = 0, Priorfunctions(0, x2, 0, ...) ≈ 0
* x2 = 2, Priorfunctions(0, x2, 0, ...) ≈ 1
* x2 = 4, Priorfunctions(0, x2, 0, ...) ≈ 8
### Key Observations
* Both charts show similar U-shaped curves for both "LPN 2" and "Ref J(x)".
* In the left chart, the "Ref J(x)" curve is initially above the "LPN 2" curve for negative x1 values, but they converge as they approach x1 = 0.
* In the right chart, the "LPN 2" and "Ref J(x)" curves are very close to each other.
* The minimum value of "LPN 2" is slightly higher than that of "Ref J(x)" in both charts.
### Interpretation
The charts compare the cross-sections of a function J using two different methods, "LPN 2" and a reference function "Ref J(x)". The similarity in the shapes of the curves suggests that "LPN 2" approximates the reference function reasonably well, especially in the right chart where the curves are almost identical. The left chart shows some deviation between the two functions, particularly for negative x1 values, indicating that the approximation might be less accurate in that region. The fact that the minimum value of "LPN 2" is slightly higher than "Ref J(x)" in both charts could indicate a systematic bias in the "LPN 2" approximation. The dimension of 32 suggests that the function J is defined in a high-dimensional space, and these cross-sections provide insights into its behavior along specific axes.
</details>
Figure 3: The cross sections of the convex function $\psi(x)$ for dimension $32$ (top). The bottom row compares the cross sections of the prior function from “invert LPN” (left) and our trained second LPN method (right)
### 5.3 Concave prior
For this example, we use
$$
J(\bm{x})=-\left\|{\bm{x}}\right\|_{2}^{2}/4.
$$
This is actually challenging example because, technically, $J$ is not uniformly Lipschitz continuous. (Although numerically we can get around this by “Huberizing” the prior.) We use this prior because we have an exact solution for this problem. It’s also a bit challenging for a convex LPN network because according to the theory of HJ PDEs, the function $J+\frac{1}{4}\left\|{\bm{x}}\right\|_{2}^{2}$ is convex, and $J+\frac{1}{2}\left\|{\bm{x}}\right\|_{2}^{2}$ is strongly convex, so an LPN (which is not inherently S.C.) may not be able to detect the strong convexity and makes this function more challenging to learn.
Table 3: Mean square errors of LPN $\psi$ and prior $J$ with 2 layers and 256 neurons in the concave prior example.
| | Dimension | LPN ( $\psi$ ) | Prior ( $J$ ) |
| --- | --- | --- | --- |
| Mean Square Errors | 2D | $7.00E-7$ | $1.57E-6$ |
| 4D | $2.74E-5$ | $7.70E-5$ | |
| 8D | $5.58E-4$ | $7.91E-4$ | |
| 16D | $3.69E-3$ | $3.28E-3$ | |
| 32D | $8.70E-2$ | $3.01E-2$ | |
| 64D | $6.23E-6$ | $1.87E-3$ | |
<details>
<summary>exp_quadratic_concave_prior_8D_PR1.png Details</summary>

### Visual Description
## Chart Type: Line Graphs Comparing Learned and True Functions
### Overview
The image contains two line graphs displayed side-by-side. Both graphs compare a learned function (LPN) with a true function, where the true function is defined as J(x) = -1/4||x||². The left graph shows the prior with x1 as the variable, while the other variables are set to 0. The right graph shows the prior with x2 as the variable, while the other variables are set to 0. Both graphs have the same x and y axis scales.
### Components/Axes
**Left Graph:**
* **Title:** Prior J(x1, 0, ...) - Dim 8, J(x) = -1/4||x||²
* **X-axis:** x1, with ticks at -4, -3, -2, -1, 0, 1, 2, 3, 4
* **Y-axis:** J(x1, 0, ...), with ticks at 0, -1, -2, -3, -4
* **Legend (bottom-left):**
* Blue solid line: LPN (Learned J)
* Brown dashed line: True J(x) = -1/4||x||²
**Right Graph:**
* **Title:** Prior J(0, x2, 0, ...) - Dim 8, J(x) = -1/4||x||²
* **X-axis:** x2, with ticks at -4, -3, -2, -1, 0, 1, 2, 3, 4
* **Y-axis:** J(0, x2, 0, ...), with ticks at 0, -1, -2, -3, -4
* **Legend (bottom-right):**
* Blue solid line: LPN (Learned J)
* Brown dashed line: True J(x) = -1/4||x||²
### Detailed Analysis
**Left Graph (x1):**
* **LPN (Learned J) - Blue solid line:** The line forms a concave-down curve. It starts at approximately -4.2 at x1 = -4, rises to a peak of approximately -0.7 at x1 = 0, and then decreases back down to approximately -4.2 at x1 = 4.
* **True J(x) = -1/4||x||² - Brown dashed line:** The line forms a concave-down curve. It starts at approximately -4 at x1 = -4, rises to a peak of 0 at x1 = 0, and then decreases back down to approximately -4 at x1 = 4.
**Right Graph (x2):**
* **LPN (Learned J) - Blue solid line:** The line forms a concave-down curve. It starts at approximately -4.2 at x2 = -4, rises to a peak of approximately -0.7 at x2 = 0, and then decreases back down to approximately -4.2 at x2 = 4.
* **True J(x) = -1/4||x||² - Brown dashed line:** The line forms a concave-down curve. It starts at approximately -4 at x2 = -4, rises to a peak of 0 at x2 = 0, and then decreases back down to approximately -4 at x2 = 4.
### Key Observations
* Both graphs show similar trends.
* The "True J(x)" curve is consistently higher than the "LPN (Learned J)" curve.
* The peak of the "True J(x)" curve is at 0, while the peak of the "LPN (Learned J)" curve is around -0.7.
* The LPN curve is a smoothed version of the True J(x) curve.
### Interpretation
The graphs illustrate how a learned function (LPN) approximates a true function. The LPN curve, representing the learned function, is a smoothed and slightly lower version of the true function. This suggests that the learning process captures the general trend of the true function but doesn't perfectly replicate it, resulting in a slight underestimation of the function's values, especially at the peak. The similarity between the two graphs indicates that the function's behavior is consistent whether x1 or x2 is varied while the other variables are held at 0. The difference between the learned and true functions could be due to limitations in the learning algorithm, the amount of training data, or the complexity of the function being learned.
</details>
<details>
<summary>exp_quadratic_concave_prior_8D_PR2.png Details</summary>

### Visual Description
## Line Charts: Prior J(x1, 0, ...) and Prior J(0, x2, 0, ...)
### Overview
The image contains two line charts side-by-side, comparing the performance of "LPN 2" against a "True J(x)" function. Both charts depict a similar downward-sloping curve, with the x-axis representing either x1 or x2, and the y-axis representing the function J(x1, 0, ...) or J(0, x2, 0, ...). The "True J(x)" curve is slightly above the "LPN 2" curve in both charts. Both charts are titled "Prior J(...) Dim 8".
### Components/Axes
**Left Chart:**
* **Title:** Prior J(x1, 0,...) Dim 8
* **X-axis:** x1, ranging from -4 to 4 in increments of 1.
* **Y-axis:** J(x1, 0,...), ranging from -4 to 0 in increments of 1.
* **Legend (bottom-left):**
* Blue line: LPN 2
* Dashed orange line: True J(x) = -1/4||x||²
**Right Chart:**
* **Title:** Prior J(0, x2, 0,...) Dim 8
* **X-axis:** x2, ranging from -4 to 4 in increments of 1.
* **Y-axis:** J(0, x2, 0,...), ranging from -4.0 to 0.0 in increments of 0.5.
* **Legend (bottom-left):**
* Blue line: LPN 2
* Dashed orange line: True J(x) = -1/4||x||²
### Detailed Analysis
**Left Chart (Prior J(x1, 0,...)):**
* **LPN 2 (Blue):** The curve starts at approximately -4 when x1 is -4, rises to a peak of approximately 0 when x1 is 0, and then decreases back to approximately -4 when x1 is 4.
* x1 = -4, J(x1, 0, ...) ≈ -4
* x1 = -2, J(x1, 0, ...) ≈ -1
* x1 = 0, J(x1, 0, ...) ≈ 0
* x1 = 2, J(x1, 0, ...) ≈ -1
* x1 = 4, J(x1, 0, ...) ≈ -4
* **True J(x) = -1/4||x||² (Dashed Orange):** The curve starts at approximately -4 when x1 is -4, rises to a peak of approximately 0 when x1 is 0, and then decreases back to approximately -4 when x1 is 4. The orange line is consistently slightly above the blue line.
* x1 = -4, J(x1, 0, ...) ≈ -4
* x1 = -2, J(x1, 0, ...) ≈ -1
* x1 = 0, J(x1, 0, ...) ≈ 0
* x1 = 2, J(x1, 0, ...) ≈ -1
* x1 = 4, J(x1, 0, ...) ≈ -4
**Right Chart (Prior J(0, x2, 0,...)):**
* **LPN 2 (Blue):** The curve starts at approximately -4 when x2 is -4, rises to a peak of approximately 0 when x2 is 0, and then decreases back to approximately -4 when x2 is 4.
* x2 = -4, J(0, x2, 0, ...) ≈ -4
* x2 = -2, J(0, x2, 0, ...) ≈ -1
* x2 = 0, J(0, x2, 0, ...) ≈ 0
* x2 = 2, J(0, x2, 0, ...) ≈ -1
* x2 = 4, J(0, x2, 0, ...) ≈ -4
* **True J(x) = -1/4||x||² (Dashed Orange):** The curve starts at approximately -4 when x2 is -4, rises to a peak of approximately 0 when x2 is 0, and then decreases back to approximately -4 when x2 is 4. The orange line is consistently slightly above the blue line.
* x2 = -4, J(0, x2, 0, ...) ≈ -4
* x2 = -2, J(0, x2, 0, ...) ≈ -1
* x2 = 0, J(0, x2, 0, ...) ≈ -0.2
* x2 = 2, J(0, x2, 0, ...) ≈ -1
* x2 = 4, J(0, x2, 0, ...) ≈ -4
### Key Observations
* Both charts show a symmetrical, downward-sloping curve, indicating a similar relationship between x1/x2 and the function J.
* The "True J(x)" curve consistently outperforms "LPN 2", as it has slightly higher values across the range of x1 and x2.
* The minimum value for both curves in both charts is approximately -4, occurring at x1/x2 = -4 and x1/x2 = 4.
* The maximum value for both curves in both charts is approximately 0, occurring at x1/x2 = 0.
### Interpretation
The charts compare the performance of "LPN 2" to a "True J(x)" function, which is defined as -1/4||x||². The charts suggest that "LPN 2" approximates the "True J(x)" function, but with a slight underestimation. The fact that both charts are titled "Prior J(...) Dim 8" suggests that these are prior distributions in an 8-dimensional space, with the charts showing the marginal distributions along the x1 and x2 axes, while holding the other dimensions at 0. The similarity between the two charts indicates that the function behaves similarly with respect to both x1 and x2. The difference between the "LPN 2" and "True J(x)" curves could be due to approximations or simplifications made in the "LPN 2" model.
</details>
Figure 4: The cross sections of the prior function for $8$ dimension from “invert LPN” (left) and our trained second LPN method (right)
### 5.4 Negative $\ell_{1}$ norm
See Equation Eq. 32 for the minimum value and Eq. 33 for the proximal value. For this example, we consider $J(\bm{y})=-\left\|{\bm{y}}\right\|_{1}$ . Let $n=1$ for simplicity and consider the one-dimensional problem
$$
S(x,t)=\min_{y\in\mathbb{R}}\left\{\frac{1}{2t}(x-y)^{2}-|y|\right\}. \tag{31}
$$
The function $y\mapsto(x-y)^{2}/2t-|y|$ is differentiable everywhere except at $y=0$ . A stationary point of this function satisfies
$$
0\in\frac{y-x}{t}-\partial|y|\iff x\in\begin{cases}x-t,&\ \text{if $y<0$,}\\
[x-t,x+t]&\ \text{if $y=0$},\\
x+t,&\ \text{if $y>0$.}\end{cases}
$$
If $x>t$ , the only minimum is $x+t$ , in which case we have
$$
S(x,t)=\frac{1}{2t}(x+t-x)^{2}-|x+t|=\frac{t}{2}-(x+t)=-\frac{t}{2}-x.
$$
If $0<x\leqslant t$ , there are two local minimums, $0$ and $x+t$ , but the global minimum is attained at $x+t$ , again yielding $S(x,t)=\frac{t}{2}-x$ . If $x=0$ , we have three local minimums: $-t$ , $0$ , and $t$ . The global minimum is attained at either $-t$ or $t$ , yielding $S(0,t)=-\frac{t}{2}$ . If $-t\leqslant x<0$ , there are two local minimums, $0$ and $x-t$ , but the global minimum is attained at $x-t$ , yielding $S(x,t)=-\frac{t}{2}+x$ . If $x<-t$ , the only minimum is $x-t$ , in which case we have $S(x,t)-\frac{t}{2}+x$ . Hence
$$
S(x,t)=-\frac{t}{2}-|x|. \tag{32}
$$
In particular, its gradient in $x$ is given by
$$
\nabla_{x}S(x,t)\in\begin{cases}1&\ \text{if $x<0$},\\
[-1,1]&\ \text{if $x=0$},\\
-1&\ \text{if $x>0$}.\end{cases}
$$
Moreover,
$$
\operatorname*{arg\,min}_{y\in\mathbb{R}}\left\{\frac{1}{2t}(x-y)^{2}-|y|\right\}=\begin{cases}x-t&\ \text{if $x<0$},\\
[-t,t]&\ \text{if $x=0$},\\
x+t&\ \text{if $x>0$}.\end{cases} \tag{33}
$$
Table 4: Mean square errors of LPN $\psi$ and prior $J$ with 2 layers and 256 neurons in the negative $L_{1}$ norm examples.
| | Dimension | LPN ( $\psi$ ) | Prior ( $J$ ) |
| --- | --- | --- | --- |
| Mean Square Errors | 2D | $6.59E-5$ | $5.20E-6$ |
| 4D | $3.15E-4$ | $3.17E-5$ | |
| 8D | $2.12E-3$ | $2.94E-4$ | |
| 16D | $8.01E-3$ | $4.49E-2$ | |
| 32D | $1.55E-1$ | $2.29E-2$ | |
| 64D | $6.42E-4$ | $4.49E-3$ | |
<details>
<summary>exp_NegL1_prior_4D_LPN.png Details</summary>

### Visual Description
## Chart: Cross Sections of Convex Function
### Overview
The image presents two line charts side-by-side, both displaying cross-sections of a convex function in 4 dimensions. The left chart shows the cross-section (x1, 0), while the right chart shows (0, x2, 0). Each chart plots two data series: "LPN" (solid blue line) and "Ref" (dashed orange line). Both charts exhibit a similar U-shaped curve, indicating a convex function.
### Components/Axes
**Left Chart:**
* **Title:** Cross sections (x1,0) of the convex function, Dim 4
* **Y-axis:** Convexfunctions(x1, 0, ...)
* Scale: 0 to 12, with tick marks at every increment of 2.
* **X-axis:** x1
* Scale: -4 to 4, with tick marks at every increment of 1.
* **Legend:** Located in the bottom-left corner.
* LPN: Solid blue line
* Ref: Dashed orange line
**Right Chart:**
* **Title:** Cross sections (0, x2,0) of the convex function, Dim 4
* **Y-axis:** Convexfunctions(0, x2, 0, ...)
* Scale: 0 to 12, with tick marks at every increment of 2.
* **X-axis:** x2
* Scale: -4 to 4, with tick marks at every increment of 1.
* **Legend:** Located in the bottom-left corner.
* LPN: Solid blue line
* Ref: Dashed orange line
### Detailed Analysis
**Left Chart (x1, 0):**
* **LPN (Solid Blue):** The line forms a U-shape. It starts at approximately (x1 = -4, Convexfunctions = 12.5), decreases to a minimum around (x1 = 0, Convexfunctions = 0.75), and then increases to approximately (x1 = 4, Convexfunctions = 12.5).
* **Ref (Dashed Orange):** The line also forms a U-shape, closely following the LPN line but generally slightly below it. It starts at approximately (x1 = -4, Convexfunctions = 11.5), decreases to a minimum around (x1 = 0, Convexfunctions = 0.5), and then increases to approximately (x1 = 4, Convexfunctions = 11.5).
**Right Chart (0, x2, 0):**
* **LPN (Solid Blue):** The line forms a U-shape. It starts at approximately (x2 = -4, Convexfunctions = 12.5), decreases to a minimum around (x2 = 0, Convexfunctions = 0.75), and then increases to approximately (x2 = 4, Convexfunctions = 12.5).
* **Ref (Dashed Orange):** The line also forms a U-shape, closely following the LPN line but generally slightly below it. It starts at approximately (x2 = -4, Convexfunctions = 11.5), decreases to a minimum around (x2 = 0, Convexfunctions = 0.5), and then increases to approximately (x2 = 4, Convexfunctions = 11.5).
### Key Observations
* Both charts show very similar curves for LPN and Ref, indicating symmetry in the convex function with respect to x1 and x2.
* The "Ref" data series consistently shows slightly lower values than the "LPN" series across both charts.
* The minimum value of both curves occurs at x1 = 0 and x2 = 0, respectively.
### Interpretation
The charts illustrate the cross-sections of a 4-dimensional convex function along two different planes (x1, 0) and (0, x2). The similarity between the two charts suggests that the function is symmetric with respect to the x1 and x2 axes. The "LPN" and "Ref" curves represent two different methods or models for approximating or representing this convex function. The fact that "Ref" consistently yields slightly lower values than "LPN" might indicate that "Ref" is a more conservative or lower-bound estimate of the function's value compared to "LPN". The data suggests that both "LPN" and "Ref" are reasonable approximations of the underlying convex function, with "LPN" potentially being slightly more optimistic or aggressive in its estimation.
</details>
<details>
<summary>exp_NegL1_prior_8D_LPN.png Details</summary>

### Visual Description
## Line Chart: Cross Sections of Convex Function
### Overview
The image contains two line charts displayed side-by-side. Both charts depict cross-sections of a convex function in 8 dimensions. The left chart shows the cross-section (x1, 0), while the right chart shows the cross-section (0, x2, 0). Each chart plots two data series: "LPN" (solid blue line) and "Ref" (dashed orange line). Both charts show a U-shaped curve, indicating a convex function.
### Components/Axes
**Left Chart:**
* **Title:** Cross sections (x1,0) of the convex function, Dim 8
* **X-axis:** x1, ranging from -4 to 4 in increments of 1.
* **Y-axis:** Convexfunctions(x1, 0, ...), ranging from 0 to 14 in increments of 2.
* **Legend:** Located in the bottom-left corner.
* LPN: Solid blue line
* Ref: Dashed orange line
**Right Chart:**
* **Title:** Cross sections (0, x2,0) of the convex function, Dim 8
* **X-axis:** x2, ranging from -4 to 4 in increments of 1.
* **Y-axis:** Convexfunctions(0, x2, 0, ...), ranging from 0 to 14 in increments of 2.
* **Legend:** Located in the bottom-left corner.
* LPN: Solid blue line
* Ref: Dashed orange line
### Detailed Analysis
**Left Chart (x1, 0):**
* **LPN (Solid Blue):** The LPN line forms a U-shaped curve. It decreases from approximately 13.5 at x1 = -4 to a minimum of approximately 2.2 at x1 = 0, then increases back to approximately 13.5 at x1 = 4.
* x1 = -4, Convexfunctions(x1, 0, ...) ≈ 13.5
* x1 = -2, Convexfunctions(x1, 0, ...) ≈ 5
* x1 = 0, Convexfunctions(x1, 0, ...) ≈ 2.2
* x1 = 2, Convexfunctions(x1, 0, ...) ≈ 5
* x1 = 4, Convexfunctions(x1, 0, ...) ≈ 13.5
* **Ref (Dashed Orange):** The Ref line also forms a U-shaped curve, but with lower values than the LPN line. It decreases from approximately 12.5 at x1 = -4 to a minimum of approximately 0.5 at x1 = 0, then increases back to approximately 12.5 at x1 = 4.
* x1 = -4, Convexfunctions(x1, 0, ...) ≈ 12.5
* x1 = -2, Convexfunctions(x1, 0, ...) ≈ 3
* x1 = 0, Convexfunctions(x1, 0, ...) ≈ 0.5
* x1 = 2, Convexfunctions(x1, 0, ...) ≈ 3
* x1 = 4, Convexfunctions(x1, 0, ...) ≈ 12.5
**Right Chart (0, x2, 0):**
* **LPN (Solid Blue):** The LPN line forms a U-shaped curve. It decreases from approximately 13.5 at x2 = -4 to a minimum of approximately 2.2 at x2 = 0, then increases back to approximately 13.5 at x2 = 4.
* x2 = -4, Convexfunctions(0, x2, 0, ...) ≈ 13.5
* x2 = -2, Convexfunctions(0, x2, 0, ...) ≈ 5
* x2 = 0, Convexfunctions(0, x2, 0, ...) ≈ 2.2
* x2 = 2, Convexfunctions(0, x2, 0, ...) ≈ 5
* x2 = 4, Convexfunctions(0, x2, 0, ...) ≈ 13.5
* **Ref (Dashed Orange):** The Ref line also forms a U-shaped curve, but with lower values than the LPN line. It decreases from approximately 12.5 at x2 = -4 to a minimum of approximately 0.5 at x2 = 0, then increases back to approximately 12.5 at x2 = 4.
* x2 = -4, Convexfunctions(0, x2, 0, ...) ≈ 12.5
* x2 = -2, Convexfunctions(0, x2, 0, ...) ≈ 3
* x2 = 0, Convexfunctions(0, x2, 0, ...) ≈ 0.5
* x2 = 2, Convexfunctions(0, x2, 0, ...) ≈ 3
* x2 = 4, Convexfunctions(0, x2, 0, ...) ≈ 12.5
### Key Observations
* Both charts exhibit similar U-shaped curves for both LPN and Ref data series, indicating symmetry around x1 = 0 and x2 = 0.
* The LPN line consistently has higher values than the Ref line across the entire range of x1 and x2.
* The minimum value for both LPN and Ref occurs at x1 = 0 and x2 = 0.
### Interpretation
The charts illustrate the cross-sections of a convex function in 8 dimensions, specifically along the x1 and x2 axes while holding other dimensions at 0. The "LPN" and "Ref" likely represent two different methods or models for approximating or representing this convex function. The fact that both lines show a U-shape confirms the convexity along these cross-sections. The difference in magnitude between the LPN and Ref lines suggests that they have different scales or sensitivities to changes in x1 and x2. The Ref data series has a lower value than the LPN data series, suggesting that the "Ref" method results in a smaller value of the convex function compared to the "LPN" method. The symmetry of the curves around x=0 indicates that the function behaves similarly for positive and negative values of x1 and x2.
</details>
Figure 5: The cross sections of the convex function $\psi(x)$ for dimension $4$ (left) and $8$ (right).
The Tables 1, 2, 3, and 4 quantify the scalability of the LPN approach across dimensions ranging from $2$ D to $64$ D. The results indicate that the method performs with high accuracy in lower dimensions ( $2$ D through $8$ D), achieving mean square errors in the range of $10^{-7}$ to $10^{-4}$ . However, performance degrades slightly as the dimensionality increases, particularly at $16$ D and $32$ D, where the error spikes to a range of $10^{-3}$ and $10^{-1}$ . The higher dimensions did not work too well, which might be because we did not train for long enough and also used a simple architecture (which ought to be more intricate for the higher dimensional problems). While the error for the recovered prior $J$ is generally slightly higher than that of LPN $\psi$ , this is expected given the added complexity of recovering the non-convex, non-smooth, or concave functions. However, the errors generally remain low, validating our method’s effectiveness even in high-dimensional spaces.
The top rows of Figures 1, 2, 3 and Figure 5 demonstrate that the LPN accurately learns the cross sections of the convex function $\psi(x)$ for dimension $4,8,$ and $32$ , closely matching the reference function with the most significant variation in Figure 3 corresponding to dimension $32$ . The bottom row of these figures compares the “invert LPN” (left) method and our trained second LPN method (right). It is clear that in all cases, our direct method recovers the original non-smooth prior, as indicated by the sharp V-shaped reconstructions in Figure 1, the non-convex prior Figures 2 and 3, and quadratic concave prior Figure 4 despite its challenging nature.
Figure 5 represents the cross-sections of the LPN value function against the ground truth for dimensions $4$ (left) and $8$ (right). In both instances, the LPN-approximation curves exhibit a tight fit to the analytical reference, capturing the characteristic V-shape more closely and the non-smooth geometry of the underlying function. The errors for $64$ dimensions in all the examples are consistently lower than expected. In contrast, it does not yield a good visual approximation of the cross sections. We are unable to explain the reason for this behaviour in $64$ D cases yet; however, we think it is probably due to the way we sample the hypercube.
## 6 Discussion
In this work, we leveraged the theory of viscosity solutions of HJ PDEs to develop novel deep learning numerical methods to learn, from data, the underlying prior of the proximal operator Eq. 2 yielding $(\bm{x},t)\mapsto S(\bm{x},t)$ defined in Eq. 1. Our approach built on the existing connections between proximal operators and HJ PDEs, crucially the fact that $(\bm{x},t)\mapsto S(\bm{x},t)$ is obtained from the solution to an HJ PDE, and in particular on the theory for the inverse problem for HJ equations. As discussed in Section 3, the theory for the inverse problem for HJ equations show that while there may be infinitely many priors that can recover Eq. 1, there is a natural choice, obtained by reversing the time in the HJ PDE Eq. 14 and using the value of the proximal operator $(\bm{x},t)\mapsto S(\bm{x},t)$ as initial condition. The resulting backward viscosity solution yields a prior $J_{\text{BVS}}$ that can reconstruct the $(\bm{x},t)\mapsto S(\bm{x},t)$ and also that is semiconvex. We considered the case where only samples of the proximal operators and its values were available in Section Section 4, and used techniques from max-plus algebra to derive some characterizations and errors property of $J_{\text{BVS}}$ with respect to convex functions approximating it from above. Finally, in Section 5 we proposed to learn the prior $J_{\text{BVS}}$ by training a convex neural network, specifically a learned proximal network, on a function of the form $\bm{y}\mapsto\tilde{J}(\bm{y})+\frac{1}{2}\left\|{\bm{y}}\right\|_{2}^{2}$ from data $\{\bm{x}_{k},S(\bm{x}_{k},t),\nabla_{\bm{x}}S(\bm{x}_{k},t)\}_{k=1}^{K}$ via Eq. 17. We presented several numerical results that demonstrate the efficiency of our proposed method in high dimensions.
While this work focused on proximal operators, we expect our approach can be extended to a broad class of Bregman divergences, as recent results in the theory of inverse problems for HJ equations suggest [esteve2020inverse]. Another potential direction would be in the case where the value of the proximal operator $(\bm{x},t)\mapsto S(\bm{x},t)$ is known to learn the prior $J$ using Monte Carlo sampling strategies, as recently proposed in [park2025neural] for the forward problem of HJ equations (i.e., learning $(\bm{x},t)\mapsto S(\bm{x},t)$ from known $J$ ). In the longer term, it would be interesting to devise similar deep learning methods for the inverse problem of HJ equations with possibly time- or state-dependent Hamiltonians, relevant to optimal control problems.
## Appendix A Calculations
### A.1 Formal calculation of Eq. 19
We have
| | $\displaystyle\inf_{\bm{y}\in\mathbb{R}^{n}}$ | $\displaystyle\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{PAM}}(\bm{y})\right\}=\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+\right.$ | |
| --- | --- | --- | --- |
### A.2 Formal calculation of Eq. 21
Formally, we have
$$
t\nabla_{\bm{y}}J_{\text{PQM}}(\bm{y})=\alpha(\bm{y}-\bm{y}_{k})+\bm{x}_{k}-\bm{y}
$$
for some $k\in\{1,\dots,K\}$ . Similarly,
| | $\displaystyle\hat{\bm{y}}=\operatorname*{arg\,min}_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{PQM}}(\bm{y})\right\}$ | $\displaystyle\iff\bm{0}=\frac{\hat{\bm{y}}-\bm{x}}{t}+\frac{\alpha}{t}(\hat{\bm{y}}-\bm{y}_{k})+\frac{\bm{x}_{k}-\hat{\bm{y}}}{t}$ | |
| --- | --- | --- | --- |
In addition,
| | | $\displaystyle\bm{x}-\hat{\bm{y}}=\frac{\bm{x}_{k}-(1-\alpha)\bm{x}}{\alpha}-\bm{y}_{k}\implies\frac{1}{2t}\left\|{\bm{x}-\hat{\bm{y}}}\right\|_{2}^{2}=\frac{1}{2t\alpha^{2}}\left\|{\bm{x}_{k}-\bm{x}+\alpha(\bm{x}-\bm{y}_{k})}\right\|_{2}^{2},$ | |
| --- | --- | --- | --- |
and
| | $\displaystyle J_{\text{PQM}}(\hat{\bm{y}})$ | $\displaystyle=J(\bm{y}_{k})+\frac{1}{2t}\left\|{\bm{x}_{k}-\bm{y}_{k}}\right\|_{2}^{2}-\frac{1}{2t}\left\|{\bm{x}_{k}-\hat{\bm{y}}}\right\|_{2}^{2}+\frac{\alpha}{2t}\left\|{\hat{\bm{y}}-\bm{y}_{k}}\right\|_{2}^{2}$ | |
| --- | --- | --- | --- |
From this, we deduce
| | $\displaystyle\inf_{\bm{y}\in\mathbb{R}^{n}}\left\{\frac{1}{2t}\left\|{\bm{x}-\bm{y}}\right\|_{2}^{2}+J_{\text{PQM}}(\bm{y})\right\}$ | $\displaystyle=\frac{1}{2t\alpha^{2}}\left\|{\bm{x}_{k}-\bm{x}+\alpha(\bm{x}-\bm{y}_{k})}\right\|_{2}^{2}$ | |
| --- | --- | --- | --- |
for some $k\in\{1,\dots,K\}$ .