# The Causal Round Trip: Generating Authentic Counterfactuals by Eliminating Information Loss
**Authors**:
- \nameRui Wu \emailwurui22@mail.ustc.edu.cn (\addrSchool of Management, University of Science and Technology of China)
- \nameLizheng Wang \emaillzwang@ustc.edu.cn (\addrSchool of Management, University of Science and Technology of China)
- \nameYongjun Li \emaillionli@ustc.edu.cn (\addrSchool of Management, University of Science and Technology of China)
> Corresponding author.
Abstract
Judea Pearlâs vision of Structural Causal Models (SCMs) as engines for counterfactual reasoning hinges on faithful abduction: the precise inference of latent exogenous noise. For decades, operationalizing this step for complex, non-linear mechanisms has remained a significant computational challenge. The advent of diffusion models, powerful universal function approximators, offers a promising solution. However, we argue that their standard design, optimized for perceptual generation over logical inference, introduces a fundamental flaw for this classical problem: an inherent information loss we term the Structural Reconstruction Error (SRE). To address this challenge, we formalize the principle of Causal Information Conservation (CIC) as the necessary condition for faithful abduction. We then introduce BELM-MDCM, the first diffusion-based framework engineered to be causally sound by eliminating SRE by construction through an analytically invertible mechanism. To operationalize this framework, a Targeted Modeling strategy provides structural regularization, while a Hybrid Training Objective instills a strong causal inductive bias. Rigorous experiments demonstrate that our Zero-SRE framework not only achieves state-of-the-art accuracy but, more importantly, enables the high-fidelity, individual-level counterfactuals required for deep causal inquiries. Our work provides a foundational blueprint that reconciles the power of modern generative models with the rigor of classical causal theory, establishing a new and more rigorous standard for this emerging field.
Keywords: Causal Inference, Diffusion Models, Causal Information Conservation, Structural Causal Models, Counterfactual Generation, BELM, Structural Reconstruction Error
1 Introduction
The fundamental challenge of causal inference, as articulated by rubin1974estimating, is our inability to simultaneously observe an individualâs potential outcomes. Generating authentic counterfactuals is thus the fieldâs grand challenge. Structural Causal Models (SCMs), introduced by pearl2009causality, provide the formal language for this pursuit. An SCM posits that an outcome $V_{i}$ is generated by a function of its parents $\mathbf{Pa}_{i}$ and a unique exogenous noise variable $U_{i}$ . This noise, $U_{i}$ , represents the primordial causal information âthe collection of unobserved factors unique to an individual. This concept aligns directly with the long-standing focus in econometrics on unobserved individual heterogeneity, a central challenge in structural modeling for decades (heckman2001micro). Pearlâs framework for causal reasoning, the Abduction-Action-Prediction cycle, hinges on the fidelity of the first step: abduction. To answer any âwhat ifâ question, one must first perfectly infer this primordial information $U_{i}$ from an observed outcome $v_{i}$ . For decades, while this theoretical blueprint was clear, its practical realization for complex, non-linear mechanisms remained a major computational hurdle, often addressed in econometrics through strong parametric assumptions or linear approximations (angrist2008mostly).
The advent of deep generative models, particularly diffusion models (ho2020denoising), offers a powerful new hope for bridging this gap. As near-universal function approximators, they possess the expressive power to learn the complex, non-linear functions that have long challenged classical methods (chao2023interventional; sanchez2022dcms). However, this promise is shadowed by a critical, yet overlooked, âimpedance mismatch.â These models were engineered for perceptual tasks like image synthesis, where visual plausibility is paramount, not for the logical rigor demanded by causal abduction. We argue that their standard design, which relies on approximate inversion schemes like DDIM (song2021denoising), is fundamentally at odds with the strict requirements of this classical causal problem.
In this work, we diagnose and resolve this conflict. We begin by giving the classic requirement for faithful abduction a modern name: Causal Information Conservation (CIC) In this work, âCausal Information Conservationâ is defined operationally as the lossless, deterministic recovery of the exogenous noise variable $U$ . Its novelty lies in its application as a design principle and diagnostic tool for the diffusion model paradigm in causality, rather than as a formal information-theoretic quantity. Connecting this operational principle to formal measures, such as mutual information, is a compelling avenue for future research.. Our core contribution is the identification that standard diffusion models systematically violate this principle due to an inherent algorithmic flaw. We formalize this flaw as the Structural Reconstruction Error (SRE) âa quantifiable information loss that imposes a hard theoretical ceiling on the fidelity of any counterfactual generated by such methods. The SRE is not an estimation error to be solved with more data, but a structural defect in the tool itself.
To solve the long-standing challenge of operationalizing faithful abduction, we introduce BELM-MDCM. It is not merely a new model, but the first diffusion-based framework re-engineered from first principles to be causally sound. Architected around an analytically invertible sampler (liu2024belm), it is the first Zero-SRE causal framework by construction. This design choice reconciles the expressive power of modern diffusion models with the logical rigor of Pearlâs causal theory, ensuring the abduction step is lossless. Our primary contributions are therefore:
1. Diagnosing a Fundamental Barrier in a Classic Problem. We are the first to identify that standard diffusion models, when applied to the classic problem of SCM abduction, suffer from a structural flaw we term the Structural Reconstruction Error (SRE), which violates the foundational principle of Causal Information Conservation.
1. Proposing the First Causally-Sound Diffusion Framework. We introduce BELM-MDCM, the first framework to eliminate SRE by design. By leveraging an analytically invertible mechanism, it ensures that the power of diffusion models can be applied to causality without compromising the integrity of the abduction process.
1. Developing a Principled Methodology to Operationalize the Framework. To make our Zero-SRE framework practical and robust, we introduce two synergistic innovations: a Targeted Modeling strategy to manage complexity and a Hybrid Training Objective to provide a strong causal inductive bias, both supported by our theoretical analysis.
Through a comprehensive experimental evaluation, we demonstrate that BELM-MDCM not only sets a new state-of-the-art in estimation accuracy but, more critically, unlocks the generation of authentic individual-level counterfactuals for deep causal inquiries. By providing a foundational blueprint that resolves a core tension between modern machine learning and classical causal theory, our work establishes a new, more rigorous standard for this research direction.
1.1 The Inversion Challenge in Diffusion-Based Causality
Diffusion models (ho2020denoising) are powerful generative models that learn to reverse a fixed, gradual noising process. They train a neural network, $\epsilon_{\theta}(\mathbf{x}_{t},t)$ , to predict the noise component of a corrupted sample $\mathbf{x}_{t}$ by optimizing a simple mean-squared error objective:
$$
\begin{split}L_{\text{simple}}(\theta)=\mathbb{E}_{t,\mathbf{x}_{0},\boldsymbol{\epsilon}}\bigg[\Big\|\boldsymbol{\epsilon}-\epsilon_{\theta}\big(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon},t\big)\Big\|^{2}\bigg]\end{split} \tag{1}
$$
where $\bar{\alpha}_{t}$ defines the noise schedule and $\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ . This trained network is then used to iteratively denoise a variable from pure noise back to a clean sample. A standard deterministic method for this generative process is the Denoising Diffusion Implicit Model (DDIM) (song2021denoising):
$$
\begin{split}\mathbf{x}_{t-1}=\sqrt{\bar{\alpha}_{t-1}}\left(\frac{\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(\mathbf{x}_{t},t)}{\sqrt{\bar{\alpha}_{t}}}\right)+\sqrt{1-\bar{\alpha}_{t-1}}\cdot\epsilon_{\theta}(\mathbf{x}_{t},t)\end{split} \tag{2}
$$
However, causal abduction requires the inverse operation: encoding an observed data point $\mathbf{x}_{0}$ into its latent noise code $\mathbf{x}_{T}$ . Standard frameworks (chao2023interventional) use the DDIM inversion, which only approximates this path:
$$
\begin{split}\mathbf{x}_{t+1}=\sqrt{\bar{\alpha}_{t+1}}\left(\frac{\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(\mathbf{x}_{t},t)}{\sqrt{\bar{\alpha}_{t}}}\right)+\sqrt{1-\bar{\alpha}_{t+1}}\cdot\epsilon_{\theta}(\mathbf{x}_{t},t)\end{split} \tag{3}
$$
This inversion is approximate because it relies on the noise prediction $\epsilon_{\theta}(\mathbf{x}_{t},t)$ remaining constant across the step, which introduces discretization errors that accumulate (liu2022pseudo). This structural flaw, which we term the Structural Reconstruction Error (SRE), systematically corrupts the inferred exogenous noise $U_{i}$ . The initial error in the abduction step then propagates through the entire Abduction-Action-Prediction cycle, compromising the fidelity of the final counterfactual.
1.2 Our Solution: A Zero-SRE Causal Framework
To eliminate SRE by construction, we build our framework upon an analytically invertible sampler: the B idirectional E xplicit L inear M ulti-step (BELM) sampler (liu2024belm). BELM overcomes the âmemorylessâ limitation of single-step samplers like DDIM by using a history of noise predictions, a principle grounded in classical theory for solving ODEs (hairer2006solving).
Specifically, we employ a second-order BELM. During decoding, it computes a more stable effective noise, $\boldsymbol{\epsilon}_{\text{eff}}$ , using predictions from the current and previous timesteps:
$$
\boldsymbol{\epsilon}_{\text{eff}}=\frac{3}{2}\epsilon_{\theta}(\mathbf{x}_{t},t)-\frac{1}{2}\epsilon_{\theta}(\mathbf{x}_{t+1},t+1) \tag{4}
$$
This improved estimate is then used in a DDIM-like update. The key innovation is that the corresponding encoding process is constructed to be the exact algebraic inverse of this decoding process, guaranteeing that the round-trip is lossless, i.e., $\mathbf{H}(\mathbf{T}(\mathbf{x}_{0}))=\mathbf{x}_{0}$ . While the original work on BELM focused on general generative tasks, we are the first to identify, leverage, and theoretically justify its analytical invertibility as the key to satisfying the principle of Causal Information Conservation for rigorous counterfactual generation. Our choice of a second-order BELM represents a deliberate trade-off, providing substantial accuracy gains over single-step methods while maintaining practical efficiency (liu2024belm), making it ideal for our causal framework.
1.3 Methodological Gaps in Applying Invertible SCMs
However, achieving high-fidelity causal inference requires more than a simple substitution of one sampler for another. The principle of analytical invertibility, while theoretically sound, exposes new challenges in practical SCM implementation that our framework is designed to address.
The Challenge of Model Specification: Targeted Modeling.
A key decision in SCM construction is assigning a causal mechanism to each node. Naively applying a complex, computationally expensive BELM-based diffusion model to every node in the causal graph is suboptimal. This motivates our Targeted Modeling strategy, where model complexity is treated as a resource to be allocated judiciously across the graph.
The Challenge of Downstream Tasks: Hybrid Training.
The second challenge arises from a fundamental mismatch in objectives. A diffusion model is trained on a generative objective, $L_{\text{diffusion}}(\theta)$ , while a downstream predictive task is optimized using a discriminative loss, $L_{\text{task}}(\phi)$ . These two objectives are not aligned. This âobjective mismatchâ motivates our Hybrid Training strategy, which seeks to unify these two goals.
2 Theoretical Analysis: An Operator-Theoretic Framework
To formalize our thesis that Causal Information Conservation is paramount and its violation via Structural Reconstruction Error is a fundamental barrier, we develop a rigorous operator-theoretic framework. This perspective is essential for analyzing the fidelity of the causal mapping process itself, moving beyond simple prediction errors. We present the first formal analysis that decomposes the counterfactual error in diffusion-based causal models to explicitly isolate the SRE, proving how our Zero-SRE design eliminates this critical structural limitation.
Our analysis first establishes the conditions for perfect counterfactual generation (§ 2.1-§ 2.3) and proves that standard methods produce a non-zero SRE, which our sampler eliminates by construction (Proposition 5 - 6; § 2.4). The centerpiece is a novel error decomposition theorem that isolates the SRE, motivating our Zero-SRE design (§ 2.5-§ 2.7). We conclude with learnability guarantees and a discussion of implications for advanced causal tasks like transportability (§ 2.8-§ 2.10).
2.1 Problem Formulation and Causal Operators
Let $(\Omega,\mathcal{F},P)$ be a probability space. We consider endogenous variables $\mathbf{V}$ as elements of the Hilbert space of square-integrable random variables, $\mathcal{X}:=L^{2}(\Omega,\mathbb{R}^{d})$ . Unless otherwise specified, all vector norms $\|·\|$ in the subsequent analysis refer to the standard Euclidean ( $L_{2}$ ) norm.
**Definition 1 (Functional SCM Operator)**
*A Structural Causal Model is defined by a set of unknown, true functional operators $\{\mathbf{F}_{i}\}_{i=1}^{d}$ , where each $\mathbf{F}_{i}:\mathcal{X}^{\text{pa}_{i}}Ă\mathcal{U}_{i}â\mathcal{X}_{i}$ is a map such that $V_{i}:=\mathbf{F}_{i}(\mathbf{Pa}_{i},U_{i})$ , with $\mathbf{Pa}_{i}$ being the set of parent random variables and $U_{i}$ an exogenous noise variable. We establish the convention that the corresponding lowercase bold letter, $\mathbf{pa}_{i}$ , denotes a specific vector of observed values for these parents.*
Our goal is to learn a model parameterized by $\theta$ that approximates this SCM. Our model consists of a pair of conditional operators for each variable $V_{i}$ :
1. A decoder (generative) operator $\mathbf{H}_{\theta}:\mathcal{U}Ă\mathcal{X}^{p}â\mathcal{X}$ , which aims to approximate $\mathbf{F}$ .
1. An encoder (inference) operator $\mathbf{T}_{\theta}:\mathcal{X}Ă\mathcal{X}^{p}â\mathcal{U}$ , which aims to perform abduction by inferring the latent noise.
These operators are realized by solving the probability flow ODE (Appendix A). The decoder $\mathbf{H}_{\theta}$ solves the ODE from $t=T$ to $t=0$ , while the encoder $\mathbf{T}_{\theta}$ solves it from $t=0$ to $t=T$ . Our BELM sampler is a high-fidelity numerical solver designed such that these forward and backward operations are exact algebraic inverses.
2.2 Identifiability and Exact Counterfactual Generation
We adapt principles from identifiable generative modeling (chao2023interventional) to formalize the conditions for exact counterfactuals. This requires assuming the SCM is invertible with respect to its noise term, a condition discussed in Section 2.11.
**Theorem 2 (Identifiability via Statistical Independence)**
*Given an SCM operator $X:=\mathbf{F}(\mathbf{Pa},U)$ where $U\perp\!\!\!\perp\mathbf{Pa}$ and $\mathbf{F}$ is invertible w.r.t. $U$ . If a learned encoder $\mathbf{T}_{\theta}$ (with sufficient capacity) yields a latent representation $Z=\mathbf{T}_{\theta}(X,\mathbf{Pa})$ that is statistically independent of the parents $\mathbf{Pa}$ , then $Z$ is an isomorphic representation of the exogenous noise $U$ .*
2.3 Geometric Inductive Bias for Identifiability
The score-matching objectiveâs geometric inductive biases strengthen our identifiability argument. We leverage the principle of implicit regularization, where optimizers favor âsimplerâ functions (hochreiter1997flat; neyshabur2018pac). We adopt the principle of simplicity bias, a cornerstone of modern deep learning theory that, while empirically supported, remains an active and not yet universally proven area of research. Our conclusions are conditioned on its validity, as discussed further in Section 2.11. This suggests the model learns the most parsimonious geometric transformation required to explain the data.
Considering the local geometry of the data density $p(\mathbf{x})$ provides powerful intuition. In a local region $\mathcal{R}$ , if the data is isotropic (spherically symmetric), the simplest score function is a radial vector field, yielding a conformal map. If the structure is simply anisotropic (e.g., ellipsoidal), the model is biased towards learning a local affine map. This refines the notion of a purely conformal bias and leads to the following proposition.
**Proposition 3 (Implicit Bias towards Simple Geometric Maps)**
*Assume (A1) the true data density $p(\mathbf{x})$ is smooth ( $C^{2}$ ) and (A2) the optimization process has a simplicity bias (e.g., favoring low-complexity solutions, see Appendix H).
1. If there exists a local region $\mathcal{R}$ where $p(\mathbf{x})$ is isotropic, the optimal learned score function is a radial vector field, and the flow map it generates is a conformal map on $\mathcal{R}$ .
1. If we relax the condition to a local region $\mathcal{R}$ where $p(\mathbf{x})$ has an ellipsoidal structure, the optimal learned score function is normal to the ellipsoidal iso-contours, and the flow map it generates is a local affine transformation on $\mathcal{R}$ .*
The formal argument is detailed in Appendix H. This proposition is significant: it suggests that the model defaults to learning the most parsimonious, well-behaved, and locally invertible map that can explain the dataâs geometry. This bias is crucial for the abduction step, as it prevents the pathological distortions that would corrupt the inferred causal noise $U$ .
**Theorem 4 (Operator Isomorphism Guarantees Exact Counterfactuals)**
*Let the conditions of Theorem 2 hold. If the learned operator pair $(\mathbf{T}_{\theta},\mathbf{H}_{\theta})$ constitutes a conditional isomorphism (i.e., $\mathbf{H}_{\theta}(\mathbf{T}_{\theta}(·,\mathbf{pa}),\mathbf{pa})=\mathbf{I}$ , the identity operator), then the modelâs prediction under an intervention $do(\mathbf{Pa}:=\boldsymbol{\alpha})$ is exact.*
Proof A full proof, covering cases for different dimensions of the exogenous noise variable, is provided in Appendix B.
2.4 Analysis of Inversion Fidelity
We now formally analyze the inversion error. We prove that standard approximate schemes produce a non-zero SRE (Proposition 5), whereas our chosen sampler eliminates it by construction (Proposition 6).
**Proposition 5 (Structural Error of Approximate Inversion)**
*Let $\mathbf{T}_{\text{DDIM}}$ be the operator for one step of DDIM inversion from $\mathbf{x}_{t}$ to $\mathbf{x}_{t+1}$ , and $\mathbf{H}_{\text{DDIM}}$ be the generative step operator from $\mathbf{x}_{t+1}$ to $\mathbf{x}_{t}$ . The single-step reconstruction error is non-zero and of second order in the time step $\Delta t$ :
$$
(\mathbf{H}_{\text{DDIM}}\circ\mathbf{T}_{\text{DDIM}})(\mathbf{x}_{t})-\mathbf{x}_{t}=\mathcal{O}((\Delta t)^{2})
$$
This error accumulates over the full trajectory, leading to a non-zero Structural Reconstruction Error.*
Proof See Appendix C for a rigorous proof.
**Proposition 6 (Analytical Invertibility of the Sampler)**
*Let $\mathbf{T}_{\text{BELM}}$ and $\mathbf{H}_{\text{BELM}}$ be the operators corresponding to the full-trajectory BELM sampler for inference and generation, respectively. For a fixed noise prediction network $\epsilon_{\theta}$ , the operators are exact algebraic inverses:
$$
\mathbf{H}_{\text{BELM}}\circ\mathbf{T}_{\text{BELM}}=\mathbf{I}
$$*
Proof The proof follows from the algebraic construction of the BELM update rules, as detailed in Appendix C.
2.5 Error Decomposition for Counterfactual Estimation
This brings us to our central theoretical result: an error decomposition theorem that rigorously partitions the total counterfactual error. This decomposition isolates the SRE and mathematically demonstrates why its elimination is critical.
**Definition 7 (Counterfactual Error Components)**
*We formally define the two primary sources of error in counterfactual estimation for the invertible case:
1. The Structural Reconstruction Error ( $E_{SR}$ ) measures the information loss from the modelâs abduction-action cycle on a given sample $X$ :
$$
E_{SR}(X):=\|(\mathbf{H}_{\theta}\circ\mathbf{T}_{\theta}-\mathbf{I})X\|^{2}
$$
1. The Latent Space Invariance Error ( $E_{LSI}$ ) measures the failure of the learned latent space to remain invariant under interventions on parent variables:
$$
E_{LSI}:=\|\mathbf{T}_{\theta}(X,\mathbf{Pa})-\mathbf{T}_{\theta}(X_{\boldsymbol{\alpha}}^{\text{true}},\boldsymbol{\alpha})\|^{2}
$$*
**Theorem 8 (Counterfactual Error Bound)**
*Let a model be defined by $(\mathbf{T}_{\theta},\mathbf{H}_{\theta})$ and the true SCM by $\mathbf{F}$ . Assume the decoder $\mathbf{H}_{\theta}$ is $L_{\mathcal{H}}$ -Lipschitz. The expected squared error of the modelâs counterfactual prediction $\hat{X}_{\boldsymbol{\alpha}}$ is bounded by the expectation of the two error components:
$$
\mathbb{E}\left[\|\hat{X}_{\boldsymbol{\alpha}}-X_{\boldsymbol{\alpha}}^{\text{true}}\|^{2}\right]\leq 2\mathbb{E}\left[E_{SR}(X_{\boldsymbol{\alpha}}^{\text{true}})\right]+2L_{\mathcal{H}}^{2}\mathbb{E}\left[E_{LSI}\right]
$$*
Proof The proof is in Appendix D.
**Remark 9 (Elimination of Structural Error)**
*By Proposition 6, the Structural Reconstruction Error for BELM-MDCM is identically zero. This is the central theoretical advantage of our framework. It disentangles the error sources, allowing us to isolate the entire modeling challenge to learning a high-quality score function ( $\epsilon_{\theta}$ ) without the confounding factor of an imperfect inversion algorithm. Any remaining error is now purely a function of statistical estimation, not a structural bias of the model itself.*
**Proposition 10 (Bound on Latent Space Invariance Error)**
*We assume the learned score network, $\boldsymbol{\epsilon}_{\theta}$ , is Lipschitz continuous, ensuring the existence and uniqueness of the probability flow ODE solution via the Picard-Lindelöf theorem. Under standard integrability conditions (Fubiniâs theorem), the Latent Space Invariance Error is bounded by the expected score-matching loss:
$$
\mathbb{E}\left[E_{LSI}\right]\leq C^{\prime}\cdot\mathbb{E}\left[\|\boldsymbol{\epsilon}_{\theta}-\boldsymbol{\epsilon}^{*}\|^{2}\right]
$$
for some constant $C^{\prime}$ , where $\boldsymbol{\epsilon}^{*}$ is the true score function.*
Proof The proof is in Appendix D. This proposition formally establishes that by eliminating structural error, the causal fidelity of BELM-MDCM is directly and provably controlled by its ability to accurately learn the dataâs score function.
2.6 Decomposing Error: A Motivation for Empirical Validation
The error decomposition in Theorem 8 provides a clear strategy for empirical validation by isolating two distinct error sources: the Structural Reconstruction Error ( $E_{SR}$ ) and the Latent Space Invariance Error. While developing a single score combining these is future work, these components directly motivate our empirical investigations. Our ablation study (Section 5.4.2) is designed to measure the impact of a non-zero $E_{SR}$ , while our stress-test (Section 5.4.1) probes robustness when latent space invariance is challenged by a non-invertible SCM.
2.7 Theoretical Roles of Targeted Modeling and Hybrid Training
With algorithmic error eliminated by our Zero-SRE design, the challenge becomes minimizing the modeling error ( $E_{LSI}$ ). Our two methodological innovations, Targeted Modeling and Hybrid Training, are principled strategies for this purpose.
Targeted Modeling as Formal Complexity Control.
Our Targeted Modeling strategy acts as a form of structural regularization. The finite sample bound in Theorem 15 is governed by the Rademacher complexity $\mathfrak{R}_{n}(\mathcal{F}_{\Theta})$ of the entire SCMâs hypothesis space. By assigning low-complexity models to a subset of nodes, we directly constrain the overall complexity.
**Remark 11 (Effect on Generalization Bound)**
*Our Targeted Modeling strategy is formally justified as a complexity control mechanism. The Rademacher complexity of a composite SCM is bounded by the sum of the complexities of its individual mechanisms (mohri2018foundations). By strategically substituting a high-complexity diffusion model $\mathcal{F}_{\text{diff}}$ with a lower-complexity alternative $\mathcal{F}_{\text{simple}}$ for non-critical nodes, Targeted Modeling directly minimizes this upper bound. This leads to a tighter generalization bound and improves the statistical efficiency of the overall SCM.*
Hybrid Training as a Weighted Score-Matching Objective.
The Hybrid Training Objective, $L_{\text{total}}=L_{\text{diffusion}}+\lambda· L_{\text{task}}$ , imparts a crucial inductive bias for learning a causally salient score function. The task-specific loss acts as a conductorâs baton, forcing the model to prioritize learning an accurate score function in regions of the data manifold most critical to the causal question. We formalize this by proposing that the auxiliary loss implicitly implements a weighted score-matching objective.
**Proposition 12 (Hybrid Objective as a Weighted Score-Matching Regularizer)**
*The auxiliary task loss $L_{\text{task}}$ provides a lower bound for the modelâs error, weighted by a function reflecting the causal salience of the data manifold. Minimizing the hybrid objective $L_{\text{total}}$ is thereby equivalent to solving a weighted score-matching problem that prioritizes accuracy in causally salient regions, leading to a smaller effective Latent Space Invariance Error. (A rigorous proof is provided in Appendix E.)*
This proposition formally grounds our hybrid training strategy, revealing that the task-specific loss intelligently forces the diffusion model to prioritize accuracy in regions of the data manifold most critical to the causal question. This reinforces the CIC principle by avoiding information loss where it matters most, effectively implementing the simplicity bias principle from Section 2.3.
We can deepen this insight by analyzing its information-theoretic implications.
**Proposition 13 (Disentanglement via Hybrid Objective)**
*Information-theoretically, the hybrid objective provides a strong inductive bias towards learning a disentangled latent representation. It encourages a âdivision of laborâ where the task-specific component explains variance from the parents $\mathbf{Pa}$ , while the diffusion componentâs latent code $Z=\mathbf{T}_{\theta}(V,\mathbf{Pa})$ models the residual information. This implicitly pushes $Z$ towards being independent of $\mathbf{Pa}$ , a crucial step towards satisfying the identifiability conditions.*
Proof A detailed information-theoretic argument is provided in Appendix E.
2.8 BELM-MDCM as a Unifying Framework
The principle of Causal Information Conservation also unifies our framework with classical models. Simpler models like Additive Noise Models (ANMs) can be seen as special cases where this principle is met trivially, positioning our work as a generalization of established causal principles. For instance, in a classic ANM (hoyer2009nonlinear), $V_{i}=f_{i}(\mathbf{Pa}_{i})+U_{i}$ , the noise is recovered by a direct, lossless inversion: $U_{i}=V_{i}-f_{i}(\mathbf{Pa}_{i})$ . Our framework generalizes this principle to arbitrarily complex, non-additive mechanisms, offering a flexible, non-parametric extension to classical structural equation models (wooldridge2010econometric). The importance of noise distributions, particularly non-Gaussianity, for identifiability in linear models is also a well-established principle (shimizu2006linear).
2.9 Learnability and Statistical Guarantees
We now provide finite-sample learnability guarantees for our SCM framework.
**Proposition 14 (Asymptotic Consistency)**
*Under standard regularity conditions, as the number of data samples $nââ$ and model capacity $Nââ$ , the learned operators $(\hat{\mathbf{T}}_{n},\hat{\mathbf{H}}_{n})$ are consistent estimators of the ideal operators $(\mathbf{T}^{*},\mathbf{H}^{*})$ : $\hat{\mathbf{T}}_{n}\xrightarrow{p}\mathbf{T}^{*}$ and $\hat{\mathbf{H}}_{n}\xrightarrow{p}\mathbf{H}^{*}$ .*
**Theorem 15 (Finite Sample Bound for Causal Diffusion SCMs)**
*Let an SCM consist of $d$ endogenous nodes, with a causal graph having a maximum in-degree of $d_{in}^{max}$ . Assume each causal mechanism is implemented by a score network $\epsilon_{\theta}$ that is an $L$ -layer MLP with ReLU activations, and the spectral norm of each weight matrix is bounded by $B$ . Let the input space be appropriately normalized. Let the loss function be bounded by $M$ . Then, for the parameters $\hat{\theta}_{n}$ learned from $n$ samples, the excess risk is bounded with probability at least $1-\delta$ :
$$
R(\hat{\theta}_{n})-R(\theta^{*})\leq C\cdot\frac{d\cdot L\cdot B^{L}\cdot\sqrt{d_{in}^{max}+d_{embed}+1}}{\sqrt{n}}+M\sqrt{\frac{\log(1/\delta)}{2n}}
$$
where $C$ is a constant independent of the network architecture and sample size, and $d_{embed}$ is the dimension of the time embedding.*
Proof The proof, which combines the sub-additivity of Rademacher complexity over the SCM with standard bounds for deep neural networks (bartlett2017spectrally; neyshabur2018pac), is detailed in Appendix G.
**Remark 16 (Interpretation of the Bound)**
*This refined bound quantitatively links the generalization error to:
1. Causal Complexity ( $d·\sqrt{d_{in}^{max}}$ ): The error scales with the number of causal mechanisms ( $d$ ) and the graphâs complexity ( $d_{in}^{max}$ ), formalizing the intuition that more complex causal systems are harder to learn.
1. Network Complexity ( $L· B^{L}$ ): The error scales with the depth and spectral norm of the score networks. This provides direct theoretical grounding for our Targeted Modeling strategy, as using simpler models tightens this generalization bound.*
2.10 Implications for Causal Transportability
Causal Information Conservation also provides a foundation for transportability âapplying knowledge from a source domain $\mathcal{S}$ to a target domain $\mathcal{T}$ (pearl2014transportability). Transportability requires separating invariant causal knowledge from domain-specific mechanisms. By losslessly recovering the exogenous noise $U$ (the invariant âcausal essenceâ), our framework achieves this separation by design; the decoders $\mathbf{H}_{\theta}$ represent the domain-specific mechanisms. This insight is formalized in the following theorem.
**Theorem 17 (Condition for Lossless Causal Transport)**
*Let a source domain $\mathcal{S}$ and a target domain $\mathcal{T}$ be described by SCMs $\mathcal{M}^{\mathcal{S}}$ and $\mathcal{M}^{\mathcal{T}}$ , respectively. Assume the following conditions hold:
1. Shared Structure: Both domains share the same causal graph $\mathcal{G}$ and the same exogenous noise distributions $\{p_{i}(U_{i})\}$ . The domains differ only in a subset of causal mechanisms $\mathcal{K}_{\text{changed}}$ .
1. Noise Independence: The exogenous noise variables $\{U_{i}\}_{i=1}^{d}$ are mutually independent.
1. Information Conservation: A model $(\mathbf{T}_{\theta},\mathbf{H}_{\theta})$ trained on data from $\mathcal{S}$ satisfies the Causal Information Conservation principle, achieving zero Structural Reconstruction Error.
Then, causal knowledge can be losslessly transported from $\mathcal{S}$ to $\mathcal{T}$ by re-learning only the operators $\{\mathbf{T}_{\theta_{k}},\mathbf{H}_{\theta_{k}}\}$ corresponding to the changed mechanisms $kâ\mathcal{K}_{\text{changed}}$ , while directly reusing all operators for invariant mechanisms.*
Proof The proof is provided in Appendix F.
2.11 Discussion of Assumptions
Our framework rests on several key assumptions, which we now critically examine.
Our geometric inductive bias argument (Proposition 3) rests on the principle of simplicity bias. While this principle is a cornerstone of modern deep learning theory with substantial empirical backing, it remains an active area of research and is not a universally proven theorem. Our conclusions are therefore conditioned on the validity of this powerful but conjectural assumption.
The cornerstone of our identifiability theory (Theorem 2) is the SCMâs invertibility with respect to its noise term $U$ . This is a strong assumption; when violated (e.g., by a many-to-one function), the abduction task becomes ill-posed.
To address this foundational challenge, we provide an exhaustive theoretical treatment in Appendix C. There, we formalize the irreducible ârepresentational errorâ and derive a tighter, more general error bound (Theorem 21). More importantly, we propose a concrete mitigation strategy: a novel prior-matching regularizer (Definition 23), theoretically shown to reduce the error by encouraging the learned encoder to approximate the ideal Maximum a Posteriori (MAP) solution (Proposition 24). This highlights a primary contribution: even in the challenging non-invertible case, BELM-MDCMâs zero-SRE design eliminates the algorithmic error, thereby isolating the more fundamental representational challenge. Our stress-test in Section 5.4.1 empirically confirms this advantage, while validating our regularizer provides a clear direction for future work.
Our identifiability proof is dimension-dependent, leveraging Liouvilleâs theorem for $dâ„ 3$ and requiring stronger assumptions like asymptotic linearity for the special case of $d=2$ . Other assumptions, such as Lipschitz continuity of the score network, are mild regularity conditions standard in deep generative model analysis and can be encouraged through architectural choices like spectral normalization.
3 Architectural Design and Training
The BELM-MDCM architecture embodies our core principles through a non-monolithic, theoretically-motivated design. Its central philosophy is Targeted Modeling: judiciously allocating the expressive power of our Zero-SRE CausalDiffusionModel to nodes of causal interest (e.g., Treatment T, Outcome Y), while using simpler, efficient mechanisms for confounders, as illustrated in Figure 1. This strategy provides practical complexity control, tightening the generalization bound as established in Theorem 15.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Causal Diagram: Targeted Modeling Principle
### Overview
The image presents a causal diagram illustrating the Targeted Modeling Principle. It shows relationships between variables W, X, T, and Y, with specific modeling approaches applied to different nodes. The diagram is accompanied by a text box explaining the principle.
### Components/Axes
* **Nodes:**
* W (top-left): Represented by a blue circle.
* X (bottom-left): Represented by a blue circle.
* T (center): Represented by a blue circle.
* Y (right): Represented by a blue circle.
* **Edges:** Arrows indicating causal relationships.
* W -> T
* X -> T
* T -> Y
* X -> Y
* **Annotations:**
* "Empirical Distribution": Associated with node W, enclosed in a rounded rectangle.
* "Additive Noise Model": Associated with node X, enclosed in a rounded rectangle.
* "CausalDiffusionModel (BELM-MDCM)": Associated with the edges X -> Y and T -> Y, enclosed in green rounded rectangles.
* **Text Box (bottom):** Contains the "Targeted Modeling Principle" description.
### Detailed Analysis or ### Content Details
**Diagram Structure:**
* Node W (Empirical Distribution) points to Node T.
* Node X (Additive Noise Model) points to Node T and Node Y.
* Node T points to Node Y.
* The edges T -> Y and X -> Y are annotated with "CausalDiffusionModel (BELM-MDCM)".
**Text Box Content:**
"Targeted Modeling Principle:
The expressive power of the CausalDiffusionModel is judiciously allocated to key causal nodes (Treatment T, Outcome Y) for high-fidelity counterfactual generation.
Simpler, efficient mechanisms (e.g., ANM, Empirical Distribution) are used for confounder nodes (W, X) to ensure stability and efficiency."
### Key Observations
* The CausalDiffusionModel (BELM-MDCM) is specifically applied to model the relationships leading to the outcome variable Y.
* Simpler models (Empirical Distribution, Additive Noise Model) are used for the confounder nodes W and X.
* The diagram visually represents the principle of allocating complex models to key causal nodes and simpler models to confounder nodes.
### Interpretation
The diagram and text describe a targeted modeling approach where the complexity of the model is tailored to the importance of the causal node. The CausalDiffusionModel, presumably a more complex and expressive model, is used to model the treatment (T) and outcome (Y) variables, which are crucial for counterfactual generation. Simpler models like Empirical Distribution and Additive Noise Model are used for the confounder variables (W and X) to ensure stability and efficiency. This suggests a strategy to balance model accuracy and computational cost by focusing the modeling effort on the most critical parts of the causal structure.
</details>
Figure 1: Illustration of the Targeted Modeling Principle. The expressive CausalDiffusionModel is judiciously allocated to key causal nodes (Treatment T, Outcome Y) for high-fidelity counterfactual generation. Simpler, efficient mechanisms (e.g., ANM, Empirical Distribution) are used for confounder nodes (W, X) to ensure stability and efficiency.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: BELM-MDCM Module Workflow
### Overview
The image is a diagram illustrating the workflow of a BELM-MDCM (Bayesian Explanatory Learning Machine - Multi-Dimensional Causal Modeling) module. It shows the steps involved in processing numerical and categorical input data, embedding, training, post-processing, and ultimately, causal identification.
### Components/Axes
The diagram is structured horizontally, showing the flow of data from left to right through several stages:
* **Input Data:**
* `Xnum`: Numerical input data, with example values 50, 257, and -3.0.
* `Xcat1`: Categorical input 1, with example value "M".
* `Xcat2`: Categorical input 2, with example value "woman".
* `Xcat3`: Categorical input 3, with an example value of a blue triangle.
* **Pre-Processing:**
* `StandardScaler`: A yellow box representing the standardization of numerical data.
* `OneHotEncoder`: A peach-colored box representing the one-hot encoding of categorical data.
* **Embedding:**
* `Xnum â Xcat Connection`: A gray box representing the connection of numerical and categorical data.
* `Timestep embedding`: A green box representing timestep embedding.
* **Train:**
* `BELM-MDCM module`: A light blue box representing the core module.
* `Noisy Target Variable`: A light blue box representing the noisy target variable.
* **Post-Processing:**
* `Inverse Transformation`: A white box representing the inverse transformation.
* **Results:**
* `Causal Identification`: A pink box representing the final causal identification step.
Below each stage, there is a label indicating the process: "Pre-Processing", "Embedding", "Train", "Post-Processing", and "Results".
At the top, there is a label "Select" above the Pre-Processing stage.
### Detailed Analysis
* **Numerical Data (`Xnum`):** The numerical data consists of three example values: 50, 257, and -3.0. These values are fed into the `StandardScaler` for normalization.
* **Categorical Data (`Xcat1`, `Xcat2`, `Xcat3`):** The categorical data consists of three categories with example values: "M", "woman", and a blue triangle. These values are fed into the `OneHotEncoder` to convert them into a numerical format suitable for the model.
* **Data Connection (`Xnum â Xcat Connection`):** The standardized numerical data and the one-hot encoded categorical data are combined in the "Connection" step.
* **BELM-MDCM Module:** The combined data, along with the timestep embedding and noisy target variable, are fed into the `BELM-MDCM module` for training.
* **Inverse Transformation:** After training, an inverse transformation is applied.
* **Causal Identification:** Finally, the model performs causal identification to produce the results.
### Key Observations
* The diagram illustrates a clear flow of data from input to output.
* The `BELM-MDCM module` is the central component of the workflow.
* Pre-processing steps are crucial for preparing the data for the model.
### Interpretation
The diagram provides a high-level overview of the `BELM-MDCM module` workflow. It shows how numerical and categorical data are pre-processed, combined, and used to train the model for causal identification. The diagram highlights the importance of each stage in the process, from data input to final results. The use of `StandardScaler` and `OneHotEncoder` suggests that the model can handle both numerical and categorical data types. The inclusion of "Timestep embedding" indicates that the model is likely designed to handle time-series data or sequential information. The "Noisy Target Variable" suggests that the model is robust to noise in the target variable.
</details>
Figure 2: The detailed internal architecture of the CausalDiffusionModel. This diagram illustrates the end-to-end workflow of the causal mechanism designed for key nodes like Treatment T and Outcome Y, detailing the pre-processing, embedding, training, and post-processing stages.
The internal architecture of the CausalDiffusionModel itself, depicted in Figure 2, is engineered to learn the complex, non-linear mapping $v_{i}:=f_{i}(\mathbf{pa}_{i},u_{i})$ with high fidelity.
3.1 Mechanism for Exogenous Nodes
Exogenous nodes (without parents in the causal graph $\mathcal{G}$ ) are modeled non-parametrically via the Empirical Distribution of the observed data. This approach avoids distributional assumptions and provides a robust foundation for the Structural Causal Model (SCM).
3.2 Mechanism for Endogenous Nodes: The CausalDiffusionModel
For endogenous nodes $V_{i}$ , particularly those central to the causal query (treatment, outcome, key mediators), we employ our bespoke CausalDiffusionModel to learn the functional mapping $v_{i}:=f_{i}(\mathbf{pa}_{i},u_{i})$ .
3.2.1 Conditioning via Parent Node Transformation
The denoising process is conditioned on the parent nodes $\mathbf{pa}_{i}$ , which are transformed into a fixed-dimensional conditioning vector $\mathbf{c}â\mathbb{R}^{d_{c}}$ . A ColumnTransformer handles heterogeneous data types: continuous parents are standardized (StandardScaler) to unify scales, while categorical parents are one-hot encoded (OneHotEncoder) to prevent artificial ordinality. The resulting vectors are concatenated into $\mathbf{c}$ , which remains constant for a given sampleâs diffusion trajectory.
3.2.2 The Denoising Process
The core of the CausalDiffusionModel is a denoising network $\epsilon_{\theta}(v_{t},t,\mathbf{c})$ , implemented as a Residual MLP (he2016deep). It takes as input the noisy variable $v_{t}$ , a sinusoidal Time Embedding of timestep $t$ , and the conditioning vector $\mathbf{c}$ . Before the diffusion process, the target variable $V_{i}$ is also preprocessed (standardized for continuous values or label-encoded for categorical ones). The denoising process is driven by the BELM sampler, ensuring a mathematically exact and stable inversion path as established in Section 2.
3.2.3 Hybrid Training Objective
We introduce a Hybrid Training Objective to reconcile generative fidelity with predictive accuracy. As established in our theoretical analysis (Proposition 12), this is more than a standard multi-task learning scheme; it acts as a powerful inductive bias, creating a weighted score-matching objective that prioritizes accuracy in causally salient regions of the data manifold. The total loss is a linearly weighted combination:
$$
L_{\text{total}}=L_{\text{diffusion}}+\lambda\cdot L_{\text{task}} \tag{5}
$$
where $L_{\text{diffusion}}$ is the noise prediction error (Equation 1). The auxiliary loss $L_{\text{task}}$ is a Mean Squared Error for continuous nodes ( $L_{\text{regression}}$ ) and a Cross-Entropy loss for discrete nodes ( $L_{\text{classification}}$ ).
3.2.4 Decoding and Counterfactual Generation
For generation, the BELM sampler produces an output in the normalized space. This is then mapped back to the original data domain using the inverse transformations of the pre-fitted preprocessors (StandardScaler for continuous, LabelEncoder for categorical). For categorical outputs, the continuous value is rounded and clipped to the valid class range before the inverse mapping, ensuring that generated (counterfactual) data is interpretable and resides in the correct space.
4 New Evaluation Metrics for Generative Causal Models
The principle of Causal Information Conservation demands new evaluation dimensions that traditional metrics like ATE and PEHE cannot capture. An accurate ATE score, for instance, could arise from a model with high SRE where individual errors fortuitously cancel out at the population level. To move beyond mere outcome accuracy and directly assess a modelâs adherence to our foundational principle, we propose a new, theoretically-grounded evaluation framework.
4.1 Causal Information Conservation Score (CIC-Score)
The Causal Information Conservation Score (CIC-Score) is a direct empirical diagnostic for the Structural Reconstruction Error. It quantifies a frameworkâs adherence to the CIC principle by disentangling algorithmic information loss (from an imperfect inversion process) from modeling error (from the statistical challenge of learning the true causal mechanism). We define the score, bounded in $[0,1]$ , using an exponential formulation:
$$
\text{CIC-Score}=\exp\left(-\left(\delta_{U}+\delta_{\text{SRE}}\right)\right)
$$
The error components are designed to isolate distinct failure modes:
- $\delta_{U}$ , the Relative Noise Recovery Error, quantifies the modeling error. It measures how well the trained network approximates the true score function, reflected in the fidelity of the recovered noise $\hat{U}$ versus the ground-truth $U_{\text{true}}$ :
$$
\delta_{U}=\frac{\mathbb{E}[\|\hat{U}_{\text{scaled}}-U_{\text{true, scaled}}\|^{2}]}{\mathbb{E}[\|U_{\text{true, scaled}}\|^{2}]}
$$
This term captures all inaccuracies from finite data and imperfect optimization.
- $\delta_{\text{SRE}}$ , the Normalized Structural Error, exclusively quantifies the algorithmic error inherent to the inversion process itself. Its definition is model-dependent to allow for fair comparisons:
- For frameworks with analytical invertibility (e.g., our BELM-MDCM, ANMs), the algorithm introduces no information loss, so we set $\delta_{\text{SRE}}\equiv 0$ by construction. Any observed reconstruction error is a symptom of modeling error, already captured by $\delta_{U}$ .
- For frameworks relying on approximate inversion (e.g., DDIM), $\delta_{\text{SRE}}$ is empirically measured to quantify this inherent algorithmic flaw:
$$
\delta_{\text{SRE}}=\frac{\mathbb{E}[\|(\mathbf{H}_{\theta}\circ\mathbf{T}_{\theta}-\mathbf{I})X\|^{2}]}{\mathbb{E}[\|X\|^{2}]}
$$
This principled decomposition allows the CIC-Score to fairly assess different frameworks by isolating structural design advantages from the universal challenge of model training.
4.2 Causal Mechanism Fidelity Score (CMF-Score)
A generative causal modelâs core promise is to learn true causal mechanisms, not just outcomes. NaĂŻve metrics like pairwise correlations fail to capture the non-linear, multi-variable, and directional nature of causality. We therefore propose the Causal Mechanism Fidelity (CMF) score, a hierarchical framework with two levels of increasing rigor.
4.2.1 Level 1 (Pragmatic): The Conditional Mutual Information Score (CMI-Score)
The Conditional Mutual Information (CMI), $I(V_{i};V_{j}|\mathbf{Pa}_{j}\setminus\{V_{i}\})$ , is a non-parametric, non-linear measure of the direct influence a parent $V_{i}$ has on its child $V_{j}$ after accounting for all other parents. The CMI-Score evaluates whether this influence is preserved. For a single mechanism $V_{j}$ , it is the average consistency across all parent-child edges:
$$
\text{CMI-Score}(V_{j})=\frac{1}{|\mathbf{Pa}_{j}|}\sum_{V_{i}\in\mathbf{Pa}_{j}}\left(1-\frac{\left|I_{\text{obs}}(V_{i};V_{j}|\cdot)-I_{\text{cf}}(V_{i};V^{\prime}_{j}|\cdot)\right|}{I_{\text{obs}}(V_{i};V_{j}|\cdot)+\epsilon}\right)
$$
where $I_{\text{obs}}$ and $I_{\text{cf}}$ are the CMI values from observational and counterfactual data, respectively. The final CMI-Score is the average over all SCM mechanisms.
4.2.2 Level 2 (Gold Standard): The Kernelized Mechanism Discrepancy (KMD) Score
To rigorously compare entire conditional distributions, we use the Maximum Mean Discrepancy (MMD) (gretton2012kernel), a kernel-based statistical test for distributional equality. The KMD-Score applies this test to the conditional distributions $p(V_{j}|\mathbf{Pa}_{j})$ that define each causal mechanism, measuring the discrepancy between the learned and observed conditionals. The final score is mapped to a similarity measure in $[0,1]$ :
$$
\text{KMD-Score}=\exp(-\gamma\cdot\mathbb{E}_{\mathbf{pa}_{j}\sim p(\mathbf{Pa}_{j})}[\text{MMD}(p(V_{j}|\mathbf{pa}_{j}),p_{\theta}(V_{j}|\mathbf{pa}_{j}))])
$$
where $\gamma$ is a scaling parameter. A score of 1 indicates that the learned conditional mechanism is statistically indistinguishable from the observed one.
Complementary and Validated Evaluation Metrics.
Our proposed metrics complement, rather than replace, traditional ones like ATE and PEHE. They evaluate distinct facets of performance: while ATE/PEHE measure outcome accuracy, the CMF-Score assesses mechanism fidelity. This distinction is critical, as a model can achieve a high ATE via fortuitous error cancellation despite failing to learn the true data-generating process. To ensure our metrics are practically reliable, we conducted a controlled micro-simulation study, detailed in Appendix J. The results provide strong empirical evidence for their validity and complementary roles: the CIC-Score acts as a high-sensitivity SRE detector; the CMI-Score robustly tracks the fidelity of causal associations; and the KMD-Score serves as a final arbiter of distributional similarity. This validation confirms that our evaluation framework offers a more complete, nuanced, and reliable assessment of generative causal models.
5 Experiments
Our empirical evaluation is designed as a comprehensive test of our central thesis: that eliminating SRE is a necessary condition for generating authentic counterfactuals and unlocks analytical capabilities beyond the reach of conventional methods. We structured the study as a four-act narrative to rigorously test our claims. Act I establishes our modelâs state-of-the-art predictive fidelity on standard benchmarks. Act II provides a deep diagnostic analysis, using our proposed metrics as empirical evidence for the destructive effect of SRE. Act III showcases the unique capabilities unlocked by an information-conserving framework. Finally, Act IV validates the frameworkâs robustness through a series of stress tests and a full ablation study.
Evaluation Protocol.
For a rigorous evaluation, we employ two complementary protocols. This distinction is crucial, as it separates the assessment of our methodologyâs peak performance from the diagnostic analysis of its components.
1. Ensemble Evaluation for SOTA Performance: To benchmark against state-of-the-art methods (specifically, ITE estimation in Section 5.1.3), we adopt the standard Deep Ensemble methodology. We train N=5 independent models and report the final metric (e.g., PEHE) on the ensembled prediction.
1. Individual Model Evaluation for Diagnostic Analysis: In all other experiments where the goal is a fair, apples-to-apples architectural comparison or stability assessment, we report the mean and standard deviation of metrics from individual model instances across N=5 runs. This isolates the effect of design choices from the gains of ensembling.
We estimate the Average Treatment Effect (ATE) throughout our experiments using a standard counterfactual imputation procedure, the pseudo-code for which is detailed in Algorithm 1 in Appendix K.
Baseline Estimators.
The Directed Acyclic Graphs (DAGs) for our experiments are shown in Figure 3. We benchmark BELM-MDCM against a suite of baselines from the DoWhy library (sharma2022dowhy), spanning classical statistical methods to state-of-the-art machine learning estimators to ensure a comprehensive comparison.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Causal Diagram: Treatment Effect
### Overview
The image presents a causal diagram illustrating the relationship between covariates, a categorical confounder, treatment, and outcome. The diagram uses nodes to represent variables and arrows to indicate causal relationships.
### Components/Axes
* **Nodes:**
* W1 (Covariate): A blue rectangle at the top-left.
* W2 (Covariate): A blue rectangle at the top-right.
* C1 (Categorical Confounder): A purple rectangle in the middle-top.
* T (Treatment): An orange diamond in the middle.
* Y (Outcome): A green circle at the bottom.
* **Edges (Arrows):** Gray arrows indicate causal relationships, except for the arrow from T to Y, which is orange.
### Detailed Analysis
* **W1 (Covariate):**
* Position: Top-left.
* Shape: Rectangle.
* Color: Blue outline, white fill.
* Outgoing Arrows: One to C1, one to T, one to Y.
* **W2 (Covariate):**
* Position: Top-right.
* Shape: Rectangle.
* Color: Blue outline, white fill.
* Outgoing Arrows: One to C1, one to T, one to Y.
* **C1 (Categorical Confounder):**
* Position: Middle-top.
* Shape: Rectangle.
* Color: Purple outline, white fill.
* Outgoing Arrow: One to T.
* **T (Treatment):**
* Position: Middle.
* Shape: Diamond.
* Color: Orange outline, white fill.
* Outgoing Arrow: One to Y (orange).
* **Y (Outcome):**
* Position: Bottom.
* Shape: Circle.
* Color: Green outline, white fill.
* Incoming Arrows: From W1, W2, T, and C1.
### Key Observations
* Covariates W1 and W2 influence the categorical confounder C1, the treatment T, and the outcome Y.
* The categorical confounder C1 influences the treatment T.
* The treatment T directly influences the outcome Y.
* The arrow from T to Y is highlighted in orange, possibly indicating the primary causal path of interest.
### Interpretation
The diagram represents a causal model where the treatment (T) affects the outcome (Y). However, this relationship is confounded by a categorical variable (C1) and influenced by covariates (W1 and W2). The orange arrow from T to Y likely represents the causal effect of the treatment on the outcome, which the analysis aims to isolate and quantify, potentially adjusting for the confounder and covariates. The diagram suggests that W1 and W2 are potential confounders as they influence both the treatment and the outcome.
</details>
(a) PSM Failure Scenario
<details>
<summary>x4.png Details</summary>

### Visual Description
## Causal Diagram: Confounding and Mediation
### Overview
The image is a causal diagram illustrating relationships between variables, including confounders, an instrumental variable, treatment, a mediator, and an outcome. The diagram uses shapes and arrows to represent the variables and their relationships.
### Components/Axes
* **Shapes:**
* Rectangles: Represent confounders (X1, X2).
* Hexagon: Represents the instrumental variable (Z).
* Diamond: Represents the treatment (T).
* Oval: Represents the mediator (M).
* Double Circle: Represents the outcome (Y).
* **Arrows:** Indicate the direction of influence or causal relationship.
* Solid Arrows: Represent direct causal effects.
* Dashed Arrows: Represent a potential or weaker causal effect.
* **Variables:**
* X1: Confounder
* X2: Confounder
* Z: Instrumental Variable
* T: Treatment
* M: Mediator
* Y: Outcome
### Detailed Analysis or Content Details
* **X1 (Confounder):** A rectangle in the top-left corner. It has arrows pointing towards both T (Treatment) and Y (Outcome).
* **X2 (Confounder):** A rectangle in the top-right corner. It has arrows pointing towards both T (Treatment) and Y (Outcome).
* **Z (Instrumental Variable):** A hexagon at the top-center. It has a dashed arrow pointing towards T (Treatment).
* **T (Treatment):** A diamond shape in the center. It receives arrows from X1, X2, and Z. It has a solid arrow pointing towards M (Mediator).
* **M (Mediator):** An oval shape below T. It receives an arrow from T and has an arrow pointing towards Y (Outcome).
* **Y (Outcome):** A double circle at the bottom. It receives arrows from X1, X2, and M.
**Arrow Details:**
* X1 -> T: Solid gray arrow.
* X1 -> Y: Solid gray arrow.
* X2 -> T: Solid gray arrow.
* X2 -> Y: Solid gray arrow.
* Z -> T: Dashed teal arrow.
* T -> M: Solid red arrow.
* M -> Y: Solid red arrow.
### Key Observations
* Confounders X1 and X2 influence both the treatment T and the outcome Y.
* The instrumental variable Z influences the treatment T but is assumed to be independent of the outcome Y except through its effect on T.
* The treatment T influences the mediator M, which in turn influences the outcome Y.
### Interpretation
The diagram illustrates a causal model where the relationship between a treatment (T) and an outcome (Y) is influenced by confounders (X1, X2) and mediated by a variable M. The instrumental variable Z is used to estimate the causal effect of T on Y by exploiting its influence on T. The diagram highlights the importance of considering confounding and mediation when analyzing causal relationships. The dashed line from Z to T indicates a potential or weaker causal effect, possibly representing an indirect or probabilistic influence. The red arrows from T to M and M to Y suggest a direct and potentially strong influence of the treatment on the mediator and the mediator on the outcome. The gray arrows from X1 and X2 to T and Y indicate confounding effects, where these variables influence both the treatment and the outcome, potentially biasing the estimated effect of T on Y.
</details>
(b) Ablation Study Scenario
<details>
<summary>x5.png Details</summary>

### Visual Description
## Causal Diagram: Confounders, Treatment, and Outcome
### Overview
The image presents a causal diagram illustrating the relationships between confounders, treatment, and outcome. The diagram uses nodes to represent variables and arrows to indicate causal relationships.
### Components/Axes
* **Nodes:**
* **Confounders:** Represented by a blue rectangle at the top. The label reads "Confounders" with "(age, educ, re74, etc.)" below it.
* **Treatment:** Represented by an orange diamond in the middle. The label reads "Treatment" with "(treat)" below it.
* **Outcome:** Represented by a green circle at the bottom. The label reads "Outcome" with "(re78)" below it.
* **Arrows:**
* A blue arrow points from "Confounders" to "Treatment".
* A blue arrow curves from "Confounders" to "Outcome".
* A red arrow points from "Treatment" to "Outcome".
### Detailed Analysis
* **Confounders:** The blue rectangle at the top contains the text "Confounders" and lists examples such as "age, educ, re74, etc." This indicates that these factors influence both the treatment and the outcome.
* **Treatment:** The orange diamond in the middle is labeled "Treatment (treat)". This represents the intervention or exposure being studied.
* **Outcome:** The green circle at the bottom is labeled "Outcome (re78)". This represents the result or effect being measured.
* **Causal Paths:**
* The blue arrow from "Confounders" to "Treatment" indicates that confounders influence the treatment received.
* The red arrow from "Treatment" to "Outcome" indicates that the treatment has a direct effect on the outcome.
* The blue arrow from "Confounders" to "Outcome" indicates that confounders also directly influence the outcome, independent of the treatment.
### Key Observations
* The diagram highlights the potential for confounding, where the observed relationship between treatment and outcome may be influenced by other factors.
* The curved arrow from "Confounders" to "Outcome" suggests a non-linear or complex relationship.
### Interpretation
The causal diagram illustrates a common scenario in observational studies where confounders can distort the observed effect of a treatment on an outcome. The diagram suggests that to accurately estimate the treatment effect, it is necessary to account for the influence of confounders on both the treatment and the outcome. The diagram emphasizes the importance of considering potential confounders when analyzing the relationship between treatment and outcome.
</details>
(c) Lalonde Confounding Structure
Figure 3: Directed Acyclic Graphs (DAGs) for key experiments. (a) A structure designed to challenge propensity score methods. (b) A mediation structure used for the ablation study. (c) The standard confounding structure assumed for both Lalonde-based experiments.
5.1 Act I: Establishing State-of-the-Art Predictive Fidelity
We first establish that our principled design achieves superior predictive fidelity on standard causal inference benchmarks.
5.1.1 Robustness in Non-Linear Confounding Scenarios
We tested our model in a challenging synthetic scenario (Figure 3(a)) designed with highly non-linear confounding to cause propensity-based methods to fail. Table 1 shows the results. While Causal Forest is exceptionally accurate on this specific DGP, our BELM-MDCM framework secures its position as the second most accurate method, delivering a highly stable and competitive ATE estimate. Crucially, it significantly outperforms the entire suite of propensity-based methods and powerful estimators like DML in accuracy. The high standard deviation of DML highlights its unreliability in this context, validating our model as a robust estimator where traditional approaches are compromised.
Table 1: ATE Estimation on the PSM Failure Scenario (True ATE = 5000). We report the mean ATE and standard deviation across multiple runs.
| Causal Forest | 4895.77 $±$ 69.26 | 104.23 |
| --- | --- | --- |
| Propensity Score Stratification | 5309.38 $±$ 185.36 | 309.38 |
| Linear Regression | 5348.82 $±$ 23.23 | 348.82 |
| Propensity Score Matching | 5353.93 $±$ 191.36 | 353.93 |
| Inverse Propensity Weighting | 5385.68 $±$ 52.03 | 385.68 |
| Double Machine Learning | 4285.63 $±$ 550.97 | 714.37 |
5.1.2 Accuracy and Robustness on Real-World Observational Data
We next evaluated our framework on the canonical Lalonde dataset (lalonde1986evaluating), a challenging real-world benchmark with a known RCT ground truth. Table 2 demonstrates the comprehensive superiority of our BELM-MDCM framework. It achieved a mean ATE estimate of 1567.36 $±$ 201.62, the lowest error among all methods that correctly identified the treatment effectâs positive direction. More critically, the results highlight a stark contrast in reliability. Classical methods failed entirely, while the powerful Causal Forest baseline suffered from extreme instability (Std Dev of 785.59). In contrast, BELM-MDCM exhibited remarkable robustness, with a standard deviation approximately four times lower. This outstanding performance on a canonical benchmark validates that our framework delivers accurate estimates with the consistency essential for trustworthy causal inference.
Table 2: ATE Estimation Stability on the Lalonde Dataset (RCT Benchmark ATE $â 1794$ ). Results for all models are reported as Mean $±$ Standard Deviation across 5 independent runs.
| Causal Forest | 1085.30 $±$ 785.59 | 708.70 |
| --- | --- | --- |
| Linear Regression | 46.33 $±$ 76.80 | 1747.67 |
| Propensity Score Matching | -3.96 $±$ 118.37 | 1797.96 |
| Propensity Score Stratification | -35.54 $±$ 81.44 | 1829.54 |
| Propensity Score Weighting | -122.55 $±$ 50.51 | 1916.55 |
| Double Machine Learning | nan $±$ nan | nan |
5.1.3 High-Fidelity ITE Estimation and Stability Analysis
Objective. We evaluate performance at the individual level using a semi-synthetic version of the Lalonde dataset. This experiment leverages real-world covariates and assumes the causal structure depicted in Figure 3(c). To rigorously assess both accuracy and reliability, we follow our Individual Model Evaluation protocol, reporting the mean and standard deviation of performance across 5 independent runs for each method.
Results. The PEHE results, presented in Table 3, confirm the exceptional fidelity and robustness of our framework. BELM-MDCM achieves the lowest average PEHE score of 537.84 and demonstrates remarkable stability with the lowest standard deviation of just 60.11. This performance is closely followed by Causal Forest. However, the results also highlight the instability of other meta-learners; X-Learner, in particular, exhibits extremely high variance, with a standard deviation more than three times larger than its competitors, rendering its single-run estimates unreliable. This highlights the dual advantage of our framework: superior accuracy combined with consistent, trustworthy performance. Figure 4 provides visual confirmation, showing the tight clustering of our modelâs ensembled ITE estimates around the ground truth.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy of Individual Treatment Effect (ITE) Estimation
### Overview
The image is a scatter plot showing the accuracy of Individual Treatment Effect (ITE) estimation. The plot compares the "True ITE" on the x-axis against the "Estimated ITE (Ensemble)" on the y-axis. Individual data points, representing individual samples, are plotted as light blue circles. A red dashed line represents a "Perfect Match" where y = x. The plot includes a grid for easier reading of values.
### Components/Axes
* **Title:** Accuracy of Individual Treatment Effect (ITE) Estimation
* **X-axis:** True ITE
* Scale: 0 to 4000, with gridlines at intervals of 1000.
* **Y-axis:** Estimated ITE (Ensemble)
* Scale: 0 to 4000, with gridlines at intervals of 1000.
* **Legend:** Located in the top-left corner.
* Individual Samples (BELM-MDCM Ensemble): Represented by light blue circles.
* Perfect Match (y=x): Represented by a red dashed line.
### Detailed Analysis
* **Individual Samples (BELM-MDCM Ensemble):** The light blue data points are scattered around the red dashed line. The density of points appears higher between True ITE values of 2000 and 3000. There is a noticeable spread of the estimated ITE values for any given True ITE value, indicating some degree of estimation error.
* **Perfect Match (y=x):** The red dashed line represents the ideal scenario where the estimated ITE perfectly matches the true ITE. It starts at approximately (-500, -500) and extends to (4500, 4500).
### Key Observations
* The estimated ITE values tend to cluster around the perfect match line, but with significant variance.
* There are more data points above the perfect match line for lower True ITE values (e.g., True ITE < 2000), suggesting a tendency to overestimate the ITE in this range.
* Conversely, there are more data points below the perfect match line for higher True ITE values (e.g., True ITE > 3000), suggesting a tendency to underestimate the ITE in this range.
### Interpretation
The scatter plot illustrates the accuracy of the BELM-MDCM Ensemble method for estimating Individual Treatment Effects. While the estimates generally follow the trend of the true ITE, there is a considerable amount of error, as indicated by the spread of the data points around the "Perfect Match" line. The tendency to overestimate ITE for lower true values and underestimate for higher true values suggests a potential bias in the estimation method. The clustering of points between 2000 and 3000 True ITE indicates that the model performs relatively better in that range. Overall, the plot highlights the need for further refinement of the estimation method to reduce variance and mitigate potential biases.
</details>
Figure 4: Accuracy of Individual Treatment Effect (ITE) Estimation on the semi-synthetic Lalonde dataset. The plot shows the ensembled estimated ITE from our model versus the true ITE. The tight clustering of our modelâs estimates (blue dots) around the perfect-match line (red dash) visually demonstrates its low PEHE score.
Table 3: ITE Estimation Accuracy (PEHE) on the Semi-Synthetic Lalonde Dataset. Results are reported as Mean $±$ Standard Deviation across 5 independent runs. Lower is better.
| BELM-MDCM Causal Forest S-Learner | 537.84 $±$ 60.11 563.90 $±$ 73.66 816.26 $±$ 79.17 |
| --- | --- |
| X-Learner | 1546.38 $±$ 679.09 |
Table 4: Causal Mechanism Fidelity (CMI-Score) on the Semi-Synthetic Lalonde Dataset. Results are reported as Mean $±$ Standard Deviation across 5 runs. Higher is better.
| S-Learner BELM-MDCM Causal Forest | 0.9905 $±$ 0.0062 0.9824 $±$ 0.0092 0.9786 $±$ 0.0099 |
| --- | --- |
| X-Learner | 0.9782 $±$ 0.0145 |
| T-Learner | 0.9555 $±$ 0.0113 |
5.2 Act II: Uncovering the Accuracy-Invertibility Trade-off
We now conduct the pivotal experiment of our study: a deep diagnostic analysis using our novel CIC-Score to reveal the trade-off between predictive accuracy and mechanism invertibility. This provides the core empirical evidence for our thesis by comparing three paradigms: our BELM-MDCM (Learned Invertibility), a DDIM variant (Flawed Invertibility), and a classic RF-ANM (Assumed Invertibility).
The results in Table 5 decisively validate our frameworkâs principles. Our BELM-MDCM is the clear leader, achieving the lowest PEHE score (1071.95) with high stability. Critically, its CIC-Score of 0.3679 is orders of magnitude higher than the alternatives, proving its unique ability to learn an invertible mapping that conserves causal information. In stark contrast, the DDIM-MDCM model exemplifies the failure predicted by our theory: its near-zero CIC-Score confirms a near-total collapse of causal information due to SRE, leading to unreliable predictions (high PEHE and variance). The classical RF-ANM, while structurally invertible, lacks the capacity to learn the true mechanism, resulting in a zero CIC-Score and poor accuracy. This âGolden Tableâ experiment underscores that both structural integrity and powerful modeling capacity are essential for high-fidelity causal inference.
Table 5: The âUltimate Golden Tableâ: A comparative analysis of model classes on predictive accuracy (PEHE) and structural integrity (CIC-Score). This table includes the NF-SCM baseline, which empirically validates the likelihood-fidelity dilemma. Results are reported as Mean $±$ Standard Deviation across 5 runs. Lower PEHE is better; higher CIC-Score is better.
| RF-ANM DDIM-MDCM NF-SCM | 1533.18 $±$ 134.24 2085.98 $±$ 788.12 442229.96 $±$ 66963.73 | 0.0000 $±$ 0.0000 0.0065 $±$ 0.0130 0.1572 $±$ 0.0232 |
| --- | --- | --- |
| BELM-MDCM | 1071.95 $±$ 152.11 | 0.3679 $±$ 0.0000 |
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Training Loss Curve
### Overview
The image is a line chart displaying the training loss curve of a model. The x-axis represents the epoch number, and the y-axis represents the negative log-likelihood loss. The chart shows how the loss decreases over the training epochs.
### Components/Axes
* **Title:** Training Loss Curve
* **X-axis:**
* Label: Epoch
* Scale: 0 to 200, with major ticks at intervals of 25 (0, 25, 50, 75, 100, 125, 150, 175, 200)
* **Y-axis:**
* Label: Negative Log-Likelihood Loss
* Scale: -2 to 6, with major ticks at intervals of 2 (-2, 0, 2, 4, 6)
* **Data Series:** A single blue line representing the training loss.
### Detailed Analysis
* **Data Series Trend:** The blue line starts at a high value (approximately 6) and rapidly decreases in the initial epochs. It then gradually decreases and stabilizes around -2 after approximately 50 epochs. The line fluctuates slightly around -2 for the remaining epochs.
* **Data Points:**
* Epoch 0: Negative Log-Likelihood Loss â 6
* Epoch 25: Negative Log-Likelihood Loss â -1.5
* Epoch 50: Negative Log-Likelihood Loss â -2
* Epoch 100: Negative Log-Likelihood Loss â -2
* Epoch 150: Negative Log-Likelihood Loss â -2.2
* Epoch 200: Negative Log-Likelihood Loss â -2.3
### Key Observations
* The most significant decrease in loss occurs within the first 25 epochs.
* The loss stabilizes after approximately 50 epochs, indicating that the model has learned most of the patterns in the data.
* The fluctuations in the loss after stabilization suggest that the model is still making minor adjustments, but the overall performance is relatively consistent.
### Interpretation
The training loss curve demonstrates the learning process of the model. The rapid decrease in loss during the initial epochs indicates that the model is quickly adapting to the training data. The stabilization of the loss after 50 epochs suggests that the model has converged to a local minimum. The fluctuations in the loss after stabilization could be due to the learning rate or the complexity of the data. Overall, the curve indicates that the model has learned effectively and is performing well on the training data.
</details>
Figure 5: The training loss curve for the Conditional Normalizing Flow (NF) baseline. The smooth, stable convergence to a low negative log-likelihood value indicates a successful statistical training run. However, this did not correspond to learning the true causal mechanism, as evidenced by its extremely high PEHE score.
The Likelihood-Fidelity Dilemma: Why Natively Invertible Models Can Fail.
To rigorously test the limits of models that are natively invertible, we conducted a comprehensive stability analysis on a Conditional Normalizing Flow (NF) baseline, a model class that satisfies the Causal Information Conservation principle by construction (SRE $\equiv 0$ ). Across five independent runs with different random seeds, the NF model consistently demonstrated successful statistical learning, with its training loss stably converging to a high log-likelihood in each instance (a representative example is shown in Figure 5).
However, this statistical success was starkly contrasted by a systematic and catastrophic failure in the causal task. The model yielded an average PEHE score of 442,229.96 $±$ 66,963.73, confirming that its generated counterfactuals were fundamentally incorrect. This consistent result provides decisive evidence for a critical challenge we term the likelihood-fidelity dilemma: a model can perfectly learn to replicate a data distribution while remaining completely ignorant of the underlying causal mechanism.
The root of this dilemma is the fundamental mismatch between the optimization objective and the causal goal. The maximum likelihood objective incentivizes the NF to find any invertible mapping that transforms the data to a simple base distribution. While an infinite number of such mappings may be statistically equivalent, only one corresponds to the true, unique causal data-generating process. Without a guiding signal, the NF is mathematically predisposed to learn a causally-incorrect âstatistical shortcut.â This finding powerfully underscores the contribution of our Hybrid Training Objective. It acts as the crucial causal inductive bias that resolves this dilemma, compelling the model to learn the unique, causally salient structure and enabling valid causal inference where pure likelihood-based methods, even those with zero SRE, are destined to fail.
5.3 Act III: Unlocking Deeper Causal Inquiry with a High-Fidelity Model
An information-conserving model serves as a trustworthy âworld modelâ for deep causal inquiry. We showcase three applications uniquely enabled by our frameworkâs high-fidelity counterfactuals.
Heterogeneity Analysis: Conditional ATE (CATE).
A reliable ITE model can act as a âcausal microscope.â We use it to explore treatment effect heterogeneity by estimating the Conditional Average Treatment Effect (CATE) for subpopulations. By averaging the results from our five independently trained models, we obtain stable and robust estimates. Our modelâs mean CATE estimates track the true CATE trends with high fidelity across different education levels, a capability crucial for policy-making. For instance, for individuals with an education level of 3.0, the estimated CATE was $2562.79 (true CATE: $2280.79). For levels 8.0, 12.0, and 16.0, the estimates were $2092.91 (true: $2118.61), $2253.76 (true: $2384.44), and $2434.89 (true: $2490.21), respectively, demonstrating a close correspondence to the ground truth.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Bar Chart: Causal Attribution: The Impact of Exogenous "Luck"
### Overview
The image is a bar chart comparing the "Actual Outcome" of a "Victim (Index 201)" to the outcome if they had the "Luck" of a "Top Responder (Index 281)". The chart visually demonstrates the impact of exogenous "Luck" on the outcome.
### Components/Axes
* **Title:** Causal Attribution: The Impact of Exogenous "Luck"
* **Y-axis:** Outcome (Y), with scale markers at 0, 5000, 10000, 15000, 20000, 25000, and 30000.
* **X-axis:** Two categories:
* Victim (Index 201) Actual Outcome
* Victim (Index 201) with Top Responder's "Luck"
* **Legend:** Located at the top-left of the chart.
* Blue dashed line: Top Responder (Index 281) Actual Outcome ($4,296.47)
### Detailed Analysis
* **Victim (Index 201) Actual Outcome:**
* Bar color: Red
* Value: $10,993.98
* **Victim (Index 201) with Top Responder's "Luck":**
* Bar color: Green
* Value: $33,328.30
* **Top Responder (Index 281) Actual Outcome:**
* Line type: Dashed Blue
* Value: $4,296.47. This is represented as a horizontal dashed blue line across the entire chart.
### Key Observations
* The "Victim (Index 201)" experiences a significantly higher outcome when "Luck" is factored in.
* The "Top Responder (Index 281)" has an actual outcome of $4,296.47, which is substantially lower than the victim's outcome with the top responder's luck.
* The victim's outcome with the top responder's luck is more than three times the victim's actual outcome.
### Interpretation
The chart illustrates the substantial impact of exogenous "Luck" on outcomes. It suggests that if the "Victim (Index 201)" had the same "Luck" as the "Top Responder (Index 281)", their outcome would be dramatically higher. The difference between the victim's actual outcome and the outcome with the top responder's luck highlights the potential influence of external factors beyond individual effort or skill. The horizontal line representing the Top Responder's actual outcome provides a baseline for comparison, emphasizing the disparity between the actual and potential outcomes for the victim.
</details>
Figure 6: Causal Attribution Analysis. This chart shows the counterfactual outcome for the âVictimâ if they had possessed the individual-specific exogenous factors of the âTop Responderâ.
Causal Attribution: Isolating the Effect of Exogenous Factors.
We conducted a causal attribution experiment via a counterfactual intervention of the form $do(U_{\text{victim}}:=u_{\text{responder}})$ . Figure 6 shows that our framework can losslessly recover these factors, revealing that unobserved exogenous âluckâ In this context, the term âluckâ serves as an intuitive shorthand for the exogenous noise variable $U$ . Within the Structural Causal Model (SCM) framework, $U$ represents all unobserved, individual-specific factors (e.g., intrinsic ability, random chance, measurement errors) that, together with the observed parent variables ( $\mathbf{Pa}$ ), determine the final outcome for an individual. had a massive and stable causal effect, averaging a +22,334.32 change in the âVictimâsâ outcome. This capacity for reliable attribution is a unique advantage of our information-conserving framework.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Distribution Plot: Counterfactual Fairness Audit for Attribute "black"
### Overview
The image is a distribution plot comparing the actual outcome distribution for a group where "black=1" to the counterfactual outcome distribution if "black=0". It visualizes a fairness audit, showing the density of outcomes (Y) for both scenarios. The plot includes mean values for both distributions and an average fairness gap.
### Components/Axes
* **Title:** Counterfactual Fairness Audit for Attribute: "black"
* **X-axis:** Outcome (Y), ranging from 0 to 100000. Axis markers are present at 0, 20000, 40000, 60000, 80000, and 100000.
* **Y-axis:** Density, scaled by 1e-5. Axis markers are present at 0, 1, 2, 3, 4, 5, 6, and 7.
* **Legend (Top-Right):**
* **Red:** Actual Outcome Distribution (black=1)
* Mean Actual: $8,829.19
* **Blue:** Counterfactual Outcome Distribution (if black=0)
* Mean Counterfactual: $7,151.03
* **Annotation Box (Top-Left):**
* Audited Group: black=1 (N=371)
* Average Fairness Gap: $-1,678.16
### Detailed Analysis
* **Actual Outcome Distribution (Red):**
* The distribution is unimodal and skewed to the right.
* The peak density is approximately at an outcome value of 5000.
* The mean is indicated by a dashed red vertical line at $8,829.19.
* **Counterfactual Outcome Distribution (Blue):**
* The distribution is unimodal and skewed to the right.
* The peak density is approximately at an outcome value of 0.
* The mean is indicated by a dashed blue vertical line at $7,151.03.
### Key Observations
* The counterfactual distribution (blue) is shifted to the left compared to the actual distribution (red).
* The mean actual outcome ($8,829.19) is higher than the mean counterfactual outcome ($7,151.03).
* The average fairness gap is negative ($-1,678.16), indicating a disparity in outcomes.
### Interpretation
The plot suggests that there is a difference in outcomes between the group where "black=1" and the counterfactual scenario where "black=0". The negative fairness gap indicates that, on average, the "black=1" group experiences a lower outcome than what they would have if "black=0". The shift in the distributions and the difference in means support this conclusion. The data demonstrates a potential fairness issue related to the "black" attribute.
</details>
(a) Fairness audit for attribute âblackâ.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Density Plot: Counterfactual Fairness Audit for Attribute "hisp"
### Overview
The image is a density plot comparing the actual outcome distribution for a group where "hisp=1" (Hispanic) with the counterfactual outcome distribution if "hisp=0". It visualizes the fairness gap between these two distributions, showing the difference in outcomes based on the "hisp" attribute. The plot includes mean values for both distributions and highlights the average fairness gap.
### Components/Axes
* **Title:** Counterfactual Fairness Audit for Attribute: "hisp"
* **X-axis:** Outcome (Y), ranging from -20000 to 60000.
* **Y-axis:** Density, ranging from 0 to 5e-5.
* **Legend (Top-Right):**
* Red: Actual Outcome Distribution (hisp=1)
* Red Dashed Line: Mean Actual: $5,145.18
* Blue: Counterfactual Outcome Distribution (if hisp=0)
* Blue Dashed Line: Mean Counterfactual: $7,007.30
* **Annotation (Top-Left):** Audited Group: hisp=1 (N=39), Average Fairness Gap: $1,862.12
### Detailed Analysis
* **Actual Outcome Distribution (hisp=1) - Red:**
* Trend: The red curve represents the density of the actual outcome distribution when hisp=1. It peaks around 0 and then decreases as the outcome increases or decreases.
* Mean Actual (Red Dashed Line): $5,145.18. The red dashed line is positioned at approximately x=5000.
* **Counterfactual Outcome Distribution (if hisp=0) - Blue:**
* Trend: The blue curve represents the density of the counterfactual outcome distribution if hisp=0. It peaks slightly to the right of the red curve's peak and has a broader spread.
* Mean Counterfactual (Blue Dashed Line): $7,007.30. The blue dashed line is positioned at approximately x=7000.
### Key Observations
* The counterfactual outcome distribution (blue) is shifted to the right compared to the actual outcome distribution (red), indicating a higher average outcome when hisp=0.
* The average fairness gap is $1,862.12, which is the difference between the mean counterfactual outcome and the mean actual outcome.
* The actual outcome distribution (red) has a higher peak than the counterfactual outcome distribution (blue), suggesting a higher concentration of outcomes around its mean.
### Interpretation
The density plot illustrates a fairness issue related to the "hisp" attribute. The shift in the counterfactual distribution suggests that individuals with "hisp=1" experience, on average, lower outcomes compared to what they would experience if "hisp=0". The average fairness gap of $1,862.12 quantifies this disparity. This analysis suggests potential bias or unfairness in the system being audited, where the "hisp" attribute is correlated with a difference in outcome. The plot provides a visual and quantitative assessment of this fairness gap, highlighting the need for further investigation and potential mitigation strategies.
</details>
(b) Fairness audit for attribute âhispâ.
Figure 7: Counterfactual fairness audits reveal significant outcome disparities based on sensitive attributes. The plots show the distribution of actual outcomes for each group versus the distribution of their counterfactual outcomes had their sensitive attribute been different.
Counterfactual Fairness Audit.
Finally, we applied our framework to a counterfactual fairness audit. Only a model that faithfully represents the data generating process can reliably answer questions about fairness. Figure 7 reveals significant and stable disparities: our model estimates an average fairness gap of -1,678.16 for the âblackâ attribute and +1,862.12 for the âhispâ attribute, demonstrating its capacity as a powerful tool for ethical audits.
5.4 Act IV: Final Validation: Stress Tests and Ablation Study
We conclude by subjecting the framework to two final tests: a stress test on a non-invertible SCM and a comprehensive ablation study.
5.4.1 Stress Test on a Non-Invertible SCM
We tested our frameworkâs robustness when the theoretical assumption of an invertible SCM is violated, using a DGP where $Y\propto U_{Y}^{2}$ . The results in Table 6 and Figure 8 decisively validate our hypothesis. On the definitive metric of individual-level fidelity (PEHE), our zero-SRE BELM framework achieves an error of 0.77, a 44% reduction compared to the SRE-prone DDIM model. This empirically proves that even when the true SCM is non-invertible, eliminating algorithmic SRE provides a substantial advantage. This result is fully consistent with our theoretical analysis in Appendix C (Theorem 21), where the total error is decomposed into algorithmic, modeling, and representational errors. By eliminating the algorithmic error ( $E_{SR}\equiv 0$ ), our frameworkâs performance approaches the theoretical limit set by the other two components.
Interestingly, the DDIM sampler achieves a slightly higher KMD-Score. We hypothesize that its inherent inversion noise acts as a form of implicit regularization, making the marginal generated distribution appear closer to the truth, even while individual-level counterfactuals are less accurate. This highlights the important distinction between distributional fidelity and individual-level causal accuracy.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Bar Charts: Robustness to Many-to-One SCM: BELM vs. DDIM
### Overview
The image presents four bar charts comparing the performance of two methods, BELM and DDIM, in terms of robustness to Many-to-One Structural Causal Models (SCM). The charts evaluate Group-Level Accuracy, Individual-Level Fidelity, Mechanism Fidelity, and Distributional Fidelity. Each chart displays the score for each method, along with error bars indicating variability.
### Components/Axes
* **Title:** Robustness to Many-to-One SCM: BELM vs. DDIM
* **X-axis:** Categorical, representing the two methods being compared: DDIM and BELM.
* **Y-axis:** Numerical, labeled "Score," ranging from 0.0 to 1.2 (Group-Level Accuracy), 0.00 to 1.75 (Individual-Level Fidelity), 0.0 to 1.0 (Mechanism Fidelity), and 0.0 to 1.0 (Distributional Fidelity).
* **Error Bars:** Represent variability or uncertainty in the scores for each method.
* **Chart Titles:**
* Group-Level Accuracy (ATE Error) - Lower is Better
* Individual-Level Fidelity (PEHE) - Lower is Better
* Mechanism Fidelity (CMI-Score) - Higher is Better
* Distributional Fidelity (KMD-Score) - Higher is Better
* **Bar Colors:** DDIM is represented by a dark teal color, and BELM is represented by a green color.
### Detailed Analysis
**1. Group-Level Accuracy (ATE Error) - Lower is Better**
* **Trend:** Lower scores are better. BELM has a lower score than DDIM.
* **DDIM:** Score of approximately 0.973.
* **BELM:** Score of approximately 0.740.
**2. Individual-Level Fidelity (PEHE) - Lower is Better**
* **Trend:** Lower scores are better. BELM has a lower score than DDIM.
* **DDIM:** Score of approximately 1.376.
* **BELM:** Score of approximately 0.766.
**3. Mechanism Fidelity (CMI-Score) - Higher is Better**
* **Trend:** Higher scores are better. BELM has a slightly higher score than DDIM.
* **DDIM:** Score of approximately 0.980.
* **BELM:** Score of approximately 0.994.
**4. Distributional Fidelity (KMD-Score) - Higher is Better**
* **Trend:** Higher scores are better. DDIM has a higher score than BELM.
* **DDIM:** Score of approximately 0.907.
* **BELM:** Score of approximately 0.830.
### Key Observations
* For Group-Level Accuracy and Individual-Level Fidelity, BELM outperforms DDIM, as lower scores are preferred.
* For Mechanism Fidelity, BELM slightly outperforms DDIM.
* For Distributional Fidelity, DDIM outperforms BELM.
* The error bars indicate some variability in the scores, but the differences between BELM and DDIM appear to be relatively consistent across the four metrics.
### Interpretation
The data suggests that BELM and DDIM have different strengths and weaknesses in terms of robustness to Many-to-One SCM. BELM demonstrates better Group-Level Accuracy and Individual-Level Fidelity, while DDIM shows better Distributional Fidelity. BELM also has a slightly better Mechanism Fidelity. The choice between BELM and DDIM may depend on the specific application and the relative importance of these different metrics. The error bars suggest that these results are reasonably consistent, but further analysis with larger datasets may be warranted.
</details>
Figure 8: Robustness comparison on the many-to-one SCM. Our zero-SRE framework (BELM) demonstrates significantly superior performance on PEHE.
Table 6: Performance on the non-invertible SCM ( $Y\propto U^{2}$ ). Results are averaged over 5 runs ( $±$ std). Lower PEHE is better.
| BELM (Zero SRE) DDIM (with SRE) | 0.77 $±$ 0.16 1.38 $±$ 0.19 | 0.830 $±$ 0.009 0.907 $±$ 0.023 |
| --- | --- | --- |
5.4.2 Ablation Study: Deconstructing the Frameworkâs Success
We conducted a comprehensive ablation study on a challenging synthetic dataset (Figure 3(b)) to validate the contribution of each core component. The findings, presented in Table 7 and Figure 9, provide unequivocal evidence for our integrated design. The full BELM-MDCM model establishes the gold standard for both accuracy and stability. The study reveals three critical insights:
- Decisive Role of the Hybrid Objective: Removing it causes a catastrophic decline in performance (400%+ error increase), demonstrating it is a core driver of the causal inductive bias, not merely a fine-tuning mechanism.
- Critical Importance of Targeted Modeling: Removing it leads to a complete collapse in model stability (Std Dev explodes from 4.57 to 138.01), validating our theoretical analysis on complexity control. A judicious allocation of model complexity is paramount for reproducibility.
- Robust Advantage of Exact Invertibility: Replacing BELM with DDIM leads to a clear degradation in both accuracy and stability, confirming that SRE from approximate inversion systematically erodes the quality of the final causal estimate.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Horizontal Bar Chart: Ablation Study Results on Challenge Dataset
### Overview
The image presents two horizontal bar charts displaying the results of an ablation study conducted on a challenge dataset. The study investigates the impact of removing different components from a model (BELM-MDCM) on two metrics: ATE (Average Treatment Effect) Estimation Accuracy and Absolute Error. The top chart shows the mean estimated ATE with error bars representing ±1 standard deviation, while the bottom chart shows the mean absolute error, also with error bars.
### Components/Axes
**Top Chart: ATE Estimation Accuracy**
* **Title:** ATE Estimation Accuracy
* **X-axis:** Mean Estimated ATE (Error bars show ±1 Std Dev)
* Scale: 125 to 275, with increments of 25.
* **Y-axis:** (Implicit) Categories representing different model configurations:
* w/o Hybrid Objective
* w/o Targeted Modeling
* w/o Exact Invertibility (DDIM)
* BELM-MDCM (Full Model)
* **Vertical Dashed Line:** Represents the "True ATE = 202.29"
**Bottom Chart: Impact of Ablation on Absolute Error**
* **Title:** Impact of Ablation on Absolute Error
* **X-axis:** Mean Absolute Error (Lower is Better)
* Scale: -100 to 200, with increments of 50.
* **Y-axis:** (Implicit) Categories representing different model configurations:
* w/o Hybrid Objective
* w/o Targeted Modeling
* w/o Exact Invertibility (DDIM)
* BELM-MDCM (Full Model)
### Detailed Analysis
**Top Chart: ATE Estimation Accuracy**
* **BELM-MDCM (Full Model):** The mean estimated ATE is approximately 195, with a standard deviation of approximately ±20.
* **w/o Exact Invertibility (DDIM):** The mean estimated ATE is approximately 210, with a standard deviation of approximately ±15.
* **w/o Targeted Modeling:** The mean estimated ATE is approximately 250, with a standard deviation of approximately ±30.
* **w/o Hybrid Objective:** The mean estimated ATE is approximately 140, with a standard deviation of approximately ±25.
**Bottom Chart: Impact of Ablation on Absolute Error**
* **BELM-MDCM (Full Model):** The mean absolute error is approximately 15, with a standard deviation of approximately ±10.
* **w/o Exact Invertibility (DDIM):** The mean absolute error is approximately 25, with a standard deviation of approximately ±10.
* **w/o Targeted Modeling:** The mean absolute error is approximately 45, with a standard deviation of approximately ±150.
* **w/o Hybrid Objective:** The mean absolute error is approximately 60, with a standard deviation of approximately ±25.
### Key Observations
* In the ATE Estimation Accuracy chart, removing the Hybrid Objective results in the lowest accuracy, while removing Targeted Modeling results in the highest.
* In the Absolute Error chart, the full model (BELM-MDCM) has the lowest error, indicating its superior performance. Removing the Hybrid Objective results in the highest error.
* The error bars (standard deviations) vary across the different model configurations, indicating varying levels of consistency in the results.
### Interpretation
The ablation study reveals the importance of each component of the BELM-MDCM model. Removing the Hybrid Objective significantly impacts both ATE estimation accuracy and absolute error, suggesting it is a crucial component. Removing Targeted Modeling improves ATE estimation accuracy but increases absolute error, indicating a trade-off. The full model (BELM-MDCM) achieves the lowest absolute error, demonstrating its overall effectiveness. The variations in standard deviations suggest that some ablations lead to more consistent results than others. The "True ATE" line in the top chart provides a benchmark for evaluating the accuracy of the ATE estimations.
</details>
Figure 9: Visualization of the ablation study results. The top panel shows the mean estimated ATE relative to the true value (red dashed line), with error bars indicating $± 1$ standard deviation. The bottom panel highlights the mean absolute error for each configuration. The full BELM-MDCM model is demonstrably the most accurate and stable.
Table 7: Ablation study results on a challenging synthetic dataset (True ATE = 202.29), validating the necessity of each framework component. The mean and standard deviation are computed over 5 runs.
| w/o Exact Invertibility (DDIM) w/o Hybrid Objective w/o Targeted Modeling | 219.98 $±$ 7.48 137.77 $±$ 23.99 253.20 $±$ 138.01 | 17.68 64.53 50.90 |
| --- | --- | --- |
Conclusion.
The ablation study confirms that our frameworkâs three core design principles: analytical invertibility, a hybrid training objective, and targeted modeling âwork in synergy. The removal of any single component creates a significant vulnerability, validating the integrity and effectiveness of our integrated architectural design.
6 Causal Information Conservation as a Unifying Principle
The principle of Causal Information Conservation extends beyond a foundation for our model; it offers a unifying lensâa new taxonomyâfor analyzing the suitability of any generative model for individual-level causal inference. Applying this principle and its operational metric, SRE, allows us to situate our work and clarify its unique advantages within the broader landscape.
6.1 Normalizing Flows: Natively Information-Conserving
Normalizing Flows (NFs) (dinh2014nice; dinh2017density; kingma2018glow) are designed around an invertible mapping. By construction, their Structural Reconstruction Error is identically zero, making NFs a native implementation of the Causal Information Conservation principle. However, their strengths come with limitations: the requirement of a tractable Jacobian imposes heavy architectural constraints, which can limit expressive power and introduce strong topological assumptions on the data manifold.
6.2 VAEs and GANs: Architecturally High-SRE
Variational Autoencoders (VAEs) (kingma2014autoencoding) and Generative Adversarial Networks (GANs) (goodfellow2014generative) are fundamentally ill-suited for this task. Their Structural Reconstruction Error is large and non-zero due to a fundamental architectural mismatch, as their separate encoder and decoder networks lack any structural guarantee of being inverses. Furthermore, their sources of information loss are inherent to their design; optimization objectives like the ELBOâs KL-divergence term actively encourage lossy compression, theoretically impeding the recovery of precise causal information.
6.3 A Comparative Perspective on Diffusion Models
Viewed through our principle, diffusion models occupy a unique and compelling space in this taxonomy. Standard diffusion-based approaches, using samplers like DDIM, aspire to conserve information, but their reliance on approximate inversion results in a non-zero SRE; they are âaspirationalâ but ultimately lossy. In contrast, BELM-MDCM (this work) achieves zero SRE by integrating an analytically invertible sampler, matching the theoretical purity of Normalizing Flows but without their rigid architectural constraints. Furthermore, unlike NFs trained with a generic likelihood objective, our frameworkâs Hybrid Training objective provides a causally-oriented inductive bias. BELM-MDCM thus uniquely combines the rigorous invertibility of NFs with the modeling flexibility and task-specific power of the diffusion paradigm, making it ideally suited for principled, high-fidelity causal inference.
7 Conclusion and Future Work
This paper introduced Causal Information Conservation as a guiding principle for the emerging field of diffusion-based causal inference. Our primary contribution is not the concept of invertibility itself, but framing it as a foundational design requirement and identifying the Structural Reconstruction Error (SRE) as the precise, quantifiable cost of its violation.
Our proposed framework, BELM-MDCM, serves as a constructive proof of this principle. By being architected around analytical invertibility, it is the first to achieve zero SRE by design, shifting the focus from mitigating numerical errors to upholding a fundamental causal principle. This work provides a foundational blueprint and a more rigorous standard for applying the power of diffusion models to the profound challenges of causal inference, reconciling their flexibility with the logical rigor demanded by classical theory.
7.1 Limitations and Future Work
Our work highlights several avenues for future research:
- Handling Non-Invertible SCMs: Our framework excels when the true SCM is invertible. While our stress test shows robustness when this is violated, developing models inherently resilient to such misspecifications is a key challenge. Empirically validating the proposed prior-matching regularizer (Appendix C) is a concrete next step.
- Robustness to Graph Misspecification: Like most SCM-based methods (peters2017elements), our framework assumes a correctly specified causal graph. Analyzing how structural errors (e.g., omitted confounders) propagate through our error decomposition framework is a significant future research direction.
- Formalizing CIC within Information Theory: Our work defines CIC as an operational principle. A compelling direction is to formalize it within a rigorous information-theoretic framework, for instance by proving that zero SRE maximizes the mutual information $I(U;\hat{U})$ between the true and recovered noise, connecting our work to rate-distortion theory.
- Scalability and Generalizability: While powerful, the BELM sampler is computationally intensive. Improving its efficiency for high-dimensional settings is an important practical challenge. Furthermore, extending the principles of CIC and zero-SRE modeling to other data modalities, such as time-series or images, represents an exciting frontier.
Acknowledgments and Disclosure of Funding
We thank the anonymous reviewers for their insightful feedback which significantly improved the clarity and rigor of this paper.
Appendix A Core Diffusion Model Equations
This appendix provides the essential equations for the diffusion models referenced in this work. Diffusion models learn a data distribution by training a neural network $\boldsymbol{\epsilon}_{\theta}$ to reverse a fixed, gradual noising process.
The model is trained by optimizing a simplified score-matching objective (ho2020denoising):
$$
\begin{split}L_{\text{simple}}(\theta)=\mathbb{E}_{t,\mathbf{x}_{0},\boldsymbol{\epsilon}}\bigg[\Big\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}\big(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon},t\big)\Big\|^{2}\bigg]\end{split}
$$
where $\bar{\alpha}_{t}$ is a predefined noise schedule. For deterministic generation and inversion, we use the Denoising Diffusion Implicit Model (DDIM) update step (song2021denoising):
$$
\begin{split}\mathbf{x}_{t-1}=\sqrt{\bar{\alpha}_{t-1}}\left(\frac{\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t},t)}{\sqrt{\bar{\alpha}_{t}}}\right)+\sqrt{1-\bar{\alpha}_{t-1}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t},t)\end{split}
$$
This process can be viewed as a discretization of a continuous-time probability flow Ordinary Differential Equation (ODE) (song2021score):
$$
d\mathbf{x}=\left[\frac{1}{2}\frac{d\log\alpha(s)}{ds}\mathbf{x}(s)-\frac{1}{2}\frac{d\log(1-\alpha(s))}{ds}\frac{\sqrt{1-\alpha(s)}}{\sqrt{\alpha(s)}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}(s),s)\right]ds
$$
Within our theoretical framework, the encoder operator $\mathbf{T}_{\theta}$ corresponds to solving this ODE forward in time (from $s=0$ to $s=1$ ), while the decoder operator $\mathbf{H}_{\theta}$ corresponds to solving it backward in time (from $s=1$ to $s=0$ ).
Appendix B Detailed Proofs for Identifiability (Theorems 2 & 4)
This appendix provides detailed, dimension-specific proofs for the identifiability of the exogenous noise $U$ and the subsequent correctness of counterfactual generation. The core challenge lies in showing that the statistical independence of the latent code $Z$ from the parents $\mathbf{Pa}$ is a sufficient condition to establish an isomorphic relationship between $Z$ and $U$ . The mathematical tools required differ based on the dimensionality of $U$ .
B.1 The High-Dimensional Case ( $dâ„ 3$ )
For cases where the exogenous noise $U$ has a dimensionality $dâ„ 3$ , the proof leverages Liouvilleâs theorem on conformal mappings.
Proof [Proof of Theorems 2 and 4 for $dâ„ 3$ ] We adapt the proof from identifiable generative modeling (chao2023interventional) to our conditional operator setting.
Let $\mathbf{Q}_{\mathbf{pa}}(U):=\mathbf{T}_{\theta}(\mathbf{F}(\mathbf{pa},U),\mathbf{pa})$ be the composite function mapping the noise $U$ to the latent code $Z$ for a given context $\mathbf{pa}$ . By assumption, $\mathbf{Q}_{\mathbf{pa}}$ is invertible and differentiable. The core assumption, $Z\perp\!\!\!\perp\mathbf{Pa}$ , implies that the conditional density $p_{Z}(z|\mathbf{pa})$ must equal a marginal density $p_{Z}(z)$ that is independent of $\mathbf{pa}$ .
Using the change of variables formula, we relate the density of $Z$ to that of $U$ :
$$
p_{Z}(z|\mathbf{pa})=\frac{p_{U}(\mathbf{Q}_{\mathbf{pa}}^{-1}(z))}{\left|\det J_{\mathbf{Q}_{\mathbf{pa}}}(\mathbf{Q}_{\mathbf{pa}}^{-1}(z))\right|}
$$
where $J_{\mathbf{Q}_{\mathbf{pa}}}$ is the Jacobian of $\mathbf{Q}_{\mathbf{pa}}$ . Since the left-hand side is independent of $\mathbf{pa}$ and $p_{U}(·)$ is a fixed distribution, this imposes a strong constraint on the Jacobian determinant. Under regularity conditions, this implies that the Jacobian $J_{\mathbf{Q}_{\mathbf{pa}}}(u)$ must be a scaled orthogonal matrix, making $\mathbf{Q}_{\mathbf{pa}}$ a conformal map.
By Liouvilleâs theorem, for dimensions $dâ„ 3$ , any conformal map must be a Möbius transformation (a composition of translations, scalings, orthogonal transformations, and inversions). For the map to be well-behaved, it must exclude the inversion component, which would introduce singularities. This is consistent with the regularity of functions representable by neural networks, simplifying the map to an affine form:
$$
\mathbf{Q}_{\mathbf{pa}}(u)=\mathbf{A}_{\mathbf{pa}}u+\mathbf{d}_{\mathbf{pa}}
$$
where $\mathbf{A}_{\mathbf{pa}}$ is a scaled orthogonal matrix. The argument then uses the independence of the distributionâs moments and support to show that $\mathbf{A}_{\mathbf{pa}}$ and $\mathbf{d}_{\mathbf{pa}}$ must be constant w.r.t. $\mathbf{pa}$ . This leads to the isomorphic relationship $\mathbf{T}_{\theta}(\mathbf{F}(\mathbf{Pa},U),\mathbf{Pa})=\mathbf{A}U+\mathbf{d}=g(U)$ .
The proof of counterfactual correctness follows directly, as detailed in chao2023interventional. The conditional isomorphism $\mathbf{H}_{\theta}(\mathbf{T}_{\theta}(·,\mathbf{pa}),\mathbf{pa})=\mathbf{I}$ combined with the identifiability result $\mathbf{T}_{\theta}(\mathbf{F}(\mathbf{pa},u),\mathbf{pa})=g(u)$ implies that $\mathbf{H}_{\theta}(g(u),\mathbf{pa})=\mathbf{F}(\mathbf{pa},u)$ . The decoder thus perfectly mimics the true causal mechanism, making an intervention exact: $\hat{X}_{\boldsymbol{\alpha}}=\mathbf{H}_{\theta}(g(u),\boldsymbol{\alpha})=\mathbf{F}(\boldsymbol{\alpha},u)=X_{\boldsymbol{\alpha}}^{\text{true}}$ .
B.2 The One-Dimensional Case ( $d=1$ )
For the one-dimensional case where Liouvilleâs theorem does not apply, this section provides a dedicated proof leveraging properties of 1D functions and a uniform noise assumption from (chao2023interventional) that does not sacrifice generality.
We first establish a helper lemma characterizing a specific class of 1D functions.
**Lemma 18**
*For $U,Zâ\mathbb{R}$ , consider a family of invertible functions $q_{\mathbf{pa}}:Uâ Z$ for $\mathbf{pa}â\mathcal{X}_{\text{pa}}â\mathbb{R}^{d}$ . The derivative expression $\frac{dq_{\mathbf{pa}}}{du}(q_{\mathbf{pa}}^{-1}(z))$ is a function of $z$ only, i.e., $c(z)$ , if and only if $q_{\mathbf{pa}}(u)$ can be expressed as
$$
q_{\mathbf{pa}}(u)=q(u+r(\mathbf{pa}))
$$
for some function $r$ and invertible function $q$ .*
Proof The proof is provided in (chao2023interventional). The reverse direction follows from direct differentiation, while the forward direction uses the inverse function theorem to show that the inverses $s_{\mathbf{pa}}(z)=q_{\mathbf{pa}}^{-1}(z)$ must have the same derivative $1/c(z)$ , implying they can only differ by an additive constant, which yields the desired form after inversion.
This lemma enables the proof of the theorem for the 1D case.
**Theorem 19 (Identifiability ford=1d=1)**
*Assume for $Xâ\mathcal{X}â\mathbb{R}$ and exogenous noise $U\sim\text{Unif}[0,1]$ , the SCM is $X:=f(\mathbf{Pa},U)$ . Assume an encoder-decoder model with encoding function $g$ and decoding function $h$ . Assume the following conditions:
1. The encoding is independent of the parents, $g(X,\mathbf{Pa})\perp\!\!\!\perp\mathbf{Pa}$ .
1. The structural equation $f$ is differentiable and strictly increasing w.r.t. $U$ .
1. The encoding $g$ is invertible and differentiable w.r.t. $X$ .
Then, $g(f(\mathbf{Pa},U),\mathbf{Pa})=\tilde{q}(U)$ for an invertible function $\tilde{q}$ .*
Proof Let $q_{\mathbf{pa}}(U):=g(f(\mathbf{pa},U),\mathbf{pa})$ . The conditions on $f$ and $g$ ensure $q_{\mathbf{pa}}$ is strictly monotonic and thus invertible. By the independence assumption, the conditional distribution of $Z=q_{\mathbf{pa}}(U)$ does not depend on $\mathbf{pa}$ . We assume $U\sim\text{Unif}[0,1]$ without loss of generality. For any SCM with a continuous noise $E$ and a strictly increasing CDF $F_{E}$ , $X=f(\mathbf{Pa},E)$ can be re-parameterized to an equivalent SCM $X=\tilde{f}(\mathbf{Pa},U)$ , where $U=F_{E}(E)\sim\text{Unif}[0,1]$ and $\tilde{f}(·,·)=f(·,F_{E}^{-1}(·))$ . The modeling task is then to learn the potentially more complex function $\tilde{f}$ . The change of density formula gives:
$$
p_{Z}(z)=\frac{p_{U}(q_{\mathbf{pa}}^{-1}(z))}{|\frac{dq_{\mathbf{pa}}}{du}(q_{\mathbf{pa}}^{-1}(z))|}
$$
Since $p_{U}(u)=1$ on its support and $q_{\mathbf{pa}}$ is increasing, the denominator must be independent of $\mathbf{pa}$ . This implies $\frac{dq_{\mathbf{pa}}}{du}(q_{\mathbf{pa}}^{-1}(z))=c(z)$ for some function $c$ . This meets the condition of Lemma 18, allowing us to express $q_{\mathbf{pa}}(u)=q(u+r(\mathbf{pa}))$ for some invertible $q$ . Since $Z\perp\!\!\!\perp\mathbf{Pa}$ , its support must also be independent of $\mathbf{pa}$ . The support is $q([0,1]+r(\mathbf{pa}))$ , which is constant only if the interval $[r(\mathbf{pa}),1+r(\mathbf{pa})]$ is constant. This requires $r(\mathbf{pa})$ to be a constant, $r$ . Thus, $q_{\mathbf{pa}}(u)=q(u+r)$ . Defining $\tilde{q}(u)=q(u+r)$ , we find that the mapping is solely a function of $U$ , which completes the proof.
B.3 The Two-Dimensional Case ( $d=2$ )
The two-dimensional case is a well-known geometric exception, as the group of conformal maps is infinite-dimensional. Consequently, the proof strategy used for higher dimensions via Liouvilleâs theorem does not directly apply. Here, we show that identifiability still holds under additional, plausible regularity assumptions aligned with our modeling framework.
**Assumption 1 (Asymptotic Linearity)**
*The composite mapping $Q_{\mathbf{pa}}(u):\mathbb{C}â\mathbb{C}$ is an entire function (analytic on the whole complex plane) with at most linear growth. That is, there exist constants $A,B$ such that $|Q_{\mathbf{pa}}(u)|†A|u|+B$ for all $uâ\mathbb{C}$ .*
This assumption reflects a fundamental inductive bias of neural network architectures. Standard activation functions (e.g., ReLU, Tanh) produce functions that cannot exhibit super-polynomial growth or essential singularities at infinity, aligning our analysis with the function classes our model can represent.
**Assumption 2 (Non-Rotationally Symmetric Base Noise)**
*The base distribution of the exogenous noise $U$ is not rotationally symmetric. This assumption is made without loss of generality. For any arbitrary continuous noise $E=(E_{1},E_{2})$ , one can define a new noise $U=(F_{E_{1}}(E_{1}),E_{2})$ , where $F_{E_{1}}$ is the CDF of the first component. The resulting distribution of $U$ is uniform along its first axis, thus breaking any rotational symmetry. The structural function $\mathbf{F}$ then absorbs this transformation.*
Proof The proof proceeds in three steps. First, as established previously, the statistical independence condition $Z\perp\!\!\!\perp\mathbf{Pa}$ implies that the learned mapping $Q_{\mathbf{pa}}(u)$ must be a conformal map, and thus an analytic function on $\mathbb{C}$ . Second, by Assumption 1, $Q_{\mathbf{pa}}(u)$ is an entire function with at most linear growth. The Generalized Liouvilleâs Theorem states that an entire function whose growth is bounded by a polynomial of degree $k$ must itself be a polynomial of degree at most $k$ . In our case, this implies $Q_{\mathbf{pa}}(u)$ must be a polynomial of degree at most one, giving it the affine form:
$$
Q_{\mathbf{pa}}(u)=a(\mathbf{pa})u+b(\mathbf{pa})
$$
where $a,b$ are complex coefficients that can depend on $\mathbf{pa}$ . Third, we use the full statistical independence condition to show that the coefficients $a$ and $b$ must be constant. For the distribution of $Z=a(\mathbf{pa})U+b(\mathbf{pa})$ to be independent of $\mathbf{pa}$ , all of its properties must be constant. Mean: The mean $E[Z|\mathbf{pa}]=a(\mathbf{pa})E[U]+b(\mathbf{pa})$ must be constant. Assuming $E[U]=0$ without loss of generality implies $b(\mathbf{pa})$ must be a constant, $b$ . Covariance: By Assumption 2, the distribution of $U$ is not rotationally symmetric, so its covariance matrix is not proportional to the identity. Any rotation induced by the phase of $a(\mathbf{pa})$ would alter the covariance structure of $Z$ . For the distribution of $Z$ to be invariant, the rotation angle (phase) of $a(\mathbf{pa})$ must be constant. Scale: The change of variables formula implies that the magnitude $|a(\mathbf{pa})|$ must also be constant. Since both the magnitude and phase of $a(\mathbf{pa})$ must be constant, $a(\mathbf{pa})$ must be a constant complex number, $a$ . Therefore, $Q_{\mathbf{pa}}(u)=au+b$ , which is an isomorphic mapping of $u$ . This completes the proof.
Appendix C Extended Analysis of Inversion Fidelity and Non-Invertible SCMs
This appendix provides a unified analysis of inversion errors. We first provide rigorous proofs for inversion fidelity (Propositions 5 and 6), then extend our error decomposition framework to the more challenging non-invertible SCM setting.
C.1 Proofs for Inversion Fidelity (Propositions 5 & 6)
Proof [Proof of Proposition 5] The proof proceeds by deriving the explicit one-step reconstruction error and then analyzing its order of magnitude.
1. Derivation of the One-Step Reconstruction Error.
Let the single-step DDIM inversion operator be $\mathbf{T}_{t}$ , mapping an observation $\mathbf{x}_{t}$ to $\mathbf{x}^{\prime}_{t+1}$ using the noise prediction $\boldsymbol{\epsilon}_{t}=\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t},t)$ . The corresponding generative operator is $\mathbf{H}_{t}$ , which reconstructs $\mathbf{x}^{\prime}_{t}$ from $\mathbf{x}^{\prime}_{t+1}$ using a new prediction at the new state, $\boldsymbol{\epsilon}^{\prime}_{t+1}=\boldsymbol{\epsilon}_{\theta}(\mathbf{x}^{\prime}_{t+1},t+1)$ .
By substituting the formula for $\mathbf{x}^{\prime}_{t+1}$ into the update for $\mathbf{x}^{\prime}_{t}$ , the single-step reconstruction error $\mathbf{x}^{\prime}_{t}-\mathbf{x}_{t}$ is found to be:
$$
\mathbf{x}^{\prime}_{t}-\mathbf{x}_{t}=\left(\sqrt{1-\bar{\alpha}_{t}}-\frac{\sqrt{\bar{\alpha}_{t}}\sqrt{1-\bar{\alpha}_{t+1}}}{\sqrt{\bar{\alpha}_{t+1}}}\right)(\boldsymbol{\epsilon}^{\prime}_{t+1}-\boldsymbol{\epsilon}_{t})
$$
This error is non-zero if and only if the noise prediction changes after one inversion step, i.e., $\boldsymbol{\epsilon}^{\prime}_{t+1}â \boldsymbol{\epsilon}_{t}$ .
2. Analysis of the Errorâs Order of Magnitude.
In the continuous-time limit with time step $\Delta s=1/T$ , a Taylor expansion shows that both the coefficient term and the difference in noise predictions $(\boldsymbol{\epsilon}^{\prime}_{t+1}-\boldsymbol{\epsilon}_{t})$ are of order $\mathcal{O}(\Delta s)$ . The one-step reconstruction error is therefore the product of these two terms:
$$
\text{Error}_{\text{step}}=\mathcal{O}(\Delta s)\times\mathcal{O}(\Delta s)=\mathcal{O}((\Delta s)^{2})=\mathcal{O}(1/T^{2})
$$
This local error accumulates over the $T$ steps of the trajectory, resulting in a total accumulated error of order $\mathcal{O}(1/T)$ . For any finite $T$ , this global error is non-zero, constituting the Structural Reconstruction Error (SRE) for DDIM.
Proof [Proof of Proposition 6] The proof is constructive, following from the exact algebraic invertibility of the BELM sampler (liu2024belm). For the second-order BELM used in this work, the one-step decoder is an affine transformation of the form:
$$
\displaystyle\mathbf{x}_{t-1} \displaystyle=A_{t}\mathbf{x}_{t}+B_{t}\boldsymbol{\epsilon}_{t}+C_{t}\boldsymbol{\epsilon}_{t+1}
$$
where $A_{t},B_{t},C_{t}$ are schedule-dependent coefficients, $\boldsymbol{\epsilon}_{t}=\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t},t)$ , and $\boldsymbol{\epsilon}_{t+1}=\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t+1},t+1)$ .
The full-trajectory decoder, $\mathbf{H}_{\text{BELM}}$ , is a composition of these one-step affine maps. The BELM encoder, $\mathbf{T}_{\text{BELM}}$ , is constructed using a symmetric update rule designed to be the exact algebraic inverse of the decoder. As rigorously shown by liu2024belm, this construction ensures that the composite operator $\mathbf{T}_{\text{BELM}}$ is the exact inverse of $\mathbf{H}_{\text{BELM}}$ , assuming the same sequence of noise function evaluations is used for both processes.
Therefore, by its algebraic construction, the BELM sampler guarantees a lossless round trip:
$$
\mathbf{H}_{\text{BELM}}\circ\mathbf{T}_{\text{BELM}}=\mathbf{I}
$$
The Structural Reconstruction Error is thus identically zero by construction.
C.2 Exhaustive Analysis for the Non-Invertible SCM Setting
This section rigorously extends our framework to the non-invertible case, providing a theoretical underpinning for the stress test results in § 5.4.1.
C.2.1 Assumptions and Definitions
We formalize the problem with the following.
**Assumption 3 (Well-Posed Abduction)**
*For any observed $(\mathbf{v},\mathbf{pa})$ , the inverse image set $\mathcal{U}_{(\mathbf{v},\mathbf{pa})}=\{\mathbf{u}^{\prime}â\mathcal{U}\mid\mathbf{F}(\mathbf{pa},\mathbf{u}^{\prime})=\mathbf{v}\}$ is non-empty, and the Maximum a Posteriori (MAP) solution over this set, given the prior $p(\mathbf{U})$ , is unique. This solution defines the ideal amortized inverse operator, $\mathbf{T}^{*}(\mathbf{v},\mathbf{pa})=\arg\max_{\mathbf{u}^{\prime}â\mathcal{U}_{(\mathbf{v},\mathbf{pa})}}p(\mathbf{u}^{\prime})$ .*
**Definition 20 (Tripartite Error Sources)**
*In the non-invertible case, we refine the error sources into three distinct components:
1. Algorithmic Error (SRE): The error from an imperfect inversion algorithm, $E_{SR}:=\|(\mathbf{H}_{\theta}\circ\mathbf{T}_{\theta}-\mathbf{I})\mathbf{X}\|^{2}$ . For our framework, $E_{SR}\equiv 0$ .
1. Modeling Error: The error from imperfectly learning the ideal amortized inverse, $E_{\text{Modeling}}:=\|\mathbf{T}_{\theta}(\mathbf{V},\mathbf{Pa})-\mathbf{T}^{*}(\mathbf{V},\mathbf{Pa})\|^{2}$ .
1. Representational Error: The fundamental, irreducible error from the SCMâs non-invertibility, $E_{\text{Rep}}:=\|\mathbf{T}^{*}(\mathbf{V},\mathbf{Pa})-\mathbf{U}_{\text{true}}\|^{2}$ .*
C.2.2 A Tighter Error Decomposition
We now present a tighter error bound for the non-invertible case.
**Theorem 21 (Tighter Counterfactual Error Bound)**
*Let the conditions of Theorem 8 hold ( $H_{\theta}$ is $L_{H}$ -Lipschitz). The expected squared error of the counterfactual prediction is bounded by:
$$
\begin{split}\mathbb{E}\left[\|\hat{\mathbf{X}}_{\alpha}-\mathbf{X}_{\alpha}^{\text{true}}\|^{2}\right]\leq&2\mathbb{E}[E_{SR}]+4L_{H}^{2}\mathbb{E}[E_{\text{Modeling}}]+4L_{H}^{2}\mathbb{E}[E_{\text{Rep}}]\end{split}
$$*
Proof We decompose the total error $\|\hat{\mathbf{X}}_{\alpha}-\mathbf{X}_{\alpha}^{\text{true}}\|$ using the triangle inequality and the ideal amortized inverse $\mathbf{T}^{*}$ as an intermediate step. The result follows from applying the inequality $(a+b+c)^{2}†2(a^{2}+b^{2}+c^{2})$ , bounding terms with the $L_{H}$ -Lipschitz property of $\mathbf{H}_{\theta}$ , and taking expectations.
C.2.3 Information-Theoretic Interpretation of Representational Error
The Representational Error ( $E_{\text{Rep}}$ ) is deeply connected to information conservation. The non-invertibility of the SCM $\mathbf{F}$ means that observing $(\mathbf{V},\mathbf{Pa})$ is not sufficient to uniquely determine $\mathbf{U}$ . Information-theoretically, this implies the conditional entropy $H(\mathbf{U}|\mathbf{V},\mathbf{Pa})$ is greater than zero.
**Remark 22 (Representational Error as Information Loss)**
*The ideal amortized inverse $\mathbf{T}^{*}$ yields the mode of the posterior $p(\mathbf{U}|\mathbf{V},\mathbf{Pa})$ . The expected representational error, $\mathbb{E}[E_{\text{Rep}}]$ , can be seen as the expected squared error of this MAP estimator, which is related to the variance and shape of the posterior. Thus, $E_{\text{Rep}}$ is a direct consequence of the information about $\mathbf{U}$ that is fundamentally lost in the forward causal process, a loss captured by $H(\mathbf{U}|\mathbf{V},\mathbf{Pa})>0$ .*
C.2.4 Theoretical Guarantee for the Mitigation Strategy
We now provide a theoretical justification for the âPrior-Matching Regularizerâ proposed in § 7.1.
**Definition 23 (Prior-Matching Regularizer)**
*The regularizer is defined as $R(\mathbf{T}_{\theta})=\mathbb{E}_{(\mathbf{v},\mathbf{pa})}\left[\|\mathbf{s}_{p}(\mathbf{T}_{\theta}(\mathbf{v},\mathbf{pa}))\|^{2}\right]$ , where $\mathbf{s}_{p}(\mathbf{u})=â_{\mathbf{u}}\log p(\mathbf{u})$ is the score function of the prior distribution $p(\mathbf{U})$ .*
**Proposition 24 (Regularizer Induces Convergence to MAP Solution)**
*Minimizing the regularizer $R(\mathbf{T}_{\theta})$ provides an inductive bias that encourages the encoder output, $\hat{\mathbf{u}}=\mathbf{T}_{\theta}(\mathbf{v},\mathbf{pa})$ , to lie on a mode of the prior distribution $p(\mathbf{U})$ .*
Proof [Proof Sketch] The objective is a form of score matching on the prior. The score function $\mathbf{s}_{p}(\mathbf{u})$ is zero if and only if $\mathbf{u}$ is a stationary point of the log-prior. Minimizing the expected squared norm of the score at the encoderâs output penalizes the production of latent codes $\hat{\mathbf{u}}$ in low-probability regions of the prior. This incentivizes the encoder to map observations to the most probable latent code, thereby encouraging $\mathbf{T}_{\theta}$ to approximate the ideal MAP estimator $\mathbf{T}^{*}$ .
Appendix D Main Proofs for Theoretical Framework
This appendix provides the proofs for the main theoretical results presented in the text.
Proof [Proof of Theorem 8] This bound is a direct specialization of the more general bound for non-invertible SCMs derived in Theorem 21 (Appendix C).
For an invertible SCM, abduction is perfect, meaning the ideal amortized inverse $\mathbf{T}^{*}$ is the true inverse of the SCM function $\mathbf{F}$ . Consequently, the recovered noise is the true noise, $\mathbf{T}^{*}(\mathbf{V},\mathbf{Pa})=\mathbf{U}_{\text{true}}$ , which implies that the Representational Error is identically zero: $\mathbb{E}[E_{\text{Rep}}]=0$ .
In this context, the Latent Space Invariance Error, $E_{LSI}$ , becomes equivalent to the Modeling Error, $E_{\text{Modeling}}$ . Applying the inequality $(a+b)^{2}†2a^{2}+2b^{2}$ , the general bound from Theorem 21 reduces to the two terms presented in Theorem 8.
Proof [Proof of Proposition 10] This proof relies on the identifiability of the true SCM (Theorem 2) and the Lipschitz continuity of the score network $\boldsymbol{\epsilon}_{\theta}$ , which guarantees unique ODE solutions via the Picard-Lindelöf theorem. We also assume standard integrability conditions (Fubiniâs theorem).
The encoder $\mathbf{T}_{\theta}$ maps an initial condition $\mathbf{x}(0)$ to the terminal state $\mathbf{x}(T)$ of the probability flow ODE (Eq. A.3). Let $\mathbf{x}_{\theta}(t;\mathbf{x}_{0})$ and $\mathbf{x}_{*}(t;\mathbf{x}_{0})$ denote the ODE solutions with the learned score $\boldsymbol{\epsilon}_{\theta}$ and the true score $\boldsymbol{\epsilon}^{*}$ , respectively. The Latent Space Invariance Error is $\mathbb{E}[E_{LSI}]=\mathbb{E}[\|\mathbf{x}_{\theta}(T;X)-\mathbf{x}_{\theta}(T;X_{\boldsymbol{\alpha}}^{\text{true}})\|^{2}]$ . By the triangle inequality:
| | $\displaystyle\|\mathbf{x}_{\theta}(T;X)-\mathbf{x}_{\theta}(T;X_{\boldsymbol{\alpha}}^{\text{true}})\|$ | $\displaystyleâ€\|\mathbf{x}_{\theta}(T;X)-\mathbf{x}_{*}(T;X)\|$ | |
| --- | --- | --- | --- |
The middle term is zero under ideal identifiability, as the true encoder $\mathbf{T}^{*}$ maps both an observation and its true counterfactual to the same underlying noise. We thus only need to bound terms of the form $\|\mathbf{x}_{\theta}(T;\mathbf{x}_{0})-\mathbf{x}_{*}(T;\mathbf{x}_{0})\|$ .
Let $\mathbf{z}(t)=\mathbf{x}_{\theta}(t;\mathbf{x}_{0})-\mathbf{x}_{*}(t;\mathbf{x}_{0})$ . Since the ODE vector field is Lipschitz, applying Grönwallâs inequality to the differential of $\mathbf{z}(t)$ yields:
$$
\|\mathbf{z}(T)\|\leq\int_{0}^{T}e^{L_{f}(T-t)}C_{f}\|\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{*}(t),t)-\boldsymbol{\epsilon}^{*}(\mathbf{x}_{*}(t),t)\|dt
$$
where $L_{f}$ and $C_{f}$ are constants from the ODE coefficients. Squaring, taking expectations, and applying Jensenâs inequality leads to:
$$
\mathbb{E}[E_{LSI}]\leq C^{\prime}\cdot\mathbb{E}_{\mathbf{x},t}[\|\boldsymbol{\epsilon}_{\theta}-\boldsymbol{\epsilon}^{*}\|^{2}]
$$
where the final expectation is the score-matching loss.
Appendix E Proofs for Theoretical Roles of Hybrid Training
E.1 Proof of Proposition 12 (Weighted Score-Matching)
This section provides a rigorous proof that the auxiliary task loss $L_{\text{task}}$ provides a lower bound on a weighted score-matching objective.
E.1.1 Preliminaries and Setup
Probability Flow ODE.
The generative process is the solution to the reverse-time probability flow ODE from $t=T$ to $t=0$ :
$$
d\mathbf{x}_{t}=f(t,\mathbf{x}_{t},\boldsymbol{\epsilon}(t,\mathbf{x}_{t}))dt,\quad\mathbf{x}_{T}\sim\mathcal{N}(0,\mathbf{I})
$$
where the vector field $f$ is determined by the diffusion scheduler.
Core Objects.
- $\boldsymbol{\epsilon}_{\theta},\boldsymbol{\epsilon}^{*}$ : Learned and true score functions.
- $\mathbf{x}_{t}^{\theta},\mathbf{x}_{t}^{*}$ : ODE trajectories driven by $\boldsymbol{\epsilon}_{\theta}$ and $\boldsymbol{\epsilon}^{*}$ .
- $\mathbf{x}_{0}^{\theta},\mathbf{x}_{0}^{*}$ : Generated and true (counterfactual) data points at $t=0$ .
- $g:\mathbb{R}^{d}â\mathbb{R}^{k}$ : Downstream prediction function.
- $Y_{\text{pred}}=g(\mathbf{x}_{0}^{\theta})$ , $Y_{\text{true}}=g(\mathbf{x}_{0}^{*})$ .
- $L_{\text{task}}=\mathbb{E}_{\mathbf{x}_{T}}[\|Y_{\text{pred}}-Y_{\text{true}}\|^{2}]$ .
Assumptions.
We assume standard regularity conditions for the proof:
1. The vector field $f(t,\mathbf{x},\boldsymbol{\epsilon})$ is Lipschitz continuous in $\mathbf{x}$ and $\boldsymbol{\epsilon}$ .
1. The downstream task function $g(\mathbf{x})$ is Lipschitz continuous and differentiable.
1. The learned score $\boldsymbol{\epsilon}_{\theta}$ and true score $\boldsymbol{\epsilon}^{*}$ are well-behaved.
E.1.2 Formal Proposition Statement
**Proposition 25 (Hybrid Objective as a Weighted Score-Matching Regularizer)**
*Under the regularity assumptions (A1-A3), the auxiliary task loss $L_{\text{task}}$ provides a lower bound on a weighted score-matching objective:
$$
L_{\text{task}}\geq C\cdot\mathbb{E}_{\mathbf{x}_{T},t}\left[w(\mathbf{x}_{t}^{*})\cdot\|\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}^{*})-\boldsymbol{\epsilon}^{*}(\mathbf{x}_{t}^{*})\|^{2}\right]
$$
where $C>0$ and the weight function $w(\mathbf{x}_{t}^{*})$ measures the sensitivity of the final prediction $Y$ to score perturbations along the ideal data generation trajectory.*
Proof The proof proceeds in three main steps.
Step 1: Bounding Sample Error by Score Error (Error Propagation).
Let $\mathbf{z}(t)=\mathbf{x}_{t}^{\theta}-\mathbf{x}_{t}^{*}$ be the error between the two ODE trajectories. By linearizing the vector field $f$ around the true trajectory, we can analyze the impact of the score perturbation $\Delta\boldsymbol{\epsilon}_{t}=\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}^{*})-\boldsymbol{\epsilon}^{*}(\mathbf{x}_{t}^{*})$ . From Duhamelâs Principle, the final sample error $\Delta\mathbf{x}_{0}=\mathbf{z}(0)$ can be expressed as an integral over the perturbation:
$$
\Delta\mathbf{x}_{0}=\int_{T}^{0}\mathbf{K}(s)\Delta\boldsymbol{\epsilon}_{s}ds
$$
where the kernel $\mathbf{K}(s)$ captures the influence of a score perturbation at time $s$ on the final sample at time $0$ .
Step 2: Linearizing the Task Error.
The error in the downstream prediction is $\Delta Y=g(\mathbf{x}_{0}^{\theta})-g(\mathbf{x}_{0}^{*})$ . A first-order Taylor expansion gives:
$$
\Delta Y\approx\nabla_{\mathbf{x}}g(\mathbf{x}_{0}^{*})\cdot\Delta\mathbf{x}_{0}
$$
The task loss is the expected squared norm, $L_{\text{task}}=\mathbb{E}[\|\Delta Y\|^{2}]$ .
Step 3: Deriving the Weighted Relationship.
Substituting the integral form of $\Delta\mathbf{x}_{0}$ into the task error approximation and applying the Cauchy-Schwarz inequality for integrals yields a lower bound for the task loss:
$$
L_{\text{task}}\geq C\cdot\mathbb{E}_{\mathbf{x}_{T}}\left[\int_{0}^{T}\left\|\mathcal{W}(\mathbf{x}_{0}^{*},s)\right\|_{F}^{2}\cdot\|\Delta\boldsymbol{\epsilon}_{s}\|^{2}ds\right]
$$
where $\|·\|_{F}$ is the Frobenius norm and the influence operator $\mathcal{W}(\mathbf{x}_{0}^{*},s)$ captures the end-to-end sensitivity from a score perturbation at time $s$ to the final prediction. Rewriting the expectation gives the final form:
$$
L_{\text{task}}\geq C\cdot\mathbb{E}_{t\sim U[0,T]}\mathbb{E}_{\mathbf{x}_{t}^{*}}\left[w(\mathbf{x}_{t}^{*})\cdot\|\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}^{*})-\boldsymbol{\epsilon}^{*}(\mathbf{x}_{t}^{*})\|^{2}\right]
$$
where the weight function is $w(\mathbf{x}_{t}^{*}):=T·\mathbb{E}_{\mathbf{x}_{T}|\mathbf{x}_{t}^{*}}[\|\mathcal{W}(\mathbf{x}_{0}^{*},t)\|_{F}^{2}]$ .
Interpretation and Conclusion.
The weight function $w(\mathbf{x}_{t}^{*})$ is large when the gradient norm $\|â_{\mathbf{x}}g(\mathbf{x}_{0}^{*})\|$ is largeâi.e., in causally salient regions where the outcome is highly sensitive to the features. The inequality therefore shows that minimizing $L_{\text{task}}$ provides a lower bound on a score-matching error that is weighted to prioritize accuracy in these causally salient regions.
E.2 Argument for Proposition 13 (Latent Space Disentanglement)
This section provides a qualitative, information-theoretic argument for Proposition 13, showing how the hybrid objective encourages a âdivision of laborâ that promotes disentanglement. The SCM posits that an observation $V$ is determined by $(\mathbf{Pa},U)$ , and the encoder maps $(V,\mathbf{Pa})$ to a latent code $Z=\mathbf{T}_{\theta}(V,\mathbf{Pa})$ , where a perfect encoder would yield $Z=U$ . The process works as follows: the diffusion loss ( $L_{\text{diffusion}}$ ) maximizes the log-likelihood $\log p_{\theta}(V|\mathbf{Pa})$ , forcing the pair $(\mathbf{Pa},Z)$ to contain all information to reconstruct $V$ , thus maximizing the mutual information $I(V;(\mathbf{Pa},Z))$ . Simultaneously, the task loss ( $L_{\text{task}}$ ) learns a prediction from the parents, capturing all predictive information that $\mathbf{Pa}$ has about $V$ and thus maximizing $I(V;\mathbf{Pa})$ . The dual objective must satisfy both constraints. From the chain rule of mutual information, we know $I(V;(\mathbf{Pa},Z))=I(V;\mathbf{Pa})+I(V;Z|\mathbf{Pa})$ . Since $L_{\text{diffusion}}$ maximizes the left-hand side and $L_{\text{task}}$ captures the first term on the right, the optimization incentivizes the latent code $Z$ to model the remaining information, $I(V;Z|\mathbf{Pa})$ . This leads to a connection to disentanglement: the ideal exogenous noise $U$ is, by definition, independent of $\mathbf{Pa}$ . By forcing $Z$ to model the residual information, the optimization process actively encourages the learned representation $Z$ to be independent of $\mathbf{Pa}$ . In summary, the hybrid objective creates a division of labor: the task-specific head explains the variance from $\mathbf{Pa}$ , while the diffusion processâs latent code $Z$ models the residual. This residual is, by construction, the information in $V$ orthogonal to $\mathbf{Pa}$ , forcing $Z$ to be an empirical approximation of the true, disentangled exogenous noise $U$ , thereby serving the identifiability condition of Theorem 2.
Appendix F Proof of Theorem on Causal Transportability
This appendix provides the proof for the theorem on lossless causal transportability.
Proof [Proof of Theorem 17] This proof operates under an idealized setting, assuming a perfectly trained model (i.e., zero statistical error), to isolate the structural properties of transportability. We assume the learned decoder $\mathbf{H}_{\theta_{i}}$ is identical to the true mechanism $\mathbf{F}_{i}$ , and its encoder $\mathbf{T}_{\theta_{i}}$ is its perfect inverse.
Let the source and target SCMs be $\mathcal{M}^{\mathcal{S}}$ and $\mathcal{M}^{\mathcal{T}}$ , with mechanisms $\{\mathbf{F}_{i}\}$ and $\{\mathbf{F}^{\prime}_{i}\}$ respectively. The set of changed mechanisms is indexed by $\mathcal{K}_{\text{changed}}$ .
The proof relies on the modularity of the SCM, which is guaranteed by the mutually independent exogenous noises (Condition ii of the theorem). This independence ensures that a change in one mechanism $\mathbf{F}_{k}$ to $\mathbf{F}^{\prime}_{k}$ does not affect the conditional distributions of other nodes $V_{j}$ ( $jâ k$ ), given their parents.
We analyze the transportability of operators for each mechanism:
1. For invariant mechanisms ( $jâ\mathcal{K}_{\text{changed}}$ ): By definition, the true mechanism is unchanged, $\mathbf{F}^{\prime}_{j}=\mathbf{F}_{j}$ . Since the model operators $(\mathbf{T}_{\theta_{j}},\mathbf{H}_{\theta_{j}})$ perfectly learned $\mathbf{F}_{j}$ in the source domain, they remain valid for the target domain $\mathcal{T}$ and can be directly reused.
1. For changed mechanisms ( $kâ\mathcal{K}_{\text{changed}}$ ): The original operators $(\mathbf{T}_{\theta_{k}},\mathbf{H}_{\theta_{k}})$ are now invalid as they model $\mathbf{F}_{k}$ , not the new mechanism $\mathbf{F}^{\prime}_{k}$ . However, a new operator pair $(\mathbf{T}^{\prime}_{\theta_{k}},\mathbf{H}^{\prime}_{\theta_{k}})$ can be learned from target domain data. This training only requires samples $(v^{\prime}_{k},\mathbf{pa}^{\prime}_{k})$ from domain $\mathcal{T}$ and the shared noise distribution $p_{k}(U_{k})$ (Condition i). Because the noises are independent, this re-learning process for mechanism $k$ is modular and does not affect the other invariant mechanisms.
The procedure for adapting the model is therefore modular: freeze all invariant operators $\{(\mathbf{T}_{\theta_{j}},\mathbf{H}_{\theta_{j}})\}_{jâ\mathcal{K}_{\text{changed}}}$ and re-train only those for the changed mechanisms $\{kâ\mathcal{K}_{\text{changed}}\}$ on target domain data. The resulting adapted model is valid for the target domain $\mathcal{T}$ and can perform abduction on a target individual by applying the correct (reused or re-trained) encoders, thereby losslessly recovering the full vector of exogenous noises. This fulfills the condition for lossless transport.
Appendix G Proof of the Specific Finite Sample Bound (Theorem 15)
This derivation combines standard generalization bounds with known complexity bounds for deep neural networks. We first state a key lemma regarding the complexity of neural networks.
**Lemma 26 (Rademacher Complexity of Neural Networks)**
*Let $\mathcal{F}_{\epsilon}$ be the function class of an $L$ -layer MLP with ReLU activations, input dimension $p$ , and weight matrices $\{\mathbf{W}_{j}\}_{j=1}^{L}$ . Assume the input data $\mathbf{X}$ is contained within a ball of radius $R_{X}$ . If the spectral norm of each weight matrix is bounded, $\|\mathbf{W}_{j}\|_{2}†B_{j}$ , the Rademacher complexity of the function class is bounded by:
$$
\mathfrak{R}_{n}(\mathcal{F}_{\epsilon})\leq C_{net}\frac{R_{X}L\sqrt{p}}{\sqrt{n}}\left(\prod_{j=1}^{L}B_{j}\right)
$$
For simplicity and under normalization, we often consider $R_{X}=1$ . This result is a simplified form derived from (bartlett2017spectrally; neyshabur2018pac), where $C_{net}$ is a universal constant.*
Proof The proof proceeds by combining the standard excess risk bound with the two lemmas above. First, from standard learning theory, the excess risk is bounded by the Rademacher complexity of the total loss function class:
$$
R(\hat{\theta}_{n})-R(\theta^{*})\leq 4\mathfrak{R}_{n}(\mathcal{F}_{\mathcal{L}_{SCM}})+M\sqrt{\frac{\log(1/\delta)}{2n}}
$$
Next, by the sub-additivity property of Rademacher complexity, we decompose the SCMâs complexity into the sum of its individual components:
$$
\mathfrak{R}_{n}(\mathcal{F}_{\mathcal{L}_{SCM}})\leq\sum_{i=1}^{d}\mathfrak{R}_{n}(\mathcal{F}_{\mathcal{L}_{i}})
$$
The next step is to relate the loss complexity to the network complexity. The score-matching loss for mechanism $i$ is $\mathcal{L}_{i}=\|\boldsymbol{\epsilon}-\epsilon_{\theta_{i}}(·)\|^{2}$ . Since this loss is Lipschitz with respect to the output of $\epsilon_{\theta_{i}}$ , its Rademacher complexity is upper-bounded by a constant multiple of the complexity of the score networkâs function class $\mathcal{F}_{\epsilon_{i}}$ via Talagrandâs contraction lemma, i.e., $\mathfrak{R}_{n}(\mathcal{F}_{\mathcal{L}_{i}})†C_{1}·\mathfrak{R}_{n}(\mathcal{F}_{\epsilon_{i}})$ . We then apply Lemma 26 to bound the complexity of each score network. The input dimension $p_{i}$ for mechanism $i$ is $p_{i}=\text{dim}(\mathbf{x}_{t})+\text{dim}(\mathbf{pa}_{i})+\text{dim}(\text{time embedding})$ . Assuming univariate variables, this is $p_{i}=1+|\mathbf{Pa}_{i}|+d_{embed}$ . Let $d_{in}^{max}=\max_{i}|\mathbf{Pa}_{i}|$ , so the maximum input dimension is $p_{max}=1+d_{in}^{max}+d_{embed}$ . Assuming all networks have depth $L$ and a uniform spectral norm bound $B$ , we have:
$$
\mathfrak{R}_{n}(\mathcal{F}_{\epsilon_{i}})\leq C_{net}\frac{L\sqrt{p_{max}}}{\sqrt{n}}B^{L}=C_{net}\frac{L\sqrt{1+d_{in}^{max}+d_{embed}}}{\sqrt{n}}B^{L}
$$
Finally, we assemble the complete bound by substituting all components back into the initial inequality:
| | $\displaystyle R(\hat{\theta}_{n})-R(\theta^{*})$ | $\displaystyle†4\sum_{i=1}^{d}\mathfrak{R}_{n}(\mathcal{F}_{\mathcal{L}_{i}})+M\sqrt{\frac{\log(1/\delta)}{2n}}$ | |
| --- | --- | --- | --- |
By defining $C=4C_{1}C_{net}$ as a generic constant independent of the network architecture and sample size, we arrive at the final form stated in the theorem.
Appendix H Formal Analysis of the Geometric Inductive Bias
This section provides the formal argument for Proposition 3, demonstrating how the score-matching objective, under a simplicity bias, compels the learned generative map to adopt the local data geometry, yielding a parsimonious and well-behaved transformation.
H.1 Preliminaries and Definitions
Let $\mathcal{M}â\mathbb{R}^{d}$ be the data manifold with a smooth probability density $p(\mathbf{x})$ . The true score function is the vector field $\mathbf{s}^{*}(\mathbf{x}):=â_{\mathbf{x}}\log p(\mathbf{x})$ . We learn a parameterized score network $\mathbf{s}_{\theta}(\mathbf{x})$ by minimizing the score-matching objective $L_{SM}(\theta)=\mathbb{E}_{\mathbf{x}\sim p(\mathbf{x})}[\|\mathbf{s}_{\theta}(\mathbf{x})-\mathbf{s}^{*}(\mathbf{x})\|^{2}]$ .
The generative process is described by the probability flow ODE, whose vector field $\mathbf{f}(\mathbf{x},t)$ is a function of the score. The map $H_{\theta}:\mathbb{R}^{d}â\mathcal{M}$ is the flow map of this ODE integrated from $t=1$ to $t=0$ . A map is conformal if its Jacobian is a scaled orthogonal matrix, and affine if it is a linear transformation plus a translation.
H.2 Assumptions
**Assumption 4 (Smoothness of the Data Density)**
*The true data density $p(\mathbf{x})$ is at least twice continuously differentiable ( $C^{2}$ ) on $\mathcal{M}$ .*
**Assumption 5 (Adoption of the Simplicity Bias Principle)**
*Our analysis relies on the principle of implicit regularization: the conjecture that optimizers like SGD favor solutions with a simplicity bias. We formalize this as a preference for score functions $\mathbf{s}_{\theta}$ with lower complexity (e.g., lower Dirichlet Energy or smaller spectral norms) (hochreiter1997flat; neyshabur2018pac).*
H.3 Proof of Proposition 3
The proof proceeds in four steps:
Step 1: The Irrotational Property of the True Score Field. By definition, the true score $\mathbf{s}^{*}(\mathbf{x})$ is the gradient of the scalar potential $\log p(\mathbf{x})$ . By a fundamental theorem of vector calculus, the curl of any gradient field is zero. Thus, $\mathbf{s}^{*}$ is an irrotational (or conservative) vector field.
Step 2: The Variational Perspective of Score Matching. The simplicity bias (Assumption 5) reinforces the irrotational nature of the learned field $\mathbf{s}_{\theta}$ . This is understood via the Helmholtz decomposition, which splits any vector field into irrotational (curl-free) and solenoidal (divergence-free) components. The simplicity bias conjectures that the optimization preferentially minimizes the energy of the solenoidal component, driving the learned field $\mathbf{s}_{\theta}$ towards a purely irrotational solution ( $âĂ\mathbf{s}_{\theta}â\mathbf{0}$ ) to match the conservative nature of the true score $\mathbf{s}^{*}$ .
Step 3: The Geometry of the Score Field under Local Structural Assumptions. We analyze the score fieldâs structure under two local geometric cases for the density $p(\mathbf{x})$ in a region $\mathcal{R}$ . Case (i): Local Isotropy. If $p(\mathbf{x})$ is locally spherically symmetric around a center $\mathbf{c}$ , its value depends only on the radius $r=\|\mathbf{x}-\mathbf{c}\|$ . The gradient $\mathbf{s}^{*}=â\log p(\mathbf{x})$ must point along the radial direction, making the true score a radial vector field. The learned field converges to this simple structure. Case (ii): Local Ellipsoidal Structure. If the iso-contours of $p(\mathbf{x})$ are concentric ellipsoids, the gradient $\mathbf{s}^{*}$ must be orthogonal to these ellipsoidal surfaces at every point. Such a field is still irrotational but no longer radial.
Step 4: The Geometry of the Flow Map. The ODE vector field $\mathbf{f}(\mathbf{x},t)$ is a linear combination of the score field $\mathbf{s}_{\theta^{*}}(\mathbf{x})$ and the position vector $\mathbf{x}$ . The geometry of the resulting flow map $H_{\theta^{*}}$ depends on this vector fieldâs geometry. Result for Case (i): In the isotropic case, $\mathbf{f}(\mathbf{x},t)$ is also a radial vector field. A flow generated by a radial vector field is, by its rotational symmetry, necessarily a conformal map, as it preserves angles. Result for Case (ii): In the ellipsoidal case, the flow must transform the isotropic latent space into the anisotropic data space. The simplest such transformation is a local affine transformation, involving direction-dependent scaling and rotation. While this geometric argument is standard, a fully rigorous proof would require directly computing the Jacobian of the integrated flow map $H_{\theta^{*}}$ to verify it satisfies the required mathematical conditions. The modelâs inductive bias thus compels it to learn the most parsimonious geometric transformation necessary to explain the local data geometry, promoting the well-behaved, locally invertible mappings that are crucial for abduction.
Appendix I Experimental Details
This appendix provides comprehensive details for all experiments presented in Section 5, ensuring full transparency and reproducibility.
I.1 General Setup
Software Environment.
All experiments were conducted in a unified software environment to ensure full reproducibility. Key library versions used were: dowhy (0.12), econml (0.16.0), numpy (1.26.4), pandas (1.3.5), scikit-learn (1.6.1), torch (1.10.0), lightgbm (4.6.0), and networkx (3.2.1). All stochastic processes were controlled with a fixed global random seed (42), except for the ensemble runs, which used a set of distinct seeds for training.
Baseline Estimator Configurations.
All baseline estimators were implemented using the dowhy library. For machine learning-based methods like Causal Forest and DML, we utilized their robust implementations from the econml library. To ensure a fair and reproducible comparison against established benchmarks, we used their default hyperparameter settings, which are widely recognized and have been optimized by the library authors to provide strong performance across a broad range of tasks. Our proposed model, in turn, underwent a systematic grid search; the final parameters and a detailed analysis are provided in Appendix I.2.
Details for PSM Failure Scenario (Act I).
The Data Generation Process (DGP) for this experiment ( $N=5000$ , true ATE $\tau=5000$ ) follows the graph in Figure 3(a) and is defined by:
| | $\displaystyle W_{1},W_{2}$ | $\displaystyle\sim\mathcal{N}(0,1)$ | |
| --- | --- | --- | --- |
with exogenous noises $U_{T}\sim\text{Logistic}(0,1)$ and $U_{Y}\sim\mathcal{N}(0,6000^{2})$ .
Details for Lalonde Benchmark (Act I).
To evaluate performance on real-world data, we used the canonical Lalonde dataset, analyzing the effect of the NSW job training program (treat) on 1978 real earnings (re78). The assumed causal structure is the standard confounding model (Figure 3(c)), a DAG structure that reflects the broad consensus in the causal inference community for this benchmark. In line with our Targeted Modeling principle, we applied the expressive CausalDiffusionModel only to the key treat and re78 nodes, modeling all confounders non-parametrically via their empirical distributions.
Details for Semi-Synthetic Analysis (Act II).
This dataset uses the real-world covariates from the Lalonde dataset as a foundation. The outcome $Y$ and the ground-truth Individual Treatment Effect ( $ITE_{\text{true}}$ ) are then synthetically generated according to the following structural equations:
| | $\displaystyle Y_{\text{base}}=$ | $\displaystyle\ 2· X_{\text{re74}}+1.5· X_{\text{re75}}+100· X_{\text{educ}}-50· X_{\text{age}}$ | |
| --- | --- | --- | --- |
where $U_{\text{base}}\sim\mathcal{N}(0,500^{2})$ and $\mu_{\text{re74}}$ is the mean of the re74 covariate.
Details for the Stress Test on Non-Invertible SCMs (Act IV).
This experimentâs primary objective is to evaluate the frameworkâs robustness when the core theoretical assumption of SCM invertibility is explicitly violated. The DGP ( $N=2000$ ) is defined as follows:
| | $\displaystyle W$ | $\displaystyle\sim\mathcal{U}(-2,2)$ | |
| --- | --- | --- | --- |
The true ATE is exactly $\tau=5.0$ . The SCM was structured with W as an EmpiricalDistribution, and T and Y as CausalDiffusionModel.
Details for the Ablation Study (Act IV).
The ablation study was conducted on a challenging synthetic mediation dataset ( $N=4000$ ) designed to highlight the benefits of generative models. The causal graph is shown in Figure 3(b), with the following data generation process:
| | $\displaystyle X_{1}$ | $\displaystyle\sim\mathcal{N}(0,1),\quad X_{2}\sim\mathcal{U}(-2,2),\quad Z\sim\text{Bernoulli}(0.5)$ | |
| --- | --- | --- | --- |
where $\sigma(·)$ is the sigmoid function, $U_{T}\sim\text{Logistic}(0,0.3)$ , $U_{M}\sim\mathcal{N}(0,1.5^{2})$ , and $U_{Y}$ is drawn from a Gaussian Mixture Model conditioned on $Z$ . The true ATE for this DGP is approximately $202.29$ . BELM-MDCM (Full Model):
The complete proposed framework. The nodes T, M, and Y are all modeled by a CausalDiffusionModel with sampler_type=âbelmâ and a hybrid objective weight of $\lambda=5.0$ . w/o Analytical Invertibility:
Identical to the full model, but the sampler_type for all diffusion models was set to âddimâ. w/o Hybrid Objective:
Identical to the full model, but the hybrid objective weight $\lambda$ for all diffusion models was set to $0.0$ . w/o Targeted Modeling:
The key mediator node M was modeled with a simpler gcm.AdditiveNoiseModel (backed by an LGBMRegressor), while T and Y remained as CausalDiffusionModel s with $\lambda=5.0$ and sampler_type=âbelmâ.
I.2 Model Hyperparameter Justification
The hyperparameters for our BELM-MDCM model, reported in Table 9, were identified through a systematic grid search for each experiment. The variation in these parameters reflects principled adaptations to different data characteristics, guided by the trade-off between generative fidelity and discriminative accuracy, as well as the signal-to-noise ratio (SNR) of the underlying causal relationships. Below, we provide a holistic analysis for each scenario.
I.2.1 Hyperparameter Search Details
To ensure full reproducibility, we detail the hyperparameter search space used to arrive at the configurations in Table 9. The final values for each experiment were selected based on the best performance on a held-out validation set, typically comprising 20-30% of the training data. The primary selection criterion was the lowest PEHE score for experiments with a ground-truth ITE (e.g., Semi-Synthetic), or the best balance of low absolute ATE error and low estimation variance for observational data scenarios. For the Lalonde experiment in particular, we prioritized a configuration that demonstrated superior stability, as detailed in the analysis below.
Table 8: Hyperparameter Search Space and Selection Criteria.
| Hybrid Weight $\lambda$ | {0.0, 0.1, 0.3, 1.0, 2.0, 5.0, 10.0} | Best balance of low error and low variance on the validation set. |
| --- | --- | --- |
| Guidance Weight $w$ | {0.0, 0.1, 0.2, 0.3, 1.0, 5.0, 10.0} | |
| Diffusion Timesteps $T$ | {50, 100, 200, 500} | |
| Learning Rate | {5e-5, 1e-4, 1.1e-4, 1.2e-4, 2e-4} | |
Table 9: Consolidated BELM-MDCM Hyperparameters for All Key Experiments. Column headers correspond to the following experiments: PSM (PSM Failure), Lalonde, Semi-Synth (Semi-Synthetic), Ablation (Ablation Study), and Stress Test (Act IV).
| Hyperparameter Number of Epochs Batch Size | PSM 1500 128 | Lalonde 1000 64 | Semi-Synth 1200 64 | Ablation 700 128 | Stress Test 500 128 |
| --- | --- | --- | --- | --- | --- |
| Hidden Dimension | 512 | 512 | 768 | 768 | 256 |
| Learning Rate | $1Ă 10^{-4}$ | $1Ă 10^{-4}$ | $1.1Ă 10^{-4}$ | $1Ă 10^{-4}$ | $1Ă 10^{-4}$ |
| Diffusion Timesteps ( $T$ ) | 200 | 200 | 50 | 200 | 200 |
| Hybrid Weight $\lambda$ | 0.1 | 2.0 | 2.0 | 5.0 | 0.5 |
| Guidance Weight ( $w$ ) | 0.0 | 1.0 | 0.1 | 0.2 | 0.0 |
Scenario 1: The âPerfectionist Learnerâ (PSM Failure Experiment)
Data Profile: This synthetic dataset features complex, non-linear functions (e.g., sin, cos) but is characterized by a pure, high-SNR signal. The main challenge is to perfectly learn this intricate generative process. Hyperparameter Strategy: The strategy prioritizes generative modeling. A low hybrid weight ( $\lambda=0.1$ ) directs the model to focus on the diffusion loss. Since the conditional signal is strong, classifier-free guidance is disabled ( $w=0.0$ ). A large model capacity and extended training are employed to capture the ground-truth functions.
Scenario 2: The âPragmatic Signal Extractorâ (Lalonde Experiment)
Data Profile: This real-world data is characterized by a weak, noisy signal and a small sample size. The primary challenge is to robustly extract the causal signal. Hyperparameter Strategy: The priority shifts from pure accuracy to a balance of accuracy and stability. A high hybrid weight ( $\lambda=2.0$ ) provides a strong inductive bias towards the predictive task. Crucially, our systematic grid search revealed that high guidance weights ( $w>1.0$ ) dramatically increased estimation variance, leading to unreliable results. We therefore selected a moderate guidance weight of $w=1.0$ . This configuration provides a stabilizing effect, guiding the model towards the causal signal without amplifying noise, thereby achieving the optimal balance between accuracy and the robustness required for reliable inference on real-world data.
Scenario 3: The âPrecision Artistâ (Semi-Synthetic Experiment)
Data Profile: This setup presents a hybrid-SNR environment, using noisy real-world covariates but a pure, synthetic outcome function. The task demands high precision. Hyperparameter Strategy: The diffusion timesteps are reduced ( $T=50$ ), plausibly because a shorter generative path better preserves the fine-grained details of the synthetic ITE function. The hybrid weight is high ( $\lambda=2.0$ ) to focus on the estimation task, while guidance is light ( $w=0.1$ ) as the ITE signal is cleaner than in the purely observational Lalonde case.
Scenario 4: The âComponent Analystâ (Ablation Study)
Data Profile: This is a complex, synthetic mediation structure with a clean, high-SNR signal, specifically designed to isolate the performance impact of individual framework components. Hyperparameter Strategy: The strategy is to create a strong, stable baseline. A high hybrid weight ( $\lambda=5.0$ ) heavily orients the model towards the predictive task, making any performance degradation from ablations more pronounced. A large model capacity (Hidden Dim=768) ensures the model can learn the complex functions, while light guidance ( $w=0.2$ ) provides a minor stabilizing effect.
Scenario 5: The âRobustness Testerâ (Stress Test)
Data Profile: A simple DGP featuring a non-invertible causal mechanism ( $Y\propto U^{2}$ ), designed to test the frameworkâs behavior when its core assumption is violated. Hyperparameter Strategy: The goal is stable learning of a simple function. A moderate hybrid weight ( $\lambda=0.5$ ) balances generative and predictive learning, while a smaller model (Hidden Dim=256) is sufficient for the simpler DGP and helps prevent overfitting. Guidance is turned off ( $w=0.0$ ) as the signal is clean.
Holistic Comparison
Table 10: Summary of Adaptive Hyperparameter Strategies.
| Data Signal Guidance ( $w$ ) Timesteps ( $T$ ) | Pure & Complex 0.0 (Off) 200 (Standard) | Noisy & Weak 1.0 (Balanced Guidance) 200 (Standard) | Hybrid & Complex 0.1 (Slight Nudge) 50 (Short Path) | Pure & Mediated 0.2 (Light) 200 (Standard) | Pure & Non-Invertible 0.0 (Off) 200 (Standard) |
| --- | --- | --- | --- | --- | --- |
| Hybrid ( $\lambda$ ) | 0.1 (Generative) | 2.0 (Predictive) | 2.0 (Predictive) | 5.0 (Strongly Predictive) | 0.5 (Balanced) |
| Core Strategy | Perfect Generation | Robust Prediction | Precision Estimation | Component Isolation | Stable Learning |
In conclusion, the variability in hyperparameters demonstrates the flexibility of our framework. It showcases the modelâs ability to deploy different toolsâsuch as prioritizing the generative loss or amplifying weak signals with guidanceâto optimally adapt to the specific challenges posed by diverse causal inference problems.
I.3 CMF Metric Implementation Details
To ensure the rigor and reproducibility of our evaluation, this section provides the implementation details for the CMF scores.
CMI-Score Estimation Method.
For Conditional Mutual Information (CMI) estimation, we adopt the widely-used k-Nearest Neighbors (k-NN) based estimator from Kraskov et al. (kraskov2004estimating). This method is chosen for its strong theoretical properties and practical robustness, particularly for continuous and high-dimensional data, as it avoids explicit density estimation. Its key advantages include:
- Non-parametric: It makes no assumptions about the underlying data distributions.
- Data-adaptive: The estimation is based on local distances, adapting to the data manifoldâs geometry.
- Robustness: It is more robust than methods that rely on fixed binning, which can be sensitive to bin size.
Following common practice, we set the number of neighbors to $k=5$ for all our experiments to ensure a stable and reliable estimation.
KMD-Score Kernel Parameter Selection.
The performance of the MMD test is sensitive to kernel parameter choices. For the KMD-Score, we employed a standard Radial Basis Function (RBF) kernel, $k(x,y)=\exp(-\|x-y\|^{2}/(2\sigma^{2}))$ . The bandwidth parameter $\sigma$ is critical. Following best practices, we set the bandwidth using the median heuristic, a robust and common data-driven approach. For the analysis on the Lalonde datasetâs re78 mechanism, this heuristic yielded a bandwidth of $\sigma=0.1$ , a value confirmed as effective in preliminary experiments. All KMD-Scores are reported using this configuration.
Appendix J Empirical Validation of Proposed Evaluation Metrics
To empirically validate the reliability, sensitivity, and complementary nature of our proposed evaluation metrics (CIC-Score, CMI-Score, and KMD-Score), we conducted a controlled micro-simulation study assessing how each metric responds to a spectrum of increasingly severe model errors.
Experimental Setup.
We designed a simple ground-truth Structural Causal Model (SCM) and created five simulated models (A through E) representing a clear âdegradation gradientâ of model quality:
- Model A (Oracle): Represents a theoretically perfect model with zero SRE and a perfectly learned causal mechanism. This serves as our gold standard, with expected scores of 1.0.
- Model B (Lossy Inverter): Simulates a model with a non-zero SRE. It uses the correct causal mechanism but introduces a systematic error during the reconstruction cycle, mimicking the core flaw of DDIM-based approaches.
- Model C (Wrong Mechanism): Represents a model that fails to learn the correct functional form of the causal mechanism (e.g., learning a linear instead of a sinusoidal relationship).
- Model D (Maximal Error): Simulates a more severe failure where both the causal mechanism and the inferred noise distribution are fundamentally incorrect.
- Model E (Total Mismatch): Represents the worst-case scenario where the model ignores the causal graph and inputs, generating outputs from an unrelated distribution.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Charts: Empirical Validation of Proposed Metrics
### Overview
The image presents three bar charts comparing the performance of different metrics (CIC-Score, CMI-Score, and KMD-Score) under various error conditions. Each chart displays the score value for five scenarios: Oracle, Lossy Inverter (SRE Error), Wrong Mechanism, Maximal Error, and Total Mismatch. The y-axis represents the "Score Value," ranging from 0.0 to 1.0.
### Components/Axes
* **Title:** Empirical Validation of Proposed Metrics via Micro-simulation
* **Y-axis Title:** Score Value
* **Y-axis Scale:** 0.0 to 1.0, incrementing by 0.2.
* **X-axis Categories (shared across all charts):**
* A: Oracle
* B: Lossy Inverter (SRE Error)
* C: Wrong Mechanism
* D: Maximal Error
* E: Total Mismatch
* **Chart Titles:**
* CIC-Score (left)
* CMI-Score (center)
* KMD-Score (right)
### Detailed Analysis
**1. CIC-Score (Left Chart):**
* Color: Dark Blue
* Trend: The "Oracle" scenario has a score of 1.0, while the other scenarios have significantly lower scores.
* A: Oracle: 1.000
* B: Lossy Inverter (SRE Error): 0.154
* C: Wrong Mechanism: 0.168
* D: Maximal Error: 0.157
* E: Total Mismatch: 0.136
**2. CMI-Score (Center Chart):**
* Color: Light Blue
* Trend: The "Oracle" and "Lossy Inverter" scenarios have high scores, with a gradual decrease in scores for the remaining scenarios.
* A: Oracle: 0.999
* B: Lossy Inverter (SRE Error): 0.994
* C: Wrong Mechanism: 0.949
* D: Maximal Error: 0.836
* E: Total Mismatch: 0.758
**3. KMD-Score (Right Chart):**
* Color: Very Light Blue
* Trend: The "Oracle" and "Lossy Inverter" scenarios have high scores, with a more pronounced decrease in scores for the remaining scenarios compared to the CMI-Score chart.
* A: Oracle: 1.000
* B: Lossy Inverter (SRE Error): 0.971
* C: Wrong Mechanism: 0.822
* D: Maximal Error: 0.790
* E: Total Mismatch: 0.570
### Key Observations
* The "Oracle" scenario consistently achieves the highest score (1.0 or near 1.0) across all three metrics.
* The "Lossy Inverter (SRE Error)" scenario also performs well for CMI-Score and KMD-Score, but poorly for CIC-Score.
* The "Total Mismatch" scenario consistently has the lowest score among the error conditions for all three metrics.
* The CIC-Score shows a clear distinction between the "Oracle" and error scenarios, while CMI-Score and KMD-Score show a more gradual decline in performance across the error scenarios.
### Interpretation
The data suggests that the CIC-Score is highly sensitive to errors, as even minor deviations from the "Oracle" condition result in a significant drop in the score. In contrast, CMI-Score and KMD-Score are more robust to certain types of errors, particularly the "Lossy Inverter (SRE Error)." The consistent low performance of the "Total Mismatch" scenario across all metrics indicates that this type of error has the most detrimental impact on the system's performance.
The charts demonstrate the empirical validation of the proposed metrics by showing how they respond to different error conditions. The choice of which metric to use would depend on the specific application and the types of errors that are most likely to occur. If high sensitivity to any error is desired, CIC-Score might be preferred. If robustness to "Lossy Inverter" errors is important, CMI-Score or KMD-Score might be more suitable.
</details>
Figure 10: Results of the micro-simulation study for metric validation. The three plots show the response of the CIC-Score, CMI-Score, and KMD-Score to five models (A-E) of progressively decreasing quality. The scores demonstrate a clear monotonic degradation, confirming their ability to reliably track model fidelity. Note the CIC-Scoreâs sharp drop from Model A to B, highlighting its specific sensitivity to the Structural Reconstruction Error (SRE).
Analysis of Results.
The results (Figure 10) demonstrate the distinct and complementary roles of our proposed metrics:
1. The CIC-Score acts as a high-sensitivity âSRE detector.â It exhibits a dramatic drop from a perfect 1.0 (Model A) to approximately 0.23 (Model B) the moment SRE is introduced, while showing less sensitivity to the specific form of mechanism error. This confirms its primary role as a diagnostic for adherence to the Causal Information Conservation principle.
1. The CMI-Score serves as a robust âmechanism association tracker.â It degrades gracefully and monotonically as the learned causal mechanism deviates from the ground truth (from 0.99 to 0.76). This demonstrates its utility in quantifying the fidelity of learned parent-child conditional dependencies.
1. The KMD-Score functions as the âfinal arbiterâ of distributional fidelity. Possessing the widest dynamic range, it is sensitive to all forms of error and provides a holistic judgment of the similarity between the generated and true counterfactual distributions. As the most rigorous metric, it correctly assigns the lowest score to the completely mismatched Model E.
This simulation thus validates our proposed metrics as a reliable and nuanced evaluation framework. They work in synergy to diagnose specific model failings (CIC-Score), assess relational accuracy (CMI-Score), and provide an overall quality judgment (KMD-Score), offering a more insightful assessment than traditional metrics alone.
Discussion on the Non-Zero Lower Bound of Scores.
Notably, even for the worst-performing model (Model E), the CMI and KMD scores do not fall to zero. This behavior is not a limitation but a desirable feature that reflects their ability to capture âresidual statistical structureâ in the evaluation setting.
- For the KMD-Score: The MMD compares the joint distributions $P_{\text{model}}(Y,W,T)$ and $P_{\text{oracle}}(Y,W,T)$ . Crucially, as all models operate on the same observed parent data, the marginal parent distribution $P(W,T)$ is identical between the two. The difference lies only in the conditional $P(Y|W,T)$ . Because the joint distributions share a substantial common subspace, their MMD will be finite, preventing the KMD-Score from reaching zero. Furthermore, using a âStandardScalerâ maps both distributions to a similar feature space, which is necessary for a fair, scale-invariant comparison.
- For the CMI-Score: This metric quantifies the $I(Y;\text{parent}|\text{other parents})$ . Even an incorrect mechanism (like in Model C or D) still generates an output $Y$ as a deterministic function of its parents, resulting in a non-zero CMI. For Model E, where $Y$ is independent of its parents, the theoretically zero CMI is not observed due to the inherent finite-sample variance of the non-parametric k-NN estimator.
This behavior is advantageous, ensuring the metrics provide a meaningful, continuous gradient of failure rather than a simplistic binary judgment. This enhances their diagnostic power, allowing for fine-grained distinctions between types and degrees of model imperfection.
Appendix K Algorithm for ATE Estimation
This appendix provides the detailed pseudo-code for the counterfactual imputation procedure used to estimate the Average Treatment Effect (ATE) in our experiments.
Input: Observational data $\mathcal{D}=\{v^{(j)}\}_{j=1}^{N}$ where $v^{(j)}â\mathbb{R}^{d}$ ; Causal graph $\mathcal{G}$ .
Output: Estimated Average Treatment Effect ( $\hat{\text{ATE}}$ ).
Define Invertible SCM $\mathcal{M}_{\theta}$ :
$\mathcal{M}_{\theta}=\{f_{i}(·;\theta_{i})\}_{i=1}^{d}$ based on $\mathcal{G}$ , where $v_{i}=f_{i}(\mathbf{pa}_{i},u_{i})$ ;
// 1. Train the invertible SCM on observational data
$\hat{\theta}â\arg\min_{\theta}\sum_{j=1}^{N}\sum_{i=1}^{d}\mathcal{L}_{i}\left(f_{i}(\mathbf{pa}_{i}^{(j)};\theta_{i}),v_{i}^{(j)}\right)$ ;
// 2. Generate counterfactual outcomes for each individual
for $jâ 1$ to $N$ do
$t_{j}â v_{T}^{(j)}$ ;
// Observed treatment
$\mathbf{u}^{(j)}â\mathcal{M}_{\hat{\theta}}^{-1}(v^{(j)})$ ;
// Abduction via invertible BELM encoder
$y_{j}(1-t_{j})â\operatorname{Predict}\left(\mathcal{M}_{\hat{\theta}},\mathbf{u}^{(j)},\operatorname{do}(T:=1-t_{j})\right)$ ;
// Action & Prediction
// Store factual and counterfactual outcomes
$\mathbf{Y}_{j}(t_{j})â v_{Y}^{(j)}$ ;
$\mathbf{Y}_{j}(1-t_{j})â y_{j}(1-t_{j})$ ;
end for
// 3. Compute the ATE from factual and counterfactual outcomes
$\hat{\text{ATE}}â\frac{1}{N}\sum_{j=1}^{N}\left(\mathbf{Y}_{j}(1)-\mathbf{Y}_{j}(0)\right)$ ;
1ex
return $\hat{\text{ATE}}$ ;
Algorithm 1 ATE Estimation with Invertible SCMs via Counterfactual Imputation