# The Causal Round Trip: Generating Authentic Counterfactuals by Eliminating Information Loss
**Authors**:
- \nameRui Wu \emailwurui22@mail.ustc.edu.cn (\addrSchool of Management, University of Science and Technology of China)
- \nameLizheng Wang \emaillzwang@ustc.edu.cn (\addrSchool of Management, University of Science and Technology of China)
- \nameYongjun Li \emaillionli@ustc.edu.cn (\addrSchool of Management, University of Science and Technology of China)
> Corresponding author.
Abstract
Judea Pearlâs vision of Structural Causal Models (SCMs) as engines for counterfactual reasoning hinges on faithful abduction: the precise inference of latent exogenous noise. For decades, operationalizing this step for complex, non-linear mechanisms has remained a significant computational challenge. The advent of diffusion models, powerful universal function approximators, offers a promising solution. However, we argue that their standard design, optimized for perceptual generation over logical inference, introduces a fundamental flaw for this classical problem: an inherent information loss we term the Structural Reconstruction Error (SRE). To address this challenge, we formalize the principle of Causal Information Conservation (CIC) as the necessary condition for faithful abduction. We then introduce BELM-MDCM, the first diffusion-based framework engineered to be causally sound by eliminating SRE by construction through an analytically invertible mechanism. To operationalize this framework, a Targeted Modeling strategy provides structural regularization, while a Hybrid Training Objective instills a strong causal inductive bias. Rigorous experiments demonstrate that our Zero-SRE framework not only achieves state-of-the-art accuracy but, more importantly, enables the high-fidelity, individual-level counterfactuals required for deep causal inquiries. Our work provides a foundational blueprint that reconciles the power of modern generative models with the rigor of classical causal theory, establishing a new and more rigorous standard for this emerging field.
Keywords: Causal Inference, Diffusion Models, Causal Information Conservation, Structural Causal Models, Counterfactual Generation, BELM, Structural Reconstruction Error
1 Introduction
The fundamental challenge of causal inference, as articulated by rubin1974estimating, is our inability to simultaneously observe an individualâs potential outcomes. Generating authentic counterfactuals is thus the fieldâs grand challenge. Structural Causal Models (SCMs), introduced by pearl2009causality, provide the formal language for this pursuit. An SCM posits that an outcome $V_{i}$ is generated by a function of its parents $\mathbf{Pa}_{i}$ and a unique exogenous noise variable $U_{i}$ . This noise, $U_{i}$ , represents the primordial causal information âthe collection of unobserved factors unique to an individual. This concept aligns directly with the long-standing focus in econometrics on unobserved individual heterogeneity, a central challenge in structural modeling for decades (heckman2001micro). Pearlâs framework for causal reasoning, the Abduction-Action-Prediction cycle, hinges on the fidelity of the first step: abduction. To answer any âwhat ifâ question, one must first perfectly infer this primordial information $U_{i}$ from an observed outcome $v_{i}$ . For decades, while this theoretical blueprint was clear, its practical realization for complex, non-linear mechanisms remained a major computational hurdle, often addressed in econometrics through strong parametric assumptions or linear approximations (angrist2008mostly).
The advent of deep generative models, particularly diffusion models (ho2020denoising), offers a powerful new hope for bridging this gap. As near-universal function approximators, they possess the expressive power to learn the complex, non-linear functions that have long challenged classical methods (chao2023interventional; sanchez2022dcms). However, this promise is shadowed by a critical, yet overlooked, âimpedance mismatch.â These models were engineered for perceptual tasks like image synthesis, where visual plausibility is paramount, not for the logical rigor demanded by causal abduction. We argue that their standard design, which relies on approximate inversion schemes like DDIM (song2021denoising), is fundamentally at odds with the strict requirements of this classical causal problem.
In this work, we diagnose and resolve this conflict. We begin by giving the classic requirement for faithful abduction a modern name: Causal Information Conservation (CIC) In this work, âCausal Information Conservationâ is defined operationally as the lossless, deterministic recovery of the exogenous noise variable $U$ . Its novelty lies in its application as a design principle and diagnostic tool for the diffusion model paradigm in causality, rather than as a formal information-theoretic quantity. Connecting this operational principle to formal measures, such as mutual information, is a compelling avenue for future research.. Our core contribution is the identification that standard diffusion models systematically violate this principle due to an inherent algorithmic flaw. We formalize this flaw as the Structural Reconstruction Error (SRE) âa quantifiable information loss that imposes a hard theoretical ceiling on the fidelity of any counterfactual generated by such methods. The SRE is not an estimation error to be solved with more data, but a structural defect in the tool itself.
To solve the long-standing challenge of operationalizing faithful abduction, we introduce BELM-MDCM. It is not merely a new model, but the first diffusion-based framework re-engineered from first principles to be causally sound. Architected around an analytically invertible sampler (liu2024belm), it is the first Zero-SRE causal framework by construction. This design choice reconciles the expressive power of modern diffusion models with the logical rigor of Pearlâs causal theory, ensuring the abduction step is lossless. Our primary contributions are therefore:
1. Diagnosing a Fundamental Barrier in a Classic Problem. We are the first to identify that standard diffusion models, when applied to the classic problem of SCM abduction, suffer from a structural flaw we term the Structural Reconstruction Error (SRE), which violates the foundational principle of Causal Information Conservation.
1. Proposing the First Causally-Sound Diffusion Framework. We introduce BELM-MDCM, the first framework to eliminate SRE by design. By leveraging an analytically invertible mechanism, it ensures that the power of diffusion models can be applied to causality without compromising the integrity of the abduction process.
1. Developing a Principled Methodology to Operationalize the Framework. To make our Zero-SRE framework practical and robust, we introduce two synergistic innovations: a Targeted Modeling strategy to manage complexity and a Hybrid Training Objective to provide a strong causal inductive bias, both supported by our theoretical analysis.
Through a comprehensive experimental evaluation, we demonstrate that BELM-MDCM not only sets a new state-of-the-art in estimation accuracy but, more critically, unlocks the generation of authentic individual-level counterfactuals for deep causal inquiries. By providing a foundational blueprint that resolves a core tension between modern machine learning and classical causal theory, our work establishes a new, more rigorous standard for this research direction.
1.1 The Inversion Challenge in Diffusion-Based Causality
Diffusion models (ho2020denoising) are powerful generative models that learn to reverse a fixed, gradual noising process. They train a neural network, $\epsilon_{\theta}(\mathbf{x}_{t},t)$ , to predict the noise component of a corrupted sample $\mathbf{x}_{t}$ by optimizing a simple mean-squared error objective:
$$
\begin{split}L_{\text{simple}}(\theta)=\mathbb{E}_{t,\mathbf{x}_{0},\boldsymbol{\epsilon}}\bigg[\Big\|\boldsymbol{\epsilon}-\epsilon_{\theta}\big(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon},t\big)\Big\|^{2}\bigg]\end{split} \tag{1}
$$
where $\bar{\alpha}_{t}$ defines the noise schedule and $\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ . This trained network is then used to iteratively denoise a variable from pure noise back to a clean sample. A standard deterministic method for this generative process is the Denoising Diffusion Implicit Model (DDIM) (song2021denoising):
$$
\begin{split}\mathbf{x}_{t-1}=\sqrt{\bar{\alpha}_{t-1}}\left(\frac{\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(\mathbf{x}_{t},t)}{\sqrt{\bar{\alpha}_{t}}}\right)+\sqrt{1-\bar{\alpha}_{t-1}}\cdot\epsilon_{\theta}(\mathbf{x}_{t},t)\end{split} \tag{2}
$$
However, causal abduction requires the inverse operation: encoding an observed data point $\mathbf{x}_{0}$ into its latent noise code $\mathbf{x}_{T}$ . Standard frameworks (chao2023interventional) use the DDIM inversion, which only approximates this path:
$$
\begin{split}\mathbf{x}_{t+1}=\sqrt{\bar{\alpha}_{t+1}}\left(\frac{\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(\mathbf{x}_{t},t)}{\sqrt{\bar{\alpha}_{t}}}\right)+\sqrt{1-\bar{\alpha}_{t+1}}\cdot\epsilon_{\theta}(\mathbf{x}_{t},t)\end{split} \tag{3}
$$
This inversion is approximate because it relies on the noise prediction $\epsilon_{\theta}(\mathbf{x}_{t},t)$ remaining constant across the step, which introduces discretization errors that accumulate (liu2022pseudo). This structural flaw, which we term the Structural Reconstruction Error (SRE), systematically corrupts the inferred exogenous noise $U_{i}$ . The initial error in the abduction step then propagates through the entire Abduction-Action-Prediction cycle, compromising the fidelity of the final counterfactual.
1.2 Our Solution: A Zero-SRE Causal Framework
To eliminate SRE by construction, we build our framework upon an analytically invertible sampler: the B idirectional E xplicit L inear M ulti-step (BELM) sampler (liu2024belm). BELM overcomes the âmemorylessâ limitation of single-step samplers like DDIM by using a history of noise predictions, a principle grounded in classical theory for solving ODEs (hairer2006solving).
Specifically, we employ a second-order BELM. During decoding, it computes a more stable effective noise, $\boldsymbol{\epsilon}_{\text{eff}}$ , using predictions from the current and previous timesteps:
$$
\boldsymbol{\epsilon}_{\text{eff}}=\frac{3}{2}\epsilon_{\theta}(\mathbf{x}_{t},t)-\frac{1}{2}\epsilon_{\theta}(\mathbf{x}_{t+1},t+1) \tag{4}
$$
This improved estimate is then used in a DDIM-like update. The key innovation is that the corresponding encoding process is constructed to be the exact algebraic inverse of this decoding process, guaranteeing that the round-trip is lossless, i.e., $\mathbf{H}(\mathbf{T}(\mathbf{x}_{0}))=\mathbf{x}_{0}$ . While the original work on BELM focused on general generative tasks, we are the first to identify, leverage, and theoretically justify its analytical invertibility as the key to satisfying the principle of Causal Information Conservation for rigorous counterfactual generation. Our choice of a second-order BELM represents a deliberate trade-off, providing substantial accuracy gains over single-step methods while maintaining practical efficiency (liu2024belm), making it ideal for our causal framework.
1.3 Methodological Gaps in Applying Invertible SCMs
However, achieving high-fidelity causal inference requires more than a simple substitution of one sampler for another. The principle of analytical invertibility, while theoretically sound, exposes new challenges in practical SCM implementation that our framework is designed to address.
The Challenge of Model Specification: Targeted Modeling.
A key decision in SCM construction is assigning a causal mechanism to each node. Naively applying a complex, computationally expensive BELM-based diffusion model to every node in the causal graph is suboptimal. This motivates our Targeted Modeling strategy, where model complexity is treated as a resource to be allocated judiciously across the graph.
The Challenge of Downstream Tasks: Hybrid Training.
The second challenge arises from a fundamental mismatch in objectives. A diffusion model is trained on a generative objective, $L_{\text{diffusion}}(\theta)$ , while a downstream predictive task is optimized using a discriminative loss, $L_{\text{task}}(\phi)$ . These two objectives are not aligned. This âobjective mismatchâ motivates our Hybrid Training strategy, which seeks to unify these two goals.
2 Theoretical Analysis: An Operator-Theoretic Framework
To formalize our thesis that Causal Information Conservation is paramount and its violation via Structural Reconstruction Error is a fundamental barrier, we develop a rigorous operator-theoretic framework. This perspective is essential for analyzing the fidelity of the causal mapping process itself, moving beyond simple prediction errors. We present the first formal analysis that decomposes the counterfactual error in diffusion-based causal models to explicitly isolate the SRE, proving how our Zero-SRE design eliminates this critical structural limitation.
Our analysis first establishes the conditions for perfect counterfactual generation (§ 2.1-§ 2.3) and proves that standard methods produce a non-zero SRE, which our sampler eliminates by construction (Proposition 5 - 6; § 2.4). The centerpiece is a novel error decomposition theorem that isolates the SRE, motivating our Zero-SRE design (§ 2.5-§ 2.7). We conclude with learnability guarantees and a discussion of implications for advanced causal tasks like transportability (§ 2.8-§ 2.10).
2.1 Problem Formulation and Causal Operators
Let $(\Omega,\mathcal{F},P)$ be a probability space. We consider endogenous variables $\mathbf{V}$ as elements of the Hilbert space of square-integrable random variables, $\mathcal{X}:=L^{2}(\Omega,\mathbb{R}^{d})$ . Unless otherwise specified, all vector norms $\|·\|$ in the subsequent analysis refer to the standard Euclidean ( $L_{2}$ ) norm.
**Definition 1 (Functional SCM Operator)**
*A Structural Causal Model is defined by a set of unknown, true functional operators $\{\mathbf{F}_{i}\}_{i=1}^{d}$ , where each $\mathbf{F}_{i}:\mathcal{X}^{\text{pa}_{i}}Ă\mathcal{U}_{i}â\mathcal{X}_{i}$ is a map such that $V_{i}:=\mathbf{F}_{i}(\mathbf{Pa}_{i},U_{i})$ , with $\mathbf{Pa}_{i}$ being the set of parent random variables and $U_{i}$ an exogenous noise variable. We establish the convention that the corresponding lowercase bold letter, $\mathbf{pa}_{i}$ , denotes a specific vector of observed values for these parents.*
Our goal is to learn a model parameterized by $\theta$ that approximates this SCM. Our model consists of a pair of conditional operators for each variable $V_{i}$ :
1. A decoder (generative) operator $\mathbf{H}_{\theta}:\mathcal{U}Ă\mathcal{X}^{p}â\mathcal{X}$ , which aims to approximate $\mathbf{F}$ .
1. An encoder (inference) operator $\mathbf{T}_{\theta}:\mathcal{X}Ă\mathcal{X}^{p}â\mathcal{U}$ , which aims to perform abduction by inferring the latent noise.
These operators are realized by solving the probability flow ODE (Appendix A). The decoder $\mathbf{H}_{\theta}$ solves the ODE from $t=T$ to $t=0$ , while the encoder $\mathbf{T}_{\theta}$ solves it from $t=0$ to $t=T$ . Our BELM sampler is a high-fidelity numerical solver designed such that these forward and backward operations are exact algebraic inverses.
2.2 Identifiability and Exact Counterfactual Generation
We adapt principles from identifiable generative modeling (chao2023interventional) to formalize the conditions for exact counterfactuals. This requires assuming the SCM is invertible with respect to its noise term, a condition discussed in Section 2.11.
**Theorem 2 (Identifiability via Statistical Independence)**
*Given an SCM operator $X:=\mathbf{F}(\mathbf{Pa},U)$ where $U\perp\!\!\!\perp\mathbf{Pa}$ and $\mathbf{F}$ is invertible w.r.t. $U$ . If a learned encoder $\mathbf{T}_{\theta}$ (with sufficient capacity) yields a latent representation $Z=\mathbf{T}_{\theta}(X,\mathbf{Pa})$ that is statistically independent of the parents $\mathbf{Pa}$ , then $Z$ is an isomorphic representation of the exogenous noise $U$ .*
2.3 Geometric Inductive Bias for Identifiability
The score-matching objectiveâs geometric inductive biases strengthen our identifiability argument. We leverage the principle of implicit regularization, where optimizers favor âsimplerâ functions (hochreiter1997flat; neyshabur2018pac). We adopt the principle of simplicity bias, a cornerstone of modern deep learning theory that, while empirically supported, remains an active and not yet universally proven area of research. Our conclusions are conditioned on its validity, as discussed further in Section 2.11. This suggests the model learns the most parsimonious geometric transformation required to explain the data.
Considering the local geometry of the data density $p(\mathbf{x})$ provides powerful intuition. In a local region $\mathcal{R}$ , if the data is isotropic (spherically symmetric), the simplest score function is a radial vector field, yielding a conformal map. If the structure is simply anisotropic (e.g., ellipsoidal), the model is biased towards learning a local affine map. This refines the notion of a purely conformal bias and leads to the following proposition.
**Proposition 3 (Implicit Bias towards Simple Geometric Maps)**
*Assume (A1) the true data density $p(\mathbf{x})$ is smooth ( $C^{2}$ ) and (A2) the optimization process has a simplicity bias (e.g., favoring low-complexity solutions, see Appendix H).
1. If there exists a local region $\mathcal{R}$ where $p(\mathbf{x})$ is isotropic, the optimal learned score function is a radial vector field, and the flow map it generates is a conformal map on $\mathcal{R}$ .
1. If we relax the condition to a local region $\mathcal{R}$ where $p(\mathbf{x})$ has an ellipsoidal structure, the optimal learned score function is normal to the ellipsoidal iso-contours, and the flow map it generates is a local affine transformation on $\mathcal{R}$ .*
The formal argument is detailed in Appendix H. This proposition is significant: it suggests that the model defaults to learning the most parsimonious, well-behaved, and locally invertible map that can explain the dataâs geometry. This bias is crucial for the abduction step, as it prevents the pathological distortions that would corrupt the inferred causal noise $U$ .
**Theorem 4 (Operator Isomorphism Guarantees Exact Counterfactuals)**
*Let the conditions of Theorem 2 hold. If the learned operator pair $(\mathbf{T}_{\theta},\mathbf{H}_{\theta})$ constitutes a conditional isomorphism (i.e., $\mathbf{H}_{\theta}(\mathbf{T}_{\theta}(·,\mathbf{pa}),\mathbf{pa})=\mathbf{I}$ , the identity operator), then the modelâs prediction under an intervention $do(\mathbf{Pa}:=\boldsymbol{\alpha})$ is exact.*
Proof A full proof, covering cases for different dimensions of the exogenous noise variable, is provided in Appendix B.
2.4 Analysis of Inversion Fidelity
We now formally analyze the inversion error. We prove that standard approximate schemes produce a non-zero SRE (Proposition 5), whereas our chosen sampler eliminates it by construction (Proposition 6).
**Proposition 5 (Structural Error of Approximate Inversion)**
*Let $\mathbf{T}_{\text{DDIM}}$ be the operator for one step of DDIM inversion from $\mathbf{x}_{t}$ to $\mathbf{x}_{t+1}$ , and $\mathbf{H}_{\text{DDIM}}$ be the generative step operator from $\mathbf{x}_{t+1}$ to $\mathbf{x}_{t}$ . The single-step reconstruction error is non-zero and of second order in the time step $\Delta t$ :
$$
(\mathbf{H}_{\text{DDIM}}\circ\mathbf{T}_{\text{DDIM}})(\mathbf{x}_{t})-\mathbf{x}_{t}=\mathcal{O}((\Delta t)^{2})
$$
This error accumulates over the full trajectory, leading to a non-zero Structural Reconstruction Error.*
Proof See Appendix C for a rigorous proof.
**Proposition 6 (Analytical Invertibility of the Sampler)**
*Let $\mathbf{T}_{\text{BELM}}$ and $\mathbf{H}_{\text{BELM}}$ be the operators corresponding to the full-trajectory BELM sampler for inference and generation, respectively. For a fixed noise prediction network $\epsilon_{\theta}$ , the operators are exact algebraic inverses:
$$
\mathbf{H}_{\text{BELM}}\circ\mathbf{T}_{\text{BELM}}=\mathbf{I}
$$*
Proof The proof follows from the algebraic construction of the BELM update rules, as detailed in Appendix C.
2.5 Error Decomposition for Counterfactual Estimation
This brings us to our central theoretical result: an error decomposition theorem that rigorously partitions the total counterfactual error. This decomposition isolates the SRE and mathematically demonstrates why its elimination is critical.
**Definition 7 (Counterfactual Error Components)**
*We formally define the two primary sources of error in counterfactual estimation for the invertible case:
1. The Structural Reconstruction Error ( $E_{SR}$ ) measures the information loss from the modelâs abduction-action cycle on a given sample $X$ :
$$
E_{SR}(X):=\|(\mathbf{H}_{\theta}\circ\mathbf{T}_{\theta}-\mathbf{I})X\|^{2}
$$
1. The Latent Space Invariance Error ( $E_{LSI}$ ) measures the failure of the learned latent space to remain invariant under interventions on parent variables:
$$
E_{LSI}:=\|\mathbf{T}_{\theta}(X,\mathbf{Pa})-\mathbf{T}_{\theta}(X_{\boldsymbol{\alpha}}^{\text{true}},\boldsymbol{\alpha})\|^{2}
$$*
**Theorem 8 (Counterfactual Error Bound)**
*Let a model be defined by $(\mathbf{T}_{\theta},\mathbf{H}_{\theta})$ and the true SCM by $\mathbf{F}$ . Assume the decoder $\mathbf{H}_{\theta}$ is $L_{\mathcal{H}}$ -Lipschitz. The expected squared error of the modelâs counterfactual prediction $\hat{X}_{\boldsymbol{\alpha}}$ is bounded by the expectation of the two error components:
$$
\mathbb{E}\left[\|\hat{X}_{\boldsymbol{\alpha}}-X_{\boldsymbol{\alpha}}^{\text{true}}\|^{2}\right]\leq 2\mathbb{E}\left[E_{SR}(X_{\boldsymbol{\alpha}}^{\text{true}})\right]+2L_{\mathcal{H}}^{2}\mathbb{E}\left[E_{LSI}\right]
$$*
Proof The proof is in Appendix D.
**Remark 9 (Elimination of Structural Error)**
*By Proposition 6, the Structural Reconstruction Error for BELM-MDCM is identically zero. This is the central theoretical advantage of our framework. It disentangles the error sources, allowing us to isolate the entire modeling challenge to learning a high-quality score function ( $\epsilon_{\theta}$ ) without the confounding factor of an imperfect inversion algorithm. Any remaining error is now purely a function of statistical estimation, not a structural bias of the model itself.*
**Proposition 10 (Bound on Latent Space Invariance Error)**
*We assume the learned score network, $\boldsymbol{\epsilon}_{\theta}$ , is Lipschitz continuous, ensuring the existence and uniqueness of the probability flow ODE solution via the Picard-Lindelöf theorem. Under standard integrability conditions (Fubiniâs theorem), the Latent Space Invariance Error is bounded by the expected score-matching loss:
$$
\mathbb{E}\left[E_{LSI}\right]\leq C^{\prime}\cdot\mathbb{E}\left[\|\boldsymbol{\epsilon}_{\theta}-\boldsymbol{\epsilon}^{*}\|^{2}\right]
$$
for some constant $C^{\prime}$ , where $\boldsymbol{\epsilon}^{*}$ is the true score function.*
Proof The proof is in Appendix D. This proposition formally establishes that by eliminating structural error, the causal fidelity of BELM-MDCM is directly and provably controlled by its ability to accurately learn the dataâs score function.
2.6 Decomposing Error: A Motivation for Empirical Validation
The error decomposition in Theorem 8 provides a clear strategy for empirical validation by isolating two distinct error sources: the Structural Reconstruction Error ( $E_{SR}$ ) and the Latent Space Invariance Error. While developing a single score combining these is future work, these components directly motivate our empirical investigations. Our ablation study (Section 5.4.2) is designed to measure the impact of a non-zero $E_{SR}$ , while our stress-test (Section 5.4.1) probes robustness when latent space invariance is challenged by a non-invertible SCM.
2.7 Theoretical Roles of Targeted Modeling and Hybrid Training
With algorithmic error eliminated by our Zero-SRE design, the challenge becomes minimizing the modeling error ( $E_{LSI}$ ). Our two methodological innovations, Targeted Modeling and Hybrid Training, are principled strategies for this purpose.
Targeted Modeling as Formal Complexity Control.
Our Targeted Modeling strategy acts as a form of structural regularization. The finite sample bound in Theorem 15 is governed by the Rademacher complexity $\mathfrak{R}_{n}(\mathcal{F}_{\Theta})$ of the entire SCMâs hypothesis space. By assigning low-complexity models to a subset of nodes, we directly constrain the overall complexity.
**Remark 11 (Effect on Generalization Bound)**
*Our Targeted Modeling strategy is formally justified as a complexity control mechanism. The Rademacher complexity of a composite SCM is bounded by the sum of the complexities of its individual mechanisms (mohri2018foundations). By strategically substituting a high-complexity diffusion model $\mathcal{F}_{\text{diff}}$ with a lower-complexity alternative $\mathcal{F}_{\text{simple}}$ for non-critical nodes, Targeted Modeling directly minimizes this upper bound. This leads to a tighter generalization bound and improves the statistical efficiency of the overall SCM.*
Hybrid Training as a Weighted Score-Matching Objective.
The Hybrid Training Objective, $L_{\text{total}}=L_{\text{diffusion}}+\lambda· L_{\text{task}}$ , imparts a crucial inductive bias for learning a causally salient score function. The task-specific loss acts as a conductorâs baton, forcing the model to prioritize learning an accurate score function in regions of the data manifold most critical to the causal question. We formalize this by proposing that the auxiliary loss implicitly implements a weighted score-matching objective.
**Proposition 12 (Hybrid Objective as a Weighted Score-Matching Regularizer)**
*The auxiliary task loss $L_{\text{task}}$ provides a lower bound for the modelâs error, weighted by a function reflecting the causal salience of the data manifold. Minimizing the hybrid objective $L_{\text{total}}$ is thereby equivalent to solving a weighted score-matching problem that prioritizes accuracy in causally salient regions, leading to a smaller effective Latent Space Invariance Error. (A rigorous proof is provided in Appendix E.)*
This proposition formally grounds our hybrid training strategy, revealing that the task-specific loss intelligently forces the diffusion model to prioritize accuracy in regions of the data manifold most critical to the causal question. This reinforces the CIC principle by avoiding information loss where it matters most, effectively implementing the simplicity bias principle from Section 2.3.
We can deepen this insight by analyzing its information-theoretic implications.
**Proposition 13 (Disentanglement via Hybrid Objective)**
*Information-theoretically, the hybrid objective provides a strong inductive bias towards learning a disentangled latent representation. It encourages a âdivision of laborâ where the task-specific component explains variance from the parents $\mathbf{Pa}$ , while the diffusion componentâs latent code $Z=\mathbf{T}_{\theta}(V,\mathbf{Pa})$ models the residual information. This implicitly pushes $Z$ towards being independent of $\mathbf{Pa}$ , a crucial step towards satisfying the identifiability conditions.*
Proof A detailed information-theoretic argument is provided in Appendix E.
2.8 BELM-MDCM as a Unifying Framework
The principle of Causal Information Conservation also unifies our framework with classical models. Simpler models like Additive Noise Models (ANMs) can be seen as special cases where this principle is met trivially, positioning our work as a generalization of established causal principles. For instance, in a classic ANM (hoyer2009nonlinear), $V_{i}=f_{i}(\mathbf{Pa}_{i})+U_{i}$ , the noise is recovered by a direct, lossless inversion: $U_{i}=V_{i}-f_{i}(\mathbf{Pa}_{i})$ . Our framework generalizes this principle to arbitrarily complex, non-additive mechanisms, offering a flexible, non-parametric extension to classical structural equation models (wooldridge2010econometric). The importance of noise distributions, particularly non-Gaussianity, for identifiability in linear models is also a well-established principle (shimizu2006linear).
2.9 Learnability and Statistical Guarantees
We now provide finite-sample learnability guarantees for our SCM framework.
**Proposition 14 (Asymptotic Consistency)**
*Under standard regularity conditions, as the number of data samples $nââ$ and model capacity $Nââ$ , the learned operators $(\hat{\mathbf{T}}_{n},\hat{\mathbf{H}}_{n})$ are consistent estimators of the ideal operators $(\mathbf{T}^{*},\mathbf{H}^{*})$ : $\hat{\mathbf{T}}_{n}\xrightarrow{p}\mathbf{T}^{*}$ and $\hat{\mathbf{H}}_{n}\xrightarrow{p}\mathbf{H}^{*}$ .*
**Theorem 15 (Finite Sample Bound for Causal Diffusion SCMs)**
*Let an SCM consist of $d$ endogenous nodes, with a causal graph having a maximum in-degree of $d_{in}^{max}$ . Assume each causal mechanism is implemented by a score network $\epsilon_{\theta}$ that is an $L$ -layer MLP with ReLU activations, and the spectral norm of each weight matrix is bounded by $B$ . Let the input space be appropriately normalized. Let the loss function be bounded by $M$ . Then, for the parameters $\hat{\theta}_{n}$ learned from $n$ samples, the excess risk is bounded with probability at least $1-\delta$ :
$$
R(\hat{\theta}_{n})-R(\theta^{*})\leq C\cdot\frac{d\cdot L\cdot B^{L}\cdot\sqrt{d_{in}^{max}+d_{embed}+1}}{\sqrt{n}}+M\sqrt{\frac{\log(1/\delta)}{2n}}
$$
where $C$ is a constant independent of the network architecture and sample size, and $d_{embed}$ is the dimension of the time embedding.*
Proof The proof, which combines the sub-additivity of Rademacher complexity over the SCM with standard bounds for deep neural networks (bartlett2017spectrally; neyshabur2018pac), is detailed in Appendix G.
**Remark 16 (Interpretation of the Bound)**
*This refined bound quantitatively links the generalization error to:
1. Causal Complexity ( $d·\sqrt{d_{in}^{max}}$ ): The error scales with the number of causal mechanisms ( $d$ ) and the graphâs complexity ( $d_{in}^{max}$ ), formalizing the intuition that more complex causal systems are harder to learn.
1. Network Complexity ( $L· B^{L}$ ): The error scales with the depth and spectral norm of the score networks. This provides direct theoretical grounding for our Targeted Modeling strategy, as using simpler models tightens this generalization bound.*
2.10 Implications for Causal Transportability
Causal Information Conservation also provides a foundation for transportability âapplying knowledge from a source domain $\mathcal{S}$ to a target domain $\mathcal{T}$ (pearl2014transportability). Transportability requires separating invariant causal knowledge from domain-specific mechanisms. By losslessly recovering the exogenous noise $U$ (the invariant âcausal essenceâ), our framework achieves this separation by design; the decoders $\mathbf{H}_{\theta}$ represent the domain-specific mechanisms. This insight is formalized in the following theorem.
**Theorem 17 (Condition for Lossless Causal Transport)**
*Let a source domain $\mathcal{S}$ and a target domain $\mathcal{T}$ be described by SCMs $\mathcal{M}^{\mathcal{S}}$ and $\mathcal{M}^{\mathcal{T}}$ , respectively. Assume the following conditions hold:
1. Shared Structure: Both domains share the same causal graph $\mathcal{G}$ and the same exogenous noise distributions $\{p_{i}(U_{i})\}$ . The domains differ only in a subset of causal mechanisms $\mathcal{K}_{\text{changed}}$ .
1. Noise Independence: The exogenous noise variables $\{U_{i}\}_{i=1}^{d}$ are mutually independent.
1. Information Conservation: A model $(\mathbf{T}_{\theta},\mathbf{H}_{\theta})$ trained on data from $\mathcal{S}$ satisfies the Causal Information Conservation principle, achieving zero Structural Reconstruction Error.
Then, causal knowledge can be losslessly transported from $\mathcal{S}$ to $\mathcal{T}$ by re-learning only the operators $\{\mathbf{T}_{\theta_{k}},\mathbf{H}_{\theta_{k}}\}$ corresponding to the changed mechanisms $kâ\mathcal{K}_{\text{changed}}$ , while directly reusing all operators for invariant mechanisms.*
Proof The proof is provided in Appendix F.
2.11 Discussion of Assumptions
Our framework rests on several key assumptions, which we now critically examine.
Our geometric inductive bias argument (Proposition 3) rests on the principle of simplicity bias. While this principle is a cornerstone of modern deep learning theory with substantial empirical backing, it remains an active area of research and is not a universally proven theorem. Our conclusions are therefore conditioned on the validity of this powerful but conjectural assumption.
The cornerstone of our identifiability theory (Theorem 2) is the SCMâs invertibility with respect to its noise term $U$ . This is a strong assumption; when violated (e.g., by a many-to-one function), the abduction task becomes ill-posed.
To address this foundational challenge, we provide an exhaustive theoretical treatment in Appendix C. There, we formalize the irreducible ârepresentational errorâ and derive a tighter, more general error bound (Theorem 21). More importantly, we propose a concrete mitigation strategy: a novel prior-matching regularizer (Definition 23), theoretically shown to reduce the error by encouraging the learned encoder to approximate the ideal Maximum a Posteriori (MAP) solution (Proposition 24). This highlights a primary contribution: even in the challenging non-invertible case, BELM-MDCMâs zero-SRE design eliminates the algorithmic error, thereby isolating the more fundamental representational challenge. Our stress-test in Section 5.4.1 empirically confirms this advantage, while validating our regularizer provides a clear direction for future work.
Our identifiability proof is dimension-dependent, leveraging Liouvilleâs theorem for $dâ„ 3$ and requiring stronger assumptions like asymptotic linearity for the special case of $d=2$ . Other assumptions, such as Lipschitz continuity of the score network, are mild regularity conditions standard in deep generative model analysis and can be encouraged through architectural choices like spectral normalization.
3 Architectural Design and Training
The BELM-MDCM architecture embodies our core principles through a non-monolithic, theoretically-motivated design. Its central philosophy is Targeted Modeling: judiciously allocating the expressive power of our Zero-SRE CausalDiffusionModel to nodes of causal interest (e.g., Treatment T, Outcome Y), while using simpler, efficient mechanisms for confounders, as illustrated in Figure 1. This strategy provides practical complexity control, tightening the generalization bound as established in Theorem 15.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Causal Diffusion Model Architecture
### Overview
The image presents a diagram illustrating the architecture of a Causal Diffusion Model (specifically, BELM-MDCM). It depicts a causal graph with nodes representing variables and arrows indicating causal relationships. Below the graph is a block of text describing the "Targeted Modeling Principle" behind the model.
### Components/Axes
The diagram consists of the following components:
* **Nodes:** Represented by shapes (circles and rectangles).
* W: Circle, labeled "Empirical Distribution"
* X: Circle, labeled "Additive Noise Model"
* T: Rectangle, labeled "CausalDiffusionModel (BELM-MDCM)"
* Y: Rectangle, labeled "CausalDiffusionModel (BELM-MDCM)"
* **Arrows:** Indicate causal relationships between nodes.
* **Text Block:** A rectangular area containing descriptive text.
### Detailed Analysis or Content Details
The causal graph shows the following relationships:
* W (Empirical Distribution) has a causal influence on T (CausalDiffusionModel).
* X (Additive Noise Model) has a causal influence on T (CausalDiffusionModel).
* T (CausalDiffusionModel) has a causal influence on Y (CausalDiffusionModel).
The text block contains the following content:
"Targeted Modeling Principle:
The expressive power of the CausalDiffusionModel is judiciously allocated to key causal nodes (Treatment T, Outcome Y) for high-fidelity counterfactual generation.
Simpler, efficient mechanisms (e.g., ANM, Empirical Distribution) are used for confounder nodes (W, X) to ensure stability and efficiency."
### Key Observations
* The model architecture emphasizes allocating computational resources to the treatment and outcome variables (T and Y) while using simpler mechanisms for confounders (W and X).
* The diagram clearly illustrates the causal flow from empirical distribution and additive noise to the treatment, and then from the treatment to the outcome.
* The model is identified as "BELM-MDCM".
### Interpretation
The diagram and accompanying text describe a targeted modeling approach for causal diffusion models. The core idea is to focus the model's complexity on the parts of the causal graph that are most critical for generating accurate counterfactuals â the treatment and outcome variables. By using simpler mechanisms for confounders, the model aims to improve stability and efficiency. This suggests a trade-off between model expressiveness and computational cost, prioritizing accuracy in counterfactual generation over precise modeling of confounding factors. The use of "Empirical Distribution" and "Additive Noise Model" suggests a probabilistic approach to modeling the initial conditions and perturbations within the causal system. The acronym "ANM" is used in the text, but is not defined within the image.
</details>
Figure 1: Illustration of the Targeted Modeling Principle. The expressive CausalDiffusionModel is judiciously allocated to key causal nodes (Treatment T, Outcome Y) for high-fidelity counterfactual generation. Simpler, efficient mechanisms (e.g., ANM, Empirical Distribution) are used for confounder nodes (W, X) to ensure stability and efficiency.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Diagram: BELM-MDCM Workflow
### Overview
This diagram illustrates the workflow of a BELM-MDCM (likely a Bayesian Evidence Learning Model - Multi-Dimensional Causal Model) system. It depicts the data flow through several stages: Pre-Processing, Embedding, Train, Post-Processing, and Results. The diagram highlights the different modules and connections between them, along with example input data.
### Components/Axes
The diagram is segmented into five main sections, labeled at the bottom:
1. **Pre-Processing:** Includes `StandardScaler` and `OneHotEncoder`.
2. **Embedding:** Includes `Timestep embedding` and `x_num â x_cat Connection`.
3. **Train:** Includes `BELM-MDCM module` and `Noisy Target Variable`.
4. **Post-Processing:** Includes `Inverse Transformation`.
5. **Results:** Includes `Causal Identification`.
Input variables are shown on the left side:
* `x_num`: Numerical input with example values 50, 257, -3.0.
* `x_cat1`: Categorical input with example value "M".
* `x_cat2`: Categorical input with example value "woman".
* `x_cat3`: Categorical input with example value represented by a triangle symbol.
### Detailed Analysis or Content Details
* **Pre-Processing:**
* `StandardScaler`: Receives `x_num` as input. The block is yellow.
* `OneHotEncoder`: Receives `x_cat1`, `x_cat2`, and `x_cat3` as input. The block is yellow.
* **Embedding:**
* `x_num â x_cat Connection`: Receives output from `StandardScaler` and `OneHotEncoder`. The block is light blue.
* `Timestep embedding`: Receives input from `x_num â x_cat Connection`. The block is light blue.
* **Train:**
* `BELM-MDCM module`: Receives input from `Timestep embedding` and `Noisy Target Variable`. The block is a large light blue rectangle.
* `Noisy Target Variable`: Input to the `BELM-MDCM module`. The block is light blue.
* **Post-Processing:**
* `Inverse Transformation`: Receives output from `BELM-MDCM module`. The block is light blue.
* **Results:**
* `Causal Identification`: Receives output from `Inverse Transformation`. The block is pink.
The arrows indicate the direction of data flow. The diagram shows a sequential process, starting with data pre-processing, followed by embedding, training, post-processing, and finally, causal identification.
### Key Observations
The diagram emphasizes the integration of numerical and categorical data through the `x_num â x_cat Connection`. The `BELM-MDCM module` appears to be the core of the system, receiving inputs from both the embedding stage and a `Noisy Target Variable`. The final stage focuses on extracting causal relationships from the processed data.
### Interpretation
The diagram represents a machine learning pipeline designed for causal inference. The pre-processing steps prepare the data for the model. The embedding stage transforms the data into a suitable representation for the `BELM-MDCM module`. The inclusion of a "Noisy Target Variable" suggests the model is designed to handle uncertainty or imperfect data. The final stages aim to identify causal relationships, which is a complex task often requiring sophisticated modeling techniques. The use of "Inverse Transformation" suggests the model may be operating in a latent space and requires a transformation back to the original data space for interpretation. The diagram provides a high-level overview of the system's architecture and data flow, but does not provide details about the specific algorithms or parameters used in each module.
</details>
Figure 2: The detailed internal architecture of the CausalDiffusionModel. This diagram illustrates the end-to-end workflow of the causal mechanism designed for key nodes like Treatment T and Outcome Y, detailing the pre-processing, embedding, training, and post-processing stages.
The internal architecture of the CausalDiffusionModel itself, depicted in Figure 2, is engineered to learn the complex, non-linear mapping $v_{i}:=f_{i}(\mathbf{pa}_{i},u_{i})$ with high fidelity.
3.1 Mechanism for Exogenous Nodes
Exogenous nodes (without parents in the causal graph $\mathcal{G}$ ) are modeled non-parametrically via the Empirical Distribution of the observed data. This approach avoids distributional assumptions and provides a robust foundation for the Structural Causal Model (SCM).
3.2 Mechanism for Endogenous Nodes: The CausalDiffusionModel
For endogenous nodes $V_{i}$ , particularly those central to the causal query (treatment, outcome, key mediators), we employ our bespoke CausalDiffusionModel to learn the functional mapping $v_{i}:=f_{i}(\mathbf{pa}_{i},u_{i})$ .
3.2.1 Conditioning via Parent Node Transformation
The denoising process is conditioned on the parent nodes $\mathbf{pa}_{i}$ , which are transformed into a fixed-dimensional conditioning vector $\mathbf{c}â\mathbb{R}^{d_{c}}$ . A ColumnTransformer handles heterogeneous data types: continuous parents are standardized (StandardScaler) to unify scales, while categorical parents are one-hot encoded (OneHotEncoder) to prevent artificial ordinality. The resulting vectors are concatenated into $\mathbf{c}$ , which remains constant for a given sampleâs diffusion trajectory.
3.2.2 The Denoising Process
The core of the CausalDiffusionModel is a denoising network $\epsilon_{\theta}(v_{t},t,\mathbf{c})$ , implemented as a Residual MLP (he2016deep). It takes as input the noisy variable $v_{t}$ , a sinusoidal Time Embedding of timestep $t$ , and the conditioning vector $\mathbf{c}$ . Before the diffusion process, the target variable $V_{i}$ is also preprocessed (standardized for continuous values or label-encoded for categorical ones). The denoising process is driven by the BELM sampler, ensuring a mathematically exact and stable inversion path as established in Section 2.
3.2.3 Hybrid Training Objective
We introduce a Hybrid Training Objective to reconcile generative fidelity with predictive accuracy. As established in our theoretical analysis (Proposition 12), this is more than a standard multi-task learning scheme; it acts as a powerful inductive bias, creating a weighted score-matching objective that prioritizes accuracy in causally salient regions of the data manifold. The total loss is a linearly weighted combination:
$$
L_{\text{total}}=L_{\text{diffusion}}+\lambda\cdot L_{\text{task}} \tag{5}
$$
where $L_{\text{diffusion}}$ is the noise prediction error (Equation 1). The auxiliary loss $L_{\text{task}}$ is a Mean Squared Error for continuous nodes ( $L_{\text{regression}}$ ) and a Cross-Entropy loss for discrete nodes ( $L_{\text{classification}}$ ).
3.2.4 Decoding and Counterfactual Generation
For generation, the BELM sampler produces an output in the normalized space. This is then mapped back to the original data domain using the inverse transformations of the pre-fitted preprocessors (StandardScaler for continuous, LabelEncoder for categorical). For categorical outputs, the continuous value is rounded and clipped to the valid class range before the inverse mapping, ensuring that generated (counterfactual) data is interpretable and resides in the correct space.
4 New Evaluation Metrics for Generative Causal Models
The principle of Causal Information Conservation demands new evaluation dimensions that traditional metrics like ATE and PEHE cannot capture. An accurate ATE score, for instance, could arise from a model with high SRE where individual errors fortuitously cancel out at the population level. To move beyond mere outcome accuracy and directly assess a modelâs adherence to our foundational principle, we propose a new, theoretically-grounded evaluation framework.
4.1 Causal Information Conservation Score (CIC-Score)
The Causal Information Conservation Score (CIC-Score) is a direct empirical diagnostic for the Structural Reconstruction Error. It quantifies a frameworkâs adherence to the CIC principle by disentangling algorithmic information loss (from an imperfect inversion process) from modeling error (from the statistical challenge of learning the true causal mechanism). We define the score, bounded in $[0,1]$ , using an exponential formulation:
$$
\text{CIC-Score}=\exp\left(-\left(\delta_{U}+\delta_{\text{SRE}}\right)\right)
$$
The error components are designed to isolate distinct failure modes:
- $\delta_{U}$ , the Relative Noise Recovery Error, quantifies the modeling error. It measures how well the trained network approximates the true score function, reflected in the fidelity of the recovered noise $\hat{U}$ versus the ground-truth $U_{\text{true}}$ :
$$
\delta_{U}=\frac{\mathbb{E}[\|\hat{U}_{\text{scaled}}-U_{\text{true, scaled}}\|^{2}]}{\mathbb{E}[\|U_{\text{true, scaled}}\|^{2}]}
$$
This term captures all inaccuracies from finite data and imperfect optimization.
- $\delta_{\text{SRE}}$ , the Normalized Structural Error, exclusively quantifies the algorithmic error inherent to the inversion process itself. Its definition is model-dependent to allow for fair comparisons:
- For frameworks with analytical invertibility (e.g., our BELM-MDCM, ANMs), the algorithm introduces no information loss, so we set $\delta_{\text{SRE}}\equiv 0$ by construction. Any observed reconstruction error is a symptom of modeling error, already captured by $\delta_{U}$ .
- For frameworks relying on approximate inversion (e.g., DDIM), $\delta_{\text{SRE}}$ is empirically measured to quantify this inherent algorithmic flaw:
$$
\delta_{\text{SRE}}=\frac{\mathbb{E}[\|(\mathbf{H}_{\theta}\circ\mathbf{T}_{\theta}-\mathbf{I})X\|^{2}]}{\mathbb{E}[\|X\|^{2}]}
$$
This principled decomposition allows the CIC-Score to fairly assess different frameworks by isolating structural design advantages from the universal challenge of model training.
4.2 Causal Mechanism Fidelity Score (CMF-Score)
A generative causal modelâs core promise is to learn true causal mechanisms, not just outcomes. NaĂŻve metrics like pairwise correlations fail to capture the non-linear, multi-variable, and directional nature of causality. We therefore propose the Causal Mechanism Fidelity (CMF) score, a hierarchical framework with two levels of increasing rigor.
4.2.1 Level 1 (Pragmatic): The Conditional Mutual Information Score (CMI-Score)
The Conditional Mutual Information (CMI), $I(V_{i};V_{j}|\mathbf{Pa}_{j}\setminus\{V_{i}\})$ , is a non-parametric, non-linear measure of the direct influence a parent $V_{i}$ has on its child $V_{j}$ after accounting for all other parents. The CMI-Score evaluates whether this influence is preserved. For a single mechanism $V_{j}$ , it is the average consistency across all parent-child edges:
$$
\text{CMI-Score}(V_{j})=\frac{1}{|\mathbf{Pa}_{j}|}\sum_{V_{i}\in\mathbf{Pa}_{j}}\left(1-\frac{\left|I_{\text{obs}}(V_{i};V_{j}|\cdot)-I_{\text{cf}}(V_{i};V^{\prime}_{j}|\cdot)\right|}{I_{\text{obs}}(V_{i};V_{j}|\cdot)+\epsilon}\right)
$$
where $I_{\text{obs}}$ and $I_{\text{cf}}$ are the CMI values from observational and counterfactual data, respectively. The final CMI-Score is the average over all SCM mechanisms.
4.2.2 Level 2 (Gold Standard): The Kernelized Mechanism Discrepancy (KMD) Score
To rigorously compare entire conditional distributions, we use the Maximum Mean Discrepancy (MMD) (gretton2012kernel), a kernel-based statistical test for distributional equality. The KMD-Score applies this test to the conditional distributions $p(V_{j}|\mathbf{Pa}_{j})$ that define each causal mechanism, measuring the discrepancy between the learned and observed conditionals. The final score is mapped to a similarity measure in $[0,1]$ :
$$
\text{KMD-Score}=\exp(-\gamma\cdot\mathbb{E}_{\mathbf{pa}_{j}\sim p(\mathbf{Pa}_{j})}[\text{MMD}(p(V_{j}|\mathbf{pa}_{j}),p_{\theta}(V_{j}|\mathbf{pa}_{j}))])
$$
where $\gamma$ is a scaling parameter. A score of 1 indicates that the learned conditional mechanism is statistically indistinguishable from the observed one.
Complementary and Validated Evaluation Metrics.
Our proposed metrics complement, rather than replace, traditional ones like ATE and PEHE. They evaluate distinct facets of performance: while ATE/PEHE measure outcome accuracy, the CMF-Score assesses mechanism fidelity. This distinction is critical, as a model can achieve a high ATE via fortuitous error cancellation despite failing to learn the true data-generating process. To ensure our metrics are practically reliable, we conducted a controlled micro-simulation study, detailed in Appendix J. The results provide strong empirical evidence for their validity and complementary roles: the CIC-Score acts as a high-sensitivity SRE detector; the CMI-Score robustly tracks the fidelity of causal associations; and the KMD-Score serves as a final arbiter of distributional similarity. This validation confirms that our evaluation framework offers a more complete, nuanced, and reliable assessment of generative causal models.
5 Experiments
Our empirical evaluation is designed as a comprehensive test of our central thesis: that eliminating SRE is a necessary condition for generating authentic counterfactuals and unlocks analytical capabilities beyond the reach of conventional methods. We structured the study as a four-act narrative to rigorously test our claims. Act I establishes our modelâs state-of-the-art predictive fidelity on standard benchmarks. Act II provides a deep diagnostic analysis, using our proposed metrics as empirical evidence for the destructive effect of SRE. Act III showcases the unique capabilities unlocked by an information-conserving framework. Finally, Act IV validates the frameworkâs robustness through a series of stress tests and a full ablation study.
Evaluation Protocol.
For a rigorous evaluation, we employ two complementary protocols. This distinction is crucial, as it separates the assessment of our methodologyâs peak performance from the diagnostic analysis of its components.
1. Ensemble Evaluation for SOTA Performance: To benchmark against state-of-the-art methods (specifically, ITE estimation in Section 5.1.3), we adopt the standard Deep Ensemble methodology. We train N=5 independent models and report the final metric (e.g., PEHE) on the ensembled prediction.
1. Individual Model Evaluation for Diagnostic Analysis: In all other experiments where the goal is a fair, apples-to-apples architectural comparison or stability assessment, we report the mean and standard deviation of metrics from individual model instances across N=5 runs. This isolates the effect of design choices from the gains of ensembling.
We estimate the Average Treatment Effect (ATE) throughout our experiments using a standard counterfactual imputation procedure, the pseudo-code for which is detailed in Algorithm 1 in Appendix K.
Baseline Estimators.
The Directed Acyclic Graphs (DAGs) for our experiments are shown in Figure 3. We benchmark BELM-MDCM against a suite of baselines from the DoWhy library (sharma2022dowhy), spanning classical statistical methods to state-of-the-art machine learning estimators to ensure a comprehensive comparison.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Diagram: Causal Diagram
### Overview
The image depicts a causal diagram illustrating relationships between variables in a potential observational study. The diagram uses directed acyclic graph (DAG) notation to represent these relationships. The variables are represented by shapes: rectangles for covariates, a parallelogram for treatment, and a circle for the outcome. Arrows indicate the direction of causal influence.
### Components/Axes
The diagram includes the following components:
* **W1 (Covariate):** Represented by a rectangle at the top-left.
* **W2 (Covariate):** Represented by a rectangle at the top-right.
* **C1 (Categorical Confounder):** Represented by a rectangle in the center-top.
* **T (Treatment):** Represented by a parallelogram in the center.
* **Y (Outcome):** Represented by a circle at the bottom-center.
The arrows represent the causal relationships between these variables. The arrows are light gray, except for the arrow between T and Y, which is orange.
### Detailed Analysis or Content Details
The diagram shows the following relationships:
* W1 influences C1.
* W2 influences C1.
* C1 influences T.
* W1 influences T.
* W2 influences T.
* T influences Y.
* C1 influences Y.
* W1 influences Y.
* W2 influences Y.
The orange arrow between T and Y is visually distinct, potentially indicating a primary causal pathway of interest.
### Key Observations
The diagram highlights a potential confounding scenario. C1 is a categorical confounder that influences both the treatment (T) and the outcome (Y). W1 and W2 are covariates that influence both the confounder and the treatment, and also directly influence the outcome. The diagram suggests that estimating the causal effect of T on Y requires accounting for C1, W1, and W2.
### Interpretation
This diagram represents a common scenario in observational studies where confounding variables can distort the observed relationship between a treatment and an outcome. The diagram suggests that simply comparing the outcomes of treated and untreated individuals may not provide an accurate estimate of the treatment effect due to the influence of C1, W1, and W2.
The use of a DAG allows for a visual representation of the assumed causal structure, which is crucial for selecting appropriate statistical methods to control for confounding and estimate causal effects. The orange arrow emphasizes the primary relationship of interest, while the other arrows illustrate potential sources of bias.
The diagram is a conceptual tool for understanding causal relationships and does not contain numerical data. It is a qualitative representation of a hypothesized causal model. The diagram suggests that to estimate the true effect of T on Y, one would need to adjust for C1, W1, and W2 using methods like regression, propensity score matching, or instrumental variables.
</details>
(a) PSM Failure Scenario
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Diagram: Causal Diagram - Mediation Analysis
### Overview
The image depicts a directed acyclic graph (DAG) illustrating a mediation analysis model. The diagram shows the relationships between confounders, an instrumental variable, treatment, mediator, and outcome. The shapes of the nodes indicate the type of variable, and arrows indicate causal relationships.
### Components/Axes
The diagram consists of the following components:
* **Nodes:**
* X1 (Confounder) - Light blue hexagon, positioned top-left.
* Z (Instrumental Variable) - Light blue hexagon, positioned top-center.
* X2 (Confounder) - Light blue hexagon, positioned top-right.
* T (Treatment) - Orange diamond, positioned center-top.
* M (Mediator) - Light green pentagon, positioned center.
* Y (Outcome) - Light green circle, positioned bottom-center.
* **Arrows:** Represent causal relationships between the nodes.
* X1 -> T (Solid black arrow)
* X1 -> M (Solid black arrow)
* X2 -> T (Solid black arrow)
* X2 -> M (Solid black arrow)
* Z -> T (Dashed black arrow)
* T -> M (Solid black arrow)
* M -> Y (Solid red arrow)
* T -> Y (Solid black arrow)
### Detailed Analysis or Content Details
The diagram illustrates a causal pathway where confounders (X1 and X2) influence both the treatment (T) and the mediator (M). The instrumental variable (Z) influences only the treatment (T). The treatment (T) influences both the mediator (M) and the outcome (Y). The mediator (M) influences the outcome (Y). The solid arrows represent direct causal effects, while the dashed arrow represents the influence of the instrumental variable.
### Key Observations
* The diagram highlights the potential for confounding in the relationship between treatment and outcome.
* The instrumental variable is shown to only affect the treatment, not the outcome directly, making it a valid instrument for causal inference.
* The mediator variable is positioned between the treatment and outcome, indicating its role in transmitting the effect of the treatment to the outcome.
* The red arrow from M to Y is visually distinct, potentially emphasizing the importance of the mediation pathway.
### Interpretation
This diagram represents a common setup for mediation analysis, a statistical technique used to understand the mechanisms through which an independent variable (treatment) affects a dependent variable (outcome). The diagram suggests that the effect of the treatment on the outcome may be partially or fully explained by its effect on the mediator. The inclusion of confounders (X1 and X2) acknowledges that other factors may also influence both the treatment and the outcome, potentially distorting the observed relationship. The instrumental variable (Z) is used to address potential confounding and estimate the causal effect of the treatment on the outcome. The diagram is a visual representation of the assumptions underlying mediation analysis, and it helps to clarify the relationships between the variables involved. The use of different shapes for each variable type (hexagon for confounders, diamond for treatment, pentagon for mediator, circle for outcome) aids in understanding the model. The dashed line for the instrumental variable is a visual cue to indicate its unique role in the causal pathway.
</details>
(b) Ablation Study Scenario
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Diagram: Causal Relationship Model
### Overview
The image depicts a causal diagram illustrating the relationship between confounders, treatment, and outcome. It uses shapes and arrows to represent these elements and their connections. The diagram suggests a potential causal pathway with a direct effect of treatment on outcome, and the influence of confounders on both treatment and outcome.
### Components/Axes
The diagram consists of three main components:
* **Confounders:** Represented by a rectangular box at the top-center of the image. Labeled "Confounders (age, educ, re74, etc.)".
* **Treatment:** Represented by a diamond shape in the center of the image. Labeled "Treatment (treat)".
* **Outcome:** Represented by a circular shape at the bottom-center of the image. Labeled "Outcome (re78)".
Arrows connect these components, indicating the direction of influence. There are two arrow colors: blue and red.
### Detailed Analysis or Content Details
* **Confounders to Treatment:** A blue arrow originates from the "Confounders" box and points towards the "Treatment" diamond.
* **Confounders to Outcome:** A blue arrow originates from the "Confounders" box and points towards the "Outcome" circle.
* **Treatment to Outcome:** A red arrow originates from the "Treatment" diamond and points towards the "Outcome" circle.
The diagram does not contain numerical data or scales. It is a qualitative representation of relationships.
### Key Observations
The diagram highlights that the outcome is influenced by both the treatment and the confounders. The confounders influence both the treatment and the outcome, suggesting a potential for confounding bias. The use of different arrow colors (blue and red) may indicate different strengths or types of relationships, but this is not explicitly stated.
### Interpretation
This diagram illustrates a common scenario in causal inference. The "Confounders" represent variables that are associated with both the "Treatment" and the "Outcome," potentially distorting the observed relationship between them. The diagram suggests that to accurately estimate the effect of the "Treatment" on the "Outcome," it is necessary to control for these "Confounders." The red arrow from "Treatment" to "Outcome" represents the causal effect of interest, while the blue arrows represent potential sources of bias. The diagram is a simplified representation of a complex causal system and does not specify the nature of the relationships between the variables. It is a conceptual model used to guide research and analysis. The labels "age, educ, re74, etc." suggest that the confounders are demographic or socioeconomic variables. "treat" and "re78" are likely variable names used in a specific study or dataset.
</details>
(c) Lalonde Confounding Structure
Figure 3: Directed Acyclic Graphs (DAGs) for key experiments. (a) A structure designed to challenge propensity score methods. (b) A mediation structure used for the ablation study. (c) The standard confounding structure assumed for both Lalonde-based experiments.
5.1 Act I: Establishing State-of-the-Art Predictive Fidelity
We first establish that our principled design achieves superior predictive fidelity on standard causal inference benchmarks.
5.1.1 Robustness in Non-Linear Confounding Scenarios
We tested our model in a challenging synthetic scenario (Figure 3(a)) designed with highly non-linear confounding to cause propensity-based methods to fail. Table 1 shows the results. While Causal Forest is exceptionally accurate on this specific DGP, our BELM-MDCM framework secures its position as the second most accurate method, delivering a highly stable and competitive ATE estimate. Crucially, it significantly outperforms the entire suite of propensity-based methods and powerful estimators like DML in accuracy. The high standard deviation of DML highlights its unreliability in this context, validating our model as a robust estimator where traditional approaches are compromised.
Table 1: ATE Estimation on the PSM Failure Scenario (True ATE = 5000). We report the mean ATE and standard deviation across multiple runs.
| Causal Forest | 4895.77 $±$ 69.26 | 104.23 |
| --- | --- | --- |
| Propensity Score Stratification | 5309.38 $±$ 185.36 | 309.38 |
| Linear Regression | 5348.82 $±$ 23.23 | 348.82 |
| Propensity Score Matching | 5353.93 $±$ 191.36 | 353.93 |
| Inverse Propensity Weighting | 5385.68 $±$ 52.03 | 385.68 |
| Double Machine Learning | 4285.63 $±$ 550.97 | 714.37 |
5.1.2 Accuracy and Robustness on Real-World Observational Data
We next evaluated our framework on the canonical Lalonde dataset (lalonde1986evaluating), a challenging real-world benchmark with a known RCT ground truth. Table 2 demonstrates the comprehensive superiority of our BELM-MDCM framework. It achieved a mean ATE estimate of 1567.36 $±$ 201.62, the lowest error among all methods that correctly identified the treatment effectâs positive direction. More critically, the results highlight a stark contrast in reliability. Classical methods failed entirely, while the powerful Causal Forest baseline suffered from extreme instability (Std Dev of 785.59). In contrast, BELM-MDCM exhibited remarkable robustness, with a standard deviation approximately four times lower. This outstanding performance on a canonical benchmark validates that our framework delivers accurate estimates with the consistency essential for trustworthy causal inference.
Table 2: ATE Estimation Stability on the Lalonde Dataset (RCT Benchmark ATE $â 1794$ ). Results for all models are reported as Mean $±$ Standard Deviation across 5 independent runs.
| Causal Forest | 1085.30 $±$ 785.59 | 708.70 |
| --- | --- | --- |
| Linear Regression | 46.33 $±$ 76.80 | 1747.67 |
| Propensity Score Matching | -3.96 $±$ 118.37 | 1797.96 |
| Propensity Score Stratification | -35.54 $±$ 81.44 | 1829.54 |
| Propensity Score Weighting | -122.55 $±$ 50.51 | 1916.55 |
| Double Machine Learning | nan $±$ nan | nan |
5.1.3 High-Fidelity ITE Estimation and Stability Analysis
Objective. We evaluate performance at the individual level using a semi-synthetic version of the Lalonde dataset. This experiment leverages real-world covariates and assumes the causal structure depicted in Figure 3(c). To rigorously assess both accuracy and reliability, we follow our Individual Model Evaluation protocol, reporting the mean and standard deviation of performance across 5 independent runs for each method.
Results. The PEHE results, presented in Table 3, confirm the exceptional fidelity and robustness of our framework. BELM-MDCM achieves the lowest average PEHE score of 537.84 and demonstrates remarkable stability with the lowest standard deviation of just 60.11. This performance is closely followed by Causal Forest. However, the results also highlight the instability of other meta-learners; X-Learner, in particular, exhibits extremely high variance, with a standard deviation more than three times larger than its competitors, rendering its single-run estimates unreliable. This highlights the dual advantage of our framework: superior accuracy combined with consistent, trustworthy performance. Figure 4 provides visual confirmation, showing the tight clustering of our modelâs ensembled ITE estimates around the ground truth.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Scatter Plot: Accuracy of Individual Treatment Effect (ITE) Estimation
### Overview
This image presents a scatter plot visualizing the accuracy of Individual Treatment Effect (ITE) estimation. The plot compares the "True ITE" against the "Estimated ITE (Ensemble)" for a collection of individual samples. A diagonal red dashed line representing a "Perfect Match" (y=x) is overlaid on the scatter plot for reference.
### Components/Axes
* **Title:** Accuracy of Individual Treatment Effect (ITE) Estimation
* **X-axis Label:** True ITE
* Scale: 0 to 4000, with gridlines at increments of 1000.
* **Y-axis Label:** Estimated ITE (Ensemble)
* Scale: 0 to 4000, with gridlines at increments of 1000.
* **Legend:** Located at the top-right corner.
* "Individual Samples (BELM-MDCM Ensemble)" - Represented by blue circles.
* "Perfect Match (y=x)" - Represented by a red dashed line.
### Detailed Analysis
The scatter plot displays approximately 300-400 individual data points (blue circles). The points are distributed around the red dashed line, but with considerable scatter.
* **Trend of Individual Samples:** The blue points generally follow an upward trend, indicating a positive correlation between the True ITE and the Estimated ITE. However, the relationship is not perfectly linear.
* **Data Point Distribution:**
* **Low True ITE (0-1000):** The estimated ITE values are widely dispersed, with many points falling below the "Perfect Match" line. There's a noticeable cluster of points with low True ITE and low Estimated ITE.
* **Mid-Range True ITE (1000-3000):** The points are more tightly clustered around the "Perfect Match" line, but still exhibit significant scatter.
* **High True ITE (3000-4000):** The estimated ITE values tend to be slightly above the "Perfect Match" line, but with a wider spread.
* **Approximate Data Points (sampled for illustration):**
* (0, 0) - Point lies on the Perfect Match line.
* (500, 500) - Point lies on the Perfect Match line.
* (1000, 1500) - Estimated ITE is higher than True ITE.
* (2000, 2200) - Estimated ITE is slightly higher than True ITE.
* (3000, 3500) - Estimated ITE is significantly higher than True ITE.
* (4000, 4200) - Estimated ITE is slightly higher than True ITE.
* (500, 200) - Estimated ITE is lower than True ITE.
* (1500, 800) - Estimated ITE is lower than True ITE.
* (2500, 1800) - Estimated ITE is lower than True ITE.
### Key Observations
* The estimation accuracy is higher for mid-range True ITE values compared to low and high values.
* There is a tendency for the Estimated ITE to be overestimated for higher True ITE values.
* The scatter around the "Perfect Match" line indicates that the ITE estimation is not perfect, and there is inherent uncertainty in the estimation process.
* The points are not uniformly distributed, suggesting potential biases or limitations in the estimation method.
### Interpretation
The scatter plot demonstrates the accuracy of the BELM-MDCM Ensemble method for estimating Individual Treatment Effects. The plot reveals that while there is a general positive correlation between the true and estimated ITEs, the estimation is not without error. The deviation from the "Perfect Match" line quantifies the estimation error. The observed pattern suggests that the method performs better for moderate ITE values, but struggles with extreme values (both low and high). This could be due to several factors, such as the complexity of modeling heterogeneous treatment effects, limitations in the data, or the specific assumptions of the BELM-MDCM Ensemble method. The spread of the points indicates the inherent variability in ITE estimation, even with a sophisticated ensemble method. Further investigation is needed to understand the sources of error and improve the accuracy of ITE estimation, particularly for extreme ITE values.
</details>
Figure 4: Accuracy of Individual Treatment Effect (ITE) Estimation on the semi-synthetic Lalonde dataset. The plot shows the ensembled estimated ITE from our model versus the true ITE. The tight clustering of our modelâs estimates (blue dots) around the perfect-match line (red dash) visually demonstrates its low PEHE score.
Table 3: ITE Estimation Accuracy (PEHE) on the Semi-Synthetic Lalonde Dataset. Results are reported as Mean $±$ Standard Deviation across 5 independent runs. Lower is better.
| BELM-MDCM Causal Forest S-Learner | 537.84 $±$ 60.11 563.90 $±$ 73.66 816.26 $±$ 79.17 |
| --- | --- |
| X-Learner | 1546.38 $±$ 679.09 |
Table 4: Causal Mechanism Fidelity (CMI-Score) on the Semi-Synthetic Lalonde Dataset. Results are reported as Mean $±$ Standard Deviation across 5 runs. Higher is better.
| S-Learner BELM-MDCM Causal Forest | 0.9905 $±$ 0.0062 0.9824 $±$ 0.0092 0.9786 $±$ 0.0099 |
| --- | --- |
| X-Learner | 0.9782 $±$ 0.0145 |
| T-Learner | 0.9555 $±$ 0.0113 |
5.2 Act II: Uncovering the Accuracy-Invertibility Trade-off
We now conduct the pivotal experiment of our study: a deep diagnostic analysis using our novel CIC-Score to reveal the trade-off between predictive accuracy and mechanism invertibility. This provides the core empirical evidence for our thesis by comparing three paradigms: our BELM-MDCM (Learned Invertibility), a DDIM variant (Flawed Invertibility), and a classic RF-ANM (Assumed Invertibility).
The results in Table 5 decisively validate our frameworkâs principles. Our BELM-MDCM is the clear leader, achieving the lowest PEHE score (1071.95) with high stability. Critically, its CIC-Score of 0.3679 is orders of magnitude higher than the alternatives, proving its unique ability to learn an invertible mapping that conserves causal information. In stark contrast, the DDIM-MDCM model exemplifies the failure predicted by our theory: its near-zero CIC-Score confirms a near-total collapse of causal information due to SRE, leading to unreliable predictions (high PEHE and variance). The classical RF-ANM, while structurally invertible, lacks the capacity to learn the true mechanism, resulting in a zero CIC-Score and poor accuracy. This âGolden Tableâ experiment underscores that both structural integrity and powerful modeling capacity are essential for high-fidelity causal inference.
Table 5: The âUltimate Golden Tableâ: A comparative analysis of model classes on predictive accuracy (PEHE) and structural integrity (CIC-Score). This table includes the NF-SCM baseline, which empirically validates the likelihood-fidelity dilemma. Results are reported as Mean $±$ Standard Deviation across 5 runs. Lower PEHE is better; higher CIC-Score is better.
| RF-ANM DDIM-MDCM NF-SCM | 1533.18 $±$ 134.24 2085.98 $±$ 788.12 442229.96 $±$ 66963.73 | 0.0000 $±$ 0.0000 0.0065 $±$ 0.0130 0.1572 $±$ 0.0232 |
| --- | --- | --- |
| BELM-MDCM | 1071.95 $±$ 152.11 | 0.3679 $±$ 0.0000 |
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Line Chart: Training Loss Curve
### Overview
The image displays a line chart illustrating the training loss curve, showing the relationship between the negative log-likelihood loss and the epoch number during a training process. The chart indicates how the loss decreases over time as the model learns.
### Components/Axes
* **Title:** "Training Loss Curve" - positioned at the top-center of the chart.
* **X-axis:** "Epoch" - ranging from 0 to 200, with tick marks every 25 epochs.
* **Y-axis:** "Negative Log-Likelihood Loss" - ranging from approximately 0 to 6, with tick marks every 1 unit.
* **Data Series:** A single blue line representing the training loss.
* **Grid:** A light gray grid is present in the background to aid in reading values.
### Detailed Analysis
The blue line representing the training loss starts at approximately 5.8 at Epoch 0 and rapidly decreases.
Here's a breakdown of approximate values at specific epochs:
* Epoch 0: Loss â 5.8
* Epoch 5: Loss â 4.0
* Epoch 10: Loss â 2.5
* Epoch 15: Loss â 1.5
* Epoch 20: Loss â 1.0
* Epoch 25: Loss â 0.8
* Epoch 50: Loss â 0.3
* Epoch 75: Loss â 0.2
* Epoch 100: Loss â 0.15
* Epoch 125: Loss â 0.1
* Epoch 150: Loss â 0.08
* Epoch 175: Loss â 0.07
* Epoch 200: Loss â 0.06
The line initially exhibits a steep downward slope, indicating rapid learning. After approximately Epoch 25, the rate of decrease slows down significantly. The line fluctuates around a value of approximately 0.1 to 0.2 from Epoch 50 to Epoch 200, suggesting the model is converging and the loss is stabilizing. There are minor oscillations in the loss value, indicating some degree of instability or noise in the training process.
### Key Observations
* **Rapid Initial Decrease:** The loss decreases dramatically in the first 25 epochs.
* **Convergence:** The loss stabilizes after Epoch 50, indicating the model is converging.
* **Fluctuations:** Minor fluctuations in the loss value suggest some instability or noise.
* **No Clear Overfitting:** The loss does not increase after a certain point, suggesting the model is not overfitting to the training data.
### Interpretation
The training loss curve demonstrates a typical learning pattern for a machine learning model. The initial steep decrease indicates that the model is quickly learning to fit the training data. The subsequent stabilization of the loss suggests that the model has converged to a reasonably good solution. The minor fluctuations in the loss value are expected and can be attributed to the stochastic nature of the training process or the presence of noise in the data. The absence of an increase in loss after a certain point suggests that the model is generalizing well to the training data and is not overfitting. This curve suggests the training process was successful, and the model has learned to minimize the negative log-likelihood loss on the training data. Further evaluation on a validation set would be necessary to confirm the model's generalization performance.
</details>
Figure 5: The training loss curve for the Conditional Normalizing Flow (NF) baseline. The smooth, stable convergence to a low negative log-likelihood value indicates a successful statistical training run. However, this did not correspond to learning the true causal mechanism, as evidenced by its extremely high PEHE score.
The Likelihood-Fidelity Dilemma: Why Natively Invertible Models Can Fail.
To rigorously test the limits of models that are natively invertible, we conducted a comprehensive stability analysis on a Conditional Normalizing Flow (NF) baseline, a model class that satisfies the Causal Information Conservation principle by construction (SRE $\equiv 0$ ). Across five independent runs with different random seeds, the NF model consistently demonstrated successful statistical learning, with its training loss stably converging to a high log-likelihood in each instance (a representative example is shown in Figure 5).
However, this statistical success was starkly contrasted by a systematic and catastrophic failure in the causal task. The model yielded an average PEHE score of 442,229.96 $±$ 66,963.73, confirming that its generated counterfactuals were fundamentally incorrect. This consistent result provides decisive evidence for a critical challenge we term the likelihood-fidelity dilemma: a model can perfectly learn to replicate a data distribution while remaining completely ignorant of the underlying causal mechanism.
The root of this dilemma is the fundamental mismatch between the optimization objective and the causal goal. The maximum likelihood objective incentivizes the NF to find any invertible mapping that transforms the data to a simple base distribution. While an infinite number of such mappings may be statistically equivalent, only one corresponds to the true, unique causal data-generating process. Without a guiding signal, the NF is mathematically predisposed to learn a causally-incorrect âstatistical shortcut.â This finding powerfully underscores the contribution of our Hybrid Training Objective. It acts as the crucial causal inductive bias that resolves this dilemma, compelling the model to learn the unique, causally salient structure and enabling valid causal inference where pure likelihood-based methods, even those with zero SRE, are destined to fail.
5.3 Act III: Unlocking Deeper Causal Inquiry with a High-Fidelity Model
An information-conserving model serves as a trustworthy âworld modelâ for deep causal inquiry. We showcase three applications uniquely enabled by our frameworkâs high-fidelity counterfactuals.
Heterogeneity Analysis: Conditional ATE (CATE).
A reliable ITE model can act as a âcausal microscope.â We use it to explore treatment effect heterogeneity by estimating the Conditional Average Treatment Effect (CATE) for subpopulations. By averaging the results from our five independently trained models, we obtain stable and robust estimates. Our modelâs mean CATE estimates track the true CATE trends with high fidelity across different education levels, a capability crucial for policy-making. For instance, for individuals with an education level of 3.0, the estimated CATE was $2562.79 (true CATE: $2280.79). For levels 8.0, 12.0, and 16.0, the estimates were $2092.91 (true: $2118.61), $2253.76 (true: $2384.44), and $2434.89 (true: $2490.21), respectively, demonstrating a close correspondence to the ground truth.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Bar Chart: Causal Attribution: The Impact of Exogenous "Luck"
### Overview
This bar chart compares the "Actual Outcome" for a "Victim (Index 201)" to the outcome when the victim benefits from the "Luck" of a "Top Responder (Index 281)". The chart visually demonstrates a significant positive impact of the responder's "Luck" on the victim's outcome.
### Components/Axes
* **Title:** Causal Attribution: The Impact of Exogenous "Luck"
* **X-axis Label:** Categorical, representing the two conditions: "Victim (Index 201) Actual Outcome" and "Victim (Index 201) with Top Responderâs âLuckâ"
* **Y-axis Label:** "Outcome (Y)", representing the numerical value of the outcome. The scale ranges from 0 to 35,000.
* **Horizontal Line:** Dashed blue line labeled "Actual Outcome ($4,296.47)". This line represents a specific value on the Y-axis.
* **Bars:** Two vertical bars, one red and one green, representing the outcomes for each condition.
* **Data Labels:** Numerical values displayed above each bar, indicating the outcome amount.
### Detailed Analysis
The chart presents two bars:
1. **Red Bar:** Represents "Victim (Index 201) Actual Outcome". The bar extends to approximately 10,993.98 on the Y-axis. The value displayed above the bar is **$10,993.98**.
2. **Green Bar:** Represents "Victim (Index 201) with Top Responderâs âLuckâ". This bar is significantly taller, extending to approximately 33,328.30 on the Y-axis. The value displayed above the bar is **$33,328.30**.
3. **Horizontal Blue Dashed Line:** This line is positioned at approximately 4,296.47 on the Y-axis, and is labeled "Actual Outcome ($4,296.47)".
The green bar is substantially higher than the red bar, indicating a much larger outcome when the victim benefits from the responder's "Luck". The horizontal line appears to represent a baseline or reference point, potentially the outcome without any intervention.
### Key Observations
* The "Luck" of the Top Responder (Index 281) has a dramatic positive effect on the victim's outcome.
* The outcome with "Luck" is more than three times the actual outcome.
* The horizontal line suggests a baseline outcome of $4,296.47, which is lower than both the actual outcome and the outcome with "Luck".
### Interpretation
The chart demonstrates a strong causal relationship between the "Luck" of the Top Responder and the victim's outcome. The data suggests that the responder's intervention or characteristics significantly improve the victim's situation. The horizontal line representing the "Actual Outcome ($4,296.47)" could be interpreted as the outcome in the absence of any external influence or intervention. The large difference between the red bar (actual outcome) and the green bar (outcome with luck) highlights the substantial impact of the exogenous "Luck" factor. The use of the term "Luck" is interesting, as it implies a degree of randomness or chance, but the chart clearly shows a consistent and significant effect. This could be a simplified representation of a more complex mechanism where the responder possesses skills, resources, or connections that lead to better outcomes for the victim. The indices (201 and 281) suggest these are identifiers within a larger dataset or study.
</details>
Figure 6: Causal Attribution Analysis. This chart shows the counterfactual outcome for the âVictimâ if they had possessed the individual-specific exogenous factors of the âTop Responderâ.
Causal Attribution: Isolating the Effect of Exogenous Factors.
We conducted a causal attribution experiment via a counterfactual intervention of the form $do(U_{\text{victim}}:=u_{\text{responder}})$ . Figure 6 shows that our framework can losslessly recover these factors, revealing that unobserved exogenous âluckâ In this context, the term âluckâ serves as an intuitive shorthand for the exogenous noise variable $U$ . Within the Structural Causal Model (SCM) framework, $U$ represents all unobserved, individual-specific factors (e.g., intrinsic ability, random chance, measurement errors) that, together with the observed parent variables ( $\mathbf{Pa}$ ), determine the final outcome for an individual. had a massive and stable causal effect, averaging a +22,334.32 change in the âVictimâsâ outcome. This capacity for reliable attribution is a unique advantage of our information-conserving framework.
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Density Plot: Counterfactual Fairness Audit for Attribute "black"
### Overview
This image presents a density plot comparing the actual outcome distribution and the counterfactual outcome distribution for the attribute "black". The plot is used to assess fairness, specifically the average fairness gap. The plot displays density on the y-axis and outcome (Y) on the x-axis.
### Components/Axes
* **Title:** Counterfactual Fairness Audit for Attribute: "black"
* **X-axis Label:** Outcome (Y) - Scale ranges from 0 to 100000.
* **Y-axis Label:** Density - Scale ranges from 0 to 7e-5 (0.00007).
* **Audited Group:** black=1 (N=371)
* **Average Fairness Gap:** $-1,678.16
* **Legend:**
* Actual Outcome Distribution (black=1) - Represented by a solid red fill.
* Mean Actual - Represented by a dashed red line. Value: $8,829.19
* Counterfactual Outcome Distribution (if black=0) - Represented by a solid purple fill.
* Mean Counterfactual - Represented by a dashed blue line. Value: $7,151.03
### Detailed Analysis
The plot shows two overlapping density curves.
* **Actual Outcome Distribution (red):** The density is highest around 0, rapidly decreasing. There's a long tail extending towards higher outcome values, but the density remains low beyond approximately 20,000. The peak density is approximately 6.5e-5.
* **Counterfactual Outcome Distribution (purple):** The density is also highest around 0, but the curve is broader and flatter than the actual outcome distribution. The peak density is approximately 5.5e-5. The tail extends similarly to the actual distribution, but appears slightly more pronounced.
* **Mean Actual (dashed red):** A vertical dashed red line is positioned at approximately 8,829 on the x-axis.
* **Mean Counterfactual (dashed blue):** A vertical dashed blue line is positioned at approximately 7,151 on the x-axis.
The distributions overlap significantly, but the actual outcome distribution is shifted slightly to the right (higher outcome values) compared to the counterfactual distribution.
### Key Observations
* The average fairness gap is negative (-$1,678.16), indicating that, on average, individuals in the audited group (black=1) have a higher actual outcome than their counterfactual outcome (if black=0).
* The actual outcome distribution has a sharper peak and a slightly longer tail than the counterfactual distribution.
* The mean actual outcome is higher than the mean counterfactual outcome.
* The sample size for the audited group is 371.
### Interpretation
This plot suggests a potential fairness concern. The negative fairness gap indicates that the model tends to predict a more favorable outcome for individuals identified as "black" compared to what the outcome would be if they were not identified as such. This could be due to various factors, including bias in the training data or the model's decision-making process.
The difference in the shapes of the distributions suggests that the impact of the "black" attribute is not uniform across all outcome values. The sharper peak in the actual outcome distribution might indicate that the model is more confident in its predictions for certain individuals within the audited group.
The fact that the mean actual outcome is higher than the mean counterfactual outcome further supports the conclusion that the "black" attribute is associated with a more favorable predicted outcome. Further investigation is needed to understand the root cause of this disparity and to mitigate any potential unfairness. The values provided are approximate, based on visual estimation from the plot.
</details>
(a) Fairness audit for attribute âblackâ.
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Density Plot: Counterfactual Fairness Audit for Attribute "hisp"
### Overview
The image presents a density plot comparing the distribution of actual outcomes versus counterfactual outcomes for a group identified by the attribute "hisp". The plot is used to assess fairness, specifically the average fairness gap. The plot shows two overlapping density curves, one representing the actual outcome distribution and the other representing the counterfactual outcome distribution. Vertical dashed lines indicate the mean values for each distribution.
### Components/Axes
* **Title:** "Counterfactual Fairness Audit for Attribute âhispâ" (Top-center)
* **X-axis:** "Outcome (Y)" - Scale ranges from approximately -20000 to 60000.
* **Y-axis:** "Density" - Scale ranges from 0 to 6e-5.
* **Legend:** Located in the top-right corner.
* "Actual Outcome Distribution (hisp=1)" - Represented by a solid red fill.
* "Counterfactual Outcome Distribution (if hisp=0)" - Represented by a solid purple fill.
* "Mean Actual: $5,145.18" - Represented by a dashed red line.
* "Mean Counterfactual: $7,007.30" - Represented by a dashed blue line.
* **Audited Group Information:** "Audited Group: hisp=1 (N=39)" (Top-left)
* **Average Fairness Gap:** "$1,862.12" (Top-left, within a red box)
### Detailed Analysis
The plot displays two density curves.
* **Actual Outcome Distribution (hisp=1):** The red curve is centered around approximately 5000, with a long tail extending towards negative values. The density is highest around 0, decreasing as you move away in either direction. The mean actual outcome is indicated by a vertical dashed red line at approximately 5145.
* **Counterfactual Outcome Distribution (if hisp=0):** The purple curve is centered around approximately 7000, with a similar shape to the red curve, but shifted to the right. The density is highest around 0, decreasing as you move away in either direction. The mean counterfactual outcome is indicated by a vertical dashed blue line at approximately 7007.
The distributions overlap significantly, but the counterfactual distribution is shifted to the right, indicating a higher average outcome when the "hisp" attribute is set to 0.
### Key Observations
* The counterfactual mean ($7,007.30) is higher than the actual mean ($5,145.18).
* The average fairness gap is $1,862.12, representing the difference between the counterfactual and actual means.
* Both distributions are skewed to the left, with longer tails on the negative side.
* The sample size for the audited group is 39 (N=39).
### Interpretation
The data suggests a potential fairness concern related to the "hisp" attribute. The counterfactual analysis reveals that, on average, individuals would have a higher outcome if they did not belong to the "hisp" group. The fairness gap of $1,862.12 quantifies this difference.
The density plot visually demonstrates the shift in outcome distributions, highlighting the disparity. The overlap between the distributions indicates that there is still some variability within each group, but the overall trend suggests a systematic difference in outcomes based on the "hisp" attribute.
The fact that both distributions are skewed suggests that there are a number of individuals with significantly lower outcomes, potentially indicating a vulnerability within the system. The relatively small sample size (N=39) should be considered when interpreting these results; a larger sample size would provide more statistical power.
This analysis is a starting point for further investigation into the potential sources of unfairness and the development of mitigation strategies.
</details>
(b) Fairness audit for attribute âhispâ.
Figure 7: Counterfactual fairness audits reveal significant outcome disparities based on sensitive attributes. The plots show the distribution of actual outcomes for each group versus the distribution of their counterfactual outcomes had their sensitive attribute been different.
Counterfactual Fairness Audit.
Finally, we applied our framework to a counterfactual fairness audit. Only a model that faithfully represents the data generating process can reliably answer questions about fairness. Figure 7 reveals significant and stable disparities: our model estimates an average fairness gap of -1,678.16 for the âblackâ attribute and +1,862.12 for the âhispâ attribute, demonstrating its capacity as a powerful tool for ethical audits.
5.4 Act IV: Final Validation: Stress Tests and Ablation Study
We conclude by subjecting the framework to two final tests: a stress test on a non-invertible SCM and a comprehensive ablation study.
5.4.1 Stress Test on a Non-Invertible SCM
We tested our frameworkâs robustness when the theoretical assumption of an invertible SCM is violated, using a DGP where $Y\propto U_{Y}^{2}$ . The results in Table 6 and Figure 8 decisively validate our hypothesis. On the definitive metric of individual-level fidelity (PEHE), our zero-SRE BELM framework achieves an error of 0.77, a 44% reduction compared to the SRE-prone DDIM model. This empirically proves that even when the true SCM is non-invertible, eliminating algorithmic SRE provides a substantial advantage. This result is fully consistent with our theoretical analysis in Appendix C (Theorem 21), where the total error is decomposed into algorithmic, modeling, and representational errors. By eliminating the algorithmic error ( $E_{SR}\equiv 0$ ), our frameworkâs performance approaches the theoretical limit set by the other two components.
Interestingly, the DDIM sampler achieves a slightly higher KMD-Score. We hypothesize that its inherent inversion noise acts as a form of implicit regularization, making the marginal generated distribution appear closer to the truth, even while individual-level counterfactuals are less accurate. This highlights the important distinction between distributional fidelity and individual-level causal accuracy.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Bar Charts: Robustness to Many-to-One SCM: BELM vs. DDIM
### Overview
The image presents four bar charts comparing the performance of two methods, DDIM and BELM, across four different metrics: Group-Level Accuracy (ATE Error), Individual-Level Fidelity (PEHE), Mechanism Fidelity (CMI-Score), and Distributional Fidelity (KMD-Score). Each chart includes error bars representing the variability of the scores. The title indicates the comparison is related to "Robustness to Many-to-One SCM".
### Components/Axes
* **Title:** Robustness to Many-to-One SCM: BELM vs. DDIM
* **X-axis (all charts):** DDIM, BELM
* **Y-axis (all charts):** Score
* **Chart 1:** Group-Level Accuracy (ATE Error) - Lower is Better
* **Chart 2:** Individual-Level Fidelity (PEHE) - Lower is Better
* **Chart 3:** Mechanism Fidelity (CMI-Score) - Higher is Better
* **Chart 4:** Distributional Fidelity (KMD-Score) - Higher is Better
* **Color Scheme:** DDIM is represented by a dark teal/blue color, and BELM is represented by a light green color. Error bars are black.
### Detailed Analysis or Content Details
**Chart 1: Group-Level Accuracy (ATE Error)**
* The Y-axis ranges from 0.0 to 1.2.
* DDIM has a score of approximately 0.973 with an error bar extending from roughly 0.85 to 1.05.
* BELM has a score of approximately 0.740 with an error bar extending from roughly 0.60 to 0.88.
* Trend: DDIM has a higher score than BELM.
**Chart 2: Individual-Level Fidelity (PEHE)**
* The Y-axis ranges from 0.0 to 1.75.
* DDIM has a score of approximately 1.376 with an error bar extending from roughly 1.20 to 1.55.
* BELM has a score of approximately 0.766 with an error bar extending from roughly 0.60 to 0.93.
* Trend: DDIM has a higher score than BELM.
**Chart 3: Mechanism Fidelity (CMI-Score)**
* The Y-axis ranges from 0.0 to 1.0.
* DDIM has a score of approximately 0.980 with an error bar extending from roughly 0.90 to 1.0.
* BELM has a score of approximately 0.994 with an error bar extending from roughly 0.95 to 1.03.
* Trend: BELM has a slightly higher score than DDIM.
**Chart 4: Distributional Fidelity (KMD-Score)**
* The Y-axis ranges from 0.0 to 1.0.
* DDIM has a score of approximately 0.907 with an error bar extending from roughly 0.80 to 1.0.
* BELM has a score of approximately 0.830 with an error bar extending from roughly 0.70 to 0.95.
* Trend: DDIM has a higher score than BELM.
### Key Observations
* DDIM consistently outperforms BELM in Group-Level Accuracy, Individual-Level Fidelity, and Distributional Fidelity.
* BELM slightly outperforms DDIM in Mechanism Fidelity.
* The error bars indicate some variability in the scores, but the differences between DDIM and BELM appear substantial for most metrics.
* The "Lower is Better" metrics (Group-Level Accuracy and Individual-Level Fidelity) show DDIM having higher values, which means it is *worse* on these metrics.
### Interpretation
The data suggests that DDIM is more robust than BELM in terms of Group-Level Accuracy, Individual-Level Fidelity, and Distributional Fidelity when applied to a Many-to-One SCM. However, it is important to note that lower scores are better for Group-Level Accuracy and Individual-Level Fidelity, meaning DDIM performs worse on these metrics. BELM demonstrates slightly better performance in Mechanism Fidelity.
The relationship between the metrics suggests a trade-off: BELM prioritizes mechanism fidelity, while DDIM prioritizes distributional fidelity. The choice between the two methods depends on the specific application and the relative importance of these different aspects of robustness. The error bars indicate that the observed differences are not always statistically significant, and further investigation may be needed to confirm these findings. The fact that DDIM has higher values for the "Lower is Better" metrics is a critical observation, suggesting that DDIM may not be the preferred method if minimizing these errors is a primary goal.
</details>
Figure 8: Robustness comparison on the many-to-one SCM. Our zero-SRE framework (BELM) demonstrates significantly superior performance on PEHE.
Table 6: Performance on the non-invertible SCM ( $Y\propto U^{2}$ ). Results are averaged over 5 runs ( $±$ std). Lower PEHE is better.
| BELM (Zero SRE) DDIM (with SRE) | 0.77 $±$ 0.16 1.38 $±$ 0.19 | 0.830 $±$ 0.009 0.907 $±$ 0.023 |
| --- | --- | --- |
5.4.2 Ablation Study: Deconstructing the Frameworkâs Success
We conducted a comprehensive ablation study on a challenging synthetic dataset (Figure 3(b)) to validate the contribution of each core component. The findings, presented in Table 7 and Figure 9, provide unequivocal evidence for our integrated design. The full BELM-MDCM model establishes the gold standard for both accuracy and stability. The study reveals three critical insights:
- Decisive Role of the Hybrid Objective: Removing it causes a catastrophic decline in performance (400%+ error increase), demonstrating it is a core driver of the causal inductive bias, not merely a fine-tuning mechanism.
- Critical Importance of Targeted Modeling: Removing it leads to a complete collapse in model stability (Std Dev explodes from 4.57 to 138.01), validating our theoretical analysis on complexity control. A judicious allocation of model complexity is paramount for reproducibility.
- Robust Advantage of Exact Invertibility: Replacing BELM with DDIM leads to a clear degradation in both accuracy and stability, confirming that SRE from approximate inversion systematically erodes the quality of the final causal estimate.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Bar Charts: Ablation Study Results (5 Runs) on Challenge Dataset
### Overview
The image presents two horizontal bar charts comparing the performance of a full model (BELM-MDCM) against variations where specific components have been removed (ablation). The top chart shows "ATE Estimation Accuracy" measured as "Mean Estimated ATE", while the bottom chart shows "Impact of Ablation on Absolute Error" measured as "Mean Absolute Error". Both charts include error bars representing ±1 standard deviation. A vertical dashed red line indicates the "True ATE" value.
### Components/Axes
* **Title:** Ablation Study Results (5 Runs) on Challenge Dataset
* **Top Chart Title:** ATE Estimation Accuracy
* **Bottom Chart Title:** Impact of Ablation on Absolute Error
* **X-axis (Top Chart):** Mean Estimated ATE (Error bars show ±1 Std Dev) - Scale ranges from approximately 125 to 275.
* **X-axis (Bottom Chart):** Mean Absolute Error (Lower is Better) - Scale ranges from approximately -100 to 200.
* **Y-axis (Both Charts):** Categorical labels representing different model configurations:
* w/o Hybrid Objective
* w/o Targeted Modeling
* w/o Exact Invertibility (DDIM)
* BELM-MDCM (Full Model)
* **Legend (Top Chart):**
* Red Dashed Line: True ATE = 202.29
* **Color Scheme:**
* BELM-MDCM (Full Model): Dark Blue
* w/o Exact Invertibility (DDIM): Orange
* w/o Targeted Modeling: Teal
* w/o Hybrid Objective: Dark Gray
### Detailed Analysis or Content Details
**Top Chart: ATE Estimation Accuracy**
* **BELM-MDCM (Full Model):** The bar extends from approximately 220 to 255, with a mean estimated ATE of roughly 237. The error bar extends from approximately 225 to 250.
* **w/o Exact Invertibility (DDIM):** The bar extends from approximately 175 to 215, with a mean estimated ATE of roughly 195. The error bar extends from approximately 185 to 205.
* **w/o Targeted Modeling:** The bar extends from approximately 150 to 240, with a mean estimated ATE of roughly 195. The error bar extends from approximately 170 to 220.
* **w/o Hybrid Objective:** The bar extends from approximately 130 to 220, with a mean estimated ATE of roughly 175. The error bar extends from approximately 150 to 200.
* **True ATE:** A vertical dashed red line is positioned at 202.29.
**Bottom Chart: Impact of Ablation on Absolute Error**
* **BELM-MDCM (Full Model):** The bar extends from approximately 20 to 40, with a mean absolute error of roughly 30. The error bar extends from approximately 15 to 45.
* **w/o Exact Invertibility (DDIM):** The bar extends from approximately 30 to 70, with a mean absolute error of roughly 50. The error bar extends from approximately 40 to 60.
* **w/o Targeted Modeling:** The bar extends from approximately 50 to 110, with a mean absolute error of roughly 80. The error bar extends from approximately 65 to 95.
* **w/o Hybrid Objective:** The bar extends from approximately 70 to 130, with a mean absolute error of roughly 100. The error bar extends from approximately 85 to 115.
### Key Observations
* The full model (BELM-MDCM) has the highest ATE estimation accuracy (top chart) and the lowest mean absolute error (bottom chart).
* Removing the Hybrid Objective results in the largest increase in absolute error and the lowest ATE estimation accuracy.
* Removing Targeted Modeling and Exact Invertibility (DDIM) also significantly degrades performance, but to a lesser extent than removing the Hybrid Objective.
* The error bars suggest that the performance differences between the full model and the ablated models are statistically significant.
* The True ATE value (202.29) falls between the performance of the "w/o Exact Invertibility (DDIM)" and "w/o Targeted Modeling" models.
### Interpretation
The data strongly suggests that each of the components (Hybrid Objective, Targeted Modeling, and Exact Invertibility) contributes to the overall performance of the BELM-MDCM model. The Hybrid Objective appears to be the most critical component, as its removal leads to the most substantial performance degradation. The ablation study demonstrates the importance of each component in achieving accurate ATE estimation. The fact that the True ATE falls within the range of the ablated models confirms that the full model is closer to the ground truth than any of its simplified versions. The error bars provide confidence intervals, indicating the reliability of the observed performance differences. The bottom chart, with "Lower is Better" explicitly stated, reinforces the interpretation that the full model is superior as it minimizes absolute error.
</details>
Figure 9: Visualization of the ablation study results. The top panel shows the mean estimated ATE relative to the true value (red dashed line), with error bars indicating $± 1$ standard deviation. The bottom panel highlights the mean absolute error for each configuration. The full BELM-MDCM model is demonstrably the most accurate and stable.
Table 7: Ablation study results on a challenging synthetic dataset (True ATE = 202.29), validating the necessity of each framework component. The mean and standard deviation are computed over 5 runs.
| w/o Exact Invertibility (DDIM) w/o Hybrid Objective w/o Targeted Modeling | 219.98 $±$ 7.48 137.77 $±$ 23.99 253.20 $±$ 138.01 | 17.68 64.53 50.90 |
| --- | --- | --- |
Conclusion.
The ablation study confirms that our frameworkâs three core design principles: analytical invertibility, a hybrid training objective, and targeted modeling âwork in synergy. The removal of any single component creates a significant vulnerability, validating the integrity and effectiveness of our integrated architectural design.
6 Causal Information Conservation as a Unifying Principle
The principle of Causal Information Conservation extends beyond a foundation for our model; it offers a unifying lensâa new taxonomyâfor analyzing the suitability of any generative model for individual-level causal inference. Applying this principle and its operational metric, SRE, allows us to situate our work and clarify its unique advantages within the broader landscape.
6.1 Normalizing Flows: Natively Information-Conserving
Normalizing Flows (NFs) (dinh2014nice; dinh2017density; kingma2018glow) are designed around an invertible mapping. By construction, their Structural Reconstruction Error is identically zero, making NFs a native implementation of the Causal Information Conservation principle. However, their strengths come with limitations: the requirement of a tractable Jacobian imposes heavy architectural constraints, which can limit expressive power and introduce strong topological assumptions on the data manifold.
6.2 VAEs and GANs: Architecturally High-SRE
Variational Autoencoders (VAEs) (kingma2014autoencoding) and Generative Adversarial Networks (GANs) (goodfellow2014generative) are fundamentally ill-suited for this task. Their Structural Reconstruction Error is large and non-zero due to a fundamental architectural mismatch, as their separate encoder and decoder networks lack any structural guarantee of being inverses. Furthermore, their sources of information loss are inherent to their design; optimization objectives like the ELBOâs KL-divergence term actively encourage lossy compression, theoretically impeding the recovery of precise causal information.
6.3 A Comparative Perspective on Diffusion Models
Viewed through our principle, diffusion models occupy a unique and compelling space in this taxonomy. Standard diffusion-based approaches, using samplers like DDIM, aspire to conserve information, but their reliance on approximate inversion results in a non-zero SRE; they are âaspirationalâ but ultimately lossy. In contrast, BELM-MDCM (this work) achieves zero SRE by integrating an analytically invertible sampler, matching the theoretical purity of Normalizing Flows but without their rigid architectural constraints. Furthermore, unlike NFs trained with a generic likelihood objective, our frameworkâs Hybrid Training objective provides a causally-oriented inductive bias. BELM-MDCM thus uniquely combines the rigorous invertibility of NFs with the modeling flexibility and task-specific power of the diffusion paradigm, making it ideally suited for principled, high-fidelity causal inference.
7 Conclusion and Future Work
This paper introduced Causal Information Conservation as a guiding principle for the emerging field of diffusion-based causal inference. Our primary contribution is not the concept of invertibility itself, but framing it as a foundational design requirement and identifying the Structural Reconstruction Error (SRE) as the precise, quantifiable cost of its violation.
Our proposed framework, BELM-MDCM, serves as a constructive proof of this principle. By being architected around analytical invertibility, it is the first to achieve zero SRE by design, shifting the focus from mitigating numerical errors to upholding a fundamental causal principle. This work provides a foundational blueprint and a more rigorous standard for applying the power of diffusion models to the profound challenges of causal inference, reconciling their flexibility with the logical rigor demanded by classical theory.
7.1 Limitations and Future Work
Our work highlights several avenues for future research:
- Handling Non-Invertible SCMs: Our framework excels when the true SCM is invertible. While our stress test shows robustness when this is violated, developing models inherently resilient to such misspecifications is a key challenge. Empirically validating the proposed prior-matching regularizer (Appendix C) is a concrete next step.
- Robustness to Graph Misspecification: Like most SCM-based methods (peters2017elements), our framework assumes a correctly specified causal graph. Analyzing how structural errors (e.g., omitted confounders) propagate through our error decomposition framework is a significant future research direction.
- Formalizing CIC within Information Theory: Our work defines CIC as an operational principle. A compelling direction is to formalize it within a rigorous information-theoretic framework, for instance by proving that zero SRE maximizes the mutual information $I(U;\hat{U})$ between the true and recovered noise, connecting our work to rate-distortion theory.
- Scalability and Generalizability: While powerful, the BELM sampler is computationally intensive. Improving its efficiency for high-dimensional settings is an important practical challenge. Furthermore, extending the principles of CIC and zero-SRE modeling to other data modalities, such as time-series or images, represents an exciting frontier.
Acknowledgments and Disclosure of Funding
We thank the anonymous reviewers for their insightful feedback which significantly improved the clarity and rigor of this paper.
Appendix A Core Diffusion Model Equations
This appendix provides the essential equations for the diffusion models referenced in this work. Diffusion models learn a data distribution by training a neural network $\boldsymbol{\epsilon}_{\theta}$ to reverse a fixed, gradual noising process.
The model is trained by optimizing a simplified score-matching objective (ho2020denoising):
$$
\begin{split}L_{\text{simple}}(\theta)=\mathbb{E}_{t,\mathbf{x}_{0},\boldsymbol{\epsilon}}\bigg[\Big\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}\big(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon},t\big)\Big\|^{2}\bigg]\end{split}
$$
where $\bar{\alpha}_{t}$ is a predefined noise schedule. For deterministic generation and inversion, we use the Denoising Diffusion Implicit Model (DDIM) update step (song2021denoising):
$$
\begin{split}\mathbf{x}_{t-1}=\sqrt{\bar{\alpha}_{t-1}}\left(\frac{\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t},t)}{\sqrt{\bar{\alpha}_{t}}}\right)+\sqrt{1-\bar{\alpha}_{t-1}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t},t)\end{split}
$$
This process can be viewed as a discretization of a continuous-time probability flow Ordinary Differential Equation (ODE) (song2021score):
$$
d\mathbf{x}=\left[\frac{1}{2}\frac{d\log\alpha(s)}{ds}\mathbf{x}(s)-\frac{1}{2}\frac{d\log(1-\alpha(s))}{ds}\frac{\sqrt{1-\alpha(s)}}{\sqrt{\alpha(s)}}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}(s),s)\right]ds
$$
Within our theoretical framework, the encoder operator $\mathbf{T}_{\theta}$ corresponds to solving this ODE forward in time (from $s=0$ to $s=1$ ), while the decoder operator $\mathbf{H}_{\theta}$ corresponds to solving it backward in time (from $s=1$ to $s=0$ ).
Appendix B Detailed Proofs for Identifiability (Theorems 2 & 4)
This appendix provides detailed, dimension-specific proofs for the identifiability of the exogenous noise $U$ and the subsequent correctness of counterfactual generation. The core challenge lies in showing that the statistical independence of the latent code $Z$ from the parents $\mathbf{Pa}$ is a sufficient condition to establish an isomorphic relationship between $Z$ and $U$ . The mathematical tools required differ based on the dimensionality of $U$ .
B.1 The High-Dimensional Case ( $dâ„ 3$ )
For cases where the exogenous noise $U$ has a dimensionality $dâ„ 3$ , the proof leverages Liouvilleâs theorem on conformal mappings.
Proof [Proof of Theorems 2 and 4 for $dâ„ 3$ ] We adapt the proof from identifiable generative modeling (chao2023interventional) to our conditional operator setting.
Let $\mathbf{Q}_{\mathbf{pa}}(U):=\mathbf{T}_{\theta}(\mathbf{F}(\mathbf{pa},U),\mathbf{pa})$ be the composite function mapping the noise $U$ to the latent code $Z$ for a given context $\mathbf{pa}$ . By assumption, $\mathbf{Q}_{\mathbf{pa}}$ is invertible and differentiable. The core assumption, $Z\perp\!\!\!\perp\mathbf{Pa}$ , implies that the conditional density $p_{Z}(z|\mathbf{pa})$ must equal a marginal density $p_{Z}(z)$ that is independent of $\mathbf{pa}$ .
Using the change of variables formula, we relate the density of $Z$ to that of $U$ :
$$
p_{Z}(z|\mathbf{pa})=\frac{p_{U}(\mathbf{Q}_{\mathbf{pa}}^{-1}(z))}{\left|\det J_{\mathbf{Q}_{\mathbf{pa}}}(\mathbf{Q}_{\mathbf{pa}}^{-1}(z))\right|}
$$
where $J_{\mathbf{Q}_{\mathbf{pa}}}$ is the Jacobian of $\mathbf{Q}_{\mathbf{pa}}$ . Since the left-hand side is independent of $\mathbf{pa}$ and $p_{U}(·)$ is a fixed distribution, this imposes a strong constraint on the Jacobian determinant. Under regularity conditions, this implies that the Jacobian $J_{\mathbf{Q}_{\mathbf{pa}}}(u)$ must be a scaled orthogonal matrix, making $\mathbf{Q}_{\mathbf{pa}}$ a conformal map.
By Liouvilleâs theorem, for dimensions $dâ„ 3$ , any conformal map must be a Möbius transformation (a composition of translations, scalings, orthogonal transformations, and inversions). For the map to be well-behaved, it must exclude the inversion component, which would introduce singularities. This is consistent with the regularity of functions representable by neural networks, simplifying the map to an affine form:
$$
\mathbf{Q}_{\mathbf{pa}}(u)=\mathbf{A}_{\mathbf{pa}}u+\mathbf{d}_{\mathbf{pa}}
$$
where $\mathbf{A}_{\mathbf{pa}}$ is a scaled orthogonal matrix. The argument then uses the independence of the distributionâs moments and support to show that $\mathbf{A}_{\mathbf{pa}}$ and $\mathbf{d}_{\mathbf{pa}}$ must be constant w.r.t. $\mathbf{pa}$ . This leads to the isomorphic relationship $\mathbf{T}_{\theta}(\mathbf{F}(\mathbf{Pa},U),\mathbf{Pa})=\mathbf{A}U+\mathbf{d}=g(U)$ .
The proof of counterfactual correctness follows directly, as detailed in chao2023interventional. The conditional isomorphism $\mathbf{H}_{\theta}(\mathbf{T}_{\theta}(·,\mathbf{pa}),\mathbf{pa})=\mathbf{I}$ combined with the identifiability result $\mathbf{T}_{\theta}(\mathbf{F}(\mathbf{pa},u),\mathbf{pa})=g(u)$ implies that $\mathbf{H}_{\theta}(g(u),\mathbf{pa})=\mathbf{F}(\mathbf{pa},u)$ . The decoder thus perfectly mimics the true causal mechanism, making an intervention exact: $\hat{X}_{\boldsymbol{\alpha}}=\mathbf{H}_{\theta}(g(u),\boldsymbol{\alpha})=\mathbf{F}(\boldsymbol{\alpha},u)=X_{\boldsymbol{\alpha}}^{\text{true}}$ .
B.2 The One-Dimensional Case ( $d=1$ )
For the one-dimensional case where Liouvilleâs theorem does not apply, this section provides a dedicated proof leveraging properties of 1D functions and a uniform noise assumption from (chao2023interventional) that does not sacrifice generality.
We first establish a helper lemma characterizing a specific class of 1D functions.
**Lemma 18**
*For $U,Zâ\mathbb{R}$ , consider a family of invertible functions $q_{\mathbf{pa}}:Uâ Z$ for $\mathbf{pa}â\mathcal{X}_{\text{pa}}â\mathbb{R}^{d}$ . The derivative expression $\frac{dq_{\mathbf{pa}}}{du}(q_{\mathbf{pa}}^{-1}(z))$ is a function of $z$ only, i.e., $c(z)$ , if and only if $q_{\mathbf{pa}}(u)$ can be expressed as
$$
q_{\mathbf{pa}}(u)=q(u+r(\mathbf{pa}))
$$
for some function $r$ and invertible function $q$ .*
Proof The proof is provided in (chao2023interventional). The reverse direction follows from direct differentiation, while the forward direction uses the inverse function theorem to show that the inverses $s_{\mathbf{pa}}(z)=q_{\mathbf{pa}}^{-1}(z)$ must have the same derivative $1/c(z)$ , implying they can only differ by an additive constant, which yields the desired form after inversion.
This lemma enables the proof of the theorem for the 1D case.
**Theorem 19 (Identifiability ford=1d=1)**
*Assume for $Xâ\mathcal{X}â\mathbb{R}$ and exogenous noise $U\sim\text{Unif}[0,1]$ , the SCM is $X:=f(\mathbf{Pa},U)$ . Assume an encoder-decoder model with encoding function $g$ and decoding function $h$ . Assume the following conditions:
1. The encoding is independent of the parents, $g(X,\mathbf{Pa})\perp\!\!\!\perp\mathbf{Pa}$ .
1. The structural equation $f$ is differentiable and strictly increasing w.r.t. $U$ .
1. The encoding $g$ is invertible and differentiable w.r.t. $X$ .
Then, $g(f(\mathbf{Pa},U),\mathbf{Pa})=\tilde{q}(U)$ for an invertible function $\tilde{q}$ .*
Proof Let $q_{\mathbf{pa}}(U):=g(f(\mathbf{pa},U),\mathbf{pa})$ . The conditions on $f$ and $g$ ensure $q_{\mathbf{pa}}$ is strictly monotonic and thus invertible. By the independence assumption, the conditional distribution of $Z=q_{\mathbf{pa}}(U)$ does not depend on $\mathbf{pa}$ . We assume $U\sim\text{Unif}[0,1]$ without loss of generality. For any SCM with a continuous noise $E$ and a strictly increasing CDF $F_{E}$ , $X=f(\mathbf{Pa},E)$ can be re-parameterized to an equivalent SCM $X=\tilde{f}(\mathbf{Pa},U)$ , where $U=F_{E}(E)\sim\text{Unif}[0,1]$ and $\tilde{f}(·,·)=f(·,F_{E}^{-1}(·))$ . The modeling task is then to learn the potentially more complex function $\tilde{f}$ . The change of density formula gives:
$$
p_{Z}(z)=\frac{p_{U}(q_{\mathbf{pa}}^{-1}(z))}{|\frac{dq_{\mathbf{pa}}}{du}(q_{\mathbf{pa}}^{-1}(z))|}
$$
Since $p_{U}(u)=1$ on its support and $q_{\mathbf{pa}}$ is increasing, the denominator must be independent of $\mathbf{pa}$ . This implies $\frac{dq_{\mathbf{pa}}}{du}(q_{\mathbf{pa}}^{-1}(z))=c(z)$ for some function $c$ . This meets the condition of Lemma 18, allowing us to express $q_{\mathbf{pa}}(u)=q(u+r(\mathbf{pa}))$ for some invertible $q$ . Since $Z\perp\!\!\!\perp\mathbf{Pa}$ , its support must also be independent of $\mathbf{pa}$ . The support is $q([0,1]+r(\mathbf{pa}))$ , which is constant only if the interval $[r(\mathbf{pa}),1+r(\mathbf{pa})]$ is constant. This requires $r(\mathbf{pa})$ to be a constant, $r$ . Thus, $q_{\mathbf{pa}}(u)=q(u+r)$ . Defining $\tilde{q}(u)=q(u+r)$ , we find that the mapping is solely a function of $U$ , which completes the proof.
B.3 The Two-Dimensional Case ( $d=2$ )
The two-dimensional case is a well-known geometric exception, as the group of conformal maps is infinite-dimensional. Consequently, the proof strategy used for higher dimensions via Liouvilleâs theorem does not directly apply. Here, we show that identifiability still holds under additional, plausible regularity assumptions aligned with our modeling framework.
**Assumption 1 (Asymptotic Linearity)**
*The composite mapping $Q_{\mathbf{pa}}(u):\mathbb{C}â\mathbb{C}$ is an entire function (analytic on the whole complex plane) with at most linear growth. That is, there exist constants $A,B$ such that $|Q_{\mathbf{pa}}(u)|†A|u|+B$ for all $uâ\mathbb{C}$ .*
This assumption reflects a fundamental inductive bias of neural network architectures. Standard activation functions (e.g., ReLU, Tanh) produce functions that cannot exhibit super-polynomial growth or essential singularities at infinity, aligning our analysis with the function classes our model can represent.
**Assumption 2 (Non-Rotationally Symmetric Base Noise)**
*The base distribution of the exogenous noise $U$ is not rotationally symmetric. This assumption is made without loss of generality. For any arbitrary continuous noise $E=(E_{1},E_{2})$ , one can define a new noise $U=(F_{E_{1}}(E_{1}),E_{2})$ , where $F_{E_{1}}$ is the CDF of the first component. The resulting distribution of $U$ is uniform along its first axis, thus breaking any rotational symmetry. The structural function $\mathbf{F}$ then absorbs this transformation.*
Proof The proof proceeds in three steps. First, as established previously, the statistical independence condition $Z\perp\!\!\!\perp\mathbf{Pa}$ implies that the learned mapping $Q_{\mathbf{pa}}(u)$ must be a conformal map, and thus an analytic function on $\mathbb{C}$ . Second, by Assumption 1, $Q_{\mathbf{pa}}(u)$ is an entire function with at most linear growth. The Generalized Liouvilleâs Theorem states that an entire function whose growth is bounded by a polynomial of degree $k$ must itself be a polynomial of degree at most $k$ . In our case, this implies $Q_{\mathbf{pa}}(u)$ must be a polynomial of degree at most one, giving it the affine form:
$$
Q_{\mathbf{pa}}(u)=a(\mathbf{pa})u+b(\mathbf{pa})
$$
where $a,b$ are complex coefficients that can depend on $\mathbf{pa}$ . Third, we use the full statistical independence condition to show that the coefficients $a$ and $b$ must be constant. For the distribution of $Z=a(\mathbf{pa})U+b(\mathbf{pa})$ to be independent of $\mathbf{pa}$ , all of its properties must be constant. Mean: The mean $E[Z|\mathbf{pa}]=a(\mathbf{pa})E[U]+b(\mathbf{pa})$ must be constant. Assuming $E[U]=0$ without loss of generality implies $b(\mathbf{pa})$ must be a constant, $b$ . Covariance: By Assumption 2, the distribution of $U$ is not rotationally symmetric, so its covariance matrix is not proportional to the identity. Any rotation induced by the phase of $a(\mathbf{pa})$ would alter the covariance structure of $Z$ . For the distribution of $Z$ to be invariant, the rotation angle (phase) of $a(\mathbf{pa})$ must be constant. Scale: The change of variables formula implies that the magnitude $|a(\mathbf{pa})|$ must also be constant. Since both the magnitude and phase of $a(\mathbf{pa})$ must be constant, $a(\mathbf{pa})$ must be a constant complex number, $a$ . Therefore, $Q_{\mathbf{pa}}(u)=au+b$ , which is an isomorphic mapping of $u$ . This completes the proof.
Appendix C Extended Analysis of Inversion Fidelity and Non-Invertible SCMs
This appendix provides a unified analysis of inversion errors. We first provide rigorous proofs for inversion fidelity (Propositions 5 and 6), then extend our error decomposition framework to the more challenging non-invertible SCM setting.
C.1 Proofs for Inversion Fidelity (Propositions 5 & 6)
Proof [Proof of Proposition 5] The proof proceeds by deriving the explicit one-step reconstruction error and then analyzing its order of magnitude.
1. Derivation of the One-Step Reconstruction Error.
Let the single-step DDIM inversion operator be $\mathbf{T}_{t}$ , mapping an observation $\mathbf{x}_{t}$ to $\mathbf{x}^{\prime}_{t+1}$ using the noise prediction $\boldsymbol{\epsilon}_{t}=\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t},t)$ . The corresponding generative operator is $\mathbf{H}_{t}$ , which reconstructs $\mathbf{x}^{\prime}_{t}$ from $\mathbf{x}^{\prime}_{t+1}$ using a new prediction at the new state, $\boldsymbol{\epsilon}^{\prime}_{t+1}=\boldsymbol{\epsilon}_{\theta}(\mathbf{x}^{\prime}_{t+1},t+1)$ .
By substituting the formula for $\mathbf{x}^{\prime}_{t+1}$ into the update for $\mathbf{x}^{\prime}_{t}$ , the single-step reconstruction error $\mathbf{x}^{\prime}_{t}-\mathbf{x}_{t}$ is found to be:
$$
\mathbf{x}^{\prime}_{t}-\mathbf{x}_{t}=\left(\sqrt{1-\bar{\alpha}_{t}}-\frac{\sqrt{\bar{\alpha}_{t}}\sqrt{1-\bar{\alpha}_{t+1}}}{\sqrt{\bar{\alpha}_{t+1}}}\right)(\boldsymbol{\epsilon}^{\prime}_{t+1}-\boldsymbol{\epsilon}_{t})
$$
This error is non-zero if and only if the noise prediction changes after one inversion step, i.e., $\boldsymbol{\epsilon}^{\prime}_{t+1}â \boldsymbol{\epsilon}_{t}$ .
2. Analysis of the Errorâs Order of Magnitude.
In the continuous-time limit with time step $\Delta s=1/T$ , a Taylor expansion shows that both the coefficient term and the difference in noise predictions $(\boldsymbol{\epsilon}^{\prime}_{t+1}-\boldsymbol{\epsilon}_{t})$ are of order $\mathcal{O}(\Delta s)$ . The one-step reconstruction error is therefore the product of these two terms:
$$
\text{Error}_{\text{step}}=\mathcal{O}(\Delta s)\times\mathcal{O}(\Delta s)=\mathcal{O}((\Delta s)^{2})=\mathcal{O}(1/T^{2})
$$
This local error accumulates over the $T$ steps of the trajectory, resulting in a total accumulated error of order $\mathcal{O}(1/T)$ . For any finite $T$ , this global error is non-zero, constituting the Structural Reconstruction Error (SRE) for DDIM.
Proof [Proof of Proposition 6] The proof is constructive, following from the exact algebraic invertibility of the BELM sampler (liu2024belm). For the second-order BELM used in this work, the one-step decoder is an affine transformation of the form:
$$
\displaystyle\mathbf{x}_{t-1} \displaystyle=A_{t}\mathbf{x}_{t}+B_{t}\boldsymbol{\epsilon}_{t}+C_{t}\boldsymbol{\epsilon}_{t+1}
$$
where $A_{t},B_{t},C_{t}$ are schedule-dependent coefficients, $\boldsymbol{\epsilon}_{t}=\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t},t)$ , and $\boldsymbol{\epsilon}_{t+1}=\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t+1},t+1)$ .
The full-trajectory decoder, $\mathbf{H}_{\text{BELM}}$ , is a composition of these one-step affine maps. The BELM encoder, $\mathbf{T}_{\text{BELM}}$ , is constructed using a symmetric update rule designed to be the exact algebraic inverse of the decoder. As rigorously shown by liu2024belm, this construction ensures that the composite operator $\mathbf{T}_{\text{BELM}}$ is the exact inverse of $\mathbf{H}_{\text{BELM}}$ , assuming the same sequence of noise function evaluations is used for both processes.
Therefore, by its algebraic construction, the BELM sampler guarantees a lossless round trip:
$$
\mathbf{H}_{\text{BELM}}\circ\mathbf{T}_{\text{BELM}}=\mathbf{I}
$$
The Structural Reconstruction Error is thus identically zero by construction.
C.2 Exhaustive Analysis for the Non-Invertible SCM Setting
This section rigorously extends our framework to the non-invertible case, providing a theoretical underpinning for the stress test results in § 5.4.1.
C.2.1 Assumptions and Definitions
We formalize the problem with the following.
**Assumption 3 (Well-Posed Abduction)**
*For any observed $(\mathbf{v},\mathbf{pa})$ , the inverse image set $\mathcal{U}_{(\mathbf{v},\mathbf{pa})}=\{\mathbf{u}^{\prime}â\mathcal{U}\mid\mathbf{F}(\mathbf{pa},\mathbf{u}^{\prime})=\mathbf{v}\}$ is non-empty, and the Maximum a Posteriori (MAP) solution over this set, given the prior $p(\mathbf{U})$ , is unique. This solution defines the ideal amortized inverse operator, $\mathbf{T}^{*}(\mathbf{v},\mathbf{pa})=\arg\max_{\mathbf{u}^{\prime}â\mathcal{U}_{(\mathbf{v},\mathbf{pa})}}p(\mathbf{u}^{\prime})$ .*
**Definition 20 (Tripartite Error Sources)**
*In the non-invertible case, we refine the error sources into three distinct components:
1. Algorithmic Error (SRE): The error from an imperfect inversion algorithm, $E_{SR}:=\|(\mathbf{H}_{\theta}\circ\mathbf{T}_{\theta}-\mathbf{I})\mathbf{X}\|^{2}$ . For our framework, $E_{SR}\equiv 0$ .
1. Modeling Error: The error from imperfectly learning the ideal amortized inverse, $E_{\text{Modeling}}:=\|\mathbf{T}_{\theta}(\mathbf{V},\mathbf{Pa})-\mathbf{T}^{*}(\mathbf{V},\mathbf{Pa})\|^{2}$ .
1. Representational Error: The fundamental, irreducible error from the SCMâs non-invertibility, $E_{\text{Rep}}:=\|\mathbf{T}^{*}(\mathbf{V},\mathbf{Pa})-\mathbf{U}_{\text{true}}\|^{2}$ .*
C.2.2 A Tighter Error Decomposition
We now present a tighter error bound for the non-invertible case.
**Theorem 21 (Tighter Counterfactual Error Bound)**
*Let the conditions of Theorem 8 hold ( $H_{\theta}$ is $L_{H}$ -Lipschitz). The expected squared error of the counterfactual prediction is bounded by:
$$
\begin{split}\mathbb{E}\left[\|\hat{\mathbf{X}}_{\alpha}-\mathbf{X}_{\alpha}^{\text{true}}\|^{2}\right]\leq&2\mathbb{E}[E_{SR}]+4L_{H}^{2}\mathbb{E}[E_{\text{Modeling}}]+4L_{H}^{2}\mathbb{E}[E_{\text{Rep}}]\end{split}
$$*
Proof We decompose the total error $\|\hat{\mathbf{X}}_{\alpha}-\mathbf{X}_{\alpha}^{\text{true}}\|$ using the triangle inequality and the ideal amortized inverse $\mathbf{T}^{*}$ as an intermediate step. The result follows from applying the inequality $(a+b+c)^{2}†2(a^{2}+b^{2}+c^{2})$ , bounding terms with the $L_{H}$ -Lipschitz property of $\mathbf{H}_{\theta}$ , and taking expectations.
C.2.3 Information-Theoretic Interpretation of Representational Error
The Representational Error ( $E_{\text{Rep}}$ ) is deeply connected to information conservation. The non-invertibility of the SCM $\mathbf{F}$ means that observing $(\mathbf{V},\mathbf{Pa})$ is not sufficient to uniquely determine $\mathbf{U}$ . Information-theoretically, this implies the conditional entropy $H(\mathbf{U}|\mathbf{V},\mathbf{Pa})$ is greater than zero.
**Remark 22 (Representational Error as Information Loss)**
*The ideal amortized inverse $\mathbf{T}^{*}$ yields the mode of the posterior $p(\mathbf{U}|\mathbf{V},\mathbf{Pa})$ . The expected representational error, $\mathbb{E}[E_{\text{Rep}}]$ , can be seen as the expected squared error of this MAP estimator, which is related to the variance and shape of the posterior. Thus, $E_{\text{Rep}}$ is a direct consequence of the information about $\mathbf{U}$ that is fundamentally lost in the forward causal process, a loss captured by $H(\mathbf{U}|\mathbf{V},\mathbf{Pa})>0$ .*
C.2.4 Theoretical Guarantee for the Mitigation Strategy
We now provide a theoretical justification for the âPrior-Matching Regularizerâ proposed in § 7.1.
**Definition 23 (Prior-Matching Regularizer)**
*The regularizer is defined as $R(\mathbf{T}_{\theta})=\mathbb{E}_{(\mathbf{v},\mathbf{pa})}\left[\|\mathbf{s}_{p}(\mathbf{T}_{\theta}(\mathbf{v},\mathbf{pa}))\|^{2}\right]$ , where $\mathbf{s}_{p}(\mathbf{u})=â_{\mathbf{u}}\log p(\mathbf{u})$ is the score function of the prior distribution $p(\mathbf{U})$ .*
**Proposition 24 (Regularizer Induces Convergence to MAP Solution)**
*Minimizing the regularizer $R(\mathbf{T}_{\theta})$ provides an inductive bias that encourages the encoder output, $\hat{\mathbf{u}}=\mathbf{T}_{\theta}(\mathbf{v},\mathbf{pa})$ , to lie on a mode of the prior distribution $p(\mathbf{U})$ .*
Proof [Proof Sketch] The objective is a form of score matching on the prior. The score function $\mathbf{s}_{p}(\mathbf{u})$ is zero if and only if $\mathbf{u}$ is a stationary point of the log-prior. Minimizing the expected squared norm of the score at the encoderâs output penalizes the production of latent codes $\hat{\mathbf{u}}$ in low-probability regions of the prior. This incentivizes the encoder to map observations to the most probable latent code, thereby encouraging $\mathbf{T}_{\theta}$ to approximate the ideal MAP estimator $\mathbf{T}^{*}$ .
Appendix D Main Proofs for Theoretical Framework
This appendix provides the proofs for the main theoretical results presented in the text.
Proof [Proof of Theorem 8] This bound is a direct specialization of the more general bound for non-invertible SCMs derived in Theorem 21 (Appendix C).
For an invertible SCM, abduction is perfect, meaning the ideal amortized inverse $\mathbf{T}^{*}$ is the true inverse of the SCM function $\mathbf{F}$ . Consequently, the recovered noise is the true noise, $\mathbf{T}^{*}(\mathbf{V},\mathbf{Pa})=\mathbf{U}_{\text{true}}$ , which implies that the Representational Error is identically zero: $\mathbb{E}[E_{\text{Rep}}]=0$ .
In this context, the Latent Space Invariance Error, $E_{LSI}$ , becomes equivalent to the Modeling Error, $E_{\text{Modeling}}$ . Applying the inequality $(a+b)^{2}†2a^{2}+2b^{2}$ , the general bound from Theorem 21 reduces to the two terms presented in Theorem 8.
Proof [Proof of Proposition 10] This proof relies on the identifiability of the true SCM (Theorem 2) and the Lipschitz continuity of the score network $\boldsymbol{\epsilon}_{\theta}$ , which guarantees unique ODE solutions via the Picard-Lindelöf theorem. We also assume standard integrability conditions (Fubiniâs theorem).
The encoder $\mathbf{T}_{\theta}$ maps an initial condition $\mathbf{x}(0)$ to the terminal state $\mathbf{x}(T)$ of the probability flow ODE (Eq. A.3). Let $\mathbf{x}_{\theta}(t;\mathbf{x}_{0})$ and $\mathbf{x}_{*}(t;\mathbf{x}_{0})$ denote the ODE solutions with the learned score $\boldsymbol{\epsilon}_{\theta}$ and the true score $\boldsymbol{\epsilon}^{*}$ , respectively. The Latent Space Invariance Error is $\mathbb{E}[E_{LSI}]=\mathbb{E}[\|\mathbf{x}_{\theta}(T;X)-\mathbf{x}_{\theta}(T;X_{\boldsymbol{\alpha}}^{\text{true}})\|^{2}]$ . By the triangle inequality:
| | $\displaystyle\|\mathbf{x}_{\theta}(T;X)-\mathbf{x}_{\theta}(T;X_{\boldsymbol{\alpha}}^{\text{true}})\|$ | $\displaystyleâ€\|\mathbf{x}_{\theta}(T;X)-\mathbf{x}_{*}(T;X)\|$ | |
| --- | --- | --- | --- |
The middle term is zero under ideal identifiability, as the true encoder $\mathbf{T}^{*}$ maps both an observation and its true counterfactual to the same underlying noise. We thus only need to bound terms of the form $\|\mathbf{x}_{\theta}(T;\mathbf{x}_{0})-\mathbf{x}_{*}(T;\mathbf{x}_{0})\|$ .
Let $\mathbf{z}(t)=\mathbf{x}_{\theta}(t;\mathbf{x}_{0})-\mathbf{x}_{*}(t;\mathbf{x}_{0})$ . Since the ODE vector field is Lipschitz, applying Grönwallâs inequality to the differential of $\mathbf{z}(t)$ yields:
$$
\|\mathbf{z}(T)\|\leq\int_{0}^{T}e^{L_{f}(T-t)}C_{f}\|\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{*}(t),t)-\boldsymbol{\epsilon}^{*}(\mathbf{x}_{*}(t),t)\|dt
$$
where $L_{f}$ and $C_{f}$ are constants from the ODE coefficients. Squaring, taking expectations, and applying Jensenâs inequality leads to:
$$
\mathbb{E}[E_{LSI}]\leq C^{\prime}\cdot\mathbb{E}_{\mathbf{x},t}[\|\boldsymbol{\epsilon}_{\theta}-\boldsymbol{\epsilon}^{*}\|^{2}]
$$
where the final expectation is the score-matching loss.
Appendix E Proofs for Theoretical Roles of Hybrid Training
E.1 Proof of Proposition 12 (Weighted Score-Matching)
This section provides a rigorous proof that the auxiliary task loss $L_{\text{task}}$ provides a lower bound on a weighted score-matching objective.
E.1.1 Preliminaries and Setup
Probability Flow ODE.
The generative process is the solution to the reverse-time probability flow ODE from $t=T$ to $t=0$ :
$$
d\mathbf{x}_{t}=f(t,\mathbf{x}_{t},\boldsymbol{\epsilon}(t,\mathbf{x}_{t}))dt,\quad\mathbf{x}_{T}\sim\mathcal{N}(0,\mathbf{I})
$$
where the vector field $f$ is determined by the diffusion scheduler.
Core Objects.
- $\boldsymbol{\epsilon}_{\theta},\boldsymbol{\epsilon}^{*}$ : Learned and true score functions.
- $\mathbf{x}_{t}^{\theta},\mathbf{x}_{t}^{*}$ : ODE trajectories driven by $\boldsymbol{\epsilon}_{\theta}$ and $\boldsymbol{\epsilon}^{*}$ .
- $\mathbf{x}_{0}^{\theta},\mathbf{x}_{0}^{*}$ : Generated and true (counterfactual) data points at $t=0$ .
- $g:\mathbb{R}^{d}â\mathbb{R}^{k}$ : Downstream prediction function.
- $Y_{\text{pred}}=g(\mathbf{x}_{0}^{\theta})$ , $Y_{\text{true}}=g(\mathbf{x}_{0}^{*})$ .
- $L_{\text{task}}=\mathbb{E}_{\mathbf{x}_{T}}[\|Y_{\text{pred}}-Y_{\text{true}}\|^{2}]$ .
Assumptions.
We assume standard regularity conditions for the proof:
1. The vector field $f(t,\mathbf{x},\boldsymbol{\epsilon})$ is Lipschitz continuous in $\mathbf{x}$ and $\boldsymbol{\epsilon}$ .
1. The downstream task function $g(\mathbf{x})$ is Lipschitz continuous and differentiable.
1. The learned score $\boldsymbol{\epsilon}_{\theta}$ and true score $\boldsymbol{\epsilon}^{*}$ are well-behaved.
E.1.2 Formal Proposition Statement
**Proposition 25 (Hybrid Objective as a Weighted Score-Matching Regularizer)**
*Under the regularity assumptions (A1-A3), the auxiliary task loss $L_{\text{task}}$ provides a lower bound on a weighted score-matching objective:
$$
L_{\text{task}}\geq C\cdot\mathbb{E}_{\mathbf{x}_{T},t}\left[w(\mathbf{x}_{t}^{*})\cdot\|\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}^{*})-\boldsymbol{\epsilon}^{*}(\mathbf{x}_{t}^{*})\|^{2}\right]
$$
where $C>0$ and the weight function $w(\mathbf{x}_{t}^{*})$ measures the sensitivity of the final prediction $Y$ to score perturbations along the ideal data generation trajectory.*
Proof The proof proceeds in three main steps.
Step 1: Bounding Sample Error by Score Error (Error Propagation).
Let $\mathbf{z}(t)=\mathbf{x}_{t}^{\theta}-\mathbf{x}_{t}^{*}$ be the error between the two ODE trajectories. By linearizing the vector field $f$ around the true trajectory, we can analyze the impact of the score perturbation $\Delta\boldsymbol{\epsilon}_{t}=\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}^{*})-\boldsymbol{\epsilon}^{*}(\mathbf{x}_{t}^{*})$ . From Duhamelâs Principle, the final sample error $\Delta\mathbf{x}_{0}=\mathbf{z}(0)$ can be expressed as an integral over the perturbation:
$$
\Delta\mathbf{x}_{0}=\int_{T}^{0}\mathbf{K}(s)\Delta\boldsymbol{\epsilon}_{s}ds
$$
where the kernel $\mathbf{K}(s)$ captures the influence of a score perturbation at time $s$ on the final sample at time $0$ .
Step 2: Linearizing the Task Error.
The error in the downstream prediction is $\Delta Y=g(\mathbf{x}_{0}^{\theta})-g(\mathbf{x}_{0}^{*})$ . A first-order Taylor expansion gives:
$$
\Delta Y\approx\nabla_{\mathbf{x}}g(\mathbf{x}_{0}^{*})\cdot\Delta\mathbf{x}_{0}
$$
The task loss is the expected squared norm, $L_{\text{task}}=\mathbb{E}[\|\Delta Y\|^{2}]$ .
Step 3: Deriving the Weighted Relationship.
Substituting the integral form of $\Delta\mathbf{x}_{0}$ into the task error approximation and applying the Cauchy-Schwarz inequality for integrals yields a lower bound for the task loss:
$$
L_{\text{task}}\geq C\cdot\mathbb{E}_{\mathbf{x}_{T}}\left[\int_{0}^{T}\left\|\mathcal{W}(\mathbf{x}_{0}^{*},s)\right\|_{F}^{2}\cdot\|\Delta\boldsymbol{\epsilon}_{s}\|^{2}ds\right]
$$
where $\|·\|_{F}$ is the Frobenius norm and the influence operator $\mathcal{W}(\mathbf{x}_{0}^{*},s)$ captures the end-to-end sensitivity from a score perturbation at time $s$ to the final prediction. Rewriting the expectation gives the final form:
$$
L_{\text{task}}\geq C\cdot\mathbb{E}_{t\sim U[0,T]}\mathbb{E}_{\mathbf{x}_{t}^{*}}\left[w(\mathbf{x}_{t}^{*})\cdot\|\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t}^{*})-\boldsymbol{\epsilon}^{*}(\mathbf{x}_{t}^{*})\|^{2}\right]
$$
where the weight function is $w(\mathbf{x}_{t}^{*}):=T·\mathbb{E}_{\mathbf{x}_{T}|\mathbf{x}_{t}^{*}}[\|\mathcal{W}(\mathbf{x}_{0}^{*},t)\|_{F}^{2}]$ .
Interpretation and Conclusion.
The weight function $w(\mathbf{x}_{t}^{*})$ is large when the gradient norm $\|â_{\mathbf{x}}g(\mathbf{x}_{0}^{*})\|$ is largeâi.e., in causally salient regions where the outcome is highly sensitive to the features. The inequality therefore shows that minimizing $L_{\text{task}}$ provides a lower bound on a score-matching error that is weighted to prioritize accuracy in these causally salient regions.
E.2 Argument for Proposition 13 (Latent Space Disentanglement)
This section provides a qualitative, information-theoretic argument for Proposition 13, showing how the hybrid objective encourages a âdivision of laborâ that promotes disentanglement. The SCM posits that an observation $V$ is determined by $(\mathbf{Pa},U)$ , and the encoder maps $(V,\mathbf{Pa})$ to a latent code $Z=\mathbf{T}_{\theta}(V,\mathbf{Pa})$ , where a perfect encoder would yield $Z=U$ . The process works as follows: the diffusion loss ( $L_{\text{diffusion}}$ ) maximizes the log-likelihood $\log p_{\theta}(V|\mathbf{Pa})$ , forcing the pair $(\mathbf{Pa},Z)$ to contain all information to reconstruct $V$ , thus maximizing the mutual information $I(V;(\mathbf{Pa},Z))$ . Simultaneously, the task loss ( $L_{\text{task}}$ ) learns a prediction from the parents, capturing all predictive information that $\mathbf{Pa}$ has about $V$ and thus maximizing $I(V;\mathbf{Pa})$ . The dual objective must satisfy both constraints. From the chain rule of mutual information, we know $I(V;(\mathbf{Pa},Z))=I(V;\mathbf{Pa})+I(V;Z|\mathbf{Pa})$ . Since $L_{\text{diffusion}}$ maximizes the left-hand side and $L_{\text{task}}$ captures the first term on the right, the optimization incentivizes the latent code $Z$ to model the remaining information, $I(V;Z|\mathbf{Pa})$ . This leads to a connection to disentanglement: the ideal exogenous noise $U$ is, by definition, independent of $\mathbf{Pa}$ . By forcing $Z$ to model the residual information, the optimization process actively encourages the learned representation $Z$ to be independent of $\mathbf{Pa}$ . In summary, the hybrid objective creates a division of labor: the task-specific head explains the variance from $\mathbf{Pa}$ , while the diffusion processâs latent code $Z$ models the residual. This residual is, by construction, the information in $V$ orthogonal to $\mathbf{Pa}$ , forcing $Z$ to be an empirical approximation of the true, disentangled exogenous noise $U$ , thereby serving the identifiability condition of Theorem 2.
Appendix F Proof of Theorem on Causal Transportability
This appendix provides the proof for the theorem on lossless causal transportability.
Proof [Proof of Theorem 17] This proof operates under an idealized setting, assuming a perfectly trained model (i.e., zero statistical error), to isolate the structural properties of transportability. We assume the learned decoder $\mathbf{H}_{\theta_{i}}$ is identical to the true mechanism $\mathbf{F}_{i}$ , and its encoder $\mathbf{T}_{\theta_{i}}$ is its perfect inverse.
Let the source and target SCMs be $\mathcal{M}^{\mathcal{S}}$ and $\mathcal{M}^{\mathcal{T}}$ , with mechanisms $\{\mathbf{F}_{i}\}$ and $\{\mathbf{F}^{\prime}_{i}\}$ respectively. The set of changed mechanisms is indexed by $\mathcal{K}_{\text{changed}}$ .
The proof relies on the modularity of the SCM, which is guaranteed by the mutually independent exogenous noises (Condition ii of the theorem). This independence ensures that a change in one mechanism $\mathbf{F}_{k}$ to $\mathbf{F}^{\prime}_{k}$ does not affect the conditional distributions of other nodes $V_{j}$ ( $jâ k$ ), given their parents.
We analyze the transportability of operators for each mechanism:
1. For invariant mechanisms ( $jâ\mathcal{K}_{\text{changed}}$ ): By definition, the true mechanism is unchanged, $\mathbf{F}^{\prime}_{j}=\mathbf{F}_{j}$ . Since the model operators $(\mathbf{T}_{\theta_{j}},\mathbf{H}_{\theta_{j}})$ perfectly learned $\mathbf{F}_{j}$ in the source domain, they remain valid for the target domain $\mathcal{T}$ and can be directly reused.
1. For changed mechanisms ( $kâ\mathcal{K}_{\text{changed}}$ ): The original operators $(\mathbf{T}_{\theta_{k}},\mathbf{H}_{\theta_{k}})$ are now invalid as they model $\mathbf{F}_{k}$ , not the new mechanism $\mathbf{F}^{\prime}_{k}$ . However, a new operator pair $(\mathbf{T}^{\prime}_{\theta_{k}},\mathbf{H}^{\prime}_{\theta_{k}})$ can be learned from target domain data. This training only requires samples $(v^{\prime}_{k},\mathbf{pa}^{\prime}_{k})$ from domain $\mathcal{T}$ and the shared noise distribution $p_{k}(U_{k})$ (Condition i). Because the noises are independent, this re-learning process for mechanism $k$ is modular and does not affect the other invariant mechanisms.
The procedure for adapting the model is therefore modular: freeze all invariant operators $\{(\mathbf{T}_{\theta_{j}},\mathbf{H}_{\theta_{j}})\}_{jâ\mathcal{K}_{\text{changed}}}$ and re-train only those for the changed mechanisms $\{kâ\mathcal{K}_{\text{changed}}\}$ on target domain data. The resulting adapted model is valid for the target domain $\mathcal{T}$ and can perform abduction on a target individual by applying the correct (reused or re-trained) encoders, thereby losslessly recovering the full vector of exogenous noises. This fulfills the condition for lossless transport.
Appendix G Proof of the Specific Finite Sample Bound (Theorem 15)
This derivation combines standard generalization bounds with known complexity bounds for deep neural networks. We first state a key lemma regarding the complexity of neural networks.
**Lemma 26 (Rademacher Complexity of Neural Networks)**
*Let $\mathcal{F}_{\epsilon}$ be the function class of an $L$ -layer MLP with ReLU activations, input dimension $p$ , and weight matrices $\{\mathbf{W}_{j}\}_{j=1}^{L}$ . Assume the input data $\mathbf{X}$ is contained within a ball of radius $R_{X}$ . If the spectral norm of each weight matrix is bounded, $\|\mathbf{W}_{j}\|_{2}†B_{j}$ , the Rademacher complexity of the function class is bounded by:
$$
\mathfrak{R}_{n}(\mathcal{F}_{\epsilon})\leq C_{net}\frac{R_{X}L\sqrt{p}}{\sqrt{n}}\left(\prod_{j=1}^{L}B_{j}\right)
$$
For simplicity and under normalization, we often consider $R_{X}=1$ . This result is a simplified form derived from (bartlett2017spectrally; neyshabur2018pac), where $C_{net}$ is a universal constant.*
Proof The proof proceeds by combining the standard excess risk bound with the two lemmas above. First, from standard learning theory, the excess risk is bounded by the Rademacher complexity of the total loss function class:
$$
R(\hat{\theta}_{n})-R(\theta^{*})\leq 4\mathfrak{R}_{n}(\mathcal{F}_{\mathcal{L}_{SCM}})+M\sqrt{\frac{\log(1/\delta)}{2n}}
$$
Next, by the sub-additivity property of Rademacher complexity, we decompose the SCMâs complexity into the sum of its individual components:
$$
\mathfrak{R}_{n}(\mathcal{F}_{\mathcal{L}_{SCM}})\leq\sum_{i=1}^{d}\mathfrak{R}_{n}(\mathcal{F}_{\mathcal{L}_{i}})
$$
The next step is to relate the loss complexity to the network complexity. The score-matching loss for mechanism $i$ is $\mathcal{L}_{i}=\|\boldsymbol{\epsilon}-\epsilon_{\theta_{i}}(·)\|^{2}$ . Since this loss is Lipschitz with respect to the output of $\epsilon_{\theta_{i}}$ , its Rademacher complexity is upper-bounded by a constant multiple of the complexity of the score networkâs function class $\mathcal{F}_{\epsilon_{i}}$ via Talagrandâs contraction lemma, i.e., $\mathfrak{R}_{n}(\mathcal{F}_{\mathcal{L}_{i}})†C_{1}·\mathfrak{R}_{n}(\mathcal{F}_{\epsilon_{i}})$ . We then apply Lemma 26 to bound the complexity of each score network. The input dimension $p_{i}$ for mechanism $i$ is $p_{i}=\text{dim}(\mathbf{x}_{t})+\text{dim}(\mathbf{pa}_{i})+\text{dim}(\text{time embedding})$ . Assuming univariate variables, this is $p_{i}=1+|\mathbf{Pa}_{i}|+d_{embed}$ . Let $d_{in}^{max}=\max_{i}|\mathbf{Pa}_{i}|$ , so the maximum input dimension is $p_{max}=1+d_{in}^{max}+d_{embed}$ . Assuming all networks have depth $L$ and a uniform spectral norm bound $B$ , we have:
$$
\mathfrak{R}_{n}(\mathcal{F}_{\epsilon_{i}})\leq C_{net}\frac{L\sqrt{p_{max}}}{\sqrt{n}}B^{L}=C_{net}\frac{L\sqrt{1+d_{in}^{max}+d_{embed}}}{\sqrt{n}}B^{L}
$$
Finally, we assemble the complete bound by substituting all components back into the initial inequality:
| | $\displaystyle R(\hat{\theta}_{n})-R(\theta^{*})$ | $\displaystyle†4\sum_{i=1}^{d}\mathfrak{R}_{n}(\mathcal{F}_{\mathcal{L}_{i}})+M\sqrt{\frac{\log(1/\delta)}{2n}}$ | |
| --- | --- | --- | --- |
By defining $C=4C_{1}C_{net}$ as a generic constant independent of the network architecture and sample size, we arrive at the final form stated in the theorem.
Appendix H Formal Analysis of the Geometric Inductive Bias
This section provides the formal argument for Proposition 3, demonstrating how the score-matching objective, under a simplicity bias, compels the learned generative map to adopt the local data geometry, yielding a parsimonious and well-behaved transformation.
H.1 Preliminaries and Definitions
Let $\mathcal{M}â\mathbb{R}^{d}$ be the data manifold with a smooth probability density $p(\mathbf{x})$ . The true score function is the vector field $\mathbf{s}^{*}(\mathbf{x}):=â_{\mathbf{x}}\log p(\mathbf{x})$ . We learn a parameterized score network $\mathbf{s}_{\theta}(\mathbf{x})$ by minimizing the score-matching objective $L_{SM}(\theta)=\mathbb{E}_{\mathbf{x}\sim p(\mathbf{x})}[\|\mathbf{s}_{\theta}(\mathbf{x})-\mathbf{s}^{*}(\mathbf{x})\|^{2}]$ .
The generative process is described by the probability flow ODE, whose vector field $\mathbf{f}(\mathbf{x},t)$ is a function of the score. The map $H_{\theta}:\mathbb{R}^{d}â\mathcal{M}$ is the flow map of this ODE integrated from $t=1$ to $t=0$ . A map is conformal if its Jacobian is a scaled orthogonal matrix, and affine if it is a linear transformation plus a translation.
H.2 Assumptions
**Assumption 4 (Smoothness of the Data Density)**
*The true data density $p(\mathbf{x})$ is at least twice continuously differentiable ( $C^{2}$ ) on $\mathcal{M}$ .*
**Assumption 5 (Adoption of the Simplicity Bias Principle)**
*Our analysis relies on the principle of implicit regularization: the conjecture that optimizers like SGD favor solutions with a simplicity bias. We formalize this as a preference for score functions $\mathbf{s}_{\theta}$ with lower complexity (e.g., lower Dirichlet Energy or smaller spectral norms) (hochreiter1997flat; neyshabur2018pac).*
H.3 Proof of Proposition 3
The proof proceeds in four steps:
Step 1: The Irrotational Property of the True Score Field. By definition, the true score $\mathbf{s}^{*}(\mathbf{x})$ is the gradient of the scalar potential $\log p(\mathbf{x})$ . By a fundamental theorem of vector calculus, the curl of any gradient field is zero. Thus, $\mathbf{s}^{*}$ is an irrotational (or conservative) vector field.
Step 2: The Variational Perspective of Score Matching. The simplicity bias (Assumption 5) reinforces the irrotational nature of the learned field $\mathbf{s}_{\theta}$ . This is understood via the Helmholtz decomposition, which splits any vector field into irrotational (curl-free) and solenoidal (divergence-free) components. The simplicity bias conjectures that the optimization preferentially minimizes the energy of the solenoidal component, driving the learned field $\mathbf{s}_{\theta}$ towards a purely irrotational solution ( $âĂ\mathbf{s}_{\theta}â\mathbf{0}$ ) to match the conservative nature of the true score $\mathbf{s}^{*}$ .
Step 3: The Geometry of the Score Field under Local Structural Assumptions. We analyze the score fieldâs structure under two local geometric cases for the density $p(\mathbf{x})$ in a region $\mathcal{R}$ . Case (i): Local Isotropy. If $p(\mathbf{x})$ is locally spherically symmetric around a center $\mathbf{c}$ , its value depends only on the radius $r=\|\mathbf{x}-\mathbf{c}\|$ . The gradient $\mathbf{s}^{*}=â\log p(\mathbf{x})$ must point along the radial direction, making the true score a radial vector field. The learned field converges to this simple structure. Case (ii): Local Ellipsoidal Structure. If the iso-contours of $p(\mathbf{x})$ are concentric ellipsoids, the gradient $\mathbf{s}^{*}$ must be orthogonal to these ellipsoidal surfaces at every point. Such a field is still irrotational but no longer radial.
Step 4: The Geometry of the Flow Map. The ODE vector field $\mathbf{f}(\mathbf{x},t)$ is a linear combination of the score field $\mathbf{s}_{\theta^{*}}(\mathbf{x})$ and the position vector $\mathbf{x}$ . The geometry of the resulting flow map $H_{\theta^{*}}$ depends on this vector fieldâs geometry. Result for Case (i): In the isotropic case, $\mathbf{f}(\mathbf{x},t)$ is also a radial vector field. A flow generated by a radial vector field is, by its rotational symmetry, necessarily a conformal map, as it preserves angles. Result for Case (ii): In the ellipsoidal case, the flow must transform the isotropic latent space into the anisotropic data space. The simplest such transformation is a local affine transformation, involving direction-dependent scaling and rotation. While this geometric argument is standard, a fully rigorous proof would require directly computing the Jacobian of the integrated flow map $H_{\theta^{*}}$ to verify it satisfies the required mathematical conditions. The modelâs inductive bias thus compels it to learn the most parsimonious geometric transformation necessary to explain the local data geometry, promoting the well-behaved, locally invertible mappings that are crucial for abduction.
Appendix I Experimental Details
This appendix provides comprehensive details for all experiments presented in Section 5, ensuring full transparency and reproducibility.
I.1 General Setup
Software Environment.
All experiments were conducted in a unified software environment to ensure full reproducibility. Key library versions used were: dowhy (0.12), econml (0.16.0), numpy (1.26.4), pandas (1.3.5), scikit-learn (1.6.1), torch (1.10.0), lightgbm (4.6.0), and networkx (3.2.1). All stochastic processes were controlled with a fixed global random seed (42), except for the ensemble runs, which used a set of distinct seeds for training.
Baseline Estimator Configurations.
All baseline estimators were implemented using the dowhy library. For machine learning-based methods like Causal Forest and DML, we utilized their robust implementations from the econml library. To ensure a fair and reproducible comparison against established benchmarks, we used their default hyperparameter settings, which are widely recognized and have been optimized by the library authors to provide strong performance across a broad range of tasks. Our proposed model, in turn, underwent a systematic grid search; the final parameters and a detailed analysis are provided in Appendix I.2.
Details for PSM Failure Scenario (Act I).
The Data Generation Process (DGP) for this experiment ( $N=5000$ , true ATE $\tau=5000$ ) follows the graph in Figure 3(a) and is defined by:
| | $\displaystyle W_{1},W_{2}$ | $\displaystyle\sim\mathcal{N}(0,1)$ | |
| --- | --- | --- | --- |
with exogenous noises $U_{T}\sim\text{Logistic}(0,1)$ and $U_{Y}\sim\mathcal{N}(0,6000^{2})$ .
Details for Lalonde Benchmark (Act I).
To evaluate performance on real-world data, we used the canonical Lalonde dataset, analyzing the effect of the NSW job training program (treat) on 1978 real earnings (re78). The assumed causal structure is the standard confounding model (Figure 3(c)), a DAG structure that reflects the broad consensus in the causal inference community for this benchmark. In line with our Targeted Modeling principle, we applied the expressive CausalDiffusionModel only to the key treat and re78 nodes, modeling all confounders non-parametrically via their empirical distributions.
Details for Semi-Synthetic Analysis (Act II).
This dataset uses the real-world covariates from the Lalonde dataset as a foundation. The outcome $Y$ and the ground-truth Individual Treatment Effect ( $ITE_{\text{true}}$ ) are then synthetically generated according to the following structural equations:
| | $\displaystyle Y_{\text{base}}=$ | $\displaystyle\ 2· X_{\text{re74}}+1.5· X_{\text{re75}}+100· X_{\text{educ}}-50· X_{\text{age}}$ | |
| --- | --- | --- | --- |
where $U_{\text{base}}\sim\mathcal{N}(0,500^{2})$ and $\mu_{\text{re74}}$ is the mean of the re74 covariate.
Details for the Stress Test on Non-Invertible SCMs (Act IV).
This experimentâs primary objective is to evaluate the frameworkâs robustness when the core theoretical assumption of SCM invertibility is explicitly violated. The DGP ( $N=2000$ ) is defined as follows:
| | $\displaystyle W$ | $\displaystyle\sim\mathcal{U}(-2,2)$ | |
| --- | --- | --- | --- |
The true ATE is exactly $\tau=5.0$ . The SCM was structured with W as an EmpiricalDistribution, and T and Y as CausalDiffusionModel.
Details for the Ablation Study (Act IV).
The ablation study was conducted on a challenging synthetic mediation dataset ( $N=4000$ ) designed to highlight the benefits of generative models. The causal graph is shown in Figure 3(b), with the following data generation process:
| | $\displaystyle X_{1}$ | $\displaystyle\sim\mathcal{N}(0,1),\quad X_{2}\sim\mathcal{U}(-2,2),\quad Z\sim\text{Bernoulli}(0.5)$ | |
| --- | --- | --- | --- |
where $\sigma(·)$ is the sigmoid function, $U_{T}\sim\text{Logistic}(0,0.3)$ , $U_{M}\sim\mathcal{N}(0,1.5^{2})$ , and $U_{Y}$ is drawn from a Gaussian Mixture Model conditioned on $Z$ . The true ATE for this DGP is approximately $202.29$ . BELM-MDCM (Full Model):
The complete proposed framework. The nodes T, M, and Y are all modeled by a CausalDiffusionModel with sampler_type=âbelmâ and a hybrid objective weight of $\lambda=5.0$ . w/o Analytical Invertibility:
Identical to the full model, but the sampler_type for all diffusion models was set to âddimâ. w/o Hybrid Objective:
Identical to the full model, but the hybrid objective weight $\lambda$ for all diffusion models was set to $0.0$ . w/o Targeted Modeling:
The key mediator node M was modeled with a simpler gcm.AdditiveNoiseModel (backed by an LGBMRegressor), while T and Y remained as CausalDiffusionModel s with $\lambda=5.0$ and sampler_type=âbelmâ.
I.2 Model Hyperparameter Justification
The hyperparameters for our BELM-MDCM model, reported in Table 9, were identified through a systematic grid search for each experiment. The variation in these parameters reflects principled adaptations to different data characteristics, guided by the trade-off between generative fidelity and discriminative accuracy, as well as the signal-to-noise ratio (SNR) of the underlying causal relationships. Below, we provide a holistic analysis for each scenario.
I.2.1 Hyperparameter Search Details
To ensure full reproducibility, we detail the hyperparameter search space used to arrive at the configurations in Table 9. The final values for each experiment were selected based on the best performance on a held-out validation set, typically comprising 20-30% of the training data. The primary selection criterion was the lowest PEHE score for experiments with a ground-truth ITE (e.g., Semi-Synthetic), or the best balance of low absolute ATE error and low estimation variance for observational data scenarios. For the Lalonde experiment in particular, we prioritized a configuration that demonstrated superior stability, as detailed in the analysis below.
Table 8: Hyperparameter Search Space and Selection Criteria.
| Hybrid Weight $\lambda$ | {0.0, 0.1, 0.3, 1.0, 2.0, 5.0, 10.0} | Best balance of low error and low variance on the validation set. |
| --- | --- | --- |
| Guidance Weight $w$ | {0.0, 0.1, 0.2, 0.3, 1.0, 5.0, 10.0} | |
| Diffusion Timesteps $T$ | {50, 100, 200, 500} | |
| Learning Rate | {5e-5, 1e-4, 1.1e-4, 1.2e-4, 2e-4} | |
Table 9: Consolidated BELM-MDCM Hyperparameters for All Key Experiments. Column headers correspond to the following experiments: PSM (PSM Failure), Lalonde, Semi-Synth (Semi-Synthetic), Ablation (Ablation Study), and Stress Test (Act IV).
| Hyperparameter Number of Epochs Batch Size | PSM 1500 128 | Lalonde 1000 64 | Semi-Synth 1200 64 | Ablation 700 128 | Stress Test 500 128 |
| --- | --- | --- | --- | --- | --- |
| Hidden Dimension | 512 | 512 | 768 | 768 | 256 |
| Learning Rate | $1Ă 10^{-4}$ | $1Ă 10^{-4}$ | $1.1Ă 10^{-4}$ | $1Ă 10^{-4}$ | $1Ă 10^{-4}$ |
| Diffusion Timesteps ( $T$ ) | 200 | 200 | 50 | 200 | 200 |
| Hybrid Weight $\lambda$ | 0.1 | 2.0 | 2.0 | 5.0 | 0.5 |
| Guidance Weight ( $w$ ) | 0.0 | 1.0 | 0.1 | 0.2 | 0.0 |
Scenario 1: The âPerfectionist Learnerâ (PSM Failure Experiment)
Data Profile: This synthetic dataset features complex, non-linear functions (e.g., sin, cos) but is characterized by a pure, high-SNR signal. The main challenge is to perfectly learn this intricate generative process. Hyperparameter Strategy: The strategy prioritizes generative modeling. A low hybrid weight ( $\lambda=0.1$ ) directs the model to focus on the diffusion loss. Since the conditional signal is strong, classifier-free guidance is disabled ( $w=0.0$ ). A large model capacity and extended training are employed to capture the ground-truth functions.
Scenario 2: The âPragmatic Signal Extractorâ (Lalonde Experiment)
Data Profile: This real-world data is characterized by a weak, noisy signal and a small sample size. The primary challenge is to robustly extract the causal signal. Hyperparameter Strategy: The priority shifts from pure accuracy to a balance of accuracy and stability. A high hybrid weight ( $\lambda=2.0$ ) provides a strong inductive bias towards the predictive task. Crucially, our systematic grid search revealed that high guidance weights ( $w>1.0$ ) dramatically increased estimation variance, leading to unreliable results. We therefore selected a moderate guidance weight of $w=1.0$ . This configuration provides a stabilizing effect, guiding the model towards the causal signal without amplifying noise, thereby achieving the optimal balance between accuracy and the robustness required for reliable inference on real-world data.
Scenario 3: The âPrecision Artistâ (Semi-Synthetic Experiment)
Data Profile: This setup presents a hybrid-SNR environment, using noisy real-world covariates but a pure, synthetic outcome function. The task demands high precision. Hyperparameter Strategy: The diffusion timesteps are reduced ( $T=50$ ), plausibly because a shorter generative path better preserves the fine-grained details of the synthetic ITE function. The hybrid weight is high ( $\lambda=2.0$ ) to focus on the estimation task, while guidance is light ( $w=0.1$ ) as the ITE signal is cleaner than in the purely observational Lalonde case.
Scenario 4: The âComponent Analystâ (Ablation Study)
Data Profile: This is a complex, synthetic mediation structure with a clean, high-SNR signal, specifically designed to isolate the performance impact of individual framework components. Hyperparameter Strategy: The strategy is to create a strong, stable baseline. A high hybrid weight ( $\lambda=5.0$ ) heavily orients the model towards the predictive task, making any performance degradation from ablations more pronounced. A large model capacity (Hidden Dim=768) ensures the model can learn the complex functions, while light guidance ( $w=0.2$ ) provides a minor stabilizing effect.
Scenario 5: The âRobustness Testerâ (Stress Test)
Data Profile: A simple DGP featuring a non-invertible causal mechanism ( $Y\propto U^{2}$ ), designed to test the frameworkâs behavior when its core assumption is violated. Hyperparameter Strategy: The goal is stable learning of a simple function. A moderate hybrid weight ( $\lambda=0.5$ ) balances generative and predictive learning, while a smaller model (Hidden Dim=256) is sufficient for the simpler DGP and helps prevent overfitting. Guidance is turned off ( $w=0.0$ ) as the signal is clean.
Holistic Comparison
Table 10: Summary of Adaptive Hyperparameter Strategies.
| Data Signal Guidance ( $w$ ) Timesteps ( $T$ ) | Pure & Complex 0.0 (Off) 200 (Standard) | Noisy & Weak 1.0 (Balanced Guidance) 200 (Standard) | Hybrid & Complex 0.1 (Slight Nudge) 50 (Short Path) | Pure & Mediated 0.2 (Light) 200 (Standard) | Pure & Non-Invertible 0.0 (Off) 200 (Standard) |
| --- | --- | --- | --- | --- | --- |
| Hybrid ( $\lambda$ ) | 0.1 (Generative) | 2.0 (Predictive) | 2.0 (Predictive) | 5.0 (Strongly Predictive) | 0.5 (Balanced) |
| Core Strategy | Perfect Generation | Robust Prediction | Precision Estimation | Component Isolation | Stable Learning |
In conclusion, the variability in hyperparameters demonstrates the flexibility of our framework. It showcases the modelâs ability to deploy different toolsâsuch as prioritizing the generative loss or amplifying weak signals with guidanceâto optimally adapt to the specific challenges posed by diverse causal inference problems.
I.3 CMF Metric Implementation Details
To ensure the rigor and reproducibility of our evaluation, this section provides the implementation details for the CMF scores.
CMI-Score Estimation Method.
For Conditional Mutual Information (CMI) estimation, we adopt the widely-used k-Nearest Neighbors (k-NN) based estimator from Kraskov et al. (kraskov2004estimating). This method is chosen for its strong theoretical properties and practical robustness, particularly for continuous and high-dimensional data, as it avoids explicit density estimation. Its key advantages include:
- Non-parametric: It makes no assumptions about the underlying data distributions.
- Data-adaptive: The estimation is based on local distances, adapting to the data manifoldâs geometry.
- Robustness: It is more robust than methods that rely on fixed binning, which can be sensitive to bin size.
Following common practice, we set the number of neighbors to $k=5$ for all our experiments to ensure a stable and reliable estimation.
KMD-Score Kernel Parameter Selection.
The performance of the MMD test is sensitive to kernel parameter choices. For the KMD-Score, we employed a standard Radial Basis Function (RBF) kernel, $k(x,y)=\exp(-\|x-y\|^{2}/(2\sigma^{2}))$ . The bandwidth parameter $\sigma$ is critical. Following best practices, we set the bandwidth using the median heuristic, a robust and common data-driven approach. For the analysis on the Lalonde datasetâs re78 mechanism, this heuristic yielded a bandwidth of $\sigma=0.1$ , a value confirmed as effective in preliminary experiments. All KMD-Scores are reported using this configuration.
Appendix J Empirical Validation of Proposed Evaluation Metrics
To empirically validate the reliability, sensitivity, and complementary nature of our proposed evaluation metrics (CIC-Score, CMI-Score, and KMD-Score), we conducted a controlled micro-simulation study assessing how each metric responds to a spectrum of increasingly severe model errors.
Experimental Setup.
We designed a simple ground-truth Structural Causal Model (SCM) and created five simulated models (A through E) representing a clear âdegradation gradientâ of model quality:
- Model A (Oracle): Represents a theoretically perfect model with zero SRE and a perfectly learned causal mechanism. This serves as our gold standard, with expected scores of 1.0.
- Model B (Lossy Inverter): Simulates a model with a non-zero SRE. It uses the correct causal mechanism but introduces a systematic error during the reconstruction cycle, mimicking the core flaw of DDIM-based approaches.
- Model C (Wrong Mechanism): Represents a model that fails to learn the correct functional form of the causal mechanism (e.g., learning a linear instead of a sinusoidal relationship).
- Model D (Maximal Error): Simulates a more severe failure where both the causal mechanism and the inferred noise distribution are fundamentally incorrect.
- Model E (Total Mismatch): Represents the worst-case scenario where the model ignores the causal graph and inputs, generating outputs from an unrelated distribution.
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Bar Chart: Empirical Validation of Proposed Metrics via Micro-simulation
### Overview
The image presents three bar charts, side-by-side, comparing the scores of three different metrics (CIC-Score, CMI-Score, and KMD-Score) across five different error types (A: Oracle, B: Lossy Inverter (SRE Error), C: Wrong Mechanism, D: Maximal Error, E: Total Mismatch). The y-axis represents the "Score Value", ranging from 0.0 to 1.0. The x-axis represents the error types.
### Components/Axes
* **Title:** "Empirical Validation of Proposed Metrics via Micro-simulation" (positioned at the top-center)
* **Y-axis Label:** "Score Value" (common to all three charts, positioned on the left side)
* **X-axis Labels (common to all three charts):**
* A: Oracle
* B: Lossy Inverter (SRE Error)
* C: Wrong Mechanism
* D: Maximal Error
* E: Total Mismatch
* **Chart Titles:**
* CIC-Score (left chart)
* CMI-Score (center chart)
* KMD-Score (right chart)
* **Color:** All bars are a shade of blue.
### Detailed Analysis or Content Details
**1. CIC-Score (Left Chart)**
* **Trend:** The CIC-Score shows a steep decline from A to B, then a gradual decline from B to E.
* **Data Points:**
* A: Oracle - 1.000
* B: Lossy Inverter (SRE Error) - 0.154
* C: Wrong Mechanism - 0.158
* D: Maximal Error - 0.157
* E: Total Mismatch - 0.136
**2. CMI-Score (Center Chart)**
* **Trend:** The CMI-Score starts high at A, decreases to B, then continues to decrease, but at a slower rate, towards E.
* **Data Points:**
* A: Oracle - 0.999
* B: Lossy Inverter (SRE Error) - 0.994
* C: Wrong Mechanism - 0.949
* D: Maximal Error - 0.836
* E: Total Mismatch - 0.758
**3. KMD-Score (Right Chart)**
* **Trend:** The KMD-Score starts at 1.0 for A, decreases to B, then decreases more rapidly from B to E.
* **Data Points:**
* A: Oracle - 1.000
* B: Lossy Inverter (SRE Error) - 0.971
* C: Wrong Mechanism - 0.822
* D: Maximal Error - 0.790
* E: Total Mismatch - 0.670
### Key Observations
* All three metrics achieve a perfect score (1.0) for the "Oracle" case (A).
* The "Lossy Inverter (SRE Error)" (B) causes the most significant drop in the CIC-Score, followed by the KMD-Score, and then the CMI-Score.
* The CMI-Score remains relatively high even for the "Total Mismatch" (E) error type, compared to the CIC-Score and KMD-Score.
* The CIC-Score consistently has the lowest values across all error types except for the Oracle case.
### Interpretation
The data suggests that the three metrics (CIC, CMI, and KMD) respond differently to various error types. The CIC-Score appears to be the most sensitive to the "Lossy Inverter (SRE Error)" and generally provides lower scores across the board, indicating it may be more conservative in its evaluation. The CMI-Score is the most robust, maintaining higher scores even in the presence of significant errors. The KMD-Score falls in between the two, showing a moderate sensitivity to the different error types.
The fact that all metrics achieve a perfect score for the "Oracle" case confirms their ability to correctly identify correct behavior. The differences in scores for the error types highlight the varying strengths and weaknesses of each metric in detecting and quantifying different types of errors. This information is valuable for selecting the most appropriate metric for a specific application or for combining multiple metrics to achieve a more comprehensive evaluation. The micro-simulation context suggests these metrics are being used to assess the performance of a system under controlled error conditions.
</details>
Figure 10: Results of the micro-simulation study for metric validation. The three plots show the response of the CIC-Score, CMI-Score, and KMD-Score to five models (A-E) of progressively decreasing quality. The scores demonstrate a clear monotonic degradation, confirming their ability to reliably track model fidelity. Note the CIC-Scoreâs sharp drop from Model A to B, highlighting its specific sensitivity to the Structural Reconstruction Error (SRE).
Analysis of Results.
The results (Figure 10) demonstrate the distinct and complementary roles of our proposed metrics:
1. The CIC-Score acts as a high-sensitivity âSRE detector.â It exhibits a dramatic drop from a perfect 1.0 (Model A) to approximately 0.23 (Model B) the moment SRE is introduced, while showing less sensitivity to the specific form of mechanism error. This confirms its primary role as a diagnostic for adherence to the Causal Information Conservation principle.
1. The CMI-Score serves as a robust âmechanism association tracker.â It degrades gracefully and monotonically as the learned causal mechanism deviates from the ground truth (from 0.99 to 0.76). This demonstrates its utility in quantifying the fidelity of learned parent-child conditional dependencies.
1. The KMD-Score functions as the âfinal arbiterâ of distributional fidelity. Possessing the widest dynamic range, it is sensitive to all forms of error and provides a holistic judgment of the similarity between the generated and true counterfactual distributions. As the most rigorous metric, it correctly assigns the lowest score to the completely mismatched Model E.
This simulation thus validates our proposed metrics as a reliable and nuanced evaluation framework. They work in synergy to diagnose specific model failings (CIC-Score), assess relational accuracy (CMI-Score), and provide an overall quality judgment (KMD-Score), offering a more insightful assessment than traditional metrics alone.
Discussion on the Non-Zero Lower Bound of Scores.
Notably, even for the worst-performing model (Model E), the CMI and KMD scores do not fall to zero. This behavior is not a limitation but a desirable feature that reflects their ability to capture âresidual statistical structureâ in the evaluation setting.
- For the KMD-Score: The MMD compares the joint distributions $P_{\text{model}}(Y,W,T)$ and $P_{\text{oracle}}(Y,W,T)$ . Crucially, as all models operate on the same observed parent data, the marginal parent distribution $P(W,T)$ is identical between the two. The difference lies only in the conditional $P(Y|W,T)$ . Because the joint distributions share a substantial common subspace, their MMD will be finite, preventing the KMD-Score from reaching zero. Furthermore, using a âStandardScalerâ maps both distributions to a similar feature space, which is necessary for a fair, scale-invariant comparison.
- For the CMI-Score: This metric quantifies the $I(Y;\text{parent}|\text{other parents})$ . Even an incorrect mechanism (like in Model C or D) still generates an output $Y$ as a deterministic function of its parents, resulting in a non-zero CMI. For Model E, where $Y$ is independent of its parents, the theoretically zero CMI is not observed due to the inherent finite-sample variance of the non-parametric k-NN estimator.
This behavior is advantageous, ensuring the metrics provide a meaningful, continuous gradient of failure rather than a simplistic binary judgment. This enhances their diagnostic power, allowing for fine-grained distinctions between types and degrees of model imperfection.
Appendix K Algorithm for ATE Estimation
This appendix provides the detailed pseudo-code for the counterfactual imputation procedure used to estimate the Average Treatment Effect (ATE) in our experiments.
Input: Observational data $\mathcal{D}=\{v^{(j)}\}_{j=1}^{N}$ where $v^{(j)}â\mathbb{R}^{d}$ ; Causal graph $\mathcal{G}$ .
Output: Estimated Average Treatment Effect ( $\hat{\text{ATE}}$ ).
Define Invertible SCM $\mathcal{M}_{\theta}$ :
$\mathcal{M}_{\theta}=\{f_{i}(·;\theta_{i})\}_{i=1}^{d}$ based on $\mathcal{G}$ , where $v_{i}=f_{i}(\mathbf{pa}_{i},u_{i})$ ;
// 1. Train the invertible SCM on observational data
$\hat{\theta}â\arg\min_{\theta}\sum_{j=1}^{N}\sum_{i=1}^{d}\mathcal{L}_{i}\left(f_{i}(\mathbf{pa}_{i}^{(j)};\theta_{i}),v_{i}^{(j)}\right)$ ;
// 2. Generate counterfactual outcomes for each individual
for $jâ 1$ to $N$ do
$t_{j}â v_{T}^{(j)}$ ;
// Observed treatment
$\mathbf{u}^{(j)}â\mathcal{M}_{\hat{\theta}}^{-1}(v^{(j)})$ ;
// Abduction via invertible BELM encoder
$y_{j}(1-t_{j})â\operatorname{Predict}\left(\mathcal{M}_{\hat{\theta}},\mathbf{u}^{(j)},\operatorname{do}(T:=1-t_{j})\right)$ ;
// Action & Prediction
// Store factual and counterfactual outcomes
$\mathbf{Y}_{j}(t_{j})â v_{Y}^{(j)}$ ;
$\mathbf{Y}_{j}(1-t_{j})â y_{j}(1-t_{j})$ ;
end for
// 3. Compute the ATE from factual and counterfactual outcomes
$\hat{\text{ATE}}â\frac{1}{N}\sum_{j=1}^{N}\left(\mathbf{Y}_{j}(1)-\mathbf{Y}_{j}(0)\right)$ ;
1ex
return $\hat{\text{ATE}}$ ;
Algorithm 1 ATE Estimation with Invertible SCMs via Counterfactual Imputation