2511.05236

Model: healer-alpha-free

# The Causal Round Trip: Generating Authentic Counterfactuals by Eliminating Information Loss **Authors**: - Rui Wu wurui22@mail.ustc.edu.cn (School of Management, University of Science and Technology of China) - Lizheng Wang lzwang@ustc.edu.cn (School of Management, University of Science and Technology of China) - Yongjun Li lionli@ustc.edu.cn (School of Management, University of Science and Technology of China) > Corresponding author. ## Abstract Judea Pearl’s vision of Structural Causal Models (SCMs) as engines for counterfactual reasoning hinges on faithful abduction: the precise inference of latent exogenous noise. For decades, operationalizing this step for complex, non-linear mechanisms has remained a significant computational challenge. The advent of diffusion models, powerful universal function approximators, offers a promising solution. However, we argue that their standard design, optimized for perceptual generation over logical inference, introduces a fundamental flaw for this classical problem: an inherent information loss we term the Structural Reconstruction Error (SRE). To address this challenge, we formalize the principle of Causal Information Conservation (CIC) as the necessary condition for faithful abduction. We then introduce BELM-MDCM, the first diffusion-based framework engineered to be causally sound by eliminating SRE by construction through an analytically invertible mechanism. To operationalize this framework, a Targeted Modeling strategy provides structural regularization, while a Hybrid Training Objective instills a strong causal inductive bias. Rigorous experiments demonstrate that our Zero-SRE framework not only achieves state-of-the-art accuracy but, more importantly, enables the high-fidelity, individual-level counterfactuals required for deep causal inquiries. Our work provides a foundational blueprint that reconciles the power of modern generative models with the rigor of classical causal theory, establishing a new and more rigorous standard for this emerging field. Keywords: Causal Inference, Diffusion Models, Causal Information Conservation, Structural Causal Models, Counterfactual Generation, BELM, Structural Reconstruction Error ## 1 Introduction The fundamental challenge of causal inference, as articulated by rubin1974estimating, is our inability to simultaneously observe an individual’s potential outcomes. Generating authentic counterfactuals is thus the field’s grand challenge. Structural Causal Models (SCMs), introduced by pearl2009causality, provide the formal language for this pursuit. An SCM posits that an outcome $V_i$ is generated by a function of its parents $Pa_i$ and a unique exogenous noise variable $U_i$ . This noise, $U_i$ , represents the primordial causal information —the collection of unobserved factors unique to an individual. This concept aligns directly with the long-standing focus in econometrics on unobserved individual heterogeneity, a central challenge in structural modeling for decades (heckman2001micro). Pearl’s framework for causal reasoning, the Abduction-Action-Prediction cycle, hinges on the fidelity of the first step: abduction. To answer any ”what if” question, one must first perfectly infer this primordial information $U_i$ from an observed outcome $v_i$ . For decades, while this theoretical blueprint was clear, its practical realization for complex, non-linear mechanisms remained a major computational hurdle, often addressed in econometrics through strong parametric assumptions or linear approximations (angrist2008mostly). The advent of deep generative models, particularly diffusion models (ho2020denoising), offers a powerful new hope for bridging this gap. As near-universal function approximators, they possess the expressive power to learn the complex, non-linear functions that have long challenged classical methods (chao2023interventional; sanchez2022dcms). However, this promise is shadowed by a critical, yet overlooked, ”impedance mismatch.” These models were engineered for perceptual tasks like image synthesis, where visual plausibility is paramount, not for the logical rigor demanded by causal abduction. We argue that their standard design, which relies on approximate inversion schemes like DDIM (song2021denoising), is fundamentally at odds with the strict requirements of this classical causal problem. In this work, we diagnose and resolve this conflict. We begin by giving the classic requirement for faithful abduction a modern name: Causal Information Conservation (CIC) In this work, ’Causal Information Conservation’ is defined operationally as the lossless, deterministic recovery of the exogenous noise variable $U$ . Its novelty lies in its application as a design principle and diagnostic tool for the diffusion model paradigm in causality, rather than as a formal information-theoretic quantity. Connecting this operational principle to formal measures, such as mutual information, is a compelling avenue for future research.. Our core contribution is the identification that standard diffusion models systematically violate this principle due to an inherent algorithmic flaw. We formalize this flaw as the Structural Reconstruction Error (SRE) —a quantifiable information loss that imposes a hard theoretical ceiling on the fidelity of any counterfactual generated by such methods. The SRE is not an estimation error to be solved with more data, but a structural defect in the tool itself. To solve the long-standing challenge of operationalizing faithful abduction, we introduce BELM-MDCM. It is not merely a new model, but the first diffusion-based framework re-engineered from first principles to be causally sound. Architected around an analytically invertible sampler (liu2024belm), it is the first Zero-SRE causal framework by construction. This design choice reconciles the expressive power of modern diffusion models with the logical rigor of Pearl’s causal theory, ensuring the abduction step is lossless. Our primary contributions are therefore: 1. Diagnosing a Fundamental Barrier in a Classic Problem. We are the first to identify that standard diffusion models, when applied to the classic problem of SCM abduction, suffer from a structural flaw we term the Structural Reconstruction Error (SRE), which violates the foundational principle of Causal Information Conservation. 1. Proposing the First Causally-Sound Diffusion Framework. We introduce BELM-MDCM, the first framework to eliminate SRE by design. By leveraging an analytically invertible mechanism, it ensures that the power of diffusion models can be applied to causality without compromising the integrity of the abduction process. 1. Developing a Principled Methodology to Operationalize the Framework. To make our Zero-SRE framework practical and robust, we introduce two synergistic innovations: a Targeted Modeling strategy to manage complexity and a Hybrid Training Objective to provide a strong causal inductive bias, both supported by our theoretical analysis. Through a comprehensive experimental evaluation, we demonstrate that BELM-MDCM not only sets a new state-of-the-art in estimation accuracy but, more critically, unlocks the generation of authentic individual-level counterfactuals for deep causal inquiries. By providing a foundational blueprint that resolves a core tension between modern machine learning and classical causal theory, our work establishes a new, more rigorous standard for this research direction. ### 1.1 The Inversion Challenge in Diffusion-Based Causality Diffusion models (ho2020denoising) are powerful generative models that learn to reverse a fixed, gradual noising process. They train a neural network, $ε_θ(x_t,t)$ , to predict the noise component of a corrupted sample $x_t$ by optimizing a simple mean-squared error objective: $$ \begin{split}L_simple(θ)=E_t,x_0,\boldsymbol{ε}\bigg[\Big\|\boldsymbol{ε}-ε_θ\big(√{\bar{α}_t}x_0+√{1-\bar{α}_t}\boldsymbol{ε},t\big)\Big\|^2\bigg]\end{split} \tag{1} $$ where $\bar{α}_t$ defines the noise schedule and $\boldsymbol{ε}∼N(0,I)$ . This trained network is then used to iteratively denoise a variable from pure noise back to a clean sample. A standard deterministic method for this generative process is the Denoising Diffusion Implicit Model (DDIM) (song2021denoising): $$ \begin{split}x_t-1=√{\bar{α}_t-1}≤ft(\frac{x_t-√{1-\bar{α}_t}ε_θ(x_t,t)}{√{\bar{α}_t}}\right)+√{1-\bar{α}_t-1}·ε_θ(x_t,t)\end{split} \tag{2} $$ However, causal abduction requires the inverse operation: encoding an observed data point $x_0$ into its latent noise code $x_T$ . Standard frameworks (chao2023interventional) use the DDIM inversion, which only approximates this path: $$ \begin{split}x_t+1=√{\bar{α}_t+1}≤ft(\frac{x_t-√{1-\bar{α}_t}ε_θ(x_t,t)}{√{\bar{α}_t}}\right)+√{1-\bar{α}_t+1}·ε_θ(x_t,t)\end{split} \tag{3} $$ This inversion is approximate because it relies on the noise prediction $ε_θ(x_t,t)$ remaining constant across the step, which introduces discretization errors that accumulate (liu2022pseudo). This structural flaw, which we term the Structural Reconstruction Error (SRE), systematically corrupts the inferred exogenous noise $U_i$ . The initial error in the abduction step then propagates through the entire Abduction-Action-Prediction cycle, compromising the fidelity of the final counterfactual. ### 1.2 Our Solution: A Zero-SRE Causal Framework To eliminate SRE by construction, we build our framework upon an analytically invertible sampler: the B idirectional E xplicit L inear M ulti-step (BELM) sampler (liu2024belm). BELM overcomes the ”memoryless” limitation of single-step samplers like DDIM by using a history of noise predictions, a principle grounded in classical theory for solving ODEs (hairer2006solving). Specifically, we employ a second-order BELM. During decoding, it computes a more stable effective noise, $\boldsymbol{ε}_eff$ , using predictions from the current and previous timesteps: $$ \boldsymbol{ε}_eff=\frac{3}{2}ε_θ(x_t,t)-\frac{1}{2}ε_θ(x_t+1,t+1) \tag{4} $$ This improved estimate is then used in a DDIM-like update. The key innovation is that the corresponding encoding process is constructed to be the exact algebraic inverse of this decoding process, guaranteeing that the round-trip is lossless, i.e., $H(T(x_0))=x_0$ . While the original work on BELM focused on general generative tasks, we are the first to identify, leverage, and theoretically justify its analytical invertibility as the key to satisfying the principle of Causal Information Conservation for rigorous counterfactual generation. Our choice of a second-order BELM represents a deliberate trade-off, providing substantial accuracy gains over single-step methods while maintaining practical efficiency (liu2024belm), making it ideal for our causal framework. ### 1.3 Methodological Gaps in Applying Invertible SCMs However, achieving high-fidelity causal inference requires more than a simple substitution of one sampler for another. The principle of analytical invertibility, while theoretically sound, exposes new challenges in practical SCM implementation that our framework is designed to address. The Challenge of Model Specification: Targeted Modeling. A key decision in SCM construction is assigning a causal mechanism to each node. Naively applying a complex, computationally expensive BELM-based diffusion model to every node in the causal graph is suboptimal. This motivates our Targeted Modeling strategy, where model complexity is treated as a resource to be allocated judiciously across the graph. The Challenge of Downstream Tasks: Hybrid Training. The second challenge arises from a fundamental mismatch in objectives. A diffusion model is trained on a generative objective, $L_diffusion(θ)$ , while a downstream predictive task is optimized using a discriminative loss, $L_task(φ)$ . These two objectives are not aligned. This ”objective mismatch” motivates our Hybrid Training strategy, which seeks to unify these two goals. ## 2 Theoretical Analysis: An Operator-Theoretic Framework To formalize our thesis that Causal Information Conservation is paramount and its violation via Structural Reconstruction Error is a fundamental barrier, we develop a rigorous operator-theoretic framework. This perspective is essential for analyzing the fidelity of the causal mapping process itself, moving beyond simple prediction errors. We present the first formal analysis that decomposes the counterfactual error in diffusion-based causal models to explicitly isolate the SRE, proving how our Zero-SRE design eliminates this critical structural limitation. Our analysis first establishes the conditions for perfect counterfactual generation (§ 2.1-§ 2.3) and proves that standard methods produce a non-zero SRE, which our sampler eliminates by construction (Proposition 5 - 6; § 2.4). The centerpiece is a novel error decomposition theorem that isolates the SRE, motivating our Zero-SRE design (§ 2.5-§ 2.7). We conclude with learnability guarantees and a discussion of implications for advanced causal tasks like transportability (§ 2.8-§ 2.10). ### 2.1 Problem Formulation and Causal Operators Let $(Ω,F,P)$ be a probability space. We consider endogenous variables $V$ as elements of the Hilbert space of square-integrable random variables, $X:=L^2(Ω,ℝ^d)$ . Unless otherwise specified, all vector norms $\|·\|$ in the subsequent analysis refer to the standard Euclidean ( $L_2$ ) norm. **Definition 1 (Functional SCM Operator)** *A Structural Causal Model is defined by a set of unknown, true functional operators $\{F_i\}_i=1^d$ , where each $F_i:X^pa_i×U_i→X_i$ is a map such that $V_i:=F_i(Pa_i,U_i)$ , with $Pa_i$ being the set of parent random variables and $U_i$ an exogenous noise variable. We establish the convention that the corresponding lowercase bold letter, $pa_i$ , denotes a specific vector of observed values for these parents.* Our goal is to learn a model parameterized by $θ$ that approximates this SCM. Our model consists of a pair of conditional operators for each variable $V_i$ : 1. A decoder (generative) operator $H_θ:U×X^p→X$ , which aims to approximate $F$ . 1. An encoder (inference) operator $T_θ:X×X^p→U$ , which aims to perform abduction by inferring the latent noise. These operators are realized by solving the probability flow ODE (Appendix A). The decoder $H_θ$ solves the ODE from $t=T$ to $t=0$ , while the encoder $T_θ$ solves it from $t=0$ to $t=T$ . Our BELM sampler is a high-fidelity numerical solver designed such that these forward and backward operations are exact algebraic inverses. ### 2.2 Identifiability and Exact Counterfactual Generation We adapt principles from identifiable generative modeling (chao2023interventional) to formalize the conditions for exact counterfactuals. This requires assuming the SCM is invertible with respect to its noise term, a condition discussed in Section 2.11. **Theorem 2 (Identifiability via Statistical Independence)** *Given an SCM operator $X:=F(Pa,U)$ where $U⊥⊥Pa$ and $F$ is invertible w.r.t. $U$ . If a learned encoder $T_θ$ (with sufficient capacity) yields a latent representation $Z=T_θ(X,Pa)$ that is statistically independent of the parents $Pa$ , then $Z$ is an isomorphic representation of the exogenous noise $U$ .* ### 2.3 Geometric Inductive Bias for Identifiability The score-matching objective’s geometric inductive biases strengthen our identifiability argument. We leverage the principle of implicit regularization, where optimizers favor ”simpler” functions (hochreiter1997flat; neyshabur2018pac). We adopt the principle of simplicity bias, a cornerstone of modern deep learning theory that, while empirically supported, remains an active and not yet universally proven area of research. Our conclusions are conditioned on its validity, as discussed further in Section 2.11. This suggests the model learns the most parsimonious geometric transformation required to explain the data. Considering the local geometry of the data density $p(x)$ provides powerful intuition. In a local region $R$ , if the data is isotropic (spherically symmetric), the simplest score function is a radial vector field, yielding a conformal map. If the structure is simply anisotropic (e.g., ellipsoidal), the model is biased towards learning a local affine map. This refines the notion of a purely conformal bias and leads to the following proposition. **Proposition 3 (Implicit Bias towards Simple Geometric Maps)** *Assume (A1) the true data density $p(x)$ is smooth ( $C^2$ ) and (A2) the optimization process has a simplicity bias (e.g., favoring low-complexity solutions, see Appendix H). 1. If there exists a local region $R$ where $p(x)$ is isotropic, the optimal learned score function is a radial vector field, and the flow map it generates is a conformal map on $R$ . 1. If we relax the condition to a local region $R$ where $p(x)$ has an ellipsoidal structure, the optimal learned score function is normal to the ellipsoidal iso-contours, and the flow map it generates is a local affine transformation on $R$ .* The formal argument is detailed in Appendix H. This proposition is significant: it suggests that the model defaults to learning the most parsimonious, well-behaved, and locally invertible map that can explain the data’s geometry. This bias is crucial for the abduction step, as it prevents the pathological distortions that would corrupt the inferred causal noise $U$ . **Theorem 4 (Operator Isomorphism Guarantees Exact Counterfactuals)** *Let the conditions of Theorem 2 hold. If the learned operator pair $(T_θ,H_θ)$ constitutes a conditional isomorphism (i.e., $H_θ(T_θ(·,pa),pa)=I$ , the identity operator), then the model’s prediction under an intervention $do(Pa:=\boldsymbol{α})$ is exact.* Proof A full proof, covering cases for different dimensions of the exogenous noise variable, is provided in Appendix B. ### 2.4 Analysis of Inversion Fidelity We now formally analyze the inversion error. We prove that standard approximate schemes produce a non-zero SRE (Proposition 5), whereas our chosen sampler eliminates it by construction (Proposition 6). **Proposition 5 (Structural Error of Approximate Inversion)** *Let $T_DDIM$ be the operator for one step of DDIM inversion from $x_t$ to $x_t+1$ , and $H_DDIM$ be the generative step operator from $x_t+1$ to $x_t$ . The single-step reconstruction error is non-zero and of second order in the time step $Δ t$ : $$ (H_DDIM∘T_DDIM)(x_t)-x_t=O((Δ t)^2) $$ This error accumulates over the full trajectory, leading to a non-zero Structural Reconstruction Error.* Proof See Appendix C for a rigorous proof. **Proposition 6 (Analytical Invertibility of the Sampler)** *Let $T_BELM$ and $H_BELM$ be the operators corresponding to the full-trajectory BELM sampler for inference and generation, respectively. For a fixed noise prediction network $ε_θ$ , the operators are exact algebraic inverses: $$ H_BELM∘T_BELM=I $$* Proof The proof follows from the algebraic construction of the BELM update rules, as detailed in Appendix C. ### 2.5 Error Decomposition for Counterfactual Estimation This brings us to our central theoretical result: an error decomposition theorem that rigorously partitions the total counterfactual error. This decomposition isolates the SRE and mathematically demonstrates why its elimination is critical. **Definition 7 (Counterfactual Error Components)** *We formally define the two primary sources of error in counterfactual estimation for the invertible case: 1. The Structural Reconstruction Error ( $E_SR$ ) measures the information loss from the model’s abduction-action cycle on a given sample $X$ : $$ E_SR(X):=\|(H_θ∘T_θ-I)X\|^2 $$ 1. The Latent Space Invariance Error ( $E_LSI$ ) measures the failure of the learned latent space to remain invariant under interventions on parent variables: $$ E_LSI:=\|T_θ(X,Pa)-T_θ(X_\boldsymbol{α}^true,\boldsymbol{α})\|^2 $$* **Theorem 8 (Counterfactual Error Bound)** *Let a model be defined by $(T_θ,H_θ)$ and the true SCM by $F$ . Assume the decoder $H_θ$ is $L_H$ -Lipschitz. The expected squared error of the model’s counterfactual prediction $\hat{X}_\boldsymbol{α}$ is bounded by the expectation of the two error components: $$ E≤ft[\|\hat{X}_\boldsymbol{α}-X_\boldsymbol{α}^true\|^2\right]≤ 2E≤ft[E_SR(X_\boldsymbol{α}^true)\right]+2L_H^2E≤ft[E_LSI\right] $$* Proof The proof is in Appendix D. **Remark 9 (Elimination of Structural Error)** *By Proposition 6, the Structural Reconstruction Error for BELM-MDCM is identically zero. This is the central theoretical advantage of our framework. It disentangles the error sources, allowing us to isolate the entire modeling challenge to learning a high-quality score function ( $ε_θ$ ) without the confounding factor of an imperfect inversion algorithm. Any remaining error is now purely a function of statistical estimation, not a structural bias of the model itself.* **Proposition 10 (Bound on Latent Space Invariance Error)** *We assume the learned score network, $\boldsymbol{ε}_θ$ , is Lipschitz continuous, ensuring the existence and uniqueness of the probability flow ODE solution via the Picard-Lindelöf theorem. Under standard integrability conditions (Fubini’s theorem), the Latent Space Invariance Error is bounded by the expected score-matching loss: $$ E≤ft[E_LSI\right]≤ C^\prime·E≤ft[\|\boldsymbol{ε}_θ-\boldsymbol{ε}^*\|^2\right] $$ for some constant $C^\prime$ , where $\boldsymbol{ε}^*$ is the true score function.* Proof The proof is in Appendix D. This proposition formally establishes that by eliminating structural error, the causal fidelity of BELM-MDCM is directly and provably controlled by its ability to accurately learn the data’s score function. ### 2.6 Decomposing Error: A Motivation for Empirical Validation The error decomposition in Theorem 8 provides a clear strategy for empirical validation by isolating two distinct error sources: the Structural Reconstruction Error ( $E_SR$ ) and the Latent Space Invariance Error. While developing a single score combining these is future work, these components directly motivate our empirical investigations. Our ablation study (Section 5.4.2) is designed to measure the impact of a non-zero $E_SR$ , while our stress-test (Section 5.4.1) probes robustness when latent space invariance is challenged by a non-invertible SCM. ### 2.7 Theoretical Roles of Targeted Modeling and Hybrid Training With algorithmic error eliminated by our Zero-SRE design, the challenge becomes minimizing the modeling error ( $E_LSI$ ). Our two methodological innovations, Targeted Modeling and Hybrid Training, are principled strategies for this purpose. Targeted Modeling as Formal Complexity Control. Our Targeted Modeling strategy acts as a form of structural regularization. The finite sample bound in Theorem 15 is governed by the Rademacher complexity $\mathfrak{R}_n(F_Θ)$ of the entire SCM’s hypothesis space. By assigning low-complexity models to a subset of nodes, we directly constrain the overall complexity. **Remark 11 (Effect on Generalization Bound)** *Our Targeted Modeling strategy is formally justified as a complexity control mechanism. The Rademacher complexity of a composite SCM is bounded by the sum of the complexities of its individual mechanisms (mohri2018foundations). By strategically substituting a high-complexity diffusion model $F_diff$ with a lower-complexity alternative $F_simple$ for non-critical nodes, Targeted Modeling directly minimizes this upper bound. This leads to a tighter generalization bound and improves the statistical efficiency of the overall SCM.* Hybrid Training as a Weighted Score-Matching Objective. The Hybrid Training Objective, $L_total=L_diffusion+λ· L_task$ , imparts a crucial inductive bias for learning a causally salient score function. The task-specific loss acts as a conductor’s baton, forcing the model to prioritize learning an accurate score function in regions of the data manifold most critical to the causal question. We formalize this by proposing that the auxiliary loss implicitly implements a weighted score-matching objective. **Proposition 12 (Hybrid Objective as a Weighted Score-Matching Regularizer)** *The auxiliary task loss $L_task$ provides a lower bound for the model’s error, weighted by a function reflecting the causal salience of the data manifold. Minimizing the hybrid objective $L_total$ is thereby equivalent to solving a weighted score-matching problem that prioritizes accuracy in causally salient regions, leading to a smaller effective Latent Space Invariance Error. (A rigorous proof is provided in Appendix E.)* This proposition formally grounds our hybrid training strategy, revealing that the task-specific loss intelligently forces the diffusion model to prioritize accuracy in regions of the data manifold most critical to the causal question. This reinforces the CIC principle by avoiding information loss where it matters most, effectively implementing the simplicity bias principle from Section 2.3. We can deepen this insight by analyzing its information-theoretic implications. **Proposition 13 (Disentanglement via Hybrid Objective)** *Information-theoretically, the hybrid objective provides a strong inductive bias towards learning a disentangled latent representation. It encourages a ”division of labor” where the task-specific component explains variance from the parents $Pa$ , while the diffusion component’s latent code $Z=T_θ(V,Pa)$ models the residual information. This implicitly pushes $Z$ towards being independent of $Pa$ , a crucial step towards satisfying the identifiability conditions.* Proof A detailed information-theoretic argument is provided in Appendix E. ### 2.8 BELM-MDCM as a Unifying Framework The principle of Causal Information Conservation also unifies our framework with classical models. Simpler models like Additive Noise Models (ANMs) can be seen as special cases where this principle is met trivially, positioning our work as a generalization of established causal principles. For instance, in a classic ANM (hoyer2009nonlinear), $V_i=f_i(Pa_i)+U_i$ , the noise is recovered by a direct, lossless inversion: $U_i=V_i-f_i(Pa_i)$ . Our framework generalizes this principle to arbitrarily complex, non-additive mechanisms, offering a flexible, non-parametric extension to classical structural equation models (wooldridge2010econometric). The importance of noise distributions, particularly non-Gaussianity, for identifiability in linear models is also a well-established principle (shimizu2006linear). ### 2.9 Learnability and Statistical Guarantees We now provide finite-sample learnability guarantees for our SCM framework. **Proposition 14 (Asymptotic Consistency)** *Under standard regularity conditions, as the number of data samples $n→∞$ and model capacity $N→∞$ , the learned operators $(\hat{T}_n,\hat{H}_n)$ are consistent estimators of the ideal operators $(T^*,H^*)$ : $\hat{T}_n\xrightarrow{p}T^*$ and $\hat{H}_n\xrightarrow{p}H^*$ .* **Theorem 15 (Finite Sample Bound for Causal Diffusion SCMs)** *Let an SCM consist of $d$ endogenous nodes, with a causal graph having a maximum in-degree of $d_in^max$ . Assume each causal mechanism is implemented by a score network $ε_θ$ that is an $L$ -layer MLP with ReLU activations, and the spectral norm of each weight matrix is bounded by $B$ . Let the input space be appropriately normalized. Let the loss function be bounded by $M$ . Then, for the parameters $\hat{θ}_n$ learned from $n$ samples, the excess risk is bounded with probability at least $1-δ$ : $$ R(\hat{θ}_n)-R(θ^*)≤ C·\frac{d· L· B^L·√{d_in^max+d_embed+1}}{√{n}}+M√{\frac{\log(1/δ)}{2n}} $$ where $C$ is a constant independent of the network architecture and sample size, and $d_embed$ is the dimension of the time embedding.* Proof The proof, which combines the sub-additivity of Rademacher complexity over the SCM with standard bounds for deep neural networks (bartlett2017spectrally; neyshabur2018pac), is detailed in Appendix G. **Remark 16 (Interpretation of the Bound)** *This refined bound quantitatively links the generalization error to: 1. Causal Complexity ( $d·√{d_in^max}$ ): The error scales with the number of causal mechanisms ( $d$ ) and the graph’s complexity ( $d_in^max$ ), formalizing the intuition that more complex causal systems are harder to learn. 1. Network Complexity ( $L· B^L$ ): The error scales with the depth and spectral norm of the score networks. This provides direct theoretical grounding for our Targeted Modeling strategy, as using simpler models tightens this generalization bound.* ### 2.10 Implications for Causal Transportability Causal Information Conservation also provides a foundation for transportability —applying knowledge from a source domain $S$ to a target domain $T$ (pearl2014transportability). Transportability requires separating invariant causal knowledge from domain-specific mechanisms. By losslessly recovering the exogenous noise $U$ (the invariant ”causal essence”), our framework achieves this separation by design; the decoders $H_θ$ represent the domain-specific mechanisms. This insight is formalized in the following theorem. **Theorem 17 (Condition for Lossless Causal Transport)** *Let a source domain $S$ and a target domain $T$ be described by SCMs $M^S$ and $M^T$ , respectively. Assume the following conditions hold: 1. Shared Structure: Both domains share the same causal graph $G$ and the same exogenous noise distributions $\{p_i(U_i)\}$ . The domains differ only in a subset of causal mechanisms $K_changed$ . 1. Noise Independence: The exogenous noise variables $\{U_i\}_i=1^d$ are mutually independent. 1. Information Conservation: A model $(T_θ,H_θ)$ trained on data from $S$ satisfies the Causal Information Conservation principle, achieving zero Structural Reconstruction Error. Then, causal knowledge can be losslessly transported from $S$ to $T$ by re-learning only the operators $\{T_θ_{k},H_θ_{k}\}$ corresponding to the changed mechanisms $k∈K_changed$ , while directly reusing all operators for invariant mechanisms.* Proof The proof is provided in Appendix F. ### 2.11 Discussion of Assumptions Our framework rests on several key assumptions, which we now critically examine. Our geometric inductive bias argument (Proposition 3) rests on the principle of simplicity bias. While this principle is a cornerstone of modern deep learning theory with substantial empirical backing, it remains an active area of research and is not a universally proven theorem. Our conclusions are therefore conditioned on the validity of this powerful but conjectural assumption. The cornerstone of our identifiability theory (Theorem 2) is the SCM’s invertibility with respect to its noise term $U$ . This is a strong assumption; when violated (e.g., by a many-to-one function), the abduction task becomes ill-posed. To address this foundational challenge, we provide an exhaustive theoretical treatment in Appendix C. There, we formalize the irreducible ”representational error” and derive a tighter, more general error bound (Theorem 21). More importantly, we propose a concrete mitigation strategy: a novel prior-matching regularizer (Definition 23), theoretically shown to reduce the error by encouraging the learned encoder to approximate the ideal Maximum a Posteriori (MAP) solution (Proposition 24). This highlights a primary contribution: even in the challenging non-invertible case, BELM-MDCM’s zero-SRE design eliminates the algorithmic error, thereby isolating the more fundamental representational challenge. Our stress-test in Section 5.4.1 empirically confirms this advantage, while validating our regularizer provides a clear direction for future work. Our identifiability proof is dimension-dependent, leveraging Liouville’s theorem for $d≥ 3$ and requiring stronger assumptions like asymptotic linearity for the special case of $d=2$ . Other assumptions, such as Lipschitz continuity of the score network, are mild regularity conditions standard in deep generative model analysis and can be encouraged through architectural choices like spectral normalization. ## 3 Architectural Design and Training The BELM-MDCM architecture embodies our core principles through a non-monolithic, theoretically-motivated design. Its central philosophy is Targeted Modeling: judiciously allocating the expressive power of our Zero-SRE CausalDiffusionModel to nodes of causal interest (e.g., Treatment T, Outcome Y), while using simpler, efficient mechanisms for confounders, as illustrated in Figure 1. This strategy provides practical complexity control, tightening the generalization bound as established in Theorem 15. <details> <summary>x1.png Details</summary> ![3ba2e5cd](/v1/image/3ba2e5cd0265ce2c86c124ee030676fee7031435eb01d653510aca6030ff7faf) ### Visual Description ## Diagram: Causal Modeling Framework with Targeted Model Allocation ### Overview The image displays a technical diagram illustrating a causal inference modeling framework. It depicts a causal graph with four nodes (W, X, T, Y) and specifies the modeling techniques applied to the relationships between them. A separate text box explains the "Targeted Modeling Principle" that guides this allocation of models. ### Components/Axes **Nodes (Causal Variables):** * **W**: Located at the top-left of the diagram. * **X**: Located below node W on the left side. * **T**: Located in the center of the diagram. * **Y**: Located on the right side of the diagram. **Directed Edges (Causal Relationships) & Associated Models:** 1. **W → T**: An arrow from W to T. The associated model annotation, placed to the left of the arrow, is a white box labeled: **"Empirical Distribution"**. 2. **X → T**: An arrow from X to T. The associated model annotation, placed along the arrow, is a green box labeled: **"CausalDiffusionModel (BELM-MDCM)"**. 3. **T → Y**: An arrow from T to Y. The associated model annotation, placed along the arrow, is a green box labeled: **"CausalDiffusionModel (BELM-MDCM)"**. 4. **X → Y**: A direct arrow from X to Y. No specific model annotation is attached to this edge. **Additional Annotation:** * A white box labeled **"Additive Noise Model"** is positioned below node X. Its placement suggests it may be a general model type or associated with the confounder nodes, but it is not directly linked to a specific arrow in this diagram. **Text Box (Principle Explanation):** Located at the bottom of the image, a bordered text box contains the following text: > **Targeted Modeling Principle:** > > The expressive power of the **CausalDiffusionModel** is judiciously allocated to key causal nodes (Treatment **T**, Outcome **Y**) for high-fidelity counterfactual generation. > > Simpler, efficient mechanisms (e.g., ANM, Empirical Distribution) are used for confounder nodes (**W**, **X**) to ensure stability and efficiency. ### Detailed Analysis The diagram presents a structured approach to causal modeling: * **Key Causal Pathway (T → Y):** The relationship from Treatment (T) to Outcome (Y) is modeled using the most complex and expressive model, the **CausalDiffusionModel (BELM-MDCM)**. * **Treatment Assignment Mechanism (W → T, X → T):** The influences on the Treatment node are modeled differently. The influence from confounder **W** is modeled with a simple **Empirical Distribution**. The influence from confounder **X** is modeled with the complex **CausalDiffusionModel (BELM-MDCM)**. * **Direct Confounder Effect (X → Y):** A direct causal path from confounder X to Outcome Y is shown, but no specific model is assigned to it in this diagram. * **Model Types:** Two primary model types are named: 1. **CausalDiffusionModel (BELM-MDCM):** A complex, expressive model. 2. **Simpler Mechanisms:** Explicitly mentioned are **ANM** (Additive Noise Model) and **Empirical Distribution**. ### Key Observations 1. **Asymmetric Model Allocation:** The framework does not apply the most powerful model uniformly. It strategically allocates the **CausalDiffusionModel** to the edges **X → T** and **T → Y**, which are deemed critical for high-fidelity generation of counterfactuals involving the Treatment and Outcome. 2. **Confounder Handling:** Confounder nodes (W, X) are generally associated with simpler models (Empirical Distribution, ANM) to prioritize stability and computational efficiency, with the notable exception of the path from X to T. 3. **Visual Coding:** Green boxes are used exclusively for the **CausalDiffusionModel**, creating a clear visual distinction from the white boxes used for simpler models and node labels. 4. **Principle Justification:** The text box provides the explicit rationale for the design, linking model complexity to the importance of the causal relationship for the end goal of counterfactual generation. ### Interpretation This diagram represents a **pragmatic and resource-aware strategy for causal inference**. The core insight is that not all parts of a causal graph require the same level of modeling sophistication. By applying a high-fidelity, likely computationally intensive model (**CausalDiffusionModel**) only to the most critical pathways (those directly defining the treatment effect and its assignment from key confounders), the framework aims to achieve accurate counterfactual predictions. Meanwhile, using simpler, well-understood models for other relationships (like the empirical distribution of W) reduces overall complexity, improves stability, and conserves computational resources. This "targeted" approach reflects a common engineering trade-off between accuracy and efficiency in complex system modeling. The diagram serves as a blueprint for implementing such a hybrid causal model. </details> Figure 1: Illustration of the Targeted Modeling Principle. The expressive CausalDiffusionModel is judiciously allocated to key causal nodes (Treatment T, Outcome Y) for high-fidelity counterfactual generation. Simpler, efficient mechanisms (e.g., ANM, Empirical Distribution) are used for confounder nodes (W, X) to ensure stability and efficiency. <details> <summary>x2.png Details</summary> ![d78b0c7e](/v1/image/d78b0c7e38be3578605b57aeb7630f2a088419d27e30b518b99337e47b463c81) ### Visual Description ## System Architecture Diagram: Causal Identification Pipeline ### Overview The image displays a technical flowchart illustrating a multi-stage data processing and machine learning pipeline designed for causal identification. The diagram flows from left to right, starting with raw input data, moving through pre-processing, embedding, training, and post-processing stages, and culminating in a causal identification result. The architecture is modular, with distinct colored blocks representing different functional units. ### Components/Flow The diagram is organized into five horizontal phases, labeled at the bottom: **Pre-Processing**, **Embedding**, **Train**, **Post-Processing**, and **Results**. **1. Input Data (Far Left):** * **`x_num`**: A numerical input vector. The example values shown are `50`, `257`, and `-3.0`. * **`x_cat1`**: A categorical input with example value `M`. * **`x_cat2`**: A categorical input with example value `woman`. * **`x_cat3`**: A categorical input represented by a blue triangle icon. **2. Pre-Processing Stage:** * **`StandardScaler`** (Yellow block): Receives the `x_num` input. A "Select" arrow points from the input line to this block, indicating a selection or routing mechanism. * **`OneHotEncoder`** (Peach block): Receives all three categorical inputs (`x_cat1`, `x_cat2`, `x_cat3`). **3. Embedding Stage:** * **`Connection`** (Grey block): Performs a concatenation operation, denoted by the symbol `⊕`. It combines the processed numerical data (`x_num`) and the processed categorical data (`x_cat`) into a single representation. * **`Timestep embedding`** (Green block): An independent module that feeds into the next stage. **4. Train Stage:** * **`BELM-MDCM module`** (Light blue, tall vertical block): This is the central processing unit. It receives two inputs: 1. The concatenated data from the `Connection` block. 2. The output from the `Timestep embedding` block. * **`Noisy Target Variable`** (Light cyan block): This input also feeds directly into the `BELM-MDCM module`. **5. Post-Processing & Results Stage:** * **`Inverse Transformation`** (White block): Processes the output from the `BELM-MDCM module`. * **`Causal Identification`** (Pink, tall vertical block): The final output stage of the pipeline, receiving the transformed data. **Flow Direction:** Arrows clearly indicate a unidirectional data flow from the inputs on the left, through the sequential processing blocks, to the final output on the right. The `Timestep embedding` and `Noisy Target Variable` provide auxiliary inputs to the central training module. ### Detailed Analysis * **Data Transformation Path:** Numerical data (`x_num`) is scaled via `StandardScaler`. Categorical data (`x_cat`) is one-hot encoded. These parallel streams are then fused (`Connection`). * **Core Model:** The `BELM-MDCM module` is the primary computational engine, integrating the fused data features, temporal information (`Timestep embedding`), and the target variable (`Noisy Target Variable`). * **Output Processing:** The model's output undergoes an `Inverse Transformation` before being used for the final task of `Causal Identification`. * **Visual Coding:** Colors are used to group related functions: yellow/peach for pre-processing, green for embedding, blue for the core model, and pink for the final result. ### Key Observations 1. **Hybrid Data Handling:** The pipeline explicitly separates and then recombines numerical and categorical data streams, a common practice in tabular data modeling. 2. **Temporal Component:** The inclusion of a `Timestep embedding` suggests the model is designed to handle sequential or time-series data. 3. **Noise in Target:** The `Noisy Target Variable` input implies the model is robust to or specifically designed for learning from imperfect or noisy labels. 4. **Modular Design:** Each major step (scaling, encoding, embedding, core modeling, transformation) is encapsulated in its own block, suggesting a flexible and interpretable architecture. ### Interpretation This diagram represents a sophisticated machine learning pipeline for **causal inference** from structured, likely temporal, data. The process begins by standardizing and encoding raw features, then projects them into a learned embedding space alongside temporal information. The core `BELM-MDCM module` (the acronym is not defined in the image) presumably performs the main causal discovery or estimation task, potentially using a method robust to target noise. The final `Inverse Transformation` likely maps the model's internal representations back into an interpretable causal effect or graph. The pipeline's structure suggests an application where understanding the cause-and-effect relationships between variables (e.g., in economics, healthcare, or social science) is critical, and the data contains both static features and a time dimension. The explicit handling of a noisy target indicates a practical consideration for real-world data where ground truth may be uncertain. </details> Figure 2: The detailed internal architecture of the CausalDiffusionModel. This diagram illustrates the end-to-end workflow of the causal mechanism designed for key nodes like Treatment T and Outcome Y, detailing the pre-processing, embedding, training, and post-processing stages. The internal architecture of the CausalDiffusionModel itself, depicted in Figure 2, is engineered to learn the complex, non-linear mapping $v_i:=f_i(pa_i,u_i)$ with high fidelity. ### 3.1 Mechanism for Exogenous Nodes Exogenous nodes (without parents in the causal graph $G$ ) are modeled non-parametrically via the Empirical Distribution of the observed data. This approach avoids distributional assumptions and provides a robust foundation for the Structural Causal Model (SCM). ### 3.2 Mechanism for Endogenous Nodes: The CausalDiffusionModel For endogenous nodes $V_i$ , particularly those central to the causal query (treatment, outcome, key mediators), we employ our bespoke CausalDiffusionModel to learn the functional mapping $v_i:=f_i(pa_i,u_i)$ . #### 3.2.1 Conditioning via Parent Node Transformation The denoising process is conditioned on the parent nodes $pa_i$ , which are transformed into a fixed-dimensional conditioning vector $c∈ℝ^d_c$ . A ColumnTransformer handles heterogeneous data types: continuous parents are standardized (StandardScaler) to unify scales, while categorical parents are one-hot encoded (OneHotEncoder) to prevent artificial ordinality. The resulting vectors are concatenated into $c$ , which remains constant for a given sample’s diffusion trajectory. #### 3.2.2 The Denoising Process The core of the CausalDiffusionModel is a denoising network $ε_θ(v_t,t,c)$ , implemented as a Residual MLP (he2016deep). It takes as input the noisy variable $v_t$ , a sinusoidal Time Embedding of timestep $t$ , and the conditioning vector $c$ . Before the diffusion process, the target variable $V_i$ is also preprocessed (standardized for continuous values or label-encoded for categorical ones). The denoising process is driven by the BELM sampler, ensuring a mathematically exact and stable inversion path as established in Section 2. #### 3.2.3 Hybrid Training Objective We introduce a Hybrid Training Objective to reconcile generative fidelity with predictive accuracy. As established in our theoretical analysis (Proposition 12), this is more than a standard multi-task learning scheme; it acts as a powerful inductive bias, creating a weighted score-matching objective that prioritizes accuracy in causally salient regions of the data manifold. The total loss is a linearly weighted combination: $$ L_total=L_diffusion+λ· L_task \tag{5} $$ where $L_diffusion$ is the noise prediction error (Equation 1). The auxiliary loss $L_task$ is a Mean Squared Error for continuous nodes ( $L_regression$ ) and a Cross-Entropy loss for discrete nodes ( $L_classification$ ). #### 3.2.4 Decoding and Counterfactual Generation For generation, the BELM sampler produces an output in the normalized space. This is then mapped back to the original data domain using the inverse transformations of the pre-fitted preprocessors (StandardScaler for continuous, LabelEncoder for categorical). For categorical outputs, the continuous value is rounded and clipped to the valid class range before the inverse mapping, ensuring that generated (counterfactual) data is interpretable and resides in the correct space. ## 4 New Evaluation Metrics for Generative Causal Models The principle of Causal Information Conservation demands new evaluation dimensions that traditional metrics like ATE and PEHE cannot capture. An accurate ATE score, for instance, could arise from a model with high SRE where individual errors fortuitously cancel out at the population level. To move beyond mere outcome accuracy and directly assess a model’s adherence to our foundational principle, we propose a new, theoretically-grounded evaluation framework. ### 4.1 Causal Information Conservation Score (CIC-Score) The Causal Information Conservation Score (CIC-Score) is a direct empirical diagnostic for the Structural Reconstruction Error. It quantifies a framework’s adherence to the CIC principle by disentangling algorithmic information loss (from an imperfect inversion process) from modeling error (from the statistical challenge of learning the true causal mechanism). We define the score, bounded in $[0,1]$ , using an exponential formulation: $$ CIC-Score=\exp≤ft(-≤ft(δ_U+δ_SRE\right)\right) $$ The error components are designed to isolate distinct failure modes: - $δ_U$ , the Relative Noise Recovery Error, quantifies the modeling error. It measures how well the trained network approximates the true score function, reflected in the fidelity of the recovered noise $\hat{U}$ versus the ground-truth $U_true$ : $$ δ_U=\frac{E[\|\hat{U}_scaled-U_true, scaled\|^2]}{E[\|U_true, scaled\|^2]} $$ This term captures all inaccuracies from finite data and imperfect optimization. - $δ_SRE$ , the Normalized Structural Error, exclusively quantifies the algorithmic error inherent to the inversion process itself. Its definition is model-dependent to allow for fair comparisons: - For frameworks with analytical invertibility (e.g., our BELM-MDCM, ANMs), the algorithm introduces no information loss, so we set $δ_SRE≡ 0$ by construction. Any observed reconstruction error is a symptom of modeling error, already captured by $δ_U$ . - For frameworks relying on approximate inversion (e.g., DDIM), $δ_SRE$ is empirically measured to quantify this inherent algorithmic flaw: $$ δ_SRE=\frac{E[\|(H_θ∘T_θ-I)X\|^2]}{E[\|X\|^2]} $$ This principled decomposition allows the CIC-Score to fairly assess different frameworks by isolating structural design advantages from the universal challenge of model training. ### 4.2 Causal Mechanism Fidelity Score (CMF-Score) A generative causal model’s core promise is to learn true causal mechanisms, not just outcomes. Naïve metrics like pairwise correlations fail to capture the non-linear, multi-variable, and directional nature of causality. We therefore propose the Causal Mechanism Fidelity (CMF) score, a hierarchical framework with two levels of increasing rigor. #### 4.2.1 Level 1 (Pragmatic): The Conditional Mutual Information Score (CMI-Score) The Conditional Mutual Information (CMI), $I(V_i;V_j|Pa_j∖\{V_i\})$ , is a non-parametric, non-linear measure of the direct influence a parent $V_i$ has on its child $V_j$ after accounting for all other parents. The CMI-Score evaluates whether this influence is preserved. For a single mechanism $V_j$ , it is the average consistency across all parent-child edges: $$ CMI-Score(V_j)=\frac{1}{|Pa_j|}∑_V_{i∈Pa_j}≤ft(1-\frac{≤ft|I_obs(V_i;V_j|·)-I_cf(V_i;V^\prime_j|·)\right|}{I_obs(V_i;V_j|·)+ε}\right) $$ where $I_obs$ and $I_cf$ are the CMI values from observational and counterfactual data, respectively. The final CMI-Score is the average over all SCM mechanisms. #### 4.2.2 Level 2 (Gold Standard): The Kernelized Mechanism Discrepancy (KMD) Score To rigorously compare entire conditional distributions, we use the Maximum Mean Discrepancy (MMD) (gretton2012kernel), a kernel-based statistical test for distributional equality. The KMD-Score applies this test to the conditional distributions $p(V_j|Pa_j)$ that define each causal mechanism, measuring the discrepancy between the learned and observed conditionals. The final score is mapped to a similarity measure in $[0,1]$ : $$ KMD-Score=\exp(-γ·E_pa_j∼ p(\mathbf{Pa_j)}[MMD(p(V_j|pa_j),p_θ(V_j|pa_j))]) $$ where $γ$ is a scaling parameter. A score of 1 indicates that the learned conditional mechanism is statistically indistinguishable from the observed one. Complementary and Validated Evaluation Metrics. Our proposed metrics complement, rather than replace, traditional ones like ATE and PEHE. They evaluate distinct facets of performance: while ATE/PEHE measure outcome accuracy, the CMF-Score assesses mechanism fidelity. This distinction is critical, as a model can achieve a high ATE via fortuitous error cancellation despite failing to learn the true data-generating process. To ensure our metrics are practically reliable, we conducted a controlled micro-simulation study, detailed in Appendix J. The results provide strong empirical evidence for their validity and complementary roles: the CIC-Score acts as a high-sensitivity SRE detector; the CMI-Score robustly tracks the fidelity of causal associations; and the KMD-Score serves as a final arbiter of distributional similarity. This validation confirms that our evaluation framework offers a more complete, nuanced, and reliable assessment of generative causal models. ## 5 Experiments Our empirical evaluation is designed as a comprehensive test of our central thesis: that eliminating SRE is a necessary condition for generating authentic counterfactuals and unlocks analytical capabilities beyond the reach of conventional methods. We structured the study as a four-act narrative to rigorously test our claims. Act I establishes our model’s state-of-the-art predictive fidelity on standard benchmarks. Act II provides a deep diagnostic analysis, using our proposed metrics as empirical evidence for the destructive effect of SRE. Act III showcases the unique capabilities unlocked by an information-conserving framework. Finally, Act IV validates the framework’s robustness through a series of stress tests and a full ablation study. Evaluation Protocol. For a rigorous evaluation, we employ two complementary protocols. This distinction is crucial, as it separates the assessment of our methodology’s peak performance from the diagnostic analysis of its components. 1. Ensemble Evaluation for SOTA Performance: To benchmark against state-of-the-art methods (specifically, ITE estimation in Section 5.1.3), we adopt the standard Deep Ensemble methodology. We train N=5 independent models and report the final metric (e.g., PEHE) on the ensembled prediction. 1. Individual Model Evaluation for Diagnostic Analysis: In all other experiments where the goal is a fair, apples-to-apples architectural comparison or stability assessment, we report the mean and standard deviation of metrics from individual model instances across N=5 runs. This isolates the effect of design choices from the gains of ensembling. We estimate the Average Treatment Effect (ATE) throughout our experiments using a standard counterfactual imputation procedure, the pseudo-code for which is detailed in Algorithm 1 in Appendix K. Baseline Estimators. The Directed Acyclic Graphs (DAGs) for our experiments are shown in Figure 3. We benchmark BELM-MDCM against a suite of baselines from the DoWhy library (sharma2022dowhy), spanning classical statistical methods to state-of-the-art machine learning estimators to ensure a comprehensive comparison. <details> <summary>x3.png Details</summary> ![02d53349](/v1/image/02d53349017117e27a4f5e8ab1e8cd5dbe67416939b9a3d72c44c77dba63c3f6) ### Visual Description ## Directed Acyclic Graph (DAG): Causal Model with Covariates and Confounder ### Overview The image displays a directed acyclic graph (DAG), a type of causal diagram used in statistics and epidemiology to represent hypothesized causal relationships between variables. The diagram illustrates a model where two covariates (W1, W2) influence a categorical confounder (C1) and the treatment (T), which in turn affects the outcome (Y). The confounder also directly influences both the treatment and the outcome. ### Components/Axes The diagram consists of five nodes connected by directed arrows (edges). There are no numerical axes, as this is a conceptual model. **Nodes (Variables):** 1. **W1 (Covariate)**: Located in the top-left corner. Represented by a light blue rectangle with a blue border. 2. **W2 (Covariate)**: Located in the top-right corner. Represented by a light blue rectangle with a blue border. 3. **C1 (Categorical Confounder)**: Located in the center, below W1 and W2. Represented by a light purple rectangle with a purple border. 4. **T (Treatment)**: Located in the center, below C1. Represented by an orange diamond shape. 5. **Y (Outcome)**: Located at the bottom center. Represented by a light green circle with a double green border. **Edges (Directed Relationships):** The arrows indicate the direction of hypothesized causal influence. All arrows are grey except for one. * From **W1** to **C1** (grey arrow). * From **W1** to **T** (grey arrow). * From **W1** to **Y** (grey arrow). * From **W2** to **C1** (grey arrow). * From **W2** to **T** (grey arrow). * From **W2** to **Y** (grey arrow). * From **C1** to **T** (grey arrow). * From **C1** to **Y** (grey arrow). * From **T** to **Y** (a thicker, **orange** arrow, visually emphasizing the primary causal path of interest). ### Detailed Analysis The diagram explicitly maps the following causal pathways: 1. **Direct Effects of Covariates**: Both covariates (W1, W2) have direct causal paths to all other variables in the model: the confounder (C1), the treatment (T), and the outcome (Y). 2. **Role of the Confounder**: The categorical confounder (C1) is influenced by the covariates and, in turn, exerts a direct causal influence on both the treatment (T) and the outcome (Y). This creates a "backdoor path" from T to Y via C1, which must be controlled for to estimate the true effect of T on Y. 3. **Primary Causal Path**: The central relationship of interest is the direct effect of the Treatment (T) on the Outcome (Y), indicated by the distinct, thicker orange arrow. 4. **Common Causes**: W1 and W2 act as common causes (or sources of variation) for both the confounder and the treatment, and they also directly affect the outcome. ### Key Observations * **Visual Emphasis**: The arrow from **T to Y** is the only one colored orange and is thicker than the others. This design choice highlights it as the primary causal relationship the model is designed to investigate. * **Node Coding**: Each variable type is encoded with a distinct shape and color: rectangles for covariates/confounder, a diamond for treatment, and a circle for outcome. This follows common conventions in causal diagrams. * **Complex Confounding Structure**: The model depicts a scenario where simple adjustment for C1 may not be sufficient to block all confounding, as the covariates W1 and W2 also create additional paths between T and Y. ### Interpretation This DAG represents a causal inference model for estimating the effect of a treatment (T) on an outcome (Y) in the presence of a categorical confounder (C1) and two covariates (W1, W2). The diagram suggests that to isolate the true causal effect of T on Y (the orange path), an analyst must account for all common causes of T and Y. In this model, those are: 1. The direct confounder **C1**. 2. The covariates **W1** and **W2**, which affect both T and Y directly and also indirectly through C1. The model implies that a valid analysis (e.g., using regression adjustment, stratification, or propensity score methods) would need to condition on W1, W2, and C1 to block the "backdoor" non-causal associations between T and Y. The diagram serves as a visual blueprint for specifying a statistical model or identifying the necessary variables for a causal analysis. It underscores that the relationship between T and Y is not isolated but is part of a network of influences. </details> (a) PSM Failure Scenario <details> <summary>x4.png Details</summary> ![3f72bbac](/v1/image/3f72bbaca96de75400d9c01813019ea0d24ab45ec1031453ca0ee8e72a4afb65) ### Visual Description ## Causal Diagram: Instrumental Variable with Mediation ### Overview The image displays a directed acyclic graph (DAG) illustrating a causal model. It depicts the relationships between several variables: two confounders (X1, X2), an instrumental variable (Z), a treatment (T), a mediator (M), and an outcome (Y). The diagram uses distinct shapes and colors to denote different variable types and employs arrows (solid and dashed) to represent hypothesized causal pathways. ### Components/Axes The diagram is composed of six labeled nodes connected by directional arrows. There are no traditional axes, as this is a conceptual model, not a data chart. **Nodes (Variables):** 1. **X1 (Confounder):** Located in the top-left corner. Represented by a light blue rectangle. 2. **X2 (Confounder):** Located in the top-right corner. Represented by a light blue rectangle. 3. **Z (Instrumental Variable):** Located in the top-center. Represented by a teal hexagon. 4. **T (Treatment):** Located in the center. Represented by an orange diamond. 5. **M (Mediator):** Located directly below T. Represented by a teal oval. 6. **Y (Outcome):** Located at the bottom-center. Represented by a light green circle with a double outline. **Arrows (Causal Pathways):** * **Solid Grey Arrows:** * From X1 to T. * From X1 to Y. * From X2 to T. * From X2 to Y. * From X2 to M. * **Dashed Teal Arrow:** * From Z to T. * **Solid Red Arrows:** * From T to M. * From M to Y. ### Detailed Analysis The diagram explicitly defines the role of each variable and maps the proposed causal flows. * **Confounders (X1, X2):** These variables are shown to influence both the Treatment (T) and the Outcome (Y) directly, which is the classic definition of a confounder. X2 also has a direct path to the Mediator (M). * **Instrumental Variable (Z):** This variable is connected *only* to the Treatment (T) via a dashed arrow. The dashed line may indicate a specific assumption, such as a weaker or conditional relationship, or simply differentiate it visually. Crucially, Z has no direct path to the Outcome (Y) or the Mediator (M), satisfying the core exclusion restriction assumption for an instrument. * **Treatment (T):** This is the central intervention or exposure variable. It is influenced by the confounders (X1, X2) and the instrumental variable (Z). It then influences the Outcome (Y) through two pathways: a direct path (not shown) and an indirect path via the Mediator (M). * **Mediator (M):** This variable sits on the causal pathway between Treatment (T) and Outcome (Y). The red arrows from T to M and M to Y highlight this specific mediated effect. * **Outcome (Y):** The final variable of interest. It is influenced directly by the confounders (X1, X2) and the mediator (M). The double circle around Y emphasizes its status as the primary endpoint. ### Key Observations 1. **Variable Typing by Shape:** The diagram uses a consistent visual language: rectangles for confounders, a hexagon for the instrument, a diamond for the treatment, an oval for the mediator, and a circle for the outcome. 2. **Pathway Highlighting:** The causal path from Treatment (T) to Outcome (Y) via the Mediator (M) is emphasized with red arrows, drawing attention to the mediation analysis component of the model. 3. **Instrumental Variable Isolation:** The instrumental variable Z is visually and structurally isolated, connecting only to T. This is a critical feature for its valid use in causal inference. 4. **Complex Confounding:** The model accounts for confounding from two separate sources (X1 and X2), with X2 having a more complex role as it also affects the mediator. ### Interpretation This diagram represents a sophisticated causal inference framework that combines **instrumental variable (IV) analysis** with **mediation analysis**. * **Purpose:** The model is designed to estimate the causal effect of a Treatment (T) on an Outcome (Y) in the presence of unmeasured confounding, using Z as an instrument. Furthermore, it seeks to decompose this total effect into a direct effect and an indirect effect that flows through the Mediator (M). * **Relationships:** The structure suggests that while Z can be used to isolate variation in T that is not confounded by X1 or X2, the effect of T on Y is not entirely direct. A portion of T's influence is transmitted through changing M, which in turn affects Y. * **Notable Implications:** * The direct arrow from X2 to M indicates that the mediator itself is confounded. This is an important consideration for mediation analysis, as it violates the "no mediator-outcome confounder" assumption unless X2 is measured and controlled. * The absence of a direct arrow from Z to Y or M is the key assumption that makes Z a valid instrument. If this assumption is violated (e.g., Z affects Y through a path other than T), the IV estimates would be biased. * This model would be used in fields like epidemiology, economics, or social sciences to answer questions such as: "What is the effect of a job training program (T) on future earnings (Y), and how much of that effect works through increasing skills (M), while accounting for pre-existing ability (X1, X2) using an instrument like random assignment to the program (Z)?" </details> (b) Ablation Study Scenario <details> <summary>x5.png Details</summary> ![e9cf81d0](/v1/image/e9cf81d06b78886c92f1bc21955a5904e488796f097e765a7f6b898084605bf6) ### Visual Description ## Causal Diagram: Confounding Structure in Treatment-Outcome Analysis ### Overview The image is a directed acyclic graph (DAG) illustrating a classic confounding structure in causal inference. It depicts the relationships between a set of confounding variables, a treatment variable, and an outcome variable. The diagram is designed to show how external factors can influence both the treatment assignment and the observed outcome, thereby complicating the estimation of the true causal effect of the treatment. ### Components/Axes The diagram consists of three primary nodes connected by directional arrows (edges). 1. **Top Node (Confounders):** * **Shape & Color:** A blue rectangle with a solid border. * **Label Text:** "Confounders (age, educ, re74, etc.)" * **Position:** Centered at the top of the diagram. * **Function:** Represents a set of pre-treatment variables that are potential common causes of both the treatment and the outcome. The listed examples are "age," "educ" (likely education), and "re74" (likely real earnings in 1974). 2. **Middle-Left Node (Treatment):** * **Shape & Color:** An orange diamond (rhombus) with a solid border. * **Label Text:** "Treatment (treat)" * **Position:** Located below and to the left of the Confounders node. * **Function:** Represents the intervention or exposure variable of interest, labeled "treat." 3. **Bottom-Right Node (Outcome):** * **Shape & Color:** A light green circle with a double-line border. * **Label Text:** "Outcome (re78)" * **Position:** Located below and to the right of the Treatment node, and directly below the Confounders node. * **Function:** Represents the measured result variable, labeled "re78" (likely real earnings in 1978). **Arrows (Edges):** * **From Confounders to Treatment:** A solid gray arrow points from the bottom of the Confounders rectangle to the top corner of the Treatment diamond. This indicates that the confounding variables influence the assignment of the treatment. * **From Confounders to Outcome:** A solid gray arrow points from the bottom of the Confounders rectangle to the top of the Outcome circle. This indicates that the confounding variables also directly influence the outcome. * **From Treatment to Outcome:** A solid **red** arrow points from the bottom corner of the Treatment diamond to the left side of the Outcome circle. This represents the direct causal path of interest—the effect of the treatment on the outcome. ### Detailed Analysis The diagram explicitly maps the flow of influence: 1. **Confounders → Treatment:** The gray arrow establishes that variables like age, education, and prior earnings affect who receives the treatment. This is a source of selection bias. 2. **Confounders → Outcome:** The second gray arrow shows these same variables also affect the outcome (e.g., earnings in 1978) independently of the treatment. 3. **Treatment → Outcome:** The red arrow highlights the primary causal relationship under study. However, its effect is "confounded" by the two gray paths originating from the Confounders node. The use of distinct shapes (rectangle, diamond, circle) and colors (blue, orange, green) visually separates the three types of variables. The red color of the Treatment→Outcome arrow emphasizes it as the focal relationship. ### Key Observations * **Classic Backdoor Path:** The diagram visually defines a "backdoor path" from Treatment to Outcome via the Confounders: Treatment ← Confounders → Outcome. This open path creates a non-causal association between treatment and outcome. * **Visual Emphasis:** The red arrow draws the viewer's eye to the direct causal effect, while the gray arrows represent the confounding bias that must be controlled for in analysis. * **Variable Naming:** The labels "re74" and "re78" strongly suggest this diagram is based on a well-known econometric dataset (likely the Lalonde dataset) studying the effect of a job training program ("treat") on subsequent earnings ("re78"), with prior earnings ("re74") and demographics as confounders. ### Interpretation This diagram is a foundational tool in causal inference, particularly for observational studies. It argues that a naive comparison of outcomes between treated and untreated groups would be biased because the groups differ systematically in their confounding characteristics (age, education, prior earnings). The diagram dictates the analytical strategy: to estimate the true causal effect of "treat" on "re78," one must "block" the backdoor path. This is achieved by statistically **controlling for** (adjusting for) the variables in the "Confounders" set. Methods like regression adjustment, matching, or stratification aim to create a comparison where the confounders are balanced across treatment groups, effectively severing the gray arrows and isolating the red causal path. In essence, the image is not just a flowchart but a **causal model** that encodes assumptions about the data-generating process. It visually justifies the need for specific statistical controls and warns against interpreting raw associations as causal effects. </details> (c) Lalonde Confounding Structure Figure 3: Directed Acyclic Graphs (DAGs) for key experiments. (a) A structure designed to challenge propensity score methods. (b) A mediation structure used for the ablation study. (c) The standard confounding structure assumed for both Lalonde-based experiments. ### 5.1 Act I: Establishing State-of-the-Art Predictive Fidelity We first establish that our principled design achieves superior predictive fidelity on standard causal inference benchmarks. #### 5.1.1 Robustness in Non-Linear Confounding Scenarios We tested our model in a challenging synthetic scenario (Figure 3(a)) designed with highly non-linear confounding to cause propensity-based methods to fail. Table 1 shows the results. While Causal Forest is exceptionally accurate on this specific DGP, our BELM-MDCM framework secures its position as the second most accurate method, delivering a highly stable and competitive ATE estimate. Crucially, it significantly outperforms the entire suite of propensity-based methods and powerful estimators like DML in accuracy. The high standard deviation of DML highlights its unreliability in this context, validating our model as a robust estimator where traditional approaches are compromised. Table 1: ATE Estimation on the PSM Failure Scenario (True ATE = 5000). We report the mean ATE and standard deviation across multiple runs. | Causal Forest | 4895.77 $±$ 69.26 | 104.23 | | --- | --- | --- | | Propensity Score Stratification | 5309.38 $±$ 185.36 | 309.38 | | Linear Regression | 5348.82 $±$ 23.23 | 348.82 | | Propensity Score Matching | 5353.93 $±$ 191.36 | 353.93 | | Inverse Propensity Weighting | 5385.68 $±$ 52.03 | 385.68 | | Double Machine Learning | 4285.63 $±$ 550.97 | 714.37 | #### 5.1.2 Accuracy and Robustness on Real-World Observational Data We next evaluated our framework on the canonical Lalonde dataset (lalonde1986evaluating), a challenging real-world benchmark with a known RCT ground truth. Table 2 demonstrates the comprehensive superiority of our BELM-MDCM framework. It achieved a mean ATE estimate of 1567.36 $±$ 201.62, the lowest error among all methods that correctly identified the treatment effect’s positive direction. More critically, the results highlight a stark contrast in reliability. Classical methods failed entirely, while the powerful Causal Forest baseline suffered from extreme instability (Std Dev of 785.59). In contrast, BELM-MDCM exhibited remarkable robustness, with a standard deviation approximately four times lower. This outstanding performance on a canonical benchmark validates that our framework delivers accurate estimates with the consistency essential for trustworthy causal inference. Table 2: ATE Estimation Stability on the Lalonde Dataset (RCT Benchmark ATE $≈ 1794$ ). Results for all models are reported as Mean $±$ Standard Deviation across 5 independent runs. | Causal Forest | 1085.30 $±$ 785.59 | 708.70 | | --- | --- | --- | | Linear Regression | 46.33 $±$ 76.80 | 1747.67 | | Propensity Score Matching | -3.96 $±$ 118.37 | 1797.96 | | Propensity Score Stratification | -35.54 $±$ 81.44 | 1829.54 | | Propensity Score Weighting | -122.55 $±$ 50.51 | 1916.55 | | Double Machine Learning | nan $±$ nan | nan | #### 5.1.3 High-Fidelity ITE Estimation and Stability Analysis Objective. We evaluate performance at the individual level using a semi-synthetic version of the Lalonde dataset. This experiment leverages real-world covariates and assumes the causal structure depicted in Figure 3(c). To rigorously assess both accuracy and reliability, we follow our Individual Model Evaluation protocol, reporting the mean and standard deviation of performance across 5 independent runs for each method. Results. The PEHE results, presented in Table 3, confirm the exceptional fidelity and robustness of our framework. BELM-MDCM achieves the lowest average PEHE score of 537.84 and demonstrates remarkable stability with the lowest standard deviation of just 60.11. This performance is closely followed by Causal Forest. However, the results also highlight the instability of other meta-learners; X-Learner, in particular, exhibits extremely high variance, with a standard deviation more than three times larger than its competitors, rendering its single-run estimates unreliable. This highlights the dual advantage of our framework: superior accuracy combined with consistent, trustworthy performance. Figure 4 provides visual confirmation, showing the tight clustering of our model’s ensembled ITE estimates around the ground truth. <details> <summary>x6.png Details</summary> ![ec5a6546](/v1/image/ec5a65466255e36aed74c6b851717d41b42a303ea7b93a12d67e0c2ea7ac5997) ### Visual Description \n ## Scatter Plot: Accuracy of Individual Treatment Effect (ITE) Estimation ### Overview The image is a scatter plot evaluating the performance of an ensemble model (BELM-MDCM Ensemble) in estimating Individual Treatment Effects (ITE). It compares the model's estimated ITE values against the known true ITE values for a set of individual samples. The plot includes a reference line representing perfect prediction. ### Components/Axes * **Chart Title:** "Accuracy of Individual Treatment Effect (ITE) Estimation" (centered at the top). * **X-Axis:** * **Label:** "True ITE" * **Scale:** Linear, ranging from approximately -500 to 4500. * **Major Tick Marks:** 0, 1000, 2000, 3000, 4000. * **Y-Axis:** * **Label:** "Estimated ITE (Ensemble)" * **Scale:** Linear, ranging from approximately -500 to 4500. * **Major Tick Marks:** 0, 1000, 2000, 3000, 4000. * **Legend:** Located in the top-left corner of the plot area. * **Item 1:** A blue dot labeled "Individual Samples (BELM-MDCM Ensemble)". * **Item 2:** A red dashed line labeled "Perfect Match (y=x)". * **Grid:** A light gray grid is present, aligned with the major tick marks on both axes. ### Detailed Analysis * **Data Series (Individual Samples):** The plot contains several hundred light blue, semi-transparent circular data points. Each point represents a single sample, plotting its True ITE (x-coordinate) against its Estimated ITE from the ensemble model (y-coordinate). * **Reference Line (Perfect Match):** A red dashed diagonal line runs from the bottom-left to the top-right of the plot. This line represents the ideal scenario where the estimated value perfectly equals the true value (y = x). * **Data Distribution & Trend:** * **Visual Trend:** The cloud of blue data points shows a strong, positive linear trend. The points are generally clustered along the red "Perfect Match" line. * **Spread:** The points are not perfectly on the line, indicating estimation error. The spread (vertical distance from the line) appears relatively consistent across the range of True ITE values, though there may be slightly more dispersion at the higher end (True ITE > 3000). * **Range:** The data spans a wide range of ITE values. True ITE values extend from just below 0 to approximately 4500. Estimated ITE values show a similar range. * **Density:** The highest density of points appears in the central region, roughly between True ITE values of 1500 and 3000. ### Key Observations 1. **Strong Positive Correlation:** There is a clear and strong positive correlation between the True ITE and the Estimated ITE. This indicates the ensemble model's estimates are generally well-calibrated and move in the correct direction relative to the true values. 2. **Model Bias:** The data points are distributed fairly evenly on both sides of the perfect match line across the entire range. There is no obvious systematic over-estimation (points consistently above the line) or under-estimation (points consistently below the line). 3. **Estimation Variance:** While the correlation is strong, there is visible variance. For any given True ITE value, the corresponding estimates show a range of values. For example, at a True ITE of ~2000, estimates range from approximately 1500 to 2500. 4. **Outliers:** A few points deviate more significantly from the trend line. For instance, there are points near True ITE ≈ 500 with Estimated ITE near 1500, and points near True ITE ≈ 3000 with Estimated ITE near 2000. These represent samples where the model's estimation was less accurate. ### Interpretation This scatter plot serves as a visual validation metric for the BELM-MDCM Ensemble model's performance on an ITE estimation task. The tight clustering of points around the y=x line demonstrates that the model is effective at predicting individual-level treatment effects. The lack of systematic bias suggests the model is well-calibrated across the spectrum of treatment effect magnitudes. The presence of variance and outliers is expected in any predictive model and indicates the inherent difficulty of the estimation problem or potential noise in the data for specific samples. The plot does not reveal any catastrophic failures (e.g., points clustered along the x- or y-axis), which would indicate a complete model breakdown. **In summary, the image provides strong visual evidence that the ensemble model produces accurate and reliable ITE estimates, with errors that appear random rather than systematic.** To fully quantify the performance, one would need to supplement this visual analysis with numerical metrics like Mean Squared Error (MSE) or R-squared, which are not provided in the image. </details> Figure 4: Accuracy of Individual Treatment Effect (ITE) Estimation on the semi-synthetic Lalonde dataset. The plot shows the ensembled estimated ITE from our model versus the true ITE. The tight clustering of our model’s estimates (blue dots) around the perfect-match line (red dash) visually demonstrates its low PEHE score. Table 3: ITE Estimation Accuracy (PEHE) on the Semi-Synthetic Lalonde Dataset. Results are reported as Mean $±$ Standard Deviation across 5 independent runs. Lower is better. | BELM-MDCM Causal Forest S-Learner | 537.84 $±$ 60.11 563.90 $±$ 73.66 816.26 $±$ 79.17 | | --- | --- | | X-Learner | 1546.38 $±$ 679.09 | Table 4: Causal Mechanism Fidelity (CMI-Score) on the Semi-Synthetic Lalonde Dataset. Results are reported as Mean $±$ Standard Deviation across 5 runs. Higher is better. | S-Learner BELM-MDCM Causal Forest | 0.9905 $±$ 0.0062 0.9824 $±$ 0.0092 0.9786 $±$ 0.0099 | | --- | --- | | X-Learner | 0.9782 $±$ 0.0145 | | T-Learner | 0.9555 $±$ 0.0113 | ### 5.2 Act II: Uncovering the Accuracy-Invertibility Trade-off We now conduct the pivotal experiment of our study: a deep diagnostic analysis using our novel CIC-Score to reveal the trade-off between predictive accuracy and mechanism invertibility. This provides the core empirical evidence for our thesis by comparing three paradigms: our BELM-MDCM (Learned Invertibility), a DDIM variant (Flawed Invertibility), and a classic RF-ANM (Assumed Invertibility). The results in Table 5 decisively validate our framework’s principles. Our BELM-MDCM is the clear leader, achieving the lowest PEHE score (1071.95) with high stability. Critically, its CIC-Score of 0.3679 is orders of magnitude higher than the alternatives, proving its unique ability to learn an invertible mapping that conserves causal information. In stark contrast, the DDIM-MDCM model exemplifies the failure predicted by our theory: its near-zero CIC-Score confirms a near-total collapse of causal information due to SRE, leading to unreliable predictions (high PEHE and variance). The classical RF-ANM, while structurally invertible, lacks the capacity to learn the true mechanism, resulting in a zero CIC-Score and poor accuracy. This ”Golden Table” experiment underscores that both structural integrity and powerful modeling capacity are essential for high-fidelity causal inference. Table 5: The ”Ultimate Golden Table”: A comparative analysis of model classes on predictive accuracy (PEHE) and structural integrity (CIC-Score). This table includes the NF-SCM baseline, which empirically validates the likelihood-fidelity dilemma. Results are reported as Mean $±$ Standard Deviation across 5 runs. Lower PEHE is better; higher CIC-Score is better. | RF-ANM DDIM-MDCM NF-SCM | 1533.18 $±$ 134.24 2085.98 $±$ 788.12 442229.96 $±$ 66963.73 | 0.0000 $±$ 0.0000 0.0065 $±$ 0.0130 0.1572 $±$ 0.0232 | | --- | --- | --- | | BELM-MDCM | 1071.95 $±$ 152.11 | 0.3679 $±$ 0.0000 | <details> <summary>x7.png Details</summary> ![78fa96b8](/v1/image/78fa96b8d91eda1fcdb7d3a62f1a277def21f8eaa04b32ff580ecebec7e5c5b5) ### Visual Description ## Training Loss Curve: Negative Log-Likelihood Loss vs. Epoch ### Overview The image displays a line chart titled "Training Loss Curve," plotting the Negative Log-Likelihood Loss of a machine learning model against the number of training epochs. The chart shows a classic convergence pattern, with a rapid initial decrease in loss followed by a gradual stabilization. ### Components/Axes * **Chart Title:** "Training Loss Curve" (centered at the top). * **X-Axis:** * **Label:** "Epoch" * **Scale:** Linear, ranging from 0 to 200. * **Major Tick Marks:** At intervals of 25 (0, 25, 50, 75, 100, 125, 150, 175, 200). * **Y-Axis:** * **Label:** "Negative Log-Likelihood Loss" * **Scale:** Linear, ranging from approximately -2 to 6. * **Major Tick Marks:** At intervals of 2 (-2, 0, 2, 4, 6). * **Data Series:** A single, solid blue line representing the loss value at each epoch. * **Grid:** A light gray grid is present, with lines aligned to the major tick marks on both axes. * **Legend:** None present (single data series). ### Detailed Analysis The data series exhibits two distinct phases: 1. **Phase 1 - Rapid Descent (Epochs 0 to ~10):** * **Trend:** The line slopes steeply downward. * **Data Points (Approximate):** * Epoch 0: Loss ≈ 6.0 * Epoch 5: Loss ≈ 0.0 * Epoch 10: Loss ≈ -1.0 2. **Phase 2 - Gradual Convergence & Plateau (Epochs ~10 to 200):** * **Trend:** The line continues to slope downward but at a much shallower angle, eventually flattening into a noisy plateau. * **Data Points (Approximate):** * Epoch 25: Loss ≈ -1.8 * Epoch 50: Loss ≈ -2.0 * Epoch 100: Loss ≈ -2.2 * Epoch 150: Loss ≈ -2.4 * Epoch 200: Loss ≈ -2.5 * **Noise:** From approximately epoch 25 onward, the line exhibits consistent, small-scale fluctuations (noise) around the general downward trend. The amplitude of this noise appears relatively constant. ### Key Observations * **Convergence:** The model's loss converges to a stable value, indicating successful training. * **Learning Rate:** The extremely steep initial drop suggests a high initial learning rate or a model that quickly learns the most salient features of the data. * **Plateau Value:** The loss stabilizes at a negative value (≈ -2.5). This is mathematically valid for Negative Log-Likelihood (NLL) loss, as NLL can be negative when the model assigns a probability greater than 1 to the correct class (which is possible in unnormalized log-space calculations). * **Noise:** The persistent noise in the later epochs is typical of stochastic gradient descent (SGD) or its variants, reflecting updates from mini-batches of data. ### Interpretation This curve demonstrates a healthy and typical training progression for a probabilistic model (e.g., a classifier using cross-entropy loss). The data suggests: 1. **Effective Learning:** The model rapidly absorbed the primary patterns in the training data within the first 10-20 epochs. 2. **Fine-Tuning:** The subsequent 180 epochs were spent on fine-tuning, where the model made smaller adjustments to its parameters, leading to a slower but continued improvement in fit. 3. **Stability:** The plateau indicates the model has likely reached a local minimum in the loss landscape for the given hyperparameters (learning rate, optimizer). Further training beyond 200 epochs is unlikely to yield significant improvement. 4. **Potential for Optimization:** The presence of noise suggests the learning rate might be slightly high for the later stages of training. A learning rate scheduler that reduces the rate after the initial drop could potentially lead to a smoother convergence to a slightly lower loss value. **In summary, the chart provides clear visual evidence of a model that has successfully learned from its training data, transitioning from a phase of rapid acquisition of knowledge to one of refinement and stabilization.** </details> Figure 5: The training loss curve for the Conditional Normalizing Flow (NF) baseline. The smooth, stable convergence to a low negative log-likelihood value indicates a successful statistical training run. However, this did not correspond to learning the true causal mechanism, as evidenced by its extremely high PEHE score. The Likelihood-Fidelity Dilemma: Why Natively Invertible Models Can Fail. To rigorously test the limits of models that are natively invertible, we conducted a comprehensive stability analysis on a Conditional Normalizing Flow (NF) baseline, a model class that satisfies the Causal Information Conservation principle by construction (SRE $≡ 0$ ). Across five independent runs with different random seeds, the NF model consistently demonstrated successful statistical learning, with its training loss stably converging to a high log-likelihood in each instance (a representative example is shown in Figure 5). However, this statistical success was starkly contrasted by a systematic and catastrophic failure in the causal task. The model yielded an average PEHE score of 442,229.96 $±$ 66,963.73, confirming that its generated counterfactuals were fundamentally incorrect. This consistent result provides decisive evidence for a critical challenge we term the likelihood-fidelity dilemma: a model can perfectly learn to replicate a data distribution while remaining completely ignorant of the underlying causal mechanism. The root of this dilemma is the fundamental mismatch between the optimization objective and the causal goal. The maximum likelihood objective incentivizes the NF to find any invertible mapping that transforms the data to a simple base distribution. While an infinite number of such mappings may be statistically equivalent, only one corresponds to the true, unique causal data-generating process. Without a guiding signal, the NF is mathematically predisposed to learn a causally-incorrect ”statistical shortcut.” This finding powerfully underscores the contribution of our Hybrid Training Objective. It acts as the crucial causal inductive bias that resolves this dilemma, compelling the model to learn the unique, causally salient structure and enabling valid causal inference where pure likelihood-based methods, even those with zero SRE, are destined to fail. ### 5.3 Act III: Unlocking Deeper Causal Inquiry with a High-Fidelity Model An information-conserving model serves as a trustworthy ”world model” for deep causal inquiry. We showcase three applications uniquely enabled by our framework’s high-fidelity counterfactuals. Heterogeneity Analysis: Conditional ATE (CATE). A reliable ITE model can act as a “causal microscope.” We use it to explore treatment effect heterogeneity by estimating the Conditional Average Treatment Effect (CATE) for subpopulations. By averaging the results from our five independently trained models, we obtain stable and robust estimates. Our model’s mean CATE estimates track the true CATE trends with high fidelity across different education levels, a capability crucial for policy-making. For instance, for individuals with an education level of 3.0, the estimated CATE was $2562.79 (true CATE: $2280.79). For levels 8.0, 12.0, and 16.0, the estimates were $2092.91 (true: $2118.61), $2253.76 (true: $2384.44), and $2434.89 (true: $2490.21), respectively, demonstrating a close correspondence to the ground truth. <details> <summary>x8.png Details</summary> ![0049c82e](/v1/image/0049c82edc88c57043b312dfd282e4ef208e37ee7a4c5bf4465bb7b52267c5a0) ### Visual Description ## Bar Chart: Causal Attribution: The Impact of Exogenous "Luck" ### Overview This is a vertical bar chart comparing two financial outcomes for a single subject ("Victim (Index 201)") under different conditions. The chart visually demonstrates the significant positive impact of applying an exogenous factor labeled "Top Responder's 'Luck'" to the victim's actual outcome. A reference line shows the actual outcome of a separate "Top Responder" for comparison. ### Components/Axes * **Chart Title:** "Causal Attribution: The Impact of Exogenous 'Luck'" * **Y-Axis:** * **Label:** "Outcome (¥)" (The symbol "¥" indicates the currency is likely Japanese Yen or Chinese Yuan). * **Scale:** Linear scale from 0 to over 30,000. * **Major Tick Marks:** 0, 5000, 10000, 15000, 20000, 25000, 30000. * **X-Axis (Categories):** 1. **Left Bar (Red):** "Victim (Index 201) Actual Outcome" 2. **Right Bar (Green):** "Victim (Index 201) with Top Responder's 'Luck'" * **Legend (Top-Left Corner):** * A blue dashed line icon labeled: "Top Responder (Index 281) Actual Outcome ($4,296.47)" * *Note: The legend uses a dollar sign ($), while the y-axis uses a yen/yuan sign (¥). This is a notable inconsistency in the source data.* * **Data Labels (On top of bars):** * Red Bar: **$10,993.98** * Green Bar: **$33,328.30** * **Reference Line:** A horizontal blue dashed line runs across the chart at the y-value corresponding to $4,296.47, visually anchored to the legend description. ### Detailed Analysis 1. **Data Series & Values:** * **Victim's Actual Outcome (Red Bar):** The outcome is **$10,993.98**. The bar extends from the x-axis to just above the 10,000 mark on the y-axis. * **Victim's Outcome with Luck (Green Bar):** The outcome is **$33,328.30**. The bar extends from the x-axis to a point between the 30,000 and 35,000 (implied) marks on the y-axis. * **Top Responder's Actual Outcome (Blue Dashed Line):** The outcome is **$4,296.47**. This line is positioned below the 5,000 mark on the y-axis. 2. **Trend & Comparison:** * The visual trend is a dramatic increase. The green bar ("with Luck") is approximately **3.03 times taller** than the red bar ("Actual Outcome"). * Both victim outcomes are substantially higher than the Top Responder's actual outcome. The victim's actual outcome is ~2.55 times the Top Responder's, and the outcome with luck is ~7.76 times the Top Responder's. ### Key Observations * **Magnitude of Impact:** The application of "Top Responder's 'Luck'" results in an absolute increase of **$22,334.32** ($33,328.30 - $10,993.98) for the victim. * **Currency Symbol Discrepancy:** The y-axis is labeled in ¥ (Yen/Yuan), but all data labels and the legend use $ (Dollar). This suggests a potential error in chart creation or that the values are being presented in a different currency than the axis label implies. * **Relative Performance:** The "Top Responder" (Index 281) has a lower actual outcome than the "Victim" (Index 201) in both scenarios presented. The chart's narrative focuses on the victim's potential gain from the responder's "luck," not the responder's own performance. ### Interpretation The chart is designed to argue for a strong causal effect of an external, luck-based factor. It visually separates the victim's inherent outcome (red bar) from the outcome achievable when augmented by an external positive shock (green bar). The large disparity between the bars is the central message: "luck" is portrayed as a transformative element. The inclusion of the Top Responder's much lower actual outcome (blue line) serves a dual purpose: 1. It establishes a baseline for the "luck" source, showing that the responder's own typical result is modest. 2. It highlights that the victim, even without the luck factor, outperforms the responder. This frames the "luck" as an additive bonus on top of an already superior base performance, rather than a corrective equalizer. The currency symbol inconsistency is a critical flaw that undermines the data's credibility. A technical reader must question whether the values are directly comparable or if a conversion is missing. Assuming the numerical values are correct in their own context, the chart effectively communicates a narrative of substantial, externally-driven gain. </details> Figure 6: Causal Attribution Analysis. This chart shows the counterfactual outcome for the ’Victim’ if they had possessed the individual-specific exogenous factors of the ’Top Responder’. Causal Attribution: Isolating the Effect of Exogenous Factors. We conducted a causal attribution experiment via a counterfactual intervention of the form $do(U_victim:=u_responder)$ . Figure 6 shows that our framework can losslessly recover these factors, revealing that unobserved exogenous ’luck’ In this context, the term ’luck’ serves as an intuitive shorthand for the exogenous noise variable $U$ . Within the Structural Causal Model (SCM) framework, $U$ represents all unobserved, individual-specific factors (e.g., intrinsic ability, random chance, measurement errors) that, together with the observed parent variables ( $Pa$ ), determine the final outcome for an individual. had a massive and stable causal effect, averaging a +22,334.32 change in the ’Victim’s’ outcome. This capacity for reliable attribution is a unique advantage of our information-conserving framework. <details> <summary>x9.png Details</summary> ![0401c533](/v1/image/0401c5336ae95a820c577e9fe7a9ca5496a343234f99fab6cfbcfe4a04bbbaab) ### Visual Description ## Density Plot: Counterfactual Fairness Audit for Attribute: "black" ### Overview This image is a statistical density plot visualizing a counterfactual fairness audit. It compares the distribution of an outcome variable (Y) for a specific group (where the attribute "black" equals 1) against a simulated counterfactual distribution (what the outcomes would be if the attribute "black" were 0). The chart aims to quantify a potential fairness gap. ### Components/Axes * **Chart Title:** "Counterfactual Fairness Audit for Attribute: 'black'" * **X-Axis:** Labeled "Outcome (Y)". The scale runs from 0 to 100,000, with major tick marks at 0, 20000, 40000, 60000, 80000, and 100000. * **Y-Axis:** Labeled "Density". The scale is in scientific notation, ranging from 0 to 7e-5 (0.00007), with major tick marks at 0, 1, 2, 3, 4, 5, 6, and 7 (all multiplied by 1e-5). * **Legend (Top-Right Corner):** * **Red Shaded Area:** "Actual Outcome Distribution (black=1)" * **Red Dashed Vertical Line:** "Mean Actual: $8,829.19" * **Blue Shaded Area:** "Counterfactual Outcome Distribution (if black=0)" * **Blue Dashed Vertical Line:** "Mean Counterfactual: $7,151.03" * **Annotation Box (Top-Left Corner):** * "Audited Group: black=1 (N=371)" * "Average Fairness Gap: $-1,678.16" ### Detailed Analysis * **Data Series & Trends:** * **Actual Distribution (Red):** This density curve is right-skewed, with its peak (mode) occurring at a low outcome value, approximately between 0 and 5,000. The curve has a long tail extending towards higher outcome values up to 100,000. The mean is marked by the red dashed line at **$8,829.19**. * **Counterfactual Distribution (Blue):** This curve is also right-skewed and peaks at a similar low outcome range as the red curve. Visually, its peak appears slightly higher and narrower than the red curve's peak. Its mean is marked by the blue dashed line at **$7,151.03**. * **Comparison of Means:** The red dashed line (Actual Mean) is positioned to the right of the blue dashed line (Counterfactual Mean) on the x-axis. This visually confirms that the average outcome for the actual group (black=1) is higher than the average outcome in the counterfactual scenario (if black=0). * **Fairness Gap:** The annotation box calculates the "Average Fairness Gap" as **-$1,678.16**. This value is derived from the difference: Mean Counterfactual ($7,151.03) - Mean Actual ($8,829.19) = -$1,678.16. The negative sign indicates the actual mean is higher than the counterfactual mean. * **Sample Size:** The analysis is based on a group of **N=371** individuals where the attribute "black" is 1. ### Key Observations 1. **Skewed Distributions:** Both the actual and counterfactual outcome distributions are heavily right-skewed. This indicates that for this group, most observed outcomes are clustered at the lower end of the scale (below ~$20,000), with a smaller number of individuals achieving much higher outcomes. 2. **Shift in Distribution:** While both distributions have a similar shape, the actual distribution (red) appears to have slightly more density in the mid-to-high outcome range (roughly $10,000 to $40,000) compared to the counterfactual distribution (blue). This shift contributes to the higher actual mean. 3. **Overlap:** There is substantial overlap between the two density curves, especially in the lower outcome range. This suggests that for many individuals in this group, the counterfactual change (setting black=0) would not dramatically alter their predicted outcome. ### Interpretation This chart presents the results of a counterfactual fairness analysis. The core question it addresses is: "For individuals in the group where the attribute 'black' is 1, what would their outcomes look like if that attribute were instead 0?" The data suggests that, on average, the **actual outcomes for the "black=1" group are higher** than the outcomes predicted in the counterfactual "black=0" scenario. The average fairness gap of -$1,678.16 quantifies this difference. **Important Contextual Note:** The interpretation of "fairness" here is technical and depends entirely on the model's purpose. A negative gap (actual > counterfactual) could be interpreted in multiple ways: * If the outcome Y is something desirable (e.g., loan amount, salary), this result might suggest the model is *not* exhibiting a traditional disparate impact against the "black=1" group in this specific audit, as their actual outcomes are higher than the counterfactual. * However, without knowing the broader context—such as whether the "black=0" group's actual outcomes are even higher, or what the true ground-truth values should be—this single chart cannot determine if the overall system is fair. It only shows the modeled counterfactual effect for one subgroup. The significant skew in both distributions is a critical finding. It implies that the outcome variable (Y) is not normally distributed, and any analysis or policy based on this data must account for this inequality, where a small proportion of cases account for a large proportion of the total outcome value. </details> (a) Fairness audit for attribute ’black’. <details> <summary>x10.png Details</summary> ![c7f5950a](/v1/image/c7f5950a938d54d886166823225967f48d53ede068389bdb3da69b363ebb4b23) ### Visual Description ## Counterfactual Fairness Audit Density Plot ### Overview This image is a density plot visualizing a counterfactual fairness audit for the attribute "hisp" (likely an abbreviation for Hispanic). It compares the distribution of an outcome variable (Y) for an actual group (hisp=1) against a counterfactual distribution (if hisp=0). The chart includes summary statistics and a legend to differentiate the two distributions and their respective means. ### Components/Axes * **Chart Title:** "Counterfactual Fairness Audit for Attribute: 'hisp'" * **X-Axis:** Labeled "Outcome (Y)". The scale runs from approximately -20,000 to 60,000, with major tick marks at intervals of 10,000. * **Y-Axis:** Labeled "Density". The scale runs from 0 to 5e-5 (0.00005), with major tick marks at intervals of 1e-5. * **Legend (Top-Right Corner):** * Red filled area: "Actual Outcome Distribution (hisp=1)" * Red dashed vertical line: "Mean Actual: $5,145.18" * Blue filled area: "Counterfactual Outcome Distribution (if hisp=0)" * Blue dashed vertical line: "Mean Counterfactual: $7,007.30" * **Annotation Box (Top-Left Corner):** * "Audited Group: hisp=1 (N=39)" * "Average Fairness Gap: $1,862.12" ### Detailed Analysis * **Actual Outcome Distribution (hisp=1 - Red):** * **Trend:** The distribution is right-skewed, with a sharp peak near an outcome value of 0 and a long tail extending towards higher positive values. * **Key Points:** The density peaks at approximately 5.2e-5 on the y-axis. The mean is marked by a red dashed vertical line at $5,145.18 on the x-axis. * **Counterfactual Outcome Distribution (if hisp=0 - Blue):** * **Trend:** This distribution is also right-skewed but appears slightly shifted to the right compared to the red distribution. Its peak is slightly lower and broader. * **Key Points:** The density peaks at approximately 5.0e-5 on the y-axis. The mean is marked by a blue dashed vertical line at $7,007.30 on the x-axis. * **Comparison & Gap:** * The two distributions overlap significantly, especially in the range from approximately -10,000 to 20,000. * The blue (counterfactual) distribution has a visibly higher density for outcome values above approximately 5,000. * The "Average Fairness Gap" is explicitly stated as $1,862.12, which is the difference between the two means ($7,007.30 - $5,145.18). ### Key Observations 1. **Right Skew:** Both outcome distributions are heavily right-skewed, indicating that most observations cluster around lower values, with a few high-value outliers pulling the mean to the right. 2. **Mean Displacement:** The counterfactual mean (if hisp=0) is higher than the actual mean (hisp=1) by $1,862.12. This is the central quantitative finding of the audit. 3. **Distribution Shape Difference:** While both are skewed, the counterfactual (blue) distribution appears to have a slightly heavier tail on the right side, suggesting a higher probability of very large outcomes in the counterfactual scenario. 4. **Small Sample Size:** The annotation notes the audited group (hisp=1) has a sample size of N=39, which is relatively small and should be considered when interpreting the robustness of the distributions. ### Interpretation This chart demonstrates a **counterfactual fairness analysis**. It attempts to answer: "What would the outcome distribution look like for the same group of individuals if the sensitive attribute 'hisp' were changed from 1 to 0?" * **What the Data Suggests:** The analysis suggests a potential disparity. For the same set of individuals (N=39), the model or system being audited produces a lower average outcome ($5,145.18) when the attribute is "hisp=1" compared to the hypothetical scenario where it is "hisp=0" ($7,007.30). The $1,862.12 gap quantifies this average disparity. * **Relationship Between Elements:** The overlapping density plots show that the disparity is not uniform across all individuals; for many, the outcomes might be similar regardless of the attribute. However, the shift in the mean and the heavier right tail of the counterfactual distribution indicate that, on average and particularly for higher outcomes, the "hisp=0" scenario is associated with better results. * **Notable Implications:** This type of audit is used to detect algorithmic bias. The finding implies that the attribute "hisp" is correlated with a lower outcome in the model's predictions or decisions, holding other factors constant (as implied by the counterfactual). The small sample size (N=39) is a limitation, suggesting this result may be specific to this subgroup and requires validation with larger data. The right-skewed nature of the outcomes (which could represent income, loan amounts, etc.) means the fairness gap has a more significant absolute impact on the higher end of the scale. </details> (b) Fairness audit for attribute ’hisp’. Figure 7: Counterfactual fairness audits reveal significant outcome disparities based on sensitive attributes. The plots show the distribution of actual outcomes for each group versus the distribution of their counterfactual outcomes had their sensitive attribute been different. Counterfactual Fairness Audit. Finally, we applied our framework to a counterfactual fairness audit. Only a model that faithfully represents the data generating process can reliably answer questions about fairness. Figure 7 reveals significant and stable disparities: our model estimates an average fairness gap of -1,678.16 for the ’black’ attribute and +1,862.12 for the ’hisp’ attribute, demonstrating its capacity as a powerful tool for ethical audits. ### 5.4 Act IV: Final Validation: Stress Tests and Ablation Study We conclude by subjecting the framework to two final tests: a stress test on a non-invertible SCM and a comprehensive ablation study. #### 5.4.1 Stress Test on a Non-Invertible SCM We tested our framework’s robustness when the theoretical assumption of an invertible SCM is violated, using a DGP where $Y∝ U_Y^2$ . The results in Table 6 and Figure 8 decisively validate our hypothesis. On the definitive metric of individual-level fidelity (PEHE), our zero-SRE BELM framework achieves an error of 0.77, a 44% reduction compared to the SRE-prone DDIM model. This empirically proves that even when the true SCM is non-invertible, eliminating algorithmic SRE provides a substantial advantage. This result is fully consistent with our theoretical analysis in Appendix C (Theorem 21), where the total error is decomposed into algorithmic, modeling, and representational errors. By eliminating the algorithmic error ( $E_SR≡ 0$ ), our framework’s performance approaches the theoretical limit set by the other two components. Interestingly, the DDIM sampler achieves a slightly higher KMD-Score. We hypothesize that its inherent inversion noise acts as a form of implicit regularization, making the marginal generated distribution appear closer to the truth, even while individual-level counterfactuals are less accurate. This highlights the important distinction between distributional fidelity and individual-level causal accuracy. <details> <summary>x11.png Details</summary> ![2a6e94d9](/v1/image/2a6e94d991f5860ebdae450a2e98745f7d85a577fa550224adc1b27761595b94) ### Visual Description ## Bar Chart Comparison: Robustness to Many-to-One SCM: BELM vs. DDIM ### Overview The image displays a set of four bar charts comparing the performance of two methods, **DDIM** and **BELM**, across four different evaluation metrics. The overall title is "Robustness to Many-to-One SCM: BELM vs. DDIM". Each chart is a separate panel with its own title, y-axis scale, and two bars representing the scores for DDIM (dark blue) and BELM (green). Error bars are present on all bars, indicating variability or confidence intervals. ### Components/Axes * **Main Title:** "Robustness to Many-to-One SCM: BELM vs. DDIM" * **Layout:** Four charts arranged in a 2x2 grid. * **Common Elements:** * **X-axis (all charts):** Two categories: "DDIM" (left bar) and "BELM" (right bar). * **Y-axis Label (all charts):** "Score". * **Legend/Color Key:** Implicit from the x-axis labels. DDIM is represented by a dark blue bar, BELM by a green bar. * **Panel-Specific Titles & Instructions:** 1. **Top-Left Panel:** "Group-Level Accuracy (ATE Error)" with the subtitle "Lower is Better". 2. **Top-Right Panel:** "Individual-Level Fidelity (PEHE)" with the subtitle "Lower is Better". 3. **Bottom-Left Panel:** "Mechanism Fidelity (CMI-Score)" with the subtitle "Higher is Better". 4. **Bottom-Right Panel:** "Distributional Fidelity (KMD-Score)" with the subtitle "Higher is Better". ### Detailed Analysis **1. Group-Level Accuracy (ATE Error) - Top-Left** * **Trend:** The DDIM bar is taller than the BELM bar. Since "Lower is Better", this indicates BELM has a better (lower) score. * **Data Points:** * DDIM: Score = **0.973**. Error bar extends approximately from 0.88 to 1.06. * BELM: Score = **0.740**. Error bar extends approximately from 0.64 to 0.84. **2. Individual-Level Fidelity (PEHE) - Top-Right** * **Trend:** The DDIM bar is significantly taller than the BELM bar. Since "Lower is Better", BELM performs substantially better. * **Data Points:** * DDIM: Score = **1.376**. Error bar extends approximately from 1.25 to 1.50. * BELM: Score = **0.766**. Error bar extends approximately from 0.68 to 0.85. **3. Mechanism Fidelity (CMI-Score) - Bottom-Left** * **Trend:** The bars are nearly equal in height, with BELM being marginally taller. Since "Higher is Better", BELM has a very slight advantage. * **Data Points:** * DDIM: Score = **0.980**. Error bar is very small, approximately ±0.01. * BELM: Score = **0.994**. Error bar is very small, approximately ±0.01. **4. Distributional Fidelity (KMD-Score) - Bottom-Right** * **Trend:** The DDIM bar is taller than the BELM bar. Since "Higher is Better", DDIM performs better on this metric. * **Data Points:** * DDIM: Score = **0.907**. Error bar extends approximately from 0.88 to 0.93. * BELM: Score = **0.830**. Error bar extends approximately from 0.81 to 0.85. ### Key Observations 1. **Performance Dichotomy:** BELM outperforms DDIM on three of the four metrics (ATE Error, PEHE, CMI-Score), while DDIM outperforms BELM on one (KMD-Score). 2. **Magnitude of Difference:** The most dramatic performance gap is in **Individual-Level Fidelity (PEHE)**, where BELM's score (0.766) is nearly half that of DDIM's (1.376), a significant improvement given "Lower is Better". 3. **Similar Performance:** The scores for **Mechanism Fidelity (CMI-Score)** are extremely close (0.980 vs. 0.994), with minimal error bars, suggesting both methods are highly effective and nearly equivalent on this measure. 4. **Error Bar Consistency:** Error bars are generally larger for the "Lower is Better" metrics (ATE, PEHE) and smaller for the "Higher is Better" metrics (CMI, KMD), indicating potentially more variance in the error-based measurements. ### Interpretation This set of charts provides a multi-faceted evaluation of two methods (BELM and DDIM) in the context of "Many-to-One SCM" (likely Structural Causal Models). The data suggests that **BELM is generally more robust and accurate** for this task, particularly in minimizing errors at both the group (ATE) and individual (PEHE) levels, and in preserving the underlying causal mechanism (CMI). Its primary weakness, relative to DDIM, is in distributional fidelity (KMD), where it scores slightly lower. The choice between methods would depend on the specific priority of the application. If minimizing prediction error (ATE, PEHE) and ensuring mechanism accuracy are paramount, BELM is the superior choice. If matching the overall data distribution (KMD) is the critical requirement, DDIM holds a slight edge. The near-parity on mechanism fidelity suggests both methods are reliable for understanding the causal structure, but BELM translates that understanding into more accurate individual and group-level outcomes. </details> Figure 8: Robustness comparison on the many-to-one SCM. Our zero-SRE framework (BELM) demonstrates significantly superior performance on PEHE. Table 6: Performance on the non-invertible SCM ( $Y∝ U^2$ ). Results are averaged over 5 runs ( $±$ std). Lower PEHE is better. | BELM (Zero SRE) DDIM (with SRE) | 0.77 $±$ 0.16 1.38 $±$ 0.19 | 0.830 $±$ 0.009 0.907 $±$ 0.023 | | --- | --- | --- | #### 5.4.2 Ablation Study: Deconstructing the Framework’s Success We conducted a comprehensive ablation study on a challenging synthetic dataset (Figure 3(b)) to validate the contribution of each core component. The findings, presented in Table 7 and Figure 9, provide unequivocal evidence for our integrated design. The full BELM-MDCM model establishes the gold standard for both accuracy and stability. The study reveals three critical insights: - Decisive Role of the Hybrid Objective: Removing it causes a catastrophic decline in performance (400%+ error increase), demonstrating it is a core driver of the causal inductive bias, not merely a fine-tuning mechanism. - Critical Importance of Targeted Modeling: Removing it leads to a complete collapse in model stability (Std Dev explodes from 4.57 to 138.01), validating our theoretical analysis on complexity control. A judicious allocation of model complexity is paramount for reproducibility. - Robust Advantage of Exact Invertibility: Replacing BELM with DDIM leads to a clear degradation in both accuracy and stability, confirming that SRE from approximate inversion systematically erodes the quality of the final causal estimate. <details> <summary>x12.png Details</summary> ![27d36b31](/v1/image/27d36b31714b8dfdef6b146eed6ae214d0ca1e7ce0edb28bd9fe5c112939e070) ### Visual Description ## Ablation Study Results (5 Runs) on Challenge Dataset ### Overview The image displays two horizontal bar charts presenting the results of an ablation study conducted over 5 runs on a "Challenge Dataset." The study evaluates the performance of a full model ("BELM-MDCM") against three variants where specific components have been removed ("w/o" meaning "without"). The top chart assesses the accuracy of Average Treatment Effect (ATE) estimation, while the bottom chart quantifies the impact of each ablation on the mean absolute error. ### Components/Axes **Main Title:** "Ablation Study Results (5 Runs) on Challenge Dataset" **Top Chart: ATE Estimation Accuracy** * **Chart Type:** Horizontal bar chart with error bars. * **Y-axis (Categories):** Four model variants, listed from top to bottom: 1. `w/o Hybrid Objective` (Dark purple bar) 2. `w/o Targeted Modeling` (Teal/blue-green bar) 3. `w/o Exact Invertibility (DDIM)` (Medium green bar) 4. `BELM-MDCM (Full Model)` (Bright green bar) * **X-axis:** "Mean Estimated ATE (Error bars show ±1 Std Dev)". The scale runs from approximately 110 to 290, with major ticks at 125, 150, 175, 200, 225, 250, 275. * **Legend/Reference Line:** A vertical, dashed red line is positioned at `x = 202.29`. The legend in the top-right corner labels this as `True ATE = 202.29`. **Bottom Chart: Impact of Ablation on Absolute Error** * **Chart Type:** Horizontal bar chart with error bars. * **Y-axis (Categories):** The same four model variants as the top chart, in the same order and color scheme. * **X-axis:** "Mean Absolute Error (Lower is Better)". The scale runs from -100 to 200, with major ticks at -100, -50, 0, 50, 100, 150, 200. ### Detailed Analysis **Top Chart - ATE Estimation Accuracy:** * **Trend Verification:** The "Full Model" bar is closest to the "True ATE" line. Removing components generally shifts the estimated ATE further away from the true value, with the exception of the "w/o Exact Invertibility" variant, which is closer than "w/o Targeted Modeling." * **Data Points (Approximate Mean ± Std Dev):** * `w/o Hybrid Objective`: Mean ≈ 135, Std Dev range ≈ 115 to 155. * `w/o Targeted Modeling`: Mean ≈ 255, Std Dev range ≈ 115 to 295 (very large spread). * `w/o Exact Invertibility (DDIM)`: Mean ≈ 215, Std Dev range ≈ 205 to 225. * `BELM-MDCM (Full Model)`: Mean ≈ 190, Std Dev range ≈ 180 to 200. **Bottom Chart - Impact of Ablation on Absolute Error:** * **Trend Verification:** The "Full Model" has the shortest bar, indicating the lowest error. All ablated variants show increased error. The "w/o Targeted Modeling" variant has an exceptionally large error bar. * **Data Points (Approximate Mean Absolute Error ± Std Dev):** * `w/o Hybrid Objective`: Mean ≈ 65, Std Dev range ≈ 45 to 85. * `w/o Targeted Modeling`: Mean ≈ 50, Std Dev range ≈ -90 to 190 (extremely large spread, indicating high instability). * `w/o Exact Invertibility (DDIM)`: Mean ≈ 15, Std Dev range ≈ 5 to 25. * `BELM-MDCM (Full Model)`: Mean ≈ 10, Std Dev range ≈ 5 to 15. ### Key Observations 1. **Full Model Superiority:** The `BELM-MDCM (Full Model)` consistently performs best, achieving an estimated ATE closest to the true value (202.29) and the lowest mean absolute error. 2. **Critical Component:** Removing "Targeted Modeling" (`w/o Targeted Modeling`) causes the most severe degradation in performance. It results in the largest overestimation of the ATE (mean ~255) and exhibits extremely high variance (very long error bars in both charts), suggesting this component is crucial for both accuracy and stability. 3. **Stability vs. Accuracy:** The `w/o Exact Invertibility (DDIM)` variant shows relatively good accuracy (mean ATE ~215) and low variance, but its absolute error is still higher than the full model. This suggests the DDIM component contributes to fine-tuning accuracy. 4. **Hybrid Objective Role:** Removing the "Hybrid Objective" leads to a significant underestimation of the ATE (mean ~135) and a moderate increase in error, indicating its importance for correct central tendency estimation. ### Interpretation This ablation study demonstrates the additive value of each component in the BELM-MDCM model for estimating the Average Treatment Effect on the Challenge Dataset. * **What the data suggests:** The full model architecture is necessary for optimal performance. Each ablated component leads to a specific type of failure: loss of the Hybrid Objective causes systematic underestimation, loss of Targeted Modeling causes catastrophic overestimation and instability, and loss of Exact Invertibility (DDIM) reduces precision. * **How elements relate:** The two charts are complementary. The top chart shows *directional bias* (how far the estimate is from the truth), while the bottom chart shows *magnitude of error*. The "w/o Targeted Modeling" variant is particularly interesting: while its mean error (~50) isn't the highest, its enormous standard deviation indicates that in some runs, it can be wildly inaccurate (error up to ~190), making it unreliable. * **Notable anomaly:** The error bar for `w/o Targeted Modeling` in the bottom chart extends into negative values (to ~-90). Since absolute error cannot be negative, this likely represents the lower bound of the standard deviation calculation around a positive mean, visually emphasizing the extreme variance rather than a literal negative error. * **Conclusion:** The "Targeted Modeling" component appears to be the most critical for stabilizing the estimation process and preventing large deviations. The "Hybrid Objective" is essential for centering the estimate correctly, and "Exact Invertibility (DDIM)" provides a final layer of refinement. The full model synergistically combines these elements to achieve accurate and precise ATE estimation. </details> Figure 9: Visualization of the ablation study results. The top panel shows the mean estimated ATE relative to the true value (red dashed line), with error bars indicating $± 1$ standard deviation. The bottom panel highlights the mean absolute error for each configuration. The full BELM-MDCM model is demonstrably the most accurate and stable. Table 7: Ablation study results on a challenging synthetic dataset (True ATE = 202.29), validating the necessity of each framework component. The mean and standard deviation are computed over 5 runs. | w/o Exact Invertibility (DDIM) w/o Hybrid Objective w/o Targeted Modeling | 219.98 $±$ 7.48 137.77 $±$ 23.99 253.20 $±$ 138.01 | 17.68 64.53 50.90 | | --- | --- | --- | Conclusion. The ablation study confirms that our framework’s three core design principles: analytical invertibility, a hybrid training objective, and targeted modeling —work in synergy. The removal of any single component creates a significant vulnerability, validating the integrity and effectiveness of our integrated architectural design. ## 6 Causal Information Conservation as a Unifying Principle The principle of Causal Information Conservation extends beyond a foundation for our model; it offers a unifying lens—a new taxonomy—for analyzing the suitability of any generative model for individual-level causal inference. Applying this principle and its operational metric, SRE, allows us to situate our work and clarify its unique advantages within the broader landscape. ### 6.1 Normalizing Flows: Natively Information-Conserving Normalizing Flows (NFs) (dinh2014nice; dinh2017density; kingma2018glow) are designed around an invertible mapping. By construction, their Structural Reconstruction Error is identically zero, making NFs a native implementation of the Causal Information Conservation principle. However, their strengths come with limitations: the requirement of a tractable Jacobian imposes heavy architectural constraints, which can limit expressive power and introduce strong topological assumptions on the data manifold. ### 6.2 VAEs and GANs: Architecturally High-SRE Variational Autoencoders (VAEs) (kingma2014autoencoding) and Generative Adversarial Networks (GANs) (goodfellow2014generative) are fundamentally ill-suited for this task. Their Structural Reconstruction Error is large and non-zero due to a fundamental architectural mismatch, as their separate encoder and decoder networks lack any structural guarantee of being inverses. Furthermore, their sources of information loss are inherent to their design; optimization objectives like the ELBO’s KL-divergence term actively encourage lossy compression, theoretically impeding the recovery of precise causal information. ### 6.3 A Comparative Perspective on Diffusion Models Viewed through our principle, diffusion models occupy a unique and compelling space in this taxonomy. Standard diffusion-based approaches, using samplers like DDIM, aspire to conserve information, but their reliance on approximate inversion results in a non-zero SRE; they are ”aspirational” but ultimately lossy. In contrast, BELM-MDCM (this work) achieves zero SRE by integrating an analytically invertible sampler, matching the theoretical purity of Normalizing Flows but without their rigid architectural constraints. Furthermore, unlike NFs trained with a generic likelihood objective, our framework’s Hybrid Training objective provides a causally-oriented inductive bias. BELM-MDCM thus uniquely combines the rigorous invertibility of NFs with the modeling flexibility and task-specific power of the diffusion paradigm, making it ideally suited for principled, high-fidelity causal inference. ## 7 Conclusion and Future Work This paper introduced Causal Information Conservation as a guiding principle for the emerging field of diffusion-based causal inference. Our primary contribution is not the concept of invertibility itself, but framing it as a foundational design requirement and identifying the Structural Reconstruction Error (SRE) as the precise, quantifiable cost of its violation. Our proposed framework, BELM-MDCM, serves as a constructive proof of this principle. By being architected around analytical invertibility, it is the first to achieve zero SRE by design, shifting the focus from mitigating numerical errors to upholding a fundamental causal principle. This work provides a foundational blueprint and a more rigorous standard for applying the power of diffusion models to the profound challenges of causal inference, reconciling their flexibility with the logical rigor demanded by classical theory. ### 7.1 Limitations and Future Work Our work highlights several avenues for future research: - Handling Non-Invertible SCMs: Our framework excels when the true SCM is invertible. While our stress test shows robustness when this is violated, developing models inherently resilient to such misspecifications is a key challenge. Empirically validating the proposed prior-matching regularizer (Appendix C) is a concrete next step. - Robustness to Graph Misspecification: Like most SCM-based methods (peters2017elements), our framework assumes a correctly specified causal graph. Analyzing how structural errors (e.g., omitted confounders) propagate through our error decomposition framework is a significant future research direction. - Formalizing CIC within Information Theory: Our work defines CIC as an operational principle. A compelling direction is to formalize it within a rigorous information-theoretic framework, for instance by proving that zero SRE maximizes the mutual information $I(U;\hat{U})$ between the true and recovered noise, connecting our work to rate-distortion theory. - Scalability and Generalizability: While powerful, the BELM sampler is computationally intensive. Improving its efficiency for high-dimensional settings is an important practical challenge. Furthermore, extending the principles of CIC and zero-SRE modeling to other data modalities, such as time-series or images, represents an exciting frontier. Acknowledgments and Disclosure of Funding We thank the anonymous reviewers for their insightful feedback which significantly improved the clarity and rigor of this paper. ## Appendix A Core Diffusion Model Equations This appendix provides the essential equations for the diffusion models referenced in this work. Diffusion models learn a data distribution by training a neural network $\boldsymbol{ε}_θ$ to reverse a fixed, gradual noising process. The model is trained by optimizing a simplified score-matching objective (ho2020denoising): $$ \begin{split}L_simple(θ)=E_t,x_0,\boldsymbol{ε}\bigg[\Big\|\boldsymbol{ε}-\boldsymbol{ε}_θ\big(√{\bar{α}_t}x_0+√{1-\bar{α}_t}\boldsymbol{ε},t\big)\Big\|^2\bigg]\end{split} $$ where $\bar{α}_t$ is a predefined noise schedule. For deterministic generation and inversion, we use the Denoising Diffusion Implicit Model (DDIM) update step (song2021denoising): $$ \begin{split}x_t-1=√{\bar{α}_t-1}≤ft(\frac{x_t-√{1-\bar{α}_t}\boldsymbol{ε}_θ(x_t,t)}{√{\bar{α}_t}}\right)+√{1-\bar{α}_t-1}\boldsymbol{ε}_θ(x_t,t)\end{split} $$ This process can be viewed as a discretization of a continuous-time probability flow Ordinary Differential Equation (ODE) (song2021score): $$ dx=≤ft[\frac{1}{2}\frac{d\logα(s)}{ds}x(s)-\frac{1}{2}\frac{d\log(1-α(s))}{ds}\frac{√{1-α(s)}}{√{α(s)}}\boldsymbol{ε}_θ(x(s),s)\right]ds $$ Within our theoretical framework, the encoder operator $T_θ$ corresponds to solving this ODE forward in time (from $s=0$ to $s=1$ ), while the decoder operator $H_θ$ corresponds to solving it backward in time (from $s=1$ to $s=0$ ). ## Appendix B Detailed Proofs for Identifiability (Theorems 2 & 4 ) This appendix provides detailed, dimension-specific proofs for the identifiability of the exogenous noise $U$ and the subsequent correctness of counterfactual generation. The core challenge lies in showing that the statistical independence of the latent code $Z$ from the parents $Pa$ is a sufficient condition to establish an isomorphic relationship between $Z$ and $U$ . The mathematical tools required differ based on the dimensionality of $U$ . ### B.1 The High-Dimensional Case ( $d≥ 3$ ) For cases where the exogenous noise $U$ has a dimensionality $d≥ 3$ , the proof leverages Liouville’s theorem on conformal mappings. Proof [Proof of Theorems 2 and 4 for $d≥ 3$ ] We adapt the proof from identifiable generative modeling (chao2023interventional) to our conditional operator setting. Let $Q_pa(U):=T_θ(F(pa,U),pa)$ be the composite function mapping the noise $U$ to the latent code $Z$ for a given context $pa$ . By assumption, $Q_pa$ is invertible and differentiable. The core assumption, $Z⊥⊥Pa$ , implies that the conditional density $p_Z(z|pa)$ must equal a marginal density $p_Z(z)$ that is independent of $pa$ . Using the change of variables formula, we relate the density of $Z$ to that of $U$ : $$ p_Z(z|pa)=\frac{p_U(Q_pa^-1(z))}{≤ft|\det J_Q_\mathbf{pa}(Q_pa^-1(z))\right|} $$ where $J_Q_\mathbf{pa}$ is the Jacobian of $Q_pa$ . Since the left-hand side is independent of $pa$ and $p_U(·)$ is a fixed distribution, this imposes a strong constraint on the Jacobian determinant. Under regularity conditions, this implies that the Jacobian $J_Q_\mathbf{pa}(u)$ must be a scaled orthogonal matrix, making $Q_pa$ a conformal map. By Liouville’s theorem, for dimensions $d≥ 3$ , any conformal map must be a Möbius transformation (a composition of translations, scalings, orthogonal transformations, and inversions). For the map to be well-behaved, it must exclude the inversion component, which would introduce singularities. This is consistent with the regularity of functions representable by neural networks, simplifying the map to an affine form: $$ Q_pa(u)=A_pau+d_pa $$ where $A_pa$ is a scaled orthogonal matrix. The argument then uses the independence of the distribution’s moments and support to show that $A_pa$ and $d_pa$ must be constant w.r.t. $pa$ . This leads to the isomorphic relationship $T_θ(F(Pa,U),Pa)=AU+d=g(U)$ . The proof of counterfactual correctness follows directly, as detailed in chao2023interventional. The conditional isomorphism $H_θ(T_θ(·,pa),pa)=I$ combined with the identifiability result $T_θ(F(pa,u),pa)=g(u)$ implies that $H_θ(g(u),pa)=F(pa,u)$ . The decoder thus perfectly mimics the true causal mechanism, making an intervention exact: $\hat{X}_\boldsymbol{α}=H_θ(g(u),\boldsymbol{α})=F(\boldsymbol{α},u)=X_\boldsymbol{α}^true$ . ### B.2 The One-Dimensional Case ( $d=1$ ) For the one-dimensional case where Liouville’s theorem does not apply, this section provides a dedicated proof leveraging properties of 1D functions and a uniform noise assumption from (chao2023interventional) that does not sacrifice generality. We first establish a helper lemma characterizing a specific class of 1D functions. **Lemma 18** *For $U,Z⊂ℝ$ , consider a family of invertible functions $q_pa:U→ Z$ for $pa∈X_pa⊂ℝ^d$ . The derivative expression $\frac{dq_pa}{du}(q_pa^-1(z))$ is a function of $z$ only, i.e., $c(z)$ , if and only if $q_pa(u)$ can be expressed as $$ q_pa(u)=q(u+r(pa)) $$ for some function $r$ and invertible function $q$ .* Proof The proof is provided in (chao2023interventional). The reverse direction follows from direct differentiation, while the forward direction uses the inverse function theorem to show that the inverses $s_pa(z)=q_pa^-1(z)$ must have the same derivative $1/c(z)$ , implying they can only differ by an additive constant, which yields the desired form after inversion. This lemma enables the proof of the theorem for the 1D case. **Theorem 19 (Identifiability ford=1d=1)** *Assume for $X∈X⊂ℝ$ and exogenous noise $U∼Unif[0,1]$ , the SCM is $X:=f(Pa,U)$ . Assume an encoder-decoder model with encoding function $g$ and decoding function $h$ . Assume the following conditions: 1. The encoding is independent of the parents, $g(X,Pa)⊥⊥Pa$ . 1. The structural equation $f$ is differentiable and strictly increasing w.r.t. $U$ . 1. The encoding $g$ is invertible and differentiable w.r.t. $X$ . Then, $g(f(Pa,U),Pa)=\tilde{q}(U)$ for an invertible function $\tilde{q}$ .* Proof Let $q_pa(U):=g(f(pa,U),pa)$ . The conditions on $f$ and $g$ ensure $q_pa$ is strictly monotonic and thus invertible. By the independence assumption, the conditional distribution of $Z=q_pa(U)$ does not depend on $pa$ . We assume $U∼Unif[0,1]$ without loss of generality. For any SCM with a continuous noise $E$ and a strictly increasing CDF $F_E$ , $X=f(Pa,E)$ can be re-parameterized to an equivalent SCM $X=\tilde{f}(Pa,U)$ , where $U=F_E(E)∼Unif[0,1]$ and $\tilde{f}(·,·)=f(·,F_E^-1(·))$ . The modeling task is then to learn the potentially more complex function $\tilde{f}$ . The change of density formula gives: $$ p_Z(z)=\frac{p_U(q_pa^-1(z))}{|\frac{dq_pa}{du}(q_pa^-1(z))|} $$ Since $p_U(u)=1$ on its support and $q_pa$ is increasing, the denominator must be independent of $pa$ . This implies $\frac{dq_pa}{du}(q_pa^-1(z))=c(z)$ for some function $c$ . This meets the condition of Lemma 18, allowing us to express $q_pa(u)=q(u+r(pa))$ for some invertible $q$ . Since $Z⊥⊥Pa$ , its support must also be independent of $pa$ . The support is $q([0,1]+r(pa))$ , which is constant only if the interval $[r(pa),1+r(pa)]$ is constant. This requires $r(pa)$ to be a constant, $r$ . Thus, $q_pa(u)=q(u+r)$ . Defining $\tilde{q}(u)=q(u+r)$ , we find that the mapping is solely a function of $U$ , which completes the proof. ### B.3 The Two-Dimensional Case ( $d=2$ ) The two-dimensional case is a well-known geometric exception, as the group of conformal maps is infinite-dimensional. Consequently, the proof strategy used for higher dimensions via Liouville’s theorem does not directly apply. Here, we show that identifiability still holds under additional, plausible regularity assumptions aligned with our modeling framework. **Assumption 1 (Asymptotic Linearity)** *The composite mapping $Q_pa(u):ℂ→ℂ$ is an entire function (analytic on the whole complex plane) with at most linear growth. That is, there exist constants $A,B$ such that $|Q_pa(u)|≤ A|u|+B$ for all $u∈ℂ$ .* This assumption reflects a fundamental inductive bias of neural network architectures. Standard activation functions (e.g., ReLU, Tanh) produce functions that cannot exhibit super-polynomial growth or essential singularities at infinity, aligning our analysis with the function classes our model can represent. **Assumption 2 (Non-Rotationally Symmetric Base Noise)** *The base distribution of the exogenous noise $U$ is not rotationally symmetric. This assumption is made without loss of generality. For any arbitrary continuous noise $E=(E_1,E_2)$ , one can define a new noise $U=(F_E_1(E_1),E_2)$ , where $F_E_1$ is the CDF of the first component. The resulting distribution of $U$ is uniform along its first axis, thus breaking any rotational symmetry. The structural function $F$ then absorbs this transformation.* Proof The proof proceeds in three steps. First, as established previously, the statistical independence condition $Z⊥⊥Pa$ implies that the learned mapping $Q_pa(u)$ must be a conformal map, and thus an analytic function on $ℂ$ . Second, by Assumption 1, $Q_pa(u)$ is an entire function with at most linear growth. The Generalized Liouville’s Theorem states that an entire function whose growth is bounded by a polynomial of degree $k$ must itself be a polynomial of degree at most $k$ . In our case, this implies $Q_pa(u)$ must be a polynomial of degree at most one, giving it the affine form: $$ Q_pa(u)=a(pa)u+b(pa) $$ where $a,b$ are complex coefficients that can depend on $pa$ . Third, we use the full statistical independence condition to show that the coefficients $a$ and $b$ must be constant. For the distribution of $Z=a(pa)U+b(pa)$ to be independent of $pa$ , all of its properties must be constant. Mean: The mean $E[Z|pa]=a(pa)E[U]+b(pa)$ must be constant. Assuming $E[U]=0$ without loss of generality implies $b(pa)$ must be a constant, $b$ . Covariance: By Assumption 2, the distribution of $U$ is not rotationally symmetric, so its covariance matrix is not proportional to the identity. Any rotation induced by the phase of $a(pa)$ would alter the covariance structure of $Z$ . For the distribution of $Z$ to be invariant, the rotation angle (phase) of $a(pa)$ must be constant. Scale: The change of variables formula implies that the magnitude $|a(pa)|$ must also be constant. Since both the magnitude and phase of $a(pa)$ must be constant, $a(pa)$ must be a constant complex number, $a$ . Therefore, $Q_pa(u)=au+b$ , which is an isomorphic mapping of $u$ . This completes the proof. ## Appendix C Extended Analysis of Inversion Fidelity and Non-Invertible SCMs This appendix provides a unified analysis of inversion errors. We first provide rigorous proofs for inversion fidelity (Propositions 5 and 6), then extend our error decomposition framework to the more challenging non-invertible SCM setting. ### C.1 Proofs for Inversion Fidelity (Propositions 5 & 6 ) Proof [Proof of Proposition 5] The proof proceeds by deriving the explicit one-step reconstruction error and then analyzing its order of magnitude. 1. Derivation of the One-Step Reconstruction Error. Let the single-step DDIM inversion operator be $T_t$ , mapping an observation $x_t$ to $x^\prime_t+1$ using the noise prediction $\boldsymbol{ε}_t=\boldsymbol{ε}_θ(x_t,t)$ . The corresponding generative operator is $H_t$ , which reconstructs $x^\prime_t$ from $x^\prime_t+1$ using a new prediction at the new state, $\boldsymbol{ε}^\prime_t+1=\boldsymbol{ε}_θ(x^\prime_t+1,t+1)$ . By substituting the formula for $x^\prime_t+1$ into the update for $x^\prime_t$ , the single-step reconstruction error $x^\prime_t-x_t$ is found to be: $$ x^\prime_t-x_t=≤ft(√{1-\bar{α}_t}-\frac{√{\bar{α}_t}√{1-\bar{α}_t+1}}{√{\bar{α}_t+1}}\right)(\boldsymbol{ε}^\prime_t+1-\boldsymbol{ε}_t) $$ This error is non-zero if and only if the noise prediction changes after one inversion step, i.e., $\boldsymbol{ε}^\prime_t+1≠\boldsymbol{ε}_t$ . 2. Analysis of the Error’s Order of Magnitude. In the continuous-time limit with time step $Δ s=1/T$ , a Taylor expansion shows that both the coefficient term and the difference in noise predictions $(\boldsymbol{ε}^\prime_t+1-\boldsymbol{ε}_t)$ are of order $O(Δ s)$ . The one-step reconstruction error is therefore the product of these two terms: $$ Error_step=O(Δ s)×O(Δ s)=O((Δ s)^2)=O(1/T^2) $$ This local error accumulates over the $T$ steps of the trajectory, resulting in a total accumulated error of order $O(1/T)$ . For any finite $T$ , this global error is non-zero, constituting the Structural Reconstruction Error (SRE) for DDIM. Proof [Proof of Proposition 6] The proof is constructive, following from the exact algebraic invertibility of the BELM sampler (liu2024belm). For the second-order BELM used in this work, the one-step decoder is an affine transformation of the form: $$ \displaystylex_t-1 \displaystyle=A_tx_t+B_t\boldsymbol{ε}_t+C_t\boldsymbol{ε}_t+1 $$ where $A_t,B_t,C_t$ are schedule-dependent coefficients, $\boldsymbol{ε}_t=\boldsymbol{ε}_θ(x_t,t)$ , and $\boldsymbol{ε}_t+1=\boldsymbol{ε}_θ(x_t+1,t+1)$ . The full-trajectory decoder, $H_BELM$ , is a composition of these one-step affine maps. The BELM encoder, $T_BELM$ , is constructed using a symmetric update rule designed to be the exact algebraic inverse of the decoder. As rigorously shown by liu2024belm, this construction ensures that the composite operator $T_BELM$ is the exact inverse of $H_BELM$ , assuming the same sequence of noise function evaluations is used for both processes. Therefore, by its algebraic construction, the BELM sampler guarantees a lossless round trip: $$ H_BELM∘T_BELM=I $$ The Structural Reconstruction Error is thus identically zero by construction. ### C.2 Exhaustive Analysis for the Non-Invertible SCM Setting This section rigorously extends our framework to the non-invertible case, providing a theoretical underpinning for the stress test results in § 5.4.1. #### C.2.1 Assumptions and Definitions We formalize the problem with the following. **Assumption 3 (Well-Posed Abduction)** *For any observed $(v,pa)$ , the inverse image set $U_(v,\mathbf{pa)}=\{u^\prime∈U\midF(pa,u^\prime)=v\}$ is non-empty, and the Maximum a Posteriori (MAP) solution over this set, given the prior $p(U)$ , is unique. This solution defines the ideal amortized inverse operator, $T^*(v,pa)=\arg\max_u^\prime∈U_(v,\mathbf{pa)}p(u^\prime)$ .* **Definition 20 (Tripartite Error Sources)** *In the non-invertible case, we refine the error sources into three distinct components: 1. Algorithmic Error (SRE): The error from an imperfect inversion algorithm, $E_SR:=\|(H_θ∘T_θ-I)X\|^2$ . For our framework, $E_SR≡ 0$ . 1. Modeling Error: The error from imperfectly learning the ideal amortized inverse, $E_Modeling:=\|T_θ(V,Pa)-T^*(V,Pa)\|^2$ . 1. Representational Error: The fundamental, irreducible error from the SCM’s non-invertibility, $E_Rep:=\|T^*(V,Pa)-U_true\|^2$ .* #### C.2.2 A Tighter Error Decomposition We now present a tighter error bound for the non-invertible case. **Theorem 21 (Tighter Counterfactual Error Bound)** *Let the conditions of Theorem 8 hold ( $H_θ$ is $L_H$ -Lipschitz). The expected squared error of the counterfactual prediction is bounded by: $$ \begin{split}E≤ft[\|\hat{X}_α-X_α^true\|^2\right]≤&2E[E_SR]+4L_H^2E[E_Modeling]+4L_H^2E[E_Rep]\end{split} $$* Proof We decompose the total error $\|\hat{X}_α-X_α^true\|$ using the triangle inequality and the ideal amortized inverse $T^*$ as an intermediate step. The result follows from applying the inequality $(a+b+c)^2≤ 2(a^2+b^2+c^2)$ , bounding terms with the $L_H$ -Lipschitz property of $H_θ$ , and taking expectations. #### C.2.3 Information-Theoretic Interpretation of Representational Error The Representational Error ( $E_Rep$ ) is deeply connected to information conservation. The non-invertibility of the SCM $F$ means that observing $(V,Pa)$ is not sufficient to uniquely determine $U$ . Information-theoretically, this implies the conditional entropy $H(U|V,Pa)$ is greater than zero. **Remark 22 (Representational Error as Information Loss)** *The ideal amortized inverse $T^*$ yields the mode of the posterior $p(U|V,Pa)$ . The expected representational error, $E[E_Rep]$ , can be seen as the expected squared error of this MAP estimator, which is related to the variance and shape of the posterior. Thus, $E_Rep$ is a direct consequence of the information about $U$ that is fundamentally lost in the forward causal process, a loss captured by $H(U|V,Pa)>0$ .* #### C.2.4 Theoretical Guarantee for the Mitigation Strategy We now provide a theoretical justification for the ‘Prior-Matching Regularizer‘ proposed in § 7.1. **Definition 23 (Prior-Matching Regularizer)** *The regularizer is defined as $R(T_θ)=E_(v,\mathbf{pa)}≤ft[\|s_p(T_θ(v,pa))\|^2\right]$ , where $s_p(u)=∇_u\log p(u)$ is the score function of the prior distribution $p(U)$ .* **Proposition 24 (Regularizer Induces Convergence to MAP Solution)** *Minimizing the regularizer $R(T_θ)$ provides an inductive bias that encourages the encoder output, $\hat{u}=T_θ(v,pa)$ , to lie on a mode of the prior distribution $p(U)$ .* Proof [Proof Sketch] The objective is a form of score matching on the prior. The score function $s_p(u)$ is zero if and only if $u$ is a stationary point of the log-prior. Minimizing the expected squared norm of the score at the encoder’s output penalizes the production of latent codes $\hat{u}$ in low-probability regions of the prior. This incentivizes the encoder to map observations to the most probable latent code, thereby encouraging $T_θ$ to approximate the ideal MAP estimator $T^*$ . ## Appendix D Main Proofs for Theoretical Framework This appendix provides the proofs for the main theoretical results presented in the text. Proof [Proof of Theorem 8] This bound is a direct specialization of the more general bound for non-invertible SCMs derived in Theorem 21 (Appendix C). For an invertible SCM, abduction is perfect, meaning the ideal amortized inverse $T^*$ is the true inverse of the SCM function $F$ . Consequently, the recovered noise is the true noise, $T^*(V,Pa)=U_true$ , which implies that the Representational Error is identically zero: $E[E_Rep]=0$ . In this context, the Latent Space Invariance Error, $E_LSI$ , becomes equivalent to the Modeling Error, $E_Modeling$ . Applying the inequality $(a+b)^2≤ 2a^2+2b^2$ , the general bound from Theorem 21 reduces to the two terms presented in Theorem 8. Proof [Proof of Proposition 10] This proof relies on the identifiability of the true SCM (Theorem 2) and the Lipschitz continuity of the score network $\boldsymbol{ε}_θ$ , which guarantees unique ODE solutions via the Picard-Lindelöf theorem. We also assume standard integrability conditions (Fubini’s theorem). The encoder $T_θ$ maps an initial condition $x(0)$ to the terminal state $x(T)$ of the probability flow ODE (Eq. A.3). Let $x_θ(t;x_0)$ and $x_*(t;x_0)$ denote the ODE solutions with the learned score $\boldsymbol{ε}_θ$ and the true score $\boldsymbol{ε}^*$ , respectively. The Latent Space Invariance Error is $E[E_LSI]=E[\|x_θ(T;X)-x_θ(T;X_\boldsymbol{α}^true)\|^2]$ . By the triangle inequality: | | $\displaystyle\|x_θ(T;X)-x_θ(T;X_\boldsymbol{α}^true)\|$ | $\displaystyle≤\|x_θ(T;X)-x_*(T;X)\|$ | | | --- | --- | --- | --- | The middle term is zero under ideal identifiability, as the true encoder $T^*$ maps both an observation and its true counterfactual to the same underlying noise. We thus only need to bound terms of the form $\|x_θ(T;x_0)-x_*(T;x_0)\|$ . Let $z(t)=x_θ(t;x_0)-x_*(t;x_0)$ . Since the ODE vector field is Lipschitz, applying Grönwall’s inequality to the differential of $z(t)$ yields: $$ \|z(T)\|≤∫_0^Te^L_f(T-t)C_f\|\boldsymbol{ε}_θ(x_*(t),t)-\boldsymbol{ε}^*(x_*(t),t)\|dt $$ where $L_f$ and $C_f$ are constants from the ODE coefficients. Squaring, taking expectations, and applying Jensen’s inequality leads to: $$ E[E_LSI]≤ C^\prime·E_x,t[\|\boldsymbol{ε}_θ-\boldsymbol{ε}^*\|^2] $$ where the final expectation is the score-matching loss. ## Appendix E Proofs for Theoretical Roles of Hybrid Training ### E.1 Proof of Proposition 12 (Weighted Score-Matching) This section provides a rigorous proof that the auxiliary task loss $L_task$ provides a lower bound on a weighted score-matching objective. #### E.1.1 Preliminaries and Setup Probability Flow ODE. The generative process is the solution to the reverse-time probability flow ODE from $t=T$ to $t=0$ : $$ dx_t=f(t,x_t,\boldsymbol{ε}(t,x_t))dt, x_T∼N(0,I) $$ where the vector field $f$ is determined by the diffusion scheduler. Core Objects. - $\boldsymbol{ε}_θ,\boldsymbol{ε}^*$ : Learned and true score functions. - $x_t^θ,x_t^*$ : ODE trajectories driven by $\boldsymbol{ε}_θ$ and $\boldsymbol{ε}^*$ . - $x_0^θ,x_0^*$ : Generated and true (counterfactual) data points at $t=0$ . - $g:ℝ^d→ℝ^k$ : Downstream prediction function. - $Y_pred=g(x_0^θ)$ , $Y_true=g(x_0^*)$ . - $L_task=E_x_T[\|Y_pred-Y_true\|^2]$ . Assumptions. We assume standard regularity conditions for the proof: 1. The vector field $f(t,x,\boldsymbol{ε})$ is Lipschitz continuous in $x$ and $\boldsymbol{ε}$ . 1. The downstream task function $g(x)$ is Lipschitz continuous and differentiable. 1. The learned score $\boldsymbol{ε}_θ$ and true score $\boldsymbol{ε}^*$ are well-behaved. #### E.1.2 Formal Proposition Statement **Proposition 25 (Hybrid Objective as a Weighted Score-Matching Regularizer)** *Under the regularity assumptions (A1-A3), the auxiliary task loss $L_task$ provides a lower bound on a weighted score-matching objective: $$ L_task≥ C·E_x_T,t≤ft[w(x_t^*)·\|\boldsymbol{ε}_θ(x_t^*)-\boldsymbol{ε}^*(x_t^*)\|^2\right] $$ where $C>0$ and the weight function $w(x_t^*)$ measures the sensitivity of the final prediction $Y$ to score perturbations along the ideal data generation trajectory.* Proof The proof proceeds in three main steps. Step 1: Bounding Sample Error by Score Error (Error Propagation). Let $z(t)=x_t^θ-x_t^*$ be the error between the two ODE trajectories. By linearizing the vector field $f$ around the true trajectory, we can analyze the impact of the score perturbation $Δ\boldsymbol{ε}_t=\boldsymbol{ε}_θ(x_t^*)-\boldsymbol{ε}^*(x_t^*)$ . From Duhamel’s Principle, the final sample error $Δx_0=z(0)$ can be expressed as an integral over the perturbation: $$ Δx_0=∫_T^0K(s)Δ\boldsymbol{ε}_sds $$ where the kernel $K(s)$ captures the influence of a score perturbation at time $s$ on the final sample at time $0$ . Step 2: Linearizing the Task Error. The error in the downstream prediction is $Δ Y=g(x_0^θ)-g(x_0^*)$ . A first-order Taylor expansion gives: $$ Δ Y≈∇_xg(x_0^*)·Δx_0 $$ The task loss is the expected squared norm, $L_task=E[\|Δ Y\|^2]$ . Step 3: Deriving the Weighted Relationship. Substituting the integral form of $Δx_0$ into the task error approximation and applying the Cauchy-Schwarz inequality for integrals yields a lower bound for the task loss: $$ L_task≥ C·E_x_T≤ft[∫_0^T≤ft\|W(x_0^*,s)\right\|_F^2·\|Δ\boldsymbol{ε}_s\|^2ds\right] $$ where $\|·\|_F$ is the Frobenius norm and the influence operator $W(x_0^*,s)$ captures the end-to-end sensitivity from a score perturbation at time $s$ to the final prediction. Rewriting the expectation gives the final form: $$ L_task≥ C·E_t∼ U[0,T]E_x_t^*≤ft[w(x_t^*)·\|\boldsymbol{ε}_θ(x_t^*)-\boldsymbol{ε}^*(x_t^*)\|^2\right] $$ where the weight function is $w(x_t^*):=T·E_x_T|\mathbf{x_t^*}[\|W(x_0^*,t)\|_F^2]$ . Interpretation and Conclusion. The weight function $w(x_t^*)$ is large when the gradient norm $\|∇_xg(x_0^*)\|$ is large—i.e., in causally salient regions where the outcome is highly sensitive to the features. The inequality therefore shows that minimizing $L_task$ provides a lower bound on a score-matching error that is weighted to prioritize accuracy in these causally salient regions. ### E.2 Argument for Proposition 13 (Latent Space Disentanglement) This section provides a qualitative, information-theoretic argument for Proposition 13, showing how the hybrid objective encourages a ”division of labor” that promotes disentanglement. The SCM posits that an observation $V$ is determined by $(Pa,U)$ , and the encoder maps $(V,Pa)$ to a latent code $Z=T_θ(V,Pa)$ , where a perfect encoder would yield $Z=U$ . The process works as follows: the diffusion loss ( $L_diffusion$ ) maximizes the log-likelihood $\log p_θ(V|Pa)$ , forcing the pair $(Pa,Z)$ to contain all information to reconstruct $V$ , thus maximizing the mutual information $I(V;(Pa,Z))$ . Simultaneously, the task loss ( $L_task$ ) learns a prediction from the parents, capturing all predictive information that $Pa$ has about $V$ and thus maximizing $I(V;Pa)$ . The dual objective must satisfy both constraints. From the chain rule of mutual information, we know $I(V;(Pa,Z))=I(V;Pa)+I(V;Z|Pa)$ . Since $L_diffusion$ maximizes the left-hand side and $L_task$ captures the first term on the right, the optimization incentivizes the latent code $Z$ to model the remaining information, $I(V;Z|Pa)$ . This leads to a connection to disentanglement: the ideal exogenous noise $U$ is, by definition, independent of $Pa$ . By forcing $Z$ to model the residual information, the optimization process actively encourages the learned representation $Z$ to be independent of $Pa$ . In summary, the hybrid objective creates a division of labor: the task-specific head explains the variance from $Pa$ , while the diffusion process’s latent code $Z$ models the residual. This residual is, by construction, the information in $V$ orthogonal to $Pa$ , forcing $Z$ to be an empirical approximation of the true, disentangled exogenous noise $U$ , thereby serving the identifiability condition of Theorem 2. ## Appendix F Proof of Theorem on Causal Transportability This appendix provides the proof for the theorem on lossless causal transportability. Proof [Proof of Theorem 17] This proof operates under an idealized setting, assuming a perfectly trained model (i.e., zero statistical error), to isolate the structural properties of transportability. We assume the learned decoder $H_θ_{i}$ is identical to the true mechanism $F_i$ , and its encoder $T_θ_{i}$ is its perfect inverse. Let the source and target SCMs be $M^S$ and $M^T$ , with mechanisms $\{F_i\}$ and $\{F^\prime_i\}$ respectively. The set of changed mechanisms is indexed by $K_changed$ . The proof relies on the modularity of the SCM, which is guaranteed by the mutually independent exogenous noises (Condition ii of the theorem). This independence ensures that a change in one mechanism $F_k$ to $F^\prime_k$ does not affect the conditional distributions of other nodes $V_j$ ( $j≠ k$ ), given their parents. We analyze the transportability of operators for each mechanism: 1. For invariant mechanisms ( $j∉K_changed$ ): By definition, the true mechanism is unchanged, $F^\prime_j=F_j$ . Since the model operators $(T_θ_{j},H_θ_{j})$ perfectly learned $F_j$ in the source domain, they remain valid for the target domain $T$ and can be directly reused. 1. For changed mechanisms ( $k∈K_changed$ ): The original operators $(T_θ_{k},H_θ_{k})$ are now invalid as they model $F_k$ , not the new mechanism $F^\prime_k$ . However, a new operator pair $(T^\prime_θ_{k},H^\prime_θ_{k})$ can be learned from target domain data. This training only requires samples $(v^\prime_k,pa^\prime_k)$ from domain $T$ and the shared noise distribution $p_k(U_k)$ (Condition i). Because the noises are independent, this re-learning process for mechanism $k$ is modular and does not affect the other invariant mechanisms. The procedure for adapting the model is therefore modular: freeze all invariant operators $\{(T_θ_{j},H_θ_{j})\}_j∉K_changed$ and re-train only those for the changed mechanisms $\{k∈K_changed\}$ on target domain data. The resulting adapted model is valid for the target domain $T$ and can perform abduction on a target individual by applying the correct (reused or re-trained) encoders, thereby losslessly recovering the full vector of exogenous noises. This fulfills the condition for lossless transport. ## Appendix G Proof of the Specific Finite Sample Bound (Theorem 15 ) This derivation combines standard generalization bounds with known complexity bounds for deep neural networks. We first state a key lemma regarding the complexity of neural networks. **Lemma 26 (Rademacher Complexity of Neural Networks)** *Let $F_ε$ be the function class of an $L$ -layer MLP with ReLU activations, input dimension $p$ , and weight matrices $\{W_j\}_j=1^L$ . Assume the input data $X$ is contained within a ball of radius $R_X$ . If the spectral norm of each weight matrix is bounded, $\|W_j\|_2≤ B_j$ , the Rademacher complexity of the function class is bounded by: $$ \mathfrak{R}_n(F_ε)≤ C_net\frac{R_XL√{p}}{√{n}}≤ft(∏_j=1^LB_j\right) $$ For simplicity and under normalization, we often consider $R_X=1$ . This result is a simplified form derived from (bartlett2017spectrally; neyshabur2018pac), where $C_net$ is a universal constant.* Proof The proof proceeds by combining the standard excess risk bound with the two lemmas above. First, from standard learning theory, the excess risk is bounded by the Rademacher complexity of the total loss function class: $$ R(\hat{θ}_n)-R(θ^*)≤ 4\mathfrak{R}_n(F_L_SCM)+M√{\frac{\log(1/δ)}{2n}} $$ Next, by the sub-additivity property of Rademacher complexity, we decompose the SCM’s complexity into the sum of its individual components: $$ \mathfrak{R}_n(F_L_SCM)≤∑_i=1^d\mathfrak{R}_n(F_L_i) $$ The next step is to relate the loss complexity to the network complexity. The score-matching loss for mechanism $i$ is $L_i=\|\boldsymbol{ε}-ε_θ_{i}(·)\|^2$ . Since this loss is Lipschitz with respect to the output of $ε_θ_{i}$ , its Rademacher complexity is upper-bounded by a constant multiple of the complexity of the score network’s function class $F_ε_{i}$ via Talagrand’s contraction lemma, i.e., $\mathfrak{R}_n(F_L_i)≤ C_1·\mathfrak{R}_n(F_ε_{i})$ . We then apply Lemma 26 to bound the complexity of each score network. The input dimension $p_i$ for mechanism $i$ is $p_i=dim(x_t)+dim(pa_i)+dim(time embedding)$ . Assuming univariate variables, this is $p_i=1+|Pa_i|+d_embed$ . Let $d_in^max=\max_i|Pa_i|$ , so the maximum input dimension is $p_max=1+d_in^max+d_embed$ . Assuming all networks have depth $L$ and a uniform spectral norm bound $B$ , we have: $$ \mathfrak{R}_n(F_ε_{i})≤ C_net\frac{L√{p_max}}{√{n}}B^L=C_net\frac{L√{1+d_in^max+d_embed}}{√{n}}B^L $$ Finally, we assemble the complete bound by substituting all components back into the initial inequality: | | $\displaystyle R(\hat{θ}_n)-R(θ^*)$ | $\displaystyle≤ 4∑_i=1^d\mathfrak{R}_n(F_L_i)+M√{\frac{\log(1/δ)}{2n}}$ | | | --- | --- | --- | --- | By defining $C=4C_1C_net$ as a generic constant independent of the network architecture and sample size, we arrive at the final form stated in the theorem. ## Appendix H Formal Analysis of the Geometric Inductive Bias This section provides the formal argument for Proposition 3, demonstrating how the score-matching objective, under a simplicity bias, compels the learned generative map to adopt the local data geometry, yielding a parsimonious and well-behaved transformation. ### H.1 Preliminaries and Definitions Let $M⊂ℝ^d$ be the data manifold with a smooth probability density $p(x)$ . The true score function is the vector field $s^*(x):=∇_x\log p(x)$ . We learn a parameterized score network $s_θ(x)$ by minimizing the score-matching objective $L_SM(θ)=E_x∼ p(\mathbf{x)}[\|s_θ(x)-s^*(x)\|^2]$ . The generative process is described by the probability flow ODE, whose vector field $f(x,t)$ is a function of the score. The map $H_θ:ℝ^d→M$ is the flow map of this ODE integrated from $t=1$ to $t=0$ . A map is conformal if its Jacobian is a scaled orthogonal matrix, and affine if it is a linear transformation plus a translation. ### H.2 Assumptions **Assumption 4 (Smoothness of the Data Density)** *The true data density $p(x)$ is at least twice continuously differentiable ( $C^2$ ) on $M$ .* **Assumption 5 (Adoption of the Simplicity Bias Principle)** *Our analysis relies on the principle of implicit regularization: the conjecture that optimizers like SGD favor solutions with a simplicity bias. We formalize this as a preference for score functions $s_θ$ with lower complexity (e.g., lower Dirichlet Energy or smaller spectral norms) (hochreiter1997flat; neyshabur2018pac).* ### H.3 Proof of Proposition 3 The proof proceeds in four steps: Step 1: The Irrotational Property of the True Score Field. By definition, the true score $s^*(x)$ is the gradient of the scalar potential $\log p(x)$ . By a fundamental theorem of vector calculus, the curl of any gradient field is zero. Thus, $s^*$ is an irrotational (or conservative) vector field. Step 2: The Variational Perspective of Score Matching. The simplicity bias (Assumption 5) reinforces the irrotational nature of the learned field $s_θ$ . This is understood via the Helmholtz decomposition, which splits any vector field into irrotational (curl-free) and solenoidal (divergence-free) components. The simplicity bias conjectures that the optimization preferentially minimizes the energy of the solenoidal component, driving the learned field $s_θ$ towards a purely irrotational solution ( $∇×s_θ≈0$ ) to match the conservative nature of the true score $s^*$ . Step 3: The Geometry of the Score Field under Local Structural Assumptions. We analyze the score field’s structure under two local geometric cases for the density $p(x)$ in a region $R$ . Case (i): Local Isotropy. If $p(x)$ is locally spherically symmetric around a center $c$ , its value depends only on the radius $r=\|x-c\|$ . The gradient $s^*=∇\log p(x)$ must point along the radial direction, making the true score a radial vector field. The learned field converges to this simple structure. Case (ii): Local Ellipsoidal Structure. If the iso-contours of $p(x)$ are concentric ellipsoids, the gradient $s^*$ must be orthogonal to these ellipsoidal surfaces at every point. Such a field is still irrotational but no longer radial. Step 4: The Geometry of the Flow Map. The ODE vector field $f(x,t)$ is a linear combination of the score field $s_θ^*(x)$ and the position vector $x$ . The geometry of the resulting flow map $H_θ^*$ depends on this vector field’s geometry. Result for Case (i): In the isotropic case, $f(x,t)$ is also a radial vector field. A flow generated by a radial vector field is, by its rotational symmetry, necessarily a conformal map, as it preserves angles. Result for Case (ii): In the ellipsoidal case, the flow must transform the isotropic latent space into the anisotropic data space. The simplest such transformation is a local affine transformation, involving direction-dependent scaling and rotation. While this geometric argument is standard, a fully rigorous proof would require directly computing the Jacobian of the integrated flow map $H_θ^*$ to verify it satisfies the required mathematical conditions. The model’s inductive bias thus compels it to learn the most parsimonious geometric transformation necessary to explain the local data geometry, promoting the well-behaved, locally invertible mappings that are crucial for abduction. ## Appendix I Experimental Details This appendix provides comprehensive details for all experiments presented in Section 5, ensuring full transparency and reproducibility. ### I.1 General Setup Software Environment. All experiments were conducted in a unified software environment to ensure full reproducibility. Key library versions used were: dowhy (0.12), econml (0.16.0), numpy (1.26.4), pandas (1.3.5), scikit-learn (1.6.1), torch (1.10.0), lightgbm (4.6.0), and networkx (3.2.1). All stochastic processes were controlled with a fixed global random seed (42), except for the ensemble runs, which used a set of distinct seeds for training. Baseline Estimator Configurations. All baseline estimators were implemented using the dowhy library. For machine learning-based methods like Causal Forest and DML, we utilized their robust implementations from the econml library. To ensure a fair and reproducible comparison against established benchmarks, we used their default hyperparameter settings, which are widely recognized and have been optimized by the library authors to provide strong performance across a broad range of tasks. Our proposed model, in turn, underwent a systematic grid search; the final parameters and a detailed analysis are provided in Appendix I.2. Details for PSM Failure Scenario (Act I). The Data Generation Process (DGP) for this experiment ( $N=5000$ , true ATE $τ=5000$ ) follows the graph in Figure 3(a) and is defined by: | | $\displaystyle W_1,W_2$ | $\displaystyle∼N(0,1)$ | | | --- | --- | --- | --- | with exogenous noises $U_T∼Logistic(0,1)$ and $U_Y∼N(0,6000^2)$ . Details for Lalonde Benchmark (Act I). To evaluate performance on real-world data, we used the canonical Lalonde dataset, analyzing the effect of the NSW job training program (treat) on 1978 real earnings (re78). The assumed causal structure is the standard confounding model (Figure 3(c)), a DAG structure that reflects the broad consensus in the causal inference community for this benchmark. In line with our Targeted Modeling principle, we applied the expressive CausalDiffusionModel only to the key treat and re78 nodes, modeling all confounders non-parametrically via their empirical distributions. Details for Semi-Synthetic Analysis (Act II). This dataset uses the real-world covariates from the Lalonde dataset as a foundation. The outcome $Y$ and the ground-truth Individual Treatment Effect ( $ITE_true$ ) are then synthetically generated according to the following structural equations: | | $\displaystyle Y_base=$ | $\displaystyle 2· X_re74+1.5· X_re75+100· X_educ-50· X_age$ | | | --- | --- | --- | --- | where $U_base∼N(0,500^2)$ and $μ_re74$ is the mean of the re74 covariate. Details for the Stress Test on Non-Invertible SCMs (Act IV). This experiment’s primary objective is to evaluate the framework’s robustness when the core theoretical assumption of SCM invertibility is explicitly violated. The DGP ( $N=2000$ ) is defined as follows: | | $\displaystyle W$ | $\displaystyle∼U(-2,2)$ | | | --- | --- | --- | --- | The true ATE is exactly $τ=5.0$ . The SCM was structured with W as an EmpiricalDistribution, and T and Y as CausalDiffusionModel. Details for the Ablation Study (Act IV). The ablation study was conducted on a challenging synthetic mediation dataset ( $N=4000$ ) designed to highlight the benefits of generative models. The causal graph is shown in Figure 3(b), with the following data generation process: | | $\displaystyle X_1$ | $\displaystyle∼N(0,1), X_2∼U(-2,2), Z∼Bernoulli(0.5)$ | | | --- | --- | --- | --- | where $σ(·)$ is the sigmoid function, $U_T∼Logistic(0,0.3)$ , $U_M∼N(0,1.5^2)$ , and $U_Y$ is drawn from a Gaussian Mixture Model conditioned on $Z$ . The true ATE for this DGP is approximately $202.29$ . BELM-MDCM (Full Model): The complete proposed framework. The nodes T, M, and Y are all modeled by a CausalDiffusionModel with sampler_type=’belm’ and a hybrid objective weight of $λ=5.0$ . w/o Analytical Invertibility: Identical to the full model, but the sampler_type for all diffusion models was set to ’ddim’. w/o Hybrid Objective: Identical to the full model, but the hybrid objective weight $λ$ for all diffusion models was set to $0.0$ . w/o Targeted Modeling: The key mediator node M was modeled with a simpler gcm.AdditiveNoiseModel (backed by an LGBMRegressor), while T and Y remained as CausalDiffusionModel s with $λ=5.0$ and sampler_type=’belm’. ### I.2 Model Hyperparameter Justification The hyperparameters for our BELM-MDCM model, reported in Table 9, were identified through a systematic grid search for each experiment. The variation in these parameters reflects principled adaptations to different data characteristics, guided by the trade-off between generative fidelity and discriminative accuracy, as well as the signal-to-noise ratio (SNR) of the underlying causal relationships. Below, we provide a holistic analysis for each scenario. #### I.2.1 Hyperparameter Search Details To ensure full reproducibility, we detail the hyperparameter search space used to arrive at the configurations in Table 9. The final values for each experiment were selected based on the best performance on a held-out validation set, typically comprising 20-30% of the training data. The primary selection criterion was the lowest PEHE score for experiments with a ground-truth ITE (e.g., Semi-Synthetic), or the best balance of low absolute ATE error and low estimation variance for observational data scenarios. For the Lalonde experiment in particular, we prioritized a configuration that demonstrated superior stability, as detailed in the analysis below. Table 8: Hyperparameter Search Space and Selection Criteria. | Hybrid Weight $λ$ | {0.0, 0.1, 0.3, 1.0, 2.0, 5.0, 10.0} | Best balance of low error and low variance on the validation set. | | --- | --- | --- | | Guidance Weight $w$ | {0.0, 0.1, 0.2, 0.3, 1.0, 5.0, 10.0} | | | Diffusion Timesteps $T$ | {50, 100, 200, 500} | | | Learning Rate | {5e-5, 1e-4, 1.1e-4, 1.2e-4, 2e-4} | | Table 9: Consolidated BELM-MDCM Hyperparameters for All Key Experiments. Column headers correspond to the following experiments: PSM (PSM Failure), Lalonde, Semi-Synth (Semi-Synthetic), Ablation (Ablation Study), and Stress Test (Act IV). | Hyperparameter Number of Epochs Batch Size | PSM 1500 128 | Lalonde 1000 64 | Semi-Synth 1200 64 | Ablation 700 128 | Stress Test 500 128 | | --- | --- | --- | --- | --- | --- | | Hidden Dimension | 512 | 512 | 768 | 768 | 256 | | Learning Rate | $1× 10^-4$ | $1× 10^-4$ | $1.1× 10^-4$ | $1× 10^-4$ | $1× 10^-4$ | | Diffusion Timesteps ( $T$ ) | 200 | 200 | 50 | 200 | 200 | | Hybrid Weight $λ$ | 0.1 | 2.0 | 2.0 | 5.0 | 0.5 | | Guidance Weight ( $w$ ) | 0.0 | 1.0 | 0.1 | 0.2 | 0.0 | Scenario 1: The ”Perfectionist Learner” (PSM Failure Experiment) Data Profile: This synthetic dataset features complex, non-linear functions (e.g., sin, cos) but is characterized by a pure, high-SNR signal. The main challenge is to perfectly learn this intricate generative process. Hyperparameter Strategy: The strategy prioritizes generative modeling. A low hybrid weight ( $λ=0.1$ ) directs the model to focus on the diffusion loss. Since the conditional signal is strong, classifier-free guidance is disabled ( $w=0.0$ ). A large model capacity and extended training are employed to capture the ground-truth functions. Scenario 2: The ”Pragmatic Signal Extractor” (Lalonde Experiment) Data Profile: This real-world data is characterized by a weak, noisy signal and a small sample size. The primary challenge is to robustly extract the causal signal. Hyperparameter Strategy: The priority shifts from pure accuracy to a balance of accuracy and stability. A high hybrid weight ( $λ=2.0$ ) provides a strong inductive bias towards the predictive task. Crucially, our systematic grid search revealed that high guidance weights ( $w>1.0$ ) dramatically increased estimation variance, leading to unreliable results. We therefore selected a moderate guidance weight of $w=1.0$ . This configuration provides a stabilizing effect, guiding the model towards the causal signal without amplifying noise, thereby achieving the optimal balance between accuracy and the robustness required for reliable inference on real-world data. Scenario 3: The ”Precision Artist” (Semi-Synthetic Experiment) Data Profile: This setup presents a hybrid-SNR environment, using noisy real-world covariates but a pure, synthetic outcome function. The task demands high precision. Hyperparameter Strategy: The diffusion timesteps are reduced ( $T=50$ ), plausibly because a shorter generative path better preserves the fine-grained details of the synthetic ITE function. The hybrid weight is high ( $λ=2.0$ ) to focus on the estimation task, while guidance is light ( $w=0.1$ ) as the ITE signal is cleaner than in the purely observational Lalonde case. Scenario 4: The ”Component Analyst” (Ablation Study) Data Profile: This is a complex, synthetic mediation structure with a clean, high-SNR signal, specifically designed to isolate the performance impact of individual framework components. Hyperparameter Strategy: The strategy is to create a strong, stable baseline. A high hybrid weight ( $λ=5.0$ ) heavily orients the model towards the predictive task, making any performance degradation from ablations more pronounced. A large model capacity (Hidden Dim=768) ensures the model can learn the complex functions, while light guidance ( $w=0.2$ ) provides a minor stabilizing effect. Scenario 5: The ”Robustness Tester” (Stress Test) Data Profile: A simple DGP featuring a non-invertible causal mechanism ( $Y∝ U^2$ ), designed to test the framework’s behavior when its core assumption is violated. Hyperparameter Strategy: The goal is stable learning of a simple function. A moderate hybrid weight ( $λ=0.5$ ) balances generative and predictive learning, while a smaller model (Hidden Dim=256) is sufficient for the simpler DGP and helps prevent overfitting. Guidance is turned off ( $w=0.0$ ) as the signal is clean. Holistic Comparison Table 10: Summary of Adaptive Hyperparameter Strategies. | Data Signal Guidance ( $w$ ) Timesteps ( $T$ ) | Pure & Complex 0.0 (Off) 200 (Standard) | Noisy & Weak 1.0 (Balanced Guidance) 200 (Standard) | Hybrid & Complex 0.1 (Slight Nudge) 50 (Short Path) | Pure & Mediated 0.2 (Light) 200 (Standard) | Pure & Non-Invertible 0.0 (Off) 200 (Standard) | | --- | --- | --- | --- | --- | --- | | Hybrid ( $λ$ ) | 0.1 (Generative) | 2.0 (Predictive) | 2.0 (Predictive) | 5.0 (Strongly Predictive) | 0.5 (Balanced) | | Core Strategy | Perfect Generation | Robust Prediction | Precision Estimation | Component Isolation | Stable Learning | In conclusion, the variability in hyperparameters demonstrates the flexibility of our framework. It showcases the model’s ability to deploy different tools—such as prioritizing the generative loss or amplifying weak signals with guidance—to optimally adapt to the specific challenges posed by diverse causal inference problems. ### I.3 CMF Metric Implementation Details To ensure the rigor and reproducibility of our evaluation, this section provides the implementation details for the CMF scores. CMI-Score Estimation Method. For Conditional Mutual Information (CMI) estimation, we adopt the widely-used k-Nearest Neighbors (k-NN) based estimator from Kraskov et al. (kraskov2004estimating). This method is chosen for its strong theoretical properties and practical robustness, particularly for continuous and high-dimensional data, as it avoids explicit density estimation. Its key advantages include: - Non-parametric: It makes no assumptions about the underlying data distributions. - Data-adaptive: The estimation is based on local distances, adapting to the data manifold’s geometry. - Robustness: It is more robust than methods that rely on fixed binning, which can be sensitive to bin size. Following common practice, we set the number of neighbors to $k=5$ for all our experiments to ensure a stable and reliable estimation. KMD-Score Kernel Parameter Selection. The performance of the MMD test is sensitive to kernel parameter choices. For the KMD-Score, we employed a standard Radial Basis Function (RBF) kernel, $k(x,y)=\exp(-\|x-y\|^2/(2σ^2))$ . The bandwidth parameter $σ$ is critical. Following best practices, we set the bandwidth using the median heuristic, a robust and common data-driven approach. For the analysis on the Lalonde dataset’s re78 mechanism, this heuristic yielded a bandwidth of $σ=0.1$ , a value confirmed as effective in preliminary experiments. All KMD-Scores are reported using this configuration. ## Appendix J Empirical Validation of Proposed Evaluation Metrics To empirically validate the reliability, sensitivity, and complementary nature of our proposed evaluation metrics (CIC-Score, CMI-Score, and KMD-Score), we conducted a controlled micro-simulation study assessing how each metric responds to a spectrum of increasingly severe model errors. Experimental Setup. We designed a simple ground-truth Structural Causal Model (SCM) and created five simulated models (A through E) representing a clear ”degradation gradient” of model quality: - Model A (Oracle): Represents a theoretically perfect model with zero SRE and a perfectly learned causal mechanism. This serves as our gold standard, with expected scores of 1.0. - Model B (Lossy Inverter): Simulates a model with a non-zero SRE. It uses the correct causal mechanism but introduces a systematic error during the reconstruction cycle, mimicking the core flaw of DDIM-based approaches. - Model C (Wrong Mechanism): Represents a model that fails to learn the correct functional form of the causal mechanism (e.g., learning a linear instead of a sinusoidal relationship). - Model D (Maximal Error): Simulates a more severe failure where both the causal mechanism and the inferred noise distribution are fundamentally incorrect. - Model E (Total Mismatch): Represents the worst-case scenario where the model ignores the causal graph and inputs, generating outputs from an unrelated distribution. <details> <summary>x13.png Details</summary> ![1f92c2c8](/v1/image/1f92c2c8ba99ee8ea69843b16212aebc3d9bc156ab0cd5a67a203d8f3451f2fc) ### Visual Description \n ## Bar Chart: Empirical Validation of Proposed Metrics via Micro-simulation ### Overview The image displays a set of three bar charts arranged horizontally, collectively titled "Empirical Validation of Proposed Metrics via Micro-simulation." Each chart represents the performance of a different proposed metric (CIC-Score, CMI-Score, KMD-Score) across five distinct test conditions or error scenarios. The charts use a consistent blue color palette and share a common y-axis scale. ### Components/Axes * **Main Title:** "Empirical Validation of Proposed Metrics via Micro-simulation" (centered at the top). * **Subplot Titles:** Three individual charts are labeled from left to right: "CIC-Score", "CMI-Score", and "KMD-Score". * **Y-Axis (Common to all):** * **Label:** "Scores Value" * **Scale:** Linear scale from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0. * **X-Axis (Common to all):** Five categorical conditions are listed identically on each chart. The labels are rotated approximately 45 degrees for readability. * **Categories:** 1. A. Oracle 2. B. Logit Inverter (SRE Error) 3. C. Wrong Mechanism 4. D. Maximal Error 5. E. Total Mismatch * **Data Representation:** Vertical bars. The numerical value for each bar is displayed directly above it. * **Color Scheme:** All bars are shades of blue. The "Oracle" bar (Category A) is a darker, more saturated blue in the CIC-Score chart. The bars in the CMI-Score and KMD-Score charts are a lighter, more uniform blue. There is no separate legend; the categories are defined by the x-axis labels. ### Detailed Analysis **1. CIC-Score Chart (Left)** * **Trend:** The score is at its maximum for the Oracle condition and drops dramatically for all error conditions, which have relatively similar, low scores. * **Data Points:** * A. Oracle: 1.000 * B. Logit Inverter (SRE Error): 0.154 * C. Wrong Mechanism: 0.158 * D. Maximal Error: 0.157 * E. Total Mismatch: 0.136 **2. CMI-Score Chart (Center)** * **Trend:** The score starts very high for the Oracle and Logit Inverter conditions, then shows a gradual, stepwise decline across the subsequent error conditions. * **Data Points:** * A. Oracle: 0.999 * B. Logit Inverter (SRE Error): 0.994 * C. Wrong Mechanism: 0.949 * D. Maximal Error: 0.836 * E. Total Mismatch: 0.758 **3. KMD-Score Chart (Right)** * **Trend:** Similar to CMI-Score, it starts high for Oracle and Logit Inverter, then declines. The decline appears more pronounced after the "Wrong Mechanism" condition compared to the CMI-Score chart. * **Data Points:** * A. Oracle: 1.000 * B. Logit Inverter (SRE Error): 0.971 * C. Wrong Mechanism: 0.822 * D. Maximal Error: 0.790 * E. Total Mismatch: 0.570 ### Key Observations 1. **Oracle Baseline:** All three metrics achieve near-perfect scores (≈1.0) for the "A. Oracle" condition, establishing it as the ideal baseline. 2. **Metric Sensitivity:** The metrics respond very differently to errors. * **CIC-Score** is extremely sensitive, showing a catastrophic drop (>84%) for *any* error condition, with little discrimination between error types. * **CMI-Score** and **KMD-Score** are more graduated. They are robust to the "Logit Inverter (SRE Error)" (scores >0.97) but show increasing sensitivity to more severe errors like "Wrong Mechanism," "Maximal Error," and "Total Mismatch." 3. **Error Hierarchy:** The data suggests a hierarchy of error severity as perceived by the CMI and KMD metrics: Logit Inverter Error < Wrong Mechanism ≈ Maximal Error < Total Mismatch. The CIC metric does not differentiate between these. 4. **Worst-Case Performance:** The "E. Total Mismatch" condition results in the lowest score for all three metrics, with KMD-Score showing the most severe degradation (0.570). ### Interpretation This visualization validates the behavior of three proposed evaluation metrics under controlled, simulated error conditions. The "Oracle" represents a perfect model or ground truth, serving as the control. The **CIC-Score** appears to be a strict, binary-like metric. It effectively flags the *presence* of any deviation from the oracle but is insensitive to the *type or magnitude* of the error. It could be useful as a simple "pass/fail" check. The **CMI-Score** and **KMD-Score** function as more nuanced, continuous metrics. They demonstrate a desirable property: they maintain high scores for minor or specific errors (like the SRE Error in the Logit Inverter) while progressively penalizing more fundamental flaws in the mechanism or complete mismatches. This graduated response makes them more informative for diagnosing the severity of model misspecification. The stark difference between the CIC pattern and the CMI/KMD patterns suggests they are measuring different aspects of model performance or error. The CMI and KMD metrics seem better suited for comparative analysis where distinguishing between levels of error is important, while the CIC metric is a robust detector of any non-oracle behavior. The "Total Mismatch" condition serves as a stress test, revealing the limits of all metrics, with KMD-Score being the most severely impacted. </details> Figure 10: Results of the micro-simulation study for metric validation. The three plots show the response of the CIC-Score, CMI-Score, and KMD-Score to five models (A-E) of progressively decreasing quality. The scores demonstrate a clear monotonic degradation, confirming their ability to reliably track model fidelity. Note the CIC-Score’s sharp drop from Model A to B, highlighting its specific sensitivity to the Structural Reconstruction Error (SRE). Analysis of Results. The results (Figure 10) demonstrate the distinct and complementary roles of our proposed metrics: 1. The CIC-Score acts as a high-sensitivity ”SRE detector.” It exhibits a dramatic drop from a perfect 1.0 (Model A) to approximately 0.23 (Model B) the moment SRE is introduced, while showing less sensitivity to the specific form of mechanism error. This confirms its primary role as a diagnostic for adherence to the Causal Information Conservation principle. 1. The CMI-Score serves as a robust ”mechanism association tracker.” It degrades gracefully and monotonically as the learned causal mechanism deviates from the ground truth (from 0.99 to 0.76). This demonstrates its utility in quantifying the fidelity of learned parent-child conditional dependencies. 1. The KMD-Score functions as the ”final arbiter” of distributional fidelity. Possessing the widest dynamic range, it is sensitive to all forms of error and provides a holistic judgment of the similarity between the generated and true counterfactual distributions. As the most rigorous metric, it correctly assigns the lowest score to the completely mismatched Model E. This simulation thus validates our proposed metrics as a reliable and nuanced evaluation framework. They work in synergy to diagnose specific model failings (CIC-Score), assess relational accuracy (CMI-Score), and provide an overall quality judgment (KMD-Score), offering a more insightful assessment than traditional metrics alone. Discussion on the Non-Zero Lower Bound of Scores. Notably, even for the worst-performing model (Model E), the CMI and KMD scores do not fall to zero. This behavior is not a limitation but a desirable feature that reflects their ability to capture ”residual statistical structure” in the evaluation setting. - For the KMD-Score: The MMD compares the joint distributions $P_model(Y,W,T)$ and $P_oracle(Y,W,T)$ . Crucially, as all models operate on the same observed parent data, the marginal parent distribution $P(W,T)$ is identical between the two. The difference lies only in the conditional $P(Y|W,T)$ . Because the joint distributions share a substantial common subspace, their MMD will be finite, preventing the KMD-Score from reaching zero. Furthermore, using a ‘StandardScaler‘ maps both distributions to a similar feature space, which is necessary for a fair, scale-invariant comparison. - For the CMI-Score: This metric quantifies the $I(Y;parent|other parents)$ . Even an incorrect mechanism (like in Model C or D) still generates an output $Y$ as a deterministic function of its parents, resulting in a non-zero CMI. For Model E, where $Y$ is independent of its parents, the theoretically zero CMI is not observed due to the inherent finite-sample variance of the non-parametric k-NN estimator. This behavior is advantageous, ensuring the metrics provide a meaningful, continuous gradient of failure rather than a simplistic binary judgment. This enhances their diagnostic power, allowing for fine-grained distinctions between types and degrees of model imperfection. ## Appendix K Algorithm for ATE Estimation This appendix provides the detailed pseudo-code for the counterfactual imputation procedure used to estimate the Average Treatment Effect (ATE) in our experiments. Input: Observational data $D=\{v^(j)\}_j=1^N$ where $v^(j)∈ℝ^d$ ; Causal graph $G$ . Output: Estimated Average Treatment Effect ( $\hat{ATE}$ ). Define Invertible SCM $M_θ$ : $M_θ=\{f_i(·;θ_i)\}_i=1^d$ based on $G$ , where $v_i=f_i(pa_i,u_i)$ ; // 1. Train the invertible SCM on observational data $\hat{θ}←\arg\min_θ∑_j=1^N∑_i=1^dL_i≤ft(f_i(pa_i^(j);θ_i),v_i^(j)\right)$ ; // 2. Generate counterfactual outcomes for each individual for $j← 1$ to $N$ do $t_j← v_T^(j)$ ; // Observed treatment $u^(j)←M_\hat{θ}^-1(v^(j))$ ; // Abduction via invertible BELM encoder $y_j(1-t_j)←≤ft(M_\hat{θ},u^(j),(T:=1-t_j)\right)$ ; // Action & Prediction // Store factual and counterfactual outcomes $Y_j(t_j)← v_Y^(j)$ ; $Y_j(1-t_j)← y_j(1-t_j)$ ; end for // 3. Compute the ATE from factual and counterfactual outcomes $\hat{ATE}←\frac{1}{N}∑_j=1^N≤ft(Y_j(1)-Y_j(0)\right)$ ; 1ex return $\hat{ATE}$ ; Algorithm 1 ATE Estimation with Invertible SCMs via Counterfactual Imputation

Rendering Paper...