# Lifted Causal Inference in Relational Domains
Malte Luttermann malte.luttermann@dfki.de German Research Center for Artificial Intelligence (DFKI), LĂŒbeck, Germany and Mattis Hartwig mattis.hartwig@dfki.de German Research Center for Artificial Intelligence (DFKI), LĂŒbeck, Germany singularIT GmbH, Leipzig, Germany and Tanya Braun tanya.braun@uni-muenster.de Data Science Group, University of MĂŒnster, Germany and Ralf Möller moeller@ifis.uni-luebeck.de Institute of Information Systems, University of LĂŒbeck, Germany German Research Center for Artificial Intelligence (DFKI), LĂŒbeck, Germany and Marcel Gehrke gehrke@ifis.uni-luebeck.de Institute of Information Systems, University of LĂŒbeck, Germany * arg max * arg min
## Abstract
Lifted inference exploits symmetries in probabilistic graphical models by using a representative for indistinguishable objects, thereby speeding up query answering while maintaining exact answers. Even though lifting is a well-established technique for the task of probabilistic inference in relational domains, it has not yet been applied to the task of causal inference. In this paper, we show how lifting can be applied to efficiently compute causal effects in relational domains. More specifically, we introduce parametric causal factor graphs as an extension of parametric factor graphs incorporating causal knowledge and give a formal semantics of interventions therein. We further present the lifted causal inference algorithm to compute causal effects on a lifted level, thereby drastically speeding up causal inference compared to propositional inference, e.g., in causal Bayesian networks. In our empirical evaluation, we demonstrate the effectiveness of our approach.
keywords: Causal graphical models, lifted probabilistic inference, interventional distributions
## 1 Introduction
A fundamental problem in the research field of artificial intelligence for an intelligent agent is to plan and act rationally in a relational domain. To compute the best possible action in a perceived state, the agent considers the available actions and chooses the one with the maximum expected utility. When computing the expected utility of an action that acts on a specific variable, it is crucial to deploy the semantics of an intervention instead of a typical conditioning on that variable (Pearl(2009), Chapter 4). When calculating the effect of an intervention, a specific variable is set to a fixed value and all incoming probabilistic causal influences of this variable must be ignored for the specific query. It is therefore fundamental to deploy the semantics of an intervention instead of the typical conditioning to correctly determine the effect of an action.
Over the last years, causal graphical models have become a widely used formalism to answer questions concerning the causal impact of a treatment variable on an outcome variable. These models combine probabilistic modeling with causal knowledge, which enables computing the effect of an action that intervenes on a particular variable. As our world is inherently relational (i.e., it consists of objects and relations between those objects), it is particularly important to have models that represent the relational structure between objects in addition to capturing causal knowledge. However, commonly applied causal graphical models focus on propositional representations while at the same time relational models lack the ability to efficiently apply causal knowledge for inference. Therefore, we aim to combine the best of both worlds to allow for efficient causal inference in relational domains. In particular, this paper deals with the problem of efficiently computing causal effects in models representing objects and their causal relationships to each other.
#### Previous work.
To perform causal effect estimation in causal graphical models, there has been a considerable amount of work and most of this work focuses on models of propositional data (Spirtes et al.(2000)Spirtes, Glymour, and Scheines; Pearl(2009)). Some works extend propositional factor graphs by adding edge directions to enable the computation of the effect of interventions (Frey(2003); Winn(2012)). Maier et al.(2010)Maier, Taylor, Oktay, and Jensen show that propositional models are insufficient to represent causal relationships within relational domains as required by real-world applications. To express causal dependencies within relational domains, Maier et al.(2013)Maier, Marazopoulou, Arbour, and Jensen introduce so-called relational causal models but focus on learning RCMs from observed data. Further related work covering RCMs also focuses on causal discovery and reasoning about conditional independence (e.g., Lee and Honavar(2015); Lee and Honavar(2016); Lee and Honavar(2019)). rcm have also been extended to cover cyclic dependency structures (Ahsan et al.(2022)Ahsan, Arbour, and Zheleva; Ahsan et al.(2023)Ahsan, Arbour, and Zheleva), however, both non-cyclic and cyclic RCMs allow only for reasoning about conditional independence on a lifted level but they do not allow for lifted causal inference. Prior work dealing with the estimation of causal effects in relational domains applies propositional probabilistic inference (Arbour et al.(2016)Arbour, Garant, and Jensen; Salimi et al.(2020)Salimi, Parikh, Kayali, Getoor, Roy, and Suciu) and thus does not scale for large graphs. Consequently, there is a lack of efficient algorithms to compute causal effects in relational domains. In probabilistic inference, lifting exploits symmetries in a relational model, allowing to carry out query answering more efficiently while maintaining exact answers (Niepert and Van den Broeck(2014)). First introduced by Poole(2003), parametric factor graphs and lifted variable elimination (LVE) allow to perform lifted probabilistic inference, i.e., to exploit symmetries in a probabilistic graphical model, resulting in significant speed-ups for probabilistic query answering in relational domains. Over time, LVE has been refined by many researchers to reach its current form (De Salvo Braz et al.(2005)De Salvo Braz, Amir, and Roth; De Salvo Braz et al.(2006)De Salvo Braz, Amir, and Roth; Milch et al.(2008)Milch, Zettlemoyer, Kersting, Haimes, and Kaelbling; KisyĆski and Poole(2009); Taghipour et al.(2013a)Taghipour, Fierens, Davis, and Blockeel; Braun and Möller(2018)). pfg are well-studied for many years and have been developed further to efficiently perform probabilistic inference not only for single queries but also for sets of queries (Braun and Möller(2016)), to incorporate probabilistic inference over time (Gehrke et al.(2018)Gehrke, Braun, and Möller; Gehrke et al.(2020)Gehrke, Möller, and Braun), and, among other extensions, to allow for decision making by following the maximum expected utility principle (Gehrke et al.(2019a)Gehrke, Braun, and Möller; Gehrke et al.(2019b)Gehrke, Braun, Möller, Waschkau, Strumann, and SteinhĂ€user; Braun and Gehrke(2022)). Markov logic networks are another lifted representation and have been extended to incorporate maximum expected utility as well (Apsel and Brafman(2012)). Nevertheless, when a decision-making agent plans for the best action to take, previous works improperly apply conditioning (i.e., actions are treated as evidence), as also suggested by Russell and Norvig(2020), instead of the notion of an intervention. Treating actions as evidence, however, is incorrect as noted by Pearl(2009). To correctly handle the semantics of an action, the notion of an intervention (Pearl et al.(2016)Pearl, Glymour, and Jewell) has to be applied. Therefore, in this paper, we close the gap between PFGs and causal inference in relational domains by introducing parametric causal factor graphs as an extension of parametric factor graphs incorporating causal knowledge to allow for lifted causal inference, thereby enabling efficient decision making using the notion of an intervention.
#### Our contributions.
pfg are well-established models coming with LVE as a mature lifted inference algorithm, allowing for tractable probabilistic inference with respect to domain sizes in relational domains. We extend PFGs by incorporating causal knowledge, resulting in parametric causal factor graphs for which we define a formal semantics of interventions. Having defined a formal semantics of interventions in PCFGs, we show how causal effects can be efficiently computed, even for multiple simultaneous interventions. We further introduce the lifted causal inference (LCI) algorithm that operates on a lifted level to allow for lifted causal inference in relational domains. Apart from the theoretical investigation of PCFGs and the LCI algorithm, we provide an empirical evaluation confirming the efficiency of LCI.
#### Structure of this paper.
In Section 2, we introduce both FGs and PFGs as undirected probabilistic graphical models. Thereafter, in Section 3, we define PCFGs as an extension of PFGs incorporating causal knowledge and provide a formal semantics of interventions in PCFGs. In Section 4, we introduce the LCI algorithm operating on a PCFG and show how LCI computes causal effects on a lifted level to avoid grounding the PCFG as much as possible. Afterwards, in our empirical evaluation in Section 5, we investigate the speed-up of LCI compared to performing propositional causal inference both in causal Bayesian networks and in directed FGs before we conclude in Section 6.
## 2 Preliminaries
We begin by introducing FGs as propositional probabilistic models and afterwards continue to define PFGs which combine probabilistic models and first-order logic to allow for tractable probabilistic inference with respect to domain sizes in relational domains. An FG is an undirected graphical model to compactly encode a full joint probability distribution between random variables (Frey et al.(1997)Frey, Kschischang, Loeliger, and Wiberg; Kschischang et al.(2001)Kschischang, Frey, and Loeliger). Similar to a Bayesian network (Pearl(1988)), an FG factorises a full joint probability distribution into a product of factors.
**Definition 2.1 (Factor Graph)**
*An FG $G=(\boldsymbol V,\boldsymbol E)$ is a bipartite graph with node set $\boldsymbol V=\boldsymbol R\cup\boldsymbol\Phi$ where $\boldsymbol R=\{R_{1},\ldots,R_{n}\}$ is a set of variable nodes (randvars) and $\boldsymbol\Phi=\{\phi_{1},\ldots,\phi_{m}\}$ is a set of factor nodes (functions). There is an edge between a variable node $R_{i}$ and a factor node $\phi_{j}$ in $\boldsymbol E\subseteq\boldsymbol R\times\boldsymbol\Phi$ if $R_{i}$ appears in the argument list of $\phi_{j}$ . A factor is a function that maps its arguments to a positive real number, called potential. The semantics of $G$ is given by
$$
P_{G}=\frac{1}{Z}\prod_{j=1}^{m}\phi_{j}(\mathcal{A}_{j})
$$
with $Z$ being the normalisation constant and $\mathcal{A}_{j}$ denoting the randvars appearing in $\phi_{j}$ .*
$Rev$ $Comp.bob$ $Comp.alice$ $Comp.dave$ $Comp.eve$ $\phi_{4}^{3}$ $\phi_{4}^{1}$ $\phi_{4}^{2}$ $\phi_{4}^{4}$ $Train.alice.t_{1}$ $Train.alice.t_{2}$ $Train.bob.t_{1}$ $Train.bob.t_{2}$ $Train.dave.t_{1}$ $Train.dave.t_{2}$ $Train.eve.t_{1}$ $Train.eve.t_{2}$ $\phi_{3}^{1}$ $\phi_{3}^{5}$ $\phi_{3}^{2}$ $\phi_{3}^{6}$ $\phi_{3}^{3}$ $\phi_{3}^{7}$ $\phi_{3}^{4}$ $\phi_{3}^{8}$ $Qual.t_{1}$ $Qual.t_{2}$ $\phi_{1}^{1}$ $\phi_{2}^{1}$ $\phi_{2}^{2}$ $\phi_{2}^{3}$ $\phi_{2}^{4}$ $\phi_{1}^{2}$ $\phi_{2}^{5}$ $\phi_{2}^{6}$ $\phi_{2}^{7}$ $\phi_{2}^{8}$
Figure 1: A toy example of an FG modelling the interplay of a companyâs revenue and its employeesâ competences, which, in turn, can be improved by training employees with a specific training program. We omit the input-output pairs of the factors for brevity.
**Example 2.2**
*Figure 1 shows a toy example of an FG modelling the relationships between a companyâs revenue, the competences of its employees, and the training of its employees. More specifically, there are randvars $Qual.t_{i}$ indicating the quality of a training program $t_{i}$ , randvars $Comp.e_{j}$ describing the competence of an employee $e_{j}$ , randvars $Train.e_{j}.t_{i}$ specifying whether employee $e_{j}$ has been trained with training program $t_{i}$ , and a randvar $Rev$ denoting the revenue of the company. In this particular example, there is a single company with four employees $alice$ , $bob$ , $dave$ , and $eve$ and there are two training programs $t_{1}$ and $t_{2}$ each employee can be trained with. The randvars $Qual.t_{i}$ , $Comp.e_{j}$ , and $Rev$ can take one of the values $\{low,medium,high\}$ and the randvars $Train.e_{j}.t_{i}$ are Boolean. Factors $\phi_{1}^{i}$ encode the prior probability distribution for the quality of a training program, factors $\phi_{2}^{i}$ encode the relationship between the quality of a training program and an employee being trained with that program, factors $\phi_{3}^{i}$ encode the relationship between an employee being trained by a specific training program and the competence of that employee, and factors $\phi_{4}^{i}$ encode the relationship between the competence of an employee and the revenue of the company. The input-output pairs of the factors are omitted for brevity.*
We continue to define PFGs, first introduced by Poole(2003), which combine probabilistic models and first-order logic. In particular, PFGs use logical variables as parameters in randvars to represent sets of indistinguishable randvars. Each set of indistinguishable randvars is represented by a so-called parameterised randvar (PRV), defined as follows.
**Definition 2.3 (Parameterised Random Variable)**
*Let $\boldsymbol{R}$ be a set of randvar names, $\boldsymbol{L}$ a set of logvar names, and $\boldsymbol{D}$ a set of constants. All sets are finite. Each logvar $L$ has a domain $\mathcal{D}(L)\subseteq\boldsymbol{D}$ . A constraint is a tuple $(\mathcal{X},C_{\mathcal{X}})$ of a sequence of logvars $\mathcal{X}=(X_{1},\ldots,X_{n})$ and a set $C_{\mathcal{X}}\subseteq\times_{i=1}^{n}\mathcal{D}(X_{i})$ . The symbol $\top$ for $C$ marks that no restrictions apply, i.e., $C_{\mathcal{X}}=\times_{i=1}^{n}\mathcal{D}(X_{i})$ . A PRV $R(L_{1},\ldots,L_{n})$ , $n\geq 0$ , is a syntactical construct of a randvar $R\in\boldsymbol{R}$ possibly combined with logvars $L_{1},\ldots,L_{n}\in\boldsymbol{L}$ to represent a set of randvars. If $n=0$ , the PRV is parameterless and forms a propositional randvar. A PRV $A$ (or logvar $L$ ) under constraint $C$ is given by $A_{|C}$ ( $L_{|C}$ ), respectively. We may omit $|\top$ in $A_{|\top}$ or $L_{|\top}$ . The term $\mathcal{R}(A)$ denotes the possible values (range) of a PRV $A$ . An event $A=a$ denotes the occurrence of PRV $A$ with range value $a\in\mathcal{R}(A)$ .*
**Example 2.4**
*Consider $\boldsymbol{R}=\{Qual,Train,Comp,Rev\}$ for quality, training, competence, and revenue, respectively, $\boldsymbol{L}=\{E,T\}$ with $\mathcal{D}(E)=\{alice,bob,dave,eve\}$ (employees) and $\mathcal{D}(T)=\{t_{1},t_{2}\}$ (training programs), combined into PRVs $Qual(T)$ , $Train(E,T)$ , $Comp(E)$ , and $Rev$ .*
A parametric factor (parfactor) describes a function, mapping argument values to positive real numbers, of which at least one is non-zero.
**Definition 2.5 (Parfactor)**
*Let $\Phi$ denote a set of factor names. We denote a parfactor $g$ by $\phi(\mathcal{A})_{|C}$ with $\mathcal{A}=(A_{1},\ldots,A_{n})$ being a sequence of PRVs, $\phi$ $:$ $\times_{i=1}^{n}\mathcal{R}(A_{i})\mapsto\mathbb{R}^{+}$ being a function with name $\phi\in\Phi$ mapping argument values to a positive real number called potential, and $C$ being a constraint on the logvars of $\mathcal{A}$ . We may omit $|\top$ in $\phi(\mathcal{A})_{|\top}$ . The term $lv(Y)$ refers to the logvars in some element $Y$ , a PRV, a parfactor, or sets thereof. The term $gr(Y_{|C})$ denotes the set of all instances (groundings) of $Y$ with respect to constraint $C$ .*
**Example 2.6**
*Take a look at the parfactor $g_{3}=\phi_{3}(Train(E,T),Comp(E))_{|\top}$ . Assuming the same ranges of the PRVs and the same domains of the logvars as in Examples 2.2 and 2.4, $g_{3}$ specifies $2\cdot 3=6$ input-output pairs $\phi_{3}(true,low)=\varphi_{1}$ , $\phi_{3}(true,medium)=\varphi_{2}$ , $\phi_{3}(true,high)=\varphi_{3}$ , and so on with $\varphi_{i}\in\mathbb{R}^{+}$ . Further, $lv(g_{3})=\{E,T\}$ and $gr(g_{3_{\top}})=\{\phi_{3}(Train(alice,t_{1}),Comp(alice))$ , âŠ, $\phi_{3}(Train(eve,t_{2}),Comp(eve))\}$ . Thus, in this specific example, the parfactor $g_{3}$ is able to represent a set of eight factors.*
A PFG is then built from a set of parfactors $\{g_{1},\dots,g_{m}\}$ .
**Definition 2.7 (Parametric Factor Graph)**
*A PFG $G=(\boldsymbol V,\boldsymbol E)$ is a bipartite graph with node set $\boldsymbol V=\boldsymbol A\cup\boldsymbol G$ where $\boldsymbol A=\{A_{1},\ldots,A_{n}\}$ is a set of PRVs and $\boldsymbol G=\{g_{1},\ldots,g_{m}\}$ is a set of parfactors. A PRV $A_{i}$ and a parfactor $g_{j}$ are connected via an edge in $G$ (i.e., $\{A_{i},g_{j}\}\in\boldsymbol E$ ) if $A_{i}$ appears in the argument list of $g_{j}$ . The semantics of $G$ is given by grounding and building a full joint distribution. With $Z$ as the normalisation constant and $\mathcal{A}_{k}$ denoting the randvars connected to $\phi_{k}$ , $G$ represents the full joint distribution
$$
P_{G}=\frac{1}{Z}\prod_{g_{j}\in\boldsymbol G}\prod_{\phi_{k}\in gr(g_{j})}
\phi_{k}(\mathcal{A}_{k}).
$$*
$Qual(T)$ $Train(E,T)$ $Comp(E)$ $Rev$ $g_{1}$ $g_{2}$ $g_{3}$ $g_{4}$
Figure 2: A visualisation of a PFG entailing equivalent semantics as the FG shown in Fig. 1. Each parfactor represents a group of factors and each PRV represents a group of randvars.
**Example 2.8**
*Figure 2 depicts a PFG $G$ consisting of four parfactors $g_{1}=\phi_{1}(Qual(T))$ , $g_{2}=\phi_{2}(Qual(T),Train(E,T))$ , $g_{3}=\phi_{3}(Train(E,T),Comp(E))$ , and $g_{4}=\phi_{4}(Comp(E),Rev)$ . Assuming that both the ranges of the PRVs and the domains of the logvars follow Examples 2.2, 2.4 and 2.6, $G$ is a lifted representation entailing equivalent semantics as the FG shown in Fig. 1. Each parfactor $g_{1}$ , âŠ, $g_{4}$ represents a group of factors $\phi_{1}^{i}$ , âŠ, $\phi_{4}^{i}$ , respectively, and each PRV $Qual$ , $Train$ , $Comp$ , and $Rev$ represents a group of randvars.*
The underlying assumption here is that there are indistinguishable objects, in this specific example employees, which can be represented by a representative. In particular, the assumption is that the competence of every employee has the same influence on the companyâs revenue, i.e., all factors $\phi_{4}^{i}$ encode the same mappings (and the same holds for the factors $\phi_{1}^{i}$ , $\phi_{2}^{i}$ , and $\phi_{3}^{i}$ , meaning training programs are indistinguishable as well). In other words, it is relevant for the company how many employees are competent but it does not matter which exact employees are competent. Note that the definition of PFGs also includes FGs, as every FG is a PFG containing only parameterless randvars.
In the following, we extend PFGs to incorporate causal knowledge, represented by directed edges defining cause-effect relationships between PRVs.
## 3 Parametric Causal Factor Graphs
pfg are well-established models for which lifted inference algorithms exist to allow for tractable probabilistic inference with respect to domain sizes. Even though Frey(2003) introduces directed FGs to allow for representing causal knowledge on a ground level, PFGs have not yet been extended to incorporate causal knowledge on a lifted level.
Therefore, we now introduce PCFGs as an extension of PFGs incorporating causal knowledge and give a formal semantics of interventions therein. A PCFG extends a PFG by incorporating causal knowledge in form of directed edgesâthat is, all edges between two PRVs (via a parfactor) are directed and describe a cause-effect relationship. For example, an edge $A_{1}\to A_{2}$ indicates that $A_{1}$ is a cause of $A_{2}$ and, consequently, $A_{2}$ is an effect of $A_{1}$ . In particular, in a PCFG, each parfactor is connected to a single child and zero or more parents, matching the definition of directed FGs in the ground case given by Frey(2003). Further, as commonly required in directed graphical models such as causal Bayesian networks, we restrict a PCFG to be acyclic.
**Definition 3.1 (Parametric Causal Factor Graph)**
*A PCFG is a fully directed graph $G=(\boldsymbol V,\boldsymbol E)$ with node set $\boldsymbol V=\boldsymbol A\cup\boldsymbol G$ where $\boldsymbol A=\{A_{1},\ldots,A_{n}\}$ is a set of PRVs and $\boldsymbol G=\{g_{1},\ldots,g_{m}\}$ is a set of directed parfactors. A directed parfactor $g=\phi(\mathcal{A})_{|C}^{\rightarrow A_{i}}$ with $\mathcal{A}=(A_{1},\ldots,A_{k})$ being a sequence of PRVs, $\phi$ $:$ $\times_{i=1}^{k}\mathcal{R}(A_{i})\mapsto\mathbb{R}^{+}$ being a function, and $C$ being a constraint on the logvars of $\mathcal{A}$ , maps its argument values to a positive real number (potential). Again, we may omit $|\top$ in $\phi(\mathcal{A})_{|\top}^{\rightarrow A_{i}}$ . $A_{i}\in\mathcal{A}$ denotes the child of $\phi(\mathcal{A})^{\rightarrow A_{i}}$ whereas all $A_{j}\in\mathcal{A}$ with $j\neq i$ are the parents of $\phi(\mathcal{A})^{\rightarrow A_{i}}$ . For each directed parfactor $g$ , there are edges $(g,A_{i})\in\boldsymbol E$ and $(A_{j},g)\in\boldsymbol E$ (for all $A_{j}\neq A_{i}$ ). A PCFG is an acyclic graph, that is, there is no sequence of edges $(g_{1},A_{1}),(A_{1},g_{2}),\dots,(g_{\ell},A_{\ell}),(A_{\ell},g_{1})$ in $\boldsymbol E$ . The semantics of $G$ is given by grounding and building a full joint distribution, identical to the semantics of a PFG, i.e., with $Z$ as the normalisation constant, $\mathcal{A}_{k}$ denoting the randvars connected to $\phi_{k}$ , and $A_{i}^{k}\in\mathcal{A}_{k}$ specifying the child of $\phi_{k}$ , $G$ represents
$$
P_{G}=\frac{1}{Z}\prod_{g_{j}\in\boldsymbol G}\prod_{\phi_{k}\in gr(g_{j})}
\phi_{k}(\mathcal{A}_{k})^{\rightarrow A_{i}^{k}}.
$$*
**Example 3.2**
*Consider the PCFG $G$ depicted in Fig. 3. $G$ represents the same full joint probability distribution as the PFG shown in Fig. 2. In particular, both models are identical except for the fact that $G$ contains directed edges instead of undirected edges between parfactors and PRVs. Each parfactor represents a group of directed factors and thus, grounding $G$ results in a directed FG. Following previous examples by assuming $\mathcal{D}(T)=\{t_{1},t_{2}\}$ , for example $g_{1}=\phi_{1}(Qual(T))^{\rightarrow Qual(T)}$ represents $gr(g_{1})=\{\phi_{1}(Qual(t_{1}))^{\rightarrow Qual(t_{1})},\allowbreak\phi_{1 }(Qual(t_{2}))^{\rightarrow Qual(t_{2})}\}$ .*
$Qual(T)$ $Train(E,T)$ $Comp(E)$ $Rev$ $g_{1}$ $g_{2}$ $g_{3}$ $g_{4}$
Figure 3: An illustration of a PCFG encoding the same full joint probability distribution as the PFG given in Fig. 2. The only difference between the PCFG and the PFG is that the PCFG contains directed edges instead of undirected edges between PRVs and parfactors.
In the following, we denote the parents of a PRV $A$ by $\mathrm{Pa}_{G}(A)=\{\phi^{\rightarrow A_{i}}\mid A=A_{i}\}$ and the child of a parfactor $\phi$ by $\mathrm{Ch}_{G}(\phi(\mathcal{A})^{\rightarrow A_{i}})=A_{i}$ in a PCFG $G$ . If the context is clear, we omit the subscript $G$ . Before we define the semantics of an intervention in a PCFG, we briefly revisit the notion of $d$ -separation in directed acyclic graphs and afterwards apply it to PCFGs.
### 3.1 $\boldsymbol d$ d-Separation in Parametric Causal Factor Graphs
The notion of $d$ -separation (Pearl(1986)) provides a graphical criterion to test for conditional independence in directed acyclic graphs. Frey(2003) translates the notion of $d$ -separation to directed FGs. We build on the definition of $d$ -separation in directed FGs provided by Frey(2003) to define $d$ -separation in PCFGs on a ground level.
**Definition 3.3 (\boldsymbolâąd\boldsymbolđ\boldsymbol ditalic_d-separation)**
*Let $G=(\boldsymbol A\cup\boldsymbol G,\boldsymbol E)$ be a PCFG. Given three disjoint sets of randvars $\boldsymbol X$ , $\boldsymbol Y$ , and $\boldsymbol Z$ (subsets of $\bigcup_{A\in\boldsymbol A}gr(A)$ ), we say that $\boldsymbol X$ and $\boldsymbol Y$ are conditionally independent given $\boldsymbol Z$ , written as $\boldsymbol X\perp\!\!\!\perp\boldsymbol Y\mid\boldsymbol Z$ , if the nodes in $\boldsymbol Z$ block all paths from the nodes in $\boldsymbol X$ to the nodes in $\boldsymbol Y$ in the directed FG obtained by grounding $G$ . A path is a connected sequence of edges $(R_{1},g_{1})$ , âŠ, $(R_{\ell},g_{\ell})$ and is not restricted to follow the arrow directions of the edges. Note that it is therefore also possible for a path to pass from a parent of a factor to another parent of the factor. A path is blocked by the nodes in $\boldsymbol Z$ if
1. the path contains a variable from $\boldsymbol Z$ , or
1. the path passes from a parent of a directed factor $\phi$ to another parent of $\phi$ , and neither the child of $\phi$ nor any of its descendants are in $\boldsymbol Z$ .*
The semantics of $d$ -separation in PCFGs is defined on a ground level. However, it is possible to check for $d$ -separation on a lifted level without having to ground the PCFG. We give a short motivational example to illustrate this idea and leave the algorithmic details for future work.
**Example 3.4**
*Consider again the PCFG depicted in Fig. 3 and assume we want to check whether $Qual(t_{1})\perp\!\!\!\perp Comp(bob)\mid Train(bob,t_{1})$ holds. In this case, we have assigned $T=t_{1}$ and $E=bob$ , so we only need to examine paths involving this particular assignment of logvars. More specifically, all PRVs on the path with overlapping logvars, i.e., all PRVs having $T$ or $E$ as a logvar, are bound to the same assignment. Therefore, all paths from $Qual(t_{1})$ to $Comp(bob)$ pass through $Train(bob,t_{1})$ , meaning the conditional independence statement in question holds. Note that if there were other paths involving PRVs with non-overlapping logvars, i.e., logvars not involved in the sets $\boldsymbol X$ and $\boldsymbol Y$ , all randvars represented by those PRVs need to be in $\boldsymbol Z$ to block those paths.*
We close this subsection with a final remark on $d$ -separation in PCFGs. In contrast to Bayesian networks, the conditional independence statements induced by a PCFG $G$ are not guaranteed to be compatible with arbitrary potential mappings of the directed parfactors in $G$ (i.e., it is not possible to specify the edge directions and arbitrary potential mappings independent of each other). In other words, the conditional independence statements induced by the graph structure of $G$ must be encoded in the potential mappings of the directed parfactors in $G$ . For example, if the graph structure of $G$ induces a conditional independence statement $X\perp\!\!\!\perp Y\mid Z$ for the randvars $X$ , $Y$ , and $Z$ , the potential mappings of the directed parfactors in $G$ must be specified such that $P(X,Y\mid Z)=P(X\mid Z)\cdot P(Y\mid Z)$ holds In a Bayesian network, the numbers in all conditional probability tables can be specified arbitrarily (except for the only restriction that each row must sum to one) because the structure of the conditional probability tables enforces that all conditional independence statements induced by the Bayesian network âs structure are also encoded in the numbers, i.e., $P(X,Y\mid Z)=P(X\mid Z)\cdot P(Y\mid Z)$ holds if and only if $X\perp\!\!\!\perp Y\mid Z$ .. To ensure that the conditional independence statements induced by the graph structure of $G$ are compatible with the potential mappings of the directed parfactors in $G$ , it is sufficient to specify the potential mappings for each directed parfactor $\phi(A_{1},\dots,A_{k})^{\to A_{i}}$ such that all assignments $(a_{1},\dots,a_{k})$ which differ only at position $i$ sum up to one (which is equivalent to each row in a conditional probability table in a Bayesian network summing up to one). Specifying the potential mappings in this way ensures that each directed parfactor $\phi(A_{1},\dots,A_{k})^{\to A_{i}}$ encodes a conditional probability distribution that corresponds to exactly one conditional probability distribution in an equivalent Bayesian network. Consequently, the conditional independence statements induced by the graph structure of $G$ are compatible with the potential mappings of the directed parfactors in $G$ because there exists a Bayesian network that induces the same conditional independence statements as $G$ and encodes the same underlying full joint probability distribution as $G$ . Note that this issue arises only when specifying the model by hand, i.e., in case the model is learned from data, the conditional independence statements induced by the data are automatically reflected both in the graph structure and the potential values.
The concept of $d$ -separation is important for the computation of the effect of an intervention in the sense that all non-causal paths, so-called backdoor paths, need to be blocked to remove spurious effects. We next show how these backdoor paths are blocked when performing an intervention and give a formal semantics of an intervention in a PCFG.
### 3.2 Semantics of Interventions in Parametric Causal Factor Graphs
To correctly handle the semantics of an action, for example in the setting of a decision-making agent planning for the best action to take, we have to differentiate between seeing (conditioning) and doing (intervention). Let us take a look at the PCFG shown in Fig. 3 again. If we observe (see) an event $Train(bob,t_{1})=true$ , our belief about the probability distribution of $Qual(t_{1})$ might change. More specifically, the probability of $t_{1}$ having a high quality might be higher when observing $Train(bob,t_{1})=true$ than without the observation under the assumption that the probability of training an employee increases if the quality of a training program is high. However, if we are interested in the effect an action setting $Train(bob,t_{1})$ to $true$ , denoted as $do(Train(bob,t_{1})=true)$ , has on the remaining PRVs, we have to ensure that the belief about the probability of $t_{1}$ having a high quality remains unchanged as the action itself has no influence on the probability distribution of $Qual(t_{1})$ . Therefore, it is crucial to avoid the propagation of information against the edge directions whenever we are interested in the effect of an action. That is, if we are interested in the effect a specific randvar $R^{\prime}$ has on another randvar $R$ , all so-called backdoor paths from $R$ to $R^{\prime}$ must be blocked. A backdoor path is a non-causal path, i.e., a backdoor path from $R$ to $R^{\prime}$ is a path that remains after removing all outgoing edges of $R$ .
To account for backdoor paths and correctly handle the semantics of an action, we employ the notion of an intervention. We first define the notion of an intervention on randvars and later extend it to allow for $do$ -expressions on PRVs. An intervention on a randvar $R$ , denoted as $do(R=r)$ with $r\in\mathcal{R}(R)$ , changes the structure of a PCFG by removing all parent edges of $R$ and setting $R$ to the value $r$ . By removing the parent edges, all backdoor paths are removed. Formally, the semantics of an intervention in a PCFG is defined as below, following the definition of an intervention in Bayesian networks provided by Pearl et al.(2016)Pearl, Glymour, and Jewell.
**Definition 3.5 (Intervention)**
*Let $\boldsymbol R=\{R_{1},\dots,R_{n}\}$ be the set of randvars obtained by grounding a PCFG $G=(\boldsymbol A\cup\boldsymbol G,\boldsymbol E)$ , i.e., $\boldsymbol R=\bigcup_{A\in\boldsymbol A}gr(A)$ . An intervention $do(R_{1}=r_{1},\dots,R_{k}=r_{k})$ changes the underlying probability distribution such that each factor $\phi(R^{\prime}_{1},\dots,R^{\prime}_{i},\dots,R^{\prime}_{\ell})^{\rightarrow R ^{\prime}_{i}}$ with $R^{\prime}_{i}\in\{R_{1},\dots,R_{k}\}$ is replaced by a factor $\phi^{\prime}(R^{\prime}_{1},\dots,R^{\prime}_{i},\dots R^{\prime}_{\ell})^{ \rightarrow R^{\prime}_{i}}$ with
$$
\phi^{\prime}(R^{\prime}_{1}=r^{\prime}_{1},\dots,R^{\prime}_{i}=r^{\prime}_{i
},\dots,R^{\prime}_{\ell}=r^{\prime}_{\ell})^{\rightarrow R^{\prime}_{i}}=
\cases{1}&\text{if}r_{i}=r^{\prime}_{i}\\
0\text{if}r_{i}\neq r^{\prime}_{i}.
$$
All $\phi(R^{\prime}_{1},\dots,R^{\prime}_{i},\dots,R^{\prime}_{\ell})^{\rightarrow R ^{\prime}_{i}}$ with $R^{\prime}_{i}\notin\{R_{1},\dots,R_{k}\}$ remain unchanged.*
By fixing the values of all parent factors, all parent influences are (virtually) removed from the model and hence initial backdoor paths are (virtually) removed from the model as well. Having defined the semantics of an intervention in a PCFG, we are now interested in efficiently computing interventional distributions, i.e., the result of queries that contain $do$ -expressions. Note that in a PCFG, all causal effects are identifiable per definition and thus, we do not have to rewrite a query containing $do$ -expressions according to the $do$ -calculus (Pearl(1995)) to obtain an equivalent query free of $do$ -expressions. In particular, we do not estimate causal effects from observed data but instead compute them in a fully specified model as every PCFG encodes a full joint probability distribution which we can modify according to the definition of an intervention and afterwards query the modified distribution to answer any query containing $do$ -expressions.
We next introduce the LCI algorithm, which handles interventions in PCFGs efficiently by directly applying the semantics of an intervention on a lifted level.
## 4 Efficient Causal Effect Computation in Parametric Causal Factor Graphs
Now that we have introduced PCFGs, we study the problem of efficiently computing the effect of interventions in PCFGs. A major advantage of using PCFGs instead of propositional models such as causal Bayesian networks is that we mostly do not have to fully ground the model to compute the effect of interventions. Consider again the PCFG illustrated in Fig. 3 and assume we want to compute the interventional distribution $P(Rev\mid do(Train(bob,t_{1})=true))$ in $G$ . Note that when intervening on a randvar, we have to treat it differently than other randvars in the same group on which we do not intervene. An intervention $do(Treat(bob,t_{1})=true)$ sets the value of $Treat(bob,t_{1})$ to $true$ and thus, we have to treat $bob$ differently from $alice$ , $dave$ , and $eve$ âin other words, not all employees are indistinguishable anymore. Nevertheless, and this is the crucial point, we can still treat $alice$ , $dave$ , and $eve$ as indistinguishable when computing the interventional distribution.
[t] Lifted Causal Inference InputInput OutputOutput A PCFG $G=(\boldsymbol A\cup\boldsymbol G,\boldsymbol E)$ , and a query $P(R_{1},\dots,R_{\ell}\mid do(R^{\prime}_{1}=r^{\prime}_{1},\dots,R^{\prime}_{ k}=r^{\prime}_{k}))$ with $\{R_{1},\dots,R_{\ell}\}\subseteq\bigcup_{A\in\boldsymbol A}gr(A)$ and $\{R^{\prime}_{1},\dots,R^{\prime}_{k}\}\subseteq\bigcup_{A\in\boldsymbol A}gr(A)$ . The interventional distribution $P(R_{1},\dots,R_{\ell}\mid do(R^{\prime}_{1}=r^{\prime}_{1},\dots,R^{\prime}_{ k}=r^{\prime}_{k}))$ . $G^{\prime}\leftarrow$ Split parfactors in $G$ based on each $R^{\prime}_{i}\in\{R^{\prime}_{1},\dots,R^{\prime}_{k}\}$ $R^{\prime}_{i}\in\{R^{\prime}_{1},\dots,R^{\prime}_{k}\}$ $\phi(A_{1},\dots,A_{z})^{\rightarrow R^{\prime}_{i}}\in\mathrm{Pa}_{G^{\prime} }(R^{\prime}_{i})$ assignment $(a_{1},\dots,a_{z})\in\mathcal{R}(A_{1})\times\dots\times\mathcal{R}(A_{z})$ Set $\phi(a_{1},\dots,a_{z})=\cases{1}&\text{if$(a_{1},\dots,a_{z})$assigns$R^{ \prime}_{i}=r^{\prime}_{i}$}\\ 0\text{if$(a_{1},\dots,a_{z})$assigns$R^{\prime}_{i}\neq r^{\prime}_{i}$}$ $P\leftarrow$ Call LVE to compute $P(R_{1},\dots,R_{\ell})$ in $G^{\prime}$ $P$
### 4.1 The Lifted Causal Inference Algorithm
We now introduce the LCI algorithm to compute the interventional distribution $P(R_{1},\dots,R_{\ell}\mid do(R^{\prime}_{1}=r^{\prime}_{1},\dots,R^{\prime}_{ k}=r^{\prime}_{k}))$ in a PCFG $G$ . The entire LCI algorithm is shown in Section 4.
First, LCI splits the parfactors in $G$ based on the intervention variables $R^{\prime}_{i}\in\{R^{\prime}_{1},\dots,R^{\prime}_{k}\}$ . In particular, splitting parfactors in $G$ results in a modified PCFG $G^{\prime}$ entailing equivalent semantics as $G$ (De Salvo Braz et al.(2005)De Salvo Braz, Amir, and Roth). The procedure of splitting a parfactor works as follows. Recall that $R^{\prime}_{i}=A(L_{1}=l_{1},\dots,L_{j}=l_{j})$ , $l_{1}\in\mathcal{D}(L_{1}),\dots,l_{j}\in\mathcal{D}(L_{j})$ , is a particular instance of a PRV $A(L_{1},\dots,L_{j})$ , that is, it holds that $R^{\prime}_{i}\in gr(A)$ . The idea behind the splitting procedure is that we would like to separate $gr(A)$ into two sets $gr(A)\setminus\{R^{\prime}_{i}\}$ and $\{R^{\prime}_{i}\}$ , as $R^{\prime}_{i}$ has to be treated differently than the remaining instances of $A$ . Therefore, every parfactor $g$ for which there is an instance $\phi\in gr(g)$ such that $R^{\prime}_{i}$ appears in the argument list of $\phi$ is split. Formally, splitting a parfactor $g$ replaces $g$ by two parfactors $g^{\prime}_{|C^{\prime}}$ and $g^{\prime\prime}_{|C^{\prime\prime}}$ and adapts the constraints of $g^{\prime}_{|C^{\prime}}$ and $g^{\prime\prime}_{|C^{\prime\prime}}$ . The constraints $C^{\prime}$ and $C^{\prime\prime}$ are altered such that the inputs of $g^{\prime}_{|C^{\prime}}$ are restricted to all sequences that contain $R^{\prime}_{i}$ and the inputs of $g^{\prime\prime}_{|C^{\prime\prime}}$ are restricted to the remaining input sequences. After the splitting procedure, the semantics of the model remains unchanged as the groundings of $G^{\prime}$ are still the same as the groundings of the initial model $G$ âthey are just arranged differently across the sets of ground instances. Having completed the split of all respective parfactors, LCI next modifies the parents of $R^{\prime}_{i}$ , i.e., the underlying probability distribution encoded by $G^{\prime}$ is modified according to the semantics of the intervention $do(R^{\prime}_{i}=r^{\prime}_{i})$ with $r^{\prime}_{i}\in\mathcal{R}(R^{\prime}_{i})$ . More specifically, as $R^{\prime}_{i}$ is fixed on $r^{\prime}_{i}$ , all parents $\phi\in\mathrm{Pa}_{G^{\prime}}(R^{\prime}_{i})$ of $R^{\prime}_{i}$ are altered such that all input sequences assigning $R^{\prime}_{i}=r^{\prime}_{i}$ map to the potential value one while all other input sequences map to zero. Finally, LCI computes the result for $P(R_{1},\dots,R_{\ell})$ in the modified model $G^{\prime}$ , which is equivalent to the result for $P(R_{1},\dots,R_{\ell}\mid do(R^{\prime}_{1}=r^{\prime}_{1},\dots,R^{\prime}_{ k}=r^{\prime}_{k}))$ in the original model $G$ . To perform query answering in $G^{\prime}$ , LVE can be applied to $G^{\prime}$ by simply ignoring the edge directions as the semantics of a PFG and a PCFG are defined identically.
Before we continue to examine the correctness of Section 4, we take a look at an example. $Train(E,T)$ $Train(bob,t_{1})$ $Qual(T)$ $Comp(E)$ $Rev$ $g_{1_{|\top}}$ $g_{2_{|C_{2}}}$ $g^{\prime}_{2_{|C^{\prime}_{2}}}$ $g_{3_{|C_{3}}}$ $g^{\prime}_{3_{|C^{\prime}_{3}}}$ $g_{4_{|\top}}$
Figure 4: A visualisation of the modified PCFG obtained after altering the PCFG shown in Fig. 3 by splitting $g_{2}$ and $g_{3}$ to separate $Train(bob,t_{1})$ from $Train(E,T)$ . Here, the constraints $C_{2}$ as well as $C_{3}$ include all instances of $Train(E,T)$ except for $Train(bob,t_{1})$ and $C^{\prime}_{2}$ as well as $C^{\prime}_{3}$ are restricted to the single instance $Train(bob,t_{1})$ of $Train(E,T)$ . Note that the graph size remains significantly smaller than for the fully grounded model.
**Example 4.1**
*Consider again the PCFG $G$ shown in Fig. 3 and assume we would like to compute $P(Rev\mid do(Train(bob,t_{1}))=true)$ . As $Train(bob,t_{1})$ is a particular instance of $Train(E,T)$ , we have to split the parfactors $g_{2}$ and $g_{3}$ while $g_{1}$ as well as $g_{4}$ keep $\top$ as their constraint. Figure 4 shows the modified PCFG $G^{\prime}$ obtained after splitting $g_{2}$ and $g_{3}$ based on the intervention on $Train(bob,t_{1})$ . In $G^{\prime}$ , $Train(bob,t_{1})$ is now a separate node in the graph, connected to two newly introduced parfactors. In particular, $g_{2}$ has been replaced by two parfactors $g_{2_{|C_{2}}}$ and $g^{\prime}_{2_{|C^{\prime}_{2}}}$ with constraints $C_{2}=((E,T),\{(alice,t_{1}),\allowbreak(alice,t_{2}),\allowbreak(bob,t_{2}), \allowbreak(dave,t_{1}),\allowbreak(dave,t_{2}),\allowbreak(eve,t_{1}), \allowbreak(eve,t_{2})\})$ and $C^{\prime}_{2}=((E,T),\{(bob,t_{1})\})$ . In other words, $g_{2_{|C_{2}}}$ is restricted to all instances of $Train(E,T)$ except for $Train(bob,t_{1})$ and $g^{\prime}_{2_{|C^{\prime}_{2}}}$ is restricted to the instance $Train(bob,t_{1})$ of $Train(E,T)$ . Analogously, $g_{3}$ has been replaced by two parfactors $g_{3_{|C_{3}}}$ and $g^{\prime}_{3_{|C^{\prime}_{3}}}$ . To incorporate the semantics of $do(Train(bob,t_{1}))=true$ , LCI next modifies the parents of $Train(bob,t_{1})$ , i.e., LCI modifies $g^{\prime}_{2_{|C^{\prime}_{2}}}$ in this example. More specifically, $g^{\prime}_{2_{|C^{\prime}_{2}}}(Qual(t_{1})=q,Train(bob,t_{1})=true)$ is set to one and $g^{\prime}_{2_{|C^{\prime}_{2}}}(Qual(t_{1})=q,Train(bob,t_{1})=false)$ is set to zero for all $q\in\mathcal{R}(Qual(t_{1}))$ . Finally, LVE can be run to compute $P(Rev)$ in $G^{\prime}$ , which is equivalent to the interventional distribution $P(Rev\mid do(Train(bob,t_{1}))=true)$ in the original model $G$ .*
Due to the splitting of parfactors, it might be the case that there are PRVs in $G^{\prime}$ having more parents than they previously had in the original model $G$ , as with $Comp(E)$ in Fig. 4. The semantics of the model, however, remains unchanged because $\bigcup_{g\in\boldsymbol G}gr(g)=\bigcup_{g\in\boldsymbol G^{\prime}}gr(g)$ . Given the way we specified the semantics of an intervention in a PCFG, it immediately follows that LCI correctly computes the effect of interventions. In particular, as LCI directly applies Def. 3.5 by setting the parent factors of all variables we intervene on accordingly, the semantics of the modified model $G^{\prime}$ is equivalent to the semantics of interventions from Def. 3.5 and thus, LCI is correct.
**Corollary 4.2**
*Section 4 computes the interventional distribution according to Def. 3.5.*
Moreover, directly applying Def. 3.5 allows LCI to exploit the established LVE algorithm. By deploying LVE, LCI is able to perform tractable inference with respect to domain sizes for all PCFGs in the class of domain-liftable models (Taghipour et al.(2013b)Taghipour, Fierens, Van den Broeck, Davis, and Blockeel) if the $do$ -expressions in the query do not ground a domain. The class of domain-liftable models includes all PCFGs containing only parfactors with at most two logvars and all PCFGs containing only PRVs having at most one logvar.
**Corollary 4.3**
*Section 4 performs tractable probabilistic inference with respect to domain sizes for the class of domain-liftable models if the $do$ -expressions in the query do not ground a domain.*
A more fine-grained complexity analysis of LVE is provided in (Taghipour(2013)). Note that the loops in Sections 4, 4 and 4 of Section 4 do not influence the overall run time complexity of LCI as they iterate over potential mappings that must be considered anyway during inference (both by LVE and its propositional counterpart variable elimination).
To summarise, LCI is a simple, yet effective algorithm to perform lifted causal inference, even for interventions on large groups of randvars, as we investigate next.
### 4.2 Handling Interventions on Groups of Random Variables
lci is able to handle both interventions on a single (ground) randvar as well as interventions on a conjunction of multiple randvars efficiently. In particular, when intervening on multiple randvars at the same time, LCI is able to treat those randvars as a group. For example, recall the employee example and assume we want to train multiple employees simultaneously as a training program is mostly offered not only for a single employee but for a group of employees. Then, it is not necessary to split all trained employees into separate groupsâit is sufficient to differentiate between trained employees and all remaining employees. Formally, the interventions $do(R^{\prime}_{1}=r^{\prime}_{1},\dots,R^{\prime}_{k}=r^{\prime}_{k})$ on an arbitrary set of randvars $\{R^{\prime}_{1},\dots,R^{\prime}_{k}\}$ can thus efficiently be handled by splitting the parfactors in $G$ such that all $R^{\prime}_{i}$ that are represented by the same PRV $A$ and set to the same value $r^{\prime}_{i}$ remain grouped, equal to splitting on constraints in LVE. More specifically, LCI needs just a single split per group and thus avoids manipulating the parents of each individual randvar separately. Furthermore, it is also possible to intervene on a PRV (instead of intervening on a randvar). The semantics of an intervention on a PRV $A$ is given by $do(A=a)=do(R_{1}=a,\dots,R_{k}=a)$ with $gr(A)=\{R_{1},\dots,R_{k}\}$ . Again, LCI is able to treat all randvars represented by $A$ as a group and therefore is not required to split the group. In contrast, in a propositional model, every object has to be treated individually and therefore the parents for each randvar need to be manipulated separately.
Next, we investigate the practical performance of PCFGs and, in particular, the LCI algorithm for the computation of interventional distributions.
## 5 Experiments
In this section, we evaluate the run times needed to compute the result of interventional queries in Bayesian networks, directed FGs, and PCFGs. For our experiments, we use a slightly modified version of the PCFG given in Fig. 3 which can directly be translated into a Bayesian network without having to combine multiple parent factors into a single conditional probability table. More specifically, to obtain the corresponding directed FG, we simply ground the PCFG and to obtain the equivalent Bayesian network, we use the transformation from directed FG to Bayesian network given by Frey(2003). Note that the PCFG used in our experiments to demonstrate the practical efficiency of lifted causal inference is rather small with four parfactors and PRVs, respectively, and the gain we obtain from lifted inference further increases with models consisting of more PRVs. 10 100 1000 10000 8 16 32 64 128 256 512 1024 2048 4096 $d$ time (ms) Lifted Variable Elimination (PCFG) Variable Elimination (BN) Variable Elimination (FG)
Figure 5: A comparison of the run times required to compute interventional distributions on different graphical models encoding equivalent full joint probability distributions.
We test the required run time for each of the three graphical models on different graph sizes by setting the domain size of the employees to $d=8,16,32,\dots,4096$ and having a single training program for each choice of $d$ (i.e., $\lvert\mathcal{D}(E)\rvert=d$ and $\lvert\mathcal{D}(T)\rvert=1$ ). Figure 5 shows the run times needed to compute an interventional distribution for a single intervention in the modified graph when running variable elimination on the directed FG, variable elimination on the Bayesian network, and LVE on the PCFG. The variable elimination algorithm is the propositional counterpart of LVE and operates on a propositional (ground) model, such as a Bayesian network or an FG. Consequently, variable elimination considers every object represented by a randvar (e.g., every employee) individually for computations, independent of whether there are objects behaving exactly the same. In contrast, LVE treats identically behaving objects as a group by using a representative for computations instead of considering each of those objects separately. The results emphasise that the LCI algorithm, which internally exploits LVE, overcomes scalability issues for large domain sizes as the run time of LVE, in contrast to the run times of variable elimination on the Bayesian network and the directed FG, does not exponentially increase with $d$ (y-axis is log-scaled). To conclude, PCFGs not only provide us with expressive probabilistic graphical models for relational domains but also enable us to drastically speed up causal inference by reasoning over sets of indistinguishable objects.
## 6 Conclusion
We introduce PCFGs to combine lifted probabilistic inference in relational domains with causal inference, thereby allowing for lifted causal inference. To leverage the power of lifted inference for causal effect computation, we present the LCI algorithm, which operates on a lifted level and thus allows us to drastically speed up causal inference. lci is a simple, yet effective algorithm to compute the effect of (multiple simultaneous) interventions, and builds on the well-founded LVE algorithm, thereby allowing LCI to be plugged into parameterised decision models (Gehrke et al.(2019b)Gehrke, Braun, Möller, Waschkau, Strumann, and SteinhÀuser) to compute the maximum expected utility in accordance with Pearl(2009).
pcfg open up interesting directions for future work. A basic problem is to learn a PCFG directly from a relational database. Following up on learning PCFGs from data, another constitutive problem for future research is to relax the assumption of having a fully directed PCFG at hand, i.e., to allow PCFGs to contain both directed and undirected edges at the same time and investigate the implications for answering causal queries.
This work is partially funded by the BMBF project AnoMed 16KISA057 and 16KISA050K, and is also partially supported by the Medical Cause and Effects Analysis (MCEA) project and partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germanyâs Excellence Strategy â EXC 2176 âUnderstanding Written Artefacts: Material, Interaction and Transmission in Manuscript Culturesâ, project no. 390893796.
## References
- [Ahsan et al.(2022)Ahsan, Arbour, and Zheleva] Ragib Ahsan, David Arbour, and Elena Zheleva. Relational Causal Models with Cycles: Representation and Reasoning. In Proceedings of the First Conference on Causal Learning and Reasoning (CLeaR-22), pages 1â18. PMLR, 2022.
- [Ahsan et al.(2023)Ahsan, Arbour, and Zheleva] Ragib Ahsan, David Arbour, and Elena Zheleva. Learning Relational Causal Models with Cycles through Relational Acyclification. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23), pages 12164â12171. AAAI Press, 2023.
- [Apsel and Brafman(2012)] Udi Apsel and Ronen I. Brafman. Lifted MEU by Weighted Model Counting. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI-12), pages 1861â1867. AAAI Press, 2012.
- [Arbour et al.(2016)Arbour, Garant, and Jensen] David Arbour, Dan Garant, and David Jensen. Inferring Network Effects from Observational Data. In Proceedings of the Twenty-Second ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-16), pages 715â724. Association for Computing Machinery, 2016.
- [Braun and Gehrke(2022)] Tanya Braun and Marcel Gehrke. Explainable and Explorable Decision Support. In Proceedings of the Twenty-Seventh International Conference on Conceptual Structures (ICCS-22), pages 99â114. Springer, 2022.
- [Braun and Möller(2016)] Tanya Braun and Ralf Möller. Lifted Junction Tree Algorithm. In Proceedings of KI 2016: Advances in ArtiïŹcial Intelligence (KI-16), pages 30â42. Springer, 2016.
- [Braun and Möller(2018)] Tanya Braun and Ralf Möller. Parameterised Queries and Lifted Query Answering. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), pages 4980â4986. IJCAI Organization, 2018.
- [De Salvo Braz et al.(2005)De Salvo Braz, Amir, and Roth] Rodrigo De Salvo Braz, Eyal Amir, and Dan Roth. Lifted First-Order Probabilistic Inference. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05), pages 1319â1325. Morgan Kaufmann Publishers Inc., 2005.
- [De Salvo Braz et al.(2006)De Salvo Braz, Amir, and Roth] Rodrigo De Salvo Braz, Eyal Amir, and Dan Roth. MPE and Partial Inversion in Lifted Probabilistic Variable Elimination. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06), pages 1123â1130. AAAI Press, 2006.
- [Frey(2003)] Brendan J. Frey. Extending Factor Graphs so as to Unify Directed and Undirected Graphical Models. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI-03), pages 257â264. Morgan Kaufmann Publishers Inc., 2003.
- [Frey et al.(1997)Frey, Kschischang, Loeliger, and Wiberg] Brendan J. Frey, Frank R. Kschischang, Hans-Andrea Loeliger, and Niclas Wiberg. Factor Graphs and Algorithms. In Proceedings of the Thirty-Fifth Annual Allerton Conference on Communication, Control, and Computing, pages 666â680. Allerton House, 1997.
- [Gehrke et al.(2018)Gehrke, Braun, and Möller] Marcel Gehrke, Tanya Braun, and Ralf Möller. Lifted Dynamic Junction Tree Algorithm. In Proceedings of the Twenty-Third International Conference on Conceptual Structures (ICCS-2018), pages 55â69. Springer, 2018.
- [Gehrke et al.(2019a)Gehrke, Braun, and Möller] Marcel Gehrke, Tanya Braun, and Ralf Möller. Lifted Temporal Maximum Expected Utility. In Proceedings of the Thirty-Second Canadian Conference on Artificial Intelligence (CANAI-19), pages 380â386. Springer, 2019a.
- [Gehrke et al.(2019b)Gehrke, Braun, Möller, Waschkau, Strumann, and SteinhĂ€user] Marcel Gehrke, Tanya Braun, Ralf Möller, Alexander Waschkau, Christoph Strumann, and Jost SteinhĂ€user. Lifted Maximum Expected Utility. In Proceedings of the First International Workshop on Artificial Intelligence in Health (AIH-18), pages 131â141. Springer, 2019b.
- [Gehrke et al.(2020)Gehrke, Möller, and Braun] Marcel Gehrke, Ralf Möller, and Tanya Braun. Taming Reasoning in Temporal Probabilistic Relational Models. In Proceedings of the Twenty-Fourth European Conference on Artificial Intelligence (ECAI-20), pages 2592â2599. IOS Press, 2020.
- [KisyĆski and Poole(2009)] Jacek KisyĆski and David Poole. Constraint Processing in Lifted Probabilistic Inference. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI-09), pages 293â302. AUAI Press, 2009.
- [Kschischang et al.(2001)Kschischang, Frey, and Loeliger] Frank R. Kschischang, Brendan J. Frey, and Hans-Andrea Loeliger. Factor Graphs and the Sum-Product Algorithm. IEEE Transactions on Information Theory, 47:498â519, 2001.
- [Lee and Honavar(2015)] Sanghack Lee and Vasant Honavar. Lifted Representation of Relational Causal Models Revisited: Implications for Reasoning and Structure Learning. In Proceedings of the UAI 2015 Conference on Advances in Causal Inference, pages 56â65. CEUR, 2015.
- [Lee and Honavar(2016)] Sanghack Lee and Vasant Honavar. On Learning Causal Models from Relational Data. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), pages 3263â3270. AAAI Press, 2016.
- [Lee and Honavar(2019)] Sanghack Lee and Vasant Honavar. Towards Robust Relational Causal Discovery. In Proceedings of The Thirty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI-19), pages 345â355. PMLR, 2019.
- [Maier et al.(2010)Maier, Taylor, Oktay, and Jensen] Marc Maier, Brian Taylor, Huseyin Oktay, and David Jensen. Learning Causal Models of Relational Domains. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10), pages 531â538. AAAI Press, 2010.
- [Maier et al.(2013)Maier, Marazopoulou, Arbour, and Jensen] Marc Maier, Katerina Marazopoulou, David Arbour, and David Jensen. A Sound and Complete Algorithm for Learning Causal Models from Relational Data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI-13), pages 371â380. AUAI Press, 2013.
- [Milch et al.(2008)Milch, Zettlemoyer, Kersting, Haimes, and Kaelbling] Brian Milch, Luke S. Zettlemoyer, Kristian Kersting, Michael Haimes, and Leslie Pack Kaelbling. Lifted Probabilistic Inference with Counting Formulas. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (AAAI-08), pages 1062â1068. AAAI Press, 2008.
- [Niepert and Van den Broeck(2014)] Mathias Niepert and Guy Van den Broeck. Tractability through Exchangeability: A New Perspective on Efficient Probabilistic Inference. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI-14), pages 2467â2475. AAAI Press, 2014.
- [Pearl(1986)] Judea Pearl. Fusion, Propagation, and Structuring in Belief Networks. Artificial Intelligence, 29:241â288, 1986.
- [Pearl(1988)] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.
- [Pearl(1995)] Judea Pearl. Causal Diagrams for Empirical Research. Biometrika, 82:669â688, 1995.
- [Pearl(2009)] Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2nd edition, 2009.
- [Pearl et al.(2016)Pearl, Glymour, and Jewell] Judea Pearl, Madelyn Glymour, and Nicholas P. Jewell. Causal Inference in Statistics: A Primer. Wiley, 1st edition, 2016.
- [Poole(2003)] David Poole. First-Order Probabilistic Inference. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03), pages 985â991. Morgan Kaufmann Publishers Inc., 2003.
- [Russell and Norvig(2020)] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson, 4th edition, 2020.
- [Salimi et al.(2020)Salimi, Parikh, Kayali, Getoor, Roy, and Suciu] Babak Salimi, Harsh Parikh, Moe Kayali, Lise Getoor, Sudeepa Roy, and Dan Suciu. Causal Relational Learning. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 241â256. Association for Computing Machinery, 2020.
- [Spirtes et al.(2000)Spirtes, Glymour, and Scheines] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. MIT Press, 2nd edition, 2000.
- [Taghipour(2013)] Nima Taghipour. Lifted Probabilistic Inference by Variable Elimination. PhD thesis, KU Leuven, 2013.
- [Taghipour et al.(2013a)Taghipour, Fierens, Davis, and Blockeel] Nima Taghipour, Daan Fierens, Jesse Davis, and Hendrik Blockeel. Lifted Variable Elimination: Decoupling the Operators from the Constraint Language. Journal of Artificial Intelligence Research, 47:393â439, 2013a.
- [Taghipour et al.(2013b)Taghipour, Fierens, Van den Broeck, Davis, and Blockeel] Nima Taghipour, Daan Fierens, Guy Van den Broeck, Jesse Davis, and Hendrik Blockeel. Completeness Results for Lifted Variable Elimination. In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS-13), pages 572â580. PMLR, 2013b.
- [Winn(2012)] John Winn. Causality with Gates. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12), pages 1314â1322. PMLR, 2012.