# Neuro-Symbolic Inductive Logic Programming with Logical Neural Networks
**Authors**: Prithviraj Sen, Breno W. S. R. de Carvalho, Ryan Riegel, Alexander Gray
## Neuro-Symbolic Inductive Logic Programming with Logical Neural Networks
Prithviraj Sen, Breno W. S. R. de Carvalho, Ryan Riegel, Alexander Gray
IBM Research of substituting logical connectives besides learning the rules themselves. Neural logic machine (NLM) (Dong et al. 2019) achieves this but at the cost of interpretability. More precisely, it models propositional formulae (consisting of conjunctions, disjunctions and/or negations) with multi-layer perceptrons (MLP). Once trained, it may not be possible to interpret NLM as rules since there exists no standard translation from MLP to logic. What is needed is an extension of classical logic with ties strong enough to be amenable to interpretation and can learn not only rules but also the logical connectives using gradient-based optimization.
## Abstract
Recent work on neuro-symbolic inductive logic programming has led to promising approaches that can learn explanatory rules from noisy, real-world data. While some proposals approximate logical operators with differentiable operators from fuzzy or real-valued logic that are parameter-free thus diminishing their capacity to fit the data, other approaches are only loosely based on logic making it difficult to interpret the learned 'rules'. In this paper, we propose learning rules with the recently proposed logical neural networks (LNN). Compared to others, LNNs offer strong connection to classical Boolean logic thus allowing for precise interpretation of learned rules while harboring parameters that can be trained with gradient-based optimization to effectively fit the data. We extend LNNs to induce rules in first-order logic. Our experiments on standard benchmarking tasks confirm that LNN rules are highly interpretable and can achieve comparable or higher accuracy due to their flexible parameterization.
## 1 Introduction
Inductive logic programming (ILP) (Muggleton 1996) has been of long-standing interest where the goal is to learn logical rules from labeled data. Since rules are explicitly symbolic, they provide certain advantages over black box models. For instance, learned rules can be inspected, understood and verified forming a convenient means of storing learned knowledge. Consequently, a number of approaches have been proposed to address ILP including, but not limited to, statistical relational learning (Getoor and Taskar 2007) and more recently, neuro-symbolic methods.
Since logical operators such as conjunction and disjunction are not differentiable, one issue that most neuro-symbolic ILP techniques have to address is how to learn rules using gradient-based optimization. A popular solution is to employ extensions from fuzzy or real-valued logic that are either differentiable or have subgradients available. For instance, NeuralLP (Yang, Yang, and Cohen 2017) substitutes logical conjunction with product t -norm ( x ∧ y ≡ xy ) and logic tensor networks (Donadello, Serafini, and d'Avila Garcez 2017) with Łukasiewicz t -norm ( x ∧ y ≡ max(0 , x + y -1) ). In an interesting experiment, Evans and Grefenstette (2018) show that among the various options, product t -norm seems to lead to the best ILP result. This indicates that the learning approach, rather than the user, should be in charge
In this paper, we propose ILP with the recently proposed logical neural nerworks (LNN) (Riegel et al. 2020). Instead of forcing the user to choose a function that mimics a logical connective, LNNs employ constraints to ensure that neurons behave like conjunctions or disjunctions. By decoupling neuron activation from the mechanism to ensure that it behaves like a logical connective, LNNs offer tremendous flexibility in how to parameterize neurons thus ensuring that they fit the data better while maintaining close connections with classical Boolean logic which, in turn, facilitates principled interpretation. We propose first-order extensions of LNNs that can tackle ILP. Since vanilla backpropagation is insufficient for constraint optimization, we propose flexible learning algorithms capable of handling a variety of (linear) inequality and equality constraints. We experiment with diverse benchmarks for ILP including gridworld and knowledge base completion (KBC) that call for learning of different kinds of rules and show how our approach can tackle both effectively. In fact, our KBC results represents a 4 -16% relative improvement (in terms of mean reciprocal rank) upon the current best rule-based KBC results on popular KBC benchmarks. Additional, we show that joint learning of rules and logical connectives leads to LNN rules that are easier to interpret vs. other neuro-symbolic ILP approaches.
## 2 Related Work
∂ ILP (Evans and Grefenstette 2018) is another neurosymbolic ILP technique whose main parameter is a tensor with one cell per candidate logic program. Since the number of candidates is exponential in both the number of available predicates and the number of constituent rules in the program, ∂ ILP's complexity is exponential making it impractical for anything but the smallest learning task.
To reign in the complexity, ∂ ILP asks the user to specify the ILP task using a template consisting of two rules each containing a maximum of two predicates. In reality, most neuro-symbolic ILP approaches ask the user to specify a template. NeuralLP's (Yang, Yang, and Cohen 2017) template is meant for link prediction in incomplete knowledge bases (sometimes called open path or chain rule) which only includes binary predicates and is of the form T ( X 1 , X n ) ← R 1 ( X 1 , X 2 ) ∧ R 2 ( X 2 , X 3 ) ∧ . . . R n -1 ( X n -1 , X n ) positing that the head predicate T ( X 1 , X n ) can be modeled as a path from X 1 to X n . NLM (Dong et al. 2019) is restricted to learning rules where the head and body predicates each contain the same set of variables. For instance, to model the rule learned by NeuralLP, NLM would first need to add arguments to R 1 so that the new predicate contains all variables including X 3 , . . . X n . In contrast, our approach can use more flexible templates that can express programs beyond two rules, allows the use of n -ary predicates ( n possibly > 2 ), and allows the head to have fewer variables than the body thus going beyond all of the above mentioned approaches.
Neural theorem provers (NTP) (Rockt¨ aschel and Riedel 2017) generalize the notion of unification by embedding logical constants into a high-dimensional latent space. NTPs can achieve ILP by learning the embedding for the unknown predicate which forms part of the rule, and subsequently comparing with embeddings of known predicates. NTPs can have difficulty scaling to real-world tasks since the decision to unify two constants is no longer Boolean-valued leading to an explosion of proof paths that need to be explored. To improve scalability, recent proposals either greedily choose (GNTP (Minervini et al. 2020a)) or learn to choose (CTP (Minervini et al. 2020b)) most promising proof paths. We compare against CTP in Section 5.
Lifted relational neural networks (LRNN) (Sourek et al. 2017) model conjunctions using (generalized) sigmoid but fix certain parameters to ensure that it behaves similar to Łukasiewicz t -norm. This limits how well LRNN can model the data, which is contrary to our goal as stated in the previous section. While other combinations of logic and neural networks exist, e.g. logic tensor networks (Donadello, Serafini, and d'Avila Garcez 2017), RelNN (Kazemi and Poole 2018), DeepProbLog (Manhaeve et al. 2018), to the best of our knowledge, none of these learn rules to address ILP.
## 3 Generalized Propositional Logic with Logical Neural Networks
Logical neural networks (LNN) (Riegel et al. 2020) allow the use of almost any parameterized function as a logical connective. We illustrate how LNNs generalize conjunction ( ∧ ). Let 0 denote false and 1 denote true . Let a, b ∈ { 0 , 1 } and x, y ∈ [0 , 1] denote Boolean-valued and continuous-valued variables, respectively. While Boolean logic defines the output of ∧ when x, y attain the extremities of their permissible domains (shown in Figure 1 (a)), to fully define real-valued logic's ∧ we need to also extend its definition to x, y ∈ (0 , 1) . Intuitively, the characteristic shape of ∧ is to produce a 1) low output when either input is low , and 2) high output when both inputs are high . A simple way to capture low vs. high is
Figure 1: (a) Truth table for ∧ in Boolean logic, and (b) Shape of ∧ extended to real-valued logic.
| a | b | a ∧ b | x | y | x ∧ y |
|-----|-------|---------|--------------|--------------|--------------|
| 0 | 0 | 0 | [0 , 1 - α ] | [0 , 1 - α ] | [0 , 1 - α ] |
| 0 | 1 | 0 | [0 , 1 - α ] | (1 - α, 1] | [0 , 1 - α ] |
| 1 | 0 | 0 | (1 - α, 1] | [0 , 1 - α ] | [0 , 1 - α ] |
| 1 | 1 (a) | 1 | [ α, 1] | [ α, 1] (b) | [ α, 1] |
via a user-defined hyperparameter α ∈ ( 1 2 , 1] : x ∈ [0 , 1 -α ] constitutes low and x ∈ [ α, 1] constitutes high . Figure 1 (b) expresses ∧ for real-valued logic in terms of α .
LNNs propose constraints to enforce the shape of ∧ . Let f : [0 , 1] × [0 , 1] → [0 , 1] denote a monotonically increasing function (in both inputs). In other words, f ( x, y ′ ) ≥ f ( x, y ) ∀ y ′ ≥ y and f ( x ′ , y ) ≥ f ( x, y ) ∀ x ′ ≥ x . In accordance with figure 1 (b), LNNs enforce the following constraints:
$$f ( x , y ) \leq 1 - a , \forall x , y \in [0 , 1 - a ] f ( x , y ) \leq 1 - a , \forall x \in [0 , 1 - a ] , \forall y \in ( 1 - a , 1 ] f ( x , y ) \geq a , \forall x , y \in [a , 1]$$
Since f is monotonically increasing, we can move all constraints to their corresponding extremities and eliminate the first constraint since it is redundant given the second and third.
$$f ( 1 - \alpha , 1 ) \leq 1 - \alpha , f ( 1 , 1 - \alpha )$$
Further simplifications may be obtained for specific choice of f . For inspiration, we look towards triangular norm ( t -norm) (Esteva and Godo 2001) which is defined as a symmetric, associative and non-decreasing function T : [0 , 1] 2 → [0 , 1] satisfying boundary condition T (1 , x ) = x, ∀ x ∈ [0 , 1] . Popular t -norms include product t -norm, xy , and Łukasiewicz t -norm, max(0 , x + y -1) . We extend the latter to define LNN-∧ (other t -norms may also be extended similarly):
$$\begin{array}{ll}
LNN- \lambda ( x , y ; \beta , w _ { 1 } , w _ { 2 } ) = \\
\{ 0 & if \beta - w _ { 1 } ( 1 - x ) - w _ { 2 } ( 1 - y ) < 0 \\
\{ 1 & if \beta - w _ { 1 } ( 1 - x ) - w _ { 2 } ( 1 - y ) > 1 \\
\{ \beta - w _ { 1 } ( 1 - x ) - w _ { 2 } ( 1 - y ) \text{ otherwise}
\end{array}$$
where β, w 1 , w 2 denote learnable parameters subject to the following constraints translated from above 1 :
$$the$$
To ensure that LNN-∧ is monotonically increasing, we also need to enforce non-negativity of w 1 , w 2 . It is easy to extend LNN-∧ to an n -ary conjunction ( n ≥ 2 ):
$$\underline{LNN- \lambda ( x ; \beta , w ) = } \text{re} \{ u _ { 1 } ( \beta - w ^ { T } ( 1 - x ) ) , \beta - a w \leq ( 1 - a ) 1 ^ { T } w \geq a ^ { ( 1 ) } \}$$
1 We remove the upper and lower clamps since they do not apply in the active region of the constraints.
Figure 2: (a) Lukasiewicz t -norm vs. (b) Product t -norm vs. LNN-∧ with (c) α = 0 . 7 , (d) α = 0 . 9 . (e) LNN-∨ ( α = 0 . 7 ).
<details>
<summary>Image 1 Details</summary>

### Visual Description
## 3D Surface Plots: Visualization of Z-Value Variations
### Overview
The image contains five 3D surface plots (labeled a–e) depicting variations in a z-value (0–1) across x and y axes (0–1). Each plot uses a color gradient (blue = 0, red = 1) to represent z-values, with axes labeled x, y, and z. The plots illustrate distinct spatial patterns, including linear gradients, discontinuities, and localized peaks.
### Components/Axes
- **Axes**:
- X-axis: Labeled "x" (0–1).
- Y-axis: Labeled "y" (0–1).
- Z-axis: Labeled "z" (0–1).
- **Color Legend**:
- Positioned on the right of each plot.
- Gradient: Blue (0) → Red (1).
- Label: "z" (implicit).
### Detailed Analysis
#### Plot (a)
- **Structure**: Flat triangular surface sloping from (0,0,0) to (1,1,1).
- **Trend**: Linear increase in z with x and y.
- **Color**: Blue (z=0) at origin, transitioning to red (z=1) at (1,1,1).
#### Plot (b)
- **Structure**: Flat triangular surface sloping from (0,1,1) to (1,0,1).
- **Trend**: Linear increase in z with x and inverse y.
- **Color**: Blue (z=0) at (1,1,0), transitioning to red (z=1) at (0,1,1) and (1,0,1).
#### Plot (c)
- **Structure**: Flat surface with a sharp vertical drop at x=0.5.
- **Trend**:
- z=0.7 for x < 0.5.
- z=0 for x ≥ 0.5.
- **Color**: Blue (z=0) for x ≥ 0.5, green (z=0.7) for x < 0.5.
#### Plot (d)
- **Structure**: Flat surface with a vertical spike at x=0.5.
- **Trend**:
- z=0.4 for x ≠ 0.5.
- z=1 at x=0.5.
- **Color**: Blue (z=0.4) for x ≠ 0.5, red (z=1) at x=0.5.
#### Plot (e)
- **Structure**: Flat surface with a red-highlighted region in the top-right corner (x=0.7–1, y=0.7–1).
- **Trend**:
- z=0.7 in the red region.
- z=0 elsewhere.
- **Color**: Blue (z=0) for x < 0.7 or y < 0.7, red (z=0.7) in the highlighted region.
### Key Observations
1. **Linear Gradients**: Plots (a) and (b) show smooth, continuous z-increases.
2. **Discontinuities**: Plot (c) exhibits an abrupt z-drop at x=0.5.
3. **Localized Peaks**: Plot (d) features a singular spike at x=0.5.
4. **Regional Highlighting**: Plot (e) isolates a high-z region in the top-right quadrant.
### Interpretation
The plots likely represent scenarios in a parametric model where z-values depend on x and y. The color gradients emphasize z-magnitude, while discontinuities (e.g., plot c) suggest threshold effects. The spike in plot (d) could indicate a critical point or anomaly, and the highlighted region in (e) may represent a targeted area of interest. These visualizations are useful for identifying patterns, thresholds, or anomalies in multidimensional data.
</details>
where relu1 ( x ) denotes max(0 , min(1 , x )) (Krizhevsky 2010) and, x , w and 1 denote vectors of continuous-valued inputs, weights, and 1 s, respectively.
Note that, Figure 1 (b) does not enforce constraints on 1 -α < x,y < α . Essentially, α acts as a tunable knob that controls the size of this unconstrained region where we can learn LNN operators without impedance which is an arguably better approach than choosing a parameter-less t -norm that would arbitrarily interpolate from Boolean-logic's ∧ to real-valued logic's ∧ . Figure 2 illustrates how Łukasiewicz (Figure 2 (a)) and product (Figure 2 (b)) t -norms differ from LNN-∧ learned by fitting to the four rows in Figure 1 (a)'s truth-table. Even pictorially, LNN-∧ looks distinctly conjunction-like, i.e., when either x or y is low it produces a value close to 0 while rising quickly to 1 when both x, y are high. When α = 0 . 9 (Figure 2 (d)), the region devoid of constraints is larger than at α = 0 . 7 (Figure 2 (c)) ( ∵ [0 . 1 , 0 . 9] ⊃ [0 . 3 , 0 . 7] ), so the curve can rise later to provide a better fit . In contrast, Łukasiewicz t -norm remains at 0 until the x + y = 1 line, post which it increases linearly. Product t -norm is similar, adding a slight, upward curve.
Other propositional logic operators include negation ( ¬ ) and disjunction ( ∨ ). LNN negation is given by 1 -x and LNN disjunction, LNN-∨ , is defined in terms of LNN-∧ :
$$\begin{array}{ll}
L N - \lambda ( 1 - x ; \beta , w ) = 1 \\
L N - \lambda ( 1 - x ; \beta , w ) = 0
\end{array}$$
where constraints defined in Equation 1 apply. Figure 2 (e) pictorially depicts LNN-∨ (with α = 0 . 7 ). In contrast to Figure 2 (c), it clearly shows how maximum output is achieved for smaller values of x, y , as a disjunction operator should.
## 4 Learning First-Order LNNs
Following previous work (Yang, Yang, and Cohen 2017; Evans and Grefenstette 2018; Dong et al. 2019), we also utilize program templates expressed in higher-order logic to be fed by the user to guide the learning in the right direction. Our definition of a program template draws inspiration from metainterpretive learning (Muggleton et al. 2014). In contrast to previous work on neuro-symbolic AI however, our definition of a program template is more general and includes as special cases the templates utilized by Evans and Grefenstette (considers only rules whose body contains up to 2 predicates), Yang, Yang, and Cohen (considers only binary predicates) and Dong et al. (considers only rules whose head includes all variables contained in the body). After introducing our logic program template, we then describe how to combine it with data to construct a neural network that may then be trained to learn the logic program of interest.
Let pred ( X 1 , . . . X n ) denote an n -ary predicate which returns true ( 1 ) or false ( 0 ) for every possible joint assignment of X 1 , . . . X n to constants in the knowledge base. The main construct in first-order logic (FOL) is a rule or clause :
$$1 k - 6 h A \cdot \ldots b m$$
where b 1 , . . . b m denote predicates in its body and h denotes the head predicate. If the conjunction of all predicates in the body is true then the head is also true. The head h in a clause may contain fewer logical variables than the body b 1 , . . . b m which means that there must exist an assignment to the missing variables in the head for it to hold. More precisely, if B = b 1 , . . . b m denotes the body and is defined over logical variables Y then h ( X ) , such that X ⊆ Y , is true if ∃ Y \ X : B ( Y ) is true . Assignments that lead to the predicate being true are also called facts .
Figure 3 (a) introduces a toy knowledge base (KB) which will serve as a running example. The KB contains three binary predicates, each containing their respective facts along with a unique identifier for easy reference. Thus, A ( X,Y ) 's facts are A (1 , 2) (denoted a 1 ) and A (1 , 5) (denoted a 2 ). Figure 3 (b) shows a template for learning R ( X,Z ) with three crucial pieces of information: 1) the first predicate in its body P is binary and while we do not know the identity of this predicate we know its domain Dom ( P ) is { A,B } , 2) to keep the example simple, the second predicate in the body Q has a singleton domain ( { C } ), and lastly 3) the second argument in the first predicate should be equal to the first argument in the second predicate indicated by repeated use of Y . Figure 3 (b) (bottom) expresses the same template as a tree where P ( X,Y ) and Q ( Y, Z ) are annotated with their respective domains forming children of R ( X,Z ) whose label ∧ indicates that the predicates in its body are to be conjuncted together.
Figure 3 (c) shows a more complex template that disjuncts R ( X,Z ) with O ( X,Z ) , such that Dom ( O ) = { A,B } , to produce S ( X,Z ) . Figure 3 (d) shows generated facts (with unknown truth-values) that can be possibly produced when this template is combined with the KB from Figure 3 (a). P, Q and O contain the union of all facts included in the predicates in their respective domains. Since p 1 and q 1 are the only two facts that agree on value of Y , PQ , the predicate produced by the body of R , contains only one generated fact. R is obtained by dropping Y , which is then unioned with O to produce S . By comparing generated facts in S with labels in the training data, it is possible to learn a full program which in this case constitutes learning: 1) which predicates to replace P, Q and O with, and 2) the logical connectives, LNN-∧ and LNN-∨ , used to model R and S with, respectively. We next state a more formal problem definition.
Figure 3: (a) A toy KB. (b) An example rule template (top) and its tree form (bottom). (c) A more complex program template. (d) Generated facts for our running example.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Logical Relationship Diagram: Data Tables and Set Operations
### Overview
The image presents a technical diagram illustrating logical relationships between data tables (A, B, C, P, Q, R, S, O) through set operations (∧, ∨) and subset constraints. It combines tabular data with formal logic notation to demonstrate how relationships are derived and combined.
### Components/Axes
1. **Tables**:
- **A**: Columns X, Y with entries a1(1,2), a2(1,5)
- **B**: Columns X, Y with entry b1(1,2)
- **C**: Columns W, Z with entry c1(2,5)
- **P**: Columns X, Y with entries p1(1,2), p2(1,5)
- **Q**: Columns Y, Z with entry q1(2,5)
- **O**: Columns X, Z with entries o1(1,2), o2(1,5)
- **R**: Columns X, Z with entry r1(1,5)
- **S**: Columns X, Z with entries s1(1,5), s2(1,2)
2. **Logical Relationships**:
- R(X,Z) = P(X,Y) ∧ Q(Y,Z) where P ∈ {A,B}, Q ∈ {C}
- S(X,Z) = R(X,Z) ∨ O(X,Z) where O ∈ {A,B}
3. **Set Constraints**:
- P ∈ {A,B} (P is subset of A or B)
- Q ∈ {C} (Q is subset of C)
- O ∈ {A,B} (O is subset of A or B)
### Detailed Analysis
#### Section (a)
- **Tables A/B/C**:
- A: X=1 maps to Y=2 (a1) and Y=5 (a2)
- B: X=1 maps to Y=2 (b1)
- C: W=2 maps to Z=5 (c1)
- **Formula**: R(X,Z) is defined as the intersection of P(X,Y) and Q(Y,Z) with P from {A,B} and Q from {C}
#### Section (b)
- **Diagram Flow**:
- R(X,Z) splits into P(X,Y) (from A/B) and Q(Y,Z) (from C)
- P(X,Y) contains p1(1,2), p2(1,5)
- Q(Y,Z) contains q1(2,5)
- R(X,Z) combines these to form r1(1,5)
#### Section (c)
- **New Relationship**:
- S(X,Z) = R(X,Z) ∨ O(X,Z)
- O(X,Z) contains o1(1,2), o2(1,5)
- S(X,Z) combines R(1,5) and O(1,2/1,5) to form s1(1,5), s2(1,2)
#### Section (d)
- **Final Tables**:
- P: p1(1,2), p2(1,5)
- Q: q1(2,5)
- O: o1(1,2), o2(1,5)
- R: r1(1,5)
- S: s1(1,5), s2(1,2)
### Key Observations
1. **Data Consistency**:
- All X values are 1 except for W=2 in table C
- Z values are either 2 or 5 across all tables
- Y values are 2 or 5, matching Z values
2. **Logical Operations**:
- AND (∧) operation in R(X,Z) requires matching Y values between P and Q
- OR (∨) operation in S(X,Z) combines results from R and O
3. **Set Constraints**:
- P and O are restricted to subsets of {A,B}
- Q is restricted to subset of {C}
### Interpretation
This diagram demonstrates how relational data can be combined through logical operations while maintaining set constraints. The relationships show:
1. **Data Integration**: R(X,Z) combines X-Y-Z relationships from A/B and C tables through intersection
2. **Alternative Paths**: S(X,Z) provides alternative combinations through union of R and O
3. **Constraint Enforcement**: Set membership restrictions (P∈{A,B}, Q∈{C}) ensure data provenance
The structure suggests a formal system for:
- Data fusion from multiple sources
- Constraint-based relationship derivation
- Alternative solution generation through logical disjunction
The consistent use of X=1 across most entries (except W=2 in C) indicates a possible focus on a specific data subset or normalization process.
</details>
Let T = ( V , E , L ) denote a tree-structured program template where V denotes the set of nodes, E denotes the set of edges and L denotes a mapping from V to node labels. L maps T 's leaves to the corresponding domain of predicates in the KB. In the worst case, the domain can be the subset of predicates in the KB that agrees with the arity of the leaf. L maps internal nodes to a logical operator {∧ , ∨ , ¬} . The ILP task can then be stated as, given T , a knowledge base KB, and truth-values corresponding to generated facts in the root of T , to learn all logical connectives involved along with selecting predicates for each leaf in T .
In the remainder of this section, we describe how to achieve the above ILP task given ground truth values for generated facts belonging to the root of the template. Let ψ ( v ) denote the truth value associated with (generated) fact v . Our strategy is to build a neural network that connects the truth values of generated facts from the root of the template to other (generated) facts in its lineage right down to the facts in the base KB whose truth values are defined to be 1 . The whole neural network can then be trained end-to-end using backpropagation. Let V ( X ) denote any node in T whose predicate is defined over variables X and whose children in T is denoted by N ( V ) . Also, let V ( x ) denote a fact obtained by the substitution X = x and F ( V ) denote all facts of V .
## 4.1 Combining Base Facts
Let V ( X ) denote a leaf in T with domain L ( V ) then F ( V ) is given by ⋃ P ∈L ( V ) F ( P ) . Computing ψ ( V ( X = x )) corresponding to X = x requires truth values of all facts corresponding to the same substitution. We provide two options that associate parameters with predicates in L ( V ) : 1) attention (Yang, Yang, and Cohen 2017) and 2) our proposed LNNpred operator:
$$\begin{array}{ll}
1) \psi ( V ( x ) ) = \sum _ { P \in C ( V ) } w p ( P ( x ) ) \\
2) \psi ( V ( x ) ) = 1 - relu ( \beta - \sum _ { P \in C ( V ) } w p ( P ( x ) ) )
\end{array}$$
where β , w P denote learnable parameters. One issue with attention is that it may lack sparsity assigning a majority of predicates in L ( V ) non-zero weights thus hampering interpretability. To address this, we propose an alternate parameterization, LNNpred , that is a simpler version of LNN-∨
and lacks all constraints except for non-negativity of w P (since we do not require disjunctive semantics). As an example, the lower left corner of Figure 4 shows how to compute ψ ( p 1 ) from ψ ( a 1 ) , ψ ( b 1 ) since a 1 , b 1 form the lineage for p 1 (Figure 3). Here, β 1 , w 1 denote LNNpred 's parameters.
## 4.2 Combining Facts with Conjunction
We handle conjunctions in two steps: 1) first construct the result of the body of the clause, and then 2) construct the head, dropping variables if needed. Let V ( Y ) in T be such that L ( V ) = ∧ and N ( V ) denote its children. Also, let I ( X ) denote the intermediate predicate produced from the body of V ( Y ) potentially containing additional variables such that X ⊇ Y . We use LNN-∧ to compute ψ of I ( x ) :
$$\sum _ { s \in S } \varphi ( I ( x ) ) = r e l u 1 ( b - \sum _ { P \in P ( X _ { var } ( P ) ) } w p ( 1 - v ) ( P ( X _ { var } ( P ) ) ) )$$
where var ( P ) denotes predicate P 's arguments and x var denotes the substitution restricted to variables in var. When X ⊃ Y , multiple facts from F ( I ) may combine to produce a fact in V ( Y ) and we use maxout for this:
$$V : y ( V ( y ) ) = \maxout ( y ( I )$$
where maxout ( { x 1 , . . . } ) (Goodfellow et al. 2013) returns the maximum of the set. Figure 4 shows how ψ ( pq 1 ) is computed from ψ ( p 1 ) , ψ ( q 1 ) where β 2 , w 2 denotes LNN-∧ 's parameters. Since pq 1 is the only intermediate fact leading to r 1 , we do not need maxout in this case. However, if that was not the case, Figure 4 shows where maxout would appear.
## 4.3 Combining Facts with Disjunction
Given V ( X ) in T such that L ( V ) = ∨ , ψ ( V ( x )) can be computed using LNN-∨ :
$$\phi ( V ( x ) ) = 1 - \text{relu} ( \beta - \sum _ { P \in N ( V ) } w p ^{\psi}( P ( x ) )$$
In Figure 4 shows how ψ ( s 1 ) is computed from ψ ( r 1 ) and ψ ( o 2 ) where β r , w 4 denotes LNN-∨ 's parameters.
## 4.4 Other Operations and Extensions
Implementing negation is more involved since it requires that we consider assignments that may lead to facts in V but are
Figure 4: Neural network constructed for s 1 ∈ F ( S ) . σ denotes softmax .
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Neural Network Architecture Diagram: Component Flow and Operations
### Overview
The diagram illustrates a multi-stage neural network architecture with three primary components: **LNN-A** (left), **LNN-V** (center), and **LNN-pred** (right). Each component contains interconnected operations (e.g., `relu`, `maxout`, parameterized layers) and data flow paths. The diagram uses color-coded blocks (pink, green, gray) to denote specific operations or layers, with arrows indicating data propagation.
---
### Components/Axes
#### Key Labels and Operations:
1. **LNN-A (Left Block)**:
- **Inputs**: `σ`, `λ₂`, `γ₂`, `μ₂`.
- **Operations**:
- `relu` (green block).
- `maxout` (pink block).
- Parameterized layers: `Γ₂μ₂`, `Γ₂λ₂`.
- **Outputs**: `w₂`, `pq₁`.
2. **LNN-V (Center Block)**:
- **Inputs**: `w₂`, `pq₁`.
- **Operations**:
- `1-relu(-1)` (gray block).
- `Γ₄μ₄`, `Γ₄λ₄` (parameterized layers).
- `maxout` (pink block).
- **Outputs**: `w₄`, `s₁`.
3. **LNN-pred (Right Block)**:
- **Inputs**: `w₁`, `w₃`, `1(a₁)`, `1(a₂)`, `1(c₁)`, `1(c₂)`.
- **Operations**:
- `β₁`, `β₂`, `β₃`, `β₄` (parameterized layers).
- `1-relu(-1)` (gray block).
- `relu` (green block).
- **Outputs**: `1(a₁)`, `1(a₂)`, `1(c₁)`, `1(c₂)`.
#### Arrows and Data Flow:
- **Top Path**: `s₁` → `Γ₄μ₄` → `Γ₄λ₄` → `LNN-V`.
- **Middle Path**: `w₂` → `1-relu(-1)` → `Γ₂μ₂` → `Γ₂λ₂` → `LNN-A`.
- **Bottom Path**: `w₁`, `w₃` → `β₁`, `β₂`, `β₃`, `β₄` → `LNN-pred`.
#### Color Legend:
- **Pink**: `maxout` operations.
- **Green**: `relu` operations.
- **Gray**: `1-relu(-1)` operations.
---
### Detailed Analysis
#### LNN-A (Left Block):
- **Flow**:
1. Inputs `σ`, `λ₂`, `γ₂`, `μ₂` are processed through `relu` (green).
2. Outputs are combined via `maxout` (pink) to produce `w₂` and `pq₁`.
- **Parameters**:
- `Γ₂μ₂`, `Γ₂λ₂` suggest matrix multiplications or transformations.
#### LNN-V (Center Block):
- **Flow**:
1. `w₂` and `pq₁` are input to `1-relu(-1)` (gray), producing `r₁`, `o₂`.
2. `r₁` and `o₂` are combined with `w₄` via `maxout` (pink) to generate `s₁`.
- **Parameters**:
- `Γ₄μ₄`, `Γ₄λ₄` indicate further transformations.
#### LNN-pred (Right Block):
- **Flow**:
1. Inputs `w₁`, `w₃` are processed through `β₁`, `β₂`, `β₃`, `β₄` (parameterized layers).
2. Outputs are combined via `1-relu(-1)` (gray) and `relu` (green) to produce `1(a₁)`, `1(a₂)`, `1(c₁)`, `1(c₂)`.
- **Parameters**:
- `β₁`, `β₂`, `β₃`, `β₄` likely represent weights or biases.
---
### Key Observations
1. **Modular Design**: The architecture is divided into distinct modules (LNN-A, LNN-V, LNN-pred), each handling specific transformations.
2. **Non-Linear Activations**: `relu` and `maxout` operations introduce non-linearity, critical for learning complex patterns.
3. **Parameterized Layers**: Use of `Γ`, `β`, `λ`, and `μ` suggests trainable parameters optimized during training.
4. **Data Fusion**: Multiple input paths (e.g., `w₁`, `w₃`, `1(a₁)`) converge in LNN-pred, indicating feature integration.
---
### Interpretation
This diagram represents a **hybrid neural network** combining feedforward and recurrent-like structures. The use of `maxout` and `1-relu(-1)` suggests a focus on robustness to input variations. The parameterized layers (`Γ`, `β`) imply the model is trained end-to-end, with weights adjusted to minimize prediction error. The separation into LNN-A, LNN-V, and LNN-pred may reflect a **multi-task learning** setup, where each module addresses a specific subtask (e.g., feature extraction, validation, prediction). The absence of explicit numerical values indicates this is a **conceptual architecture** rather than a trained model with specific weights.
</details>
not present in its child predicate P . Given a universe of all possible assignments U , we express ψ ( V ( x )) as:
$$\frac { 1 } { 2 } \times 1 = 1 - \frac { 1 } { 2 } \times 1 = \frac { 1 } { 2 }$$
where ψ ( P ( x )) is defined as 0 if P ( x ) / ∈ F ( P ) .
Note that, any LNN operator introduced for V ( X ) is shared across all V ( x ) ∈ F ( V ) since the result of ILP should be agnostic of individual facts. Thus, even though the (sub)network constructed for V ( x ) may differ from V ( x ′ ) 's, e.g., s 2 's neural network (not shown) is simpler than s 1 's, gradient updates flow to the same LNN parameters. For simplicity, we only discussed templates comprising a tree of Horn clauses but the ideas presented here can easily extend to DAG-structured templates and going beyond equality conditions in the body of a clause, e.g., R ( X,Z ) ← P ( X,Y ) ∧ Q ( Y ′ , Z ) ∧ Y > Y ′ .
## 4.5 Learning Constrained Activations
So far, we have shown how to construct a neural network from a KB and template comprising parameters of LNN operators but we have not addressed how to enforce constraints on said parameters. More precisely, β i , w i , ∀ i = 1 , . . . 4 in Figure 4 need to satisfy the respective constraints associated with the LNN operators they form parameters for (as described in Section 3). Since backpropagation does not handle constraints, we propose to apply a recently proposed approach to 'fold' in a system of linear constraints as layers into the neural network (Frerix, Nießner, and Cremers 2020). We note that Riegel et al. (2020) also devise a training algorithm for learning LNN operators but this is tightly connected to a specific kind of LNN operator called tailored activation. For the small systems of inequality constraints introduced by LNN-∧ , LNN-∨ and LNNpred , the approach presented here conveniently allows learning all (constrained) parameters of LNNs using vanilla backpropagation alone.
Frerix, Nießner, and Cremers (2020) recently showed how to handle a system of linear inequality constraints of the form A z ≤ b where A denotes a matrix containing coefficients in the constraints, z denotes the parameters (in our case, some concatenation of β and w ), and b denotes the constants in the constraints. We begin with the Minkowski-Weyl theorem:
Theorem 4.1. A set C = { z | A z ≤ b } is a convex polyhedron if and only if:
where Υ and Γ contain a finite number of rays and vertices, respectively.
which states that there exists a translation from A , b to Υ , Γ obtained via the double-description method (Motzkin et al. 1953). Assuming we can generate non-negative vectors µ, λ and additionally ensure that λ sums to 1 , then one can access a point z = [ β, w ] from the feasible set C by computing Υ µ +Γ λ . Sampling vectors µ, λ can be achieved, for instance, by using relu (Nair and Hinton 2010) and softmax :
Additionally, these operations can be easily included into any neural network as additional layers. For instance, Figure 4 contains in dashed boxes the above set of layers needed to generate w i , β i , ∀ i = 1 , . . . 4 . The resulting neural network is self-contained, and can be trained by vanilla backpropagation end-to-end thus ensuring that the learned LNN parameters satisfy their respective constraints.
## 5 Experiments
Our experiments compare ILP with LNN against other neurosymbolic ILP approaches on standard benchmarks. We evaluate rules in terms of application-specific goodness metrics and interpretability.
## 5.1 Gridworld
The goal in Gridworld is to learn rules that can help an agent move across an N × N regular grid. Some of the cells on the grid are deemed obstacles that the agent cannot step onto, and the agent's goal is to arrive at the cell which has been deemed the destination.
Predicates, Template and Rewards : We include two kinds of base predicates to describe the grid 1) HasObstacle -dir ( X,Y ) is true if the cell next to X,Y in direction dir contains an obstacle, 2) HasTarget -dir ( X,Y ) is true if the destination lies in direction dir from cell X,Y . There are four directions, North, South, East, and West, and including their negated counterparts, ¬ HasObstacle -dir ( X,Y ) and ¬ HasTarget -dir ( X,Y ) , brings the total number of base predicates to 8 . The template is S ( X,Y ) ← P ( X,Y ) ∧ Q ( X,Y ) where P 's domain includes all HasObstacle -dir predicates and their negated counterparts, and Q 's domain includes all HasTarget -dir predicates and their negated counterparts. We set α = 0 . 8 and use a simple reward mechanism: +1 for moving towards the destination, -1 for moving
Figure 5: Gridworld: (a) Avg. Rewards vs. Training Grids. (b) LNN Rule and (c) LNN-∧ for GoSouth(X,Y).
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Line Graph: Performance Comparison of "Ours" vs "NeuralLP" Models
### Overview
A line graph comparing two models ("Ours" and "NeuralLP") across 100 data points. The y-axis ranges from -1.2 to 0.8, while the x-axis spans 0 to 100. The graph includes a legend in the bottom-right corner.
### Components/Axes
- **X-axis**: Unlabeled numerical scale (0–100)
- **Y-axis**: Unlabeled numerical scale (-1.2–0.8)
- **Legend**:
- Red squares: "Ours"
- Black circles: "NeuralLP"
### Detailed Analysis
- **"Ours" (Red Squares)**:
- Starts at approximately -0.8 at x=0.
- Sharp upward trend to ~0.6 by x=20.
- Fluctuates between 0.4–0.6 for x=40–100.
- **"NeuralLP" (Black Circles)**:
- Starts at -1.2 at x=0.
- Gradual increase to -0.4 by x=40.
- Stabilizes between -0.2–0.0 for x=60–100.
### Key Observations
- "Ours" consistently outperforms "NeuralLP" across all x-values.
- "Ours" shows higher volatility in the early range (x=0–40) but stabilizes later.
- "NeuralLP" exhibits a smoother, slower ascent.
### Interpretation
The graph suggests that "Ours" achieves superior performance (higher y-values) compared to "NeuralLP" in the measured metric. The early volatility in "Ours" may indicate sensitivity to initial conditions or data preprocessing, while its stabilization implies robustness in later stages. "NeuralLP"’s slower improvement could reflect architectural limitations or suboptimal hyperparameters.
---
## Flowchart: LNN-Λ Decision Tree
### Overview
A hierarchical flowchart depicting a decision tree for "LNN-Λ" with conditional branches. Values in parentheses represent confidence scores or weights.
### Components/Axes
- **Root Node**: "LNN-Λ" (1.791)
- **Branches**:
- Left: "LNN-pred" (2.239) → "HasObstacleSouth(X, Y)" (1.077)
- Right: "LNN-pred" (2.322) → "HasTargetSouth(X, Y)" (1.052)
- **Terminal Nodes**: Blue rectangles with conditional logic.
### Content Details
- **Root Node**: "LNN-Λ" (1.791)
- Splits into two "LNN-pred" nodes.
- **Left Branch**:
- "LNN-pred" (2.239) → "HasObstacleSouth(X, Y)" (1.077)
- **Right Branch**:
- "LNN-pred" (2.322) → "HasTargetSouth(X, Y)" (1.052)
### Key Observations
- The root node "LNN-Λ" has the lowest value (1.791), suggesting it is a foundational or aggregated metric.
- "LNN-pred" nodes have higher values (2.239 and 2.322), indicating intermediate processing steps.
- Terminal conditions ("HasObstacleSouth" and "HasTargetSouth") have lower weights (1.077 and 1.052), implying they are secondary factors.
### Interpretation
The flowchart represents a decision-making process where "LNN-Λ" evaluates environmental conditions (obstacles/targets) to determine outcomes. The higher values of "LNN-pred" suggest intermediate computations amplify the root metric. The terminal conditions act as filters, with "HasObstacleSouth" having a slightly stronger influence than "HasTargetSouth."
---
## 3D Surface Plot: GoSouth LNN-Λ
### Overview
A 3D surface plot visualizing the relationship between "HasObstacleSouth," "HasTargetSouth," and "GoSouth LNN-Λ." The z-axis (color gradient) ranges from 0 (purple) to 1 (red).
### Components/Axes
- **X-axis**: "HasObstacleSouth" (0–1)
- **Y-axis**: "HasTargetSouth" (0–1)
- **Z-axis**: "GoSouth LNN-Λ" (color-mapped 0–1)
- **Color Bar**: Purple (0) to Red (1)
### Detailed Analysis
- The surface starts at 0 (purple) at the origin (0,0).
- Rises sharply to 1 (red) at the top-right corner (1,1).
- Intermediate values form a gradient from purple to red, with a plateau near (0.5, 0.5).
### Key Observations
- Maximum "GoSouth LNN-Λ" occurs when both "HasObstacleSouth" and "HasTargetSouth" are 1.
- The plateau suggests diminishing returns or a threshold effect beyond certain values.
### Interpretation
The plot indicates that the "GoSouth" decision is most favorable when both obstacles and targets are present in the south direction. The plateau implies that beyond a certain threshold, additional obstacles/targets do not significantly impact the decision. This could reflect a saturation point in the model’s logic or environmental constraints.
---
## Cross-Referenced Insights
1. **Model Performance (Part a)**: "Ours" outperforms "NeuralLP," aligning with the decision tree’s emphasis on optimized conditional logic (Part b).
2. **Decision Tree (Part b)**: The higher "LNN-pred" values suggest intermediate computations enhance the root metric, which is critical for the 3D plot’s surface behavior (Part c).
3. **3D Plot (Part c)**: The surface’s peak at (1,1) correlates with the decision tree’s terminal conditions, reinforcing the importance of obstacle/target presence in the "GoSouth" outcome.
### Final Notes
- All textual elements (labels, values, legends) are extracted with approximate precision.
- No non-English text detected.
- Spatial grounding and trend verification confirm consistency across components.
</details>
away, and -2 for stepping on an obstacle. The learning objective is to maximize rewards on randomly generated 5 × 5 grids with 3 obstacles sampled uniformly at random. We test the learned rules on grids with 12 obstacles.
Results : We compare our approach based on LNNpred and LNN-∧ , against NeuralLP which uses attention and product t -norm. Figure 5 (a) shows mean rewards averaged across all cells of 50 test grids produced by the learned rules on the y-axis vs. number of training grids observed on the x-axis. NeuralLP requires a lot more grids before it can learn the desired rules whereas we learn almost perfect rules after observing as few as 20 training grids. Essentially, in comparison to product t -norm, due to the extra learnable parameters in LNN-∧ , we can learn with fewer learning iterations. Figure 5 (b) shows the weights for the LNN rule for GoSouth ( X,Y ) on the edges and biases in parenthesis. This rule allows the agent to go South if 1) target is in that direction and, 2) there is no obstacle to the immediate South of the current cell. Despite the leaves of the template containing 8 predicates each in their respective domains, the learned LNNpred s are quite sparse and thus highly interpretable. The left LNNpred assigns 1 non-zero weight to ¬ HasObstacleSouth out of all HasObstacle -dir predicates and their negated counterparts, and the right LNNpred assigns 1 non-zero weight to HasTargetSouth out of all HasTarget -dir predicates and their negated counterparts. This may be due to the use of relu1 in LNNpred whose sparsity-inducing properties have been noted before (Krizhevsky 2010). In Figure 5 (c), we also plot GoSouth ( X,Y ) 's learned LNN-∧ .
## 5.2 Knowledge Base Completion
Knowledge base completion (KBC) is a standard benchmark for ILP. We experiment with publicly available KBC datasets Kinship, UMLS (Kok and Domingos 2007), WN18RR (Dettmers et al. 2018), and FB15K-237 (Toutanova and Chen 2015) 2 (see Table 1 for statistics). We compare against a host of rule-based KBC approaches: NeuralLP (Yang, Yang, and Cohen 2017), DRUM (Sadeghian et al. 2019), CTP (Minervini et al. 2020b) which is an improvement on neural theorem provers (Rockt¨ aschel and Riedel 2017), and the recently proposed RNNLogic 3 (Qu et al. 2021).
2 All available at github.com/shehzaadzd/MINERVA
3 We compare with rules-only RNNLogic ('w/o embd.'), since using embeddings is out of scope of this work.
Table 1: Statistics of KBC datasets.
| Name | Vertices | Predicates | Facts | Queries |
|-----------|------------|--------------|---------|-----------|
| UMLS | 135 | 49 | 5216 | 661 |
| Kinship | 104 | 26 | 10686 | 1074 |
| WN18RR | 40945 | 11 | 86835 | 3134 |
| FB15K-237 | 14505 | 237 | 272115 | 20466 |
Task Description and Template : A popular abstraction of the KBC task is to complete edges or triples missing from the knowledge graph (KG). More precisely, given a query 〈 h, r, ? 〉 , where h denotes a source vertex and r denotes a relation from the KG, most KBC approaches provide a ranked list of destination vertices. Most of the aforementioned rulebased KBC approaches exclusively focus on learning chain FOL rules as discussed in Section 2. There are at least two prevalent approaches to learn chain rules for KBC. The first approach (Yang, Yang, and Cohen 2017; Sadeghian et al. 2019) represents each predicate in the body as a mixture of relations present in the KG, and subsequently combines these via a conjunction operator. Figure 3 (c)'s template captures this where the subtree rooted at R defines chain rules of length 2 predicates and O captures length 1 thus enabling learning of chain rules capturing multi-hop paths of length up to 2 . The only change we need to make to this template is to include all relations in the KG into the domains of the leaves. It is also easy to extend the template to learn longer chain rules. A different approach, pioneered by MINERVA (Das et al. 2018) and RNNLogic (Qu et al. 2021), is to define the chain rule as a mixture over all possible paths that can exist in the KG. This latter approach leads to more effective KBC and thus we report results by expressing it in our LNN framework as follows: 1) Express each possible multi-hop path as a base relation, and 2) Use one LNN-pred operator to express a mixture over all multi-hop paths.
Metrics and Methodology : We learn a chain rule per relation present in the KG. Following previous work (Yang, Yang, and Cohen 2017), we also add inverse relations to the KG which switches the source and destination vertices. We also include inverse triples into our test set. We compute filtered ranks for destinations (Bordes et al. 2013), which removes all true triples ranked above, and compute the following metrics based on Sun et al. (2020)'s suggestions. Let n denote the number of destinations that have a score strictly greater than
Table 2: KBC Results: Bold font denotes best in row. CTP does not scale to WN18RR, FB15K-237. * indicates results copied from original paper.
| | | NeuralLP | DRUM | CTP | RNNLogic | Ours |
|---------|---------|------------|--------|-------|------------|--------|
| Kinship | Hits@10 | 89.1 | 86.1 | 93.9 | 91.1 | 98.4 |
| Kinship | Hits@3 | 63 | 48.2 | 79.7 | 72.9 | 89.3 |
| Kinship | MRR | 48.8 | 40 | 70.3 | 64.5 | 81.9 |
| | Hits@10 | 93 | 97.9 | 97.0 | 91.1 | 99.4 |
| | Hits@3 | 75.4 | 91.2 | 91.0 | 82.1 | 98.3 |
| | MRR | 55.3 | 61.8 | 80.1 | 71.0 | 90 |
| | Hits@10 | 50.2 | 52.1 | - | 53.1 ∗ | 55.5 |
| | Hits@3 | 43.3 | 44.6 | - | 47.5 ∗ | 49.7 |
| | MRR | 33.7 | 34.8 | - | 45.5 ∗ | 47.3 |
| | Hits@10 | 32.8 | 33.1 | - | 44.5 ∗ | 47 |
| | Hits@3 | 19.8 | 20 | - | 31.5 ∗ | 34.2 |
| | MRR | 17.4 | 17.5 | - | 28.8 ∗ | 30.7 |
destination t 's and let the number of destinations assigned the same score as t 's be denoted by m (including t ), then we compute t 's mean reciprocal rank (MRR) and Hits@K as:
where δ () denotes the Dirac delta function. For each method, we report averages across all test set triples. We learn chain rules containing up to 3 predicates for Kinship and UMLS, 4 for FB15K-237, and 5 for WN18RR in the body of the rule. We provide additional details including the training algorithm used and hyperparameter tuning in Appendix A.
Results : Table 2 reports results for all methods. While CTP improves upon the efficiency of neural theorem provers (Rockt¨ aschel and Riedel 2017), it still does not scale beyond Kinship and UMLS (indicated by -). Also, we copy previously published results for RNNLogic on WN18RR and FB15K-237 4 (indicated by ∗ ). On the smaller datasets, CTP is the best baseline but our results are significantly better producing 16 . 5% and 12 . 4% relative improvements in MRR on Kinship and UMLS, respectively. On the larger datasets, RNNLogic is the current state-of-the-art within rule-based KBC and we outperform it producing 4% and 6 . 6% relative improvements in MRR on WN18RR and FB15K-237, respectively. Despite both learning a mixture over relation sequences appearing on KG paths, one reason for our relative success could be that RNNLogic uses an inexact training algorithm (Qu et al. 2021), relying on expectation-maximization, ELBO bound, whereas we employ no such approximations. Learned Rules for FB15K-237 : Table 3 presents a few rules learned from FB15K-237. Rule 1 in Table 3 infers the language a person speaks by exploiting knowledge of the language spoken in her/his country of nationality. In terms of multi-hop path, this looks like: P ( person ) nationality -→ N ( nation ) spoken in -→ L ( language ) . Similarly, Rule 2 uses the film country relation instead of nationality to infer the language used in a film. Besides spoken in, FB15K-237 contains
4 Despite exchanging multiple emails with its authors, we were unable to run RNNLogic code on the larger KBC datasets.
Table 3: Learned rules from FB15K-237.
| 1) | person language ( P,L ) ← | nationality ( P,N ) ∧ spoken in ( L,N ) |
|----------------------------------|----------------------------------|---------------------------------------------------------------------------------------|
| 2) | film language ( F,L ) ← | film country ( F,C ) ∧ spoken in ( L,C ) |
| 3) tv program language ( P,L ) ← | 3) tv program language ( P,L ) ← | country of tv program ( P,N ) ∧ official language ( N,L ) |
| 4) | burial place ( P,L ) ← | nationality ( P,N ) ∧ located in ( L,N ) |
| 5) | tv program country ( P,N ) ← | tv program actor ( P,A ) ∧ born in ( A,L ) ∧ located in ( L,N ) |
| 6) | film release region ( F,R ) ← | film crew ( F,P ) ∧ marriage location ( P,L ) ∧ located in ( L,R ) |
| 7) | marriage location ( P,L ) ← | celebrity friends ( P,F ) ∧ marriage location ( F,L ′ ) ∧ location adjoins ( L ′ ,L ) |
other relations that can be utilized to infer language such as the official language spoken in a country. Rule 3 uses this relation to infer the language spoken in a TV program by first exploiting knowledge of its country of origin. Rules 5, 6 and 7 are longer rules containing 3 relations each in their body. Rule 5 infers a TV program's country by first exploiting knowledge of one of its actor's birth place and then determining which country the birth place belongs to. Rule 6 is similar but uses a film crew member's marriage location instead to infer the region where the film was released. Lastly, Rule 7 infers the marriage location of a celebrity by exploiting knowledge of where their friends got married.
Additional KBC Results : Due to space constraints, in Appendix B we report results on the Countries dataset (Bouchard, Singh, and Trouillon 2015) for which ground truth rules are known. On Countries, our KBC accuracy is comparable to other approaches and the learned LNN rules form a close match with the ground truth rules specified in Nickel, Rosasco, and Poggio (2016).
## 6 Conclusion
Our experiments show that learning rules and logical connectives jointly is not only possible but leads to more accurate rules than other neuro-symbolic ILP approaches. Templates provide a flexible way to express a wide range of ILP tasks. The templates used for Gridworld and KBC are distinct, yet we outperformed baselines in both cases. LNN rules use weights sparingly and are eminently interpretable while LNN operators' constraint formulation ensures close ties to classical logic's precise semantics compared to other approaches (e.g., NLM). While our neural network requires grounding the KB, our approach is still scalable enough to tackle the larger KBC benchmarks whereas others are not (e.g., CTP). In terms of future work, we aim to combine the ideas presented here with embedding of predicates and constants in a high-dimensional latent space to hopefully further improve performance. We would also like to extend our approach to learn more general Prolog-style rules.
## References
Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and Yakhnenko, O. 2013. Translating embeddings for modeling multi-relational data. In NeurIPS .
Bouchard, G.; Singh, S.; and Trouillon, T. 2015. On Approx- imate Reasoning Capabilities of Low-Rank Vector Spaces. In AAAI .
Das, R.; Dhuliawala, S.; Zaheer, M.; Vilnis, L.; Durugkar, I.; Krishnamurthy, A.; Smola, A.; and McCallum, A. 2018. Go for a Walk and Arrive at the Answer: Reasoning Over Paths in Knowledge Bases using Reinforcement Learning. In ICLR .
Dettmers, T.; Minervini, P.; Stenetorp, P.; and Riedel, S. 2018. Convolutional 2d knowledge graph embeddings. In Thirtysecond AAAI conference on artificial intelligence .
Donadello, I.; Serafini, L.; and d'Avila Garcez, A. S. 2017. Logic Tensor Networks for Semantic Image Interpretation. In IJCAI .
Dong, H.; Mao, J.; Lin, T.; Wang, C.; Li, L.; and Zhou, D. 2019. Neural Logic Machines. In ICLR .
Esteva, F.; and Godo, L. 2001. Monoidal t-norm based logic: Towards a logic for left-continuous t-norms. Fuzzy Sets and Systems .
Evans, R.; and Grefenstette, E. 2018. Learning Explanatory Rules from Noisy Data. JAIR .
Frerix, T.; Nießner, M.; and Cremers, D. 2020. Homogeneous Linear Inequality Constraints for Neural Network Activations. In CVPR Workshops .
Getoor, L.; and Taskar, B. 2007. Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning) . The MIT Press.
Goodfellow, I.; Warde-Farley, D.; Mirza, M.; Courville, A.; and Bengio, Y. 2013. Maxout Networks. In ICML .
Kazemi, S. M.; and Poole, D. 2018. RelNN: A Deep Neural Model for Relational Learning. In AAAI .
Kok, S.; and Domingos, P. 2007. Statistical predicate invention. In Proceedings of the 24th international conference on Machine learning , 433-440.
Krizhevsky, A. 2010. Convolutional deep belief networks on CIFAR-10. Unpublished Manuscript.
Manhaeve, R.; Dumancic, S.; Kimmig, A.; Demeester, T.; and Raedt, L. D. 2018. DeepProbLog: Neural Probabilistic Logic Programming. CoRR .
Minervini, P.; Bosnjak, M.; Rockt¨ aschel, T.; Riedel, S.; and Grefenstette, E. 2020a. Differentiable reasoning on large knowledge bases and natural language. In AAAI .
Minervini, P.; Riedel, S.; Stenetorp, P.; Grefenstette, E.; and Rockt¨ aschel, T. 2020b. Learning Reasoning Strategies in End-to-End Differentiable Proving. In ICML .
Motzkin, T. S.; Raiffa, H.; Thompson, G. L.; and Thrall, R. M. 1953. The double description method. Contributions to the Theory of Games .
Muggleton, S. 1996. Learning from positive data. In Worshop on ILP .
Muggleton, S. H.; Lin, D.; Pahlavi, N.; and TamaddoniNezhad, A. 2014. Meta-interpretive Learning: Application to Grammatical Inference. Machine Learning .
Nair, V.; and Hinton, G. 2010. Rectified Linear Units Improve Restricted Boltzmann Machines. In ICML .
Nickel, M.; Rosasco, L.; and Poggio, T. 2016. Holographic Embeddings of Knowledge Graphs. In AAAI .
Qu, M.; Chen, J.; Xhonneux, L.-P.; Bengio, Y.; and Tang, J. 2021. { RNNL } ogic: Learning Logic Rules for Reasoning on Knowledge Graphs. In ICLR .
Riegel, R.; Gray, A.; Luus, F.; Khan, N.; Makondo, N.; Akhalwaya, I. Y.; Qian, H.; Fagin, R.; Barahona, F.; Sharma, U.; Ikbal, S.; Karanam, H.; Neelam, S.; Likhyani, A.; and Srivastava, S. 2020. Logical Neural Networks. CoRR .
Rockt¨ aschel, T.; and Riedel, S. 2017. End-to-End Differentiable Proving. In NeurIPS .
Sadeghian, A.; Armandpour, M.; Ding, P.; and Wang, D. Z. 2019. DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs. In NeurIPS .
Sourek, G.; Svatos, M.; Zelezny, F.; Schockaert, S.; and Kuzelka, O. 2017. Stacked Structure Learning for Lifted Relational Neural Networks. In International Conference on Inductive Logic Programming .
Sun, Z.; Vashishth, S.; Sanyal, S.; Talukdar, P.; and Yang, Y. 2020. A Re-evaluation of Knowledge Graph Completion Methods. In ACL .
Toutanova, K.; and Chen, D. 2015. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd workshop on continuous vector space models and their compositionality , 57-66.
Yang, F.; Yang, Z.; and Cohen, W. W. 2017. Differentiable Learning of Logical Rules for Knowledge Base Reasoning. In NeurIPS .
## A Implementation Details for KBC Experiments
Given a training knowledge graph G = 〈V , R , E〉 whose vertices are given by V , edges are given by E and relations are given by R , we learn a chain FOL rule for each r ∈ R independently. Each edge in E is given by a triple 〈 h, r, t 〉 where h, t ∈ V denote source and destination vertices, and r ∈ R denotes a relation. Following previous work (Yang, Yang, and Cohen 2017), we also add inverse relations. In other words, for each r ∈ R we introduce a new relation r -1 by adding for each 〈 h, r, t 〉 ∈ E a new triple 〈 t, r -1 , h 〉 to G . Learning to predict destinations for a given relation r in the augmented set of relations is essentially a binary classification task where 〈 h, r, t 〉 ∈ E and 〈 h, r, t 〉 / ∈ E , ∀ h, t ∈ V , denote positive and negative examples, respectively.
Training Algorithm & Hyperparameter Settings : In each iteration, we sample uniformly at random a mini-batch of positive triples B + from 〈 h, r, t 〉 ∈ E and negative triples B -from 〈 h, r, t 〉 / ∈ E , ∀ h, t ∈ V , such that | B + | = | B -| to minimize the following loss:
where γ denotes the margin hyperparameter. We use Adagrad ( ? ) with step size ∈ { 0 . 1 , 1 . 0 } , margin γ ∈ { 0 . 1 , 0 . 5 , 1 . 0 , 2 . 0 } and batch size | B + | = | B -| = 8 . We use the validation set to perform hyperparameter tuning and
## NeuralLP
Figure 6: LNN-rule (left) vs. NeuralLP's (5) rules (top right) vs. CTP's rule (bottom right) for Countries-S3.
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Flowchart: Logical Relationships and Predictive Nodes
### Overview
The image depicts a hierarchical flowchart with three primary decision nodes labeled "LNN-pred" (red boxes) connected to a central node "locIn(X,Z)" (gray box). Each LNN-pred node branches into two sub-nodes representing logical functions (`ngbrOf` and `locIn`). Numerical values are annotated on connecting arrows, and a block of logical expressions appears on the right side. A Conditional Transformation Rule (CTP) is defined at the bottom.
---
### Components/Axes
1. **Nodes**:
- **LNN-pred**: Three red decision nodes (labeled 0) positioned at the top-left, center, and top-right.
- **locIn**: Gray node at the center, connected to all LNN-pred nodes.
- **Sub-nodes**: Six blue-labeled nodes representing logical functions:
- `ngbrOf(X,W)`, `locIn(X,W)`
- `ngbrOf(W,Y)`, `locIn(W,Y)`
- `ngbrOf(Y,Z)`, `locIn(Y,Z)`
2. **Arrows**:
- Connect nodes with numerical values (e.g., 1.125, 1.034, 0.042).
- Arrows flow from LNN-pred nodes to sub-nodes and from sub-nodes to the central `locIn(X,Z)` node.
3. **Logical Expressions**:
- Right-side text block defines relationships using logical operators (`←`, `∧`):
- `locIn(X,Z) ← locIn(X,W) ∧ locIn(W,Y) ∧ ngbrOf(Z,Y)`
- `locIn(X,Z) ← locIn(X,W) ∧ locIn(W,Y) ∧ ngbrOf(Y,Z)`
- `locIn(X,Z) ← locIn(X,W) ∧ locIn(W,Y) ∧ locIn(Z,Y)`
- `locIn(X,Z) ← locIn(X,Y) ∧ locIn(Y,Z)`
- CTP: `ngbrOf(X,Y) ← ngbrOf(Y,X)` (bidirectional symmetry).
---
### Detailed Analysis
1. **Flowchart Structure**:
- **Top Layer**: Three LNN-pred nodes (0) with outgoing arrows to sub-nodes.
- Left LNN-pred:
- `ngbrOf(X,W)` (1.034)
- `locIn(X,W)` (0.042)
- Center LNN-pred:
- `ngbrOf(W,Y)` (1.033)
- `locIn(W,Y)` (0.042)
- Right LNN-pred:
- `ngbrOf(Y,Z)` (0.001)
- `locIn(Y,Z)` (1.075)
- **Central Node**: `locIn(X,Z)` receives inputs from all sub-nodes via arrows labeled 1.125.
2. **Logical Expressions**:
- All `locIn(X,Z)` conditions require `locIn(X,W) ∧ locIn(W,Y)` as a base.
- Additional conditions vary:
- `ngbrOf(Z,Y)` or `ngbrOf(Y,Z)` (bidirectional via CTP).
- `locIn(Z,Y)` or `locIn(Y,Z)` (directional).
- Final condition: `locIn(X,Y) ∧ locIn(Y,Z)`.
3. **CTP Rule**:
- Enforces symmetry: `ngbrOf(X,Y)` is equivalent to `ngbrOf(Y,X)`.
---
### Key Observations
1. **Symmetry in Relationships**:
- The CTP rule implies `ngbrOf` is bidirectional, while `locIn` is directional.
2. **Weighted Connections**:
- Arrows to `locIn(X,Z)` have uniform weight (1.125), suggesting equal contribution from all paths.
- Sub-node weights vary significantly (e.g., 0.001 vs. 1.075), indicating differing confidence or priority.
3. **Logical Consistency**:
- All paths to `locIn(X,Z)` require intermediate `locIn(X,W) ∧ locIn(W,Y)`, acting as a "bridge" condition.
---
### Interpretation
1. **Predictive Logic**:
- The flowchart models a predictive system where `LNN-pred` nodes evaluate local relationships (`ngbrOf`, `locIn`) to determine outcomes for `locIn(X,Z)`.
- The CTP rule ensures bidirectional neighbor relationships, critical for consistency in the logical framework.
2. **Weight Significance**:
- High weights (1.125) on central connections suggest strong dependencies between sub-nodes and the final output.
- Low weights (e.g., 0.001 for `ngbrOf(Y,Z)`) may indicate rare or low-confidence relationships.
3. **Anomalies**:
- The `ngbrOf(Y,Z)` sub-node has an unusually low weight (0.001), potentially signaling an outlier or edge case in the data.
4. **Practical Implications**:
- The system likely prioritizes local interactions (`locIn`) over neighbor relationships (`ngbrOf`) for predicting `locIn(X,Z)`.
- The bidirectional CTP rule ensures symmetry in neighbor definitions, which could be critical for applications like graph-based reasoning or social network analysis.
---
### Conclusion
This flowchart represents a logical inference system where predictive nodes (`LNN-pred`) evaluate local interactions to determine global relationships (`locIn(X,Z)`). The CTP rule enforces symmetry in neighbor definitions, while weighted connections highlight the system's prioritization of certain paths. The structure suggests applications in knowledge graphs, relational databases, or AI reasoning pipelines requiring consistent bidirectional relationships.
</details>
Table 4: AUC-PR Results on Countries
| | NLM | NTP- λ | CTP | NeuralLP | Ours |
|----|-------------|--------------|------------|------------|----------|
| S1 | 58.06 ± 2.4 | 100 ± 0 | 99.6 ± 0.5 | 100 ± 0 | 100 ± 0 |
| S2 | 40.57 ± 5.3 | 93.04 ± 0.4 | 92.4 ± 1.7 | 75.1 ± 0.3 | 92.3 ± 0 |
| S3 | 53.37 ± 2.8 | 77.26 ± 17.0 | 55.4 ± 3.5 | 92.2 ± 0.2 | 91.3 ± 0 |
learn rules of length up to 4 for FB15K-237, 5 for WN18RR, and 3 for Kinship, UMLS.
## B Additional KBC Experiments
Besides the experiments presented in Section 5, we also experiment with the Countries dataset (Bouchard, Singh, and Trouillon 2015) which contains 272 vertices, 1158 facts and 2 predicates locatedIn ( X,Y ) (abbreviated locIn ) and neighborOf ( X,Y ) (abbreviated ngbrOf ). Following previous work (Das et al. 2018), we report area under the precision-recall curve (AUC-PR).
Nickel, Rosasco, and Poggio (2016) permute the facts in this dataset to pose 3 learning tasks { S 1 , S 2 , S 3 } each corresponding to learning a different rule. S1 and S2's rules contain 2 predicates in their bodies, whereas S3's body contains 3 . We use S ( X,Z ) ← P ( X,Y ) ∧ Q ( Y, Z ) as template for S1 and S2, and S ( X,Z ) ← P ( X,W ) ∧ Q ( W,Y ) ∧ O ( Y, Z ) for S3, where Dom ( P ) = Dom ( Q ) = Dom ( O ) = { locIn , ngbrOf } . We compare against NeuralLP 5 (Yang, Yang, and Cohen 2017), neural theorem provers 6 (NTPλ ) (Rockt¨ aschel and Riedel 2017), conditional theorem provers 7 (CTP) (Minervini et al. 2020b) and neural logic machines 8 (NLM) (Dong et al. 2019).
Table 4 reports averages across 3 runs. The NLM implementation we report results with is quite pessimistic almost never predicting a link. All other approaches achieve perfect AUC-PR on S1. We outperform NeuralLP on S2 and NTP shows high variance on S3 9 perhaps due to its excessive parameterization (NTP learns embeddings for predicates and vertices). CTP performs well on the first two tasks but fails on S3. To find out why, we take a close look at the learned rules next.
5 github.com/fanyangxyz/Neural-LP
6 github.com/uclnlp/ntp
7 github.com/uclnlp/ctp
8 github.com/google/neural-logic-machines
9 NTP's high variance on S2 is also noted in Das et al. (2018).
The goal in S2 and S3 (Countries) is to learn the following rules (Nickel, Rosasco, and Poggio 2016):
We compare the rules learned by our approach, NeuralLP and CTP for Countries-S3. NeuralLP produces a weighted list of 5 rules (see Figure 6 top right, weights omitted) none of which capture the correct rule shown above. CTP learns that ngbrOf is a symmetric relation (Minervini et al. 2020b) which makes sense (Figure 6 bottom right). Unfortunately however, all facts that need to be proved in the test set of Countries-S3 pertain to locIn predicate which explains why CTP's AUC-PR for Countries-S3 is so poor (Table 4). In contrast, the learned LNN-rule shown in Figure 6 (left) is a near perfect translation of the correct logical rule into real-valued logic. Note that, the leftmost and middle LNNpred s place a large weight on ngbrOf vs. a small weight on locIn while the rightmost LNNpred does the opposite, which lines up perfectly with the rule to be learned.
We also compared the rules learned for S2. Just as in the case of S3, for S2, the learned LNN rule's (shown in Figure 7) left LNNpred places a large weight on ngbrOf while placing a small weight on locIn whereas the right LNNpred does the opposite, closely matching the correct FOL rule shown above. The list of rules learned by NeuralLP (weight depicted in parenthesis) is shown below:
none of which match the correct rule to be learned for S2. CTP learns the following rule (Minervini et al. 2020b):
Figure 7: LNN rule learned from Countries-S2
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Flowchart: LNN-^ (1.056) Decision Tree
### Overview
The diagram depicts a hierarchical decision tree structure with nodes representing logical operations and edges labeled with numerical values. The root node is labeled "LNN-^ (1.056)", branching into intermediate nodes and terminal operations. Key elements include conditional branches, numerical weights, and operational labels.
### Components/Axes
1. **Nodes**:
- **LNN-^ (1.056)**: Root node (gray background)
- **locIn(X,Z)**: Intermediate node (gray background)
- **LNN-pred**: Prediction nodes (red background, two instances)
- **ngbrOf(X,Y)**, **locIn(X,Y)**, **ngbrOf(Y,Z)**, **locIn(Y,Z)**: Terminal operation nodes (blue background)
2. **Edges**:
- Numerical values (e.g., 1.059, 1.228, 0.002) represent weights/confidence scores
- Labels: `(0)` denotes inactive/false state for LNN-pred nodes
3. **Color Coding**:
- Gray: Root/intermediate nodes
- Red: Prediction nodes
- Blue: Terminal operation nodes
### Detailed Analysis
1. **Root Node**:
- LNN-^ (1.056) splits into two paths via **locIn(X,Z)** (1.059)
2. **First Branch (X-Z Path)**:
- **LNN-pred (0)** node with:
- **ngbrOf(X,Y)**: 1.228 (high weight)
- **locIn(X,Y)**: 0.002 (near-zero weight)
3. **Second Branch (Y-Z Path)**:
- **LNN-pred (0)** node with:
- **ngbrOf(Y,Z)**: 0.04 (low weight)
- **locIn(Y,Z)**: 1.186 (high weight)
### Key Observations
1. **Weight Distribution**:
- High weights (>1.0) dominate terminal operations in both branches
- **ngbrOf(X,Y)** has the highest weight (1.228)
- **locIn(X,Y)** has the lowest weight (0.002)
2. **Conditional Logic**:
- Both LNN-pred nodes are in inactive state `(0)`
- Symmetrical structure suggests parallel processing paths
3. **Numerical Patterns**:
- Edge weights sum to 2.228 (X-Z branch) and 1.226 (Y-Z branch)
- Root node value (1.056) precedes all branch weights
### Interpretation
This flowchart likely represents a graph neural network (GNN) architecture where:
1. **LNN-^** acts as the root transformer, processing input through location-based operations (`locIn`)
2. **LNN-pred** nodes represent prediction layers with conditional activation states
3. **ngbrOf** operations (neighborhood aggregation) dominate processing paths
4. **locIn** operations (local information extraction) show asymmetric weight distribution
The `(0)` labels on LNN-pred nodes suggest these layers are currently inactive or in a default state. The high weights on `ngbrOf(X,Y)` and `locIn(Y,Z)` indicate these operations are prioritized in the current configuration. The symmetrical structure implies the system processes X-Z and Y-Z relationships in parallel, with distinct weighting strategies for each path.
</details>
```
```
Not only does this rule not match the correct rule to be learned, it is also not very useful from the perspective of proving facts in the test set which pertain to predicate locIn exclusively.