## Probabilistic Reasoning via Deep Learning: Neural Association Models
Quan Liu โ , Hui Jiang โก , Andrew Evdokimov โก , Zhen-Hua Ling โ , Xiaodan Zhu , Si Wei ยง , Yu Hu โ ยง
โ
National Engineering Laboratory for Speech and Language Information Processing
University of Science and Technology of China, Hefei, Anhui, China
โก Department of Electrical Engineering and Computer Science, York University, Canada
National Research Council Canada, Ottawa, Canada
ยง iFLYTEK Research, Hefei, China emails: quanliu@mail.ustc.edu.cn, hj@cse.yorku.ca, ae2718@cse.yorku.ca, zhling@ustc.edu.cn xiaodan@cse.yorku.ca, siwei@iflytek.com, yuhu@iflytek.com
## Abstract
In this paper, we propose a new deep learning approach, called neural association model (NAM), for probabilistic reasoning in artificial intelligence. We propose to use neural networks to model association between any two events in a domain. Neural networks take one event as input and compute a conditional probability of the other event to model how likely these two events are to be associated. The actual meaning of the conditional probabilities varies between applications and depends on how the models are trained. In this work, as two case studies, we have investigated two NAM structures, namely deep neural networks (DNN) and relation-modulated neural nets (RMNN), on several probabilistic reasoning tasks in AI, including recognizing textual entailment, triple classification in multi-relational knowledge bases and commonsense reasoning. Experimental results on several popular datasets derived from WordNet, FreeBase and ConceptNet have all demonstrated that both DNNs and RMNNs perform equally well and they can significantly outperform the conventional methods available for these reasoning tasks. Moreover, compared with DNNs, RMNNs are superior in knowledge transfer, where a pre-trained model can be quickly extended to an unseen relation after observing only a few training samples. To further prove the effectiveness of the proposed models, in this work, we have applied NAMs to solving challenging Winograd Schema (WS) problems. Experiments conducted on a set of WS problems prove that the proposed models have the potential for commonsense reasoning.
## Introduction
Reasoning is an important topic in artificial intelligence (AI), which has attracted considerable attention and research effort in the past few decades (McCarthy 1986; Minsky 1988; Mueller 2014). Besides the traditional logic reasoning, probabilistic reasoning has been studied as another typical genre in order to handle knowledge uncertainty in reasoning based on probability theory (Pearl 1988; Neapolitan 2012). The probabilistic reasoning can be used to predict conditional probability Pr( E 2 | E 1 ) of one event E 2 given another event E 1 . State-of-the-art methods for probabilistic reasoning include Bayesian Networks (Jensen 1996), Markov Logic Networks (Richardson and Domingos 2006) and other graphical models (Koller and Friedman 2009). Taking Bayesian networks as an example, the conditional
Copyright 2015-2016.
probabilities between two associated events are calculated as posterior probabilities according to Bayes theorem, with all possible events being modeled by a pre-defined graph structure. However, these methods quickly become intractable for most practical tasks where the number of all possible events is usually very large.
In recent years, distributed representations that map discrete language units into continuous vector space have gained significant popularity along with the development of neural networks (Bengio et al. 2003; Collobert et al. 2011; Mikolov et al. 2013). The main benefit of embedding in continuous space is its smoothness property, which helps to capture the semantic relatedness between discrete events, potentially generalizable to unseen events. Similar ideas, such as knowledge graph embedding, have been proposed to represent knowledge bases (KB) in low-dimensional continuous space (Bordes et al. 2013; Socher et al. 2013; Wang et al. 2014; Nickel et al. 2015). Using the smoothed KB representation, it is possible to reason over the relations among various entities. However, human-like reasoning remains as an extremely challenging problem partially because it requires the effective encoding of world knowledge using powerful models. Most of the existing KBs are quite sparse and even recently created large-scale KBs, such as YAGO, NELL and Freebase, can only capture a fraction of world knowledge. In order to take advantage of these sparse knowledge bases, the state-of-the-art approaches for knowledge graph embedding usually adopt simple linear models, such as RESCAL (Nickel, Tresp, and Kriegel 2012), TransE (Bordes et al. 2013) and Neural Tensor Networks (Socher et al. 2013; Bowman 2013).
Although deep learning techniques achieve great progresses in many domains, e.g. speech and image (LeCun, Bengio, and Hinton 2015), the progress in commonsense reasoning seems to be slow. In this paper, we propose to use deep neural networks, called neural association model (NAM) , for commonsense reasoning. Different from the existing linear models, the proposed NAM model uses multilayer nonlinear activations in deep neural nets to model the association conditional probabilities between any two possible events. In the proposed NAM framework, all symbolic events are represented in low-dimensional continuous space and there is no need to explicitly specify any dependency structure among events as required in Bayesian networks.
Deep neural networks are used to model the association between any two events, taking one event as input to compute a conditional probability of another event. The computed conditional probability for association may be generalized to model various reasoning problems, such as entailment inference, relational learning, causation modelling and so on. In this work, we study two model structures for NAM. The first model is a standard deep neural networks (DNN) and the second model uses a special structure called relation modulated neural nets (RMNN). Experiments on several probabilistic reasoning tasks, including recognizing textual entailment, triple classification in multi-relational KBs and commonsense reasoning, have demonstrated that both DNNs and RMNNs can outperform other conventional methods. Moreover, the RMNN model is shown to be effective in knowledge transfer learning, where a pre-trained model can be quickly extended to a new relation after observing only a few training samples.
Furthermore, we also apply the proposed NAM models to more challenging commonsense reasoning problems, i.e., the recently proposed Winograd Schemas (WS) (Levesque, Davis, and Morgenstern 2011). The WS problems has been viewed as an alternative to the Turing Test (Turing 1950). To support the model training for NAM, we propose a straightforward method to collect associated cause-effect pairs from large unstructured texts. The pair extraction procedure starts from constructing a vocabulary with thousands of common verbs and adjectives. Based on the extracted pairs, this paper extends the NAM models to solve the Winograd Schema problems and achieves a 61% accuracy on a set of causeeffect examples. Undoubtedly, to realize commonsense reasoning, there is still much work be done and many problems to be solved. Detailed discussions would be given at the end of this paper.
## Motivation: Association between Events
This paper aims to model the association relationships between events using neural network methods. To make clear our main work, we will first describe the characteristics of events and all the possible association relationships between events. Based on the analysis of event association, we present the motivation for the proposed neural association models. In commonsense reasoning, the main characteristics of events are the following:
- Massive : In most natural situations, the number of events is massive, which means that the association space we will model is very large.
- Sparse : All the events occur in our dialy life are very sparse. It is a very challenging task to ideally capture the similarities between all those different events.
At the same time, association between events appears everywhere. Consider a single event play basketball for example, shown in Figure 1. This single event would associate with many other events. A person who plays basketball would win a game. Meanwhile, he would be injured in some cases. The person could make money by playing basketball as well. Moreover, we know that a person who plays basketball should be coached during a regular game. Those are all typical associations between events. However, we need to recognize that the task of modeling event association is not identical to performing classification . In classification, we typically map an event from its feature space into one of pre-defined finite categories or classes. In event association, we need to compute the association probability between two arbitrary events, each of which may be a sample from a possibly infinite set. The mapping relationships in event association would be many-to-many ; e.g., not only playing basketball could support us to make money, someone who makes stock trading could make money as well. More specifically, the association relationships between events include causeeffect, spatial, temporal and so on. This paper treats them as a general relation considering the sparseness of useful KBs.
Figure 1: Example of association between events.
<details>
<summary>Image 1 Details</summary>

### Visual Description
\n
## Diagram: Activity and Outcome Relationships
### Overview
The image is a diagram illustrating potential outcomes or consequences associated with two primary activities: playing basketball and stock trading. The diagram uses images to represent both the activities and their respective outcomes, connected by arrows indicating a relationship. The diagram is visually simple, focusing on direct associations rather than complex interactions.
### Components/Axes
The diagram consists of the following components:
* **Activity 1:** "play basketball" - represented by an image of two basketball players in uniform during a game.
* **Activity 2:** "stock trading" - represented by an image of a person interacting with stock market charts on a screen.
* **Outcomes of "play basketball":**
* "win" - represented by an image of a runner being pulled across a finish line by a red cord.
* "injured" - represented by an image of a basketball player lying on the court, seemingly injured.
* "make money" - represented by an image of stacks of US dollar bills.
* "be coached" / "drink water" - represented by an image of a basketball team with a coach.
* **Arrows:** Lines connecting each activity to its associated outcomes.
### Detailed Analysis or Content Details
The diagram shows a one-to-many relationship between activities and outcomes.
* **"play basketball"** is connected to four outcomes: "win", "injured", "make money", and "be coached/drink water".
* **"stock trading"** is not connected to any outcomes in the visible portion of the diagram, indicated by the "..." symbol.
The image does not provide any quantitative data or specific probabilities associated with these outcomes. It simply illustrates potential associations.
### Key Observations
The diagram presents a somewhat simplified view of the relationship between activities and outcomes. It suggests that playing basketball can lead to both positive outcomes (winning, making money) and negative outcomes (injury). The inclusion of "be coached/drink water" as an outcome is less direct than the others, representing supporting activities or necessities. The diagram implies that stock trading also has outcomes, but these are not shown.
### Interpretation
The diagram appears to be a conceptual illustration of risk and reward associated with different activities. Playing basketball is presented as an activity with a range of potential outcomes, both desirable and undesirable. The diagram highlights the inherent uncertainties involved in both activities. The lack of visible outcomes for stock trading suggests either that the outcomes are more complex or that the diagram is incomplete. The diagram could be used to initiate a discussion about the potential benefits and drawbacks of different choices. The image is a visual metaphor for the idea that every action has consequences, and those consequences can be both positive and negative. The image does not provide any facts or data, but rather a conceptual representation of potential relationships.
</details>
In this paper, we believe that modeling the the association relationships between events is a fundamental work for commonsense reasoning. If we could model the event associations very well, we may have the ability to solve many commonsense reasoning problems. Considering the main characteristics of discrete event and event association , two reasons are given for describing our motivation.
- The advantage of distributed representation methods: representing discrete events into continuous vector space provides a good way to capture the similarities between discrete events.
- The advantage of neural network methods: neural networks could perform universal approximation while linear models cannot easily do this (Hornik, Stinchcombe, and White 1990).
At the same time, this paper takes into account that both distributed representation and neural network methods are data-hungry. In Artificial Intelligence (AI) research, mining large sizes of useful data (or knowledge) for model learning is always challenging. In the following section, this paper presents a preliminary work on data collection and the corresponding experiments we have made for solving commonsense reasoning problems.
## Neural Association Models (NAM)
In this paper, we propose to use a nonlinear model, namely neural association model, for probabilistic reasoning. Our main goal is to use neural nets to model the association probability for any two events E 1 and E 2 in a domain, i.e., Pr( E 2 | E 1 ) of E 2 conditioning on E 1 . All possible events in the domain are projected into continuous space without specifying any explicit dependency structure among them. In the following, we first introduce neural association models (NAM) as a general modeling framework for probabilistic reasoning. Next, we describe two particular NAM structures for modeling the typical multi-relational data.
## NAMin general
Figure 2: The NAM framework in general.
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Diagram: Deep Neural Network Association
### Overview
The image depicts a diagram illustrating the association between two events, E1 and E2, using a Deep Neural Network (DNN). The diagram shows how event E1, represented in a vector space, is processed through a DNN to predict or associate with event E2, also represented in a vector space. The association is mathematically represented as Pr(E2|E1).
### Components/Axes
The diagram consists of the following components:
* **Event E1:** Represented by a green hexagon labeled "Event E1" and a column of six green circles labeled "Vector space".
* **Deep Neural Network (DNN):** A central block enclosed in a dashed rectangle labeled "Deep Neural Networks". It contains three layers of nodes:
* Input Layer: Six nodes (white circles).
* Hidden Layer: Eight nodes (white circles).
* Output Layer: Six nodes (white circles).
* Connections: Numerous lines connecting nodes between layers, representing the weights and biases of the network.
* **Event E2:** Represented by a red hexagon labeled "Event E2" and a column of six red circles labeled "Vector space".
* **Association in DNNs:** Text label below the DNN block.
* **Mathematical Notation:** "Pr(E2|E1)" below the "Association in DNNs" label, representing the conditional probability of event E2 given event E1.
### Detailed Analysis or Content Details
The diagram illustrates a flow of information:
1. **Event E1 Input:** Event E1 is initially represented as a vector in a "Vector space" (six green circles).
2. **DNN Processing:** This vector is fed into the input layer of the DNN. The connections between the layers indicate the processing and transformation of the input data.
3. **DNN Output:** The DNN processes the input and produces an output vector.
4. **Event E2 Prediction:** The output vector is associated with Event E2, which is also represented in a "Vector space" (six red circles).
5. **Association Probability:** The entire process represents the probability of Event E2 occurring given Event E1, denoted as Pr(E2|E1).
The DNN has 6 input nodes, 8 nodes in the hidden layer, and 6 output nodes. The connections between the nodes are fully connected, meaning each node in one layer is connected to every node in the next layer.
### Key Observations
The diagram highlights the use of DNNs for modeling associations between events. The vector space representation of events suggests that events are encoded as numerical vectors, allowing for mathematical manipulation and processing by the DNN. The conditional probability notation emphasizes the predictive nature of the model.
### Interpretation
This diagram illustrates a fundamental concept in machine learning, specifically how DNNs can be used to learn and represent relationships between events. The DNN acts as a function that maps the input vector representing Event E1 to an output vector that represents the probability distribution of Event E2. The use of vector spaces allows for a continuous representation of events, enabling the DNN to capture complex relationships that might not be apparent in discrete representations. The diagram suggests that the DNN learns to identify patterns and correlations in the data, allowing it to predict the likelihood of Event E2 occurring given the occurrence of Event E1. The diagram is conceptual and does not provide specific data or numerical values, but rather illustrates the general architecture and flow of information in a DNN-based association model.
</details>
Figure 2 shows the general framework of NAM for associating two events, E 1 and E 2 . In the general NAM framework, the events are first projected into a low-dimension continuous space. Deep neural networks with multi-layer nonlinearity are used to model how likely these two events are to be associated. Neural networks take the embedding of one event E 1 (antecedent) as input and compute a conditional probability Pr( E 2 | E 1 ) of the other event E 2 (consequent). If the event E 2 is binary (true or false), the NAM models may use a sigmoid node to compute Pr( E 2 | E 1 ) . If E 2 takes multiple mutually exclusive values, we use a few softmax nodes for Pr( E 2 | E 1 ) , where it may need to use multiple embeddings for E 2 (one per value). NAMs do not explicitly specify how different events E 2 are actually related; they may be mutually exclusive, contained, intersected. NAMs are only used to separately compute conditional probabilities, Pr( E 2 | E 1 ) , for each pair of events, E 1 and E 2 , in a task. The actual physical meaning of the conditional probabilities Pr( E 2 | E 1 ) varies between applications and depends on how the models are trained. Table 1 lists a few possible applications.
Table 1: Some applications for NAMs.
| Application | E 1 | E 2 |
|------------------------------------|---------------|-------------|
| language modeling causal reasoning | h | w e j W 2 D |
| | cause | effect |
| knowledge triple classification | { e i , r k } | |
| lexical entailment | W 1 | |
| textual entailment | D 1 | 2 |
In language modeling, the antecedent event is the representation of historical context, h , and the consequent event is the next word w that takes one out of K values. In causal reasoning, E 1 and E 2 represent cause and effect respectively. For example, we have E 1 = 'eating cheesy cakes' and E 2 = 'being happy' , where Pr( E 2 | E 1 ) indicates how likely it is that E 1 may cause the binary (true or false) event E 2 . In the same model, we may add more nodes to model different effects from the same E 1 , e.g., E โฒ 2 = 'growing fat' . Moreover, we may add 5 softmax nodes to model a multi-valued event, e.g., E โฒโฒ 2 = 'happiness' (scale from 1 to 5) . Similarly, for knowledge triple classification of multi-relation data, given one triple ( e i , r k , e j ) , E 1 consists of the head entity ( subject ) e i and relation ( predicate ) r k , and E 2 is a binary event indicating whether the tail entity ( object ) e j is true or false. Finally, in the applications of recognizing lexical or textual entailment, E 1 and E 2 may be defined as premise and hypothesis . More generally, NAMs can be used to model an infinite number of events E 2 , where each point in a continuous space represents a possible event. In this work, for simplicity, we only consider NAMs for a finite number of binary events E 2 but the formulation can be easily extended to more general cases.
Compared with traditional methods, like Bayesian networks, NAMs employ neural nets as a universal approximator to directly model individual pairwise event association probabilities without relying on explicit dependency structure. Therefore, NAMs can be end-to-end learned purely from training samples without strong human prior knowledge, and are potentially more scalable to real-world tasks.
Learning NAMs Assume we have a set of N d observed examples (event pairs { E 1 , E 2 } ), D , each of which is denoted as x n . This training set normally includes both positive and negative samples. We denote all positive samples ( E 2 = true ) as D + and all negative samples ( E 2 = false ) as D -. Under the same independence assumption as in statistical relational learning (SRL) (Getoor 2007; Nickel et al. 2015), the log likelihood function of a NAM model can be expressed as follows:
$$\mathcal { L } ( \Theta ) = \sum _ { x _ { n } ^ { + } \in \mathcal { D } ^ { + } } \ln f ( x _ { n } ^ { + } ; \Theta ) + \sum _ { x _ { n } ^ { - } \in \mathcal { D } ^ { - } } \ln ( 1 - f ( x _ { n } ^ { - } ; \Theta ) )$$
where f ( x n ; ฮ ) denotes a logistic score function derived by the NAM for each x n , which numerically computes the conditional probability Pr( E 2 | E 1 ) . More details on f ( ยท ) will be given later in the paper. Stochastic gradient descent (SGD) methods may be used to maximize the above likelihood function, leading to a maximum likelihood estimation (MLE) for NAMs.
In the following, as two case studies, we consider two NAM structures with a finite number of output nodes to model Pr( E 2 | E 1 ) for any pair of events, where we have only a finite number of E 2 and each E 2 is binary. The first model is a typical DNN that associates antecedent event ( E 1 ) at input and consequent event ( E 2 ) at output. We then present another model structure, called relation-modulated neural nets, which is more suitable for multi-relational data.
xisted Relations
(
L
)
โฆโฆ
(2)
(1)
(head)
ector
Event
Vector space
E
Head entity vector
Tail entity vector
## DNN for NAMs
Event
E
Head entity vector
Event
E
The first NAM structure is a traditional DNN as shown in Figure 3. Here we use multi-relational data in KB for illustration. Given a KB triple x n = ( e i , r k , e j ) and its corresponding label y n (true or false), we cast E 1 = ( e i , r k ) and E 2 = e j to compute Pr( E 2 | E 1 ) as follows. Vector space Vector space Association in DNNs P( E 2 | E 1 )
<details>
<summary>Image 3 Details</summary>

### Visual Description
\n
## Diagram: Knowledge Graph Embedding Model
### Overview
The image depicts a diagram of a knowledge graph embedding model, likely a translational distance model or a similar architecture. It illustrates the process of combining head entity, relation, and tail entity vectors through a series of transformations to learn embeddings. The diagram shows a multi-layered neural network structure.
### Components/Axes
The diagram consists of the following components:
* **Head Entity Vector:** Located at the bottom-right, represented by a row of green circles.
* **Relation Vector:** Located at the bottom-left, represented by a row of purple circles.
* **Tail Entity Vector:** Located at the top, represented by a row of red circles.
* **Layers:** Three rectangular blocks, each containing a row of grey circles, representing hidden layers.
* **Weights:** Labels `W^(1)`, `W^(2)`, `W^(L)` indicate weight matrices connecting the layers.
* **Association Function:** A circle labeled 'f' represents an association function.
* **Input/Output Labels:** Each layer has labels "In: a^(i)" and "out: z^(i)", where 'i' represents the layer number.
* **Text Label:** "Association at here" is positioned above the association function.
### Detailed Analysis or Content Details
The diagram shows a flow of information from the Head Entity Vector and Relation Vector through the layers to the Tail Entity Vector.
1. **Input Layer:** The Head Entity Vector and Relation Vector are inputs to the first layer.
2. **Layer 1:** The inputs are transformed using weight matrix `W^(1)` to produce an output `z^(1)`. The input to this layer is labeled `a^(1)`.
3. **Layer 2:** The output `z^(1)` from Layer 1 is transformed using weight matrix `W^(2)` to produce an output `z^(2)`. The input to this layer is labeled `a^(2)`.
4. **Layer L:** The output `z^(L-1)` from the previous layer is transformed using weight matrix `W^(L)` to produce an output `z^(L)`. The input to this layer is labeled `a^(L)`.
5. **Association Function:** The output `z^(L)` is then fed into the association function 'f', which combines it with the Tail Entity Vector.
The number of circles in each vector (Head, Relation, Tail) and each layer appears to be approximately 8. The diagram does not provide specific numerical values for the weights or the outputs of each layer.
### Key Observations
* The diagram illustrates a feedforward neural network architecture.
* The weight matrices `W^(1)`, `W^(2)`, and `W^(L)` suggest that the model learns transformations between the input vectors and the output vectors.
* The association function 'f' is a crucial component, likely performing a non-linear operation to combine the transformed vector with the tail entity.
* The diagram does not specify the activation function used in the layers or the specific form of the association function 'f'.
### Interpretation
This diagram represents a simplified view of a knowledge graph embedding model. The goal of such a model is to learn vector representations (embeddings) of entities and relations in a knowledge graph. These embeddings can then be used for various downstream tasks, such as link prediction (predicting missing relationships between entities) or entity classification.
The diagram suggests that the model learns to transform the head entity and relation vectors into a space where they can be combined with the tail entity vector. The association function 'f' likely measures the compatibility or similarity between the combined vector and the tail entity vector. The weights `W^(i)` are learned during training to minimize a loss function that encourages the embeddings of related entities to be close together in the embedding space.
The multi-layered structure allows the model to learn complex non-linear relationships between entities and relations. The diagram does not provide enough information to determine the specific type of embedding model (e.g., TransE, DistMult, ComplEx), but it provides a general overview of the architecture. The lack of specific numerical values or equations suggests that the diagram is intended to be a conceptual illustration rather than a precise mathematical representation.
</details>
V (head)
Tail entity vector
Figure 3: The DNN structure for NAMs.
Tail entity vector f Association at here W (2) W ( L ) โฆ out: z ( L ) In: a ( L ) out: z (2) In: a (2) โฆ B (2) B ( L ) B ( L+ 1) Firstly, we represent head entity phrase e i and tail entity phrase e j by two embedding vectors v (1) i ( โ V (1) ) and v (2) j ( โ V (2) ) . Similarly, relation r k is also represented by a low-dimensional vector c k โ C , which we call a relation code hereafter. Secondly, we combine the embeddings of the head entity e i and the relation r k to feed into an ( L + 1) -layer DNN as input. The DNN consists of L rectified linear (ReLU) hidden layers (Nair and Hinton 2010). The input is z (0) = [ v (1) i , c k ] . During the feedforward process, we have
$$a ^ { ( \ell ) } = W ^ { ( \ell ) } z ^ { ( \ell - 1 ) } + b ^ { \ell } \quad ( \ell = 1 , \cdots , L ) \quad ( 2 )$$
W (1)
$$z ^ { ( \ell ) } = h \left ( a ^ { ( \ell ) } \right ) = \max \left ( 0 , a ^ { ( \ell ) } \right ) \quad ( \ell = 1 , \cdots , L ) \quad ( 3 )$$
where W ( ) and b represent the weight matrix and bias for layer respectively.
Finally, we propose to calculate a sigmoid score for each triple x n = ( e i , r k , e j ) as the association probability using the last hidden layer's output and the tail entity vector v (2) j :
$$f ( x _ { n } ; \Theta ) = \sigma \left ( z ^ { ( L ) } \cdot v _ { j } ^ { ( 2 ) } \right ) \quad ( 4 )$$
-x where ฯ ( ยท ) is the sigmoid function, i.e., ฯ ( x ) = 1 / (1+ e ) . All network parameters of this NAM structure, represented as ฮ = { W , V (1) , V (2) , C } , may be jointly learned by maximizing the likelihood function in eq. (1).
## Relation-modulated Neural Networks (RMNN)
Particularly for multi-relation data, following the idea in (Xue et al. 2014), we propose to use the so-called relationmodulated neural nets (RMNN), as shown in Figure 4.
The RMNN uses the same operations as DNNs to project all entities and relations into low-dimensional continuous space. As shown in Figure 4, we connect the knowledgespecific relation code c ( k ) to all hidden layers in the network.
New Relation
Deep Neural Networks
Tail entity vector
Figure 4: The relation-modulated neural networks (RMNN).
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Diagram: Multi-Layered Neural Network for Knowledge Representation
### Overview
The image depicts a multi-layered neural network diagram used for representing knowledge, specifically focusing on relationships between entities. The diagram illustrates a series of layers processing input vectors (relation and entities) to generate an association. The network appears to be designed for learning and representing complex relationships.
### Components/Axes
The diagram consists of the following components:
* **Relation vector:** Represented by a row of purple circles at the bottom-left.
* **Head entity vector:** Represented by a row of green circles at the bottom-right.
* **Tail entity vector:** Represented by a row of brown circles at the top.
* **Layers:** Multiple layers (at least three are fully visible, with an indication of more) of grey circles, each representing a layer of processing. Each layer has an input ("In: a<sup>(l)</sup>") and an output ("out: z<sup>(l)</sup>").
* **Weights:** Each layer is associated with a weight matrix, denoted as W<sup>(l)</sup>.
* **Association Function:** A circular node labeled "f" representing the association function, positioned between the top layer and the tail entity vector.
* **Bias Vectors:** Represented by B<sup>(l)</sup>, connecting the relation vector to each layer.
* **Text Annotations:** Labels indicating input, output, weights, and the association function.
### Detailed Analysis or Content Details
The diagram shows a flow of information from the relation and head entity vectors through multiple layers to the tail entity vector.
1. **Input Layer:** The relation vector (purple) and head entity vector (green) are the initial inputs.
2. **Layer 1:** The relation vector and head entity vector are processed through W<sup>(1)</sup>, resulting in an output z<sup>(1)</sup>. A bias vector B<sup>(1)</sup> is added.
3. **Layer 2:** The output z<sup>(1)</sup> and the relation vector are processed through W<sup>(2)</sup>, resulting in an output z<sup>(2)</sup>. A bias vector B<sup>(2)</sup> is added.
4. **Layer l:** This pattern continues for an unspecified number of layers (indicated by the "...") up to layer l. The output z<sup>(l)</sup> and the relation vector are processed through W<sup>(l)</sup>, resulting in an output z<sup>(l)</sup>. A bias vector B<sup>(l)</sup> is added.
5. **Final Layer:** The output of the final layer is combined with the tail entity vector (brown) through the association function "f".
6. **Association:** The association function "f" produces the final output, representing the relationship between the head and tail entities given the relation.
The diagram does not provide specific numerical values for the vectors, weights, or outputs. It is a conceptual representation of the network architecture.
### Key Observations
* The network uses multiple layers to progressively refine the representation of the relationship.
* The relation vector is used as input to each layer, suggesting it plays a crucial role in shaping the relationship representation.
* The bias vectors (B<sup>(l)</sup>) are added to each layer, allowing the network to learn offsets and adjust the activation thresholds.
* The association function "f" is a key component, responsible for combining the processed information with the tail entity vector to generate the final association.
### Interpretation
This diagram illustrates a neural network architecture designed to learn and represent relationships between entities. The multi-layered structure allows the network to capture complex dependencies and nuances in the relationships. The use of weights (W<sup>(l)</sup>) and bias vectors (B<sup>(l)</sup>) enables the network to learn from data and adapt its representation of the relationships. The association function "f" serves as a final step to integrate the learned representation with the tail entity, effectively predicting or inferring the relationship.
The diagram suggests a knowledge representation approach where relationships are not explicitly defined but rather learned from data through the network's training process. This is a common approach in areas like knowledge graph completion and relation extraction. The diagram is a high-level conceptual overview and does not provide details about the specific activation functions, loss functions, or training algorithms used. It is a visual representation of a computational model for reasoning about relationships.
</details>
As shown later, this structure is superior in knowledge transfer learning tasks. Therefore, for each layer of RMNNs, instead of using eq.(2), its linear activation signal is computed from the previous layer z ( -1) and the relation code c ( k ) as follows:
$$a ^ { ( \ell ) } = W ^ { ( \ell ) } z ^ { ( \ell - 1 ) } + B ^ { ( \ell ) } c ^ { ( k ) } , \quad ( \ell = 1 \cdots L ) \quad ( 5 )$$
where W ( ) and B represent the normal weight matrix and the relation-specific weight matrix for layer . At the topmost layer, we calculate the final score for each triple x n = ( e i , r k , e j ) using the relation code as:
$$f ( x _ { n } ; \Theta ) = \sigma \left ( z ^ { ( L ) } \cdot v _ { j } ^ { ( 2 ) } + B ^ { ( L + 1 ) } \cdot c ^ { ( k ) } \right ) . \quad ( 6 )$$
In the same way, all RMNN parameters, including ฮ = { W , B , V (1) , V (2) , C } , can be jointly learned based on the above maximum likelihood estimation.
The RMNN models are particularly suitable for knowledge transfer learning , where a pre-trained model can be quickly extended to any new relation after observing a few samples from that relation. In this case, we may estimate a new relation code based on the available new samples while keeping the whole network unchanged. Due to its small size, the new relation code can be reliably estimated from only a small number of new samples. Furthermore, model performance in all original relations will not be affected since the model and all original relation codes are not changed during transfer learning.
## Experiments
In this section, we evaluate the proposed NAM models for various reasoning tasks. We first describe the experimental setup and then we report the results from several reasoning tasks, including textual entailment recognition, triple classification in multi-relational KBs, commonsense reasoning and knowledge transfer learning.
## Experimental setup
Here we first introduce some common experimental settings used for all experiments: 1) For entity or sentence representations, we represent them by composing from their
word vectors as in (Socher et al. 2013). All word vectors are initialized from a pre-trained skip-gram (Mikolov et al. 2013) word embedding model, trained on a large English Wikipedia corpus. The dimensions for all word embeddings are set to 100 for all experiments; 2) The dimensions of all relation codes are set to 50. All relation codes are randomly initialized; 3) For network structures, we use ReLU as the nonlinear activation function and all network parameters are initialized according to (Glorot and Bengio 2010). Meanwhile, since the number of training examples for most probabilistic reasoning tasks is relatively small, we adopt the dropout approach (Hinton et al. 2012) during the training process to avoid the over-fitting problem; 4) During the learning process of NAMs, we need to use negative samples, which are automatically generated by randomly perturbing positive KB triples as D -= { ( e i , r k , e ) | e = e j โง ( e i , r k , e j ) โ D + } .
For each task, we use the provided development set to tune for the best training hyperparameters. For example, we have tested the number of hidden layers among { 1, 2, 3 } , the initial learning rate among { 0.01, 0.05, 0.1, 0.25, 0.5 } , dropout rate among { 0, 0.1, 0.2, 0.3, 0.4 } . Finally, we select the best setting based on the performance on the development set: the final model structure uses 2 hidden layers, and the learning rate and the dropout rate are set to be 0.1 and 0.2, respectively, for all the experiments. During model training, the learning rate is halved once the performances in the development set decreases. Both DNNs and RMNNs are trained using the stochastic gradient descend (SGD) algorithm. We notice that the NAM models converge quickly after 30 epochs.
## Recognizing Textual Entailment
Understanding entailment and contradiction is fundamental to language understanding. Here we conduct experiments on a popular recognizing textual entailment (RTE) task, which aims to recognize the entailment relationship between a pair of English sentences. In this experiment, we use the SNLI dataset in (Bowman et al. 2015) to conduct 2-class RTE experiments (entailment or contradiction). All instances that are not labelled as 'entailment' are converted to contradiction in our experiments. The SNLI dataset contains hundreds of thousands of training examples, which is useful for training a NAM model. Since this data set does not include multirelational data, we only investigate the DNN structure for this task. The final NAM result, along with the baseline performance provided in (Bowman et al. 2015), is listed in Table 2.
Table 2: Experimental results on the RTE task.
| Model | Accuracy (%) |
|----------------------------------------|----------------|
| Edit Distance (Bowman et al. 2015) | 71.9 |
| Classifier (Bowman et al. 2015) | 72.2 |
| Lexical Resources (Bowman et al. 2015) | 75 |
| DNN | 84.7 |
From the results, we can see the proposed DNN based
NAM model achieves considerable improvements over various traditional methods. This indicates that we can better model entailment relationship in natural language by representing sentences in continuous space and conducting probabilistic reasoning with deep neural networks.
## Triple classification in multi-relational KBs
In this section, we evaluate the proposed NAM models on two popular knowledge triple classification datasets, namely WN11andFB13in(Socher et al. 2013) (derived from WordNet and FreeBase), to predict whether some new triple relations hold based on other training facts in the database. The WN11 dataset contains 38,696 unique entities involving 11 different relations in total while the FB13 dataset covers 13 relations and 75,043 entities. Table 3 summarizes the statistics of these two datasets.
Table 3: The statistics for KBs triple classification datasets. #R is the number of relations. #Ent is the size of the entity set.
| Dataset | # R | # Ent | # Train | # Dev | # Test |
|-----------|-------|---------|-----------|---------|----------|
| WN11 | 11 | 38,696 | 112,581 | 2,609 | 10,544 |
| FB13 | 13 | 75,043 | 316,232 | 5,908 | 23,733 |
The goal of knowledge triple classification is to predict whether a given triple x n = ( e i , r k , e j ) is correct or not. We first use the training data to learn NAM models. Afterwards, we use the development set to tune a global threshold T to make a binary decision: the triple is classified as true if f ( x n ; ฮ ) โฅ T ; otherwise it is false. The final accuracy is calculated based on how many triplets in the test set are classified correctly.
Experimental results on both WN11 and FB13 datasets are given in Table 4, where we compare the two NAM models with all other methods reported on these two datasets. The results clearly show that the NAM methods (DNNs and RMNNs) achieve comparable performance on these triple classification tasks, and both yield consistent improvement over all existing methods. In particular, the RMNN model yields 3.7% and 1.9% absolute improvements over the popular neural tensor networks (NTN) (Socher et al. 2013) on WN11 and FB13 respectively. Both DNN and RMNN models are much smaller than NTN in the number of parameters and they scale well as the number of relation types increases. For example, both DNN and RMNN models for WN11 have about 7.8 millions of parameters while NTN has about 15 millions. Although the RESCAL and TransE models have about 4 millions of parameters for WN11, their size goes up quickly for other tasks of thousands or more relation types. In addition, the training time of DNN and RMNN is much shorter than that of NTN or TransE since our models converge much faster. For example, we have obtained at least a 5 times speedup over NTN in WN11.
## Commonsense Reasoning
Similar to the triple classification task (Socher et al. 2013), in this work, we use the ConceptNet KB (Liu and Singh 2004) to construct a new commonsense data set, named as
Table 4: Triple classification accuracy in WN11 and FB13.
| Model | WN11 | FB13 | Avg. |
|-----------------------------|--------|--------|--------|
| SME (Bordes et al. 2012) | 70 | 63.7 | 66.9 |
| TransE (Bordes et al. 2013) | 75.9 | 81.5 | 78.7 |
| TransH (Wang et al. 2014) | 78.8 | 83.3 | 81.1 |
| TransR (Lin et al. 2015) | 85.9 | 82.5 | 84.2 |
| NTN (Socher et al. 2013) | 86.2 | 90 | 88.1 |
| DNN | 89.3 | 91.5 | 90.4 |
| RMNN | 89.9 | 91.9 | 90.9 |
CN14 hereafter. When building CN14, we first select all facts in ConceptNet related to 14 typical commonsense relations, e.g., UsedFor , CapableOf . (see Figure 5 for all 14 relations.) Then, we randomly divide the extracted facts into three sets, Train, Dev and Test. Finally, in order to create a test set for classification, we randomly switch entities (in the whole vocabulary) from correct triples and get a total of 2 ร #Test triples (half are positive samples and half are negative examples). The statistics of CN14 are given in Table 5.
Table 5: The statistics for the CN14 dataset.
| Dataset | # R | # Ent. | # Train | # Dev | # Test |
|-----------|-------|----------|-----------|---------|----------|
| CN14 | 14 | 159,135 | 200,198 | 5,000 | 10,000 |
The CN14 dataset is designed for answering commonsense questions like Is a camel capable of journeying across desert? The proposed NAM models answer this question by calculating the association probability Pr( E 2 | E 1 ) where E 1 = { camel , capable of } and E 2 = journey across desert . In this paper, we compare two NAM methods with the popular NTN method in (Socher et al. 2013) on this data set and the overall results are given in Table 6. We can see that both NAM methods outperform NTN in this task, and the DNN and RMNN models obtain similar performance.
Table 6: Accuracy (in %) comparison on CN14.
| Model | Positive | Negative | total |
|---------|------------|------------|---------|
| NTN | 82.7 | 86.5 | 84.6 |
| DNN | 84.5 | 86.9 | 85.7 |
| RMNN | 85.1 | 87.1 | 86.1 |
Furthermore, we show the classification accuracy of all 14 relations in CN14 for RMNN and NTN in Figure 5, which show that the accuracy of RMNN varies among different relations from 80.1% ( Desires ) to 93.5% ( CreatedBy ). We notice some commonsense relations (such as Desires , CapableOf ) are harder than the others (like CreatedBy , CausesDesire ). RMNN overtakes NTN in almost all relations.
## Knowledge Transfer Learning
Knowledge transfer between various domains is a characteristic feature and crucial cornerstone of human learning. In this section, we evaluate the proposed NAM models for a
Figure 5: Accuracy of different relations in CN14.
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Bar Chart: Relation Classification Performance Comparison
### Overview
This image presents a bar chart comparing the performance of two models, NTN (Neural Tensor Networks) and RMN (Relational Memory Networks), on a relation classification task. The chart displays the performance scores for various relation types. The x-axis represents the performance score, ranging from 60 to 100, and the y-axis lists the different relation types.
### Components/Axes
* **X-axis Title:** (Implicitly) Performance Score (%)
* **X-axis Scale:** 60, 65, 70, 75, 80, 85, 90, 95, 100
* **Y-axis Title:** Relation Type
* **Y-axis Categories:**
* SymbolOf
* DesireOf
* CreatedBy
* HasLastSubevent
* Desires
* CausesDesire
* ReceivesAction
* MotivatedByGoal
* Causes
* HasProperty
* HasPrerequisite
* HasSubevent
* CapableOf
* UsedFor
* **Legend:**
* NTN (Blue)
* RMN (Red)
* **Legend Position:** Top-right corner
### Detailed Analysis
The chart consists of horizontal bars for each relation type, with two bars per type representing the performance of NTN and RMN.
Here's a breakdown of the approximate performance scores for each relation type:
* **SymbolOf:** NTN โ 72%, RMN โ 78%
* **DesireOf:** NTN โ 78%, RMN โ 82%
* **CreatedBy:** NTN โ 88%, RMN โ 94%
* **HasLastSubevent:** NTN โ 83%, RMN โ 86%
* **Desires:** NTN โ 72%, RMN โ 74%
* **CausesDesire:** NTN โ 85%, RMN โ 91%
* **ReceivesAction:** NTN โ 82%, RMN โ 88%
* **MotivatedByGoal:** NTN โ 81%, RMN โ 85%
* **Causes:** NTN โ 82%, RMN โ 86%
* **HasProperty:** NTN โ 82%, RMN โ 86%
* **HasPrerequisite:** NTN โ 83%, RMN โ 87%
* **HasSubevent:** NTN โ 82%, RMN โ 86%
* **CapableOf:** NTN โ 78%, RMN โ 82%
* **UsedFor:** NTN โ 84%, RMN โ 90%
**Trends:**
* Generally, RMN outperforms NTN across most relation types.
* The performance difference between the two models varies depending on the relation type.
* The performance scores are relatively clustered between 70% and 95%.
### Key Observations
* RMN consistently achieves higher scores than NTN.
* The largest performance gap is observed for the "CreatedBy" relation, where RMN significantly outperforms NTN.
* The smallest performance gap is observed for the "Desires" relation.
* All relation types have performance scores above 70% for both models.
### Interpretation
The data suggests that RMN is a more effective model than NTN for relation classification across a variety of relation types. The consistent outperformance of RMN indicates that its architecture or training methodology is better suited for capturing the nuances of relational information. The significant difference in performance for "CreatedBy" suggests that this relation type is particularly challenging for NTN, and RMN is better equipped to handle its complexities. The relatively high performance scores for all relation types indicate that both models are reasonably effective, but RMN offers a substantial improvement in accuracy. This comparison is valuable for researchers and practitioners interested in developing and deploying relation classification systems. The chart provides a clear visual representation of the strengths and weaknesses of each model, allowing for informed decision-making.
</details>
Figure 6: Accuracy (in %) on the test set of a new relation CausesDesire is shown as a function of used training samples from CausesDesire when updating the relation code only. (Accuracy on the original relations remains as 85.7%.)
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Line Chart: Performance Comparison of DNN and RNNN
### Overview
The image presents a line chart comparing the performance of two models, DNN (Deep Neural Network) and RNNN (Recurrent Neural Network with Memory Networks), across varying percentages of data. The performance is measured on the y-axis, ranging from 70.00 to 88.00, while the data percentage is represented on the x-axis, ranging from 5% to 100%.
### Components/Axes
* **X-axis Title:** Data Percentage
* **X-axis Markers:** 5%, 10%, 15%, 20%, 25%, 50%, 75%, 100%
* **Y-axis Title:** Performance (Units not specified)
* **Y-axis Scale:** 70.00 to 88.00, with increments of 2.00
* **Legend:** Located in the top-right corner.
* **DNN:** Represented by a blue line with downward-pointing triangle markers.
* **RMNN:** Represented by a red line with square markers.
### Detailed Analysis
**RMNN (Red Line):** The RMNN line exhibits a consistent upward trend throughout the entire range of data percentages.
* At 5% data, the performance is approximately 77.00.
* At 10% data, the performance is approximately 80.00.
* At 15% data, the performance is approximately 82.00.
* At 20% data, the performance is approximately 82.50.
* At 25% data, the performance is approximately 83.00.
* At 50% data, the performance is approximately 84.00.
* At 75% data, the performance is approximately 85.00.
* At 100% data, the performance is approximately 86.50.
**DNN (Blue Line):** The DNN line also shows an upward trend, but it is less steep than the RMNN line.
* At 5% data, the performance is approximately 76.00.
* At 10% data, the performance is approximately 77.50.
* At 15% data, the performance is approximately 78.50.
* At 20% data, the performance is approximately 79.00.
* At 25% data, the performance is approximately 79.50.
* At 50% data, the performance is approximately 80.00.
* At 75% data, the performance is approximately 81.00.
* At 100% data, the performance is approximately 82.50.
### Key Observations
* RMNN consistently outperforms DNN across all data percentages.
* The performance gap between RMNN and DNN widens as the data percentage increases.
* Both models exhibit diminishing returns in performance as the data percentage approaches 100%. The slope of both lines decreases.
### Interpretation
The data suggests that RMNN is a more effective model than DNN for this particular task, especially when a larger amount of data is available. The increasing performance with increasing data percentage indicates that both models benefit from more data, but RMNN benefits more significantly. The diminishing returns observed at higher data percentages suggest that there may be a point where adding more data provides only marginal improvements in performance. This could be due to factors such as model capacity or the inherent limitations of the data itself. The chart demonstrates a clear trade-off between model complexity (RMNN being more complex) and performance gains with increasing data. The lack of units on the y-axis makes it difficult to interpret the absolute scale of the performance metric, but the relative comparison between the two models is clear.
</details>
knowledge transfer learning scenario, where we adapt a pretrained model to an unseen relation with only a few training samples from the new relation. Here we randomly select a relation, e.g., CausesDesire in CN14 for this experiment. This relation contains only 4800 training samples and 480 test samples. During the experiments, we use all of the other 13 relations in CN14 to train baseline NAM models (both DNN and RMNN). During the transfer learning, we freeze all NAM parameters, including all weights and entity representations, and only learn a new relation code for CausesDesire from the given samples. At last, the learned relation code (along with the original NAM models) is used to classify the new samples of CausesDesire in the test set. Obviously, this transfer learning does not affect the model performance in the original 13 relations because the models are not changed. Figure 6 shows the results of knowledge transfer learning for the relation CausesDesire as we increase the training samples gradually. The result shows that RMNN performs much better than DNN in this experiment, where we can significantly improve RMNN for the new relation with only 5-20% of the total training samples for CausesDesire . This demonstrates that the structure to connect the relation code to all hidden layers leads to more effective learning of new relation codes from a relatively small number of training samples.
Next, we also test a more aggressive learning strategy for this transfer learning setting, where we simultaneously update all the network parameters during the learning of the new relation code. The results are shown in Figure 7. This strategy can obviously improve performance more on the new relation, especially when we add more training samples. However, as expected, the performance on the original 13 relations deteriorates. The DNN improves the performance on the new relation as we use all training samples (up to 94.6%). However, the performance on the remaining 13 original relations drops dramatically from 85.6% to 75.5%. Once again, RMNN shows an advantage over DNN in this transfer learning setting, where the accuracy on the new relation increases from 77.9% to 90.8% but the accuracy on the original 13 relations only drop slightly from 85.9% to 82.0%.
Figure 7: Transfer learning results by updating all network parameters. The left figure shows results on the new relation while the right figure shows results on the original relations.
<details>
<summary>Image 7 Details</summary>

### Visual Description
\n
## Line Chart: Performance Comparison of DNN and RNN Models
### Overview
The image presents two line charts side-by-side, comparing the performance of two models, DNN (Deep Neural Network) and RMN (Recurrent Neural Network), across varying percentages of data. The y-axis represents a performance metric (likely accuracy or a similar score), ranging from 70 to 100, while the x-axis represents the percentage of data used, ranging from 5% to 100% in increments of 5%.
### Components/Axes
* **X-axis:** Percentage of Data (5%, 10%, 15%, 20%, 25%, 50%, 75%, 100%)
* **Y-axis:** Performance Score (70 to 100)
* **Data Series 1 (Left Chart):** DNN (represented by a blue line with triangle markers)
* **Data Series 2 (Left Chart):** RMN (represented by a red line with diamond markers)
* **Data Series 1 (Right Chart):** DNN (represented by a blue line with triangle markers)
* **Data Series 2 (Right Chart):** RMN (represented by a red line with diamond markers)
* **Legend (Bottom-Left of Left Chart):** DNN, RMN
* **Legend (Bottom-Right of Right Chart):** DNN, RMN
### Detailed Analysis or Content Details
**Left Chart:**
* **DNN (Blue Line):** The line slopes generally upward, indicating increasing performance with increasing data percentage.
* 5%: ~77
* 10%: ~82
* 15%: ~85
* 20%: ~86
* 25%: ~87
* 50%: ~89
* 75%: ~92
* 100%: ~94
* **RMN (Red Line):** The line also slopes upward, but at a slower rate than DNN.
* 5%: ~77
* 10%: ~80
* 15%: ~82
* 20%: ~84
* 25%: ~86
* 50%: ~89
* 75%: ~90
* 100%: ~91
**Right Chart:**
* **DNN (Blue Line):** The line slopes downward, indicating decreasing performance with increasing data percentage.
* 5%: ~83
* 10%: ~81
* 15%: ~79
* 20%: ~78
* 25%: ~77
* 50%: ~75
* 75%: ~74
* 100%: ~73
* **RMN (Red Line):** The line is relatively flat, with a slight downward trend.
* 5%: ~85
* 10%: ~84
* 15%: ~83
* 20%: ~82
* 25%: ~82
* 50%: ~81
* 75%: ~81
* 100%: ~80
### Key Observations
* In the left chart, DNN consistently outperforms RMN across all data percentages. Both models show improved performance with more data.
* In the right chart, DNN's performance decreases with increasing data, while RMN's performance remains relatively stable. RMN outperforms DNN in this chart.
* The two charts appear to represent different scenarios or datasets, as the trends are opposite.
### Interpretation
The data suggests that the effectiveness of DNN and RMN models is highly dependent on the dataset and the amount of data available. The left chart indicates that DNN benefits significantly from increased data, outperforming RMN. This could be because DNN models are more complex and require more data to learn effectively.
The right chart, however, shows that increasing data actually *decreases* DNN's performance, while RMN remains relatively stable. This could indicate overfitting of the DNN model to the data, or that the dataset contains noise or irrelevant information that negatively impacts DNN's performance as more data is added. RMN, being a simpler model, might be less susceptible to overfitting.
The contrasting trends in the two charts highlight the importance of careful model selection and data preprocessing. It's crucial to consider the characteristics of the dataset and the potential for overfitting when choosing a model and determining the appropriate amount of data to use for training. The charts suggest that DNN is better suited for scenarios where more data is available and beneficial, while RMN might be a better choice when data is limited or potentially noisy.
</details>
## Extending NAMs for Winograd Schema Data Collection
In the previous experiments sections, all the tasks already contained manually constructed training data for us. However, in many cases, if we want to realize flexible commonsense reasoning under the real world conditions, obtaining the training data can also be very challenging. More specifically, since the proposed neural association model is a typical deep learning technique, lack of training data would make it difficult for us to train a robust model. Therefore, in this paper, we make some efforts and try to mine useful data for model training. As a very first step, we are now working on collecting the cause-effect relationships between a set of common words and phrases. We believe this type of knowledge would be a key component for modeling the association relationships between discrete events.
This section describes the idea for automatic cause-effect pair collection as well as the data collection results. We will first introduce the common vocabulary we created for query generation. After that, the detailed algorithm for cause-effect pair collection will be presented. Finally, the following section will present the data collection results.
## Common Vocabulary and Query Generation
To avoid the data sparsity problem, we start our work by constructing a vocabulary of very common words. In our current investigations, we construct a vocabulary which contains 7500 verbs and adjectives. As shown in Table 7, this vocabulary includes 3000 verb words, 2000 verb phrases and 2500 adjective words. The procedure for constructing this vocabulary is straightforward. We first extract all words and phrases (divided by part-of-speech tags) from WordNet (Miller 1995). After conducting part-of-speech tagging on a large corpus, we then get the occurrence frequencies for all those words and phrases by scanning over the tagged corpus. Finally, we sort those words and phrases by frequency and then select the top N results.
Table 7: Common vocabulary constructed for mining causeeffect event pairs.
| Set | Category | Size |
|-------|-----------------|--------|
| 1 | Verb words | 3000 |
| 2 | Verb phrases | 2000 |
| 3 | Adjective words | 2500 |
Query Generation Based on the common vocabulary, we generate search queries by pairing any two words (or phrases). Currently we only focus on extracting the association relationships between verbs and adjectives. Even for this small vocabulary, the search space is very large (7.5K by 7.5K leads to tens of millions pairs). In this work, we define several patterns for each word or phrase based on two popular semantic dimensions: 1) positive-negative, 2) activepassive (Osgood 1952). Using the verbs rob and arrest for example, each of them contains 4 patterns, i.e. (active, positive), (active, negative), (passive, positive) and (passive, negative). Therefore, the query formed by rob and arrest would contain 16 possible dimensions, as shown in Figure 8. The task of mining the cause-effect relationships for any two words or phrases then becomes the task of getting the number of occurrences for all the possible links. Text corpus Vocab Sentences Results
Figure 8: Typical 16 dimensions for a typical query.
<details>
<summary>Image 8 Details</summary>

### Visual Description
\n
## Diagram: Semantic Network of "Rob" and "Arrest"
### Overview
The image depicts a semantic network illustrating the associations between the words "rob" and "arrest". The network categorizes these associations based on valence (Positive/Negative) and voice (Active/Passive). The diagram uses colored rectangles to represent these associations, with lines connecting related concepts.
### Components/Axes
The diagram consists of two main nodes labeled "rob" (left) and "arrest" (right). Between these nodes are four columns of rectangles, each representing a specific combination of valence and voice:
* **Active, Positive:** (Red rectangles)
* **Active, Negative:** (Green rectangles)
* **Passive, Positive:** (Yellow rectangles)
* **Passive, Negative:** (Purple rectangles)
The title "Association Links" is positioned at the top-center of the diagram. Each rectangle contains a phrase related to the concepts of "rob" and "arrest" in the specified valence and voice. Lines connect each phrase in the "rob" column to each phrase in the "arrest" column, indicating an association.
### Detailed Analysis or Content Details
Here's a breakdown of the phrases within each category:
* **Active, Positive (Red):**
* "rob"
* "arrest"
* **Active, Negative (Green):**
* "hot rob"
* "not arrest"
* **Passive, Positive (Yellow):**
* "be robbed"
* "be arrested"
* **Passive, Negative (Purple):**
* "not be robbed"
* "not be arrested"
Each phrase in the "rob" column is connected to each phrase in the "arrest" column via a gray line. This creates a fully connected network between the two main nodes.
### Key Observations
The diagram demonstrates a comprehensive mapping of semantic relationships between "rob" and "arrest", considering both active and passive voice, as well as positive and negative connotations. The complete connectivity suggests that all combinations of valence and voice are considered relevant associations between the two concepts. The use of color-coding clearly distinguishes the different categories of associations.
### Interpretation
This diagram likely represents a model of semantic association used in computational linguistics or cognitive science. It illustrates how the meaning of words can be understood in relation to other words, taking into account grammatical structure (active/passive voice) and emotional tone (positive/negative valence). The complete connectivity suggests that the model aims to capture all possible relationships, even those that might seem counterintuitive (e.g., "hot rob" associated with "not arrest").
The diagram could be used to:
* **Analyze text:** Identify the semantic relationships between words in a given text.
* **Build knowledge graphs:** Create a network of concepts and their relationships.
* **Improve natural language processing:** Enhance the ability of computers to understand and generate human language.
The inclusion of "hot rob" and "not arrest" suggests an attempt to capture nuanced or idiomatic expressions. The diagram is a visual representation of a complex semantic space, highlighting the multifaceted relationships between seemingly simple concepts. It's a theoretical model, not a depiction of empirical data.
</details>
## Automatic Cause-Effect Pair Collection
Based on the created queries, in this section, we present the procedures for extracting cause-effect pairs from large unstructured texts. The overall system framework is shown in Figure 9.
Query Searching The goal of query searching is to find all the possible sentences that may contain the input queries. Since the number of queries is very large, we structure all the queries as a hashmap and conduct string matching during text scanning. In detail, the searching program starts by
Figure 9: Automatic pair collection system framework.
<details>
<summary>Image 9 Details</summary>

### Visual Description
\n
## Diagram: Information Retrieval Process Flow
### Overview
The image depicts a diagram illustrating a process flow for information retrieval. It shows how a query is processed through several stages, starting from a vocabulary and text corpus, and culminating in results. The diagram uses rectangular blocks to represent processing steps and cylindrical shapes to represent data stores. Arrows indicate the flow of information between these components.
### Components/Axes
The diagram consists of the following components:
* **Vocab:** A cylindrical data store labeled "Vocab".
* **Text corpus:** A rolled-up document representing a "Text corpus" data store.
* **Query Searching:** A rectangular block labeled "Query Searching".
* **Subject-Object Matching:** A rectangular block labeled "Subject-Object Matching".
* **Sentences:** A cylindrical data store labeled "Sentences".
* **Dependency Parsing:** A rectangular block labeled "Dependency Parsing".
* **Results:** A cylindrical data store labeled "Results".
Arrows connect these components, indicating the flow of data.
### Detailed Analysis or Content Details
The process flow can be described as follows:
1. The "Vocab" data store and the "Text corpus" data store feed into the "Query Searching" block.
2. "Query Searching" outputs to the "Subject-Object Matching" block.
3. The "Sentences" data store feeds into the "Dependency Parsing" block.
4. "Subject-Object Matching" and "Dependency Parsing" both feed into the "Results" data store.
There are no numerical values or scales present in the diagram. It is a conceptual representation of a process.
### Key Observations
The diagram highlights a pipeline architecture for information retrieval. The process involves searching a vocabulary and text corpus, identifying subject-object relationships, parsing dependencies, and ultimately generating results. The parallel input to "Results" suggests that both subject-object matching and dependency parsing contribute to the final output.
### Interpretation
This diagram illustrates a common approach to information retrieval, particularly in the context of knowledge graphs or semantic search. The "Vocab" likely represents a controlled vocabulary or ontology used to standardize terms. The "Text corpus" is the source of information. "Query Searching" identifies relevant text based on a user's query. "Subject-Object Matching" extracts relationships between entities in the text. "Dependency Parsing" analyzes the grammatical structure of sentences to understand the relationships between words. Finally, "Results" presents the retrieved information to the user.
The diagram suggests a system designed to understand the *meaning* of text, not just find keywords. The inclusion of "Dependency Parsing" indicates a focus on semantic analysis. The parallel paths to "Results" suggest that both relational information (subject-object) and grammatical structure contribute to the final output. This is a typical architecture for a system that aims to provide more than just keyword-based search results.
</details>
conducting lemmatizing, part-of-speech tagging and dependency parsing on the source corpus. After it, we scan the corpus from the begining to end. When dealing with each sentence, we will try to find the matched words (or phrases) using the hashmap. This strategy help us to reduce the search complexity to be linear with the size of corpus, which has been proved to be very efficient in our experiments.
Association Links
'rob' Active, Positive Active, Negative Passive, Positive Passive, Negative 'arrest' Active, Positive Active, Negative Passive, Positive Passive, Negative Subject-Object Matching Based on the dependency parsing results, once we find one phrase of a query, we would check whether that phrase is associated with at least one subject or object in the corresponding sentence or not. At the same time, we record whether the phrase was positive or negative, active or passive. Moreover, for helping us to decide the cause-effect relationships, we would check whether the phrase is linked with some connective words or not. Typical connective words used in this work are because and if . To finally extract the cause-effect pairs, we design a simple subject-object matching rule, which is similar to the work of (Peng, Khashabi, and Roth 2015). 1) If the two phrases in one query share the same subject , the relationship between them is then straightforward; 2) If the subject of one phrase is the object of the other phrase, then we need to apply the passive pattern to the phrase related to the object . This subject-object matching idea is similar to the work proposed in (Peng, Khashabi, and Roth 2015). Using query ( arrest , rob ) as an example. Once we find sentence 'Tom was arrested because Tom robbed the man' , we obtain its dependency parsing result as shown in Figure 10. The verb arrest and rob share a same subject, and the pattern for arrest is passive, we will add the occurrence of the corresponding association link, i.e. link from the (active,positive) pattern of rob to the (passive,positive) pattern of arrest , by 1.
Figure 10: Dependency parsing result of sentence 'Tom was arrested because Tom robbed the man' .
<details>
<summary>Image 10 Details</summary>

### Visual Description
Icon/Small Image (393x59)
</details>
## Data Collection Results
Table 8 shows the corpora we used for collecting the causeeffect pairs and the corresponding data collection results. We extract approximately 240,000 pairs from different corpora.
Table 8: Data collection results on different corpora.
| Corpus | # Result pairs |
|------------------------------|------------------|
| Gigaword (Graff et al. 2003) | 117,938 |
| Novels (Zhu et al. 2015) | 129,824 |
| CBTest (Hill et al. 2015) | 4,167 |
| BNC (Burnard 1995) | 2,128 |
## Winograd Schema Challenge
Based on all the experiments described in the previous sections, we could conclude that the neural association model has the potential to be effective in commonsense reasoning. To further evaluate the effectiveness of the proposed neural association model, in this paper, we conduct experiments on solving the complex Winograd Schema challenge problems (Levesque, Davis, and Morgenstern 2011; Morgenstern, Davis, and Ortiz Jr 2016). Winograd Schema is a commonsense reasoning task proposed in recent years, which has been treated as an alternative to the Turing Test (Turing 1950). This is a new AI task and it would be very interesting to see whether neural network methods are suitable for solving this problem. This section then describes the progress we have made in attempting to meet the Winograd Schema Challenge. For making clear what is the main task of the Winograd Schema , we will firstly introduce it at a high level. Afterwards, we will introduce the system framework as well as all the corresponding modules we proposed to automatically solve the Winograd Schema problems. Finally, experiments and discussions on a human annotated causeeffect dataset and discussion will be presented.
## Winograd Schema
The Winograd Schema (WS) evaluates a system's commonsense reasoning ability based on a traditional, very difficult natural language processing task: coreference resolution (Levesque, Davis, and Morgenstern 2011; Saba 2015). The Winograd Schema problems are carefully designed to be a task that cannot be easily solved without commonsense knowledge. In fact, even the solution of traditional coreference resolution problems relies on semantics or world knowledge (Rahman and Ng 2011; Strube 2016). For describing the WS in detail, here we just copy some words from (Levesque, Davis, and Morgenstern 2011). A WS is a small reading comprehension test involving a single binary question. Here are two examples:
- The trophy would not fit in the brown suitcase because it was too big. What was too big?
- -Answer 0: the trophy
- -Answer 1: the suitcase
- Joan made sure to thank Susan for all the help she had given. Who had given the help?
- -Answer 0: Joan
- -Answer 1: Susan
The correct answers here are obvious for human beings. In each of the questions, the corresponding WS has the following four features:
1. Two parties are mentioned in a sentence by noun phrases. They can be two males, two females, two inanimate objects or two groups of people or objects.
2. A pronoun or possessive adjective is used in the sentence in reference to one of the parties, but is also of the right sort for the second party. In the case of males, it is 'he/him/his'; for females, it is 'she/her/her' for inanimate object it is 'it/it/its,' and for groups it is 'they/them/their.'
3. The question involves determining the referent of the pronoun or possessive adjective. Answer 0 is always the first party mentioned in the sentence (but repeated from the sentence for clarity), and Answer 1 is the second party.
4. There is a word (called the special word) that appears in the sentence and possibly the question. When it is replaced by another word (called the alternate word), everything still makes perfect sense, but the answer changes.
Solving WS problems is not easy since the required commonsense knowledge is quite difficult to collect. In the following sections, we are going to describe our work on solving the Winograd Schema problems via neural network methods.
## System Framework
In this paper, we propose that the commonsense knowledge required in many Winograd Schema problems could be formulized as some association relationships between discrete events. Using sentence ' Joan made sure to thank Susan for all the help she had given ' as an example, the commonsense knowledge is that the man who receives help should thank to the man who gives help to him. We believe that by modeling the association between event receive help and thank , give help and thank , we can make the decision by comparing the association probability Pr( thank | receive help ) and Pr( thank | give help ) . If the models are well trained, we should get the inequality Pr( thank | receive help ) > Pr( thank | give help ) . Following this idea, we propose to utilize the data constructed from the previous section and extend the NAM models for solving WS problems. Here we design two frameworks for training NAM models. relation
- TransMat -NAM: We design to apply four linear transformation matrices, i.e., matrices of (active, positive), (active, negative), (passive, positive) and (passive, negative), for transforming both the cause event and the effect event. After it, we then use NAM for model the causeeffect association relationship between any cause and effect events. cause effect Neural Association Model
- RelationVec -NAM: On the other hand, in this configuration, we treat all the typical 16 dimensions shown in Figure 8 as distinct relations. So there are 16 relation vectors
Figure 11: The model framework for TransMat -NAM.
<details>
<summary>Image 11 Details</summary>

### Visual Description
\n
## Diagram: Neural Association Model Flow
### Overview
The image depicts a simplified diagram illustrating the flow of information through a "Neural Association Model". It shows a linear process starting with a "cause", undergoing two "Transform" stages, and resulting in an "effect". The diagram uses shapes to represent different stages, with arrows indicating the direction of flow.
### Components/Axes
The diagram consists of the following components:
* **Cause:** Represented by a grey circle on the left.
* **Transform (1):** Represented by a yellow rectangle connected to the "cause".
* **Neural Association Model:** Represented by a large red rectangle in the center.
* **Transform (2):** Represented by a yellow rectangle connected to the "Neural Association Model".
* **Effect:** Represented by a grey circle on the right.
* **Arrows:** Lines connecting the components, indicating the direction of information flow.
There are no axes or scales present in this diagram.
### Detailed Analysis or Content Details
The diagram shows a sequential process:
1. A "cause" initiates the process.
2. The "cause" is fed into the first "Transform" stage.
3. The output of the first "Transform" is input into the "Neural Association Model".
4. The output of the "Neural Association Model" is fed into the second "Transform" stage.
5. The output of the second "Transform" results in an "effect".
The diagram does not provide any numerical data or specific details about the transformations or the model itself. It is a high-level conceptual representation.
### Key Observations
The diagram emphasizes a linear, sequential flow of information. The "Neural Association Model" is positioned as the central processing unit, receiving input from the first transformation and providing output to the second. The use of distinct shapes and colors helps to visually differentiate the components.
### Interpretation
This diagram illustrates a simplified model of how a neural association model might process information. The "cause" represents an initial stimulus or input, and the "effect" represents the outcome or result. The "Transform" stages likely represent pre-processing or post-processing steps that prepare the input for the model and interpret the output, respectively. The "Neural Association Model" itself is the core component where the association and processing of information occur.
The diagram suggests a functional relationship between cause and effect, mediated by the neural association model and transformation processes. It is a conceptual illustration and does not provide details about the specific algorithms or mechanisms used within the model or the nature of the transformations. The diagram is useful for understanding the overall flow of information but lacks the detail needed for a technical implementation.
</details>
in the corresponding NAM models. Currently we use the RMNN structure for NAM.
Figure 12: The model framework for RelationVec -NAM.
<details>
<summary>Image 12 Details</summary>

### Visual Description
\n
## Diagram: Neural Association Model
### Overview
The image depicts a diagram illustrating a Neural Association Model. It shows a central red square labeled "Neural Association Model" with incoming arrows from "cause" and "relation" and an outgoing arrow to "effect". The diagram visually represents a process where a cause and a relation feed into a model, resulting in an effect.
### Components/Axes
The diagram consists of four labeled components:
* **Neural Association Model:** A red square positioned centrally.
* **cause:** An oval shape positioned on the bottom-left.
* **relation:** A green oval shape positioned on the top-left.
* **effect:** An oval shape positioned on the bottom-right.
Arrows connect these components, indicating the flow of information or influence.
### Detailed Analysis or Content Details
The diagram shows the following relationships:
* An arrow originates from the "cause" oval and points towards the "Neural Association Model" square.
* An arrow originates from the "relation" oval and points towards the "Neural Association Model" square.
* An arrow originates from the "Neural Association Model" square and points towards the "effect" oval.
The diagram does not contain any numerical data or scales. It is a conceptual representation of a process.
### Key Observations
The diagram highlights the central role of the "Neural Association Model" in mediating the relationship between "cause", "relation", and "effect". The arrows suggest a directional flow, implying that the cause and relation influence the model, which in turn produces an effect.
### Interpretation
The diagram illustrates a simplified model of how neural associations might work. It suggests that an effect is not solely determined by a single cause, but also by the relationship between the cause and other factors. The "Neural Association Model" acts as a processing unit that integrates the cause and relation to generate the effect. This could represent a cognitive process, a machine learning algorithm, or any system where associations between inputs lead to outputs. The diagram is a high-level conceptual representation and does not provide details about the internal workings of the model or the nature of the cause, relation, or effect. It is a visual metaphor for a complex process.
</details>
cause effect Neural Association Model Transform Transform Training the NAM models based on these two configurations is straightforward. All the network parameters, including the relation vectors and the linear transformation matrices, are learned by the standard stochastic gradient descend algorithm.
## Experiments
In this section, we will introduce our current experiments on solving the Winograd Schema problems. We will first select a cause-effect dataset constructed from the standard WS dataset. Subsequently, experimental setup will be described in detail. After presenting the experimental results, discussions would be made at the end of this section.
Cause-Effect Dataset Labelling In this paper, based on the WS dataset available at http: //www.cs.nyu.edu/faculty/davise/papers/ WinogradSchemas/WS.html , we labelled 78 causeeffect problems among all 278 available WS questions for our experiments. Table 9 shows some typical examples. For each WS problem, we label three verb (or adjective) phrases for the corresponding two parities and the pronoun. In the labelled phrases, we also record the corresponding patterns for each word respectively. Using word lift for example, we will generate lift for its active and positive pattern, not lift for its active and negative pattern, be lifted for its passive and positive pattern, and not be lifted for its passive and negative pattern. For example, in sentence ' The man couldn't lift his son because he was so weak ', we identify weak , not lift and not be lifted for he , the man and son resspectively. The commonsense is that somebody who is weak would more likely to has the effect not lift rather than not be lifted . The main work of NAM for solving this problem is to calculate the association probability between these phrases.
Experimental setup The setup for NAM on this causeeffect task is similar to the settings on the previous tasks. For representing the phrases in neural association models, we use the bag-of-word (BOW) approach for composing phrases from pre-trained word vectors. Since the vocabulary we use in this experiment contains only 7500 common verbs and adjectives, there are some out-of-vocabulary (OOV) words in some phrases. Based on the BOW method, a phrase would be useless if all the words it contains are OOV. In this paper, we remove all the testing samples with useless phrases which results in 70 testing cause-effect samples. For network settings, we set the embedding size to 50 and the
Table 9: Examples of the Cause-Effect dataset labelled from the Winograd Schema Challenge.
| Schema texts | Verb/Adjective 1 | Verb/Adjective 2 | Verb/Adjective 3 |
|---------------------------------------------------------------------------------|--------------------|--------------------|--------------------|
| The man couldn't lift his son because he was so weak | weak | not lift | not be lifted |
| The man couldn't lift his son because he was so heavy | heavy | not lift | not be lifted |
| The fish ate the worm. it was tasty | tasty | eat | be eaten |
| The fish ate the worm. it was hungry | hungry | eat | be eaten |
| Mary tucked her daughter Anne into bed, so that she could sleep | tuck into bed | be tucked into bed | sleep |
| Mary tucked her daughter Anne into bed, so that she could work | tuck into bed | be tucked into bed | work |
| Tom threw his schoolbag down to ray after he reached the top of the stairs | reach top | throw down | be thrown down |
| Tom threw his schoolbag down to ray after he reached the bottom of the stairs | reach bottom | throw down | be thrown down |
| Jackson was greatly influenced by Arnold, though he lived two centuries earlier | live earlier | influence | be influenced |
| Jackson was greatly influenced by Arnold, though he lived two centuries later | live later | influence | be influenced |
dimension of relation vectors to 50. We set 2 hidden layers for the NAM models and all the hidden layer sizes are set to 100. The learning rate is set to 0.01 for all the experiments. At the same time, to better control the model training, we set the learning rates for learning all the embedding matrices and the relation vectors to 0.025.
Negative sampling is very important for model training for this task. In the TransMat -NAM system, we generate negative samples by randomly selecting different patterns with respect to the pattern of the effect event in the positive samples. For example, if the positive training sample is ' hungry (active, positive) causes eat (active, positive) ', we may generate negative samples like ' hungry (active, positive) causes eat (passive, positive) ', or ' hungry (active, positive) causes eat (active, negative) '. In the RelationVec -NAM system, the negative sampling method is much more straightforward, i.e., we will randomly select a different effect event from the whole vocabulary. In the example shown here, the possible negative sample would be ' hungry (active, positive) causes happy (active, positive) ', or ' hungry (active, positive) causes talk (active, positive) ' and so on.
Results The experimental results are shown in Table 10. From the results, we find that the proposed NAM models achieve about 60% accuracy on the cause-effect dataset constructed from Winograd Schemas . More specifically, the RelationVec -NAM system performs slightly better than the TransMat -NAM system.
Table 10: Results of NAMs on the Winograd Schema CauseEffect dataset.
| Model | Accuracy (%) |
|------------------|----------------|
| TransMat -NAM | 58.6 (41 / 70) |
| RelationVec -NAM | 61.4 (43 / 70) |
In the testing results, we find the NAM performs well on some testing examples. For instance, in the call phone scenario, the proposed NAM generates the corresponding association probabilities as follows.
- Paul tried to call George on the phone, but he wasn't successful. Who was not successful?
- Paul tried to call George on the phone, but he wasn't available. Who was not available?
- Paul: Pr( not successful | call ) = 0.7299
- George: Pr( not successful | be called ) = 0.5430
- Answer: Paul
- Paul: Pr( not available | call ) = 0.6859
- George: Pr( not available | be called ) = 0.8306
- Answer: George
For these testing examples, we find our model can answer the questions by correctly calculating the association probabilities. The probability Pr( not successful | call ) is larger than Pr( not successful | be called ) while the probability Pr( not available | call ) is smaller than Pr( not available | be called ) . Those simple inequality relationships between the association probabilities are very reasonable in our commonsense. Here are some more examples:
- Jim yelled at Kevin because he was so upset. Who was upset?
- -Jim: Pr( yell | be upset ) = 0.9296
- -Kevin: Pr( be yelled | be upset ) = 0.8785
- -Answer: Jim
- Jim comforted Kevin because he was so upset. Who was upset?
- -Answer: Kevin
- Jim: Pr( comfort | be upset ) = 0.0282
- Kevin: Pr( be comforted | be upset ) = 0.5657
This example also conveys some commonsense knowledge in our daily life. We all know that somebody who is upset would be more likely to yell at other people. Meanwhile, it is also more likely that they would be be comforted by other people.
## Conclusions
In this paper, we have proposed neural association models (NAM) for probabilistic reasoning. We use neural networks to model association probabilities between any two events in a domain. In this work, we have investigated two model structures, namely DNN and RMNN, for NAMs. Experimental results on several reasoning tasks have shown that both DNNs and RMNNs can outperform the existing methods. This paper also reports some preliminary results to use NAMs for knowledge transfer learning. We have found that the proposed RMNN model can be quickly adapted to a new relation without sacrificing the performance in the original relations. After proving the effectiveness of the NAM models, we apply it to solve more complex commonsense reasoning problems, i.e., the Winograd Schemas (Levesque, Davis, and Morgenstern 2011). To support model training in this task, we propose a straightforward method to collect associative phrase pairs from text corpora. Experiments conducted on a set of Winograd Schema problems have indicated the neural association model does solve some problems successfully. However, it is still a long way to finally achieving automatic commonsense reasoning.
## Acknowledgments
We want to thank Prof. Gary Marcus of New York University for his useful comments on commonsense reasoning. Wealso want to thank Prof. Ernest Davis, Dr. Leora Morgenstern and Dr. Charles Ortiz for their wonderful organizations for making the first Winograd Schema Challenge happen. This paper was supported in part by the Science and Technology Development of Anhui Province, China (Grants No. 2014z02006), the Fundamental Research Funds for the Central Universities (Grant No. WK2350000001) and the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB02070006).
## References
- [Bengio et al. 2003] Bengio, Y.; Ducharme, R.; Vincent, P.; and Janvin, C. 2003. A neural probabilistic language model. The Journal of Machine Learning Research 3:1137-1155.
- [Bordes et al. 2012] Bordes, A.; Glorot, X.; Weston, J.; and Bengio, Y. 2012. Joint learning of words and meaning representations for open-text semantic parsing. In Proceedings of AISTATS , 127-135.
- [Bordes et al. 2013] Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and Yakhnenko, O. 2013. Translating embeddings for modeling multi-relational data. In Proceedings of NIPS , 2787-2795.
- [Bowman et al. 2015] Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 .
- [Bowman 2013] Bowman, S. R. 2013. Can recursive neural tensor networks learn logical reasoning? arXiv preprint arXiv:1312.6192 .
- [Burnard 1995] Burnard, L. 1995. Users reference guide british national corpus version 1.0.
- [Collobert et al. 2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research 12:2493-2537.
- [Getoor 2007] Getoor, L. 2007. Introduction to statistical relational learning . MIT press.
- [Glorot and Bengio 2010] Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS , 249-256.
- [Graff et al. 2003] Graff, D.; Kong, J.; Chen, K.; and Maeda, K. 2003. English gigaword. Linguistic Data Consortium, Philadelphia .
- [Hill et al. 2015] Hill, F.; Bordes, A.; Chopra, S.; and Weston, J. 2015. The goldilocks principle: Reading children's books with explicit memory representations. arXiv preprint arXiv:1511.02301 .
- [Hinton et al. 2012] Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. R. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 .
- [Hornik, Stinchcombe, and White 1990] Hornik, K.; Stinchcombe, M.; and White, H. 1990. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks 3(5):551-560.
- [Jensen 1996] Jensen, F. V. 1996. An introduction to Bayesian networks , volume 210. UCL press London.
- [Koller and Friedman 2009] Koller, D., and Friedman, N. 2009. Probabilistic graphical models: principles and techniques . MIT press.
- [LeCun, Bengio, and Hinton 2015] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436-444.
- [Levesque, Davis, and Morgenstern 2011] Levesque, H. J.; Davis, E.; and Morgenstern, L. 2011. The winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning .
- [Lin et al. 2015] Lin, Y.; Liu, Z.; Sun, M.; Liu, Y.; and Zhu, X. 2015. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of AAAI .
- [Liu and Singh 2004] Liu, H., and Singh, P. 2004. Conceptnet: a practical commonsense reasoning toolkit. BT technology journal 22(4):211-226.
- [McCarthy 1986] McCarthy, J. 1986. Applications of circumscription to formalizing common-sense knowledge. Artificial Intelligence 28(1):89-116.
- [Mikolov et al. 2013] Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
- [Miller 1995] Miller, G. A. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39-41.
- [Minsky 1988] Minsky, M. 1988. Society of mind . Simon and Schuster.
- [Morgenstern, Davis, and Ortiz Jr 2016] Morgenstern, L.; Davis, E.; and Ortiz Jr, C. L. 2016. Planning, executing, and evaluating the winograd schema challenge. AI Magazine 37(1):50-54.
- [Mueller 2014] Mueller, E. T. 2014. Commonsense Reasoning: An Event Calculus Based Approach . Morgan Kaufmann.
- [Nair and Hinton 2010] Nair, V., and Hinton, G. E. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of ICML , 807-814.
- [Neapolitan 2012] Neapolitan, R. E. 2012. Probabilistic reasoning in expert systems: theory and algorithms . CreateSpace Independent Publishing Platform.
- [Nickel et al. 2015] Nickel, M.; Murphy, K.; Tresp, V.; and Gabrilovich, E. 2015. A review of relational machine learning for knowledge graphs. arXiv preprint arXiv:1503.00759 .
- [Nickel, Tresp, and Kriegel 2012] Nickel, M.; Tresp, V.; and Kriegel, H.-P. 2012. Factorizing YAGO: scalable machine learning for linked data. In Proceedings of WWW , 271-280. ACM.
- [Osgood 1952] Osgood, C. E. 1952. The nature and measurement of meaning. Psychological bulletin 49(3):197.
- [Pearl 1988] Pearl, J. 1988. Probabilistic reasoning in intelligent systems: Networks of plausible reasoning.
- [Peng, Khashabi, and Roth 2015] Peng, H.; Khashabi, D.; and Roth, D. 2015. Solving hard coreference problems. Urbana 51:61801.
- [Rahman and Ng 2011] Rahman, A., and Ng, V. 2011. Coreference resolution with world knowledge. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language TechnologiesVolume 1 , 814-824. Association for Computational Linguistics.
- [Richardson and Domingos 2006] Richardson, M., and Domingos, P. 2006. Markov logic networks. Machine learning 62(1-2):107-136.
- [Saba 2015] Saba, W. 2015. On the winograd schema challenge.
- [Socher et al. 2013] Socher, R.; Chen, D.; Manning, C. D.; and Ng, A. 2013. Reasoning with neural tensor networks for knowledge base completion. In Proceedings of NIPS , 926-934.
- [Strube 2016] Strube, M. 2016. The (non) utility of semantics for coreference resolution (corbon remix). In NAACL 2016 workshop on Coreference Resolution Beyond OntoNotes .
- [Turing 1950] Turing, A. M. 1950. Computing machinery and intelligence. Mind 59(236):433-460.
- [Wang et al. 2014] Wang, Z.; Zhang, J.; Feng, J.; and Chen, Z. 2014. Knowledge graph embedding by translating on hyperplanes. In Proceedings of AAAI , 1112-1119. Citeseer.
- [Xue et al. 2014] Xue, S.; Abdel-Hamid, O.; Jiang, H.; Dai, L.; and Liu, Q. 2014. Fast adaptation of deep neural network based on discriminant codes for speech recognition. Audio, Speech, and Language Processing, IEEE/ACM Trans. on 22(12):1713-1725.
- [Zhu et al. 2015] Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; and Fidler, S. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision , 19-27.