# Inferring Past Human Actions in Homes with Abductive Reasoning
Abstract
Abductive reasoning aims to make the most likely inference for a given set of incomplete observations. In this paper, we introduce “Abductive Past Action Inference", a novel research task aimed at identifying the past actions performed by individuals within homes to reach specific states captured in a single image, using abductive inference. The research explores three key abductive inference problems: past action set prediction, past action sequence prediction, and abductive past action verification. We introduce several models tailored for abductive past action inference, including a relational graph neural network, a relational bilinear pooling model, and a relational transformer model. Notably, the newly proposed object-relational bilinear graph encoder-decoder (BiGED) model emerges as the most effective among all methods evaluated, demonstrating good proficiency in handling the intricacies of the Action Genome dataset. The contributions of this research significantly advance the ability of deep learning models to reason about current scene evidence and make highly plausible inferences about past human actions. This advancement enables a deeper understanding of events and behaviors, which can enhance decision-making and improve system capabilities across various real-world applications such as Human-Robot Interaction and Elderly Care and Health Monitoring. Code and data available at https://github.com/LUNAProject22/AAR
1 Introduction
Reasoning is an inherent part of human intelligence as it allows us to draw conclusions and construct explanations from existing knowledge when dealing with an uncertain scenario. One of the reasoning abilities that humans possess is abductive reasoning. Abductive reasoning aims to infer the most compelling explanation for a given set of observed facts based on a logical theory. In this work, we study the new problem of inferring past human actions from visual information using abductive inference. It is an extremely useful tool in our daily life, as we often rely on a set of facts to form the most probable conclusion. In fact, a comprehensive understanding of a situation requires considering both past and future information. The ability to perform abductive reasoning about past human actions is vital for human-robot collaboration AI-assisted accident and crime investigation and assistive robotics. Furthermore, robots working in dynamic environments benefit from understanding previous human actions to better anticipate future behaviour or adapt their actions accordingly. Imagine the scenario where a rescue robot enters an elderly person’s house to check on why he or she is not responding to the routine automated phone call. Upon entering the house, the robot observes its surroundings and notices that the back door is left open but nothing else is out of the ordinary. These observations may form a basis for a rational agent – the elderly might have opened the door and went into the garden. The robot can immediately make its way to search for him/her in the back garden. This example illustrates how a social rescue robot can utilize observed facts from the scene to infer past human actions, thereby reasoning about the individual’s whereabouts and ensuring their safety through abductive reasoning.
In recent years, there have been some great initiatives made in abductive reasoning for computer vision [25, 48, 19]. In particular, [25] generates the description of the hypothesis and the premises in the natural language given a video snapshot of events. Without the generation of a hypothesis description, these methods boil down to dense video captioning. A similar task is also presented in [19] where given an image, the model must perform logical abduction to explain the observation in natural language. The use of natural language queries in these tasks presents challenges related to language understanding, making the abductive reasoning task more complicated.
In contrast to these recent works, we challenge a model to infer multiple human actions that may have occurred in the past from a given image. Based on the visual information from the image, objects such as a person, glass and cabinet, may provide clues from which humans can draw conclusions – see Fig. 1. We term this new task, Abductive Past Action Inference and further benchmark how deep learning models perform on this challenging new task. For this task, the models are not only required to decipher the effects of human actions resulting in different environment states but also solve causal chains of action effects, a task that can be challenging even for humans. Furthermore, the task relies on the model’s ability to perform abductive reasoning based on factual evidence i.e., determining what actions may have or have not been performed based on visual information in the image. Humans can solve this task by using prior experience (knowledge) about actions and their effects and using reasoning to make logical inferences. Are deep learning models able to perform abductive past action inference by utilizing visual knowledge present in a given image and a learned understanding of the domain? We aim to answer this question in this paper.
Human action can be viewed as an evolution of human-object relationships over time. Therefore, the state of human-object relations in a scene may give away some cues about the actions that may have been executed by the human. We hypothesize that deep learning models are able to perform logical abduction on past actions using visual and semantic information of human-object relations in the scene. As these human-object relations provide substantial visual cues about actions and the effects of past actions, it makes it easier for the models to infer past actions. On the other hand, there is also the duality in which the evidence should support those conclusions (the actions inferred by the model). If a human executed a set of actions $\mathcal{A}$ which resulted in a state whereby a human-object relation set $\mathcal{R}$ is formed as an effect of those executed actions (i.e., $\mathcal{A}→\mathcal{R}$ ), then using the relational information, we can formulate the task by aiming to infer $\mathcal{A}$ from $\mathcal{R}$ . Therefore, we argue that human-object relational representations are vital for abductive past action inference and provide further justifications in our experiments.
In this work, our models rely on the human-centric object relation tuples such as (person, glass) and (person, closet) obtainable from a single image at the current point in time to perform abductive past action inference. One can see why these human-centric relations are vital for identifying past actions: the (person, glass) relation may lead to deriving actions such as (person-pouring-water, person-took-glass-from-somewhere) while (person, closet) may imply actions such as (person-opening-closet, person-closing-closet) see – Figure 1. Therefore, we use objects and their relationships in the scene to construct human-centric object relations within each image. These relations are made up of both visual and semantic features of recognized objects. To effectively model relational information, we use bilinear pooling and to model inter-relational reasoning, we use a new relational graph neural network (GNN). We propose a new model called BiGED that uses both bilinear pooling and GNNs to effectively reason over human-object relations and inter-relational information of an image to perform abductive inference on past actions effectively.
Our contributions are summarized as follows: (1) To the best of our knowledge, we are the first to propose the abductive past action inference task, which involves predicting past actions through abductive inference. (2) We benchmark several image, video, vision-language, and object-relational models on this problem, thereby illustrating the importance of human-object relational representations. Additionally, we develop a new relational rule-based inference model which serve as relevant baseline models for the abductive past action inference task. (3) We propose a novel relational bilinear graph dncoder-decoder model (BiGED) to tackle this challenging reasoning problem and show the effectiveness of this new design.
2 Related Work
Our work is different from action recognition [20, 18] in a fundamental way. First, in action recognition, the objective is to identify the actions executed in the visible data (e.g., a video or an image in still image action recognition [15]). In action recognition, the models can learn from visual cues what the action looks like and what constitutes an action. In our work, we aim to infer past actions that the model has never seen the human performing. The model only sees visual evidence (e.g. human-object relations) in the scene which is the outcome of executed actions. There are no motion cues or visual patterns of actions that the model can rely on to predict past actions. From a single static image, the machine should infer what actions may have been executed. This is drastically different from classical action recognition and action anticipation tasks.
Abductive past action inference shares some similarity to short-term action anticipation [11, 9] and long-term action anticipation [1]. However, there are several notable differences between the two tasks. Firstly, in abductive past action inference, the goal of the model is to identify the most plausible actions executed by a human in the past based on the current evidence, whereas, in action anticipation, the model learns to predict future action sequences from current observations. The primary distinction lies in abductive past action inference, where observations (evidence) may imply certain past actions, contrasting with action anticipation tasks that predict future actions without certainty. In other words, in abductive past action inference, the evidence and clues indicate possible actions executed in the past that resulted in the evidence or clues. However, in action anticipation, the clues [37], context [14], current actions [13], and knowledge about the task [49] are used to infer probable future actions, but there is no guarantee that the predicted actions will be executed by a human. For instance, observing a person cleaning a room with a broom suggests prior actions such as picking up the broom from somewhere must have happened among many others. Even if putting away the broom is anticipated somewhere in the future, other actions such as holding the broom and opening a window are also possible. Therefore, while action anticipation addresses the uncertainty of future human behavior, abductive past action inference models can utilize scene evidence (such as objects in the scene) to infer the most likely past actions. Additionally, in abductive past action inference, the uncertainty arises from the fact that several different actions may have resulted in similar states $\mathcal{R}$ . In our task, models should comprehend the consequences of each executed action and engage in abductive reasoning to infer the most probable set or sequence of past actions. Another key difference between action anticipation and abductive past action inference is that in action anticipation, predictions made at time $t$ can leverage all past observations. In contrast, abductive past action inference relies solely on present and future information, where new future observations can potentially alter the evidence about past actions, making the inference process more challenging.
Visual Commonsense Reasoning (VCR) [47, 45] and causal video understanding [30, 31] are also related to our work. In VCR [47], given an image, object regions, and a question, the model should answer the question regarding what is happening in the given frame. The model has to also provide justifications for the selected answer in relation to the question. Authors in [29] also studied a similar problem where a dynamic story underlying the input image is generated using commonsense reasoning. In particular, VisualCOMET [29] extends VCR and attempts to generate a set of textual descriptions of events at present, a set of commonsense inferences on events before, a set of commonsense inferences on events after, and a set of commonsense inferences on people’s intents at present. In this vein, given the complete visual commonsense graph representing an image, they propose two tasks; (1) generate the rest of the visual commonsense graph that is connected to the current event and (2) generate a complete set of commonsense inferences. In contrast, given an image without any other natural language queries, we recognize visual objects in the scene and how they are related to the human, then use the human-centric object relational representation to infer the most likely actions executed by the human.
Recently, there are machine learning models that can also perform logical reasoning [6, 50, 4, 16, 22]. Visual scene graph generation [42] and spatial-temporal scene graph generation [5] are also related to our work. Graph neural networks are also related to our work [39, 46, 40]. Our work is also related to bilinear pooling methods such as [12, 10].
<details>
<summary>extracted/6142422/images/tasks_without.png Details</summary>

### Visual Description
## Diagram: Abductive Action Inference Workflow
### Overview
The diagram illustrates a three-stage process for inferring actions from visual data:
1. **Object Detection** (left)
2. **Relational Modelling** (center)
3. **Abductive Action Inference** (right, subdivided into three steps)
### Components/Axes
#### Object Detection (Left Panel)
- **Header**: "Object Detection" (gray background, white text).
- **Image**: A person in a kitchen holding a glass.
- **Bounding Boxes**:
- Blue box: Glass being held.
- Red box: Hand gripping the glass.
- Green box: Full-body context (person + environment).
#### Relational Modelling (Center Panel)
- **Header**: "Relational Modelling" (purple background, white text).
- **Image**: Same person with a green arrow pointing from the glass (blue box) to the hand (red box).
- **Labels**:
- "Take glass"
- "Hold glass"
- "Open cabinet"
- "Close cabinet"
- "Pour into glass"
#### Abductive Action Inference (Right Panel)
- **Header**: "Abductive Action Inference" (blue background, white text).
- **Three Substeps**:
1. **Set of actions** (black box, white text):
- Take glass
- Hold glass
- Open cabinet
- Close cabinet
- Pour into glass
2. **Sequence of actions** (black box, white text):
- 1. Open cabinet
- 2. Take glass
- 3. Close cabinet
- 4. Hold glass
- 5. Pour into glass
3. **Language query-based action verification** (black box, white text):
- Open cabinet? **Yes** (green)
- Take glass? **Yes** (green)
- Close cabinet? **Yes** (green)
- Hold glass? **Yes** (green)
- Pour into glass? **Yes** (green)
- Drinking? **No** (red)
- Washing glass? **No** (red)
### Detailed Analysis
#### Object Detection
- The person is centered in the frame, wearing a dark shirt.
- The glass is held at chest height, with the blue bounding box emphasizing its position.
- The red box isolates the hand, while the green box contextualizes the entire scene.
#### Relational Modelling
- The green arrow visually links the glass (blue box) to the hand (red box), indicating a causal or interactive relationship.
- The five listed actions suggest a workflow for handling the glass and cabinet.
#### Abductive Action Inference
- **Step 1 (Set of actions)**: Lists all possible actions without order.
- **Step 2 (Sequence of actions)**: Orders the actions numerically, showing a logical progression.
- **Step 3 (Verification)**:
- All actions except "Drinking" and "Washing glass" are confirmed (green).
- The two negatives (red) imply these actions were not observed.
### Key Observations
1. **Color Coding**:
- Green = Confirmed actions (Steps 1–2).
- Red = Unconfirmed actions (Step 3).
2. **Temporal Flow**:
- The sequence in Step 2 implies a top-down workflow (e.g., opening the cabinet before taking the glass).
3. **Contextual Gaps**:
- The absence of "Drinking" and "Washing glass" in verification suggests these actions were not part of the observed interaction.
### Interpretation
The diagram demonstrates a structured approach to action inference:
1. **Detection** identifies objects (glass, hand).
2. **Relational Modelling** establishes interactions (e.g., holding the glass).
3. **Inference** reconstructs the sequence of actions and verifies their occurrence.
The two red "No" responses in Step 3 highlight limitations: the system correctly identifies unobserved actions but may lack context to infer implicit behaviors (e.g., drinking after pouring). The green arrow in the center panel emphasizes causality, bridging object detection to action sequencing. This workflow could be applied to robotics or AI systems requiring real-time action prediction.
</details>
Figure 1: Proposed object-relational approach for abductive past action inference. Models are tasked to: 1) abduct the set of past actions, 2) abduct the sequence of past actions, and 3) perform abductive past action verification.
3 Abductive Past Action Inference
Task: Given a single image, models have to infer past actions executed by humans up to the current moment in time. We name this task Abductive Past Action Inference. Let us denote a human action by $a_{i}∈ A$ where $A$ is the set of all actions and $E_{1},E_{2},·s$ is a collection of evidence from the evidence set $\mathcal{E}$ . As the evidence is a result of actions, we can write the logical implication $\mathcal{A}→\mathcal{E}$ where $\mathcal{A}$ is the set of actions executed by a human which resulted in a set of evidence $\mathcal{E}$ . Then, the task aims to derive 1) the set of past actions, 2) the sequence of past actions that resulted in the current evidence shown in the image, and 3) abductive past action verification. The abductive past action verification is a binary task where the model is given a single image and is required to answer a yes or no to an action query (did the person execute action $a_{x}$ in the past?).
3.1 Object-Relational Representation Approach
Our primary hypothesis is that human-object relations are essential for abductive past action inference. Therefore, we propose a human-object relational approach for the task. In all three tasks, our general approach is as follows. We make use of detected humans and objects in the image and then generate a representation for human-centric object relations. Then, using these human-centric object relations, we summarize the visual image, and using neural models, we infer the most likely actions executed by the human. The overview of this approach is shown in Figure 1. Next, we first discuss abductive past action set inference, followed by the details of abductive past action verification.
Abductive past action set prediction.
Let us denote the object by $o∈ O$ , the predicate category by $p∈ P$ , and the human by $h$ . The $j^{th}$ relation $R_{j}$ is a triplet of the form $\left<h,p,o\right>$ . In the $i^{th}$ image, we observe $n$ number of relations $\mathcal{R}_{i}=\{R_{1},R_{2},·s R_{n}\}$ where $\mathcal{R}_{i}$ is the relation set present in the situation shown in that image. These relations constitute the evidence ( $\mathcal{E}$ ). The relation set $\mathcal{R}_{i}$ is an effect of a person executing an action set/sequence $\mathcal{A}_{i}=\{a_{1},·s a_{K}\}$ . Therefore, the following association holds.
$$
\mathcal{A}_{i}\rightarrow\mathcal{R}_{i} \tag{1}
$$
However, we do not know which action caused which relation (evidence), as this information is not available. The association reveals that there is a lack of specific knowledge about the exact effects of individual actions and when multiple actions have been executed, the resulting effect is a combined effect of all executed actions. Consequently, the learning mechanism must uncover the probable cause-and-effect relationships concerning actions. Therefore, given $\mathcal{R}$ we aim to perform abductive past action inference to infer the most likely set of actions executed by the human using neural network learning.
We learn this abduction using the deep neural network functions $\phi()$ , and $\phi_{c}()$ . The relational model, $\phi$ takes relation set as input and outputs a summary of relational information as a vector $x_{r}$ .
$$
x_{r}=\phi(R_{1},\cdots,R_{n};\theta_{\phi}) \tag{2}
$$
The parameters of the relational model are denoted by $\theta_{\phi}$ . The linear classifier ( $\phi_{c}$ ) having the parameters $\theta_{{c}}$ , takes relational information as a vector $x_{r}$ as input and returns the conditional probability of actions given the relational evidence as follows:
$$
P(a_{1},\cdots,a_{K}|R_{1},\cdots,R_{n})=\phi_{c}(x_{r};\theta_{{c}}) \tag{3}
$$
The training and inference sets comprise images and corresponding action set $\mathcal{A}_{i}$ . From each image, we extract the relation set $\mathcal{R}_{i}$ . Therefore, the dataset consists of $\mathcal{D}=\bigcup_{i}\{\mathcal{R}_{i},\mathcal{A}_{i}\}$ . Given the training set ( $\mathcal{D}$ ), we learn the model function in Equations 2 and 3 using backpropagation as follows:
$$
\theta_{\phi}^{*},\theta_{{c}}^{*}=\texttt{argmin}_{\theta_{\phi},\theta_{{c}}%
}\sum_{i}-log(P(\mathcal{A}_{i}|\mathcal{R}_{i})) \tag{4}
$$
where $\theta_{\phi}^{*},\theta_{{c}}^{*}$ are the optimal parameters. As this is a multi-label-multi-class classification problem, we utilize the max-margin multi-label loss from PyTorch where the margin is set to 1, during training.
Abductive past action verification.
Abductive verification model $\phi_{ver}()$ takes the evidence $\mathcal{E}$ and the semantic representation of the past action (e.g. textual encoding of the action name) $y_{a}$ as inputs and outputs a binary classification score indicating if the evidence supports the action or not, i.e. $\phi_{ver}(\mathcal{E},y_{a})→[0,1]$ . Specifically, we encode the past action name using the CLIP [33] text encoder to obtain the textual encoding $y_{a}$ for action class $a$ . Then, we concatenate $y_{a}$ with $x_{r}$ and utilize a two-layer MLP to perform binary classification to determine whether action $a$ was executed or not. We use the max-margin loss to train $\phi_{ver}()$ . Note that semantic embedding of action classes ( $y_{a}$ ) is not a necessity here. For example, one might learn the class embeddings from scratch removing the dependency on language or use one-hot-vectors.
3.2 Relational Representation
To obtain the relation representation, we extract features from the human and object regions of each image using a FasterRCNN [36] model with a ResNet101 backbone [17]. Let us denote the human feature by $x_{h}$ , the object feature by $x_{o}$ , and the features extracted from taking the union region of both human and object features by $x_{u}$ . As we do not know the predicate or the relationship label for the relation between $x_{h}$ and $x_{o}$ , we use the concatenation of all three visual features $x_{h}$ , $x_{o}$ , and $x_{u}$ as the joint relational visual feature $x_{v}=[x_{h},x_{o},x_{u}]$ . Using FasterRCNN, we can also obtain the object and human categories. We use Glove [32] embedding to acquire a semantic representation of each human and object in the image. Let us denote the Glove embedding of the human by $y_{h}$ and the object by $y_{o}$ . Then, the semantic representation of the relation is given by $y_{s}=[y_{h},y_{o}]$ . Using both visual and semantic representations, we obtain a joint representation for each human-centric object relation in a given image. Therefore, the default relation representation for a relation $R=<h,p,o>$ is given by $r=[x_{v},y_{s}]$ . Note that we do not have access to the predicate class or any information about the predicate. Next, we present several neural and non-neural models that we developed in this paper that uses relational representations for the abductive past action inference task.
The details of abductive past action sequence inference are provided in the supplementary materials (section 4.1). Next, we present our graph neural network model to infer past actions based on relational information.
3.3 GNNED: Relational Graph Neural Network
The graph neural network-based encoder-decoder model summarizes relational information for abductive past action inference. Given the relational data with slight notation abuse, let us denote the relational representations by a $n× d$ matrix $\mathcal{R}=[r_{1},r_{2},·s,r_{n}]$ , where $r_{n}$ has $d$ dimensions. In our graph neural network encoder-decoder (GNNED) model, we first project the relational data using a linear function as follows:
$$
\mathcal{R^{\prime}}=\mathcal{R}W_{l}+b_{l} \tag{5}
$$
where $\mathcal{R^{\prime}}=[r^{\prime}_{1},r^{\prime}_{2},·s,r^{\prime}_{n}]$ . Then, we construct the affinity matrix $W_{A}(i,j)$ using Jaccard Vector similarity, where $W_{A}(i,j)$ shows the similarity/affinity between the i-th relation and the j-th relation in the set. Here, we use Jaccard Vector Similarity which is a smooth and fully differentiable affinity [9]. Note that Jaccard Vector Similarity is bounded by [-1,1]. Thereafter, we obtain the graph-encoded relational representation as follows:
$$
G_{e}=ReLU((W_{A}\mathcal{R^{\prime}})W_{g}+b_{g}) \tag{6}
$$
where $W_{g}$ and $b_{g}$ are the weight matrix and bias term respectively. We call equations 5 - 6 the graph module. Using the graph module as a base model, we develop a graph encoding layer. The relational graph neural network encoder-decoder architecture we proposed is shown in Figure 2. Our graph encoder-decoder model consists of one graph encoder and three graph decoders by default. The graph module is able to model the inter-relations between human-object relations and reasons about them to perform abduction on past actions. The graph encoding layer (left) is very similar to the Transformer encoder layer [43]. The graph encoding layer consists of drop-out layers, layer norm [2], linear layers, and residual connections [17]. The graph decoder layer (right) is also similar to a Transformer decoder layer except for the graph module. Finally, we apply max-pooling at the end of the graph encoder-decoder model to obtain the final image representation $x_{r}$ .
<details>
<summary>extracted/6142422/images/graphed_updated.png Details</summary>

### Visual Description
## Diagram: Neural Network Architecture
### Overview
The diagram illustrates a neural network architecture with a bidirectional flow of data. It consists of sequential processing blocks connected by arrows, split into two mirrored sections by a vertical line. Components include graph-based processing, normalization, activation functions, and regularization techniques.
### Components/Axes
- **Left Section (Input to Middle):**
1. **Graph Module** (Input)
2. **Dropout** (Regularization)
3. **Layer Norm** (Normalization)
4. **Linear Layer** (Weighted summation)
5. **ReLU** (Activation)
6. **Dropout** (Regularization)
7. **Linear Layer** (Weighted summation)
8. **Layer Norm** (Normalization)
9. **Graph Module** (Bidirectional connection to right section)
- **Right Section (Middle to Output):**
1. **Graph Module** (Bidirectional connection from left section)
2. **Dropout** (Regularization)
3. **Layer Norm** (Normalization)
4. **Linear Layer** (Weighted summation)
5. **ReLU** (Activation)
6. **Dropout** (Regularization)
7. **Linear Layer** (Weighted summation)
8. **Layer Norm** (Normalization)
9. **Dropout** (Final regularization)
- **Flow Direction:** All arrows point rightward, except the bidirectional connection between the two Graph Modules at the vertical split.
### Detailed Analysis
- **Graph Module:** Appears twice, suggesting graph-based operations (e.g., graph neural networks). Positioned at both ends of the architecture, possibly for encoding/decoding or residual connections.
- **Dropout:** Used after critical layers (e.g., after ReLU and Linear Layers) to prevent overfitting. Appears 4 times total.
- **Layer Norm:** Applied after Linear Layers to stabilize training. Appears 3 times.
- **ReLU:** Non-linear activation function used once in each section.
- **Vertical Split:** Likely represents a separation between encoder and decoder (common in transformer architectures) or parallel processing pathways.
### Key Observations
1. **Symmetry:** The left and right sections mirror each other except for the final Dropout in the right section.
2. **Bidirectional Flow:** The Graph Module connects both sections, enabling information exchange between pathways.
3. **Regularization Emphasis:** Dropout is strategically placed after high-risk layers (e.g., after ReLU and Linear Layers).
4. **Normalization Placement:** Layer Norm follows Linear Layers to normalize outputs before activation or further processing.
### Interpretation
This architecture combines graph-based processing with standard deep learning components (e.g., ReLU, Dropout). The vertical split suggests a design for handling sequential or relational data (e.g., time series, social networks) with bidirectional context. The use of Layer Norm and Dropout indicates a focus on training stability and generalization. The final Dropout in the right section may act as a safeguard for the output layer, reducing overconfidence in predictions.
The diagram does not include numerical values or performance metrics, so quantitative analysis is not possible. However, the structural design aligns with modern neural network practices for handling complex, structured data.
</details>
Figure 2: The graph neural network encoder (left) and graph neural network decoder (right) architecture. The residual connections are shown with the $+$ sign.
<details>
<summary>extracted/6142422/images/biged.png Details</summary>

### Visual Description
## Diagram: Machine Learning Pipeline for Feature Processing and Action Prediction
### Overview
The diagram illustrates a machine learning pipeline that processes human, object, and union features through a series of neural network components to predict past actions. The flow involves concatenation, linear transformations, graph neural network (GNN) processing, bilinear operations, and final classification.
### Components/Axes
- **Input Features**:
- **Human Feature** (`x_h`): Red block on the left.
- **Object Feature** (`[x_o, y_o]`): Green block with subcomponents `x_o` and `y_o`.
- **Union Feature** (`x_u`): Orange block at the bottom.
- **Processing Blocks**:
- **GNNED**: Processes `x_h` and concatenated `x_o`/`y_o`.
- **Linear Layer**: Transforms concatenated object features.
- **Bilinear Module**: Combines outputs from GNNED blocks.
- **Classifier MLP**: Final prediction block for "Past Actions".
- **Connections**:
- Arrows indicate data flow (e.g., `x_h` → GNNED, `x_o`/`y_o` → Linear Layer → GNNED).
- Dotted lines represent concatenation operations.
### Detailed Analysis
1. **Human Feature (`x_h`)**:
- Red block input directly into GNNED.
2. **Object Feature** (`[x_o, y_o]`):
- Green block split into `x_o` and `y_o`, concatenated, then passed through a Linear Layer before GNNED.
3. **Union Feature** (`x_u`):
- Orange block concatenated with Bilinear Module output to form **Relational Features** (`x_r`).
4. **Bilinear Module**:
- Combines outputs from two GNNED blocks (one processing `x_h`, the other `x_o`/`y_o`).
5. **Classifier MLP**:
- Takes **Relational Features** (`x_r`) as input to predict **Past Actions**.
### Key Observations
- **Feature Integration**: Human and object features are processed separately before being combined via bilinear operations.
- **Relational Features**: Derived from union of GNN-processed features and raw union features, suggesting a hybrid approach to capturing interactions.
- **Output**: Final prediction targets "Past Actions," implying a temporal or sequential modeling task.
### Interpretation
This pipeline demonstrates a multi-stage feature engineering process:
1. **Separate Processing**: Human and object features are handled independently to preserve modality-specific information.
2. **Interaction Modeling**: The Bilinear Module captures cross-modal relationships between human and object features.
3. **Hybrid Representation**: Relational Features (`x_r`) merge GNN-processed interactions with raw union features, balancing learned and explicit relationships.
4. **Temporal Prediction**: The Classifier MLP maps these features to "Past Actions," suggesting the model is designed for tasks like activity recognition or behavior prediction.
The use of GNNs implies graph-structured data (e.g., human-object interactions), while the bilinear module enables efficient modeling of pairwise interactions. The final concatenation with `x_u` ensures domain knowledge (via union features) is retained alongside learned representations.
</details>
Figure 3: The Bilinear Graph Encoder-Decoder (BiGED) architecture.
3.4 RBP: Relational Bilinear Pooling
To effectively model the higher-order relational information between human and object features, we use bilinear pooling. Given the human representation $x_{h}$ and the object representation $x_{o}$ , we use bilinear pooling with a weight matrix $W_{b}$ of size $d× d× d$ and linear projection matrices $W_{bl},W_{jb}$ as follows:
$$
\displaystyle o^{\prime} \displaystyle=ReLU(W_{o}x_{o}+b_{o}) \displaystyle h^{\prime} \displaystyle=ReLU(W_{h}x_{h}+b_{h}) \displaystyle r_{b} \displaystyle=ReLU([h^{\prime}W_{b}o^{\prime};([h^{\prime};o^{\prime}]W_{bl}+b%
_{bl})])W_{jb}+b_{jb} \tag{7}
$$
where [;] represents the vector concatenation and $h^{\prime}W_{b}o^{\prime}$ is the bilinear pooling operator applied over human and object features. $([h^{\prime};o^{\prime}]W_{bl}+b_{bl})$ is the output of concatenated human and object features followed by a linear projection using weight matrix $W_{bl}$ and bias term $b_{bl}$ . In contrast to bilinear pooling, the concatenated linear projection captures direct relational information between human and object features. Then, we concatenate the bilinear pooled vector ( $h^{\prime}W_{b}o^{\prime}$ ) and the output of the linear projection ( $([h^{\prime};o^{\prime}]W_{bl}+b_{bl})$ ). Next, we use ReLU and apply another linear projection ( $W_{jb}+b_{jb}$ ). Finally, we concatenate the overlap feature $x_{u}$ with the overall model output ( $r_{b}$ ) and apply max-pooling across all relational features ( $[r_{b};x_{u}]$ ) in the image to obtain $x_{r}$ .
3.5 BiGED: Bilinear Graph Encoder-Decoder
Finally, to take advantage of both bilinear relational modeling and graph neural network encoder-decoder models, we combine both strategies as shown in Fig 3. The main idea is to replace the projection function in Equation 7 with a graph neural network encoder-decoder model. Let us denote the graph neural network encoder-decoder model by $f_{Ged}()$ . Then, equation 7 will be replaced as follows:
$$
\displaystyle O^{\prime} \displaystyle=f_{Ged}(X_{o}) \tag{10}
$$
where $X_{o}$ is all the object features in the image. Afterward, we apply equation 9 before using bilinear modeling to obtain the relational representation. Note that as there are only one or two humans in the image, we do not use the GNNED to model human semantic features. The inputs to the BiGED model are the visual human features $x_{h}$ , concatenated visual and semantic object features $[x_{o},y_{o}]$ as well as the union features $x_{u}$ . Next, we concatenate the human and object features $[x_{h},x_{o},y_{o}]$ to obtain a joint feature and then pass it through a linear layer and another GNNED model. The outputs of the bilinear, joint feature-based GNNED models and overlap union feature $x_{u}$ are concatenated to obtain the final relational representation. Afterward, we use max-pooling to obtain the representation $x_{r}$ for the image. For all models, we employ a linear classifier to infer past actions using the representation vector $x_{r}$ .
4 Experiments and Results
4.1 Action Genome Past Action Inference dataset
We extend the Action Genome (AG) dataset [21] and benchmark all models on the AG dataset for the abductive past action inference task. Built upon the Charades dataset [41], the AG dataset contains 9,848 videos with 476,000+ object bounding boxes and 1.72 million visual relationships annotated across 234,000+ frames. It should be noted that not all video frames in the Charades dataset are used in the AG dataset. Only a handful of keyframes are used in AG, and we follow the same. The AG dataset does not provide action annotations. To obtain action annotations for images of AG, we leverage the Charades dataset which contains 157 action classes. The process of generating action sets and sequences using images from the Action Genome and action labels from the Charades dataset for the abductive past action inference task is detailed in Section 2 of the supplementary materials.
4.2 Experimental Setup
After obtaining the action annotations of images for a given video, we drop videos having only one image as there are no past images and therefore, no past actions. For the remaining images, we assign action labels from the previous images in two different evaluation setups:
1. Abduct at $T$ : Given a image at time $T$ , we add action labels from all the previous images to the ground truth (including actions from the current image) where $\mathcal{A}_{t}$ denotes all past actions of the $t^{th}$ image. Therefore, the ground truth action set $\mathcal{A}$ is given by $\mathcal{A}=\mathop{\cup}\limits_{t=1}^{T}\mathcal{A}_{t}$ .
2. Abduct last image: Based on the first setup, we add an additional task where the model has to perform inference only on the last image of each video which contains all past actions. If the last image is $T^{\prime}$ , then the action set is $\mathcal{A}=\mathop{\cup}\limits_{t=1}^{T^{\prime}}\mathcal{A}_{t}$ . Note that in the Action Genome dataset, the images are sampled non-homogeneously from the Charades dataset videos. Therefore, the previous image occurs several seconds before the current image. In our abductive past action inference task, the ground truth past action sets are confined to the length of each video. We provide details on the number of images for a set of $n$ past actions in the AG dataset for these setups in the supplementary materials – section 2 and figure 4.
4.3 Evaluation Metrics
We utilize the mean Average Precision (mAP), Recall@K (R@K), and mean Recall@K (mR@K) metrics to evaluate the models for the abductive action set prediction and action verification tasks. Each image contains 8.9 and 8.2 actions for the Abduct at $T$ and Abduct last image setups respectively. Therefore, K is set to 10 based on the average number of actions contained in a image. Please see the supplementary material Section 3 for more implementation details. We will also release all our codes and models for future research.
4.4 Baseline Models
We benchmark several publicly available image (Resnet101-2D [17], ViT [7]) and video models (Slow-Fast [8] and Resnet50-3D) using the surrounding 8 frames of a image from the Charades dataset, and Video-Swin-S [27], Mvitv2 [24] and InternVideo [44] models using future $K$ images from Action Genome to explore video based methods. The value of $K$ is set to the minimum possible frame size for each model, with the default being 5 frames. Image models are pre-trained on ImageNet [38] while video models are pre-trained on Kinetics400 [23] dataset and we fine-tune these models on our task. We use a batch size of 32 with a learning rate of 1e-5. As for ViT, we use a batch size of 2042 using A100 GPUs. All video-based methods are fine-tuned end-to-end. We also use CLIP linear-probe and zero-shot to perform abduction using several variants of the CLIP model [33]. The details of all other baseline models (Relational Rule-based inference, Relational MLP, and Relational Transformers) are presented in supplementary material Section 1.
4.5 Human Performance Evaluation
Human performance for the abductive past action set inference and verification tasks in the Abduct at $T$ setup is presented in Tables 1 and 3. Performance on the abductive past action sequence inference is provided in the supplementary materials–see Table 1 and Section 4.1. All human experiments for the three sub-problems in the Abduct at $T$ setup follow the same procedure. Evaluators are asked to review 100 randomly sampled test images and manually assess all action classes in the Charades dataset without viewing the ground truth. They then select the likely past actions for each image.
4.6 Results: Abductive Past Action Set Prediction
| Model Human Performance End-to-end training | mAP – | R@10 80.60 | mR@10 82.81 |
| --- | --- | --- | --- |
| ResNet101-2D [17] | 9.27 | 18.63 | 11.51 |
| ViT B/32 [7] | 7.27 | 16.84 | 8.82 |
| \hdashline Resnet50-3D [8] | 8.16 | 16.08 | 7.83 |
| Slow-Fast [8] | 7.91 | 14.42 | 7.65 |
| \hdashline Video-Swin-S [27] - (K= 5) | 14.86 | 34.18 | 19.05 |
| MvitV2 [24] - (K=16) | 14.01 | 34.38 | 15.17 |
| InternVideo [44] - (K=8) | 12.29 | 30.72 | 12.37 |
| Vision-language models | | | |
| CLIP-ViT-B/32 (zero-shot) [33] | 14.07 | 14.88 | 20.88 |
| CLIP-ViT-L/14 (zero-shot) [33] | 19.79 | 21.88 | 27.77 |
| CLIP-ViT-B/32 (linear probe) [33] | 16.16 | 31.25 | 16.38 |
| CLIP-ViT-L/14 (linear probe) [33] | 22.06 | 40.18 | 20.01 |
| Object-relational methods - using GT human/objects | | | |
| Relational Rule-based inference | 26.27 | 48.94 | 36.89 |
| Relational MLP | 27.73 $±$ 0.20 | 42.50 $±$ 0.68 | 25.80 $±$ 0.61 |
| Relational Self Att. Transformer | 33.59 $±$ 0.17 | 56.03 $±$ 0.40 | 40.04 $±$ 1.15 |
| Relational Cross Att. Transformer | 34.73 $±$ 0.05 | 56.89 $±$ 0.47 | 40.75 $±$ 0.57 |
| Relational GNNED | 34.38 $±$ 0.36 | 57.17 $±$ 0.35 | 42.83 $±$ 0.21 |
| Relational Bilinear Pooling (RBP) | 35.55 $±$ 0.30 | 59.98 $±$ 0.68 | 43.53 $±$ 0.63 |
| BiGED | 35.75 $±$ 0.15 | 60.55 $±$ 0.41 | 44.37 $±$ 0.21 |
| \hdashline BiGED - (K=3) | 36.00 $±$ 0.12 | 60.17 $±$ 0.44 | 42.82 $±$ 0.90 |
| BiGED - (K=5) | 37.34 $±$ 0.21 | 61.16 $±$ 0.56 | 44.07 $±$ 0.87 |
| BiGED - (K=7) | 36.57 $±$ 0.38 | 60.65 $±$ 0.52 | 43.12 $±$ 0.47 |
| Object-relational method - using FasterRCNN labels | | | |
| BiGED | 24.13 $±$ 0.04 | 43.59 $±$ 0.88 | 30.12 $±$ 0.23 |
Table 1: Abductive past action set inference performance using the proposed methods on the Abduct at $T$ setup.
| Relational Rule-based inference Relational MLP Relational Self Att. Transformer | 26.18 25.99 $±$ 0.10 30.13 $±$ 0.11 | 44.34 38.79 $±$ 0.86 47.55 $±$ 0.14 | 33.94 23.54 $±$ 0.72 35.05 $±$ 0.55 |
| --- | --- | --- | --- |
| Relational Cross Att. Transformer | 31.07 $±$ 0.20 | 48.33 $±$ 0.15 | 35.32 $±$ 0.50 |
| Relational GNNED | 30.95 $±$ 0.30 | 48.18 $±$ 0.17 | 36.36 $±$ 0.12 |
| RBP | 31.48 $±$ 0.20 | 49.79 $±$ 0.55 | 36.96 $±$ 0.36 |
| BiGED | 31.41 $±$ 0.15 | 49.62 $±$ 0.64 | 36.15 $±$ 0.61 |
| Object-relational method - using FasterRCNN labels | | | |
| BiGED | 22.01 $±$ 0.26 | 37.06 $±$ 0.52 | 25.01 $±$ 0.37 |
Table 2: Abductive past action set inference performance using the proposed methods on the Abduct last image setup.
Our results for the abductive past action inference set prediction task are shown in Table 1. These results are obtained based on the Abduct at $T$ setup. During training, the model learns from every single image in the video sequence independently. Likewise, during inference, the model predicts the past action set on every single image. The end-to-end trained models such as Slow-Fast [8], ResNet50-3D, Resnet101-2D, and ViT perform poorly as it may be harder for these models to find clues that are needed for the abductive inference task. As there are no direct visual cues to infer previous actions (unlike object recognition or action recognition) from a given image, end-to-end learning becomes harder or near impossible for these models. The Video-Swin-S Transformer model [27] shows promise in end-to-end models due to its use of future context (K future snapshots) and strong video representation capabilities.
On the other hand, multi-modal foundational models such as the CLIP [33] variants are able to obtain better results than vanilla CNN models on this task perhaps due to the quality of the visual representation. Interestingly, object-relational models such as MLP and rule-based inference obtain decent performance. One might argue that the performance of human-object relational models is attributed to the use of ground truth object labels in the scene. However, when we tried to incorporate ground truth objects using object bounding boxes with red colored boxes as visual prompts in the CLIP [33] model, the performance was poor. The poor performance of the CLIP might be attributed to their training approach, which aims to align overall image features with corresponding text features. During their training, CLIP assumes that the text in the captions accurately describes the visual content of the image. However, when it comes to abductive past action inference, no explicit visual cues are available to indicate the execution of certain actions. We also note that the CLIP model demonstrates reasonable zero-shot performance. This may be because the CLIP model learns better vision features.
We experimented with generative models like ViLA [26] (instruction tuning) and BLIP (question answering). Details are in the supplementary materials sections 1.4 for GPT-3.5 and section 1.5 for ViLA. After instruction tuning on our dataset, ViLA achieved 10.5 mAP, 29.3 R@10, and 19.8 mR@10. We also tested GPT-3.5 with human-object relations as context, yielding 9.98 mAP, 25.22 R@10, and 20.32 mR@10. Due to the challenges of unconstrained text generation, models like BLIP, ViLA, and GPT-3.5 are excluded from the main comparison table.
The results also suggest that the human-object relational representations provide valuable evidence (cues) about what actions may have been executed in contrast to holistic vision representations. Among object-relational models, the MLP model and rule-based inference perform the worst across all three metrics. The rule-based inference does not use any parametric learning and therefore it can only rely on statistics. Interestingly, the rule-based method obtains similar performance to the MLP model indicating the MLP model merely learns from the bias of the dataset.
The relational transformer model improves results over MLP. Furthermore, the relational GNNED performance is comparable to the relational transformers. The transformer variants and GNNED have similar architectural properties and have better relational modeling capacity than the MLP model. These models exploit the interrelation between visual and semantic relational representations to better understand the visual scene. This potentially helps to boost the performance of abductive past action inference.
Surprisingly, Relational Bilinear Pooling (RBP) obtains particularly good results outperforming the transformer and GNNED models. The way relational reasoning is performed in RBP is fundamentally different from transformers and GNNED. The RBP models interactions between the human and object features within a relation more explicitly than the GNNED and Transformer. However, unlike the GNNED or Transformer, RBP is unable to model interactions between relations. Finally, the combination of GNNED and RBP, i.e., BiGED performs even better. This result is not surprising as BiGED takes advantage of better inter and intra-relation modelling. We also experimented with a BiGED model, that takes $K$ future frames starting from frame at $T$ (i.e. Action Genome frames from $T$ to $T+K$ ) as inputs. The results of this experiment suggests that use of future snapshots helps improve performance.
All object-relational models utilize the ground truth object labels from the AG dataset to obtain semantic representations. We observe a drop in performance when we use predicted objects from the FasterRCNN model. Additionally, FasterRCNN-based object-semantic prediction performs worse than the visual-only BiGED model (Table 4), indicating that incorrect semantics significantly harm performance. Nevertheless, the performance of BiGED with FasterRCNN labels is significantly better than end-to-end trained models and vision-language models. Finally, it should be emphasized that human performance on this task is significantly better than any of the modern AI models, highlighting a substantial research gap in developing AI systems capable of effectively performing abductive past action inference.
4.7 Results: Abduction on the Last Image
We evaluated object-relational models on the second setup, where we perform abduction on the last image of each video, using models trained in the previous setup. Due to the variety of possible actions in a video sequence, this setup is more challenging. It should be noted that this is a special case of abduct at T. This additional experiment allows us to observe and select the longest time horizon to determine whether the models are still able to abduct actions. Results in Table 2 show lower performance across all models compared to the previous setup, indicating the task’s increased difficulty. The MLP model and rule-based inference perform poorly. The GNNED, RBP, and BiGED methods outperform the Transformer model, despite GNNED’s similar architecture to the Transformer. BiGED achieves the highest mAP, while RBP excels in R@10 and mR@10.
| Human Performance Relational MLP Relational Self Att. Transformer | – 26.58 $±$ 0.37 27.94 $±$ 0.35 | 92.26 41.71 $±$ 0.82 45.72 $±$ 1.42 | 93.71 25.40 $±$ 0.66 30.12 $±$ 2.30 |
| --- | --- | --- | --- |
| RBP | 32.19 $±$ 0.44 | 53.76 $±$ 0.89 | 38.44 $±$ 0.67 |
| BiGED | 34.13 $±$ 0.39 | 57.39 $±$ 0.10 | 41.97 $±$ 0.36 |
Table 3: Abductive past action verification performance using the proposed methods on the Abduct at $T$ setup.
4.8 Results: Abductive Past Action Verification
We present abductive past action verification results in Table 3 using the object-relational approach. We use the ground truth human and object class names to obtain the semantic representation. As the query is in textual form (i.e. the action class name), we suggest that the abductive past action verification resembles a human-like task. It is easy to answer yes, or no to the question “Did the person execute action $a_{i}$ in this image to arrive at this state?" Interestingly, the performance of this task is slightly lower than the main results we obtained in Table 1. Even though this task is mentally more straightforward for the human, it seems the task is slightly difficult for the machine as it now has to understand the complexities of human languages.
4.9 Ablation on semantic vs visual
We use both visual and semantic features (Glove embedding of object names) to obtain the relational features –see Section 3.2. We ablate the impact of visual and semantic features on each model on Abductive Past Action Set Prediction (abduct at T) and the results are shown in Table 4. While RBP achieves the best performance using ground truth object semantics, BiGED is the best-performing model for visual data alone by a considerable margin, making it the overall best method. We can conclude while semantic features are effective, both visual and semantic features are complementary.
| Rule-based inference MLP Transformer | – 17.82 21.30 | 26.27 18.55 32.81 | – 27.73 33.59 |
| --- | --- | --- | --- |
| Relational GNNED | 21.55 | 32.82 | 34.38 |
| RBP | 22.15 | 33.03 | 35.55 |
| BiGED | 24.62 | 30.25 | 35.75 |
Table 4: mAP on Abductive Past Action Set Prediction (Abduct at T) using visual and semantic features.
Qualitative Results. The qualitative results in supplementary material Section 4.4 demonstrate RBP and BiGED infer past actions more accurately. Generalizability of BiGED. A visual-only BiGED model trained to infer past actions was evaluated for action recognition at the video level. We obtained 50.5 mAP on the Charades dataset without any tuning. Although not state-of-the-art, these results suggest the value of the abductive past action inference model for general action understanding using only visual features.
5 Discussion & Conclusion
This paper introduces abductive past action inference, a task involving past action set prediction, sequence prediction, and verification, all formulated as closed-set classification tasks. Our experiments show that while deep learning models can perform these tasks to some extent, holistic end-to-end models are ineffective. Large multi-modal models like CLIP show promise, but our proposed human-object relational approaches—such as relational graph neural networks, bilinear pooling, and the BiGED model—outperform them, demonstrating the value of object-relational modelling. We find conditional text generation unsuitable for this task due to limited control, and even advanced foundational models fail after instruction tuning. Overall, human-object-centric video representations emerge as the most effective approach, and abductive past action inference may enhance general human action understanding.
Acknowledgment. This research/project is supported by the National Research Foundation, Singapore, under its NRF Fellowship (Award NRF-NRFF14-2022-0001) and this research is partially supported by MOE grant RG100/23. It was also partly funded by an ASTAR CRF award to Cheston Tan and supported by the ASTAR CIS (ACIS) Scholarship awarded to Clement Tan. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of these organizations.
References
- [1] Yazan Abu Farha, Alexander Richard, and Juergen Gall. When will you do what?-anticipating temporal occurrences of activities. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5343–5352, 2018.
- [2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- [4] Le-Wen Cai, Wang-Zhou Dai, Yu-Xuan Huang, Yu-Feng Li, Stephen H Muggleton, and Yuan Jiang. Abductive learning with ground knowledge base. In IJCAI, pages 1815–1821, 2021.
- [5] Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16372–16382, 2021.
- [6] Wang-Zhou Dai, Qiuling Xu, Yang Yu, and Zhi-Hua Zhou. Bridging machine learning and logical reasoning by abductive learning. Advances in Neural Information Processing Systems, 32, 2019.
- [7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- [8] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
- [9] Basura Fernando and Samitha Herath. Anticipating human actions by correlating past with the future with jaccard similarity measures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13224–13233, 2021.
- [10] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
- [11] Antonino Furnari and Giovanni Maria Farinella. What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6252–6261, 2019.
- [12] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- [13] Rohit Girdhar and Kristen Grauman. Anticipative video transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13505–13515, October 2021.
- [14] Dayoung Gong, Joonseok Lee, Manjin Kim, Seong Jong Ha, and Minsu Cho. Future transformer for long-term action anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3052–3061, 2022.
- [15] Guodong Guo and Alice Lai. A survey on still image based human action recognition. Pattern Recognition, 47(10):3343–3361, 2014.
- [16] Zhongyi Han, Le-Wen Cai, Wang-Zhou Dai, Yu-Xuan Huang, Benzheng Wei, Wei Wang, and Yilong Yin. Abductive subconcept learning. Science China Information Sciences, 66(2):122103, 2023.
- [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [18] Samitha Herath, Mehrtash Harandi, and Fatih Porikli. Going deeper into action recognition: A survey. Image and vision computing, 60:4–21, 2017.
- [19] Jack Hessel, Jena D Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, and Yejin Choi. The abduction of sherlock holmes: A dataset for visual abductive reasoning. arXiv preprint arXiv:2202.04800, 2022.
- [20] Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J Black. Towards understanding action recognition. In Proceedings of the IEEE international conference on computer vision, pages 3192–3199, 2013.
- [21] Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10236–10247, 2020.
- [22] Yang Jin, Linchao Zhu, and Yadong Mu. Complex video action reasoning via learnable markov logic network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3242–3251, 2022.
- [23] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- [24] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4804–4814, 2022.
- [25] Chen Liang, Wenguan Wang, Tianfei Zhou, and Yi Yang. Visual abductive reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15565–15575, 2022.
- [26] Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2023.
- [27] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
- [28] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- [29] Jae Sung Park, Chandra Bhagavatula, Roozbeh Mottaghi, Ali Farhadi, and Yejin Choi. Visualcomet: Reasoning about the dynamic context of a still image. In European Conference on Computer Vision, pages 508–524. Springer, 2020.
- [30] Paritosh Parmar, Eric Peh, Ruirui Chen, Ting En Lam, Yuhan Chen, Elston Tan, and Basura Fernando. Causalchaos! dataset for comprehensive causal action question answering over longer causal chains grounded in dynamic visual scenes. arXiv preprint arXiv:2404.01299, 2024.
- [31] Paritosh Parmar, Eric Peh, and Basura Fernando. Learning to visually connect actions and their effects. arXiv preprint arXiv:2401.10805, 2024.
- [32] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
- [33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- [34] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
- [35] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- [36] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
- [37] Debaditya Roy, Ramanathan Rajendiran, and Basura Fernando. Interaction region visual transformer for egocentric action anticipation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6740–6750, January 2024.
- [38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575, 2014.
- [39] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
- [40] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European semantic web conference, pages 593–607. Springer, 2018.
- [41] Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pages 510–526. Springer, 2016.
- [42] Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3716–3725, 2020.
- [43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- [44] Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
- [45] Aming Wu, Linchao Zhu, Yahong Han, and Yi Yang. Connective cognition network for directional visual commonsense reasoning. Advances in Neural Information Processing Systems, 32, 2019.
- [46] Changqian Yu, Yifan Liu, Changxin Gao, Chunhua Shen, and Nong Sang. Representative graph neural network. In European Conference on Computer Vision, pages 379–396. Springer, 2020.
- [47] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019.
- [48] Hao Zhang, Yeo Keat Ee, and Basura Fernando. Rca: Region conditioned adaptation for visual abductive reasoning. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 9455–9464, 2024.
- [49] Qi Zhao, Shijie Wang, Ce Zhang, Changcheng Fu, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, and Chen Sun. Antgpt: Can large language models help long-term action anticipation from videos? In ICLR, 2024.
- [50] Tianyang Zhong, Yaonai Wei, Li Yang, Zihao Wu, Zhengliang Liu, Xiaozheng Wei, Wenjun Li, Junjie Yao, Chong Ma, Xiang Li, et al. Chatabl: Abductive learning via natural language interaction with chatgpt. arXiv preprint arXiv:2304.11107, 2023.
6 Supplementary - Other Baseline Methods
In this section, we give the details of all other baseline methods used in the paper.
6.1 Rule-based Abductive Past Action Inference
In abductive past action inference, we assume the following logical association holds,
$$
\{a_{1},a_{2},a_{3},\cdots,a_{K}\}\rightarrow\{R_{1},R_{2},\cdots,R_{N}\} \tag{11}
$$
where $\{R_{1},R_{2},·s,R_{N}\}$ is the relation set $\mathcal{R}$ present in an image and $\{a_{1},a_{2},a_{3},·s,a_{K}\}$ is the action set $\mathcal{A}$ executed by the human to arrive at the image. Note the set of all actions is denoted by $A$ where $\mathcal{A}⊂ A$ .
In rule-based inference, each relation is in the symbolic form $R_{k}=<H,o_{k}>$ where $H$ and $o_{k}$ are the human feature and $k^{th}$ object label in the image. As the human feature is common in all relations, we omit the human feature in each relation. Then, the relational association is updated as follows:
$$
\{a_{1},a_{2},a_{3},\cdots,a_{K}\}\rightarrow\{o_{1},o_{2},\cdots,o_{N}\} \tag{12}
$$
for any image. In rule-based abductive past action set inference, for each given object pattern $\{o_{1},o_{2},·s,o_{N}\}$ , we count the occurrence of each action $a_{j}$ . Let us denote the frequency of action $a_{j}$ for object pattern $\mathcal{O}_{q}=\{o_{1},o_{2},·s,o_{N}\}$ from the entire training set by $C_{j}^{q}$ . Therefore, for each object pattern $\mathcal{O}_{q}$ , we obtain a frequency vector over all past actions denoted by:
$$
\mathbf{C^{q}}=[C_{1}^{q},C_{2}^{q},\cdots,C_{|A|}^{q}] \tag{13}
$$
Then, we can convert these frequencies into probabilities using softmax:
$$
P(A|\mathcal{O}_{q})=softmax([C_{1}^{q},C_{2}^{q},\cdots,C_{|A|}^{q}]) \tag{14}
$$
We use this to perform abductive past action set inference using the test set. Given a test image, we first obtain the object pattern $\mathcal{O}$ . Next, we obtain the action probability vector for the object pattern from the training set using Equation 14. If an object pattern does not exist in the training set, we assign equal probability to each action.
6.2 Relational MLP
The MLP consists of 2-layers. The human feature $x_{h}$ , object feature $x_{o}$ , and union region of both human and feature $x_{u}$ obtained from the ResNet-101 FasterRCNN backbone are concatenated to form the joint relational visual features $x_{v}$ . The semantic representation $y_{s}$ is formed via a concatenation of the Glove [32] embedding of the human $y_{h}$ and object $y_{o}$ . We perform max pooling on the relational features, $\mathcal{R}_{i}=r_{1},r_{2},...,r_{n}$ in a given image, where each $r_{n}=[x_{v},y_{s}]$ is the concatenation of visual and semantic features. Afterward, we pass these features into the 2-layer MLP. The inputs and outputs of the first layer are D-dimensional and we apply dropout with $p$ = 0.5. The last layer of the MLP is the classification layer. Lastly, we apply a sigmoid function before applying multi-label margin loss to train the model.
6.3 Relational Transformer
Transformers [43] are a popular class of models in deep learning. They are effective at capturing relationships between far-apart elements in a set or a sequence. In this work, we use transformers as a set summarization model. Multi-head self-attention Transformer: Specifically, we utilize a multi-head self-attention (MSHA) transformer model. The MHSA transformer contains one encoder and three decoder layers by default. We do not use any positional encoding as we are summarising a set. Given the set of relational representation of an image $\mathcal{R}_{i}=r_{1},r_{2},·s,r_{n}$ , the transformer model outputs a Tensor of size $n× d$ where $d$ is the size of the relational representation. Afterward, we use max-pooling to obtain an image representation vector $x_{r}$ . A visual illustration of this model is shown in Fig. 4 (left). Cross-attention Transformer: Similar to the multi-head self-attention transformer, we use one encoder and three decoder layers. The inputs to the transformer encoder comprise concatenated visual and semantic features of a human and objects $[x_{h},y_{h},x_{o},y_{o}]$ , excluding the union features $x_{u}$ .
<details>
<summary>extracted/6142422/images/sat.png Details</summary>

### Visual Description
## Diagram: Machine Learning Pipeline for Object Recognition with Contextual Memory
### Overview
The diagram illustrates a neural network architecture for object recognition that integrates human features, object features, and contextual memory from past actions. The pipeline includes feature concatenation, transformer-based encoding/decoding, relational feature extraction, and a memory-augmented classifier.
### Components/Axes
1. **Input Features**:
- **Human Feature** (pink): `[x₀, y₀]` - Human-provided contextual information
- **Object Feature** (green): `[xₕ, yₕ]` - Object-specific attributes
- **Union Feature** (orange): `xᵤ` - Combined feature representation
2. **Processing Blocks**:
- **Encoder** (blue): Transformer-based encoder with Q (query), K (key), V (value) inputs/outputs
- **Decoder** (blue): Transformer-based decoder mirroring encoder structure
- **Classifier MLP** (purple): Multi-Layer Perceptron with memory integration
3. **Output**:
- **Relational Features** (yellow/blue/orange): Context-aware feature representations
- **Final Output**: `xᵣ` - Processed feature vector for classification
### Detailed Analysis
1. **Feature Integration**:
- Human (`[x₀, y₀]`) and object (`[xₕ, yₕ]`) features are concatenated with union feature `xᵤ` to form composite input `r`
- Color coding: Pink (human) + Green (object) + Orange (union) = Composite input
2. **Transformer Architecture**:
- Encoder/Decoder use standard QKV (Query-Key-Value) attention mechanism
- Blue blocks represent self-attention layers with identical Q/K/V dimensions
3. **Memory Integration**:
- Past actions feed into Classifier MLP as additional context
- Orange/Blue/Yellow blocks in relational features suggest multi-modal context processing
### Key Observations
1. **Color-Coded Flow**:
- Input features maintain distinct color identities through initial processing
- Encoder/Decoder outputs show blended color patterns indicating feature mixing
2. **Temporal Context**:
- Past actions directly influence final classification through MLP
- Suggests recurrent memory mechanism despite static diagram representation
3. **Dimensional Consistency**:
- All feature vectors maintain rectangular block proportions
- Suggests uniform dimensionality across processing stages
### Interpretation
This architecture demonstrates a hybrid approach combining:
1. **Human-in-the-loop** elements through explicit human feature integration
2. **Transformer-based** contextual processing via encoder-decoder
3. **Memory-augmented** learning through past action incorporation
The pipeline suggests:
- Human features provide initial contextual priors
- Object features get transformed through attention mechanisms
- Union features enable cross-modal integration
- Past actions create temporal context for classification
Notable design choices:
- Separate encoder/decoder rather than bidirectional transformer
- Explicit feature concatenation before transformer processing
- Color-coded feature tracking for visual debugging
The architecture appears optimized for scenarios requiring:
- Human guidance in ambiguous recognition tasks
- Object-level detail preservation
- Contextual memory for sequential decision making
</details>
<details>
<summary>extracted/6142422/images/cat.png Details</summary>

### Visual Description
## Flowchart: Multi-Modal Feature Integration Architecture
### Overview
The diagram illustrates a multi-modal neural network architecture for processing human and object features to generate relational features for classification. The system integrates spatial features through concatenation, employs transformer-based encoding/decoding, and uses a projection layer to combine features with past actions for classification.
### Components/Axes
1. **Input Features**:
- Human Feature: [x_h, y_h] (pink blocks)
- Object Feature: [x_o, y_o] (green blocks)
- Union Feature: x_u (orange blocks)
2. **Core Components**:
- Encoder: Blue block with Q (query), K (key), V (value) connections
- Decoder: Blue block with K (key), V (value), Q (query) connections
- Projection Layer: Gray block receiving Union Feature
- Classifier MLP: Gray block receiving Relational Features and Past Actions
3. **Output**:
- Relational Features: x_r (yellow/blue/orange blocks)
- Final Output: Classifier MLP prediction
### Detailed Analysis
1. **Feature Integration**:
- Human and Object Features are concatenated (pink + green → green)
- Union Feature (x_u) is generated from concatenated features
- Encoder processes Q/K/V to transform features into latent space
- Decoder reconstructs features using K/V/Q interactions
2. **Projection and Classification**:
- Projection Layer combines Union Feature with temporal context
- Relational Features (x_r) are generated through decoder output
- Classifier MLP fuses x_r with Past Actions for final prediction
### Key Observations
1. Bidirectional information flow between encoder and decoder
2. Spatial features (x, y coordinates) are preserved through concatenation
3. Temporal context (Past Actions) is integrated at classification stage
4. Transformer architecture (Q/K/V) used for feature transformation
5. Color-coded blocks indicate feature types and flow direction
### Interpretation
This architecture demonstrates a hybrid approach combining:
1. **Spatial Attention**: Through Q/K/V mechanisms in encoder/decoder
2. **Temporal Integration**: By incorporating past actions in final classification
3. **Multi-Modal Fusion**: Via concatenation of human/object features
4. **Hierarchical Processing**: From raw features to relational representations
The design suggests an intention to capture both spatial relationships (through transformer architecture) and temporal dynamics (through action history) for improved classification performance. The separation of feature generation (encoder/decoder) from classification (MLP) allows for modular optimization of different components.
</details>
Figure 4: (left) The relational multi-head self-attention transformer. (right) The relational cross-attention transformer.
6.4 Relational GPT-3.5 Past Action Inference
GPT and later versions [34, 35, 3] have revolutionized the AI field by solving many natural language processing and reasoning tasks. Here, we use the GPT-3.5 turbo version to perform abductive past action inference. To do this, we generate a query prompt as well as a contextual description for each image using the ground truth relational annotations based on the subject-predicate-object triplet relation. In contrast to the all other methods, we utilize the ground truth predicate label for GPT-3.5. An example of the contextual description and textual prompt is shown in Figure 5. In addition, an answer generated by GPT-3.5 is shown in Figure 6. We specifically created the prompt such that GPT-3.5 responses are constrained to the ground truth action sets within the dataset. Based on the responses from the GPT-3.5 model, we construct the score vector where the predicted action is marked with a score of 1 or 0 otherwise. We call this hard matching as we add 1 if and only if the GPT-3.5 model outputs the exact action class name given in the input prompt.
<details>
<summary>extracted/6142422/images/chatgpt.png Details</summary>

### Visual Description
## Screenshot: Context and Prompt for Action Recognition Task
### Overview
The image displays a structured text-based interface divided into two sections: **Context** (left) and **Prompt** (right). The Context outlines positional relationships between a "person" and objects (sofa/couch, chair, cup/glass/bottle), while the Prompt defines a task to map these states to predefined action numbers (1–157). The layout suggests a machine learning or NLP application for action recognition.
---
### Components/Axes
- **Context Section**:
- Describes spatial/temporal states of a person relative to objects:
- Positional states (e.g., "sitting on sofa/couch," "beneath chair").
- Gaze states (e.g., "not looking at sofa/couch").
- Object interactions (e.g., "holding cup/glass/bottle").
- Total of 12 distinct states listed.
- **Prompt Section**:
- Instructions: "Select subset of action numbers between [1] and [157]" to explain the person's state.
- Action-number mappings (examples):
- [1] = Holding some clothes
- [2] = Putting clothes somewhere
- [3] = Taking some clothes from somewhere
- [4] = Throwing clothes somewhere
- [...] (up to [157]).
---
### Detailed Analysis
#### Context Section
1. **Positional States**:
- Sitting on sofa/couch
- Sitting on chair
- Beneath chair
- Behind sofa/couch
- In front of cup/glass/bottle
2. **Gaze States**:
- Not looking at sofa/couch
- Not looking at chair
3. **Object Interactions**:
- Holding cup/glass/bottle
- Looking at cup/glass/bottle
#### Prompt Section
- **Task Objective**: Map Context states to action numbers (1–157).
- **Action Examples**:
- [1] = Holding some clothes
- [2] = Putting clothes somewhere
- [3] = Taking some clothes from somewhere
- [4] = Throwing clothes somewhere
- [...] (157 total actions, with ellipsis indicating continuation).
---
### Key Observations
1. **Context-Prompt Relationship**:
- The Context provides input states, while the Prompt defines output actions. This implies a mapping task (e.g., training a model to associate states with actions).
2. **Ambiguity in Action Numbers**:
- Only 4/157 actions are explicitly defined. The ellipsis suggests the full list is extensive but incomplete in the image.
3. **Object-Specific States**:
- The Context focuses on interactions with furniture (sofa/couch, chair) and containers (cup/glass/bottle), while the Prompt actions center on clothing manipulation. This may indicate a multi-modal task (e.g., combining object interactions with clothing actions).
---
### Interpretation
- **Purpose**: The image likely represents a dataset or task design for action recognition in videos or text descriptions. The Context captures observable states, while the Prompt defines action labels for supervised learning.
- **Design Implications**:
- The lack of explicit mappings between Context states and action numbers suggests the need for additional training data or rules to link them.
- The inclusion of gaze states (e.g., "not looking at...") implies the model must infer actions from both physical interactions and attention.
- **Potential Challenges**:
- Scalability: With 157 actions, the model must handle high-dimensional output spaces.
- Ambiguity: Overlapping states (e.g., "holding cup" vs. "looking at cup") may require context-aware disambiguation.
---
### Conclusion
This interface outlines a framework for action recognition, emphasizing the need to bridge contextual states (e.g., positions, gaze) with discrete actions. Further details (e.g., full action list, training data) would be required to fully operationalize this task.
</details>
Figure 5: The context description and the textual prompt used for the GPT-3.5 turbo model.
<details>
<summary>extracted/6142422/images/chatgpt_answer.png Details</summary>

### Visual Description
## Textual Analysis: Action Recognition Context and Ground Truth
### Overview
The image presents a comparative analysis of action recognition between a generated text (GPT Answer) and ground truth data. It consists of three sections:
1. **Context**: Describes a person's actions and states (e.g., "person is leaning on window," "person is holding food").
2. **GPT Answer**: Lists actions with numerical identifiers and color-coded labels (red/green).
3. **Ground Truth**: Lists actions with numerical identifiers and color-coded labels (green/red).
### Components/Axes
- **Sections**:
- **Context**: Descriptive text of observed actions/states.
- **GPT Answer**: Model-generated action labels with numerical IDs and color-coded validity (red = incorrect, green = correct).
- **Ground Truth**: Reference action labels with numerical IDs and color-coded validity (green = correct, red = incorrect).
- **Labels**:
- **Context**: No explicit labels; free-form text.
- **GPT Answer/Ground Truth**:
- Numerical IDs (e.g., [90], [93], [62], [68], [119]).
- Action descriptions (e.g., "Opening a window," "Holding a sandwich").
- Color codes: Red (incorrect), Green (correct).
### Detailed Analysis
#### Context Section
- Describes a person's actions and states:
- Leaning on a window, looking at a window, holding food/dish/sandwich.
- Explicit negations: "person is not looking at food," "person is not looking at sandwich."
#### GPT Answer Section
- **Key Entries**:
- `[90] = Opening a window` (red)
- `[93] = Watching/Looking outside of a window` (green)
- `[62] = Holding some food` (green)
- `[68] = Holding a sandwich` (green)
- `[119] = Holding a dish` (green)
#### Ground Truth Section
- **Key Entries**:
- `Holding some food` (green)
- `Holding a sandwich` (green)
- `Watching/Looking outside of a window` (green)
- `Drinking from a cup/glass/bottle` (red)
- `Holding a dish` (green)
### Key Observations
1. **Action Overlap**:
- Both GPT Answer and Ground Truth include:
- `Holding a sandwich` (green in both).
- `Holding a dish` (green in both).
- `Watching/Looking outside of a window` (green in both).
2. **Discrepancies**:
- **GPT Answer**: Incorrectly labels `[90]` as "Opening a window" (red).
- **Ground Truth**: Includes `Drinking from a cup/glass/bottle` (red), absent in GPT Answer.
3. **Color Coding**:
- Red labels in GPT Answer indicate errors (e.g., `[90]`).
- Red label in Ground Truth (`Drinking...`) suggests it is an incorrect action.
### Interpretation
- **Purpose**: The image evaluates the accuracy of a language model's action recognition against ground truth data.
- **Model Performance**:
- The GPT Answer correctly identifies most actions (e.g., `Holding a sandwich`, `Holding a dish`) but fails to recognize `Drinking...` and incorrectly labels `Opening a window`.
- Ground Truth includes an outlier (`Drinking...`), which the model did not predict, suggesting a potential gap in training data or model capability.
- **Numerical IDs**: Likely correspond to predefined action categories, but their exact mapping is unclear without additional context.
- **Color Significance**: Red/green coding provides a visual cue for correctness, aiding rapid validation of model outputs.
### Conclusion
The image highlights the model's strengths (accurate identification of holding actions and window-related behaviors) and weaknesses (missed actions like drinking, incorrect window-opening label). This analysis underscores the importance of refining action recognition models to handle nuanced scenarios and edge cases.
</details>
Figure 6: Answer generated by GPT-3.5 turbo model. The correct answers are shown in green color whereas false positives and negatives are shown in red. This example is cherry-picked.
The GPT-3.5 model is able to generate reasonable answers in some images (see Fig 6). However, most of the time GPT-3.5 answers are either overly conservative or aggressive. For example, GPT responds “There is not enough information given in the context to determine the specific actions the person executed to arrive in the described state" and in some instances, it selects all action classes. This may be the main reason for the poor performance of GPT-3.5. However, it should be noted that the GPT model is fed with more information than all other baselines as we also provide the predicate relation to the GPT-3.5 model. We also note that the GPT-3.5 + CLIP (Text) model with both soft and hard scores performs better than the hard score method. Assuming that large language models such as GPT-3.5 are capable of human-like reasoning, we can perhaps suggest that abductive inference requires more than text-based reasoning and commonsense reasoning. Given the fact that pure rule-based inference performs better than GPT-3.5 with lesser information may suggest that GPT-3.5 is not suited for abductive past action inference due to it not having a detailed understanding of some of the human behaviors and effects of human actions.
6.5 VILA Fine-tuning for Past Action Infernce
With the proven success of Large Language Models (LLMs) across various NLP tasks, recent research has extended their capabilities towards vision tasks, resulting in the development of Visual Language Models (VLMs). These models are typically enhanced through prompt-tuning (where LLMs are frozen) or fine-tuning methods. We employ a fine-tuned VLM, VILA [26], which has not only advanced state-of-the-art performance in vision tasks but also retains robust capabilities in text processing. VILA demonstrates strong reasoning abilities in multi-image analysis, contextual learning, and zero/few-shot learning scenarios. Hence, we leverage VILA for the task of abductive past action set inference.
7 Details on Dataset Creation
<details>
<summary>extracted/6142422/images/bar_chart_nf_inference.png Details</summary>

### Visual Description
## Bar Chart: Relationship Between Past Actions and Snapshots
### Overview
The chart visualizes the relationship between the number of past actions (x-axis) and the number of snapshots (y-axis, logarithmic scale). The y-axis ranges from 2³ to 2¹³, while the x-axis spans 0 to 26 past actions. Bars represent the frequency of snapshots, peaking at ~6 past actions and declining thereafter, with notable fluctuations.
### Components/Axes
- **X-axis**: "Number of past actions" (0–26, linear scale).
- **Y-axis**: "Number of snapshots" (logarithmic scale, 2³ to 2¹³).
- **Bars**: Blue vertical bars representing snapshot counts. No legend or additional labels are present.
### Detailed Analysis
- **X=0**: No bar (assumed 0 snapshots).
- **X=1**: ~2⁸ snapshots.
- **X=2**: ~2¹⁰ snapshots.
- **X=3**: ~2¹¹ snapshots.
- **X=4**: ~2¹² snapshots.
- **X=5**: ~2¹² snapshots.
- **X=6**: ~2¹² snapshots (peak).
- **X=7–10**: Gradual decline to ~2¹¹.5 snapshots.
- **X=11–15**: Further decline to ~2⁹ snapshots.
- **X=16–18**: Drop to ~2⁸.5 snapshots.
- **X=19**: ~2⁷.5 snapshots.
- **X=20**: ~2⁶.5 snapshots.
- **X=21**: ~2⁵ snapshots.
- **X=22**: ~2⁵ snapshots.
- **X=23**: ~2⁷ snapshots (anomaly).
- **X=24**: ~2⁴ snapshots.
- **X=25**: ~2³ snapshots.
- **X=26**: ~2⁴ snapshots.
### Key Observations
1. **Peak at X=6**: The highest number of snapshots (~2¹²) occurs at 6 past actions, suggesting optimal performance or data collection efficiency at this point.
2. **Decline Post-Peak**: After X=6, the number of snapshots decreases exponentially, indicating diminishing returns with additional actions.
3. **Anomalies**: A sharp drop at X=20 (~2⁶.5) followed by a rebound at X=23 (~2⁷) and subsequent decline suggests potential data irregularities or external factors.
4. **Logarithmic Scale**: The y-axis compresses large value ranges, emphasizing exponential growth/decline trends.
### Interpretation
The data implies that increasing past actions beyond 6 yields fewer snapshots, highlighting a potential inefficiency or saturation point. The anomalies at X=20–26 may reflect outliers, data collection errors, or contextual shifts (e.g., system resets). The logarithmic scale underscores the exponential nature of the relationship, making it easier to visualize trends across orders of magnitude. This pattern could inform resource allocation strategies, emphasizing the importance of balancing action frequency with snapshot utility.
</details>
<details>
<summary>extracted/6142422/images/bar_chart_nf_inference_infer_last.png Details</summary>

### Visual Description
## Bar Chart: Distribution of Snapshots by Number of Past Actions
### Overview
The chart displays a bar graph illustrating the relationship between the number of past actions (x-axis) and the number of snapshots (y-axis). The y-axis uses a logarithmic scale (2⁰ to 2⁷), while the x-axis ranges from 0 to 26. The bars represent the frequency of snapshots for each category of past actions, with the tallest bars clustered around 7–8 past actions.
### Components/Axes
- **X-axis (Horizontal)**: Labeled "Number of past actions," with integer values from 0 to 26.
- **Y-axis (Vertical)**: Labeled "Number of snapshots," using a logarithmic scale (2⁰ = 1, 2¹ = 2, ..., 2⁷ = 128).
- **Bars**: Blue-colored bars represent the frequency of snapshots for each x-axis category. No legend is present, but the color is consistent across all bars.
### Detailed Analysis
- **Distribution**:
- The number of snapshots increases sharply from 0 to 7–8 past actions, peaking at approximately 2⁷ (128 snapshots).
- After the peak, the frequency declines gradually, with fewer snapshots for higher numbers of past actions.
- The distribution forms a "long tail" pattern, with the smallest bars (2⁰ = 1 snapshot) observed at 24–26 past actions.
- **Key Data Points**:
- **0 past actions**: ~2² (4 snapshots).
- **1 past action**: ~2⁴ (16 snapshots).
- **2–3 past actions**: ~2⁵–2⁶ (32–64 snapshots).
- **4–6 past actions**: ~2⁶–2⁷ (64–128 snapshots).
- **7–8 past actions**: Peak at ~2⁷ (128 snapshots).
- **9–15 past actions**: Gradual decline to ~2⁴–2⁵ (16–32 snapshots).
- **16–20 past actions**: Further decline to ~2³–2⁴ (8–16 snapshots).
- **21–23 past actions**: ~2¹–2² (2–4 snapshots).
- **24–26 past actions**: ~2⁰ (1 snapshot each).
### Key Observations
1. **Peak at 7–8 past actions**: The highest frequency of snapshots occurs at this range, suggesting it is the most common category of past actions.
2. **Long tail**: The frequency drops significantly for actions beyond 8, indicating fewer users engage in very high numbers of past actions.
3. **Logarithmic scale impact**: The y-axis compression emphasizes exponential growth in snapshots for lower past actions, while higher values appear less impactful visually.
### Interpretation
The data suggests a distribution where most users have a moderate number of past actions (7–8), with a steep decline in frequency for higher action counts. The logarithmic y-axis highlights the exponential relationship between past actions and snapshots, emphasizing that small increases in past actions (e.g., from 0 to 2) lead to disproportionate increases in snapshots. The long tail implies that while most users cluster around the peak, a small subset engages in significantly more actions, though these are rare. This pattern could reflect user behavior in systems where actions are cumulative (e.g., social media interactions, task completion), with diminishing returns or engagement over time.
</details>
Figure 7: Number of snapshots (in $log_{2}$ ) for sets of $n$ past actions in the Action Genome test set. (a) – Abduct at $T$ , (b) – Abduct last snapshot
.
How to generate action sets and sequences? To obtain the ground truth action set $\mathcal{A}$ for an image in the Action Genome dataset using the Charades action labels, we first compute the time $t$ for each individual frame within a video sequence by using the formula: ${t}=\frac{v_{d}}{n}$ , where $v_{d}$ and $n$ denote the video duration and the number of frames in the video respectively. Then, we multiply the current frame number $f_{n}$ with ${t}$ to obtain the current time, $t_{c}=t× f_{n}$ .
Action sets: As each video contains multiple actions, we check whether the current time of the frame $t_{c}$ , falls within the start $t_{s}$ and end $t_{e}$ time of the action. If it does, we add the ground truth action label to the action set $\mathcal{A}_{n}$ for the image. To obtain the ground truth action set for the $t^{th}$ image, we combine all previous action sets from $t=1$ up to and including the $t^{th}$ image to form the set.
Action sequences: We sort the start time $t_{s}$ of the actions contained in the video in ascending order. Then, for each image, if the current time of the frame is greater than the start time of the action ( $t_{c}≥ t_{s}$ ), we add it to the sequence.
We provide details on the number of images for a set of $n$ past actions in the AG dataset for these setups in Figure 7. As can be seen from these statistics, the majority of the images have more than five actions and some images have as many as 26 actions.
8 Implementation Details
We use FasterRCNN [36] with a ResNet-101 [17] backbone to extract human and object features from each image based on the ground truth person and object bounding boxes provided by AG for all object-relational models. We load pre-trained weights provided by [5] that were trained on the training set of AG which obtained 24.6 mAP at 0.5 IoU with COCO metrics. The parameters of the FasterRCNN during training and inference are fixed for the abductive past action inference task. Our default human and object visual representations have 512 dimensions obtained from 2048 dimensional visual features from the FasterRCNN. We use linear mappings to do this. During training, we train the models for 10 epochs and set the batch size to 1 video (there are many frames in a video). We assume the frames are i.i.d. Note that even though there are multiple images in a batch, the images are processed in parallel and individually for the transformer and graph models respectively. There is no sharing of information between images. We use the AdamW [28] optimizer with an initial learning rate of 1e-5 along with a scheduler to decrease the learning rate by a factor of 0.5 to a minimum of 1e-7. We utilize Glove [32] word embedding of size 200 for the human and object semantic features. In addition, gradient clipping with a maximal norm of 5 is applied. Moreover, we report the mean across 3 different runs for each configuration to ensure we report the most accurate performance of our models. All models (except end-to-end and ViT) are trained on a single RTX3090 or A5000 GPU. For CLIP, we use publicly available implementations [33]. We use the public API of OpenAI for GPT 3.5 models.
9 Additional Experiments
9.1 Abductive Past Action Sequence Prediction
Next, we formulated the abductive past action sequence prediction task based on the Abduct at T setup. We attached a GRU / transformer decoder to our existing object-relational models. To train both sequence prediction models, we freeze the object detector and relational model ( $\phi()$ ). Then, we use the relational vector $x_{r}$ and action distribution obtained from $\phi_{c}()$ as the initial hidden state and pass it to the GRU respectively. The transformer decoder takes non-pooled relational features (a matrix of size $n× d$ ) as the key, value, and max-pooled relational features $x_{r}$ as the query. The output of these models is fed into a linear classifier to produce action sequences autoregressively. The results of these models are reported in Table 5. The BiGED model obtains slightly better performance than the rest. Although the performances of these models are suboptimal, we note that humans are also unable to obtain satisfactory results (only 14.00% accuracy). As we are constrained to only utilize available information in a single frame, the solution contains a substantial amount of sequence permutations. Therefore, the task is extremely challenging. The poor human performance also suggests how humans may use abduction. Perhaps humans do not resolve causal chains when performing abduction as it is a very challenging task.
We use the Hamming Loss to evaluate the action sequence prediction models as follows:
$$
H=\frac{1}{N*L}\sum_{n=1}^{N}\sum_{l=1}^{L}\left[y_{l}\neq\hat{y}_{l}\right] \tag{15}
$$
where $N$ is the total number of samples and $L$ is the sequence length. Finally, for a given sample, the accuracy is $(1-H)× 100$ .
Table 5: Abductive past action sequence prediction using the proposed methods on the Abduct at $T$ setup.
| Model Human performance | Accuracy 14.00 GRU | Transformer |
| --- | --- | --- |
| Relational MLP | 9.43 $±$ 0.13 | 9.59 $±$ 0.06 |
| Relational Self Att. Transformer | 9.72 $±$ 0.06 | 9.95 $±$ 0.07 |
| Relational Cross Att. Transformer | 9.69 $±$ 0.18 | 9.96 $±$ 0.12 |
| Relational GNNED | 9.81 $±$ 0.05 | 10.11 $±$ 0.19 |
| RBP | 10.48 $±$ 0.05 | 10.22 $±$ 0.12 |
| BiGED | 10.54 $±$ 0.15 | 10.1 $±$ 0.14 |
9.2 Ablation Study
Ablation on graph affinity function: By default, we use the Jaccard Vector Similarity as the affinity $W_{A}(i,j)$ for the GNNED and BiGED models. Here, we ablate the impact of this design choice by comparing it with cosine similarity and dot product. As can be seen from the results in Table 6, the Jaccard Vector Similarity (JVS) obtains better results than cosine similarity and dot product. This behavior can be attributed to the fully differentiable and bounded nature of JVS in contrast to the dot product or cosine similarity.
Table 6: Ablation on graph affinity using Abduct at T setup.
| Jaccard Vector Similarity Cosine Similarity Dot product | 35.75 34.17 28.81 | 60.55 57.98 54.68 | 44.37 41.97 38.38 |
| --- | --- | --- | --- |
Impact of semantic features and learning scheduler: Apart from the two different setups mentioned, we also use a third setup for ablations. In the third setup, the action sets are formed from the current and previous images which form the ground truth denoted by $\mathcal{A}=\{\mathcal{A}_{t-1}\bigcup\mathcal{A}_{t}\}$ for faster experimentation. We retrain all object-relational models with the corresponding past action set obtained from the current and previous images. We perform ablation studies on the relational self-attention transformer based on this setup. These findings can also be generalized to the other setups as mentioned earlier.
We evaluate the effect of visual and semantic (Glove [32]) features in Table 7. The use of semantic features provides a huge performance boost across all metrics. We attribute the performance increase to the contextual information provided by the semantics. The semantics of objects enable the model to effectively identify and relate actions, providing a more intuitive means for reasoning about these actions. It is also interesting to see the impact of the learning rate scheduler which provides considerable improvement for the transformer model. Therefore, we use semantics and the learning rate scheduler for all our models.
Table 7: Ablation study for the impact of semantic features and scheduler on the abductive past action set inference for the Abduct from current and previous images setup using self-attention transformer.
| Visual only Visual + scheduler Visual + semantic | 21.42 $±$ 0.13 21.93 $±$ 0.16 35.40 $±$ 0.16 | 46.44 $±$ 0.12 47.04 $±$ 0.44 68.47 $±$ 0.06 | 34.24 $±$ 0.42 34.80 $±$ 0.47 54.90 $±$ 0.52 |
| --- | --- | --- | --- |
| Visual + semantic + scheduler | 35.77 $±$ 0.30 | 69.16 $±$ 0.50 | 55.70 $±$ 0.47 |
9.3 Object-Relational Model Parameters
Table 8: The object-relational model parameters for the abductive past action inference task.
| Relational MLP Relational Self Att. Transformer Relational Cross Att. Transformer | 13.4M 101.2M 65.9M |
| --- | --- |
| Relational GNNED | 80.7M |
| RBP | 373.4M |
| BiGED | 213.6M |
The proposed object-relational model parameters are shown in Table 8. The rule-based inference model does not have any parameters and is therefore omitted from the table. Based on the results shown earlier, we note that the GNNED model obtains better performance than the transformer model even though it has lesser parameters. In addition, our proposed BiGED model has lesser parameters and performs comparable to or better than the RBP model. These further demonstrate the effectiveness of the proposed GNNED, RBP, and BiGED models for the challenging task of abductive past action inference.
|
<details>
<summary>extracted/6142422/images/Comparison/000138_bbox.png Details</summary>

### Visual Description
## Screenshot: Kitchen Scene with Object Annotations
### Overview
The image depicts a kitchen environment with annotated objects using colored bounding boxes. A person is standing near a counter, interacting with a laptop. The annotations include labels for various kitchen appliances, furniture, and items, with color-coded boxes (pink, yellow, blue) to distinguish categories.
### Components/Axes
- **Labels**:
- **Person**: Pink box around the individual.
- **Laptop**: Yellow box on the counter.
- **Table**: Blue box in the foreground.
- **Microwave**: Blue box on the counter.
- **Stove**: Blue box under the counter.
- **Dishwasher**: Blue box under the counter.
- **Refrigerator**: Blue box on the left.
- **Window**: Blue box on the left wall.
- **Sink**: Blue box under the counter.
- **Dish Rack**: Blue box next to the sink.
- **Coffee Maker**: Blue box on the counter.
- **Blender**: Blue box on the counter.
- **Paper Towels**: Blue box on the counter.
- **Trash Can**: Blue box under the counter.
- **Upper Cabinets**: Blue box above the counter.
- **Lower Cabinets**: Blue box below the counter.
### Detailed Analysis
- **Person**: Positioned centrally, interacting with the laptop.
- **Laptop**: Located on the counter, near the person.
- **Table**: Foreground object, partially visible.
- **Microwave**: On the counter, to the right of the person.
- **Stove**: Under the counter, adjacent to the microwave.
- **Dishwasher**: Under the counter, to the left of the stove.
- **Refrigerator**: On the left side of the image, near the window.
- **Window**: On the left wall, adjacent to the refrigerator.
- **Sink**: Under the counter, to the right of the dishwasher.
- **Dish Rack**: Next to the sink, holding dishes.
- **Coffee Maker**: On the counter, near the stove.
- **Blender**: On the counter, near the coffee maker.
- **Paper Towels**: On the counter, near the blender.
- **Trash Can**: Under the counter, to the right of the sink.
- **Upper Cabinets**: Above the counter, spanning the width of the image.
- **Lower Cabinets**: Below the counter, spanning the width of the image.
### Key Observations
- **Color Coding**:
- **Pink**: Person (unique to the individual).
- **Yellow**: Laptop (distinct from other objects).
- **Blue**: All other objects (appliances, furniture, and items).
- **Spatial Relationships**:
- The person is positioned between the laptop and the microwave.
- The stove and dishwasher are aligned horizontally under the counter.
- The refrigerator and window are on the left side, while the sink and dish rack are on the right.
- **Object Density**: The countertop contains multiple labeled items (microwave, coffee maker, blender, paper towels).
### Interpretation
The annotations suggest a focus on **object recognition** or **scene understanding** in a kitchen context. The use of distinct colors (pink for the person, yellow for the laptop) highlights key elements, while blue boxes standardize other objects. This could indicate a system designed to identify and categorize objects in real-time, such as for robotics, augmented reality, or smart home applications. The spatial arrangement of labels aligns with typical kitchen layouts, reinforcing the practicality of the annotations. No numerical data or trends are present, as the image focuses on categorical labeling rather than quantitative analysis.
</details>
| | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Taking a cup/glass/bottle from somewhere | | | | | | | |
| Holding a cup/glass/bottle of something | | | | | | | |
| Working/Playing on a laptop | | | | | | | |
| Working at a table | | | | | | | |
| Watching a laptop or something on a laptop | | | | | | | |
| Drinking from a cup/glass/bottle | | | | | | | |
| Taking a cup/glass/bottle from somewhere | | | | | | | |
| Sitting at a table | | | | | | | |
| Working at a table | | | | | | | |
| Working/Playing on a laptop | | | | | | | |
| Holding a laptop | | | | | | | |
| Opening a laptop | | | | | | | |
| Sitting in a chair | | | | | | | |
| Drinking from a cup/glass/bottle | | | | | | | |
| Taking a dish/es from somewhere | | | | | | | |
| Putting something on a shelf | | | | | | | |
| Walking through a doorway | | | | | | | |
| Closing a closet/cabinet | | | | | | | |
| Opening a closet/cabinet | | | | | | | |
| Putting a dish/es somewhere | | | | | | | |
| Taking a dish/es from somewhere | | | | | | | |
| Someone is smiling | | | | | | | |
| Someone is sneezing | | | | | | | |
| Someone is standing up from somewhere | | | | | | | |
| Taking a cup/glass/bottle from somewhere | | | | | | | |
| Putting something on a table | | | | | | | |
| Closing a closet/cabinet | | | | | | | |
| Opening a closet/cabinet | | | | | | | |
| Putting a dish/es somewhere | | | | | | | |
| Taking a dish/es from somewhere | | | | | | | |
| Someone is cooking something | | | | | | | |
| Someone is smiling | | | | | | | |
| Someone is standing up from somewhere | | | | | | | |
| Taking a cup/glass/bottle from somewhere | | | | | | | |
| Putting something on a table | | | | | | | |
| Working/Playing on a laptop | | | | | | | |
| Tidying up a table | | | | | | | |
| Putting a cup/glass/bottle somewhere | | | | | | | |
| Drinking from a cup/glass/bottle | | | | | | | |
| Putting a dish/es somewhere | | | | | | | |
| Taking a dish/es from somewhere | | | | | | | |
| Someone is standing up from somewhere | | | | | | | |
| Taking a cup/glass/bottle from somewhere | | | | | | | |
| Holding a cup/glass/bottle of something | | | | | | | |
| Working/Playing on a laptop | | | | | | | |
| Putting something on a table | | | | | | | |
| Watching a laptop or something on a laptop | | | | | | | |
| Drinking from a cup/glass/bottle | | | | | | | |
| Putting a cup/glass/bottle somewhere | | | | | | | |
| Opening a closet/cabinet | | | | | | | |
| Taking a dish/es from somewhere | | | | | | | |
| Taking a cup/glass/bottle from somewhere | | | | | | | |
| Holding a cup/glass/bottle of something | | | | | | | |
| Working/Playing on a laptop | | | | | | | |
| Putting something on a table | | | | | | | |
| Walking through a doorway | | | | | | | |
| Putting a cup/glass/bottle somewhere | | | | | | | |
| Putting a dish/es somewhere | | | | | | | |
| Taking a dish/es from somewhere | | | | | | | |
| Someone is standing up from somewhere | | | | | | | |
|
<details>
<summary>extracted/6142422/images/Comparison/XOOTA.mp4/000102_bbox.png Details</summary>

### Visual Description
## Screenshot: Object Detection Annotations in Indoor Scene
### Overview
The image depicts an annotated indoor scene with object detection labels overlaid on a photograph. A person is seated on the floor, surrounded by labeled bounding boxes for "person," "bags," and "blanket." The room contains furniture (guitar, shelving unit, TV), a curtain, and a doorway. Labels are color-coded and positioned relative to their corresponding objects.
### Components/Axes
- **Labels**:
- "person" (pink box, centered on the seated individual)
- "bags" (blue box, positioned above the person)
- "blanket" (green box, below the person)
- "floor" (blue label, anchored at the bottom of the image)
- **Colors**:
- Pink: Person
- Blue: Bags and Floor
- Green: Blanket
- **Spatial Grounding**:
- "Person" label: Centered on the seated individual.
- "Bags" label: Top-center, above the person.
- "Blanket" label: Bottom-center, below the person.
- "Floor" label: Bottom-left, aligned with the floor surface.
### Detailed Analysis
- **Object Detection**:
- The "person" label (pink) is the largest box, encompassing the seated individual.
- "Bags" (blue) are smaller and positioned near the person’s upper body.
- "Blanket" (green) covers the lower body and floor area.
- "Floor" (blue) is a horizontal label at the image’s base, likely indicating the surface the person is sitting on.
- **Room Elements**:
- A guitar leans against a shelving unit on the right.
- A TV displays a blue screen on the upper-right wall.
- A curtain with a blue pattern hangs in the background.
- A doorway with a wooden frame is visible on the right.
### Key Observations
- The "bags" and "floor" labels share the same blue color, which may indicate a categorization error or intentional grouping (e.g., "bags" as part of the floor’s context).
- The "blanket" label overlaps partially with the "floor" label, suggesting the blanket is in contact with the floor.
- No numerical data or trends are present; the image focuses on categorical annotations.
### Interpretation
This image demonstrates an object detection system’s ability to segment and label key elements in an indoor environment. The labels provide spatial context for the person, their belongings ("bags"), and the immediate surroundings ("blanket," "floor"). The shared blue color for "bags" and "floor" could imply a hierarchical relationship (e.g., bags resting on the floor) or a limitation in the labeling system. The absence of numerical data suggests this is a qualitative annotation for scene understanding rather than quantitative analysis. The annotations could support applications like robotic navigation, augmented reality overlays, or activity recognition by identifying critical objects and their spatial relationships.
</details>
| | | | | | | |
| Holding a bag | | | | | | | |
| Snuggling with a blanket | | | | | | | |
| Sitting on the floor | | | | | | | |
| Lying on the floor | | | | | | | |
| Holding a vacuum | | | | | | | |
| Someone is awakening somewhere | | | | | | | |
| Taking a broom from somewhere | | | | | | | |
| Throwing a broom somewhere | | | | | | | |
| Tidying up with a broom | | | | | | | |
| Fixing a light | | | | | | | |
| Turning on a light | | | | | | | |
| Turning off a light | | | | | | | |
| Drinking from a cup/glass/bottle | | | | | | | |
| Holding a cup/glass/bottle of something | | | | | | | |
| Pouring something into a cup/glass/bottle | | | | | | | |
| Holding some clothes | | | | | | | |
| Putting clothes somewhere | | | | | | | |
| Holding a towel/s | | | | | | | |
| Putting a blanket somewhere | | | | | | | |
| Taking a blanket from somewhere | | | | | | | |
| Walking through a doorway | | | | | | | |
| Someone is smiling | | | | | | | |
| Someone is sneezing | | | | | | | |
| Someone is standing up from somewhere | | | | | | | |
| Holding a bag | | | | | | | |
| Holding some clothes | | | | | | | |
| Putting clothes somewhere | | | | | | | |
| Taking some clothes from somewhere | | | | | | | |
| Opening a bag | | | | | | | |
| Taking a bag from somewhere | | | | | | | |
| Walking through a doorway | | | | | | | |
| Someone is smiling | | | | | | | |
| Someone is standing up from somewhere | | | | | | | |
| Holding a bag | | | | | | | |
| Snuggling with a blanket | | | | | | | |
| Opening a bag | | | | | | | |
| Putting a bag somewhere | | | | | | | |
| Taking a bag from somewhere | | | | | | | |
| Putting a blanket somewhere | | | | | | | |
| Taking a blanket from somewhere | | | | | | | |
| Tidying up a blanket/s | | | | | | | |
| Someone is standing up from somewhere | | | | | | | |
| Holding a bag | | | | | | | |
| Snuggling with a blanket | | | | | | | |
| Sitting on the floor | | | | | | | |
| Holding a vacuum | | | | | | | |
| Opening a bag | | | | | | | |
| Taking a bag from somewhere | | | | | | | |
| Taking a blanket from somewhere | | | | | | | |
| Drinking from a cup/glass/bottle | | | | | | | |
| Tidying something on the floor | | | | | | | |
| Holding a bag | | | | | | | |
| Snuggling with a blanket | | | | | | | |
| Sitting on the floor | | | | | | | |
| Holding a vacuum | | | | | | | |
| Opening a bag | | | | | | | |
| Taking a bag from somewhere | | | | | | | |
| Throwing a blanket somewhere | | | | | | | |
| Fixing a vacuum | | | | | | | |
| Someone is standing up from somewhere | | | | | | | |
Figure 8: Manually selected qualitative results produced by each model on the abductive past action set inference: Abduct last image setup on the AG test dataset. The first column shows the image followed by their corresponding ground truth past actions. The remaining columns display the actions predicted by each model, with correct predictions highlighted in green and incorrect predictions highlighted in red.
9.4 Qualitative Results
We compare qualitative results for the abductive past action set prediction task in Figure 8. Depending on the number of past action labels an image has, we take the same number of top-k predicted actions from each model. All models demonstrate their ability to perform abductive past action inference. In the first image, there are objects such as a person, laptop, table, cup, and dish. In the second image, there are objects such as a person, floor, blanket, bag, and vacuum. In both scenarios, RBP and BiGED demonstrate that they can infer past actions more accurately.