## Diagram: Relational Feature Extraction
### Overview
The image is a diagram illustrating a relational feature extraction process. It shows how human, object, and union features are processed through various layers (GNNED, Linear Layer, Bilinear Module) and concatenated to form relational features, which are then fed into a classifier MLP to predict past actions.
### Components/Axes
* **Input Features:**
* Human Feature: *x\_h* (represented by a red block)
* Object Feature: \[*x\_o*, *y\_o*] (represented by a green block)
* Union Feature: *x\_u* (represented by an orange block)
* **Processing Layers:**
* GNNED (Graph Neural Network Embedding and Decoding) (represented by light green rounded rectangles)
* Linear Layer (represented by a light purple rounded rectangle)
* Bilinear Module (represented by a light red rounded rectangle)
* Classifier MLP (Multilayer Perceptron) (represented by a light purple rounded rectangle)
* **Intermediate Features:**
* Concatenated Human and Object Features (represented by a block with red and green sections)
* Output of GNNED (Human Feature branch) (represented by a green block)
* Output of Linear Layer (Object Feature branch) (represented by a light purple block)
* Output of GNNED (Object Feature branch) (represented by a blue block)
* Output of Bilinear Module (Human Feature branch) (represented by a yellow block)
* Relational Features (represented by a block with yellow, blue, and orange sections)
* *x\_r* (Relational Feature) (represented by a block with yellow, blue, and orange sections)
* **Output:**
* Past Actions
### Detailed Analysis
1. **Human Feature Branch:**
* The "Human Feature" *x\_h* (red block) is fed into a "GNNED" layer (light green rounded rectangle).
* The output of the "GNNED" layer (green block) is then fed into a "Bilinear Module" (light red rounded rectangle).
* The output of the "Bilinear Module" (yellow block) is concatenated with the output of the object feature branch.
2. **Object Feature Branch:**
* The "Object Feature" \[*x\_o*, *y\_o*] (green block) is concatenated with the "Human Feature" *x\_h* (red block)
* The concatenated feature (red and green block) is fed into a "Linear Layer" (light purple rounded rectangle).
* The output of the "Linear Layer" (light purple block) is then fed into a "GNNED" layer (light green rounded rectangle).
* The output of the "GNNED" layer (blue block) is concatenated with the output of the human feature branch.
3. **Union Feature Branch:**
* The "Union Feature" *x\_u* (orange block) is concatenated with the output of the human and object feature branches.
4. **Concatenation and Classification:**
* The outputs of the human and object feature branches (yellow and blue blocks, respectively) are concatenated to form "Relational Features" (yellow, blue, and orange block).
* The "Union Feature" *x\_u* (orange block) is also concatenated to form "Relational Features" (yellow, blue, and orange block).
* These "Relational Features" are represented as *x\_r* (yellow, blue, and orange block) and are fed into a "Classifier MLP" (light purple rounded rectangle).
* The output of the "Classifier MLP" is "Past Actions".
### Key Observations
* The diagram illustrates a multi-branch architecture for feature extraction.
* GNNED layers are used in both the human and object feature branches.
* A bilinear module is used in the human feature branch.
* Concatenation is used to combine features from different branches.
* The final output is a prediction of "Past Actions" based on the extracted relational features.
### Interpretation
The diagram depicts a system designed to understand relationships between humans, objects, and their union in a scene, likely for action recognition or prediction. The use of GNNED layers suggests that the system is designed to capture complex relationships between entities. The bilinear module likely captures interactions between the human and object features. The concatenation of features from different branches allows the system to integrate information from multiple sources. The final MLP classifier uses these integrated features to predict past actions, indicating the system's goal is to understand or infer what has happened based on the observed relationships.