## Hierarchical Human Parsing with Typed Part-Relation Reasoning
Wenguan Wang 1 , 2 β , Hailong Zhu 3 β , Jifeng Dai 4 , Yanwei Pang 3 β , Jianbing Shen 2 , Ling Shao 2
ETH Zurich, Switzerland 2 Inception Institute of Artificial Intelligence, UAE
3 Tianjin Key Laboratory of Brain-inspired Intelligence Technology, School of Electrical and Information Engineering, Tianjin University, China 4 SenseTime Research
{ wenguanwang.ai,hlzhu2009 } @gmail.com https://github.com/hlzhu09/Hierarchical-Human-Parsing
## Abstract
Human parsing is for pixel-wise human semantic understanding. As human bodies are underlying hierarchically structured, how to model human structures is the central theme in this task. Focusing on this, we seek to simultaneously exploit the representational capacity of deep graph networks and the hierarchical human structures. In particular, we provide following two contributions. First, three kinds of part relations, i.e ., decomposition, composition, and dependency, are, for the first time, completely and precisely described by three distinct relation networks. This is in stark contrast to previous parsers, which only focus on a portion of the relations and adopt a type-agnostic relation modeling strategy. More expressive relation information can be captured by explicitly imposing the parameters in the relation networks to satisfy the specific characteristics of different relations. Second, previous parsers largely ignore the need for an approximation algorithm over the loopy human hierarchy, while we instead address an iterative reasoning process, by assimilating generic message-passing networks with their edgetyped, convolutional counterparts. With these efforts, our parser lays the foundation for more sophisticated and flexible human relation patterns of reasoning. Comprehensive experiments on five datasets demonstrate that our parser sets a new state-of-the-art on each.
## 1. Introduction
Human parsing involves segmenting human bodies into semantic parts, e.g ., head, arm, leg, etc . It has attracted tremendous attention in the literature, as it enables finegrained human understanding and finds a wide spectrum of human-centric applications, such as human behavior analysis [50, 14], human-robot interaction [16], and many others.
Human bodies present a highly structured hierarchy and body parts inherently interact with each other. As
β The first two authors contribute equally to this work.
β Corresponding author: Yanwei Pang .
Figure 1: Illustration of our hierarchical human parser. (a) Input image. (b) The human hierarchy in (a), where indicates dependency relations and is de-/compositional relations. (c) In our parser, three distinct relation networks are designed for addressing the specific characteristics of different part relations, i.e ., , , and stand for decompositional, compositional, and dependency relation networks, respectively. Iterative inference ( ) is performed for better approximation. For visual clarity, some nodes are omitted. (d) Our hierarchical parsing results.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Diagram: Human Pose Decomposition and Representation
### Overview
The image presents a diagram illustrating a hierarchical decomposition of human pose, likely for the purpose of action recognition or animation. It shows a sequence of images (a, b, d) depicting a soccer player in different poses, alongside a hierarchical graph (c) representing the relationships between body parts.
### Components/Axes
The image is divided into four main sections:
* **(a)**: A single image of a soccer player kicking a ball.
* **(b)**: A sequence of four images showing the same soccer player in different stages of the kicking motion, with lines connecting corresponding body parts across the images.
* **(c)**: A hierarchical graph representing the body part relationships. Nodes represent body parts (full-body, upper-body, lower-body, upper-arm, lower-arm, upper-leg, lower-leg), and arrows indicate dependencies or influences.
* **(d)**: Three images of a soccer player in different poses, each highlighted with a different color representing a specific body part or group of body parts.
The graph (c) uses the following labels:
* "full-body"
* "upper-body"
* "lower-body"
* "upper-arm"
* "lower-arm"
* "upper-leg"
* "lower-leg"
The arrows in the graph (c) are:
* Solid orange arrows: Representing a strong influence or dependency.
* Dotted black arrows: Representing a weaker or less direct influence.
* Blue bidirectional arrows: Representing reciprocal influence or interaction.
### Detailed Analysis or Content Details
**(a)**: Shows a soccer player in mid-kick, with the right leg extended towards the ball. The player is wearing a light blue jersey, black shorts, and black socks.
**(b)**: This section shows a time sequence of the kick. The images are arranged in a cascading manner, with the first image at the top-center and the last at the bottom-center. Lines connect corresponding body parts across the images, illustrating the motion. The lines are white and point from the earlier pose to the later pose.
**(c)**: The hierarchical graph shows the following relationships:
* "full-body" is the root node.
* "full-body" has two child nodes: "upper-body" and "lower-body".
* "upper-body" has two child nodes: "upper-arm" and "lower-arm".
* "lower-body" has two child nodes: "upper-leg" and "lower-leg".
* There are bidirectional blue arrows between "upper-arm" and "lower-arm", and between "upper-leg" and "lower-leg".
* There are solid orange arrows from "upper-body" to "upper-arm" and "lower-arm", and from "lower-body" to "upper-leg" and "lower-leg".
* There are dotted black arrows from "upper-arm" to "lower-body" and from "lower-leg" to "upper-body".
**(d)**: This section shows three different poses of the soccer player, each highlighted with a different color:
* Top image: Highlighted in magenta, focusing on the upper body and head.
* Middle image: Highlighted in teal, focusing on the lower body and legs.
* Bottom image: Highlighted in green, focusing on the lower body and legs.
### Key Observations
* The diagram emphasizes a hierarchical structure for representing human pose, breaking down the body into increasingly smaller parts.
* The graph (c) suggests that the upper and lower body are relatively independent, but there are also interactions between them.
* The time sequence in (b) illustrates the dynamic changes in pose during the kicking motion.
* The color-coding in (d) highlights specific body parts or groups of body parts, potentially for visualization or analysis.
### Interpretation
The diagram demonstrates a method for representing human pose in a structured and hierarchical manner. This approach is likely used in computer vision and robotics applications, such as action recognition, motion capture, and animation. The hierarchical decomposition allows for a more efficient and flexible representation of complex human movements. The graph (c) captures the dependencies between body parts, which can be used to constrain the motion and ensure realistic movements. The time sequence in (b) provides a visual example of how the pose changes over time, and the color-coding in (d) allows for a clear visualization of specific body parts. The dotted lines in the graph suggest that there are less direct influences between the upper and lower body, while the solid lines indicate stronger dependencies. The bidirectional arrows suggest that the upper and lower arms/legs interact with each other. This representation could be used to build a model that can predict or generate human movements based on a given set of constraints.
</details>
shown in Fig. 1(b), there are different relations between parts [42, 59, 49]: decompositional and compositional relations (full line: ) between constituent and entire parts ( e.g ., { upper body , lower body } and full body ), and dependency relations (dashed line: ) between kinematically connected parts ( e.g ., hand and arm ). Thus the central problem in human parsing is how to model such relations. Recently, numerous structured human parsers have been proposed [64, 15, 22, 63, 47, 73, 60, 20]. Their notable successes indeed demonstrate the benefit of exploiting the structure in this problem. However, three major limitations in human structure modeling are still observed. (1) The structural information utilized is typically weak and relation types studied are incomplete. Most efforts [64, 15, 22, 63, 47] directly encode human pose information into the parsing model, causing them to suffer from trivial structural information, not to mention the need of extra pose annotations. In addition, previous structured parsers focus on only one or two of the aforementioned part relations, not all of them. For example, [20] only considers
dependency relations, and [73] relies on decompositional relations. (2) Only a single relation model is learnt to reason different kinds of relations, without considering their essential and distinct geometric constraints. Such a relation modeling strategy is over-general and simple; do not seem to characterize well the diverse part relations. (3) According to graph theory, as the human body yields a complex, cyclic topology, an iterative inference is desirable for optimal result approximation. However, current arts [22, 63, 47, 73, 60] are primarily built upon an immediate, feed-forward prediction scheme.
To respond to the above challenges and enable a deeper understanding of human structures, we develop a unified, structured human parser that precisely describes a more complete set of part relations, and efficiently reasons structures with the prism of a message-passing, feed-back inference scheme. To address the first two issues, we start with an in-depth and comprehensive analysis on three essential relations, namely decomposition, composition, and dependency. Three distinct relation networks ( , , and in Fig. 1(c)) are elaborately designed and imposed to explicitly satisfy the specific, intrinsic relation constraints. Then, we construct our parser as a tree-like, end-to-end trainable graph model, where the nodes represent the human parts, and edges are built upon the relation networks. For the third issue, a modified, relation-typed convolutional message passing procedure ( in Fig. 1(c)) is performed over the human hierarchy, enabling our method to obtain better parsing results from a global view. All components, i.e ., the part nodes, edge (relation) functions, and message passing modules, are fully differentiable, enabling our whole framework to be end-to-end trainable and, in turn, facilitating learning about parts, relations, and inference algorithms.
More crucially, our structured human parser can be viewed as an essential variant of message passing neural networks (MPNNs) [19, 56], yet significantly differentiated in two aspects. (1) Most previous MPNNs are edge-typeagnostic, while ours addresses relation-typed structure reasoning with a higher expressive capability. (2) By replacing the Multilayer Perceptron (MLP) based MPNN units with convolutional counterparts, our parser gains a spatial information preserving property, which is desirable for such a pixel-wise prediction task.
We extensively evaluate our approach on five standard human parsing datasets [22, 63, 44, 31, 45], achieving stateof-the-art performance on all of them ( Β§ 4.2). In addition, with ablation studies for each essential component in our parser ( Β§ 4.3), three key insights are found: (1) Exploring different relations reside on human bodies is valuable for human parsing. (2) Distinctly and explicitly modeling different types of relations can better support human structure reasoning. (3) Message passing based feed-back inference is able to reinforce parsing results.
## 2. Related Work
Human parsing: Over the past decade, active research has been devoted towards pixel-level human semantic understanding. Early approaches tended to leverage image regions [35, 67, 68], hand-crafted features [57, 7], part templates [2, 11, 10] and human keypoints [66, 35, 67, 68], and typically explored certain heuristics over human body configurations [3, 11, 10] in a CRF [66, 28], structured model [67, 11], grammar model [3, 42, 10], or generative model [13, 51] framework. Recent advance has been driven by the streamlined designs of deep learning architectures. Some pioneering efforts revisit classic template matching strategy [31, 36], address local and global cues [34], or use tree-LSTMs to gather structure information [32, 33]. However, due to the use of superpixel [34, 32, 33] or HOG feature [44], they are fragmentary and time-consuming. Consequent attempts thus follow a more elegant FCN architecture, addressing multi-level cues [5, 62], feature aggregation [45, 71, 38], adversarial learning [70, 46, 37], or crossdomain knowledge [37, 65, 20]. To further explore inherent structures, numerous approaches [64, 71, 22, 63, 15, 47] choose to straightforward encode pose information into the parsers, however, relying on off-the-shelf pose estimators [18, 17] or additional annotations. Some others consider top-down [73] or multi-source semantic [60] information over hierarchical human layouts. Though impressive, they ignore iterative inference and seldom address explicit relation modeling, easily suffering from weak expressive ability and risk of sub-optimal results.
With the general success of these works, we make a further step towards more precisely describing the different relations residing on human bodies, i.e ., decomposition, composition, and dependency, and addressing iterative, spatialinformation preserving inference over human hierarchy.
Graph neural networks (GNNs): GNNshave a rich history (dating back to [53]) and became a veritable explosion in research community over the last few years [23]. GNNs effectively learn graph representations in an end-to-end manner, and can generally be divided into two broad classes: Graph Convolutional Networks (GCNs) and Message Passing Graph Networks (MPGNs). The former [12, 48, 27] directly extend classical CNNs to non-Euclidean data. Their simple architecture promotes their popularity, while limits their modeling capability for complex structures [23]. MPGNs [19, 72, 56, 58] parameterize all the nodes, edges, and information fusion steps in graph learning, leading to more complicated yet flexible architectures.
Our structured human parser, which falls in the second category, can be viewed as an early attempt to explore GNNs in the area of human parsing. In contrast to conventional MPGNs, which are mainly MLP-based and edgetype-agnostic, we provide a spatial information preserving and relation-type aware graph learning scheme.
O
Figure 2: Illustration of our structured human parser for hierarchical human parsing during the training phase. The main components in the flowchart are marked by (a)-(h). Please refer to Β§ 3 for more details. Best viewed in color.
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Diagram: Human Pose Estimation Pipeline
### Overview
This diagram illustrates a pipeline for human pose estimation, likely using a graph neural network approach. It depicts the process from image input to pose prediction and loss calculation. The diagram is segmented into eight stages labeled (a) through (h), showing the flow of information and transformations.
### Components/Axes
The diagram consists of the following components:
* **(a) Human Hierarchy G:** A hierarchical representation of the human body, with nodes representing body parts (V1, V2, V3) and connections defining relationships.
* **(b) Image feature extraction:** A convolutional neural network (Backbone Network) processing an input image.
* **(c) Image-node feature projection:** A transformation of image features (x) into node features (h(t)v β V).
* **(d) Node embedding initialization:** Initializing node embeddings.
* **(e) Relation-typed message aggregation:** Aggregating messages between nodes based on their relationships.
* **(f) Node state update:** Updating node states based on aggregated messages.
* **(g) Prediction readout:** Generating pose predictions from node features.
* **(h) Training loss:** Calculating the loss between predicted and ground truth poses.
The diagram also includes equations referenced in parentheses below each stage.
### Detailed Analysis or Content Details
**(a) Human Hierarchy G:**
The human hierarchy is represented as a tree structure.
* V1 (lower level): Contains nodes for lower arm, lower leg, and foot.
* V2 (mid level): Contains nodes for upper arm, upper leg, and torso.
* V3 (highest level): Contains a node for the full body.
The connections between nodes represent anatomical relationships.
**(b) Image feature extraction:**
The input image is processed by a "Backbone Network" which outputs a feature map with dimensions W x H x C.
**(c) Image-node feature projection:**
The image features (x) are projected into node features (h(t)v β V) using an equation (Eq. 1).
**(d) Node embedding initialization:**
Node embeddings are initialized. The diagram shows a 3D tensor representing the node features.
**(e) Relation-typed message aggregation:**
Messages are aggregated between nodes based on their relationships. The diagram shows colored arrows representing message passing between nodes. Orange arrows indicate messages from lower-level nodes to higher-level nodes, while blue arrows indicate messages between nodes at the same level.
**(f) Node state update:**
Node states are updated based on the aggregated messages. The diagram shows a circular arrow indicating the update process.
**(g) Prediction readout:**
Node features are used to generate pose predictions. The diagram shows a "Readout" block that transforms node features into pose predictions.
**(h) Training loss:**
The loss between predicted poses (p1) and ground truth poses (V3) is calculated. The diagram shows three examples of pose predictions and their corresponding loss values (Loss 31, Loss 32, Loss 33). Red arrows indicate the direction of the loss calculation.
### Key Observations
* The pipeline utilizes a hierarchical representation of the human body.
* Message passing between nodes is relation-typed, meaning the messages are aggregated differently based on the relationships between nodes.
* The pipeline is trained using a loss function that compares predicted poses to ground truth poses.
* The diagram emphasizes the flow of information from image features to pose predictions.
* The use of equations suggests a mathematical formulation of the pipeline.
### Interpretation
This diagram describes a graph neural network-based approach to human pose estimation. The hierarchical representation of the human body allows the network to capture anatomical relationships and dependencies. The message passing mechanism enables information exchange between body parts, leading to more accurate pose predictions. The loss function guides the training process, ensuring that the network learns to predict poses that are consistent with the ground truth. The overall pipeline demonstrates a sophisticated approach to human pose estimation that leverages the power of graph neural networks and hierarchical representations. The diagram suggests a focus on capturing the relationships between body parts to improve the accuracy of pose estimation. The use of equations indicates a rigorous mathematical foundation for the pipeline. The diagram is a high-level overview of the pipeline and does not provide details about the specific network architecture or training procedure.
</details>
## 3. Our Approach
## 3.1. Problem Definition
Formally, we represent the human semantic structure as a directed, hierarchical graph G = ( V , E , Y ) . As show in Fig. 2(a), the node set V = βͺ 3 l =1 V l represents human parts in three different semantic levels, including the leaf nodes V 1 ( i.e ., the most fine-grained parts: head, arm, hand , etc .) which are typically considered in common human parsers, two middle-level nodes V 2 = { upper-body, lower-body } and one root V 3 = { full-body } 1 . The edge set Eβ ( V 2 ) represents the relations between human parts (nodes), i.e ., the directed edge e =( u, v ) βE links node u to v : u β v . Each node v and each edge ( u, v ) are associated with feature vectors: h v and h u,v , respectively. y v βY indicates the groundtruth segmentation map of part (node) v and the groundtruth maps Y are also organized in a hierarchical manner: Y = βͺ 3 l =1 Y l .
Our human parser is trained in a graph learning scheme, using the full supervision from existing human parsing datasets. For a test sample, it is able to effectively infer the node and edge representations by reasoning human structures at the levels of individual parts and their relations, and iteratively fusing the information over the human structures.
## 3.2. Structured Human Parsing Network
Node embedding: As an initial step, a learnable projection function is used to map the input image representation into node (part) features, in order to obtain sufficient expressive power. Formally, let us denote the input image feature as x β R W Γ H Γ C , which comes from a DeepLabV3 [6]-like backbone network ( in Fig.2(b)), and the projection function as P : R W Γ H Γ C β¦β R W Γ H Γ c Γ|V| , where |V| indicates the number of nodes. The node embeddings { h v β R W Γ H Γ c } v βV are initialized by (Fig.2(d)):
<!-- formula-not-decoded -->
where each node embedding h v is a ( W , H , c )-dimensional tenor that encodes full spatial details ( in Fig.2(c)).
1 As the classic settings of graph models, there is also a 'dummy' node in V , used for interpreting the background class. As it does not interact with other semantic human parts (nodes), we omit this node for concept clarity.
Typed human part relation modeling: Basically, an edge embedding h u,v captures the relations between nodes u and v . Most previous structured human parsers [73, 60] work in an edge-type-agnostic manner, i.e ., a unified, shared relation network R : R W Γ H Γ c Γ R W Γ H Γ c β¦β R W Γ H Γ c is used to capture all the relations: h u,v = R ( h u , h v ) . Such a strategy may lose the discriminability of individual relation types and does not have an explicit bias towards modeling geometric and anatomical constraints. In contrast, we formulate h u,v in a relation-typed manner R r :
<!-- formula-not-decoded -->
where r β { dec , com , dep } . F r ( Β· ) is an attention-based relation-adaption operation, which is used to enhance the original node embedding h u by addressing geometric characteristics in relation r . The attention mechanism is favored here as it allows trainable and flexible feature enhancement and explicitly encodes specific relation constraints. From the view of information diffusion mechanism in the graph theory [53], if there exists an edge ( u, v ) that links a starting node u to a destination v , this indicates v should receive incoming information ( i.e ., h u,v ) from u . Thus, we use F r ( Β· ) to make h u better accommodate the target v . R r is edgetype specific, employing the more tractable feature F r ( h u ) in place of h u , so more expressive relation feature h u,v for v can be obtained and further benefit the final parsing results. In this way, we learn more sophisticated and impressive relation patterns within human bodies.
1) Decompositional relation modeling: Decompositional relations (full line: in Fig. 2(a)) are represented by those vertical edges starting from parent nodes to corresponding child nodes in the human hierarchy G . For example, a parent node full-body can be separated into { upper-body , lowerbody } , and upper-body can be decomposed into { head , torso , upper-arm , lower-arm } . Formally, for a node u , let us denote its child node set as C u . Our decompositional relation network R dec aims to learn the rule for 'breaking down' u into its constituent parts C u (Fig.3):
<!-- formula-not-decoded -->
de feature
(e) Relation-typed massage
(f) Node state aggregation
update
upper-leg lower-body v u Figure 3: Illustration of our decompositional relation modeling. (a) Decompositional relations between the upper-body node ( u ) and its constituents ( C u ). (b) With the decompositional attentions { att dec u,v ( h u ) } v βC u , F dec learns how to 'break down' the upper-body node and generates more tractable features for its constituents. In the relation adapted feature F dec ( h u ) , the responses from the background and other irrelevant parts are suppressed.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Diagram: Human Pose Decomposition and Reconstruction
### Overview
The image depicts a diagram illustrating a hierarchical decomposition of a human body into parts and a reconstruction process using attention mechanisms. It appears to be a visual representation of a model for understanding and generating human poses. The diagram is split into two main sections, (a) and (b), with (a) showing the hierarchical structure and (b) illustrating the reconstruction process.
### Components/Axes
The diagram consists of the following components:
* **Parent Node:** Labeled "parent node" and represented by a red circle labeled "u" (upper-body).
* **Child Nodes:** Represented by colored circles labeled: "lower-", "upper-arm", "head", and "torso". The torso is labeled "v".
* **Arrows:** Yellow arrows connecting the parent node to the child nodes, indicating hierarchical relationships.
* **Equation:** "Eq.3: h<sub>u,v</sub> = R<sup>dec</sup>(F<sup>dec</sup>(h<sub>u</sub>), h<sub>v</sub>) H"
* **Image of a Person:** A 3D rendering of a person in a dynamic pose, labeled "h<sub>u</sub>".
* **Attention Maps:** Four black square images with colored highlights, labeled "att<sub>u,v</sub>".
* **Reconstructed Features:** Four 3D renderings of body parts, labeled "F<sup>dec</sup>(h<sub>u</sub>)".
* **Reconstructed Body Parts:** Four 3D renderings of body parts with color overlays, labeled "C<sub>t</sub>".
* **Labels:** "W" and "H" are present at the bottom of the diagram.
### Detailed Analysis or Content Details
**(a) Hierarchical Decomposition:**
* The upper-body ("u") is the parent node.
* The child nodes are: lower-body (dark blue), upper-arm (yellow), head (red), and torso ("v", light blue).
* The arrows indicate that the upper-body node decomposes into these four parts.
* The equation "Eq.3: h<sub>u,v</sub> = R<sup>dec</sup>(F<sup>dec</sup>(h<sub>u</sub>), h<sub>v</sub>) H" describes a reconstruction process, where h<sub>u,v</sub> is reconstructed from the features of the parent (h<sub>u</sub>) and child (h<sub>v</sub>) nodes.
**(b) Reconstruction Process:**
* The image of the person (h<sub>u</sub>) is at the top-center.
* Four attention maps (att<sub>u,v</sub>) are shown, each with a different color focus:
* First map: Primarily black with some white highlights.
* Second map: Yellow highlights.
* Third map: Red highlights.
* Fourth map: Green highlights.
* These attention maps are connected to four reconstructed feature representations (F<sup>dec</sup>(h<sub>u</sub>)):
* First feature: Primarily blue with some white highlights.
* Second feature: Yellow highlights on a blue background.
* Third feature: Red highlights on a blue background.
* Fourth feature: Green highlights on a blue background.
* These features are then used to reconstruct the body parts (C<sub>t</sub>):
* First part: Blue with white and red highlights.
* Second part: Blue with yellow highlights.
* Third part: Blue with red highlights.
* Fourth part: Blue with green highlights.
* "W" and "H" are positioned at the bottom of the diagram, likely representing width and height dimensions.
### Key Observations
* The diagram illustrates a hierarchical approach to representing human poses.
* Attention mechanisms are used to focus on relevant parts of the body during reconstruction.
* The reconstruction process involves decoding features from both the parent and child nodes.
* The color-coding of the attention maps and reconstructed features suggests that different parts of the body are being highlighted and reconstructed independently.
### Interpretation
The diagram demonstrates a method for decomposing a human pose into a hierarchical structure and then reconstructing it using attention-guided feature decoding. The parent node represents the overall pose, while the child nodes represent individual body parts. The attention maps allow the model to focus on the most relevant features for reconstructing each body part. The equation suggests that the reconstruction process combines information from both the parent and child nodes. This approach could be useful for tasks such as pose estimation, action recognition, and human-computer interaction. The use of color-coding suggests that the model is capable of distinguishing between different body parts and reconstructing them independently. The "W" and "H" labels likely indicate that the reconstructed body parts are represented as images or volumes with specific dimensions. The diagram suggests a sophisticated model capable of understanding and generating realistic human poses.
</details>
lower-leg
C
H W ' ' indicates the attention-based feature enhancement operation, and att dec u,v ( h u ) β [0 , 1] W Γ H produces an attention map.Foreachsub-node v βC u of u , att dec u,v ( h u ) is defined as:
<!-- formula-not-decoded -->
where PSM( Β· ) stands for pixel-wise soft-max , '[ Β· ]' represents the channel-wise concatenation, and Ο dec v ( h u ) β R W Γ H computes a specific significance map for v . By making β v βC u att dec u,v = 1 , { att dec u,v ( h u ) } v βC u forms a decompositional attention mechanism, i.e ., allocates disparate attentions over h u . To recap, the decompositional attention , conditioned on h u , lets u pass separate high-level information to different child nodes C u (see Fig. 3(b)). Here att dec u,v ( Β· ) is node-specific and separately learnt for the three entire nodes in V 2 βͺV 3 , namely full-body , upper-body and lowerbody . A subscript u,v is added to address this point. In addition, for each parent node u , the groundtruth maps Y C u = { y v } v βC u β{ 0 , 1 } W Γ H Γ|C u | of all the child nodes C u can be used as supervision signals to train its decompositional attention { att dec u,v ( h u ) } v βC u β [0 , 1] W Γ H Γ|C u | :
<!-- formula-not-decoded -->
where L CE represents the standard cross-entropy loss. 2) Compositional relation modeling: In the human hierarchy G , compositional relations are represented by vertical, downward edges. To address this type of relations, we design a compositional relation network R com as (Fig.4):
<!-- formula-not-decoded -->
Here att com v : R W Γ H Γ c Γ|C v | β¦β [0 , 1] W Γ H is a compositional attention , implemented by a 1 Γ 1 convolutional layer. The rationale behind such a design is that, for a parent node v , att com v gathers statistics of all the child nodes C v and is used to enhance each sub-node feature h u . As att com v is compositional in nature, its enhanced feature F com ( h u ) is lower-
(g) Prediction upper-
arm arm
readout torso
C
(h) Training loss
H
W
Figure 4: Illustration of our compositional relation modeling. (a) Compositional relations between the lower-body node ( v ) and its constituents ( C v ). (b) The compositional attention att com v ([ h u β² , h u ]) gathers information from all the constituents C v and lets F com enhance all the lower-body related features of C v .
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Diagram: Human Pose Estimation - Attention Mechanism
### Overview
The image depicts a diagram illustrating an attention mechanism within a human pose estimation system. It shows a hierarchical structure of body parts (lower-body, upper-leg, lower-leg) and how attention is applied between these parts to refine pose estimations. The diagram is split into two main sections: (a) a hierarchical representation of body parts, and (b) the attention mechanism itself.
### Components/Axes
* **Body Part Hierarchy (a):**
* Nodes: `v` (lower-body, parent node), `u'` (upper-leg), `u` (lower-leg).
* Edges: Represent relationships between body parts.
* Labels: `lower-body`, `upper-leg`, `lower-leg`, `parent node`, `Cβ`.
* **Attention Mechanism (b):**
* Input Features: `Fcom(hu')`, `Fcom(hu)`.
* Attention Weight: `attv`.
* Output Features: `hu'`, `hu`.
* Transformation Matrices: `H`, `W`.
* Concatenated Features: `[hu', hu]`.
* **Equation:** `hu,v = Rcom(Fcom(hu')) , hu,v`
* **Color Coding:** The diagram uses color to highlight areas of activation within the pose estimations (blue, red, yellow).
### Detailed Analysis or Content Details
**(a) Body Part Hierarchy:**
The diagram shows a tree-like structure. The `lower-body` (represented by a teal circle) is the parent node (`v`). It has two child nodes: `upper-leg` (represented by a purple circle, `u'`) and `lower-leg` (represented by a red circle, `u`). A dashed line labeled `Cβ` separates the parent node from the child nodes. The equation `hu,v = Rcom(Fcom(hu')) , hu,v` is provided, likely representing a transformation or relationship between the features of the parent and child nodes.
**(b) Attention Mechanism:**
* Two feature maps, `Fcom(hu')` (top-left, blue background) and `Fcom(hu)` (top-right, blue background), are input into an attention module (center, black background with a teal and yellow activation pattern).
* The attention module produces an attention weight `attv`.
* The attention weight is used to refine the feature maps, resulting in output feature maps `hu'` (bottom-left, green background) and `hu` (bottom-right, green background).
* The output feature maps are then concatenated `[hu', hu]` and transformed by matrices `H` and `W`.
* Red arrows indicate the flow of information from the input features to the attention module and then to the output features. A white arrow indicates the flow of concatenated features.
### Key Observations
* The attention mechanism appears to be focused on refining the feature representations of the upper and lower legs based on the context of the lower body.
* The color coding in the feature maps suggests that the attention mechanism is highlighting specific regions of the body.
* The hierarchical structure suggests a recursive application of the attention mechanism across different levels of the body.
### Interpretation
This diagram illustrates a method for improving human pose estimation by incorporating an attention mechanism. The attention mechanism allows the system to focus on relevant parts of the body when estimating the pose of a specific joint. The hierarchical structure suggests that the attention mechanism can be applied recursively to refine the pose estimation at different levels of detail. The equation provided indicates a transformation process between the features of parent and child nodes, likely using the attention weights to modulate the information flow. The use of color-coded feature maps provides a visual representation of the attention mechanism's focus and the areas of activation within the pose estimations. The diagram suggests a sophisticated approach to pose estimation that leverages contextual information and attention to achieve more accurate and robust results. The diagram does not provide any numerical data, but rather a conceptual overview of the process.
</details>
more 'friendly' to the parent node v , compared to h u . Thus, R com is able to generate more expressive relation features by considering compositional structures (see Fig.4(b)).
For each parent node v β V 2 βͺ V 3 , with its groundtruth map y v β{ 0 , 1 } W Γ H , the compositional attention for all its child nodes C v is trained by minimizing the following loss:
<!-- formula-not-decoded -->
3) Dependency relation modeling: In G , dependency relations are represented as horizontal edges (dashed line: in Fig. 2(a)), describing pairwise, kinematic connections between human parts, such as ( head , torso ), ( upper-leg , lowerleg ), etc . Two kinematically connected human parts are spatially adjacent, and their dependency relation essentially addresses the context information. For a node u , with its kinematically connected siblings K u , a dependency relation network R dep is designed as (Fig. 5):
<!-- formula-not-decoded -->
where F cont ( h u ) β R W Γ H Γ c is used to extract the context of u , and att dep u,v ( F cont ( h u ) ) β [0 , 1] W Γ H is a dependency attention that produces an attention for each sibling node v , conditioned on u 's context F cont ( h u ) . Specifically, inspired by the non-local self-attention [55, 61], the context extraction module F cont is designed as:
<!-- formula-not-decoded -->
where h β² u β R ( c +8) Γ ( WH ) and x β² β R ( C +8) Γ ( WH ) are node (part) and image representations augmented with spatial information, respectively, flattened into matrix formats. The last eight channels of h β² u and x β² encode spatial coordinate information [25], where the first six dimensions are the normalized horizontal and vertical positions, and the last two dimensions are the normalized width and height information of the feature, 1/ W and 1/ H . W β R ( c +8) Γ ( C +8) is learned as a linear transformation based node-to-context projection
Attention based feature enhancem
Decomposition relation adaptio
H
W
Figure 5: Illustration of our dependency relation modeling. (a) Dependency relations between the upper-body node ( u ) and its siblings ( K u ). (b) The dependency attention { att dep u,v ( F cont ( h u ) ) } v βK u , derived from u 's contextual information F cont ( h u ) , gives separate importance for different siblings K u .
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Diagram: Human Pose Dependency Modeling
### Overview
The image presents a diagram illustrating a dependency modeling approach for human pose estimation. It depicts a graph-based representation of body parts and their relationships, alongside a visual representation of feature extraction and attention mechanisms. The diagram is divided into two main sections: (a) a graph representation of body part dependencies, and (b) a visual depiction of feature processing and attention.
### Components/Axes
The diagram includes the following components:
* **Graph (a):** Nodes representing body parts (torso, upper-leg, lower-leg) and edges representing dependencies between them. Labels: "sibling node", "torso", "upper-leg", "lower-leg", "u", "v", "ΞΊ<sub>u,v</sub>".
* **Equation (a):** "Eq. 8: h<sub>u,v</sub> = R<sup>dep</sup>(F<sup>dep</sup>(h<sub>u</sub>))"
* **Feature Maps (b):** Cuboid representations of feature maps, labeled as "F<sub>front</sub>(h<sub>u</sub>)", "F<sup>dep</sup>(h<sub>u</sub>)", "x", "h<sub>u</sub>", "C<sub>z</sub>", "W".
* **Attention Maps (b):** Darker cuboid representations of attention maps, labeled as "att<sub>u,v</sub>", "dep".
* **Arrows (b):** Indicate the flow of information between feature maps and attention mechanisms.
* **Visual Representation (b):** Heatmaps overlaid on human silhouettes, showing feature activation.
### Detailed Analysis or Content Details
**Section (a): Graph Representation**
* The graph shows a torso node connected to both an upper-leg node (labeled 'u') and a lower-leg node (labeled 'v').
* The edge connecting 'u' and 'v' is labeled "ΞΊ<sub>u,v</sub>", representing the dependency between the upper and lower leg.
* The equation "h<sub>u,v</sub> = R<sup>dep</sup>(F<sup>dep</sup>(h<sub>u</sub>))" defines a dependency representation 'h<sub>u,v</sub>' based on a feature 'h<sub>u</sub>' processed through a dependency feature extractor 'F<sup>dep</sup>' and a transformation 'R<sup>dep</sup>'.
**Section (b): Feature Processing and Attention**
* **Input Feature Map (x):** A cuboid representing the input feature map 'x', showing a heatmap of human pose.
* **Frontal Feature Map (F<sub>front</sub>(h<sub>u</sub>)):** A cuboid representing the frontal view feature map, derived from 'h<sub>u</sub>'. The heatmap shows activation in the torso and upper body.
* **Dependency Feature Map (F<sup>dep</sup>(h<sub>u</sub>)):** A cuboid representing the dependency feature map, derived from 'h<sub>u</sub>'. The heatmap shows activation in the lower body, specifically the legs.
* **Dependency Attention Map (att<sub>u,v</sub>):** A darker cuboid representing the attention map between nodes 'u' and 'v'. The heatmap shows a focused activation area.
* **Dependency Map (dep):** A darker cuboid representing the dependency map.
* **Feature h<sub>u</sub>:** A cuboid representing the feature 'h<sub>u</sub>', showing a heatmap of human pose.
* **Feature C<sub>z</sub>:** A cuboid representing the feature 'C<sub>z</sub>', showing a heatmap of human pose.
* **Feature W:** A cuboid representing the feature 'W', showing a heatmap of human pose.
* The arrows indicate that 'F<sub>front</sub>(h<sub>u</sub>)' and 'F<sup>dep</sup>(h<sub>u</sub>)' are inputs to the attention mechanism, which generates 'att<sub>u,v</sub>'.
* 'att<sub>u,v</sub>' is then used to refine 'F<sup>dep</sup>(h<sub>u</sub>)', resulting in an output feature map.
### Key Observations
* The diagram highlights a dependency modeling approach where the relationships between body parts are explicitly modeled.
* The use of attention mechanisms suggests that the model focuses on relevant dependencies when processing features.
* The heatmaps indicate that different feature maps capture different aspects of the human pose.
* The equation suggests a transformation of features to represent dependencies.
### Interpretation
The diagram illustrates a novel approach to human pose estimation that leverages dependency modeling and attention mechanisms. The graph representation in (a) provides a structured way to define relationships between body parts, while the feature processing pipeline in (b) demonstrates how these dependencies are incorporated into the feature extraction process. The attention mechanism allows the model to selectively focus on relevant dependencies, improving the accuracy and robustness of pose estimation. The equation formalizes the dependency representation, suggesting a mathematical framework for modeling these relationships. The use of heatmaps provides a visual representation of feature activation, allowing for a better understanding of how the model processes information. The overall design suggests a system that aims to capture contextual information about human pose, going beyond simple feature detection to understand the relationships between body parts.
</details>
function. The node feature h β² u , used as a query term, retrieves the reference image feature x β² for its context information. As a result, the affinity matrix A stores the attention weight between the query and reference at a certain spatial location, accounting for both visual and spatial information. Then, u 's context is collected as a weighted sum of the original image feature x with column-wise normalized weight matrix A : xA β R C Γ ( WH ) . A 1 Γ 1 convolution based linear embedding function Ο : R W Γ H Γ C β¦β R W Γ H Γ c is applied for feature dimension compression, i.e ., to make the channel dimensions of different edge embeddings consistent.
For each sibling node v βK u of u , att dep u,v is defined as:
<!-- formula-not-decoded -->
Here Ο dep v ( Β· ) β R W Γ H gives an importance map for v , using a 1 Γ 1 convolutional layer. Through the pixel-wise soft-max operation PSM( Β· ), we enforce β v βK u att dep u,v = 1 , leading to a dependency attention mechanism which assigns exclusive attentions over F cont ( h u ) , for the corresponding sibling nodes K u . Such a dependency attention is learned via:
<!-- formula-not-decoded -->
where Y K u β [0 , 1] W Γ H Γ|K u | stands for the groundtruth maps { y v } v βK u of all the sibling nodes K u of u .
Iterative inference over human hierarchy: Humanbodies present a hierarchical structure. According to graph theory, approximate inference algorithms should be used for such a loopy structure G . However, previous structured human parsers directly produce the final node representation h v by either simply accounting for the information from the parent node u [73]: h v β R ( h u , h v ) , where v β C u ; or from its neighbors N v [60]: h v β β u βN v R ( h u , h v ) . They ignore the fact that, in such a structured setting, information is organized in a complex system. Iterative algorithms offer a more favorable solution, i.e ., the node representation should be updated iteratively by aggregating the messages from its neighbors; after several iterations, the representation can approximate the optimal results [53]. In graph theory parlance, the iterative algorithm can be achieved by a parametric message passing process, which is defined in terms of a message function M and node update function U , and runs T steps. For each node v , the message passing process recursively collects information (messages) m v from the neighbors N v to enrich the node embedding h v :
<!-- formula-not-decoded -->
where h ( t ) v stands for v 's state in the t -th iteration. Recurrent neural networks are typically used to address the iterative nature of the update function U .
Inspired by previous message passing algorithms, our iterative algorithm is designed as (Fig.2(e)-(f)):
<!-- formula-not-decoded -->
<!-- formula-not-decoded -->
where the initial state h (0) v is obtained by Eq. 1. Here, the message aggregation step (Eq. 13) is achieved by per-edge relation function terms, i.e ., node v updates its state h v by absorbing all the incoming information along different relations. As for the update function U in Eq. 14, we use a convGRU [54], which replaces the fully-connected units in the original MLP-based GRU with convolution operations, to describe its repeated activation behavior and address the pixel-wise nature of human parsing, simultaneously. Compared to previous parsers, which are typically based on feedforward architectures, our massage-passing inference essentially provides a feed-back mechanism, encouraging effective reasoning over the cyclic human hierarchy G .
Loss function: In each step t , to obtain the predictions Λ Y ( t ) l = { Λ y ( t ) v β [0 , 1] W Γ H } v βV l of the l -th layer nodes V l , we apply a convolutional readout function O : R W Γ H Γ c β¦β R W Γ H over { h ( t ) v } v βV ( in Fig. 2(g)), and pixel-wise soft-max (PSM) for normalization:
<!-- formula-not-decoded -->
Given the hierarchical human parsing results { Λ Y ( t ) l } 3 l =1 and corresponding groundtruths {Y l } 3 l =1 , the learning task in the iterative inference can be posed as the minimization of the following loss (Fig.2(h)):
<!-- formula-not-decoded -->
Considering Eqs.5,7,11,and16, the overall loss is defined as:
<!-- formula-not-decoded -->
where the coefficient Ξ± is empirically set as 0 . 1 . We set the total inference time T =2 and study how the performance changes with the number of inference iterations in Β§ 4.3.
## 3.3. Implementation Details
Node embedding: A DeepLabV3 network [6] serves as the backbone architecture, resulting in a 256-channel image representation whose spatial dimensions are 1/8 of the input image. The projection function P : R W Γ H Γ C β¦β R W Γ H Γ c Γ|V| in Eq.1 is implemented by a 3 Γ 3 convolutional layer with ReLU nonlinearity, where C =256 and |V| ( i.e ., the number of nodes) is set according to the settings in different human parsing datasets. We set the channel size of node features c =64 to maintain high computational efficiency.
Relation networks: Each typed relation network R r in Eq.2 concatenates the relation-adapted feature F r ( h u ) from the source node u and the destination node v 's feature h v as the input, and outputs the relation representations: h u,v = R r ([ F r ( h u ) , h v ]) . R r : R W Γ H Γ 2 c β¦β R W Γ H Γ c is implemented by a 3 Γ 3 convolutional layer with ReLU nonlinearity.
Iterative inference: In Eq.14, the update function U convGRU is implemented by a convolutional GRU with 3 Γ 3 convolution kernels. The readout function O in Eq. 15 applies a 1 Γ 1 convolution operation on the feature-prediction projection. In addition, before sending a node feature h ( t ) v into O , we use a light-weight decoder (built using a principle of upsampling the node feature and merging it with the low-level feature of the backbone network) that outputs the segmentation mask with 1/4 the spatial resolution of the input image.
As seen, all the units of our parser are built on convolution operations, leading to spatial information preservation.
## 4. Experiments
## 4.1. Experimental Settings
Datasets: 2 Five standard benchmark datasets [22, 63, 44, 31, 45] are used for performance evaluation. LIP [22] contains 50,462 single-person images, which are collected from realistic scenarios and divided into 30,462 images for training, 10,000 for validation and 10,000 for test. The pixelwise annotations cover 19 human part categories ( e.g ., face , left-/right-arms , left-/right-legs , etc .). PASCAL-PersonPart [63] includes 3,533 multi-person images with challenging poses and viewpoints. Each image is pixel-wise annotated with six classes ( i.e ., head, torso, upper-/lower-arms , and upper-/lower-legs ). It is split into 1,716 and 1,817 images for training and test. ATR [31] is a challenging human parsing dataset, which has 7,700 single-person images with dense annotations over 17 categories ( e.g ., face, upperclothes, left-/right-arms, left-/right-legs , etc .). There are 6,000, 700 and 1,000 images for training, validation, and test, respectively. PPSS [44] is a collection of 3,673 singlepedestrian images from 171 surveillance videos and provides pixel-wise annotations for hair, face, upper-/lowerclothes, arm , and leg . It presents diverse real-word chal-
2 As the datasets provide different human part labels, we make proper modifications of our human hierarchy. For some labels that do not deliver human structures, such as hat , sun-glasses , we treat them as isolate nodes.
Table 1: Comparison of pixel accuracy, mean accuracy and mIoU on LIP val [22]. β indicates extra pose information used.
| Method | pixAcc. | Mean Acc. | Mean IoU |
|----------------------|-----------|-------------|------------|
| SegNet [1] | 69.04 | 24.00 | 18.17 |
| FCN-8s [41] | 76.06 | 36.75 | 28.29 |
| DeepLabV2 [4] | 82.66 | 51.64 | 41.64 |
| Attention [5] | 83.43 | 54.39 | 42.92 |
| β Attention+SSL [22] | 84.36 | 54.94 | 44.73 |
| DeepLabV3+ [6] | 84.09 | 55.62 | 44.8 |
| ASN [43] | - | - | 45.41 |
| β SSL [22] | - | - | 46.19 |
| MMAN[46] | 85.24 | 57.60 | 46.93 |
| β SS-NAN [71] | 87.59 | 56.03 | 47.92 |
| HSP-PRI [26] | 85.07 | 60.54 | 48.16 |
| β MuLA [47] | 88.5 | 60.5 | 49.3 |
| PSPNet [69] | 86.23 | 61.33 | 50.56 |
| CE2P [39] | 87.37 | 63.20 | 53.1 |
| BraidNet [40] | 87.60 | 66.09 | 54.42 |
| CNIF [60] | 88.03 | 68.80 | 57.74 |
| Ours | 89.05 | 70.58 | 59.25 |
lenges, e.g ., pose variations, illumination changes, and occlusions. There are 1,781 and 1,892 images for training and testing, respectively. Fashion Clothing [45] has 4,371 images gathered from Colorful Fashion Parsing [35], Fashionista [67], and Clothing Co-Parsing [68]. It has 17 clothing categories ( e.g ., hair, pants, shoes, upper-clothes , etc .) and the data split follows 3,934 for training and 437 for test.
Training: ResNet101 [24], pre-trained on ImageNet [52], is used to initialize our DeepLabV3 [6] backbone. The remaining layers are randomly initialized. We train our model on the five aforementioned datasets with their respective training samples, separately. Following the common practice [39, 21, 60], we randomly augment each training sample with a scaling factor in [0.5, 2.0], crop size of 473 Γ 473 , and horizontal flip. For optimization, we use the standard SGD solver, with a momentum of 0.9 and weight decay of 0.0005. To schedule the learning rate, we employ the polynomial annealing procedure [4, 69], where the learning rate is multiplied by (1 -iter total iter ) power with power as 0 . 9 .
Testing: For each test sample, we set the long side of the image to 473 pixels and maintain the original aspect ratio. As in [69, 47], we average the parsing results over five-scale image pyramids of different scales with flipping, i.e ., the scaling factor is 0.5 to 1.5 (with intervals of 0.25).
Reproducibility: Our method is implemented on PyTorch and trained on four NVIDIA Tesla V100 GPUs (32GB memory per-card). All the experiments are performed on one NVIDIA TITAN Xp 12GB GPU. To provide full details of our approach, our code will be made publicly available.
Evaluation: For fair comparison, we follow the official evaluation protocols of each dataset. For LIP, following [71], we report pixel accuracy, mean accuracy and mean Intersection-over-Union (mIoU). For PASCAL-Person-Part and PPSS, following [62, 63, 46], the performance is evaluated in terms of mIoU. For ATR and Fashion Clothing, as
| Method | Head | Torso | U-Arm | L-Arm | U-Leg | L-Leg | B.G. | Ave. |
|---------------------|--------|---------|---------|---------|---------|---------|--------|--------|
| HAZN [62] | 80.79 | 59.11 | 43.05 | 42.76 | 38.99 | 34.46 | 93.59 | 56.11 |
| Attention [5] | 81.47 | 59.06 | 44.15 | 42.50 | 38.28 | 35.62 | 93.65 | 56.39 |
| LG-LSTM [33] | 82.72 | 60.99 | 45.40 | 47.76 | 42.33 | 37.96 | 88.63 | 57.97 |
| Attention+SSL [22] | 83.26 | 62.40 | 47.80 | 45.58 | 42.32 | 39.48 | 94.68 | 59.36 |
| Attention+MMAN [46] | 82.58 | 62.83 | 48.49 | 47.37 | 42.80 | 40.40 | 94.92 | 59.91 |
| Graph LSTM [32] | 82.69 | 62.68 | 46.88 | 47.71 | 45.66 | 40.93 | 94.59 | 60.16 |
| SS-NAN [71] | 86.43 | 67.28 | 51.09 | 48.07 | 44.82 | 42.15 | 97.23 | 62.44 |
| Structure LSTM [30] | 82.89 | 67.15 | 51.42 | 48.72 | 51.72 | 45.91 | 97.18 | 63.57 |
| Joint [63] | 85.50 | 67.87 | 54.72 | 54.30 | 48.25 | 44.76 | 95.32 | 64.39 |
| DeepLabV2 [4] | - | - | - | - | - | - | - | 64.94 |
| MuLA [47] | 84.6 | 68.3 | 57.5 | 54.1 | 49.6 | 46.4 | 95.6 | 65.1 |
| PCNet [73] | 86.81 | 69.06 | 55.35 | 55.27 | 50.21 | 48.54 | 96.07 | 65.9 |
| Holistic [29] | 86.00 | 69.85 | 56.63 | 55.92 | 51.46 | 48.82 | 95.73 | 66.34 |
| WSHP [15] | 87.15 | 72.28 | 57.07 | 56.21 | 52.43 | 50.36 | 97.72 | 67.6 |
| DeepLabV3+ [6] | 87.02 | 72.02 | 60.37 | 57.36 | 53.54 | 48.52 | 96.07 | 67.84 |
| SPGNet [8] | 87.67 | 71.41 | 61.69 | 60.35 | 52.62 | 48.80 | 95.98 | 68.36 |
| PGN [21] | 90.89 | 75.12 | 55.83 | 64.61 | 55.42 | 41.57 | 95.33 | 68.4 |
| CNIF [60] | 88.02 | 72.91 | 64.31 | 63.52 | 55.61 | 54.96 | 96.02 | 70.76 |
Ours
89.73
75.22
66.87
66.21
58.69
58.17
96.94
73.12
Table 2: Per-class comparison of mIoU on PASCAL-PersonPart test [63].
| Method | pixAcc. | F.G. Acc. | Prec. | Recall | F-1 |
|----------------|-----------|-------------|---------|----------|-------|
| Yamaguchi [67] | 84.38 | 55.59 | 37.54 | 51.05 | 41.8 |
| Paperdoll [66] | 88.96 | 62.18 | 52.75 | 49.43 | 44.76 |
| M-CNN [36] | 89.57 | 73.98 | 64.56 | 65.17 | 62.81 |
| ATR [31] | 91.11 | 71.04 | 71.69 | 60.25 | 64.38 |
| DeepLabV2 [4] | 94.42 | 82.93 | 78.48 | 69.24 | 73.53 |
| PSPNet [69] | 95.2 | 80.23 | 79.66 | 73.79 | 75.84 |
| Attention [5] | 95.41 | 85.71 | 81.3 | 73.55 | 77.23 |
| DeepLabV3+ [6] | 95.96 | 83.04 | 80.41 | 78.79 | 79.49 |
| Co-CNN [34] | 96.02 | 83.57 | 84.95 | 77.66 | 80.14 |
| LG-LSTM [33] | 96.18 | 84.79 | 84.64 | 79.43 | 80.97 |
| TGPNet [45] | 96.45 | 87.91 | 83.36 | 80.22 | 81.76 |
| CNIF [60] | 96.26 | 87.91 | 84.62 | 86.41 | 85.51 |
| Ours | 96.84 | 89.23 | 86.17 | 88.35 | 87.25 |
Table 3: Comparison of accuracy, foreground accuracy, average precision, recall and F1-score on ATR test [31].
in [45, 60], we report pixel accuracy, foreground accuracy, average precision, average recall, and average F1-score.
## 4.2. Quantitative and Qualitative Results
LIP [22]: LIP is a gold standard benchmark for human parsing. Table1 reports the comparison results with 16 state-ofthe-arts on LIP val . Wefirst find that general semantic segmentation methods [1, 41, 4, 6] tend to perform worse than human parsers. This indicates the importance of reasoning human structures in this problem. In addition, though recent human parsers gain impressive results, our model still outperforms all the competitors by a large margin. For instance, in terms of pixAcc., mean Acc., and mean IoU, our parser dramatically surpasses the best performing method, CNIF [60], by 1.02%, 1.78% and 1.51%, respectively. We would also like to mention that our parser does not use additional pose [22, 71, 47] or edge [39] information.
PASCAL-Person-Part [63]: In Table 2, we compare our method against 18 recent methods on PASCAL-Person-Part test using IoU score. From the results, we can again see that our approach achieves better performance compared to all other methods; specially, 73.12% vs 70.76%
| Method | pixAcc. | F.G. Acc. | Prec. | Recall | F-1 |
|----------------|-----------|-------------|---------|----------|-------|
| Yamaguchi [67] | 81.32 | 32.24 | 23.74 | 23.68 | 22.67 |
| Paperdoll [66] | 87.17 | 50.59 | 45.8 | 34.2 | 35.13 |
| DeepLabV2 [4] | 87.68 | 56.08 | 35.35 | 39 | 37.09 |
| Attention [5] | 90.58 | 64.47 | 47.11 | 50.35 | 48.68 |
| TGPNet [45] | 91.25 | 66.37 | 50.71 | 53.18 | 51.92 |
| CNIF [60] | 92.2 | 68.59 | 56.84 | 59.47 | 58.12 |
| Ours | 93.12 | 70.57 | 58.73 | 61.72 | 60.19 |
Table 4: Comparison of pixel accuracy, foreground pixel accuracy, average precision, average recall and average f1-score on Fashion Clothing test [45].
| Method | Head | Face | U-Cloth | Arms | L-Cloth | Legs | B.G. | Ave. |
|-----------|--------|--------|-----------|--------|-----------|--------|--------|--------|
| DL [44] | 22 | 29.1 | 57.3 | 10.6 | 46.1 | 12.9 | 68.6 | 35.2 |
| DDN [44] | 35.5 | 44.1 | 68.4 | 17 | 61.7 | 23.8 | 80 | 47.2 |
| ASN [43] | 51.7 | 51 | 65.9 | 29.5 | 52.8 | 20.3 | 83.8 | 50.7 |
| MMAN[46] | 53.1 | 50.2 | 69 | 29.4 | 55.9 | 21.4 | 85.7 | 52.1 |
| LCPC [9] | 55.6 | 46.6 | 71.9 | 30.9 | 58.8 | 24.6 | 86.2 | 53.5 |
| CNIF [60] | 67.6 | 60.8 | 80.8 | 46.8 | 69.5 | 28.7 | 90.6 | 60.5 |
| Ours | 68.8 | 63.2 | 81.7 | 49.3 | 70.8 | 32 | 91.4 | 65.3 |
Table 5: Comparison of mIoU on PPSS test [44].
of CNIF [60] and 68.40% of PGN [21], in terms of mIoU . Such a performance gain is particularly impressive considering that improvement on this dataset is very challenging.
ATR [31]: Table 3 presents comparisons with 14 previous methods on ATR test . Our approach sets new state-ofthe-arts for all five metrics, outperforming all other methods by a large margin. For example, our parser provides a considerable performance gain in F-1 score, i.e ., 1.74% and 5.49% higher than the current top-two performing methods, CNIF [60] and TGPNet [45], respectively.
Fashion Clothing [45]: The quantitative comparison results with six competitors on Fashion Clothing test are summarized in Table 4. Our model yields an F-1 score of 60.19%, while those for Attention [5], TGPNet [45], and CNIF [60] are 48.68%, 51.92%, and 58.12%, respectively. This again demonstrates our superior performance.
PPSS [44]: Table 5 compares our method against six famous methods on PPSS test set. The evaluation results demonstrate that our human parser achieves 65.3% mIoU, with substantial gains over the second best, CNIF [60], and third best, LCPC [9], of 4.8% and 11.8%, respectively.
Runtime comparison: As our parser does not require extra pre-/post-processing steps ( e.g ., human pose used in [63], over-segmentation in [32, 30], and CRF in [63]), it achieves a high speed of 12fps (on PASCAL-Person-Part), faster than most of the counterparts, such as Joint [63] (0.1fps), Attention+SSL [22] (2.0fps), MMAN [46] (3.5fps), SSNAN [71] (2.0fps), and LG-LSTM [33] (3.0fps).
Qualitative results: Some qualitative comparison results on PASCAL-Person-Part test are depicted in Fig. 6. We can see that our approach outputs more precise parsing results than other competitors [6, 21, 71, 60], despite the existence of rare pose (2 nd row) and occlusion (3 rd row). In addition, with its better understanding of human structures,
Figure 6: Visual comparison on PASCAL-Person-Part test . Our model (c) generates more accurate predictions, compared to other famous methods [6, 21, 71, 60] (d-g). The improved labeled results by our parser are denoted in red boxes. Best viewed in color.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Image Series: Semantic Segmentation Comparison
### Overview
The image presents a series of comparisons of semantic segmentation results for three different scenes: people on a beach, a person with a dog, and a cyclist on a road. Each scene has an original image and six different segmentation outputs, labeled (b) through (f). The segmentation outputs are visually compared to a "Ground-truth" segmentation (b) and the proposed method "Ours" (c). The other methods are DeepLabV3+ [6], PGN [21], SS-NAN [71], and CNIF [60]. Each segmentation output highlights different objects with distinct colors, and a red bounding box is drawn around a specific region of interest in each image to facilitate comparison.
### Components/Axes
The image is organized into three rows, each representing a different scene. Each row contains seven columns:
1. **(a) Image:** The original input image.
2. **(b) Ground-truth:** The manually annotated, correct segmentation.
3. **(c) Ours:** The segmentation result of the proposed method.
4. **(d) DeepLabV3+ [6]:** Segmentation result of the DeepLabV3+ method.
5. **(e) PGN [21]:** Segmentation result of the PGN method.
6. **(f) SS-NAN [71]:** Segmentation result of the SS-NAN method.
7. **(g) CNIF [60]:** Segmentation result of the CNIF method.
There are no explicit axes or scales. The comparison is visual, based on the quality of the segmentation in each output.
### Detailed Analysis or Content Details
**Row 1: Beach Scene**
* **(a) Image:** Two people in wetsuits standing on a beach with boats in the background.
* **(b) Ground-truth:** Segmentation shows distinct regions for people (blue), wetsuits (various colors), water (green), sky (yellow), and boats (purple).
* **(c) Ours:** Segmentation is similar to ground truth, with good delineation of people and wetsuits.
* **(d) DeepLabV3+ [6]:** Segmentation is generally good, but some areas of the wetsuits are misclassified.
* **(e) PGN [21]:** Segmentation shows some blurring and misclassification in the wetsuit areas.
* **(f) SS-NAN [71]:** Segmentation is less accurate, with more misclassifications in the wetsuit and background.
* **(g) CNIF [60]:** Segmentation is similar to SS-NAN, with noticeable inaccuracies.
**Row 2: Person with Dog Scene**
* **(a) Image:** A person sitting on a couch with a dog.
* **(b) Ground-truth:** Segmentation shows distinct regions for the person (red), dog (yellow), couch (green), and background (purple).
* **(c) Ours:** Segmentation is accurate, closely matching the ground truth.
* **(d) DeepLabV3+ [6]:** Segmentation is good, but some areas of the dog are misclassified.
* **(e) PGN [21]:** Segmentation shows some blurring and misclassification in the dog and couch areas.
* **(f) SS-NAN [71]:** Segmentation is less accurate, with more misclassifications in the dog and couch.
* **(g) CNIF [60]:** Segmentation is similar to SS-NAN, with noticeable inaccuracies.
**Row 3: Cyclist Scene**
* **(a) Image:** A cyclist riding on a road.
* **(b) Ground-truth:** Segmentation shows distinct regions for the cyclist (blue), bicycle (green), road (gray), and background (purple).
* **(c) Ours:** Segmentation is accurate, closely matching the ground truth.
* **(d) DeepLabV3+ [6]:** Segmentation is good, but some areas of the bicycle are misclassified.
* **(e) PGN [21]:** Segmentation shows some blurring and misclassification in the bicycle and road areas.
* **(f) SS-NAN [71]:** Segmentation is less accurate, with more misclassifications in the bicycle and road.
* **(g) CNIF [60]:** Segmentation is similar to SS-NAN, with noticeable inaccuracies.
### Key Observations
* The proposed method "Ours" consistently produces segmentation results that are very close to the ground truth in all three scenes.
* DeepLabV3+ [6] generally performs well, but exhibits some misclassifications, particularly in complex regions like the wetsuits and the dog.
* PGN [21], SS-NAN [71], and CNIF [60] consistently show lower segmentation accuracy compared to "Ours" and DeepLabV3+ [6].
* The red bounding boxes highlight areas where the segmentation methods differ most from the ground truth, allowing for a focused comparison.
### Interpretation
The image demonstrates a comparative evaluation of different semantic segmentation methods. The "Ours" method appears to be the most accurate, consistently providing segmentation results that closely match the ground truth. This suggests that the proposed method is effective at accurately identifying and delineating objects in complex scenes. The other methods, while capable of producing reasonable segmentations, exhibit more errors and inaccuracies, particularly in areas with intricate details or challenging lighting conditions. The consistent performance difference suggests that the proposed method incorporates features or techniques that improve its ability to handle these challenges. The inclusion of the method names with bracketed numbers (e.g., [6], [21], [71], [60]) likely refers to citations or references to the original publications describing those methods. The visual comparison, facilitated by the red bounding boxes, allows for a quick and intuitive assessment of the strengths and weaknesses of each method.
</details>
our parser gets more robust results and eliminates the interference from the background (1 st row). The last row gives a challenging case, where our parser still correctly recognizes the confusing parts of the person in the middle.
Overall, our human parser attains strong performance across all the five datasets. We believe this is due to our typed relation modeling and iterative algorithm, which enable more trackable part features and better approximations.
## 4.3. Diagnostic Experiments
To demonstrate how each component in our parser contributes to the performance, a series of ablation experiments are conducted on PASCAL-Person-Part test .
Type-specific relation modeling: We first investigate the necessity of comprehensively exploring different relations, and discuss the effective of our type-specific relation modeling strategy. Concretely, we studied six variant models, as listed in Table 6: (1) 'Baseline' denotes the approach only using the initial node embeddings { h (0) v } v βV without any relation information; (2) 'Type-agnostic' shows the performance when modeling different human part relations in a type-agnostic manner: h u,v = R ([ h u , h v ]) ; (3) 'Typespecific w/o F r ' gives the performance without the relationadaption operation F r in Eq. 2: h u,v = R r ([ h u , h v ]) ; (4-6) 'Decomposition relation', 'Composition relation' and 'Dependency relation' are three variants that only consider the corresponding single one of the three kinds of relation categories, using our type-specific relation modeling strategy (Eq.2). Four main conclusions can be drawn: (1) Structural information are essential for human parsing, as all the structured models outperforms 'Baseline'. (2) Typed relation modeling leads to more effective human structure learning, as 'Type-specific w/o F r ' improves 'Type-agnostic' by 1.28%. (3) Exploring different kinds of relations are meaningful, as the variants using individual relation types outperform 'Baseline' and our full model considering all the three kinds of relations achieves the best performance. (4) Encoding relation-specific constrains helps with relation pattern learning as our full model is better than the one without
Table 6: Ablation study ( Β§ 4.3) on PASCAL-Person-Part test .
| Component | Module | mIoU | | time (ms) |
|-------------|---------------------------|--------|-------------------|-------------|
| Reference | Full model (2 iterations) | 73.12 | - | 81 |
| | Baseline | 68.84 | -4.28 | 46 |
| | Type-agnostic | 70.37 | -2.75 | 55 |
| Relation | Type-specific w/o F r | 71.65 | -1.47 | 55 |
| modeling | Decomposition relation | 71.38 | -1.74 | 50 |
| | Composition relation | 69.35 | -3.77 | 49 |
| | Dependency relation | 69.43 | -3.69 | 52 |
| | 0 iteration | 68.84 | -4.28 | 46 |
| | 1 iterations | 72.17 | -0.95 | 59 |
| Iterative | 3 iterations | 73.19 | +0.07 | 93 |
| Inference T | 4 iterations | 73.22 | +0.10 | 105 |
| | 5 iterations | 73.23 | +0.11 | 116 |
relation-adaption, 'Type-specific w/o F r '.
Iterative inference: Table 6 shows the performance of our parser with regard to the iteration step t as denoted in Eq.13 and Eq.14. Note that, when t =0 , only the initial node feature is used. It can be observed that setting T =2 or T =3 provided a consistent boost in accuracy of 4 βΌ 5%, on average, compared to T = 0 ; however, increasing T beyond 3 gave marginal returns in performance (around 0.1%). Accordingly, we choose T =2 for a better trade-off between accuracy and computation time.
## 5. Conclusion
In the human semantic parsing task, structure modeling is an essential, albeit inherently difficult, avenue to explore. This work proposed a hierarchical human parser that addresses this issue in two aspects. First, three distinct relation networks are designed to precisely describe the compositional/decompositional relations between constituent and entire parts and help with the dependency learning over kinetically connected parts. Second, to address the inference over the loopy human structure, our parser relies on a convolutional, message passing based approximation algorithm, which enjoys the advantages of iterative optimization and spatial information preservation. The above designs enable strong performance across five widely adopted bench- mark datasets, at times outperforming all other competitors.
## References
- [1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE TPAMI , 39(12):2481-2495, 2017.
- [2] Yihang Bo and Charless C Fowlkes. Shape-based pedestrian parsing. In CVPR , 2011.
- [3] Hong Chen, Zi Jian Xu, Zi Qiang Liu, and Song Chun Zhu. Composite templates for cloth modeling and sketching. In CVPR , 2006.
- [4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI , 40(4):834-848, 2018.
- [5] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR , 2016.
- [6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV , 2018.
- [7] Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR , 2014.
- [8] Bowen Cheng, Liang-Chieh Chen, Yunchao Wei, Yukun Zhu, Zilong Huang, Jinjun Xiong, Thomas Huang, Wen-Mei Hwu, and Honghui Shi. Spgnet: Semantic prediction guidance for scene parsing. In ICCV , 2019.
- [9] Kang Dang and Junsong Yuan. Location constrained pixel classifiers for image parsing with regular spatial layout. In BMVC , 2014.
- [10] Jian Dong, Qiang Chen, Xiaohui Shen, Jianchao Yang, and Shuicheng Yan. Towards unified human parsing and pose estimation. In CVPR , 2014.
- [11] Jian Dong, Qiang Chen, Wei Xia, Zhongyang Huang, and Shuicheng Yan. A deformable mixture parsing model with parselets. In ICCV , 2013.
- [12] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, AlΒ΄ an Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS , 2015.
- [13] S Eslami and Christopher Williams. A generative model for parts-based object segmentation. In NIPS , 2012.
- [14] Lifeng Fan, Wenguan Wang, Siyuan Huang, Xinyu Tang, and Song-Chun Zhu. Understanding human gaze communication by spatio-temporal graph reasoning. In ICCV , 2019.
- [15] Hao-Shu Fang, Guansong Lu, Xiaolin Fang, Jianwen Xie, Yu-Wing Tai, and Cewu Lu. Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. In CVPR , 2018.
- [16] Hao-Shu Fang, Chenxi Wang, Minghao Gou, and Cewu Lu. Graspnet: A large-scale clustered and densely annotated datase for object grasping. arXiv preprint arXiv:1912.13470 , 2019.
- [17] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu.
18. Rmpe: Regional multi-person pose estimation. In ICCV , 2017.
- [18] Hao-Shu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, and Song-Chun Zhu. Learning pose grammar to encode human body configuration for 3d pose estimation. In AAAI , 2018.
- [19] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In ICML , 2017.
- [20] Ke Gong, Yiming Gao, Xiaodan Liang, Xiaohui Shen, Meng Wang, and Liang Lin. Graphonomy: Universal human parsing via graph transfer learning. In CVPR , 2019.
- [21] Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang, and Liang Lin. Instance-level human parsing via part grouping network. In ECCV , 2018.
- [22] Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, and Liang Lin. Look into person: Self-supervised structuresensitive learning and a new benchmark for human parsing. In CVPR , 2017.
- [23] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584 , 2017.
- [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR , 2016.
- [25] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In ECCV , 2016.
- [26] Mahdi M. Kalayeh, Emrah Basaran, Muhittin Gokmen, Mustafa E. Kamasak, and Mubarak Shah. Human semantic parsing for person re-identification. In CVPR , 2018.
- [27] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR , 2017.
- [28] Lubor Ladicky, Philip HS Torr, and Andrew Zisserman. Human pose estimation using a joint pixel-wise and part-wise formulation. In CVPR , 2013.
- [29] Qizhu Li, Anurag Arnab, and Philip HS Torr. Holistic, instance-level human parsing. In BMVC , 2017.
- [30] Xiaodan Liang, Liang Lin, Xiaohui Shen, Jiashi Feng, Shuicheng Yan, and Eric P Xing. Interpretable structureevolving lstm. In CVPR , 2017.
- [31] Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang, Luoqi Liu, Jian Dong, Liang Lin, and Shuicheng Yan. Deep human parsing with active template regression. IEEE TPAMI , 37(12):2402-2414, 2015.
- [32] Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic object parsing with graph lstm. In ECCV , 2016.
- [33] Xiaodan Liang, Xiaohui Shen, Donglai Xiang, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic object parsing with local-global long short-term memory. In CVPR , 2016.
- [34] Xiaodan Liang, Chunyan Xu, Xiaohui Shen, Jianchao Yang, Si Liu, Jinhui Tang, Liang Lin, and Shuicheng Yan. Human parsing with contextualized convolutional neural network. In ICCV , 2015.
- [35] Si Liu, Jiashi Feng, Csaba Domokos, Hui Xu, Junshi Huang, Zhenzhen Hu, and Shuicheng Yan. Fashion parsing with weak color-category labels. TMM , 16(1):253-265, 2014.
- [36] Si Liu, Xiaodan Liang, Luoqi Liu, Xiaohui Shen, Jianchao Yang, Changsheng Xu, Liang Lin, Xiaochun Cao, and
38. Shuicheng Yan. Matching-cnn meets knn: Quasi-parametric human parsing. In CVPR , 2015.
- [37] Si Liu, Yao Sun, Defa Zhu, Guanghui Ren, Yu Chen, Jiashi Feng, and Jizhong Han. Cross-domain human parsing via adversarial feature and label adaptation. In AAAI , 2018.
- [38] Si Liu, Changhu Wang, Ruihe Qian, Han Yu, Renda Bao, and Yao Sun. Surveillance video parsing with single frame supervision. In CVPR , 2017.
- [39] Ting Liu, Tao Ruan, Zilong Huang, Yunchao Wei, Shikui Wei, Yao Zhao, and Thomas Huang. Devil in the details: Towards accurate single and multiple human parsing. arXiv preprint arXiv:1809.05996 , 2018.
- [40] Xinchen Liu, Meng Zhang, Wu Liu, Jingkuan Song, and Tao Mei. Braidnet: Braiding semantics and details for accurate human parsing. In ACMMM , 2019.
- [41] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR , 2015.
- [42] Long Zhu, Yuanhao Chen, Yifei Lu, Chenxi Lin, and A. Yuille. Max margin and/or graph learning for parsing the human body. In CVPR , 2008.
- [43] Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. Semantic segmentation using adversarial networks. In NIPS-workshop , 2016.
- [44] Ping Luo, Xiaogang Wang, and Xiaoou Tang. Pedestrian parsing via deep decompositional network. In ICCV , 2013.
- [45] Xianghui Luo, Zhuo Su, Jiaming Guo, Gengwei Zhang, and Xiangjian He. Trusted guidance pyramid network for human parsing. In ACMMM , 2018.
- [46] Yawei Luo, Zhedong Zheng, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. Macro-micro adversarial network for human parsing. In ECCV , 2018.
- [47] Xuecheng Nie, Jiashi Feng, and Shuicheng Yan. Mutual learning to adapt for joint human parsing and pose estimation. In ECCV , 2018.
- [48] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In ICML , 2016.
- [49] Seyoung Park, Bruce Xiaohan Nie, and Song-Chun Zhu. Attribute and-or grammar for joint parsing of human pose, parts and attributes. IEEE TPAMI , 40(7):1555-1569, 2018.
- [50] Siyuan Qi, Wenguan Wang, Baoxiong Jia, Jianbing Shen, and Song-Chun Zhu. Learning human-object interactions by graph parsing neural networks. In ECCV , 2018.
- [51] Ingmar Rauschert and Robert T Collins. A generative model for simultaneous estimation of human body shape and pixellevel segmentation. In ECCV , 2012.
- [52] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet large scale visual recognition challenge. IJCV , 115(3):211-252, 2015.
- [53] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks , 20(1):61-80, 2008.
- [54] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS , 2015.
- [55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Εukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS , 2017.
- [56] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR , 2018.
- [57] Nan Wang and Haizhou Ai. Who blocks who: Simultaneous clothing segmentation for grouping images. In ICCV , 2011.
- [58] Wenguan Wang, Xiankai Lu, Jianbing Shen, David J Crandall, and Ling Shao. Zero-shot video object segmentation via attentive graph neural networks. In ICCV , 2019.
- [59] Wenguan Wang, Yuanlu Xu, Jianbing Shen, and Song-Chun Zhu. Attentive fashion grammar network for fashion landmark detection and clothing category classification. In CVPR , 2018.
- [60] Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen, Yanwei Pang, and Ling Shao. Learning compositional neural information fusion for human parsing. In ICCV , 2019.
- [61] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR , 2018.
- [62] Fangting Xia, Peng Wang, Liang-Chieh Chen, and Alan L Yuille. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV , 2016.
- [63] Fangting Xia, Peng Wang, Xianjie Chen, and Alan L Yuille. Joint multi-person pose estimation and semantic part segmentation. In CVPR , 2017.
- [64] Fangting Xia, Jun Zhu, Peng Wang, and Alan L Yuille. Poseguided human parsing by an and/or graph using pose-context features. In AAAI , 2016.
- [65] Wenqiang Xu, Yonglu Li, and Cewu Lu. Srda: Generating instance segmentation annotation via scanning, reasoning and domain adaptation. In ECCV , 2018.
- [66] Kota Yamaguchi, M Hadi Kiapour, and Tamara L Berg. Paper doll parsing: Retrieving similar styles to parse clothing items. In ICCV , 2013.
- [67] Kota Yamaguchi, M Hadi Kiapour, Luis E Ortiz, and Tamara L Berg. Parsing clothing in fashion photographs. In CVPR , 2012.
- [68] Wei Yang, Ping Luo, and Liang Lin. Clothing co-parsing by joint image segmentation and labeling. In CVPR , 2014.
- [69] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR , 2017.
- [70] Jian Zhao, Jianshu Li, Yu Cheng, Terence Sim, Shuicheng Yan, and Jiashi Feng. Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In ACMMM , 2018.
- [71] Jian Zhao, Jianshu Li, Xuecheng Nie, Fang Zhao, Yunpeng Chen, Zhecan Wang, Jiashi Feng, and Shuicheng Yan. Selfsupervised neural aggregation networks for human parsing. In CVPR-workshop , 2017.
- [72] Zilong Zheng, Wenguan Wang, Siyuan Qi, and Song-Chun Zhu. Reasoning visual dialogs with structural and partial observations. In CVPR , 2019.
- [73] Bingke Zhu, Yingying Chen, Ming Tang, and Jinqiao Wang. Progressive cognitive human parsing. In AAAI , 2018.