# A Generalist 𝙵𝚊𝚌𝚎𝚇𝙵𝚊𝚌𝚎𝚇\mathtt{FaceX}typewriter_FaceX via Learning Unified Facial Representation
**Authors**: Project Page: https://diffusion-facex.github.io
Abstract
This work presents $\mathtt{FaceX}$ framework, a novel facial generalist model capable of handling diverse facial tasks simultaneously. To achieve this goal, we initially formulate a unified facial representation for a broad spectrum of facial editing tasks, which macroscopically decomposes a face into fundamental identity, intra-personal variation, and environmental factors. Based on this, we introduce Facial Omni-Representation Decomposing (FORD) for seamless manipulation of various facial components, microscopically decomposing the core aspects of most facial editing tasks. Furthermore, by leveraging the prior of a pretrained StableDiffusion (SD) to enhance generation quality and accelerate training, we design Facial Omni-Representation Steering (FORS) to first assemble unified facial representations and then effectively steer the SD-aware generation process by the efficient Facial Representation Controller (FRC). Our versatile $\mathtt{FaceX}$ achieves competitive performance compared to elaborate task-specific models on popular facial editing tasks. Full codes and models are available at https://github.com/diffusion-facex/FaceX.
<details>
<summary>2401.00551v1/x1.png Details</summary>

### Visual Description
# Technical Document Extraction: FaceX Image Analysis
## Overview
The image depicts a collage of facial manipulations organized around a central "FaceX" component. The layout demonstrates various facial editing techniques through visual examples. No numerical data or charts are present; the focus is on categorical demonstrations of facial transformation methods.
## Central Component
- **Label**: `FaceX` (bold white text on dark blue circular background)
- **Position**: Center of the image
- **Connections**: Dashed lines connect to four quadrants (left and right)
## Left Quadrant Components
### 1. Attribute Swap (`Attr Swap`)
- **Label**: `Attr Swap` (bold black text on light yellow background)
- **Example Image**:
- Original: Blonde woman with red lipstick
- Edited: Same facial structure with altered hair color/texture
- **Spatial Position**: Top-left quadrant
### 2. Face Swap (`Face Swap`)
- **Label**: `Face Swap` (bold black text on light yellow background)
- **Example Image**:
- Original: Woman with brown hair
- Edited: Same facial features with different hair color/texture
- **Spatial Position**: Top-middle quadrant
### 3. Head Swap (`Head Swap`)
- **Label**: `Head Swap` (bold black text on light yellow background)
- **Example Image**:
- Original: Boy with short brown hair
- Edited: Same facial features with different head shape/hair
- **Spatial Position**: Bottom-left quadrant
### 4. Inpainting (`Inpaint`)
- **Label**: `Inpaint` (bold black text on light yellow background)
- **Example Image**:
- Original: Man with clean-shaven face
- Edited: Beard added via inpainting
- **Spatial Position**: Bottom-left quadrant
## Right Quadrant Components
### 1. Pose Editing (`Pose Edit`)
- **Label**: `Pose Edit` (bold black text on light yellow background)
- **Visual Cues**:
- Arrow diagram showing directional pose changes
- Example: Man's face rotated 90° from front to profile view
- **Spatial Position**: Top-right quadrant
### 2. Gaze Editing (`Gaze Edit`)
- **Label**: `Gaze Edit` (bold black text on light yellow background)
- **Visual Cues**:
- Crosshair diagram indicating eye movement
- Example: Man's eyes shifting from forward to upward gaze
- **Spatial Position**: Middle-right quadrant
### 3. Expression Editing (`Exp Edit`)
- **Label**: `Exp Edit` (bold black text on light yellow background)
- **Visual Cues**:
- Emoji sequence: 😊 → 😐
- Example: Man's face transitioning from smiling to neutral
- **Spatial Position**: Bottom-right quadrant
### 4. Animation (`Animate`)
- **Label**: `Animate` (bold black text on light yellow background)
- **Example Images**:
- Sequence showing progressive facial expressions:
- Neutral → Smiling → Surprised → Angry
- Includes multiple subjects (man, woman, child)
- **Spatial Position**: Bottom-right quadrant
## Image Composition
- **Grid Structure**:
- 3x3 grid of facial examples per quadrant
- Total: 12 unique facial examples across all quadrants
- **Color Coding**:
- All background boxes: Light yellow (#FFFACD)
- Central component: Dark blue (#00008B)
- No additional color-coding beyond background consistency
## Technical Notes
1. **Spatial Grounding**:
- All labels positioned at quadrant centers
- Example images aligned with corresponding labels
2. **Legend Verification**:
- No traditional legend present
- Color consistency maintained across all background elements
3. **Trend Verification**:
- Not applicable (no numerical data)
4. **Component Isolation**:
- Each quadrant processed independently
- No cross-quadrant dependencies observed
## Conclusion
The image demonstrates a facial manipulation framework called FaceX, showcasing eight distinct editing capabilities through visual examples. Each quadrant represents a specific transformation type with corresponding before/after demonstrations.
</details>
Figure 1: Facial generalist $\mathtt{FaceX}$ is capable of handling diverse facial tasks, ranging from popular face/head swapping and motion-aware face reenactment/animation to semantic-aware attribute editing/inpainting, by one unified model, simultaneously achieving competitive performance that significantly advances the research of general facial models.
1 Introduction
<details>
<summary>2401.00551v1/x2.png Details</summary>

### Visual Description
# Technical Document Extraction: Universal Facial Representation Diagram
## Diagram Overview
The image is a flowchart illustrating the components and processes involved in **Universal Facial Representation**. It connects **Identity** and **Environmental Factors** to four core processes: **Reenactment**, **Face Swapping**, **Head Swapping**, and **Animation**.
---
## Left-Side Components (Input Factors)
### 1. **Identity**
- **Sub-components**:
- **Intra-personal variation**:
- Motion
- Facial Texture
- Hair
- Illumination
- Background
- **Environmental factors**:
- Motion
- Facial Texture
- Hair
- Illumination
- Background
- **Connections**:
- All sub-components feed into **Expression**, **Pose**, and **Gaze** (gray box).
---
## Right-Side Components (Output Processes)
### 2. **Universal Facial Representation**
- **Processes**:
1. **Reenactment**
- **Components**:
- ID
- Expression
- Pose
- Gaze
- Facial Texture
- Hair
- Illumination
- Background
- **Flow**:
- Source (left-side factors) → Target (Reenactment components) → Ignored (no output).
2. **Face Swapping**
- **Components**:
- ID
- Expression
- Pose
- Gaze
- Facial Texture
- Hair
- Illumination
- Background
- **Flow**:
- Source (left-side factors) → Target (Face Swapping components) → Ignored.
3. **Head Swapping**
- **Components**:
- ID
- Expression
- Pose
- Gaze
- Facial Texture
- Hair
- Illumination
- Background
- **Flow**:
- Source (left-side factors) → Target (Head Swapping components) → Ignored.
4. **Animation**
- **Components**:
- ID
- Expression
- Pose
- Gaze
- Facial Texture
- Hair
- Illumination
- Background
- **Flow**:
- Source (left-side factors) → Target (Animation components) → Ignored.
---
## Key Observations
1. **Component Reuse**:
- All processes (**Reenactment**, **Face Swapping**, **Head Swapping**, **Animation**) share identical component sets:
- ID, Expression, Pose, Gaze, Facial Texture, Hair, Illumination, Background.
2. **Ignored Elements**:
- The **Target** and **Ignored** labels (gray) indicate that certain inputs are excluded during processing.
3. **Flow Direction**:
- Arrows connect **Intra-personal variation** and **Environmental factors** to the four processes, emphasizing their role as foundational inputs.
---
## Diagram Structure
- **Header**:
- Title: "Universal Facial Representation".
- **Main Chart**:
- Left: Input factors (Identity, Intra-personal variation, Environmental factors).
- Right: Four processes with shared component sets.
- **Footer**:
- No explicit footer; diagram ends with process outputs.
---
## Notes
- **No numerical data or trends** are present; the diagram is conceptual.
- **No legends or axis markers** exist, as this is a flowchart, not a chart.
- **Language**: All text is in English.
This diagram outlines how facial identity and environmental factors are decomposed into universal components for various facial manipulation tasks.
</details>
Figure 2: Left: Proposed facial omni-representation equation that divides one face into a combination of different fine-grained attributes. Right: The attributes of the generated images under different tasks correspond to the decomposition of source and target facial attributes. Here, we analyze four representative facial tasks. For details of other facial tasks, please refer to our supplementary materials.
Facial editing encompasses both low-level tasks, e.g., facial inpainting [59] and domain stylization [10], and high-level tasks, e.g., region-aware face/head/attribute swapping [45, 24, 39, 25, 28], motion-aware pose/gaze/expression control [64, 55, 49]. Above tasks have extensive applications in various domains, including entertainment, social media, and security. The primary challenge in facial editing is to modify distinct attributes while preserving identity and unaffected attributes consistently. Notably, there’s also a need for in-the-wild generalization to ensure practical applicability.
Previous GAN-based methods leverage the disentangled latent space of StyleGAN [18], enabling attribute manipulation by navigating within the latent space along suitable directions. Thanks to the powerful generative capabilities of Diffusion Models (DM), recent works have embraced this technique for enhancing the quality of facial generation in various editing tasks. However, disentangling and controlling facial attributes using DM in a zero-shot manner remains an unresolved issue. For example, Face $0 0$ [43] enables one-shot identity insertion but struggles with attribute disentanglement. DiffusionRig [5] achieves pose/expression control by physical DECA [9], but requires a time-consuming fine-tuning procedure for identity generalization. DiffTalk [38] relies on landmark-guided inpainting to keep other parts intact. Recent DiffSwap [61] uses identity feature along with an identity loss to maintain identity and employs DDIM [40] inversion to preserve other parts. The above methods are designed with elaborate modules tailored for specific tasks, rendering them challenging to generalize across different tasks, thereby limiting their versatility and increasing the R&D cost in practical applications. In contrast, universal models, with higher practical value, have garnered significant success in the fields of NLP [1, 30] and segmentation [19]. However, the absence of a universal facial editing model persists due to the diverse nature of facial tasks.
To address this issue, for the first time, we present a generalist facial editing model, termed $\mathtt{FaceX}$ . Our method handles extensive facial editing tasks with a unified model (see Fig. 1), while maintaining the ability to disentangle and edit various attributes when generating high-quality images. Specifically, there are two significant designs in our $\mathtt{FaceX}$ :
$1)$ Facial Omni-representation Decomposing: We establish a coherent facial representation for a wide range of facial editing tasks, inspired by probabilistic LDA [15, 31]. Our solution introduces a unified facial representation equation to macroscopically decompose a face into three factors:
$$
\displaystyle\mathtt{X}=\mathcal{G}(\alpha,\beta,\gamma), \tag{1}
$$
where identity $\alpha$ , intra-personal variation $\beta$ , and environmental factors $\gamma$ are fundamental attributes that characterize a face $\mathtt{X}$ . $\mathcal{G}$ indicates a powerful generative model. Furthermore, we assume that the intra-personal variation can be decomposed into motion, facial texture, and hair, while environmental factors corresponde to illumination and background. As shown in Fig. 2, $\mathtt{FaceX}$ enables clear formula-level task decomposition, easy manipulation, and quick adaptation to various facial editing tasks, making a versatile and efficient solution possible. More specifically, we adopt pretrained face recognition model [3] to achieve identity feature, pretrained D $3$ DFR model [4] to obtain $3$ D coefficients for motion variations, and a vision image encoder (e.g., DINOV $2$ [29] or CLIP [32]) to model the textures of facial, hair and environmental comprehensively. Leveraging our disentangled omni-representation, we can manipulate different features for diverse editing tasks, cf., Sec. 3.3.
$2)$ Steering and Controlling Omni-representation in DM: With the proposed universal facial representation, a core challenge is how to extract and utilize it to control the generation process of DM. Specifically, we utilize the prior of a pretrained StableDiffusion (SD) to enhance generation quality and accelerate training. Existing methods augmenting conditional control in SD employ different fine-tuning approaches: i) The intuitive approach concatenates input and noise latent, and fine-tunes the entire U-net, which incurs significant training costs. ii) ControlNet [58] and T2I-Adapter [26] fine-tune additional encoders while fixing the U-net. However, they are only suitable for localized control, lacking low-level texture control. iii) Text-guided control effectively alters texture, but mapping facial representation to the CLIP text domain with a fixed U-net [36] fails at texture reconstruction. Inspired by the gated self-attention in GLIGEN [22] with grounding conditions, we propose a powerful Facial Omni-Representation Steering module (Sec. 3.3) to aggregate task-specific rich information from the input facial images, and then design an efficient and effective Facial Representation Controller (Sec. 3.4) to enable Style Diffusion to support fine-grained facial representation modulation.
Overall, our contribution can be summarized as follows:
- To our best knowledge, the proposed $\mathtt{FaceX}$ is the first generalist facial editing model that seamlessly addresses a variety of facial tasks through a single model.
- We propose a unified facial representation to macroscopically formulate facial compositions, and further design a Facial Omni-Representation Decomposing (FORD) module to microscopically decompose the core aspects of most facial editing tasks to easily manipulate various facial details, including ID, texture, motion, attribute, etc.
- We introduce the Facial Omni-Representation Steering (FORS) to first assemble unified facial representations and then effectively steer SD-aware generation process by the efficient Facial Representation Controller (FRC).
- Extensive experiments on eight tasks validate the unity, efficiency, and efficacy of our $\mathtt{FaceX}$ . Ablation studies affirm the necessity and effectiveness of each module.
2 Related Works
Diffusion Models have made significant progress in image generation, demonstrating exceptional sample quality [13]. Employing a denoising process through the U-Net structure, these models iteratively refine Gaussian noise to generate clean data. However, the quadratic growth in memory and computational demands, primarily due to self-attention layers in the U-Net, is a challenge escalated with increasing input resolution. Recent advancements emphasize speeding up the training and sampling of DMs. Latent DMs (LDMs) [35] are trained in a latent embedding space instead of the pixel space. Additionally, LDMs introduce cross-attention among conditional input feature maps at multiple resolutions in the U-Net, effectively guiding denoising.
Face Editing encompasses both low- and high-level tasks [53, 59, 54, 56, 20, 10, 23, 2, 45, 24, 39, 48, 47, 25, 28, 50, 64, 55, 49, 57]. DifFace [51] retrains the DM from scratch on pre-collected data for face restoration. Face $0 0$ [43] facilitates one-shot identity insertion and text-based facial attribute editing. DiffuionRig [5] achieves pose and expression control via physical buffers of DECA [9] but requires finetuning for identity generalization. DiffTalk [38] relies on landmarks and inpainting for talking face generation when the mouth region is driven by audio. DiffSwap [61] leverages landmarks to control expression and pose, uses face ID features as conditions, and relies on a single denoising step loss to maintain identity.
Existing facial editing tasks encounter common challenges, involving disentangling and editing different attributes, preserving identity or other non-edited attributes during editing, and facilitating generalization for real-world applications. Therefore, instead of adopting the conventional single-model-single-task approach, we comprehensively model facial representations and establish a unified editing framework, supporting single-model-multi-task scenarios.
Condition-guided Controllable SD The incorporation of conditions can be primarily divided into four categories: $1$ ) Concatenating the control conditions at the input and fully fine-tuning the U-Net is suitable for localized conditions but significantly increases the training cost, e.g. HumanSD [16] and Composer [14]. $2$ ) Projecting and adding conditions to the timestep embedding or concatenating them with CLIP [32] word embeddings, used as context input for cross-attention layers, is effective for global conditions such as intensity, color, and style. However, fine-tuning the entire U-Net with text-condition pairs (e.g., Composer [14]), incurs high training cost, while fixing U-Net requires optimization for each condition. $3$ ) Fine-tuning additional encoders while fixing U-Net is suitable for localized control but not for low-level texture control (e.g., ControlNet [58], T2I-Adapter [26], and LayoutDiffusion [62]). $4$ ) Introducing extra attention layers in U-Net to incorporate conditions, e.g., GLIGEN [22]. In this paper, we adopt a method akin to GLIGEN for incorporating unified facial representation, empirically demonstrating its efficiency and effectiveness.
3 Methods
3.1 Preliminary Diffusion Models
Denoising Diffusion Probabilistic Models (DDPMs) are a class of generative models, which recovers high-quality images from Gaussian noise (i.e., denoising process) by learning a reverse Markov Chain (i.e., diffusion process): $\bm{x}_{t}$ $\sim$ $\mathcal{N}\left(\sqrt{\alpha_{t}}\bm{x}_{t-1},\left(1-\alpha_{t}\right)\bm{I}\right)$ , where $\bm{x}_{t}$ is the random variable at $t$ -th timestep and $\alpha_{t}$ is the predefined coefficient. In practice, $\bm{x}_{t}=\sqrt{\bar{\alpha}_{t}}\bm{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\bm{\epsilon}$ is used as approximation to facilitate efficient training, where $\bar{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s}$ and $\bm{\epsilon}$ $\sim$ $\mathcal{N}(\mathbf{0},\bm{I})$ . By minimizing the ELBO of the diffusion process, the training objective is simplified to $\mathbb{E}_{\bm{x}_{0},\bm{\epsilon},t}\left[\left\|\bm{\epsilon}-\bm{\epsilon%
}_{\theta}\left(\bm{x}_{t},t\right)\right\|_{2}^{2}\right]$ . In the inference, U-Net-based denoising autoencoder $\bm{\epsilon}_{\theta}\left(x_{t},t\right)$ is predicted step by step to obtain the final $\bm{x}_{0}$ . As naive DDPMs are computationally costly, Latent Diffusion Model (LDM) [34] proposes to train the model in the latent space $\bm{z}$ compressed by VQGAN [8], whose basic paradigm is also adopted in this paper.
<details>
<summary>2401.00551v1/x3.png Details</summary>

### Visual Description
# Technical Document Extraction: Facial Omni-Representation Framework
## Overview
The image depicts a multi-stage computational framework for facial representation decomposition, steering, and control, integrated with a diffusion process. The system is divided into four primary components: **FORD** (Facial Omni-Representation Decomposing), **FORS** (Facial Omni-Representation Steering), **FRC** (Facial Representation Controller), and a **Diffusion Process**.
---
## 1. FORD (Facial Omni-Representation Decomposing)
### Components and Flow
- **Inputs**:
- `φ_ID`: Identity feature (blue box)
- `φ_Reg`: Region feature (orange box)
- `φ_Parse`: Parsing feature (gray box with face icon)
- `φ_3DMM`: 3D Morphable Model (dark blue box)
- `φ_Gaze`: Gaze direction (purple box)
- **Processing**:
- **Flatten**: Converts inputs into 1D vectors.
- **FPN Adapter**:
- 2× upsampling (dashed box)
- 4× upsampling (dashed box)
- **Mask Pooling**: Aggregates spatial information (symbol `P`).
- **Outputs**:
- `f_ID`: Identity feature map (gray bar)
- `f_Reg`: Region feature map (orange bar)
- `f_Parse`: Parsing feature map (gray bar with face icon)
- `f_3DMM`: 3DMM feature map (dark blue bar)
- `f_Gaze`: Gaze feature map (purple bar)
### Legend
- **Colors**:
- Blue: Fixed parameters
- Green: Trainable parameters
- Gray: Shared parameters
---
## 2. Diffusion Process
### Workflow
1. **Initialization**: Start with latent code `z₀`.
2. **Iterative Denoising**:
- **Mask Pooling**: Aggregates spatial features.
- **Concatenation**: Combines features across time steps.
- **Masked Fusion**: Applies attention masks (symbol `•`).
3. **Output**: Final denoised latent code `z_tgt` after `T` steps.
### Legend
- **Colors**:
- Blue: Fixed parameters
- Green: Trainable parameters
- Gray: Shared parameters
---
## 3. FORS (Facial Omni-Representation Steering)
### Components and Flow
- **Inputs**:
- **Inpainting Reference**: Source image (top-left face).
- **Task-specific Region Assembler**: Combines facial regions (middle-left).
- **Processing**:
- **FORD**: Integrates decomposed features.
- **SD Adapter** (`φ_SD`): Spatial denoising adapter (green box).
- **Representation Adapter** (`φ_Rep`): Task-specific adaptation (orange box).
- **Outputs**:
- `f_Inp`: Inpainting feature map (gray bar).
- `f_SD`: Spatial denoising feature map (green bar).
- `f_Rep`: Representation feature map (orange bar).
### Key Elements
- **Cross-Attention**: Links task-specific regions to global features.
- **Zero-Conv**: Simplifies feature integration (light blue box).
---
## 4. FRC (Facial Representation Controller)
### Components and Flow
- **Inputs**:
- **Original Model**: Baseline facial representation (left).
- **Processing**:
- **Self-Attention** (`Self-Attn`): Internal feature refinement (dark blue box).
- **Cross-Attention** (`Cross-Attn`): External feature integration (blue box).
- **Zero-Conv**: Lightweight feature transformation (light blue box).
- **Outputs**:
- **Personalized Models**: Customized facial representations (pink box).
### Legend
- **Colors**:
- Blue: Fixed parameters
- Green: Trainable parameters
- Gray: Shared parameters
---
## 5. Spatial Grounding and Color Consistency
- **Legend Placement**:
- Primary legend: Top-right corner (applies to FORD and Diffusion).
- Secondary legend: Bottom-right (applies to FRC).
- **Color Validation**:
- **FORD**: Blue (`φ_ID`), Orange (`φ_Reg`), Gray (`φ_Parse`), Dark Blue (`φ_3DMM`), Purple (`φ_Gaze`).
- **Diffusion**: Blue (Fixed), Green (Trainable), Gray (Shared).
- **FRC**: Blue (Fixed), Green (Trainable), Gray (Shared).
---
## 6. Key Trends and Data Points
- **FORD**: Decomposes facial features into identity, region, parsing, 3DMM, and gaze components.
- **Diffusion**: Iteratively denoises latent codes over `T` steps using masked fusion.
- **FORS**: Steers facial representations via task-specific inpainting and adaptation.
- **FRC**: Personalizes models using self-/cross-attention mechanisms.
---
## 7. Missing Data
- No numerical data tables or heatmaps present. All information is diagrammatic.
---
## 8. Final Notes
The framework integrates decomposition, steering, and control of facial representations, leveraging diffusion for high-quality image synthesis. Critical components include attention mechanisms, task-specific adapters, and iterative denoising.
</details>
Figure 3: Overview of the $\mathtt{FaceX}$ framework, which consists of: 1) Facial Omni-Representation Decomposing (FORD) $\bm{\varphi}=\{\bm{\varphi}^{ID},\bm{\varphi}^{Reg},\bm{\varphi}^{Parse},\bm{%
\varphi}^{3DMM},\bm{\varphi}^{Gaze}\}$ decomposes facial component representations, i.e., $\bm{f}^{ID}$ , $\bm{f}^{R}$ , $\bm{f}^{L}$ , $\bm{f}^{T}$ , $\bm{f}^{E}$ , $\bm{f}^{P}$ , and $\bm{f}^{G}$ . 2) Facial Omni-Representation Steering (FORS) $\bm{\phi}$ contains a Task-specific Representation Assembler to assemble various attributes extracted from source image $\bm{I}^{S}$ and target image $\bm{I}^{T}$ , which pass through a Representation Adapter $\bm{\phi}^{R}$ to yield $\bm{f}^{Rep}$ ; and a Task-specific Region Assembler to assemble different regions to obtain the inpainting reference image $\bm{I}^{R}$ , which is then processed by an image encoder $\bm{\phi}^{Inp}$ to obtain $\bm{f}^{Inp}$ . After concatenation with $\bm{f}^{Rep}$ , it is processed by the SD Adapter $\bm{\phi}^{SD}$ to obtain the conditional representation $\bm{f}^{SD}$ that is fed into the conditional denoising U-Net $\bm{\epsilon}_{\theta}$ . 3) Facial Representation Controller (FRC), given the basic concatenation of fixed self-/cross-attention operations, we add one extra cross-attention layer. Under the control of $\bm{f}^{SD}$ , it enables generating task-specific output images $\bm{I}^{O}$ . Notably, due to the plug-and-play nature of FRC, representations can be seamlessly integrated by cross-attention layers, allowing the diffusion model to be substituted with any other personalized models from the community.
3.2 Facial Omni-Representation Decomposing
Based on the unified facial representation Eq. 1, we apply it to actual modeling, i.e., we extract different facial components with various pre-trained models. As shown on the left side of Fig. 3, the unified facial representation include:
Identity Features. We use a face recognition model $\bm{\varphi}^{ID}$ [3] to extract discriminative identity features. Unlike prior works that select the highly discriminative features of the last layer, we select the uncompressed feature map of the previous layer, which is flattened as the identity embedding $\bm{f}^{ID}$ . We believe this manner offers richer facial spatial information, while balancing discriminative and generative capabilities.
Region Features. In Fig. 2 -Left, the region features include facial texture, hair, and background. In practical modeling, we further divide facial texture into smaller regions for representation, including eyebrows, eyes, nose, lips, ears, and skin. To align with SD text space, CLIP ViT [7, 32] is used as the encoder $\bm{\varphi}^{Region}$ , instead of the commonly used PSP [33] in prior works. However, compared to the hierarchical structure of PSP, the uniform resolution of ViT limits the spatial information granularity. To address this issue, we employ a learnable FPN Adapter to recover the spatial relationships at a higher resolution. The face parsing model [6] $\bm{\varphi}^{Parse}$ is used to obtain regional masks. The region features are extracted via mask pooling. Besides CLIP ViT, we also ablate by using ViT from different models in Sec. 4.3, finding that pretrained weights and whether to fine-tune significantly impact convergence speed and generated image quality.
Motion Descriptor. $3$ D pose/expression embedding coefficients $\bm{f}^{P}$ / $\bm{f}^{E}$ extracted by the pretrained D $3$ DFR model [4] $\bm{\varphi}^{3DMM}$ and additional gaze embedding $\bm{f}^{G}$ extracted by work [6] $\bm{\varphi}^{Gaze}$ form a complete motion descriptor. Additionally, the disentangled facial texture $\bm{f}^{T}$ and lighting $\bm{f}^{L}$ are used to work together with the skin region features to enhance the facial generation quality.
3.3 Facial Omni-Representation Steering
The disentangled facial representation can be flexibly recombined for various facial editing tasks, as illustrated in Fig. 1. We propose three components to reassemble and fuse features to steer the task-specific generation process.
Task-specific Representation Assembler reassembles the representations of source and target images at the feature level, obtaining the reassembled features $\bm{f}^{Rep}$ via a Representation Adapter $\bm{\phi}^{R}$ , which consists of linear layers for each representation to transform the feature dimension for further concatenation. Complex facial editing tasks, including reenactment, face and head swapping are used as examples here. For all three tasks, the identity features and motion descriptors come from the source and target image respectively. The combination of region features differs for each task, which is detailed in Sec. 3.4.
Although mask pooling of region features makes appearance editing easier, it results in loss of structural information, leading to increased training difficulty and lack of detail in the generated results. To tackle this issue, prior works commonly use masks as structure guidance [11, 66]. However, mask-based structure guidance only supports aligned attribute swapping and struggles to handle motion transformation. For instance, when swapping a front-facing head onto a side profile, the mask also needs to rotate accordingly. Otherwise, the strong structural constraints will lead to a result where the front-facing face is forcibly squeezed into the side profile. HS-diffusion [44] attempts to address these motion-caused structural changes by training an additional mask converter, but the outcomes are not satisfactory.
Task-specific Region Assembler is introduced to tackle this problem. Different regions are assembled at the image level to obtain the region-swapped image $\bm{I}^{R}$ , which acts as the inpainting reference for the model. $\bm{I}^{R}$ differs for each task, which is detailed in Sec. 3.4. The inpainting reference $\bm{I}^{R}$ goes through an image encoder $\bm{\phi}^{Inp}$ and obtains the image representation $\bm{f}^{Inp}$ . Instead of imposing strong structural constraints through masks, introducing the inpainting reference provides structural clues for the model and meanwhile encourages reasonable imagination. Furthermore, this approach introduces additional rich and detailed local structural information, such as hair texture.
SD Adapter $\bm{\phi}^{SD}$ adapts the concatenated facial representation to obtain $\bm{f}^{SD}$ , effectively steering subsequent SD-aware generation process.
Diverse and Mixture Editing is realized by our single model, allowing modifications like glasses, beards, shapes, hairstyles, inpainting, or even their combinations. This enhances the interactivity of editing, facilitated by the intuitive image-level region assembler. To our best knowledge, $\mathtt{FaceX}$ stands out as the pioneering work achieving cross-task mixture editing, surpassing the capabilities of existing task-specific methods. We hope it serves as a seed with potential to inspire novel and intriguing applications in the future.
3.4 Facial Representation Controller
For conditional generative models, a core challenge is how to effectively and efficiently use the rich facial representation $\bm{f}^{SD}$ to guide the generation process of the target image $\bm{I}^{O}$ . Here, we utilize the prior of a pretrained StableDiffusion (SD) [35] to accelerate training and enhance generation quality. Unlike recent efficient finetuning schemes [6], we propose a Facial Representation Controller (FRC) module to extend the basic Transformer block in LDM [34]. Specifically, the original Transformer block of LDM consists of two attention layers: one self-attention over the visual tokens $\bm{v}$ , followed by cross-attention from context tokens $\bm{f}^{SD}$ . By considering the residual connection, the two layers can be written as:
$$
\displaystyle\bm{v}=\bm{v}+\operatorname{SelfAttn_{fix}}(\bm{v}) \displaystyle\bm{v}=\bm{v}+\operatorname{CrossAttn_{fix}}\left(\bm{v},\bm{f}^{%
SD}\right), \tag{2}
$$
when $\bm{f}^{SD}$ is used as a condition, we empirically find that using only the above two frozen layers can capture coarse identity and motion, but the reconstructed texture detail is very poor, cf., qualitative results in Fig. 11 -right. We hypothesize that the reason is that the SD text space is not a continuous, dense facial semantic latent space like StyleGAN, making it challenging to map facial representations to the text space. However, finetuning the entire SD to adapt to the facial domain is computationally expensive, and we want to minimize the loss of SD prior as much as possible. Therefore, instead of finetuning the original cross-attention layer, we choose to add a new cross-attention layer after the existing one. By only fine-tuning the newly added cross-attention layer, we enable the network to learn to accept facial representations for modulating the intermediate features in the U-net. Additionally, we add a zero convolution layer after the newly added cross-attention layer. This way, the starting point of training is equivalent to the original U-net.
$$
\displaystyle\bm{v}=\bm{v}+\operatorname{ZeroConv}\left(\operatorname{%
CrossAttn_{ft}}\left(\bm{v},\bm{f}^{SD}\right)\right). \tag{3}
$$
Compared to finetuning the entire SD, this approach is more efficient and effective. Moreover, owing to the plug-and-play design, our generalist facial editing model supports loading the personalized models of SD from the community, which can be easily extended to other tasks such as animation.
<details>
<summary>2401.00551v1/x4.png Details</summary>

### Visual Description
# Technical Document: Image Editing Task Representation Table
## Overview
The image presents a structured table detailing image editing tasks, their representation parameters, and operational regions. The table is organized into four primary sections: **Attribute Editing**, **Face Swap**, **Head Swap**, and **Reenact/Animate/Inpaint**. Each section specifies source/target attributes, inpainting references, and operations.
---
## Table Structure
### Header Row
| **Tasks** | **Representation** | **Region** |
|-------------------------|--------------------------|--------------------------|
| | **Source** | **Target** |
| | **Inpainting Ref.** | **Operation** |
---
### Section 1: Attribute Editing
| **Tasks** | **Representation** | **Region** |
|-------------------------|--------------------------|--------------------------|
| **Attribute Editing** | **Source**: Any | **Target**: The others |
| | **Inpainting Ref.**: | **Operation**: |
| | [Image: Face with eyes masked out] | **Mask Out / Add New** |
---
### Section 2: Face Swap
| **Tasks** | **Representation** | **Region** |
|-------------------------|--------------------------|--------------------------|
| **Face Swap** | **Source**: | **Target**: |
| | - Eyebrows | - Background |
| | - Eyes | - Ears |
| | - Nose | - Hair |
| | - Lips | |
| | - Skin | |
| | **Inpainting Ref.**: | **Operation**: |
| | [Image: Face fully masked out] | **Mask Out + Dilate** |
---
### Section 3: Head Swap
| **Tasks** | **Representation** | **Region** |
|-------------------------|--------------------------|--------------------------|
| **Head Swap** | **Source**: | **Target**: |
| | - Eyebrows | - Background |
| | - Eyes | |
| | - Nose | |
| | - Lips | |
| | - Hair | |
| | - Ears | |
| | - Skin | |
| | **Inpainting Ref.**: | **Operation**: |
| | [Image: Grayscale head with white outline] | **Mask Out + Dilate + Grayscale Source Head** |
---
### Section 4: Reenact/Animate/Inpaint
| **Tasks** | **Representation** | **Region** |
|-------------------------|--------------------------|--------------------------|
| **Reenact/Animate/Inpaint** | **Source**: All | **Target**: None |
| | **Inpainting Ref.**: | **Operation**: |
| | [Image: Normal face] | **None** |
---
## Key Observations
1. **Task-Specific Attributes**:
- **Attribute Editing**: Modifies arbitrary attributes ("Any") targeting other regions.
- **Face Swap/Head Swap**: Focuses on facial features (eyebrows, eyes, nose, lips, hair, skin) with background as a target.
- **Reenact/Animate/Inpaint**: Applies to all attributes with no target region.
2. **Inpainting References**:
- **Attribute Editing**: Partial masking (eyes).
- **Face Swap**: Full facial masking.
- **Head Swap**: Grayscale head with white outline.
- **Reenact/Animate/Inpaint**: No masking (normal face).
3. **Operations**:
- **Attribute Editing**: Masking or adding new elements.
- **Face Swap/Head Swap**: Masking + dilation.
- **Head Swap**: Additional grayscale source head integration.
- **Reenact/Animate/Inpaint**: No operation required.
---
## Notes
- **Language**: All text is in English.
- **Visual Elements**: Images in the "Inpainting Ref." column are critical for understanding operational context (e.g., masking patterns, grayscale conversion).
- **No Data Trends**: The table is categorical, not numerical, so trend analysis is inapplicable.
This structured extraction ensures all labels, operations, and references are preserved for technical documentation purposes.
</details>
Figure 4: Illustrations on task-specific representation and region assemblers, showing omni-representation decomposing of popular facial tasks. The representation here indicates the region feature $\bm{f}^{R}$ , encompassing facial texture, hair and background, as inherited from Fig. 2. However, with more detailed divisions, facial texture is further separated into eyebrows, eyes, nose, lips, ears, and skin.
3.5 Training and Inference Details
Generalist Model. During training, both Task-specific Region and Representation Assemblers utilize the assembly method of head swapping. During testing, they perform according to the definitions of each task. This is because head swapping encompasses both reenactment and face swapping subtasks. In a nutshell, our generalist single model is trained once and supports diverse facial editing tasks.
Specialized Models. Other facial editing tasks have much lower requirements for region attribute disentanglement compared to head swapping task. To further improve the performance of subtasks, we finetune our model on these subtasks. In both training and testing, the Task-specific Region and Representation Assembler use the definition of the respective task.
Task-specific Representation Assembler. The representation combination methods for each task are defined in Fig. 4. For reenactment, all source region features are used. For face swapping, the eyebrows, eyes, nose, lips, and skin features of the source image are combined with other features of the target image. For head swapping, the eyebrows, eyes, nose, lips, hair, ears, and skin features of the source image are combined with other features of the target image.
Task-specific Region Assembler. The region combination methods for each task are defined in Fig. 4. For face reenactment, the entire source image is used. For face swapping, the source face is recombined with the hair and background of the target. To avoid residual irrelevant information, the union of the source and target face areas is dilated. For head swapping, the grayscale source head is recombined with the target background, and the edges are cut out using dilation.
4 Experiment
Dataset. We train $\mathtt{FaceX}$ on the CelebV [65] dataset. For the face reenactment task, we evaluate on FFHQ [17] and VoxCeleb1 [27] test sets. For face swapping tasks, we evaluate on FaceForensics++ [37] (FF++). For head swapping tasks, we evaluate our model using FFHQ [17] dataset. Additionally, we randomly collect images of well-known individuals from the Internet to demonstrate the qualitative results of each sub-task.
Metrics. We evaluate different methods from three perspectives: 1) Motion. We assess the motion accuracy by calculating the average $L_{2}$ distance of pose, expression, and gaze embeddings between the generated and target faces. These three embeddings are derived through the respective estimator. 2) Identity. We compute the cosine similarity of the identity feature between the generated and source faces. The identity feature is extracted by a face recognition model. 3) Image Quality. We use the Fréchet Inception Distance (FID) to assess the quality of the generated faces.
Training Details. We start training from the StableDiffusion v1-5 model and OpenAI’s clip-vit-large-patch14 vision model at a resolution of $256$ . For higher resolution of $512$ or $768$ , we finetune on SD v $2.0$ . As the head swapping task utilizes all framework components to encompass a comprehensive set of sub-capabilities, we designate the head-swapping model as our generalist model. Training our generalist models entails 20k steps on $4$ V $100$ GPUs, at a constant learning rate of $1e-5$ and a batch size of $32$ . Notably, for inpainting and animation tasks, no additional finetuning is needed. The generalist model inherently possesses robust inpainting capabilities. Moreover, during testing, animation tasks can be accomplished by directly loading community model weights. For face reenactment and swapping tasks, we further finetune for $15$ k and $5$ k steps respectively with a subset of framework components. To facilitate classifier-free guidance sampling, we train the model without conditions on $10$ of the instances.
4.1 Results of Popular Facial Tasks
Our generalist model encapsulates the capabilities of all subtasks, liberating facial editing from fixed-structure appearance modifications in specific task, enabling dynamic facial edits, and enhancing the diversity of editing possibilities. However, the intricate disentanglement of representation and regions leads to a relative performance decrease in tasks that require less decoupling, e.g. face reenactment and swapping. To address this, we fine-tune the generalist model on specific tasks to mitigate the performance drop caused by intricate disentanglement, enhancing metrics for these tasks.
<details>
<summary>2401.00551v1/x5.png Details</summary>

### Visual Description
# Technical Document: Image Analysis
## Labels and Axis Titles
- **Column Headers (Top Row):**
- `Source`
- `Target`
- `TPSM`
- `DAM`
- `FADM`
- `Ours`
- **Row Structure:**
- 4 rows of images, each row containing 6 images (one per column).
- No axis markers or legends present beyond the column headers.
## Spatial Grounding
- **Legend Placement:**
- Column headers are positioned at the **top row**, spanning horizontally across the image.
- No additional legend or colorbar is visible.
## Component Isolation
### Header
- Column headers define the six categories: `Source`, `Target`, `TPSM`, `DAM`, `FADM`, and `Ours`.
### Main Chart
- **Structure:**
- A 4x6 grid of images.
- Each column represents a category, with images arranged vertically (top to bottom).
### Footer
- No footer content present.
## Image Content Description
- **Images:**
- Each cell contains a face with varying hairstyles, expressions, and accessories (e.g., hats, wigs).
- **Row 1:**
- `Source`: Young child with dark hair.
- `Target`: Woman with red hair and makeup.
- `TPSM`, `DAM`, `FADM`, `Ours`: Child with altered expressions (e.g., open mouth, side glances).
- **Row 2:**
- `Source`: Elderly person with a colorful hat.
- `Target`: Woman with blonde hair and makeup.
- `TPSM`, `DAM`, `FADM`, `Ours`: Elderly person with exaggerated facial features (e.g., wrinkles, distorted expressions).
- **Row 3:**
- `Source`: Person with bright yellow hair.
- `Target`: Woman with blonde hair and makeup.
- `TPSM`, `DAM`, `FADM`, `Ours`: Yellow-haired person with varying expressions (e.g., neutral, smiling).
- **Row 4:**
- `Source`: Man with glasses and microphone.
- `Target`: Man with short hair and neutral expression.
- `TPSM`, `DAM`, `FADM`, `Ours`: Man with altered expressions (e.g., smiling, speaking).
## Notes
- **No Numerical Data or Charts:** The image is a qualitative comparison of facial transformations, not a quantitative dataset.
- **Textual Elements:** All labels are in English. No embedded text within images.
- **Color Consistency:** No color-coding applied to images; differences are visual (e.g., hairstyles, expressions).
</details>
Figure 5: Qualitative comparison results on face reenactment.
<details>
<summary>2401.00551v1/x6.png Details</summary>

### Visual Description
# Technical Document Extraction: Face-Swapping Method Comparison
## Image Structure Overview
The image is a comparative visualization of face-swapping techniques, organized into a grid with labeled rows and columns. It contains **no numerical data** or charts, but includes textual labels and image examples.
---
## Row 1: Gender Swap Comparison
### Labels and Components
- **Columns**:
1. **Source**: Original face images (2 examples: male, female).
2. **Target**: Desired face images (2 examples: male, female).
3. **HifiFace**: Output from HifiFace method (2 examples).
4. **E4S**: Output from E4S method (2 examples).
5. **DiffSwap**: Output from DiffSwap method (2 examples).
6. **BlendFace**: Output from BlendFace method (2 examples).
7. **Ours**: Output from the proposed method (2 examples).
### Spatial Grounding
- All labels are positioned at the top of their respective columns.
- Images are aligned horizontally under each label.
---
## Row 2: Feature-Specific Swaps (Female Target)
### Labels and Components
- **Columns**:
1. **Target**: Female face image (blonde hair, white shirt).
2. **Eyebrows**: Modified eyebrow variations.
3. **Nose**: Modified nose variations.
4. **Lips**: Modified lip variations.
### Spatial Grounding
- Labels are positioned above each column.
- Images show incremental changes to specific facial features.
---
## Row 3: Feature-Specific Swaps (Male Target)
### Labels and Components
- **Columns**:
1. **Target**: Male face image (blonde hair, blue shirt).
2. **Eyebrows**: Modified eyebrow variations.
3. **Nose**: Modified nose variations.
4. **Lips**: Modified lip variations.
### Spatial Grounding
- Labels are positioned above each column.
- Images demonstrate feature-specific adjustments.
---
## Row 4: Additional Feature Swaps (Male Target)
### Labels and Components
- **Columns**:
1. **Target**: Male face image (dark hair, suit).
2. **Eyebrows**: Modified eyebrow variations.
3. **Nose**: Modified nose variations.
4. **Lips**: Modified lip variations.
### Spatial Grounding
- Labels are positioned above each column.
- Images show variations in facial features.
---
## Key Observations
1. **Method Comparison**: The first row compares gender-swapped results across five methods (HifiFace, E4S, DiffSwap, BlendFace, Ours).
2. **Feature Isolation**: Subsequent rows focus on isolated facial features (eyebrows, nose, lips) for both male and female targets.
3. **Visual Trends**: No numerical trends exist; the image relies on qualitative visual differences between methods.
4. **Language**: All text is in English. No non-English content is present.
---
## Conclusion
This image serves as a qualitative benchmark for face-swapping techniques, emphasizing method performance and feature-specific adjustments. No numerical data or structured tables are included.
</details>
Figure 6: Top: Qualitative comparison results on face swapping. Bottom: Controllable face swapping.
Table 1: Quantitative experiments on cross-identity face reenactment, using VoxCeleb test images to drive the FFHQ images.
| | Exp Err. $\downarrow$ | Pose Err. $\downarrow$ | Gaze Err. $\downarrow$ | ID Simi. $\uparrow$ | FID $\downarrow$ |
| --- | --- | --- | --- | --- | --- |
| CVPR’22 TPSM | 6.10 | 0.0535 | 0.0900 | 0.5836 | 50.43 |
| CVPR’22 DAM | 6.31 | 0.0626 | 0.0967 | 0.5534 | 54.13 |
| CVPR’23 FADM | 6.71 | 0.0821 | 0.1242 | 0.6522 | 42.22 |
| Ours-Generalist | 5.45 | 0.0542 | 0.0758 | 0.6612 | 43.34 |
| Ours-Finetuned Specialized | 5.03 | 0.0503 | 0.0614 | 0.6778 | 35.67 |
Table 2: Quantitative results for face swapping on FF++.
| | Exp Err. $\downarrow$ | Pose Err. $\downarrow$ | Gaze Err. $\downarrow$ | ID Simi. $\uparrow$ | FID $\downarrow$ |
| --- | --- | --- | --- | --- | --- |
| IJCAI’21 HifiFace | 5.50 | 0.0506 | 0.0650 | 0.4971 | 21.88 |
| CVPR’23 E4S | 5.23 | 0.0497 | 0.0791 | 0.4792 | 36.56 |
| Ours-Generalist | 5.29 | 0.0503 | 0.0693 | 0.5031 | 44.32 |
| Ours-Finetuned Specialized | 5.14 | 0.0501 | 0.0674 | 0.5088 | 36.24 |
Face Reenactment. In Fig. 5, we compare $\mathtt{FaceX}$ with SoTA methods, including GAN-based TPSM [60], DAM [41], and diffusion-based FADM [52]. When handling unseen identities at the same resolution, our method consistently generates significantly superior results with richer texture details, i.e., teeth, hair, and accessories. Our approach maintains identity faithfully when source faces have different ethnicities, ages, extreme poses, and even occlusion. Tab. 1 demonstrates our model delivers more precise motion control quantitatively.
<details>
<summary>2401.00551v1/x7.png Details</summary>

### Visual Description
# Technical Document: Image Analysis
## Overview
The image is a collage of **8 photographs** arranged in a **2x4 grid**, with labels indicating different processing stages or methods. Each column represents a distinct category, and each row corresponds to a different individual. The text in the image is primarily in **English**.
---
## Labels and Categories
### Column Labels (Top Row)
1. **Source**: Original, unprocessed image.
2. **Target**: Reference image for transformation.
3. **Ours**: Output from the proposed method.
4. **HeSer**: Output from a competing method (likely "HeSer" as a placeholder name).
### Row Labels (Implicit)
- **Row 1**: Individual with **gray hair**, wearing a suit.
- **Row 2**: Individual with **dark hair and beard**, casual attire.
---
## Textual Elements
### Background Text
- **Top Row, Column 1 (Source)**:
- Visible text: `"AON"` and `"adidas"` logos in the background.
- **Bottom Row, Column 1 (Source)**:
- Visible text: `"OOO"` in a decorative font.
---
## Image Descriptions
### Row 1 (Gray-Haired Individual)
1. **Source**:
- Formal attire (suit), neutral expression.
- Background: Dark with `"AON"` and `"adidas"` branding.
2. **Target**:
- Blurred face, similar pose to Source.
- Background: Red-and-white striped pattern (possibly a flag).
3. **Ours**:
- Slightly altered facial features (e.g., sharper jawline).
- Background: Similar to Target.
4. **HeSer**:
- Further refined features compared to "Ours".
- Background: Consistent with Target/Ours.
### Row 2 (Dark-Haired Individual)
1. **Source**:
- Casual attire, neutral expression.
- Background: Green with decorative gold patterns.
2. **Target**:
- Blurred face, similar pose to Source.
- Background: Dark gradient.
3. **Ours**:
- Enhanced facial details (e.g., more defined beard).
- Background: Consistent with Target.
4. **HeSer**:
- Most refined output, with subtle texture adjustments.
- Background: Matches Target/Ours.
---
## Observations
1. **Purpose**: The collage compares the effectiveness of image processing methods (e.g., face enhancement, style transfer).
2. **Method Comparison**:
- **"Ours"** and **"HeSer"** appear to refine the "Target" image, with "HeSer" showing marginally better results in some cases.
3. **Textual Context**: Background text suggests corporate or branded environments, possibly for marketing or identity verification use cases.
---
## Conclusion
The image serves as a visual comparison of image processing techniques, with labels explicitly defining the stages or methods applied. No numerical data or charts are present; the focus is on qualitative differences in output quality.
</details>
Figure 7: Qualitative comparison with HeSer on head swapping.
<details>
<summary>2401.00551v1/x8.png Details</summary>

### Visual Description
# Technical Document Extraction: Image Analysis
## Image Description
The image is a collage of 12 photographs arranged in a **3x4 grid** (3 rows, 4 columns). Each row contains three distinct sections labeled **Source**, **Target**, and **Result**, with corresponding images beneath each label. The layout suggests a comparison or transformation process between the Source and Target images, resulting in the Final image.
---
### Labels and Axis Titles
- **Top Row Labels** (repeated across columns):
- **Column 1**: `Source`, `Target`, `Result`
- **Column 2**: `Source`, `Target`, `Result`
- **Column 3**: `Source`, `Target`, `Result`
- **Column 4**: `Source`, `Target`, `Result`
- **No axis titles, legends, or axis markers** are present in the image.
---
### Textual Content in Images
- **No embedded text** is visible within the photographs themselves. All text is limited to the labels at the top of the grid.
---
### Spatial Grounding
- **Labels**: Positioned at the **top edge** of each column, centered above the corresponding images.
- **Images**: Arranged in a **3x4 grid** with equal spacing between rows and columns.
---
### Component Isolation
#### Header
- **Text**: `Source`, `Target`, `Result` (repeated for each column).
#### Main Grid
- **Structure**:
- **Row 1**:
- **Column 1**: Source (man and woman embracing), Target (man with dark hair), Result (man with blonde hair).
- **Column 2**: Source (woman with curly hair), Target (blonde woman), Result (woman with dark hair).
- **Column 3**: Source (woman with curly hair), Target (blonde woman), Result (woman with dark hair).
- **Column 4**: Source (woman with curly hair), Target (blonde woman), Result (woman with dark hair).
- **Row 2**:
- **Column 1**: Source (child with blonde hair), Target (woman with shaved head), Result (child with blonde hair).
- **Column 2**: Source (woman with dark hair), Target (woman with red hair), Result (woman with dark hair).
- **Column 3**: Source (woman with dark hair), Target (woman with red hair), Result (woman with dark hair).
- **Column 4**: Source (woman with dark hair), Target (woman with red hair), Result (woman with dark hair).
- **Row 3**:
- **Column 1**: Source (man with glasses), Target (man with red hair), Result (man with glasses).
- **Column 2**: Source (man with red hair), Target (man with glasses), Result (man with red hair).
- **Column 3**: Source (man with glasses), Target (man with red hair), Result (man with glasses).
- **Column 4**: Source (woman with dark hair), Target (man with dark hair), Result (woman with dark hair).
---
### Observations
1. **Pattern**: Each row appears to demonstrate a transformation or comparison between the Source and Target images, resulting in the Final image.
2. **Subjects**: The images include adults, children, and individuals with varying hairstyles, clothing, and expressions.
3. **No Numerical Data**: The image does not contain charts, diagrams, or data tables. It is purely a collage of photographs.
---
### Conclusion
The image is a visual comparison grid with no embedded textual data beyond the labels. It does not contain numerical values, trends, or diagrammatic components requiring further analysis.
</details>
Figure 8: Qualitative results on head swapping.
<details>
<summary>2401.00551v1/x9.png Details</summary>

### Visual Description
# Technical Document: Image Transformation Grid Analysis
## Structure Overview
The image presents a 5x4 grid of facial transformations, organized as follows:
### Row Labels (Vertical Axis)
1. **Source**
- Original reference images for each subject
2. **Change Identity**
- Facial features altered to resemble different individuals
3. **Change Motion**
- Dynamic expressions or head movements applied
4. **Change Hairstyle**
- Varied haircuts/styles implemented
5. **Add Attribute**
- Accessories added to base images
### Column Labels (Horizontal Axis)
1. **Scarlett Johansson**
2. **Emma Stone**
3. **Chris Evans**
4. **Kristen Wiig**
## Key Observations
### Row-Specific Transformations
1. **Change Identity**
- Each subject's face is replaced with another individual's features while maintaining original pose/angle
- Example: Scarlett Johansson's face transformed to resemble Emma Stone
2. **Change Motion**
- Subjects shown in different facial expressions (e.g., smiling, surprised)
- Head orientation varies (frontal, 3/4 profile)
3. **Change Hairstyle**
- Hair length, color, and styling altered
- Includes both natural and dramatic changes (e.g., short to long hair)
4. **Add Attribute**
- Physical additions to base images:
- **Sunglasses** (black frames)
- **Glasses** (red frames)
- **Beard** (full facial hair)
- **Mustache** (isolated upper lip hair)
## Spatial Grounding
- **Legend Position**: Row labels act as categorical legend (y-axis)
- **Color Consistency**: No color-coded data series present
- **Coordinate System**:
- x-axis: Subject identity (Scarlett Johansson → Kristen Wiig)
- y-axis: Transformation type (Source → Add Attribute)
## Component Isolation
### Header
- Grid title: "Image Transformation Grid"
- Axes labels: "Change" (y-axis), "Identity" (x-axis)
### Main Chart
- 20 transformation examples (5 rows × 4 columns)
- Each cell contains:
- Source image (top row)
- Modified version (subsequent rows)
- Attribute icons (bottom row)
### Footer
- No explicit footer elements present
## Trend Verification
- **Identity Transformation**: Linear progression from original to altered features
- **Motion Variation**: No quantitative trend; qualitative expression changes
- **Hairstyle Evolution**: No measurable progression; categorical changes
- **Attribute Addition**: Discrete additions without quantitative scaling
## Data Table Reconstruction
| Subject | Source | Change Identity | Change Motion | Change Hairstyle | Add Attribute (Sunglasses) | Add Attribute (Glasses) | Add Attribute (Beard) | Add Attribute (Mustache) |
|------------------|--------|-----------------|---------------|------------------|----------------------------|-------------------------|-----------------------|--------------------------|
| Scarlett Johansson | Original | Altered Face 1 | Smiling | Short Hair | Not Applicable | Not Applicable | Not Applicable | Not Applicable |
| Emma Stone | Original | Altered Face 2 | Surprised | Long Hair | Not Applicable | Not Applicable | Not Applicable | Not Applicable |
| Chris Evans | Original | Altered Face 3 | Laughing | Bangs | Not Applicable | Not Applicable | Not Applicable | Not Applicable |
| Kristen Wiig | Original | Altered Face 4 | Profile View | Curly Hair | Not Applicable | Not Applicable | Not Applicable | Not Applicable |
*Note: "Altered Face X" represents identity-swapped versions of the source images.*
## Language Declaration
- Primary language: English
- No additional languages detected
## Critical Notes
1. No numerical data or quantitative measurements present
2. All transformations are qualitative visual changes
3. Attribute additions appear as discrete elements rather than cumulative modifications
</details>
Figure 9: Progressive Editing using our generalist model.
<details>
<summary>2401.00551v1/x10.png Details</summary>

### Visual Description
# Technical Document Extraction: Image Analysis
## Image Description
The image contains two sets of three photographs each, arranged horizontally. Each set demonstrates a visual transformation process labeled with textual annotations. No numerical data, charts, or diagrams are present.
---
### Left Set (Male Subject)
1. **Inpaint**
- Original photograph with a rectangular black block obscuring the face.
- Subject: Male with dark hair, wearing a black jacket.
- Background: Neutral, light-colored.
2. **Stylization1**
- Enhanced version of the original image.
- Facial features (eyes, nose, mouth) reconstructed with realistic details.
- Hair and clothing details sharpened.
3. **Stylization2**
- Further refined version of the original image.
- Increased realism in facial textures and lighting.
- Subtle adjustments to hair and clothing contours.
---
### Right Set (Female Subject)
1. **Inpaint**
- Original photograph with a rectangular black block obscuring the face.
- Subject: Female with blonde hair, wearing light makeup.
- Background: Soft pink gradient.
2. **Stylization1**
- Enhanced version of the original image.
- Facial features reconstructed with natural skin tones and makeup details.
- Hair texture and lighting improved.
3. **Stylization2**
- Further refined version of the original image.
- Hyper-realistic rendering of facial features and hair.
- Enhanced depth in eye and lip details.
---
### Key Observations
- **Label Consistency**: Both sets use identical labels ("Inpaint," "Stylization1," "Stylization2") to denote progression stages.
- **Transformation Process**:
- **Inpaint**: Base image with face obscured.
- **Stylization1**: Initial reconstruction of facial features.
- **Stylization2**: Final, highly detailed output.
- **Subject Differences**:
- Left set features a male subject with dark hair and a black jacket.
- Right set features a female subject with blonde hair and makeup.
- **Backgrounds**: Neutral for the male subject, soft pink gradient for the female subject.
---
### Notes
- No numerical data, legends, or axis markers are present.
- The image focuses on qualitative visual transformations rather than quantitative analysis.
- All textual information is in English.
</details>
Figure 10: Extension to face inpainting and animation.
Face Swapping. Fig. 6 -left shows a comparative analysis between $\mathtt{FaceX}$ and recent HifiFace [45] and E4S [24]. HifiFace adopts a target-oriented strategy, emphasizing fidelity to the target in terms of facial color and texture. On the contrary, source-oriented E4S prioritizes adherence to the source characteristics. Our method strives to preserve the facial texture and certain skin color features from the source while maintaining harmony with the target environment. Considering that E4S employs a face enhancement model to improve image resolution, to ensure fairness, we apply the same model to both HifiFace and our results. Fig. 6 -right shows the controllable attribute swapping results. By applying masked fusion during the inference sampling process, diffusion-based methods facilitate the selective swapping of a portion of the facial area, enabling the seamless integration of the substituted region with its surroundings.
Quantitatively, $\mathtt{FaceX}$ exhibits competitive performance with SoTA methods in Tab. 2. E4S employs target face parsing masks to constrain the output image structure, ensuring strict alignment with the target. Consequently, it manifests a closer resemblance to the target in terms of both pose and expression. Our approach reduces structural constraints to enhance flexibility in motion control.
Head Swapping. As HeSer [39], the recent SoTA, is not open-source, we compare using crops from the paper in Fig. 7. Unlike target-oriented method HeSer, we prioritize source texture and skin color while harmonizing with the target. HeSer uses multiple images of the source face to extract identity and perform a two-stage process by first reenacting the source face before conducting face swapping. In contrast, our one-shot-one-stage framework demonstrates comparable identity and motion consistency while achieving much higher image quality. Further, Fig. 8 evaluates $\mathtt{FaceX}$ on datasets with more complex environment beyond the VoxCeleb dataset used by HeSer, where lighting conditions are consistently dim. The results show that our $\mathtt{FaceX}$ accurately maintains skin color across various ethnicities and adapts to the target lighting conditions.
Progressive Editing across Diverse Facial Tasks. Fig. 9 illustrates the diverse facial editing capabilities of our generalist model, showcasing the progressive achievement of editing identity, motion, and semantic attributes. Note that the arrangement and order of facial features may be arbitrary. In contrast to previous methods limited by fixed structures, our approach supports flexible combination of different editing capabilities, enhancing the diversity of editing possibilities.
<details>
<summary>2401.00551v1/x11.png Details</summary>

### Visual Description
# Facial Transformation Comparison
## Image Description
The image is a horizontal grid of 11 labeled facial transformation results, comparing different methods. Each column represents a distinct method, with the first two columns labeled "Source" and "Target" as reference points. The remaining columns represent various transformation techniques applied to the source image to approximate the target.
### Labels and Categories
1. **Source**: Original facial image used as input.
2. **Target**: Desired facial transformation outcome.
3. **Finetune**: Method using fine-tuning for transformation.
4. **CLIP**: Method leveraging CLIP model for alignment.
5. **BLIP**: Method using BLIP model for transformation.
6. **DINOv2**: Method based on DINOv2 architecture.
7. **Farl**: Method utilizing Farl framework.
8. **MAE**: Method employing Masked Autoencoder (MAE).
9. **CLIP Multi-level**: Multi-level CLIP-based approach.
10. **CLIP-DINO Fuse**: Hybrid method combining CLIP and DINO.
11. **Fix Unet**: Method using Fix Unet architecture.
### Spatial Grounding
- All labels are positioned above their respective columns.
- No axis titles, legends, or numerical data are present.
- No heatmaps, charts, or diagrams are included.
### Observations
- The image focuses on qualitative visual comparisons rather than quantitative metrics.
- Each method's output is displayed as a side-by-side facial image.
- No additional textual annotations or data tables are visible.
## Conclusion
This image provides a visual comparison of facial transformation methods without embedded numerical data, charts, or diagrams. The labels and their spatial arrangement are the primary textual elements.
</details>
Figure 11: Left: Ablation of using different visual encoders. Right: Fixing U-net without FRC results in a failure to reconstruct texture.
<details>
<summary>2401.00551v1/x12.png Details</summary>

### Visual Description
# Technical Document Extraction: Face Swapping Analysis
## Image Structure
The image is a collage of **six photographs** arranged in **two rows** and **three columns**. Each row represents a distinct category, with columns indicating different face-swapping techniques applied. The layout is as follows:
### Row Labels
1. **Top Row**: Labeled **"Source"** (leftmost image).
2. **Bottom Row**: Labeled **"Target"** (leftmost image).
### Column Labels (Techniques)
- **Column 1**: **"Full"** (full face-swapping technique).
- **Column 2**: **"w/o Region Assemble"** (face-swapping without region assembly).
- **Column 3**: **"w/o Representation Assemble"** (face-swapping without representation assembly).
## Spatial Grounding of Labels
- **Top Row**:
- **Source Image**: Positioned at **[x=0, y=0]** (top-left corner).
- **Column Labels**:
- "Full" at **[x=1, y=0]**.
- "w/o Region Assemble" at **[x=2, y=0]**.
- "w/o Representation Assemble" at **[x=3, y=0]**.
- **Bottom Row**:
- **Target Image**: Positioned at **[x=0, y=1]** (bottom-left corner).
- **Column Labels**:
- "Full" at **[x=1, y=1]**.
- "w/o Region Assemble" at **[x=2, y=1]**.
- "w/o Representation Assemble" at **[x=3, y=1]**.
## Image Content Description
### Top Row ("Source")
1. **Source Image**: A woman with **blonde hair**, wearing a **black top**, smiling. Background includes a **bar setting** with bottles.
2. **Full Technique**: The source face is fully swapped onto the target body. The result shows a **younger appearance** with **blonde hair** and a **black top**.
3. **w/o Region Assemble**: The face swap appears **less seamless**, with **blonde hair** but **inconsistent lighting** and **blurred edges**.
4. **w/o Representation Assemble**: The face retains **original features** (e.g., **red lipstick**) but is misaligned with the target body.
### Bottom Row ("Target")
1. **Target Image**: An older woman with **gray hair**, wearing a **red top**, smiling. Background shows a **landscape** with a body of water.
2. **Full Technique**: The target face is fully swapped onto the source body. The result shows **gray hair**, **red top**, and a **landscape background**.
3. **w/o Region Assemble**: The face swap is **partially misaligned**, with **gray hair** but **inconsistent lighting** and **blurred edges**.
4. **w/o Representation Assemble**: The face retains **original features** (e.g., **red lipstick**) but is misaligned with the source body.
## Observations
- **Full Technique**: Produces the most seamless face swaps, preserving hair color, clothing, and background context.
- **w/o Region Assemble**: Results in **blurred edges** and **lighting mismatches**, reducing realism.
- **w/o Representation Assemble**: Causes **feature misalignment** (e.g., lipstick color mismatch) and **background inconsistencies**.
## Notes
- No numerical data, charts, or legends are present. The image focuses on qualitative comparisons of face-swapping techniques.
- All text is in **English**. No other languages are detected.
</details>
Figure 12: Qualitative comparison of our model under different ablative configurations.
Inpainting and Animation. Benefiting from our fine-tuning strategy, freezing the U-net weights during training and loading community personalized model weights during testing enables us to achieve stylization. Fig. 10 showcases animated stylizations with watercolor and oil painting brushstrokes. On the other hand, our method demonstrates a robust inpainting capability by retaining SD prior knowledge. This is evident in its ability to generate reasonable facial inpainting results, even when confronted with substantial facial voids.
4.2 Ablation Study
Choice of Visual Encoders. We ablate different visual encoders in Fig. 11, i.e., CLIP-based ViT [32, 7], DINOv2 [29], FARL [63], BLIP [21], and MAE [12], on face reenactment, because facial tasks may heavily rely on the representations from pre-trained models. We draw the following conclusions: 1) Finetuning visual encoders exhibits a significantly faster convergence than fixing them. Despite variations in convergence speed, different models of ViT ultimately yield closely aligned results. 2) Initialization via the weights of CLIP ViT demonstrates the fastest convergence during finetuning. The obtained results are also superior with fixed weights. This phenomenon might be attributed to the alignment between the visual branch of CLIP and the text branch of SD. 3) Under fixed weights, the performance hierarchy is as follows: CLIP > DINOv2 = BLIP > FARL > MAE. Neither the fusion of multi-stage features from CLIP ViT nor a combination of features from CLIP and DINOv2 yields superior results.
Task-specific Region Assembler. Due to the structural information loss caused by mask pooling in the Task-specific Region Assembler, removing this assembler results in the model lacking direct guidance from structural information. Hence, the model tends to generate ambiguous outcomes, which is demonstrated in Fig. 12 and Tab. 3.
Task-specific Representation Assembler. Task-specific Region Assembler can only provide structural guidance, and it requires the Task-specific Representation Assembler to supply local appearance information. If this information is lacking, it can lead to color bias in the generated results.
Facial Representation Controller. When the U-net is frozen and FRC is removed, solely finetuning the FORS module may enable the model to capture coarse identity and motion. Thus, generating detailed textures becomes difficult as shown in Fig. 11 -right.
4.3 Discussion on Efficiency
As a diffusion-based method, our approach does not exhibit a computational advantage in terms of inference time when compared to GAN-based methods, including TPSM, DAM, and HifiFace. However, we distinguish ourselves by achieving a notable advantage in image quality. Specifically, in contrast to the face swapping method E4S, which requires pre-alignment using a reenactment model, our method achieves uniformity within a single model. Additionally, head swapping method HeSer necessitates fine-tuning on multiple images of the source identity, whereas we accomplish identity preservation in a one-shot manner. Compared to other diffusion-based methods, FADM involves obtaining a coarse driving result using a previous reenactment model, followed by refinement using DDPM. In contrast, our method operates as a unified model. Regarding training costs, our model freezes the parameters of the SD Unet and only fine-tunes the additional introduced parameters. This leads to faster convergence compared to FADM, which trains from scratch.
| Configuratons | SSIM $\uparrow$ | PSNR $\uparrow$ | RMSE $\downarrow$ | FID $\downarrow$ |
| --- | --- | --- | --- | --- |
| w/o Region Assemble | 0.6580 | 14.79 | 3.32 | 45.31 |
| w/o Representation Assemble | 0.7520 | 18.24 | 1.78 | 29.27 |
| Our Full Model | 0.7960 | 19.15 | 1.31 | 27.95 |
Table 3: Quantitative comparison of our model under different ablative configurations. The reconstruction performance is measured.
5 Conclusion and Future Works
In this paper, we propose a novel generalist $\mathtt{FaceX}$ to accomplish a variety of facial tasks by formulating a coherent facial representation for a wide range of facial editing tasks. Specifically, we design a novel FORD to easily manipulate various facial details, and a FORS to first assemble unified facial representations and then effectively steer the SD-aware generation process by the designed FRC. Extensive experiments on various facial tasks demonstrate the unification, efficiency, and effectiveness of the proposed method.
Limitations and Future Works. As this paper aims to design a general facial editing model, it may be suboptimal on some metrics for certain tasks. In the future, we will further explore more effective methods, including investigating the integration of large language models or large vocabulary size settings [42, 46] for task expansion.
Social Impacts. Generating synthetic faces increases the risk of image forgery abuse. In the future, it’s necessary to develop forgery detection models in parallel to mitigate this risk.
References
- Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chen et al. [2020] Renwang Chen, Xuanhong Chen, Bingbing Ni, and Yanhao Ge. Simswap: An efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2003–2011, 2020.
- Deng et al. [2019a] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019a.
- Deng et al. [2019b] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019b.
- Ding et al. [2023] Zheng Ding, Xuaner Zhang, Zhihao Xia, Lars Jebe, Zhuowen Tu, and Xiuming Zhang. Diffusionrig: Learning personalized priors for facial appearance editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12736–12746, 2023.
- Dosovitskiy et al. [2021a] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021a.
- Dosovitskiy et al. [2021b] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021b.
- Esser et al. [2021] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- Feng et al. [2021] Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (ToG), 40(4):1–13, 2021.
- Gal et al. [2022] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
- Goel et al. [2023] Vidit Goel, Elia Peruzzo, Yifan Jiang, Dejia Xu, Nicu Sebe, Trevor Darrell, Zhangyang Wang, and Humphrey Shi. Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546, 2023.
- He et al. [2022] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Huang et al. [2023] Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and Jingren Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
- Ioffe [2006] Sergey Ioffe. Probabilistic linear discriminant analysis. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part IV 9, pages 531–542. Springer, 2006.
- Ju et al. [2023] Xuan Ju, Ailing Zeng, Chenchen Zhao, Jianan Wang, Lei Zhang, and Qiang Xu. Humansd: A native skeleton-guided diffusion model for human image generation. arXiv preprint arXiv:2304.04269, 2023.
- Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
- Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Li et al. [2021] Jia Li, Zhaoyang Li, Jie Cao, Xingguang Song, and Ran He. Faceinpainter: High fidelity face adaptation to heterogeneous domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5089–5098, 2021.
- Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- Li et al. [2023] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023.
- Liu et al. [2021] Mingcong Liu, Qiang Li, Zekui Qin, Guoxin Zhang, Pengfei Wan, and Wen Zheng. Blendgan: Implicitly gan blending for arbitrary stylized face generation. Advances in Neural Information Processing Systems, 34:29710–29722, 2021.
- Liu et al. [2023] Zhian Liu, Maomao Li, Yong Zhang, Cairong Wang, Qi Zhang, Jue Wang, and Yongwei Nie. Fine-grained face swapping via regional gan inversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8578–8587, 2023.
- Luo et al. [2022] Yuchen Luo, Junwei Zhu, Keke He, Wenqing Chu, Ying Tai, Chengjie Wang, and Junchi Yan. Styleface: Towards identity-disentangled face generation on megapixels. In European conference on computer vision, 2022.
- Mou et al. [2023] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Nagrani et al. [2017] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017.
- Nirkin et al. [2022] Yuval Nirkin, Yosi Keller, and Tal Hassner. Fsganv2: Improved subject agnostic face swapping and reenactment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):560–575, 2022.
- Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Prince et al. [2011] Simon Prince, Peng Li, Yun Fu, Umar Mohammed, and James Elder. Probabilistic models for inference about identity. IEEE transactions on pattern analysis and machine intelligence, 34(1):144–157, 2011.
- Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Richardson et al. [2021] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2287–2296, 2021.
- Rombach et al. [2022a] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022a.
- Rombach et al. [2022b] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022b.
- Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Rossler et al. [2019] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019.
- Shen et al. [2023] Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, and Jiwen Lu. Difftalk: Crafting diffusion models for generalized audio-driven portraits animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1982–1991, 2023.
- Shu et al. [2022] Changyong Shu, Hemao Wu, Hang Zhou, Jiaming Liu, Zhibin Hong, Changxing Ding, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. Few-shot head swapping in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10789–10798, 2022.
- Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Tao et al. [2022] Jiale Tao, Biao Wang, Borun Xu, Tiezheng Ge, Yuning Jiang, Wen Li, and Lixin Duan. Structure-aware motion transfer with deformable anchor model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3637–3646, 2022.
- Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Valevski et al. [2023] Dani Valevski, Danny Wasserman, Yossi Matias, and Yaniv Leviathan. Face0: Instantaneously conditioning a text-to-image model on a face. arXiv preprint arXiv:2306.06638, 2023.
- Wang et al. [2023] Qinghe Wang, Lijie Liu, Miao Hua, Pengfei Zhu, Wangmeng Zuo, Qinghua Hu, Huchuan Lu, and Bing Cao. Hs-diffusion: Semantic-mixing diffusion for head swapping. arXiv:2212.06458, 2023.
- Wang et al. [2021] Yuhan Wang, Xu Chen, Junwei Zhu, Wenqing Chu, Ying Tai, Chengjie Wang, Jilin Li, Yongjian Wu, Feiyue Huang, and Rongrong Ji. Hififace: 3d shape and semantic prior guided high fidelity face swapping. arXiv preprint arXiv:2106.09965, 2021.
- Wu et al. [2023] Jianzong Wu, Xiangtai Li, Shilin Xu, Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong, Xudong Jiang, Bernard Ghanem, and Dacheng Tao. Towards open vocabulary learning: A survey. arXiv pre-print, 2023.
- Xu et al. [2022a] Chao Xu, Jiangning Zhang, Yue Han, Guanzhong Tian, Xianfang Zeng, Ying Tai, Yabiao Wang, Chengjie Wang, and Yong Liu. Designing one unified framework for high-fidelity face reenactment and swapping. In European Conference on Computer Vision, pages 54–71. Springer, 2022a.
- Xu et al. [2022b] Chao Xu, Jiangning Zhang, Miao Hua, Qian He, Zili Yi, and Yong Liu. Region-aware face swapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7632–7641, 2022b.
- Xu et al. [2023] Chao Xu, Junwei Zhu, Jiangning Zhang, Yue Han, Wenqing Chu, Ying Tai, Chengjie Wang, Zhifeng Xie, and Yong Liu. High-fidelity generalized emotional talking face generation with multi-modal emotion space learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Xu et al. [2022c] Yangyang Xu, Bailin Deng, Junle Wang, Yanqing Jing, Jia Pan, and Shengfeng He. High-resolution face swapping via latent semantics disentanglement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7642–7651, 2022c.
- Yue and Loy [2022] Zongsheng Yue and Chen Change Loy. Difface: Blind face restoration with diffused error contraction. arXiv preprint arXiv:2212.06512, 2022.
- Zeng et al. [2023] Bohan Zeng, Xuhui Liu, Sicheng Gao, Boyu Liu, Hong Li, Jianzhuang Liu, and Baochang Zhang. Face animation with an attribute-guided diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 628–637, 2023.
- Zeng et al. [2020] Xianfang Zeng, Yusu Pan, Mengmeng Wang, Jiangning Zhang, and Yong Liu. Realistic face reenactment via self-supervised disentangling of identity and pose. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12757–12764, 2020.
- Zhang et al. [2020a] Jiangning Zhang, Liang Liu, Zhucun Xue, and Yong Liu. Apb2face: Audio-guided face reenactment with auxiliary pose and blink signals. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4402–4406. IEEE, 2020a.
- Zhang et al. [2020b] Jiangning Zhang, Xianfang Zeng, Mengmeng Wang, Yusu Pan, Liang Liu, Yong Liu, Yu Ding, and Changjie Fan. Freenet: Multi-identity face reenactment. In CVPR20, 2020b.
- Zhang et al. [2021a] Jiangning Zhang, Xianfang Zeng, Chao Xu, and Yong Liu. Real-time audio-guided multi-face reenactment. IEEE Signal Processing Letters, 29:1–5, 2021a.
- Zhang et al. [2023a] Jiangning Zhang, Xiangtai Li, Jian Li, Liang Liu, Zhucun Xue, Boshen Zhang, Zhengkai Jiang, Tianxin Huang, Yabiao Wang, and Chengjie Wang. Rethinking mobile block for efficient attention-based models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1389–1400, 2023a.
- Zhang et al. [2023b] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023b.
- Zhang et al. [2021b] Wendong Zhang, Junwei Zhu, Ying Tai, Yunbo Wang, Wenqing Chu, Bingbing Ni, Chengjie Wang, and Xiaokang Yang. Context-aware image inpainting with learned semantic priors. In International Joint Conference on Artificial Intelligence, 2021b.
- Zhao and Zhang [2022] Jian Zhao and Hui Zhang. Thin-plate spline motion model for image animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3657–3666, 2022.
- Zhao et al. [2023] Wenliang Zhao, Yongming Rao, Weikang Shi, Zuyan Liu, Jie Zhou, and Jiwen Lu. Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8568–8577, 2023.
- Zheng et al. [2023] Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22490–22499, 2023.
- Zheng et al. [2022] Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18697–18709, 2022.
- Zhu et al. [2022a] Feida Zhu, Junwei Zhu, Wenqing Chu, Ying Tai, Zhifeng Xie, Xiaoming Huang Huang, and Chengjie Wang. Hifihead: One-shot high fidelity neural head synthesis with 3d control. In International Joint Conference on Artificial Intelligence, 2022a.
- Zhu et al. [2022b] Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. Celebv-hq: A large-scale video facial attributes dataset. In European conference on computer vision, pages 650–667. Springer, 2022b.
- Zhu et al. [2020] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka. Sean: Image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5104–5113, 2020.