2502.01341
Model: nemotron-free
# AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
Abstract
Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLMâs embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise.
1 Introduction
Vision-Language Models (VLMs) have gained significant traction in recent years as a powerful framework for multimodal document understanding tasks that involve interpreting both the visual and textual contents of scanned documents (Kim et al., 2022; Lee et al., 2023; Liu et al., 2023a, 2024; Hu et al., 2024; Wang et al., 2023a; Rodriguez et al., 2024b). Such tasks are common in real-world commercial applications, including invoice parsing (Park et al., 2019), form reading (Jaume et al., 2019), and document question answering (Mathew et al., 2021b). VLM architectures typically consist of three components: (i) a vision encoder to process raw images, (ii) a Large Language Model (LLM) pre-trained on text, and (iii) a connector module that maps the visual features from the vision encoder into the LLMâs semantic space.
A central challenge in this pipeline is to effectively map the continuous feature embeddings of the vision encoder into the latent space of the LLM while preserving the semantic properties of visual concepts. Existing approaches can be broadly categorized into deep fusion and shallow fusion methods. Deep fusion methods, such as NVLM (Dai et al., 2024), Flamingo (Alayrac et al., 2022), CogVLM (Wang et al., 2023b), and LLama 3.2-Vision (Grattafiori et al., 2024), integrate visual and textual features by introducing additional cross-attention and feed-forward layers at each layer of the LLM. While effective at enhancing cross-modal interaction, these methods substantially increase the parameter count of the VLM compared to the base LLM, resulting in high computational overhead and reduced efficiency.
<details>
<summary>x1.png Details</summary>

### Visual Description
# Technical Document Extraction: Radar Chart Analysis
## 1. **Legend & Color Mapping**
- **Legend Position**: Bottom of the chart.
- **Color-Model Mapping**:
- **Red**: Llama-3.2-3B-Perciever R.
- **Blue**: Llama-3.2-3B-Ovis
- **Green**: Llama-3.2-3B-MLP
- **Orange**: Llama-3.2-3B-Align (ours)
## 2. **Benchmarks (Axis Labels)**
The benchmarks are arranged clockwise around the radar chart:
1. **KLC**
2. **WTQ**
3. **TabFact**
4. **ChartQA**
5. **TextVQA**
6. **TableVQA**
7. **DocVQA**
8. **InfoVQA**
9. **DeepForm**
## 3. **Data Points & Values**
Each benchmark has four data points corresponding to the four models. Values are extracted as follows:
### **KLC**
- Llama-3.2-3B-Perciever R.: 31.75
- Llama-3.2-3B-Ovis: 33.5
- Llama-3.2-3B-MLP: 62.07
- Llama-3.2-3B-Align: 63.49
### **WTQ**
- Llama-3.2-3B-Perciever R.: 28.94
- Llama-3.2-3B-Ovis: 33.13
- Llama-3.2-3B-MLP: 57.08
- Llama-3.2-3B-Align: 38.59
### **TabFact**
- Llama-3.2-3B-Perciever R.: 47.76
- Llama-3.2-3B-Ovis: 73.22
- Llama-3.2-3B-MLP: 71.93
- Llama-3.2-3B-Align: 78.51
### **ChartQA**
- Llama-3.2-3B-Perciever R.: 51.33
- Llama-3.2-3B-Ovis: 66.48
- Llama-3.2-3B-MLP: 65.16
- Llama-3.2-3B-Align: 71.88
### **TextVQA**
- Llama-3.2-3B-Perciever R.: 57.38
- Llama-3.2-3B-Ovis: 52.6
- Llama-3.2-3B-MLP: 53.56
- Llama-3.2-3B-Align: 60.1
### **TableVQA**
- Llama-3.2-3B-Perciever R.: 69.08
- Llama-3.2-3B-Ovis: 74.68
- Llama-3.2-3B-MLP: 71.46
- Llama-3.2-3B-Align: 79.63
### **DocVQA**
- Llama-3.2-3B-Perciever R.: 50.96
- Llama-3.2-3B-Ovis: 71.46
- Llama-3.2-3B-MLP: 53.93
- Llama-3.2-3B-Align: 79.63
### **InfoVQA**
- Llama-3.2-3B-Perciever R.: 34.13
- Llama-3.2-3B-Ovis: 42.11
- Llama-3.2-3B-MLP: 37.56
- Llama-3.2-3B-Align: 44.53
### **DeepForm**
- Llama-3.2-3B-Perciever R.: 27.95
- Llama-3.2-3B-Ovis: 58.02
- Llama-3.2-3B-MLP: 62.07
- Llama-3.2-3B-Align: 63.49
## 4. **Trend Verification**
- **Llama-3.2-3B-Align (Orange)**: Consistently high performance across most benchmarks (e.g., **DocVQA: 79.63**, **TabFact: 78.51**).
- **Llama-3.2-3B-Ovis (Blue)**: Strong in **TabFact (73.22)** and **InfoVQA (42.11)**, but weaker in **KLC (33.5)** and **WTQ (33.13)**.
- **Llama-3.2-3B-MLP (Green)**: Peaks in **KLC (62.07)** and **TabFact (71.93)**, but lags in **WTQ (57.08)** and **InfoVQA (37.56)**.
- **Llama-3.2-3B-Perciever R. (Red)**: Lowest performance overall (e.g., **KLC: 31.75**, **DeepForm: 27.95**).
## 5. **Spatial Grounding**
- **Legend**: Located at the bottom of the chart.
- **Data Point Colors**: Match the legend exactly (e.g., orange for Align, blue for Ovis).
## 6. **Component Isolation**
- **Header**: Not explicitly labeled; benchmarks are axis labels.
- **Main Chart**: Radar chart with four overlapping polygons (one per model).
- **Footer**: Legend with model names and colors.
## 7. **Additional Notes**
- No non-English text detected.
- All axis labels and data points are in English.
- The chart compares model performance across 9 benchmarks, with **Llama-3.2-3B-Align** generally outperforming others.
## 8. **Data Table Reconstruction**
| Benchmark | Llama-3.2-3B-Perciever R. | Llama-3.2-3B-Ovis | Llama-3.2-3B-MLP | Llama-3.2-3B-Align |
|-------------|---------------------------|-------------------|------------------|--------------------|
| KLC | 31.75 | 33.5 | 62.07 | 63.49 |
| WTQ | 28.94 | 33.13 | 57.08 | 38.59 |
| TabFact | 47.76 | 73.22 | 71.93 | 78.51 |
| ChartQA | 51.33 | 66.48 | 65.16 | 71.88 |
| TextVQA | 57.38 | 52.6 | 53.56 | 60.1 |
| TableVQA | 69.08 | 74.68 | 71.46 | 79.63 |
| DocVQA | 50.96 | 71.46 | 53.93 | 79.63 |
| InfoVQA | 34.13 | 42.11 | 37.56 | 44.53 |
| DeepForm | 27.95 | 58.02 | 62.07 | 63.49 |
## 9. **Conclusion**
The radar chart demonstrates that **Llama-3.2-3B-Align (ours)** achieves the highest scores across most benchmarks, particularly in **DocVQA (79.63)** and **TabFact (78.51)**. Other models show variability, with **Llama-3.2-3B-Ovis** excelling in **TabFact (73.22)** and **Llama-3.2-3B-MLP** performing well in **KLC (62.07)**. The **Perciever R.** model consistently underperforms.
</details>
Figure 1: Performance of Different VLM Connectors. The proposed Align connector outperforms other methods across benchmarks using the same training configuration. Radial distance is proportion of maximal score, truncated at $0.7$ (black dot).
In contrast, shallow fusion methods project visual features from the vision encoder into the LLM input embedding space using either multilayer perceptrons (MLPs) (Liu et al., 2023b, 2024), convolution mappings such as HoneyBee (Cha et al., 2024) and H-Reducer (Hu et al., 2024), or attention-based mechanisms such as the Perceiver Resampler (Li et al., 2023b; Laurençon et al., 2024; Alayrac et al., 2022). This approach is more parameter-efficient and computationally lighter than deep fusion method However, these connectors lack inductive bias to ensure that the projected features remain within the region spanned by the LLMâs pretrained text embeddings. Consequently, the projected visual features may fall outside the distribution the LLM was trained on, leading to noisy or misaligned representations. Moreover, these mappings are typically learned from scratch, making them data-inefficient and less effective under low-resource conditions.
Recent methods like Ovis (Lu et al., 2024) attempt to alleviate these issues by introducing separate visual embeddings indexed from the vision encoder outputs and combined together to construct the visual inputs to the LLM. However, this approach significantly increases parameter count due to the massive embedding matrix and requires extensive training to learn a new embedding space without guaranteeing alignment with the LLMâs input latent space.
To address these limitations, this paper introduces AlignVLM, a novel framework that sidesteps direct projection of visual features into the LLM embedding space. Instead, our proposed connector, Align, maps visual features into probability distributions over the LLMâs existing pretrained vocabulary embeddings, which are then combined into a weighted representation of the text embeddings. By constraining each visual feature as a convex combination of the LLM text embeddings, our approach leverages the linguistic priors already encoded in the LLMâs text space. This ensures that the resulting visual features lie within the convex hull of the LLMâs embedding space, reducing the risk of noisy or out-of-distribution inputs and improving alignment between modalities. The connector thus enables faster convergence and stronger performance, particularly in low-resource scenarios.
Our experimental results show that Align improves performance on various document understanding tasks, outperforming prior connector methods, with especially large gains in low-data regimes. We summarize our main contributions as follows:
- We propose a novel connector, Align, to bridge the representation gap between vision and text modalities.
- We introduce a family of Vision-Language Models, AlignVLM, that achieves state-of-the-art performance on multimodal document understanding tasks by leveraging Align.
- We conduct extensive experiments demonstrating the robustness and effectiveness of Align across different LLM sizes and training data setups.
We release our code and research artifacts at alignvlm.github.io.
2 Related Work
2.1 Vision-Language Models
Over the past few years, Vision-Language Models (VLMs) have achieved remarkable progress, largely due to advances in Large Language Models (LLMs). Initially demonstrating breakthroughs in text understanding and generation (Brown et al., 2020; Raffel et al., 2023; Achiam et al., 2023; Grattafiori et al., 2024; Qwen et al., 2025; Team, 2024), LLMs are now increasingly used to effectively interpret visual inputs (Liu et al., 2023b; Li et al., 2024; Wang et al., 2024; Chen et al., 2024b; Dai et al., 2024; Drouin et al., 2024; Rodriguez et al., 2022). This progress has enabled real-world applications across diverse domains, particularly in multimodal document understanding for tasks like form reading (Svetlichnaya, 2020), document question answering (Mathew et al., 2021b), and chart question answering (Masry et al., 2022). VLMs commonly adopt a three-component architecture: a pretrained vision encoder (Zhai et al., 2023; Radford et al., 2021), a LLM, and a connector module. A key challenge for VLMs is effectively aligning visual features with the LLMâs semantic space to enable accurate and meaningful multimodal interpretation.
2.2 Vision-Language Alignment for Multimodal Models
Existing vision-language alignment approaches can be classified into deep fusion and shallow fusion. Deep fusion methods integrate visual and textual features by modifying the LLMâs architecture, adding cross-attention and feed-forward layers. For example, Flamingo (Alayrac et al., 2022) employs the Perceiver Resampler, which uses fixed latent embeddings to attend to vision features and fuses them into the LLM via gated cross-attention layers. Similarly, NVLM (Dai et al., 2024) adopts cross-gated attention while replacing the Perceiver Resampler with a simpler MLP. CogVLM (Wang et al., 2023b) extends this approach by incorporating new feed-forward (FFN) and QKV layers for the vision modality within every layer of the LLM. While these methods improve cross-modal alignment, they significantly increase parameter counts and computational overhead, making them less efficient.
On the other hand, shallow fusion methods are more computationally efficient, mapping visual features into the LLMâs embedding space without altering its architecture. These methods can be categorized into three main types: (1) MLP-based mapping, such as LLaVA (Liu et al., 2023b) and PaliGemma (Beyer et al., 2024), which use multilayer perceptrons (MLP) to project visual features but often produce misaligned or noisy features due to a lack of constraints and inductive bias (Rodriguez et al., 2024b); (2) cross-attention mechanisms, BLIP-2 (Li et al., 2023b) uses Q-Former, which utilizes a fixed set of latent embeddings to cross-attend to visual features, but that may still produce noisy or OOD visual features; (3) convolution-based mechanisms, such as HoneyBee (Cha et al., 2024) and H-Reducer (Hu et al., 2024), which leverage convolutional or ResNet (He et al., 2015) layers to preserve spatial locality while reducing dimensionality; and (4) visual embeddings, such as those introduced by Ovis (Lu et al., 2024), which use embeddings indexed by the vision encoderâs outputs to produce the visual inputs. While this regularizes feature mapping, it adds substantial parameter overhead and creates a new vision embedding space, risking misalignment with the LLMâs text embedding space. Encoder-free VLMs, like Fuyu-8B https://www.adept.ai/blog/fuyu-8b and EVE (Diao et al., 2024), eliminate dedicated vision encoders but show degraded performance (Beyer et al., 2024).
In contrast, AlignVLM maps visual features from the vision encoder into probability distributions over the LLMâs text embeddings, using them to compute a convex combination. By leveraging the linguistic priors encoded in the LLMâs vocabulary, AlignVLM ensures that visual features remain within the convex hull of the text embedding. This design mitigates noisy or out-of-distribution projections and achieves stronger multimodal alignment, particularly in tasks that require joint modalities representation like multimodal document understanding and in low-resource settings.
<details>
<summary>x2.png Details</summary>

### Visual Description
# Technical Document Extraction: System Architecture Diagram
## Header Section: Text Blocks (Left Side)
Six text blocks with statistical claims about Americans, each paired with a small icon/chart:
1. **"AMERICANS WANT TO BE DISTRACTED FROM REALITY"**
- Icon: Person at desk with computer
- Text: "The desire to escape reality is a growing trend..."
2. **"AMERICANS LOVE VIDEO GAMES"**
- Icon: Pixelated game controller
- Text: "70% of Americans play video games weekly..."
3. **"AMERICANS ARE HYPER-SOCIAL"**
- Icon: Network of interconnected people
- Text: "Social media usage exceeds 10 hours/day..."
4. **"AMERICANS ARE CONNECTED"**
- Icon: Group of people with Wi-Fi symbols
- Text: "80% of Americans own smartphones..."
5. **"AMERICANS LOVE ROUTINE"**
- Icon: Clock with repetitive patterns
- Text: "60% of Americans follow daily routines..."
6. **"AMERICANS ARE INFLUENCED BY OTHERS"**
- Icon: Arrows pointing in multiple directions
- Text: "Peer influence drives 75% of purchasing decisions..."
**Question Prompt**:
_"What percentage of Americans are online?"_
---
## Main Chart: System Architecture Diagram
### Components and Flow
1. **Vision Encoder** (Blue Rectangle)
- Input: Text blocks from Header
- Output: Vision Inputs (6 channels)
2. **Align Module** (Dashed Orange Box)
- **Linear Layer**: Processes Vision Encoder output
- **Layer Norm**: Normalization step
- **LM Head (LLM)**: Language Model Head
- **Layer Norm**: Second normalization
- **Softmax**: Outputs probability distribution
3. **Weighted Average Sum** (Oval Node)
- Combines Vision Inputs and Text Inputs
- Output: Aggregated embeddings
4. **LLM Embedding Matrix** (Pink Rectangle)
- Contains full text embeddings
- Selected Text Embeddings: Subset of matrix
5. **LLM** (Pink Rectangle)
- Final output: _"Response: 90%"_
### Spatial Grounding
- **Legend**: Not explicitly present in diagram
- **Color Coding**:
- Blue: Vision Encoder components
- Pink: LLM-related components
- Orange: Align Module boundaries
---
## Footer Section: Response
- **Output**: _"Response: 90%"_
- **Interpretation**: 90% of Americans are online (inferred from context)
---
## Key Trends and Data Points
1. **Text Block Statistics**:
- 70% of Americans play video games weekly
- 80% own smartphones
- 90% are online (response)
2. **System Flow**:
- Text blocks â Vision Encoder â Align Module â LLM â Final Response
---
## Notes
- No explicit numerical data in the diagram itself; percentages derived from text blocks.
- Diagram uses color coding to differentiate modules (blue = vision, pink = text/LLM).
- No heatmap or data table present; focus on component interactions.
</details>
Figure 2: AlignVLM Model Architecture. The vision encoder extracts image features, which are processed to produce probabilities over the LLM embeddings. A weighted average combines these probabilities with embeddings to generate vision input vectors. Text inputs are tokenized, and the corresponding embeddings are selected from the embedding matrix, which is then used as input to the LLM. We display the vision layers in blue, and the text layers in purple.
3 Methodology
3.1 Model Architecture
The overall model architecture, shown in Figure 2, consists of three main components:
(1) Vision Encoder.
To handle high-resolution images of different aspect ratios, we divide each input image into multiple tiles according to one of the predefined aspect ratios (e.g., $1{:}1,\,1{:}2,\,...,\,9{:}1$ ) chosen via a coverage ratio (Lu et al., 2024; Chen et al., 2024a). Due to limited computational resources, we set the maximum number of tiles to 9. Each tile is further partitioned into $14Ă 14$ patches, projected into vectors, and processed by a SigLip-400M vision encoder (Zhai et al., 2023) to extract contextual visual features.
Each tile $tâ\{1,·s,T\}$ is divided into $N_{t}$ patches
$$
\mathbf{P}_{t}=\{\mathbf{p}_{t,1},\cdots,\mathbf{p}_{t,N_{t}}\},
$$
where $\mathbf{p}_{t,i}$ is the $i$ -th patch of tile $t$ . The vision encoder maps these patches to a set of visual feature vectors
| | $\displaystyle\mathbf{F}_{t}=\mathrm{VisionEncoder}(\mathbf{P}_{t}),\quad\mathbf{F}_{t}=\{\mathbf{f}_{t,1},·s,\mathbf{f}_{t,N_{t}}\},\quad\mathbf{f}_{t,i}â\mathbb{R}^{d}.$ | |
| --- | --- | --- |
Finally, we concatenate the feature sets across all tiles into a single output
$$
\mathbf{F}=\mathrm{concat}\Bigl(\mathbf{F}_{1},\mathbf{F}_{2},\cdots,\mathbf{F}_{T}\Bigr).
$$
(2) Align Module.
This module aligns the visual features with the LLM. A linear layer $\mathbf{W}_{1}â\mathbb{R}^{DĂ d}$ first projects the visual features $\mathbf{F}â\mathbb{R}^{T· N_{t}Ă d}$ to the LLMâs token embedding space: one $\mathbb{R}^{D}$ vector per token. A second linear layer $\mathbf{W}_{2}â\mathbb{R}^{VĂ D}$ (initialized from the LLMâs language-model head) followed by a softmax, produces a probability simplex $\mathbf{P}_{\text{vocab}}$ over the LLMâs vocabulary ( $V$ tokens)
$$
\mathbf{P}_{\text{vocab}}=\operatorname{softmax}(\operatorname{LayerNorm}(\mathbf{W}_{2}\operatorname{LayerNorm}(\mathbf{W}_{1}\mathbf{F}))) \tag{1}
$$
We then use the LLM text embeddings $\mathbf{E}_{\text{text}}â\mathbb{R}^{VĂ D}$ to compute a weighted sum
$$
\mathbf{F}_{\text{align}}^{\prime}=\mathbf{P}_{\text{vocab}}^{\top}\mathbf{E}_{\text{text}}. \tag{2}
$$
Finally, we concatenate $\mathbf{F}_{\text{align}}^{\prime}$ with the tokenized text embeddings to form the LLM input
$$
\mathbf{H}_{\text{input}}=\mathrm{concat}\bigl(\mathbf{F}_{\text{align}}^{\prime},\mathbf{E}_{\text{text}}(\mathbf{x})\bigr),
$$
where $\mathbf{E}_{\text{text}}(\mathbf{x})$ is obtained by tokenizing the input text $\mathbf{x}=(x_{1},·s,x_{M})$ and selecting the corresponding embeddings from $\mathbf{E}_{\text{text}}$ such that
$$
\displaystyle\mathbf{E}_{\text{text}}(\mathbf{x}) \displaystyle=\bigl[\mathbf{E}_{\text{text}}(x_{1}),\cdots,\mathbf{E}_{\text{text}}(x_{M})\bigr]. \tag{3}
$$
(3) Large Language Model.
We feed the concatenated vision and text vectors, $\mathbf{H}_{\text{input}}$ , into the LLM, which then generates output text auto-regressively. To demonstrate the effectiveness of our alignment technique, we experiment with the Llama 3.1 model family (Grattafiori et al., 2024). These models offer state-of-the-art performance and permissive licenses, making them suitable for commercial applications. In particular, we utilize Llama 3.2-1B, Llama 3.2-3B, and Llama 3.1-8B.
3.2 Motivation and relation with existing methods
By construction, each $\mathbb{R}^{D}$ representation in $\mathbf{F}_{\text{align}}^{\prime}$ is constrained to the convex hull of the points $\mathbb{E}_{\text{text}}$ , thus concentrating the visual features in the part of latent space that the LLM can effectively interpret. Moreover, we argue that our initialization of $\mathbf{W}_{2}$ to the language model head is an inductive bias toward recycling some of the semantics of these text tokens into visual tokens. This contrasts with past methods that have been proposed to adapt the vision encoder outputs $\mathbf{F}â\mathbb{R}^{T· N_{t}Ă d}$ to an $\mathbf{F}^{\prime}â\mathbb{R}^{T· N_{t}Ă D}$ to be fed to the LLM. Here, we consider two examples in more detail, highlighting these contrasts.
(1) MLP Connector Liu et al. (2023b) applies a linear projection with parameters $\mathbf{W}_{\text{MLP}}â\mathbb{R}^{DĂ d}$ and $\mathbf{b}_{\text{MLP}}â\mathbb{R}^{D}$ , followed by an activation function $\sigma$ (e.g., ReLU)
$$
\mathbf{F}_{\text{MLP}}^{\prime}=\sigma(\mathbf{W}_{\text{MLP}}\mathbf{F}+\mathbf{b}_{\text{MLP}}).
$$
These parameters are all learned from scratch, without any bias aligning them to text embeddings.
(2) Visual Embedding Table Lu et al. (2024) introduces an entire new set of visual embeddings $\mathbf{E}_{\text{VET}}â\mathbb{R}^{KĂ D}$ which, together with the weights $\mathbf{W}_{\text{VET}}â\mathbb{R}^{KĂ d}$ , specifies
$$
\mathbf{F}_{\text{VET}}^{\prime}=\operatorname{softmax}(\mathbf{W}_{\text{VET}}\mathbf{F})^{\top}\mathbf{E}_{\text{VET}}.
$$
When $D<d$ , our $\mathbf{W}_{2}\mathbf{W}_{1}$ amounts to a low-rank version of $\mathbf{W}_{\text{VET}}$ . There is thus much more to learn to obtain $\mathbf{F}_{\text{VET}}^{\prime}$ , and there is again no explicit pressure to align it with the text embeddings.
3.3 Training Datasets & Stages
We train our model in three stages:
Stage 1.
This stage focuses on training the Align Module to map visual features to the LLMâs text embeddings effectively. We use the CC-12M dataset Changpinyo et al. (2021), a large-scale web dataset commonly used for VLM pretraining Liu et al. (2023b), which contains 12M image-text pairs. However, due to broken or unavailable links, we retrieved 8.1M pairs. This dataset facilitates the alignment of visual features with the text embedding space of the LLM. During this stage, we train the full model, as this approach improves performance and stabilizes the Align Module training.
Stage 2.
The goal is to enhance the modelâs document understanding capabilities, such as OCR, document structure comprehension, in-depth reasoning, and instruction-following. We leverage the BigDocs-7.5M dataset Rodriguez et al. (2024a), a curated collection of license-permissive datasets for multimodal document understanding. This dataset aligns with the Accountability, Responsibility, and Transparency (ART) principles Bommasani et al. (2023); Vogus and LlansĂłe (2021), ensuring compliance for commercial applications. As in Stage 1, we train the full model during this stage.
Stage 3.
To enhance the modelâs instruction-tuning capabilities, particularly for downstream tasks like question answering, we further train it on the DocDownstream Rodriguez et al. (2024a); Hu et al. (2024) instruction tuning dataset. In this stage, the vision encoder is frozen, focusing training exclusively on the LLM and Align module.
4 Experimental Setup
Table 1: Main Results on General Document Benchmarks. We compare AlignVLM (ours) with state-of-the-art (SOTA) open and closed-source instructed models, and with base models that we trained using the process described in Section 3.3. AlignVLM models outperform all Base VLM models trained in the same data regime. Our models also perform competitively across document benchmarks even compared with SOTA models, in which the data regime is more targeted and optimized. Color coding for comparison: closed-source models, open-source models below 7B parameters, open-source models between 7-12B parameters.
| Model | DocVQA VAL | InfoVQA VAL | DeepForm TEST | KLC TEST | WTQ TEST | TabFact TEST | ChartQA TEST | TextVQA VAL | TableVQA TEST | Avg. Score |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Closed-Source VLMs | | | | | | | | | | |
| (Opaque Training Data) | | | | | | | | | | |
| Claude-3.5 Sonnet | 88.48 | 59.05 | 31.41 | 24.82 | 47.13 | 53.48 | 51.84 | 71.42 | 81.27 | 56.54 |
| GeminiPro-1.5 | 91.23 | 73.94 | 32.16 | 24.07 | 50.29 | 71.22 | 34.68 | 68.16 | 80.43 | 58.46 |
| GPT-4o 20240806 | 92.80 | 66.37 | 38.39 | 29.92 | 46.63 | 81.10 | 85.70 | 70.46 | 72.87 | 64.91 |
| Open-Source Instruct VLMs | | | | | | | | | | |
| (Semi-Opaque Training Data) | | | | | | | | | | |
| Janus- 1.3B (Wu et al., 2024a) | 30.15 | 17.09 | 0.62 | 15.06 | 9.30 | 51.34 | 57.20 | 51.97 | 18.67 | 27.93 |
| Qwen2-VL- 2B (Wang et al., 2024) | 89.16 | 64.11 | 32.38 | 25.18 | 38.20 | 57.21 | 73.40 | 79.90 | 43.07 | 55.84 |
| Qwen2.5-VL- 3B (Wang et al., 2024) | 93.00 | 75.83 | 32.84 | 24.82 | 53.46 | 71.16 | 83.91 | 79.29 | 71.66 | 65.10 |
| InternVL-2.5- 2B (Chen et al., 2024b) | 87.70 | 61.85 | 13.14 | 16.58 | 36.33 | 57.26 | 74.96 | 76.85 | 42.20 | 51.87 |
| InternVL-3- 2B (Zhu et al., 2025) | 87.33 | 66.99 | 37.90 | 29.79 | 39.44 | 59.91 | 75.32 | 78.69 | 43.46 | 57.64 |
| DeepSeek-VL2-Tiny- 3.4B (Wu et al., 2024b) | 88.57 | 63.88 | 25.11 | 19.04 | 35.07 | 52.15 | 80.92 | 80.48 | 56.30 | 55.72 |
| Phi3.5-Vision- 4B (Abdin et al., 2024) | 86.00 | 56.20 | 10.47 | 7.49 | 17.18 | 30.43 | 82.16 | 73.12 | 70.70 | 48.19 |
| Qwen2-VL- 7B (Wang et al., 2024) | 93.83 | 76.12 | 34.55 | 23.37 | 52.52 | 74.68 | 83.16 | 84.48 | 53.97 | 64.08 |
| Qwen2.5-VL- 7B (Bai et al., 2025) | 94.88 | 82.49 | 42.21 | 24.26 | 61.96 | 78.56 | 86.00 | 85.35 | 76.10 | 70.20 |
| LLaVA-NeXT- 7B (Xu et al., 2024) | 63.51 | 30.90 | 1.30 | 5.35 | 20.06 | 52.83 | 52.12 | 65.10 | 32.87 | 36.00 |
| DocOwl1.5- 8B (Hu et al., 2024) | 80.73 | 49.94 | 68.84 | 37.99 | 38.87 | 79.67 | 68.56 | 68.91 | 52.60 | 60.68 |
| InternVL-2.5- 8B (Chen et al., 2024b) | 91.98 | 75.36 | 34.55 | 22.31 | 50.33 | 74.75 | 82.84 | 79.00 | 52.10 | 62.58 |
| InternVL-3- 8B (Zhu et al., 2025) | 91.99 | 73.90 | 51.24 | 36.41 | 53.60 | 72.27 | 85.60 | 82.41 | 53.26 | 66.74 |
| Fuyu- 8B (Bavishi et al., 2023) | 48.97 | 23.09 | 4.78 | 6.63 | 14.55 | 47.91 | 44.36 | 46.02 | 15.49 | 22.97 |
| Ovis-1.6-Gemma2- 9B (Lu et al., 2024) | 88.84 | 73.97 | 45.16 | 23.91 | 50.72 | 76.66 | 81.40 | 77.73 | 48.33 | 62.96 |
| Llama3.2- 11B (Grattafiori et al., 2024) | 82.71 | 36.62 | 1.78 | 3.47 | 23.03 | 58.33 | 23.80 | 54.28 | 22.40 | 34.04 |
| Pixtral- 12B (Agrawal et al., 2024) | 87.67 | 49.45 | 27.37 | 24.07 | 45.18 | 73.53 | 71.80 | 76.09 | 67.13 | 58.03 |
| Document Understanding Instructed Models | | | | | | | | | | |
| (Instruction Tuned on BigDocs-7.5M + DocDownStream (Rodriguez et al., 2024a; Hu et al., 2024)) | | | | | | | | | | |
| Qwen2-VL- 2B (base+) (Wang et al., 2024) | 57.23 | 31.88 | 49.31 | 34.39 | 31.61 | 64.75 | 68.60 | 61.01 | 47.53 | 49.59 |
| AlignVLM -Llama-3.2- 1B (ours) | 72.42 | 38.16 | 60.47 | 33.71 | 28.66 | 71.31 | 65.44 | 48.81 | 50.29 | 52.14 |
| AlignVLM -Llama-3.2- 3B (ours) | 79.63 | 44.53 | 63.49 | 35.25 | 38.59 | 78.51 | 71.88 | 57.38 | 60.10 | 58.81 |
| DocOwl1.5- 8B (base+) (Hu et al., 2024) | 78.70 | 47.62 | 64.39 | 36.93 | 35.69 | 72.65 | 65.80 | 67.30 | 49.03 | 57.56 |
| Llama3.2- 11B (base+) (Grattafiori et al., 2024) | 78.99 | 44.27 | 67.05 | 37.22 | 40.18 | 78.04 | 71.40 | 68.46 | 56.73 | 60.26 |
| AlignVLM -Llama-3.1- 8B (ours) | 81.18 | 53.75 | 63.25 | 35.50 | 45.31 | 83.04 | 75.00 | 64.60 | 64.33 | 62.88 |
Setup.
We conduct all experiments using 8 nodes of H100 GPUs, totaling 64 GPUs. For model training, we leverage the MS-Swift framework (Zhao et al., 2024) for its flexibility. Additionally, we utilize the DeepSpeed framework (Aminabadi et al., 2022), specifically the ZeRO-3 configuration, to optimize efficient parallel training across multiple nodes. Detailed hyperparameters are outlined in Appendix A.1.
Baselines.
Our work focuses on architectural innovations, so we ensure that all baselines are trained on the same datasets. To enable fair comparisons, we evaluate our models against a set of Base VLMs fine-tuned on the same instruction-tuning tasks (Stages 2 and 3) as our models, using the BigDocs-7.5M and BigDocs-DocDownstream datasets. This approach ensures consistent training data, avoiding biases introduced by the Instruct versions of VLMs, which are often trained on undisclosed instruction-tuning datasets. Due to the scarcity of recently released publicly available Base VLMs, we primarily compare our model against the following Base VLMs of varying sizes: Qwen2-VL-2B (Wang et al., 2024), DocOwl1.5-8B (Hu et al., 2024), and LLama 3.2-11B (Grattafiori et al., 2024).
For additional context, we also include results from the Instruct versions of recent VLMs of different sizes: Phi3.5-Vision-4B (Abdin et al., 2024), Qwen2-VL-2B and 7B (Wang et al., 2024), Qwen2.5-VL-7B (Qwen et al., 2025), LLaVA-NeXT-7B (Liu et al., 2024), InternVL2.5-2B and 8B (Chen et al., 2024b), InternVL3-2B and 8B (Zhu et al., 2025), Janus-1.3B (Wu et al., 2024a), DeepSeek-VL2-Tiny (Wu et al., 2024b), Ovis1.6-Gemma-9B (Lu et al., 2024), Llama3.2-11B (Grattafiori et al., 2024), DocOwl1.5-8B (Hu et al., 2024), and Pixtral-12B (Agrawal et al., 2024).
Evaluation Benchmarks.
We evaluate our models on a diverse range of document understanding benchmarks that assess the modelâs capabilities in OCR, chart reasoning, table processing, or form comprehension. In particular, we employ the VLMEvalKit (Duan et al., 2024) framework and report the results on the following popular benchmarks: DocVQA (Mathew et al., 2021b), InfoVQA (Mathew et al., 2021a), DeepForm (Svetlichnaya, 2020), KLC (StanisĆawek et al., 2021), WTQ (Pasupat and Liang, 2015), TabFact (Chen et al., 2020), ChartQA (Masry et al., 2022), TextVQA (Singh et al., 2019), and TableVQA (Kim et al., 2024).
5 Results
Table 2: Impact of Connector Designs on VLM Performance: We present the results of experiments evaluating different connector designs for conditioning LLMs on visual features. Our proposed Align connector is compared against a basic Multi-Layer Perceptron (MLP), the Perceiver Resampler, and Ovis. The results demonstrate that Align consistently outperforms these alternatives across all benchmarks.
| Model | DocVQA VAL | InfoVQA VAL | DeepForm TEST | KLC TEST | WTQ TEST | TabFact TEST | ChartQA TEST | TextVQA VAL | TableVQA TEST | Avg. Score |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Llama-3.2-3B- MLP | 71.46 | 37.56 | 62.07 | 33.36 | 28.94 | 73.22 | 66.48 | 53.56 | 50.96 | 53.06 |
| Llama-3.2-3B- Perciever R. | 69.08 | 34.13 | 57.08 | 31.75 | 27.95 | 71.93 | 65.16 | 51.33 | 47.76 | 50.68 |
| Llama-3.2-3B- Ovis | 74.68 | 42.11 | 58.02 | 33.50 | 33.13 | 76.67 | 67.92 | 52.60 | 53.93 | 54.72 |
| Llama-3.2-3B- Align (ours) | 79.63 | 44.53 | 63.49 | 35.25 | 38.59 | 78.51 | 71.88 | 57.38 | 60.10 | 58.81 |
5.1 Main Results
Table 1 presents the performance of AlignVLM compared to state-of-the-art (SOTA) open- and closed-source instructed models, as well as baseline Base VLMs fine-tuned in the same instruction-tuning setup. The results demonstrate that AlignVLM consistently outperforms all Base VLMs within the same size category and achieves competitive performance against SOTA Instruct VLMs despite being trained on a more limited data regime. Below, we provide a detailed analysis.
AlignVLM vs. Base VLMs.
Our AlignVLM models, based on Llama 3.2-1B and Llama 3.2-3B, significantly outperform the corresponding Base VLM, Qwen2-VL-2B, by up to 9.22%. Notably, AlignVLM -Llama-3.2-3B surpasses DocOwl1.5-8B, which has 4B more parameters, demonstrating the effectiveness of Align in enhancing multimodal capabilities compared to traditional shallow fusion methods (e.g., MLPs). Furthermore, our 8B model achieves a 2.62% improvement over Llama3.2-11B despite sharing the same Base LLM, Llama3.1-8B. Since all models in this comparison were trained on the same instruction-tuning setup, this experiment provides a controlled evaluation, isolating the impact of architectural differences rather than dataset biases. Consequently, these results suggest that AlignVLM outperforms VLMs with shallow fusion techniques and surpasses parameter-heavy deep fusion VLMs, such as Llama3.2-11B, while maintaining a more efficient architecture.
AlignVLM vs. Instruct VLMs.
Even as open-source Instruct models are trained on significantly larger, often undisclosed instruction-tuning datasets, AlignVLM achieves competitive performance. For example, AlignVLM -Llama-3.2-3B (58.81%) outperforms other strong instruction-tuned VLMs in its size class, such as Qwen2-VL-2B and InternVL-3-2B, by considerable margins (2.97% and 1.17%, respectively). While it falls slightly behind Qwen2.5-VL-3B, a direct comparison is not entirely fair, as the latter was trained on a proprietary instruction-tuning dataset.
Additionally, our 8B model outperforms significantly larger models such as Llama 3.2-11B and PixTral-12B by substantial margins. It also surpasses InternVL-2.5-8B and performs competitively with Qwen2.5-VL-7B, though a direct comparison may not be entirely fair since Qwen2.5-VL-7B was trained on an undisclosed instruction-tuning dataset. Finally, AlignVLM also exhibits comparable performance to closed-source models like GeminiPro-1.5 and GPT4o.
Overall, these results validate the effectiveness of Align and establish AlignVLM as a state-of-the-art model for multimodal document understanding.
5.2 Impact of Connector Designs on VLM Performance
5.2.1 High-Resource Training Regime
To assess the effectiveness of our Align module, we compare it against three different and widely used shallow fusion VLM connectors: MLP, Perceiver Resampler, and Ovis. These experiments were carefully conducted under precisely identical training conditions (datasets, hyperparameters, training stages) as outlined in Appendix A.1, ensuring a fair and rigorous comparison. The results in Table 2 show that Align consistently outperforms all alternatives, demonstrating its superiority both in aligning visual and textual modalities in multimodal document understanding. MLP and Perceiver Resampler achieve the lowest performance, 53.06% and 50.68%, respectively, due to their direct feature projection, which lacks an explicit mechanism to align visual features with the LLMâs text space, leading to misalignment. Ovis introduces a separate visual embedding table, but this additional complexity does not significantly improve alignment, yielding only 54.72% accuracy. In contrast, Align ensures that visual features remain within the convex hull of the LLMâs text latent space, leveraging the linguistic priors of the LLM to enhance alignment and mitigate noisy embeddings. This design leads to the highest performance (58.81%), establishing Align as the most effective connector for integrating vision and language in multimodal document understanding. We provide some example outputs of the Llama-3.2-3B models with different connector designs in Appendix A.4. Furthermore, we include an analysis of the runtime efficiency and memory usage of different connectors in Appendix A.2.
5.2.2 Low-Resource Training Regime
The previous section focused on large-scale training setups involving millions of data samples (BigDocs-7.5M), which require significant compute resources and limit the number of baselines that we were able to compare against. Here, we examine whether Align remains effective in a low-resource setting.
We conduct additional experiments using SigLIP-400M as the vision encoder and Llama-3.2-3B as the language model, fine-tuned on the LLaVA-NeXT dataset Liu et al. (2024), which contains 779K samples. We follow the official LLaVA-NeXT configuration for both training stages. (i) Pretraining: the model is trained on the LLaVA-558K imageâcaption dataset Liu et al. (2024), freezing both the LLM and vision encoder while fine-tuning the connector (learning rate = 1e-3, batch size = 32, 1 epoch on 8 Ă H100 GPUs). To handle high-resolution document images, we adopt the "anyres_max_9" strategy with grid weaving from 1Ă1 to 6Ă6, supporting resolutions up to 2304Ă2304 with 729 tokens per grid; (ii) Instruction tuning: the model is further fine-tuned on the LLaVA-NeXT-779K instruction dataset with learning rates of 1e-5 for the LLM and connector, 2e-6 for the vision encoder, batch size = 8, for 1 epoch.
This lightweight setup allows direct comparison across more connector architectures including MLP Liu et al. (2023a), Perceiver Resampler, Ovis Lu et al. (2024), H-Reducer (1Ă4) Hu et al. (2024), and HoneyBee (C-Abstractor) Cha et al. (2024), all trained under identical conditions for fairness. Since the LLaVA-Next dataset is general-purpose and not exclusively document-focused like BigDocs-7.5M (Rodriguez et al., 2024a), it allows us to evaluate whether the Align connector generalizes beyond document understanding to broader visual reasoning. Accordingly, we assess all models on a comprehensive suite of benchmarks spanning both document understanding and general visionâlanguage tasks. The document understanding benchmarks include DocVQA Mathew et al. (2021b), InfoVQA Mathew et al. (2021a), ChartQA Masry et al. (2022), and TextVQA Singh et al. (2019). For general visionâlanguage evaluation, we report results on MMMU-dev Yue et al. (2024), SeedBench Li et al. (2023a), and MMVet Yu et al. (2024), Pope (Li et al., 2023c), and GQA (Hudson and Manning, 2019).
Table 3: Connector Performance under a Low-Resource Training Regime: We evaluate the effectiveness of more shallow-fusion connectors when trained on limited data. The Align connector achieves the highest performance, with notably larger gains on document understanding tasks, demonstrating its data efficiency and strong inductive bias.
| Model | Document Understanding Tasks | General Vision Tasks | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| DocVQA | InfoVQA | ChartQA | TextVQA | Avg. | MMMU | SeedBench | MMVet | POPE | GQA | Avg. | |
| Llama-3.2-3B-MLP | 42.11 | 19.93 | 48.44 | 51.97 | 40.61 | 33.33 | 58.54 | 31.14 | 87.35 | 57.62 | 53.59 |
| Llama-3.2-3B-Perceiver | 32.18 | 18.10 | 40.00 | 44.31 | 33.64 | 35.22 | 63.70 | 26.19 | 84.92 | 55.86 | 53.17 |
| Llama-3.2-3B-Ovis | 57.73 | 26.39 | 54.52 | 55.60 | 48.56 | 31.89 | 60.97 | 30.41 | 88.26 | 56.23 | 53.55 |
| Llama-3.2-3B-Hreducer | 34.59 | 17.57 | 45.64 | 47.13 | 36.23 | 35.00 | 61.82 | 28.39 | 87.48 | 58.24 | 54.18 |
| Llama-3.2-3B-HoneyBee | 55.86 | 19.36 | 55.32 | 58.13 | 47.16 | 32.11 | 61.18 | 34.31 | 89.28 | 54.79 | 54.33 |
| Llama-3.2-3B- Align (ours) | 71.43 | 30.50 | 69.72 | 65.63 | 59.32 | 35.33 | 63.27 | 35.32 | 88.85 | 61.67 | 56.88 |
As summarized in Table 3, Align consistently outperforms other connectors under this low-data regime, with stronger gains on document understanding tasks. The wider performance margin between Align and others connectors under limited data (Table 3) compared to the high-resource setting (Table 2) underscores the benefit of its inductive bias. By grounding visual features within the LLMâs text embedding space, Align learns more efficiently from fewer samples, unlike direct-projection connectors that rely heavily on large datasets. This makes Align especially valuable for resource-constrained environments such as academic labs or small-scale industrial research setups, where both data and compute are limited.
5.3 Probability Distribution over Text Tokens Analysis
To better understand the behavior of Align, we examine the probability distribution, $\mathbf{P}_{\text{vocab}}$ in Eq (1), over the LLMâs text vocabulary generated from visual features. Specifically, we process 100 document images through the vision encoder and Align, then average the resulting probability distributions across all image patches. The final distribution is shown in Figure 4. As illustrated, the distribution is dense (rather than sparse), with the highest probability assigned to a single token being 0.0118. This can be explained by the vision feature space being continuous and of much higher cardinality than the discrete text space. Indeed, while the LLM has 128K distinct vocabulary tokens, an image patch (e.g., 14Ă14 pixels) contains continuous, high-dimensional information that cannot be effectively mapped to a single or a few discrete tokens.
Table 4: Performance comparison when evaluating Align with the full text embedding vocabulary (128K) versus the reduced subset of 3.4K high-probability embeddings. The results show negligible performance degradation, indicating that Align relies primarily on a small subset of embeddings.
| Model | DocVQA VAL | InfoVQA VAL | DeepForm TEST | KLC TEST | WTQ TEST | TabFact TEST | ChartQA TEST | TextVQA VAL | TableVQA TEST | Avg. Score |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Llama-3.2-3B- Align (Full Embeddings) | 79.63 | 44.53 | 63.49 | 35.25 | 38.59 | 78.51 | 71.88 | 57.38 | 60.10 | 58.81 |
| Llama-3.2-3B- Align (3.4K Embeddings) | 79.40 | 44.13 | 63.64 | 35.02 | 38.26 | 78.83 | 71.72 | 57.48 | 59.80 | 58.69 |
We conducted a deeper analysis of the token probability distributions produced by the Align connector. Our observations show that Align consistently assigns high probabilities to approximately 3.4K tokens from the entire vocabulary, while the remaining tokens receive negligible probabilities (below $10^{-6}$ ). To better understand this behavior, we applied Principal Component Analysis (PCA) to reduce the dimensionality of the embeddings and visualized them in a two-dimensional space, as shown in Figure 4. The visualization reveals that these 3.4K tokens densely and comprehensively span the latent space of the LLMâs text embeddings. To validate this finding, we conducted additional evaluation experiments in which we retained only these 3.4K high-probability embeddings in the Align connector, entirely removing the rest during evaluation. As shown in Table 4, the performance difference compared to using the full embedding set (128K) was negligible. This confirms that Align effectively leverages and combines a compact subset of embeddings to map visual features into semantically meaningful regions within the LLMâs latent text space. Moreover, this suggests that Align can be further optimized through targeted embedding pruning to improve computational efficiency without sacrificing performance.
<details>
<summary>figures/prob_dist_image.png Details</summary>

### Visual Description
# Technical Document Extraction: Probability Distribution over LLM's Text Tokens
## Title
- **Title**: "Probability Distribution over the LLM's Text Tokens"
## Axes Labels and Ranges
- **X-Axis (Horizontal)**:
- **Label**: "Tokens Index"
- **Range**: 0 to 120,000 (in increments of 20,000)
- **Key Markers**: 0, 20,000, 40,000, 60,000, 80,000, 100,000, 120,000
- **Y-Axis (Vertical)**:
- **Label**: "Probability"
- **Range**: 0.000 to 0.012 (in increments of 0.002)
- **Key Markers**: 0.000, 0.002, 0.004, 0.006, 0.008, 0.010, 0.012
## Data Series and Trends
- **Line Characteristics**:
- **Color**: Light blue (no legend present; single data series)
- **Behavior**:
1. **Initial Peak**: Starts at approximately `x=0` with a probability of **~0.0115**.
2. **Sharp Decline**: Drops rapidly to **~0.0002** by `x=10,000`.
3. **Fluctuations**: Maintains minor oscillations between **~0.0001** and **~0.0003** for `x > 10,000`, with occasional peaks reaching **~0.0003** (e.g., at `x=100,000` and `x=120,000`).
## Key Data Points
| Tokens Index | Probability |
|--------------|-------------|
| 0 | ~0.0115 |
| 10,000 | ~0.0002 |
| 20,000 | ~0.00025 |
| 40,000 | ~0.00022 |
| 60,000 | ~0.00028 |
| 80,000 | ~0.00024 |
| 100,000 | ~0.0003 |
| 120,000 | ~0.00025 |
## Observations
- **Dominant Trend**: The probability distribution is highly skewed, with a single dominant token (`x=0`) accounting for the majority of the probability mass.
- **Long-Tail Behavior**: After the initial drop, probabilities remain low but exhibit minor variability across the remaining tokens.
- **No Additional Data Series**: No legends, secondary lines, or annotations are present.
## Structural Notes
- **Header**: Title centered at the top.
- **Main Chart**: Occupies the majority of the image, with axes labeled and scaled as described.
- **Footer**: No textual or graphical elements present.
## Conclusion
The graph illustrates a probability distribution where the first token dominates, followed by a near-uniform distribution of negligible probabilities across subsequent tokens. No legends or additional contextual text are provided.
</details>
Figure 3: Probability distribution over LLM tokens, highlighting dense probabilities for whitespace tokens.
<details>
<summary>figures/vision-to-text/alignvlm_embeds.png Details</summary>

### Visual Description
# Technical Document Extraction: 2D PCA of Embeddings (AlignVLM)
## Title
- **Title**: "2D PCA of Embeddings (AlignVLM)"
## Legend
- **Placement**: Top-right corner (inferred from standard chart conventions).
- **Labels**:
- **Blue**: "Other Embeddings"
- **Red**: "Most influential Embeds in AlignVLM"
## Axes
- **X-axis (PC1)**:
- Label: "PC1"
- Range: -0.15 to 0.15
- **Y-axis (PC2)**:
- Label: "PC2"
- Range: -0.15 to 0.15
## Data Series
### 1. "Most influential Embeds in AlignVLM" (Red)
- **Visual Trend**:
- Dense clustering in the central region of the plot (approximately PC1: -0.05 to 0.05, PC2: -0.05 to 0.05).
- Points exhibit a radial distribution with higher concentration near the origin.
- **Key Observations**:
- Dominates the central quadrant of the plot.
- No clear linear or directional trend; appears as a dispersed cluster.
### 2. "Other Embeddings" (Blue)
- **Visual Trend**:
- Scattered distribution across the entire plot, with a slight bias toward the upper-right quadrant (positive PC1 and PC2 values).
- Lower density compared to red points, with points spread more uniformly.
- **Key Observations**:
- No distinct clustering; points are dispersed.
- Overlaps with red points in the central region but less concentrated.
## Spatial Grounding
- **Legend**: Top-right corner (color-coded: blue for "Other Embeddings," red for "Most influential Embeds").
- **Data Point Verification**:
- Red points match the legend label "Most influential Embeds in AlignVLM."
- Blue points match the legend label "Other Embeddings."
## Trends and Analysis
- **Red Points (Most influential Embeds)**:
- Central clustering suggests these embeddings are more representative of the dataset's core structure.
- Lack of directional trend implies no strong linear relationship in the first two principal components.
- **Blue Points (Other Embeddings)**:
- Dispersed distribution indicates higher variability or noise in these embeddings.
- Slight upper-right bias may suggest a secondary trend or outlier population.
## Notes
- **No Additional Text**: The image contains no embedded text, tables, or sub-categories beyond the legend and axis labels.
- **Language**: All text is in English. No non-English content is present.
## Conclusion
The plot visualizes the distribution of embeddings in a 2D PCA space. The "Most influential Embeds in AlignVLM" (red) form a dense central cluster, while "Other Embeddings" (blue) are more dispersed. This suggests that the most influential embeddings are concentrated in a specific region of the PCA space, potentially indicating their critical role in the AlignVLM model's performance.
</details>
Figure 4: PCA of Align Embeddings: The principal components of the most influential embeddings in the Align Connector span most of the feature space represented by all embeddings.
5.4 Robustness to Noise Analysis
To evaluate the robustness of our Align connector to noisy visual features, we conduct an experiment where random Gaussian noise is added to the visual features produced by the vision encoder before passing them into the connector. Specifically, given the visual features $\mathbf{F}â\mathbb{R}^{NĂ d}$ output by the vision encoder (where $N$ is the number of feature vectors and $d$ is their dimensionality), we perturbed them as
$$
\widetilde{\mathbf{F}}=\mathbf{F}+\mathbf{N},\quad\mathbf{N}\sim\mathcal{N}(0,\sigma=3).
$$
Table 5: Robustness to Noise. Comparison of Avg. Scores with and without Gaussian noise ( $\sigma=3$ ), including performance drop ( $\Delta$ ).
| Model | Without Noise | With Noise | Drop ( $\Delta$ ) |
| --- | --- | --- | --- |
| Llama-3.2-3B-MLP | 53.06 | 27.52 | $\downarrow 25.54$ |
| Llama-3.2-3B- Align (ours) | 58.81 | 57.14 | $\downarrow\textbf{1.67}$ |
As shown in Table 5, our Align connector demonstrates high robustness to noise, with only a 1.67% average drop in performance. In contrast, the widely adopted MLP connector suffers a significant performance degradation of 25.54%, highlighting its vulnerability to noisy inputs. Furthermore, we measured the average cosine distance between the original and noise-perturbed visual embeddings using both the Align and MLP connectors. Align showed significantly lower distances (0.0036) than MLP (0.3938), further validating its robustness to noise. These empirical results support our hypothesis that leveraging the knowledge encoded in the LLMâs text embeddings and constraining the visual features within the convex hull of the text latent space act as a regularization mechanism, reducing the modelâs sensitivity to noisy visual features.
6 Conclusion
We introduce Align, a novel connector designed to align vision and language latent spaces in vision-language models (VLMs), specifically enhancing multimodal document understanding. By improving cross-modal alignment and minimizing noisy embeddings, our models, AlignVLM, which leverage Align, achieve state-of-the-art performance across diverse document understanding tasks. This includes outperforming base VLMs trained on the same datasets and achieving competitive performance with open-source instruct models trained on undisclosed data. Extensive experiments and ablations validate the robustness and effectiveness of Align compared to existing connector designs, establishing it as a significant contribution to vision-language modeling. Future work will explore training on more diverse instruction-tuning datasets to generalize to broader domains.
References
- Abdin et al. [2024] M. Abdin, J. Aneja, H. Awadalla, A. Awadallah, A. A. Awan, N. Bach, A. Bahree, A. Bakhtiari, J. Bao, H. Behl, A. Benhaim, M. Bilenko, J. Bjorck, S. Bubeck, M. Cai, Q. Cai, V. Chaudhary, D. Chen, D. Chen, W. Chen, Y.-C. Chen, Y.-L. Chen, H. Cheng, P. Chopra, X. Dai, M. Dixon, R. Eldan, V. Fragoso, J. Gao, M. Gao, M. Gao, A. Garg, A. D. Giorno, A. Goswami, S. Gunasekar, E. Haider, J. Hao, R. J. Hewett, W. Hu, J. Huynh, D. Iter, S. A. Jacobs, M. Javaheripi, X. Jin, N. Karampatziakis, P. Kauffmann, M. Khademi, D. Kim, Y. J. Kim, L. Kurilenko, J. R. Lee, Y. T. Lee, Y. Li, Y. Li, C. Liang, L. Liden, X. Lin, Z. Lin, C. Liu, L. Liu, M. Liu, W. Liu, X. Liu, C. Luo, P. Madan, A. Mahmoudzadeh, D. Majercak, M. Mazzola, C. C. T. Mendes, A. Mitra, H. Modi, A. Nguyen, B. Norick, B. Patra, D. Perez-Becker, T. Portet, R. Pryzant, H. Qin, M. Radmilac, L. Ren, G. de Rosa, C. Rosset, S. Roy, O. Ruwase, O. Saarikivi, A. Saied, A. Salim, M. Santacroce, S. Shah, N. Shang, H. Sharma, Y. Shen, S. Shukla, X. Song, M. Tanaka, A. Tupini, P. Vaddamanu, C. Wang, G. Wang, L. Wang, S. Wang, X. Wang, Y. Wang, R. Ward, W. Wen, P. Witte, H. Wu, X. Wu, M. Wyatt, B. Xiao, C. Xu, J. Xu, W. Xu, J. Xue, S. Yadav, F. Yang, J. Yang, Y. Yang, Z. Yang, D. Yu, L. Yuan, C. Zhang, C. Zhang, J. Zhang, L. L. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, and X. Zhou. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv.org/abs/2404.14219.
- Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Agrawal et al. [2024] P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. D. Monicault, S. Garg, T. Gervet, S. Ghosh, A. Héliou, P. Jacob, A. Q. Jiang, K. Khandelwal, T. Lacroix, G. Lample, D. L. Casas, T. Lavril, T. L. Scao, A. Lo, W. Marshall, L. Martin, A. Mensch, P. Muddireddy, V. Nemychnikova, M. Pellat, P. V. Platen, N. Raghuraman, B. RoziÚre, A. Sablayrolles, L. Saulnier, R. Sauvestre, W. Shang, R. Soletskyi, L. Stewart, P. Stock, J. Studnia, S. Subramanian, S. Vaze, T. Wang, and S. Yang. Pixtral 12b, 2024. URL https://arxiv.org/abs/2410.07073.
- Alayrac et al. [2022] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language model for few-shot learning, 2022. URL https://arxiv.org/abs/2204.14198.
- Aminabadi et al. [2022] R. Y. Aminabadi, S. Rajbhandari, M. Zhang, A. A. Awan, C. Li, D. Li, E. Zheng, J. Rasley, S. Smith, O. Ruwase, and Y. He. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022. URL https://arxiv.org/abs/2207.00032.
- Anthropic [2024] Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024.
- Bai et al. [2025] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report, 2025. URL https://arxiv.org/abs/2502.13923.
- Bavishi et al. [2023] R. Bavishi, E. Elsen, C. Hawthorne, M. Nye, A. Odena, A. Somani, and S. TaĆırlar. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
- Beyer et al. [2024] L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. BoĆĄnjak, X. Chen, M. Minderer, P. Voigtlaender, I. Bica, I. Balazevic, J. Puigcerver, P. Papalampidi, O. Henaff, X. Xiong, R. Soricut, J. Harmsen, and X. Zhai. Paligemma: A versatile 3b vlm for transfer, 2024. URL https://arxiv.org/abs/2407.07726.
- Bommasani et al. [2023] R. Bommasani, K. Klyman, S. Longpre, S. Kapoor, N. Maslej, B. Xiong, D. Zhang, and P. Liang. The foundation model transparency index, 2023. URL https://arxiv.org/abs/2310.12941.
- Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901, 2020.
- Cha et al. [2024] J. Cha, W. Kang, J. Mun, and B. Roh. Honeybee: Locality-enhanced projector for multimodal llm, 2024. URL https://arxiv.org/abs/2312.06742.
- Changpinyo et al. [2021] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021. URL https://arxiv.org/abs/2102.08981.
- Chen et al. [2020] W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, and W. Y. Wang. Tabfact: A large-scale dataset for table-based fact verification. In International Conference Learning Representations, 2020.
- Chen et al. [2024a] Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, J. Ma, J. Wang, X. Dong, H. Yan, H. Guo, C. He, B. Shi, Z. Jin, C. Xu, B. Wang, X. Wei, W. Li, W. Zhang, B. Zhang, P. Cai, L. Wen, X. Yan, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites, 2024a. URL https://arxiv.org/abs/2404.16821.
- Chen et al. [2024b] Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185â24198, 2024b.
- Dai et al. [2024] W. Dai, N. Lee, B. Wang, Z. Yang, Z. Liu, J. Barker, T. Rintamaki, M. Shoeybi, B. Catanzaro, and W. Ping. Nvlm: Open frontier-class multimodal llms. arXiv preprint arXiv: 2409.11402, 2024.
- Diao et al. [2024] H. Diao, Y. Cui, X. Li, Y. Wang, H. Lu, and X. Wang. Unveiling encoder-free vision-language models. arXiv preprint arXiv:2406.11832, 2024.
- Drouin et al. [2024] A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. D. Verme, T. Marty, L. Boisvert, M. Thakkar, Q. Cappart, D. Vazquez, N. Chapados, and A. Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718.
- Duan et al. [2024] H. Duan, J. Yang, Y. Qiao, X. Fang, L. Chen, Y. Liu, X. Dong, Y. Zang, P. Zhang, J. Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198â11201, 2024.
- Dubey et al. [2024] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, and et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Grattafiori et al. [2024] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. McConnell, C. Keller, C. Touret, C. Wu, C. Wong, C. C. Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz, D. Livshits, D. Wyatt, D. Esiobu, D. Choudhary, D. Mahajan, D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin, E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith, F. Radenovic, F. GuzmĂĄn, F. Zhang, G. Synnaeve, G. Lee, G. L. Anderson, G. Thattai, G. Nail, G. Mialon, G. Pang, G. Cucurell, H. Nguyen, H. Korevaar, H. Xu, H. Touvron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra, I. Evtimov, J. Zhang, J. Copet, J. Lee, J. Geffert, J. Vranes, J. Park, J. Mahadeokar, J. Shah, J. van der Linde, J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang, J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park, J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala, K. Prasad, K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone, K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla, K. Lakhotia, L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan, L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher, L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti, M. Singh, M. Paluri, M. Kardas, M. Tsimpoukelli, M. Oldham, M. Rita, M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K. Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov, N. Bogoychev, N. Chatterji, N. Zhang, O. Duchenne, O. Ăelebi, P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhargava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He, Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer, R. S. Cabral, R. Stojnic, R. Raileanu, R. Maheswari, R. Girdhar, R. Patel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor, R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabasappa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie, S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhosale, S. Zhang, S. Vandenhende, S. Batra, S. Whitman, S. Sootla, S. Collot, S. Gururangan, S. Borodinsky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou, T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao, U. Karn, V. Goswami, V. Gupta, V. Ramanathan, V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Albiero, V. Petrovic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Martinet, X. Wang, X. Wang, X. E. Tan, X. Xia, X. Xie, X. Jia, X. Wang, Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song, Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan, Z. Chen, Z. Papakipos, A. Singh, A. Srivastava, A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Victoria, A. Goldstand, A. Menon, A. Sharma, A. Boesenberg, A. Baevski, A. Feinstein, A. Kallet, A. Sangani, A. Teo, A. Yunus, A. Lupu, A. Alvarado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan, A. Ramchandani, A. Dong, A. Franco, A. Goyal, A. Saraf, A. Chowdhury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yazdan, B. James, B. Maurer, B. Leonhardi, B. Huang, B. Loyd, B. D. Paola, B. Paranjape, B. Liu, B. Wu, B. Ni, B. Hancock, B. Wasti, B. Spence, B. Stojkovic, B. Gamido, B. Montalvo, C. Parker, C. Burton, C. Mejia, C. Liu, C. Wang, C. Kim, C. Zhou, C. Hu, C.-H. Chu, C. Cai, C. Tindal, C. Feichtenhofer, C. Gao, D. Civin, D. Beaty, D. Kreymer, D. Li, D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh, D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland, E. Dowling, E. Jamil, E. Montgomery, E. Presani, E. Hahn, E. Wood, E.-T. Le, E. Brinkman, E. Arcaute, E. Dunbar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Kokkinos, F. Ozgenel, F. Caggioni, F. Kanayet, F. Seide, G. M. Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern, G. Herman, G. Sizov, Guangyi, Zhang, G. Lakshminarayanan, H. Inan, H. Shojanazeri, H. Zou, H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk, H. Aspegren, H. Goldman, H. Zhan, I. Damlaj, I. Molybog, I. Tufanov, I. Leontiadis, I.-E. Veliche, I. Gat, J. Weissman, J. Geboski, J. Kohli, J. Lam, J. Asher, J.-B. Gaya, J. Marcus, J. Tang, J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong, J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shepard, J. McPhie, J. Torres, J. Ginsburg, J. Wang, K. Wu, K. H. U, K. Saxena, K. Khandelwal, K. Zand, K. Matosich, K. Veeraraghavan, K. Michelena, K. Li, K. Jagadeesh, K. Huang, K. Chawla, K. Huang, L. Chen, L. Garg, L. A, L. Silva, L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich, L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt, M. Mankus, M. Hasson, M. Lennie, M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Keneally, M. Liu, M. L. Seltzer, M. Valko, M. Restrepo, M. Patel, M. Vyatskov, M. Samvelyan, M. Clark, M. Macey, M. Wang, M. J. Hermoso, M. Metanat, M. Rastegari, M. Bansal, N. Santhanam, N. Parks, N. White, N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. Mehta, N. P. Laptev, N. Dong, N. Cheng, O. Chernoguz, O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh, P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux, P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj, Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy, R. Nayani, R. Mitra, R. Parthasarathy, R. Li, R. Hogan, R. Battey, R. Wang, R. Howes, R. Rinott, S. Mehta, S. Siby, S. J. Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon, S. Sidorov, S. Pan, S. Mahajan, S. Verma, S. Yamamoto, S. Ramaswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin, S. C. Zha, S. Patil, S. Shankar, S. Zhang, S. Zhang, S. Wang, S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max, S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad, S. Gupta, S. Deng, S. Cho, S. Virk, S. Subramanian, S. Choudhury, S. Goldman, T. Remez, T. Glaser, T. Best, T. Koehler, T. Robinson, T. Li, T. Zhang, T. Matthews, T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Montanez, V. Mohan, V. S. Kumar, V. Mangla, V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov, W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable, X. Tang, X. Wu, X. Wang, X. Wu, X. Gao, Y. Kleinman, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang, Y. Zhang, Y. Adi, Y. Nam, Yu, Wang, Y. Zhao, Y. Hao, Y. Qian, Y. Li, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick, Z. Wen, Z. Yang, Z. Zhao, and Z. Ma. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
- He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385.
- Hu et al. [2024] A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, C. Li, J. Zhang, Q. Jin, F. Huang, and J. Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding, 2024. URL https://arxiv.org/abs/2403.12895.
- Hudson and Manning [2019] D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019. URL https://arxiv.org/abs/1902.09506.
- Jaume et al. [2019] G. Jaume, H. K. Ekenel, and J.-P. Thiran. Funsd: A dataset for form understanding in noisy scanned documents, 2019. URL https://arxiv.org/abs/1905.13538.
- Kim et al. [2022] G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park. Ocr-free document understanding transformer, 2022. URL https://arxiv.org/abs/2111.15664.
- Kim et al. [2024] Y. Kim, M. Yim, and K. Y. Song. Tablevqa-bench: A visual question answering benchmark on multiple table domains. arXiv preprint arXiv:2404.19205, 2024.
- Laurençon et al. [2024] H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when building vision-language models?, 2024. URL https://arxiv.org/abs/2405.02246.
- Lee et al. [2023] K. Lee, M. Joshi, I. Turc, H. Hu, F. Liu, J. Eisenschlos, U. Khandelwal, P. Shaw, M.-W. Chang, and K. Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023. URL https://arxiv.org/abs/2210.03347.
- Li et al. [2023a] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023a. URL https://arxiv.org/abs/2307.16125.
- Li et al. [2024] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu, and C. Li. Llava-onevision: Easy visual task transfer, 2024. URL https://arxiv.org/abs/2408.03326.
- Li et al. [2023b] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b. URL https://arxiv.org/abs/2301.12597.
- Li et al. [2023c] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models, 2023c. URL https://arxiv.org/abs/2305.10355.
- Liu et al. [2023a] H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning, 2023a.
- Liu et al. [2023b] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning, 2023b.
- Liu et al. [2024] H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Lu et al. [2024] S. Lu, Y. Li, Q.-G. Chen, Z. Xu, W. Luo, K. Zhang, and H.-J. Ye. Ovis: Structural embedding alignment for multimodal large language model, 2024. URL https://arxiv.org/abs/2405.20797.
- Masry et al. [2022] A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
- Mathew et al. [2021a] M. Mathew, V. Bagal, R. P. Tito, D. Karatzas, E. Valveny, and C. V. Jawahar. Infographicvqa, 2021a. URL https://arxiv.org/abs/2104.12756.
- Mathew et al. [2021b] M. Mathew, D. Karatzas, and C. V. Jawahar. Docvqa: A dataset for vqa on document images, 2021b. URL https://arxiv.org/abs/2007.00398.
- OpenAI et al. [2023] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, et al. Gpt-4 technical report. arXiv preprint arXiv: 2303.08774, 2023.
- Park et al. [2019] S. Park, S. Shin, B. Lee, J. Lee, J. Surh, M. Seo, and H. Lee. Cord: A consolidated receipt dataset for post-ocr parsing. Document Intelligence Workshop at Neural Information Processing Systems, 2019.
- Pasupat and Liang [2015] P. Pasupat and P. Liang. Compositional semantic parsing on semi-structured tables. In Annual Meeting of the Association for Computational Linguistics, 2015.
- Qwen et al. [2025] Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115.
- Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020.
- Raffel et al. [2023] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683.
- Rodriguez et al. [2024a] J. Rodriguez, X. Jian, S. S. Panigrahi, T. Zhang, A. Feizi, A. Puri, A. Kalkunte, F. Savard, A. Masry, S. Nayak, R. Awal, M. Massoud, A. Abaskohi, Z. Li, S. Wang, P.-A. NoĂ«l, M. L. Richter, S. Vadacchino, S. Agarwal, S. Biswas, S. Shanian, Y. Zhang, N. Bolger, K. MacDonald, S. Fauvel, S. Tejaswi, S. Sunkara, J. Monteiro, K. D. Dvijotham, T. Scholak, N. Chapados, S. Kharagani, S. Hughes, M. Ăzsu, S. Reddy, M. Pedersoli, Y. Bengio, C. Pal, I. Laradji, S. Gella, P. Taslakian, D. Vazquez, and S. Rajeswar. Bigdocs: An open and permissively-licensed dataset for training multimodal models on document and code tasks, 2024a. URL https://arxiv.org/abs/2412.04626.
- Rodriguez et al. [2022] J. A. Rodriguez, D. Vazquez, I. Laradji, M. Pedersoli, and P. Rodriguez. Ocr-vqgan: Taming text-within-image generation, 2022. URL https://arxiv.org/abs/2210.11248.
- Rodriguez et al. [2024b] J. A. Rodriguez, A. Puri, S. Agarwal, I. H. Laradji, P. Rodriguez, S. Rajeswar, D. Vazquez, C. Pal, and M. Pedersoli. Starvector: Generating scalable vector graphics code from images and text, 2024b. URL https://arxiv.org/abs/2312.11556.
- Singh et al. [2019] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read. In IEEE Conference Computer Vision Pattern Recognition, 2019.
- StanisĆawek et al. [2021] T. StanisĆawek, F. GraliĆski, A. WrĂłblewska, D. LipiĆski, A. Kaliska, P. Rosalska, B. Topolski, and P. Biecek. Kleister: key information extraction datasets involving long documents with complex layouts. In International Conference on Document Analysis and Recognition, 2021.
- Svetlichnaya [2020] S. Svetlichnaya. Deepform: Understand structured documents at scale, 2020.
- Team [2024] G. Team. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805.
- Vogus and LlansĂłe [2021] C. Vogus and E. LlansĂłe. Making transparency meaningful: A framework for policymakers. Center for Democracy and Technology, 2021.
- Wang et al. [2023a] D. Wang, N. Raman, M. Sibue, Z. Ma, P. Babkin, S. Kaur, Y. Pei, A. Nourbakhsh, and X. Liu. Docllm: A layout-aware generative language model for multimodal document understanding, 2023a. URL https://arxiv.org/abs/2401.00908.
- Wang et al. [2024] P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y. Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin. Qwen2-vl: Enhancing vision-language modelâs perception of the world at any resolution, 2024. URL https://arxiv.org/abs/2409.12191.
- Wang et al. [2023b] W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b.
- Wu et al. [2024a] C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, and P. Luo. Janus: Decoupling visual encoding for unified multimodal understanding and generation, 2024a. URL https://arxiv.org/abs/2410.13848.
- Wu et al. [2024b] Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, Z. Xie, Y. Wu, K. Hu, J. Wang, Y. Sun, Y. Li, Y. Piao, K. Guan, A. Liu, X. Xie, Y. You, K. Dong, X. Yu, H. Zhang, L. Zhao, Y. Wang, and C. Ruan. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024b. URL https://arxiv.org/abs/2412.10302.
- Xu et al. [2024] R. Xu, Y. Yao, Z. Guo, J. Cui, Z. Ni, C. Ge, T.-S. Chua, Z. Liu, M. Sun, and G. Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. European Conference on Computer Vision, 2024. doi: 10.48550/arXiv.2403.11703.
- Yu et al. [2024] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2024. URL https://arxiv.org/abs/2308.02490.
- Yue et al. [2024] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y. Liu, W. Huang, H. Sun, Y. Su, and W. Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024. URL https://arxiv.org/abs/2311.16502.
- Zhai et al. [2023] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training, 2023. URL https://arxiv.org/abs/2303.15343.
- Zhang et al. [2024] T. Zhang, S. Wang, L. Li, G. Zhang, P. Taslakian, S. Rajeswar, J. Fu, B. Liu, and Y. Bengio. Vcr: Visual caption restoration. arXiv preprint arXiv: 2406.06462, 2024.
- Zhao et al. [2024] Y. Zhao, J. Huang, J. Hu, X. Wang, Y. Mao, D. Zhang, Z. Jiang, Z. Wu, B. Ai, A. Wang, W. Zhou, and Y. Chen. Swift:a scalable lightweight infrastructure for fine-tuning, 2024. URL https://arxiv.org/abs/2408.05517.
- Zhu et al. [2025] J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y. Cao, Y. Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y. He, T. Jiang, J. Luo, Y. Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y. Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. Wang, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025. URL https://arxiv.org/abs/2504.10479.
Appendix A Appendix
A.1 Experimental Setup
We provide detailed hyperparameters of our experiments in Table 6.
Table 6: Detailed hyperparameters for each training stage across different LLM backbones.
LLM Backbone Llama 3.2-1B Llama 3.2-3B Llama 3.1-8B Stage-1 Stage-2 Stage-3 Stage-1 Stage-2 Stage-3 Stage-1 Stage-2 Stage-3 Trainable Parameters Full Model Full Model LLM & Connector Full Model Full Model LLM & Connector Full Model Full Model LLM & Connector Batch Size 512 512 512 512 256 256 512 256 256 Text Max Length 1024 2048 2048 1024 2048 2048 1024 2048 2048 Epochs 1 1 5 1 1 5 1 1 5 Learning Rate $1Ă 10^{-5}$ $5Ă 10^{-5}$ $5Ă 10^{-5}$ $1Ă 10^{-5}$ $5Ă 10^{-5}$ $5Ă 10^{-5}$ $1Ă 10^{-5}$ $1Ă 10^{-5}$ $1Ă 10^{-5}$
A.2 Runtime Comparison Between Connectors
One caveat in the Align connector is that it includes an additional LM head layer, which slightly increases the total number of parameters. However, this addition has a negligible impact on runtime efficiency due to its simple structure. It only introduces a few matrix multiplication operations (as shown in Equations 1 and 2) instead of stacking many complex layers that require sequential processing, as in deep fusion methods.
To empirically validate this claim, we benchmarked the runtime and memory usage of models equipped with different connector types (MLP, Align, Ovis, and Perceiver), following the same experimental setup as in Table 2. As shown in Table 7, the results demonstrate that although the Align connector delivers notably superior performance (see Table 2), the variations in inference speed and GPU memory usage among the connectors remain minimal.
Table 7: Runtime and memory comparison between different connector designs. The results show that Align introduces negligible computational overhead compared to other connectors.
| Model | Samples | Avg Time (s) | Tokens/sec | GPU Memory (GB) |
| --- | --- | --- | --- | --- |
| Llama-3.2-3B-MLP | 2500 | 0.161 | 118.3 | 10.9 |
| Llama-3.2-3B-Perceiver | 2500 | 0.140 | 135.1 | 10.9 |
| Llama-3.2-3B-Ovis | 2500 | 0.155 | 122.5 | 10.8 |
| Llama-3.2-3B- Align | 2500 | 0.165 | 115.4 | 10.9 |
Overall, the empirical evidence confirms that the Align connector achieves an effective balance between computational efficiency and performance. It introduces only a negligible increase in runtime and memory usage while providing substantial gains in overall accuracy.
A.3 Pixel-Level Tasks Analysis
To rigorously evaluate the ability of vision-language models to integrate fine-grained visual and textual pixel-level cues, we test our model on the VCR benchmark [Zhang et al., 2024], which requires the model to recover partially occluded texts with pixel-level hints from the revealed parts of the text. This task challenges VLMâs alignment of text and image in extreme situations. Current state-of-the-art models like GPT-4V OpenAI et al. [2023], Claude 3.5 Sonnet Anthropic [2024], and Llama-3.2 Dubey et al. [2024] significantly underperform humans on hard VCR task due to their inability to process subtle pixel-level cues in occluded text regions. These models frequently discard critical visual tokens during image tokenization on semantic priors, overlooking the interplay between partial character strokes and contextual visual scenes. To evaluate performance on VCR, we modify our Stage 3 SFT dataset composition by replacing the exclusive use of DocDownstream with a 5:1 blended ratio of DocDownstream and VCR training data. This adjustment enables direct evaluation of our architecture Align âs ability to leverage pixel-level character cues.
From the experimental outcomes, it is evident that AlignVLM consistently outperforms the MLP Connector Model across both easy and hard settings of the pixel-level VCR task (see Figure 5), with improvements ranging from 10.18% on the hard setting to 14.41% on the easy setting.
We provide a case study on VCR in Figure 6, featuring four representative examples. In Figure 6(a), it is evident that the MLP connector model fails to capture semantic consistency as effectively as AlignVLM. The phrase âThe commune first census in written history in â (where the words in italics are generated by the model while the rest are in the image) is not as semantically coherent as the phrase generated by Align âThe commune first appears in written history in â.
Beyond the issue of semantic fluency, in Figure 6(b) we also observe that AlignVLM successfully identifies the uncovered portion of the letter âgâ in âaccountingâ and uses it as a pixel-level hint to infer the correct word. In contrast, the MLP model fails to effectively attend to this crucial detail.
Figures 6(c) and 6(d) show examples where AlignVLM fails on the VCR task. These carefully picked instances show that our method mistakes names of landmarks with common words when the two are very similar. As seen in the examples, AlignVLM mistakes âLlanengan" for âLlanongan" and âGorden" for âGardenâ. In both instances, the pairs differ by one character, indicating perhaps that AlignVLM tends to align vision representations to more common tokens in the vocabulary. One approach that would potentially mitigate such errors would be to train AlignVLM with more contextually-relevant data.
<details>
<summary>x3.png Details</summary>

### Visual Description
# Technical Document Extraction: Bar Chart Analysis
## Chart Type
Bar chart comparing model performance across two evaluation categories.
## Axes Labels
- **X-axis**: "Exact Match (%)" (percentage scale from 0 to 70)
- **Y-axis**: Categorical axis with two labels:
- "VCR EN Easy"
- "VCR EN Hard"
## Legend
- **Placement**: Bottom of chart
- **Color Coding**:
- `Blue`: Llama-3.2-3B-Align (Ours)
- `Orange`: Llama-3.2-3B-MLP
## Data Points
### VCR EN Easy
- **Llama-3.2-3B-Align**: 65.84% (Blue bar)
- **Llama-3.2-3B-MLP**: 51.43% (Orange bar)
### VCR EN Hard
- **Llama-3.2-3B-Align**: 48.07% (Blue bar)
- **Llama-3.2-3B-MLP**: 37.89% (Orange bar)
## Visual Trends
1. **Performance Gap**:
- Align model consistently outperforms MLP in both categories
- Largest gap in "VCR EN Easy" (14.41% difference)
- Smaller gap in "VCR EN Hard" (10.18% difference)
2. **Category Performance**:
- Both models show higher performance in "VCR EN Easy" vs "VCR EN Hard"
- Align maintains absolute advantage across all metrics
## Spatial Grounding
- Legend positioned at bottom center
- Bars aligned vertically under respective category labels
- Color consistency verified: Blue bars match Align legend, Orange bars match MLP legend
## Technical Observations
- Chart uses percentage-based visualization for direct performance comparison
- Error bars not present; data appears to represent mean values
- No additional annotations or statistical significance markers visible
## Language Analysis
- All text in English
- No non-English content detected
## Data Reconstruction Table
| Category | Llama-3.2-3B-Align | Llama-3.2-3B-MLP |
|----------------|--------------------|------------------|
| VCR EN Easy | 65.84% | 51.43% |
| VCR EN Hard | 48.07% | 37.89% |
## Conclusion
The chart demonstrates that the Llama-3.2-3B-Align model achieves superior exact match performance across both evaluation categories compared to the MLP baseline, with particularly strong performance in the "VCR EN Easy" category.
</details>
Figure 5: Comparison of Llama-3.2-3b- Align and Llama-3.2-3B-MLP on the Easy and Hard VCR tasks.
<details>
<summary>figures/vcr_example1.png Details</summary>

### Visual Description
# Technical Document Extraction: Map of Atel Commune, Romania
## Image Description
The image is a satellite-style map of a geographic region labeled as "Atel Commune" in Romania. Key features include:
- **Yellow outline**: Highlights the commune's boundaries.
- **White lines**: Represent roads, administrative boundaries, or natural features.
- **Green shading**: Indicates forested or vegetated areas.
- **Lighter patches**: Suggests urbanized zones or cleared land.
## Textual Content Extraction
### Embedded Text
1. **Location Label**:
- "Atel Commune in Sibiu County, Transylvania, Romania."
2. **Description of Atel Commune**:
- "It is composed of two villages, Atel and DupuÈ."
3. **Historical Note** (partially truncated):
- "The commune first appeared in written history in..." (text cuts off mid-sentence).
### Language Analysis
- **Primary Language**: English.
- **Secondary Language**: Romanian (partial phrase: "AÈel Èi DupÄpÈ").
- English Translation: "Atel and DupuÈ."
## Structural Analysis
- **No Charts/Diagrams**: The image contains no numerical data, heatmaps, or graphs.
- **No Data Tables**: No tabular information is present.
- **Legend Absence**: No legend is visible to cross-reference colors or labels.
## Spatial Grounding
- **Key Features**:
- Yellow outline (Atel Commune) centered in the map.
- White lines (roads/boundaries) distributed across the region.
- No explicit coordinates or scale provided.
## Trend Verification
- **Not Applicable**: No data series or trends to analyze due to absence of charts.
## Component Isolation
1. **Header**: Map title and location label.
2. **Main Chart**: Satellite imagery with highlighted commune and boundaries.
3. **Footer**: Descriptive text about the commune's composition and history.
## Final Output
The image provides a geographic overview of Atel Commune, Romania, but lacks quantitative data or structured datasets. Extracted text confirms the commune's composition (Atel and DupuÈ villages) and its location in Transylvania. Historical context is incomplete due to text truncation.
</details>
| GT: | (appears in written history in) |
| --- | --- |
| MLP: | (census in written history in) â |
| Align | (appears in written history in) â |
(a) Positive Example 1
<details>
<summary>figures/vcr_example2.png Details</summary>

### Visual Description
# Technical Document: Ghana Telephone Numbering Plan Map
## Map Description
The image depicts a color-coded administrative map of Ghana, divided into 10 distinct regions. Each region is labeled with a three-digit numbering code (e.g., "031", "032") and assigned a unique color. The map is presented on a white background with gray borders.
### Legend
The legend at the bottom of the map provides the following key:
- **Blue (#0000FF)**: 031
- **Red (#FF0000)**: 032
- **Light Blue (#ADD8E6)**: 033
- **Purple (#800080)**: 034
- **Green (#008000)**: 035
- **Yellow (#FFFF00)**: 036
- **Dark Blue (#00008B)**: 037
- **Pink (#FFC0CB)**: 038
- **Light Purple (#E6E6FA)**: 039
- **Light Green (#90EE90)**: 030
### Spatial Grounding
1. **031** (Blue): Located in the southwestern coastal region.
2. **032** (Red): Central region, adjacent to 031 and 033.
3. **033** (Light Blue): Southern region, bordering 032 and 034.
4. **034** (Purple): Eastern region, adjacent to 033 and 035.
5. **035** (Green): Northern region, bordering 034 and 036.
6. **036** (Yellow): Northeastern region, adjacent to 035 and 037.
7. **037** (Dark Blue): Central-northern region, bordering 036 and 038.
8. **038** (Pink): Small northeastern enclave, adjacent to 037.
9. **039** (Light Purple): Small northwestern region, adjacent to 037.
10. **030** (Light Green): Small southeastern region, adjacent to 034 and 036.
## Textual Content
The map includes the following textual information:
</details>
| GT: | (the system used for assigning) |
| --- | --- |
| MLP: | (the system used for accounting) â |
| Align | (the system used for assigning) â |
(b) Positive Example 2
<details>
<summary>figures/vcr_example3.png Details</summary>

### Visual Description
# Technical Document Extraction: Penrhyn DĂč Mines Image Analysis
## Image Description
The image depicts a narrow, dimly lit tunnel with rough, uneven walls composed of layered rock formations. A faint light source illuminates the path at the tunnel's end, creating a gradient of visibility from the foreground to the background. The tunnel's floor appears uneven and partially obscured by shadows. No human figures, machinery, or modern infrastructure are visible, emphasizing the natural and historical context of the site.
## Textual Information Extraction
The image contains the following text block (transcribed verbatim):
> "The Penrhyn DĂč Mines are a collection of mines situated in the LlĆ·n Peninsula. It encompasses the Penrhyn, Assheton, Western and"
### Language Analysis
- **Primary Language**: English
- **Secondary Language**: Welsh (Cymraeg)
- **Transcribed Terms**:
- `Penrhyn DĂč` â "Black Penrhyn" (Welsh: "Penrhyn" = "head/peninsula"; "DĂč" = "black")
- `LlĆ·n Peninsula` â "LlĆ·n Peninsula" (proper noun; no direct translation required)
### Textual Incompleteness
The final sentence is truncated, ending with "and" without completing the list of mine names. The incomplete phrase suggests missing data or a formatting error in the original source.
## Structural Analysis
- **No Charts/Diagrams**: The image contains no graphical data representations (e.g., heatmaps, line charts, or tables).
- **No Labels/Axis Titles**: Absent due to the absence of quantitative data visualization.
- **No Legends**: Not applicable.
## Spatial and Contextual Notes
- The tunnel's architecture suggests historical mining activity, likely associated with the Penrhyn DĂč Mines.
- The LlĆ·n Peninsula (Welsh: *Peninsula LlĆ·n*) is a geographic region in Wales, UK, known for its rugged coastline and historical sites.
## Conclusion
The image provides a visual and textual reference to the Penrhyn DĂč Mines, a historical mining site in the LlĆ·n Peninsula. The incomplete textual description and lack of modern infrastructure highlight the site's antiquity and natural setting. No quantitative data or diagrammatic elements are present in the image.
</details>
| GT: | (mines situated near Llanengan on) |
| --- | --- |
| MLP: | (mines situated near Llanengan on) â |
| Align | (mines situated near Llanongan on) â |
(c) Negative Example 1
<details>
<summary>figures/vcr_example4.png Details</summary>

### Visual Description
# Technical Document Extraction: Image Analysis
## Image Description
The image depicts a brick municipal building with a gabled roof. The structure features a central entrance with a covered portico. The building's facade includes a prominent sign with text. Surrounding the building are trees and a parking area with a single visible vehicle.
## Textual Content Extraction
### Primary Signage
- **Text:** "CITY OF FAIRMOUNT"
- **Font:** Yellow uppercase letters
- **Placement:** Top of the building facade
- **Additional Text:** Partially visible text below the main sign (obscured by redaction)
### Secondary Text Blocks
1. **First Block (Visible Portion):**
- "Fairmount is a city in Cobb County, Georgia, United States."
- "As of the 2010 census it had a population of 720."
2. **Second Block (Visible Portion):**
- "Cobb County is home to New Echota,"
- Remaining text redacted (represented by black blocks)
### Redaction Analysis
- Multiple sections of text are obscured by black rectangular blocks, preventing full transcription.
- Redacted content appears to continue the narrative about Fairmount and Cobb County.
## Structural Analysis
- **Building Type:** Municipal facility (likely city hall or administrative center)
- **Architectural Features:**
- Brick construction
- Gabled roof with white trim
- Covered entrance portico
- Parking area with handicap-accessible spaces
## Geographic Context
- **Location:** Fairmount, Cobb County, Georgia, United States
- **Population Data:** 720 residents (2010 census)
- **Regional Reference:** Cobb County (home to New Echota)
## Limitations
- Partial text visibility due to redaction
- No numerical data beyond population figure
- No visual charts/diagrams present
## Conclusion
The image provides limited textual information about Fairmount, Georgia, including its county affiliation, population statistics, and regional context. Critical details about the city's history or governance are obscured through redaction.
</details>
| GT: | (Gorden County is home to) |
| --- | --- |
| MLP: | (Gorden County is home to) â |
| Align | (Garden County is home to) â |
(d) Negative Example 2
Figure 6: Case Study for Pixel-Level Tasks. We provide examples of our proposed Align connector compared with a the Multi-Layer Perceptron (MLP) connector. The Align connector tends to better map visual elements to common words. GT is the ground truth.
A.4 Case Studies
In this section, we provide case studies for the experiments in Section 5.1. Specifically, we provide examples of our Llama-3.2-3B- Align, and its counterpart model with alternative connectors Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis on three different datasets: KLC [StanisĆawek et al., 2021], DocVQA [Mathew et al., 2021b], and TextVQA [Singh et al., 2019]. The examples are shown in Figure 7, 8, and 9.
<details>
<summary>figures/case_1.jpg Details</summary>

### Visual Description
# Technical Document Analysis: Ardingly College Limited Directors' Report and Financial Statements
## Document Overview
- **Title**: Ardingly College Limited
- **Subtitle**: DIRECTORS' REPORT AND FINANCIAL STATEMENTS FOR THE YEAR ENDED 31ST AUGUST 2017
- **Date Range**: Year ended 31st August 2017
- **Language**: English (no other languages detected)
## Header Section
- **Logo**:
- Left: Ardingly College logo featuring a pelican with outstretched wings.
- Right: Globe icon with the tagline "World Ready" beneath it.
- **Text**:
- "Ardingly College Limited" (centered, bold, serif font).
- "DIRECTORS' REPORT AND FINANCIAL STATEMENTS" (centered, uppercase, sans-serif font).
- "FOR THE YEAR ENDED 31ST AUGUST 2017" (centered, uppercase, sans-serif font).
## Footer Section
- **Registration Details**:
- **Charity Registration No.**: 1076456
- **Company Registration No.**: 03779971 (England and Wales)
## Additional Notes
- **Top Right Corner**: Handwritten annotation "EXTERNAL" (likely indicating document classification).
- **Border Design**: Decorative border with floral motifs at corners and vertical lines on the left edge.
## Structural Analysis
1. **Header**:
- Logo placement: Top center (pelican and globe).
- Title hierarchy: Primary title > Subtitle > Date.
2. **Body**:
- No charts, diagrams, or data tables present.
- Text-only content with clear typographic hierarchy.
3. **Footer**:
- Registration numbers aligned to the bottom center.
## Key Observations
- **Purpose**: Formal financial reporting document for Ardingly College Limited.
- **Compliance**: Includes mandatory registration numbers for transparency.
- **Design**: Minimalist layout with emphasis on institutional branding (pelican and globe).
## Limitations
- No numerical data, charts, or diagrams to analyze.
- No explicit mention of financial metrics or operational details in the cover page.
## Conclusion
This document serves as the cover page for Ardingly College Limited's annual directors' report and financial statements. It provides institutional branding, regulatory compliance details, and a clear temporal reference for the reporting period. Further analysis would require access to subsequent pages containing the actual report and financial data.
</details>
| Question: | What is the value for the charity name? |
| --- | --- |
| GT: | (Ardingly College Ltd.) |
| MLP: | (Ardington College Ltd.) â |
| Ovis: | (Ardington College Ltd.) â |
| Align: | (Ardingly College Ltd.) â |
(a) Positive Example #1
<details>
<summary>figures/case_2.jpg Details</summary>

### Visual Description
# Trustees' Annual Report for the Period
## Header Section
- **Title**: Trustees' Annual Report for the period
- **Period Dates**:
- **Start**: 11/02/2016
- **End**: 10/02/2017
## Section A: Reference and Administration Details
### Charity Information
- **Charity Name**: Tender Grace Christian Centre
- **Other Names**: [Blank]
- **Registered Charity Number**: 1165532
- **Principal Address**:
- Flat 13
- 39 Trinity Rise
- London
- SW2 2QP
### Trustee Information
#### Names of the Charity Trustees Who Manage the Charity
| Trustee Name | Office (if any) | Dates Acted If Not For Whole Year | Name of Person (or Body) Entitled to Appoint Trustee (if any) |
|-----------------------------|-----------------|-----------------------------------|---------------------------------------------------------------|
| 1. Mr Anthony Yakubu Membu | | | |
| 2. Miss Happiness Ngosi Jan-Nnyeruka | | | |
| 3. Mrs Olabisi Comfort Ogunjobi | | | |
| 4. Mrs Olayinka Adeola Oyesanya | | | |
| 5. [Blank] | | | |
| 6. [Blank] | | | |
| 7. [Blank] | | | |
| 8. [Blank] | | | |
| 9. [Blank] | | | |
| 10. [Blank] | | | |
| 11. [Blank] | | | |
| 12. [Blank] | | | |
| 13. [Blank] | | | |
| 14. [Blank] | | | |
| 15. [Blank] | | | |
| 16. [Blank] | | | |
| 17. [Blank] | | | |
| 18. [Blank] | | | |
| 19. [Blank] | | | |
| 20. [Blank] | | | |
#### Names of the Trustees for the Charity (Custodian Trustees)
| Name | Dates Acted If Not For Whole Year |
|--------------------------|-----------------------------------|
| [Blank] | |
| [Blank] | |
## Footer
- **Document Code**: TAR 1
- **Date**: March 2012
</details>
| Question: | What is the value for the address postcode? |
| --- | --- |
| GT: | (SW2 2QP) |
| MLP: | (SW22 0PQ) â |
| Ovis: | (SW2 2OP) â |
| Align: | (SW2 2QP) â |
(b) Positive Example #2
<details>
<summary>figures/case_3.jpg Details</summary>

### Visual Description
# Technical Document Extraction: Annual Report Cover Analysis
## Textual Information Extraction
### Primary Text Elements
- **Title**: "ANNUAL REPORT AND FINANCIAL STATEMENTS 2015"
- Position: Centered at bottom of image
- Font: Bold white sans-serif
- Size: Approximately 24pt
- **Organization Logo**:
- **Text**: "human appeal" (lowercase)
- **Logo**: Circular emblem with globe motif
- Position: Bottom-left corner (overlaid on image)
- Size: 1.5cm diameter
### Secondary Text Elements
- **Image Metadata** (visible in top-left corner):
- "Charity No. 1154288 | Company Reg. No. 8553893 | Scotland Reg. No. SC004481"
- Font: Small white serif
- Position: Vertical text along left edge
## Image Description
### Composition
1. **Foreground**:
- Central subject: Young child (approx. 8-10 years old)
- Pose: Right hand raised to head, smiling broadly
- Attire: School uniform (white collared shirt + purple vest)
- Notable detail: Small white hair accessory
2. **Background**:
- Blurred group of children in similar uniforms
- Visible elements: Wooden classroom furniture, notebook on desk
3. **Color Palette**:
- Dominant: Purple (vest), White (shirt), Skin tones
- Accent: Blue (hair accessory)
### Spatial Grounding
- **Logo Position**: [x=25, y=920] (bottom-left corner)
- **Title Position**: [x=400, y=850] (centered bottom)
## Data Analysis
### Chart/Diagram Elements
- **Absence**: No charts, diagrams, or data tables present
- **Visual Focus**: Human-centric imagery with symbolic educational context
## Trend Verification
- **Not Applicable**: No numerical data or trends to analyze
## Component Isolation
1. **Header**: Image metadata (top-left)
2. **Main Content**: Child portrait (centered)
3. **Footer**: Title and logo (bottom)
## Conclusion
This document cover emphasizes human impact through imagery rather than quantitative data. The design prioritizes emotional engagement with beneficiaries over statistical representation.
</details>
| Question: | What is the value for the charity name? |
| --- | --- |
| GT: | (Human Appeal) |
| MLP: | (Humanitarian Agenda) â |
| Ovis: | (Human Appeal) â |
| Align: | (Human Rightsappeal) â |
(c) Negative Example #1
<details>
<summary>figures/case_4.jpg Details</summary>

### Visual Description
# Technical Document Extraction: Bishop's Stortford Baptist Church Annual Report 2017
## Header Section
- **Logo**: "BSBC" with a cross symbol
- **Tagline**: "bishops stortford baptist church"
- **Mission Statement**:
*"This Church aims to be a community of believers living according to the Bible's teaching where people become fully devoted followers of Jesus Christ"*
## Main Content
### Title
- **Document**: "Annual Report 2017"
### Introduction
- **Purpose**:
*"We are pleased to present a report of the life and work of the Church during 2017."*
### Aim and Purposes
- **Vision**:
*"The vision of Bishopâs Stortford Baptist Church is to be a community of believers living according to the Bibleâs teaching so that Christâs redemptive purposes can be accomplished in the world."*
- **Mission**:
*"The Churchâs mission is to enable people to become fully devoted followers of Jesus Christ by the power of the Holy Spirit."*
- **Key Activities**:
- Worship, discipleship, fellowship, outreach, and service
- Deepening relationship with God and expanding influence
### Objectives and Overview
- **Commitment**:
*"The Church is committed to carrying out its activities in line with the Baptist tradition and to encourage and enable people in the local area to attend and take part in its services and meetings."*
- **Activities**:
1. Weekly worship, prayer, Bible study, preaching, and teaching (additional services for non-English speakers)
2. Baptisms on request
3. Meetings for young people
4. Small group meetings (LIFE Groups) for all ages
5. Pastoral care
6. Discipleship for Christian service
7. Evangelism and mission
8. Support for Christian and social charitable action locally
9. Support for overseas mission
10. Involvement with local and national Baptist associations
### Ministry Areas
- **Classification**:
- Worship
- Fellowship
- Reaching Out
- Discipleship and Social Action
- **Support**:
- Resources Team (led by Deacons)
- Spiritual oversight and pastoral care administered by Elders
## Footer
- **Page Number**: "1"
## Notes
- **Language**: All text is in English. No other languages are present.
- **Formatting**: Text is structured hierarchically with bold headings and bullet points.
- **Visual Elements**: The logo includes a stylized cross integrated with the "BSBC" text.
</details>
| Question: | What is the value for the post town address? |
| --- | --- |
| GT: | (Bishopâs Stortford) |
| MLP: | (Stortford) â |
| Ovis: | (Bishopâs Stortford) â |
| Align: | (Stortford) â |
(d) Negative Example #2
Figure 7: Case Study for Connector Comparison on the KLC dataset [StanisĆawek et al., 2021]. We show four qualitative examples (including two correct and two incorrect examples) comparing Llama-3.2-3B- Align to the same architecture with different connectors, Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis. âGTâ denotes the ground truth.
<details>
<summary>figures/case_5.jpg Details</summary>

### Visual Description
# Technical Document Extraction: Conference Schedule
## Overview
The document is a structured schedule for a two-day conference/workshop, detailing sessions, leaders, rooms, and breaks. Below is a precise transcription of all textual information.
---
### **Thursday, June 29**
#### **Afternoon**
- **1:00 PM**
**Topic:** Techniques of Interviewing (Mrs. Fink)
**Room:** 123
- **1:30 PM**
**Topic:** Practice Interviews
**Structure:**
| Group | Leader | Room |
|-------|--------------|-------------------------------|
| I | Mrs. Fink | 123 State Health Department |
| II | Miss Grass | 802 State Health Department |
| III | Miss Peck | 627 State Health Department |
| IV | Mr. Price | 510 School of Public Health |
| V | Dr. Croley | 522 School of Public Health |
- **2:45 PM**
**Break:** Recess
- **3:15 PM**
**Topic:** Practice Interviews (continued)
**Note:** Same groups, same rooms
---
### **Friday, June 30**
#### **Morning**
- **8:00 AM**
**Group A:** Statistical Aspects of Epidemiologic Research (Dr. Gaffey)
**Room:** 802
- **9:45 AM**
**Break:** Recess
- **10:15 AM**
**Group A:** Problems in Research Design (Dr. Reynolds)
**Room:** 123
**Group B:** Statistical Aspects of Epidemiologic Research (Dr. Gaffey)
**Room:** 802
- **12:00 PM**
**Lunch**
#### **Afternoon**
- **1:00 PM**
**Topic:** Construction and Use of Questionnaires
**Structure:**
| Group | Leader | Room |
|-------|--------------|-------------------------------|
| A | Dr. Fink | 123 |
| B | Dr. Mellinger| 522 School of Public Health |
- **2:45 PM**
**Break:** Recess
- **3:15 PM**
**Topic:** Construction and Use of Questionnaires (continued)
**Note:** Same groups, same rooms
---
### **Key Observations**
1. **Room Assignments:**
- Rooms 123, 802, 510, and 522 are used for sessions.
- Repeated use of rooms for continued sessions (e.g., Practice Interviews).
2. **Session Leaders:**
- Mrs. Fink, Dr. Gaffey, Dr. Reynolds, Dr. Fink, and Dr. Mellinger are primary leaders.
3. **Recurring Patterns:**
- Sessions alternate between statistical/epidemiologic research and practical workshops (e.g., interviews, questionnaires).
4. **Breaks:**
- Recesses at 2:45 PM on both days.
---
### **Data Table Reconstruction**
#### **Thursday Practice Interviews**
| Group | Leader | Room |
|-------|--------------|-------------------------------|
| I | Mrs. Fink | 123 State Health Department |
| II | Miss Grass | 802 State Health Department |
| III | Miss Peck | 627 State Health Department |
| IV | Mr. Price | 510 School of Public Health |
| V | Dr. Croley | 522 School of Public Health |
#### **Friday Questionnaires Sessions**
| Group | Leader | Room |
|-------|--------------|-------------------------------|
| A | Dr. Fink | 123 |
| B | Dr. Mellinger| 522 School of Public Health |
---
### **Conclusion**
The schedule is organized chronologically, with clear demarcation of sessions, leaders, and room assignments. No charts or diagrams are present; all information is textual. Critical details include room numbers, session topics, and continuity notes (e.g., "same groups, same rooms").
</details>
| Question: | What does the afternoon session begin on June 29? |
| --- | --- |
| GT: | (1:00) |
| MLP: | (2:45) â |
| Ovis: | (3:30) â |
| Align: | (1:00) â |
(a) Positive Example #1
<details>
<summary>figures/case_6.jpg Details</summary>

### Visual Description
# Hemoglobin and Hematocrit Data Analysis - Massachusetts
## Hemoglobin Data - Massachusetts
**Key Finding**: 8% of the surveyed population had unsatisfactory hemoglobin levels (ICNND guidelines).
### Table: Hemoglobin Levels by Age and Gender
| Age Group | Def. (Males) | Low (Males) | Total (Males) | Def. (Females) | Low (Females) | Total (Females) |
|-----------|--------------|-------------|---------------|----------------|---------------|-----------------|
| 0-5 yr | 4.5 | 8.0 | 12.5 | 1.5 | 7.0 | 8.5 |
| 6-12 | 0.2 | 3.8 | 4.0 | 0.2 | 5.0 | 5.2 |
| 13-16 | 3.6 | 12.7 | 16.3 | 0.0 | 3.5 | 3.5 |
| 17-59 | 1.2 | 10.0 | 11.2 | 1.1 | 6.0 | 7.1 |
| 60+ | 0.7 | 14.3 | 15.0 | 0.5 | 4.7 | 5.2 |
---
## Hematocrit Data - Massachusetts
**Key Finding**: 9.2% of the surveyed population had unsatisfactory hematocrit levels (ICNND guidelines).
### Table: Hematocrit Levels by Age and Gender
| Age Group | Def. (Males) | Low (Males) | Total (Males) | Def. (Females) | Low (Females) | Total (Females) |
|-----------|--------------|-------------|---------------|----------------|---------------|-----------------|
| 0-5 yr | 4.4 | 4.0 | 8.4 | 0.5 | 1.5 | 2.0 |
| 6-12 | 0.0 | 3.9 | 3.9 | 0.0 | 5.4 | 5.4 |
| 13-16 | 1.2 | 15.0 | 16.2 | 0.0 | 2.9 | 2.9 |
| 17-59 | 0.7 | 10.0 | 10.7 | 0.5 | 7.8 | 8.3 |
| 60+ | 0.0 | 20.0 | 20.0 | 0.5 | 3.8 | 4.3 |
---
## General Socio-economic Data - Massachusetts
**Key Metrics**:
- **Total persons examined**: 4,568
- **Mean family size**: 5.31
- **Mean income**: $6,500 per annum
- **Mean poverty index ratio**: 2.29
- **% of families below poverty index ratio of 1.00**: 19.5%
---
## Source
https://www.industrydocuments.ucsf.edu/docs/tnbf0227
</details>
| Question: | What levels does the second table indicate? |
| --- | --- |
| GT: | (hematocrit data - Massachusetts) |
| MLP: | (SATISFACTORY) â |
| Ovis: | (Females) â |
| Align: | (hematocrit data - Massachusetts) â |
(b) Positive Example #2
<details>
<summary>figures/case_7.jpg Details</summary>

### Visual Description
# Policy on Document Control
**Policy No. 8**
## Chapter 1: General Provisions
### Article 1 (Purpose)
This Policy shall be intended to stipulate basic matters on creation, storage, and disposal, etc. of documents (including electromagnetic records) handled by the Company and exceptional treatments, etc. in order to respond to lawsuits and other legal/administrative proceedings (hereinafter referred to as "disputes, etc."), so that document information can be properly and effectively managed and utilized.
### Article 2 (General Provisions)
1. Documents of the Company shall be handled in accordance with this Policy, in addition to those stipulated in the "Policy on Handling Contract Documents, etc." and Corporate Seals (Policy No. 6)", the "Policy on Confidential Information Management (Policy No. 18)", and the "Policy on IT Security (Policy No. 67)".
2. Each core organization shall set forth standards for creation, storage, disposal, and other detailed treatment of documents to be handled at the core organization (hereinafter referred to as "Function Standards"), in accordance with the provisions stipulated herein.
### Article 3 (Definitions)
Definitions of the terms used in this Policy shall be as set forth in the following items.
1. **"Documents"** shall mean approval documents, notification documents, reports, conference materials, correspondences, fax/telex transmissions, bills, various records, drawings, microfilms, photographs, video/audio media, electronic mail, and any other documents and electromagnetic records handled in connection with Company business (including those created by external parties, but excluding publications such as newspapers, magazines, and books, etc.).
2. **"Disks"** shall mean electromagnetic recording media, including floppy disks, CD-Rs, and DVD-Rs, etc.
3. **"Storage"** shall mean managing documents in any of the following methods and places for specified periods of time. Documents stored shall be referred to as "Stored Documents."
1. By the originals or write-protected disks, in lockable cabinets or desks, or designated warehouses.
2. By electromagnetic means, in access-controlled servers.
**Footer**
Confidential - Subject to Protective Order
TAKJ-TPC-00000252
Source: https://www.industrydocuments.ucsf.edu/docs/jpjf0226
</details>
| Question: | What type of policy is described in this document? |
| --- | --- |
| GT: | (Policy on Document Control) |
| MLP: | (Policy on Document Control) â |
| Ovis: | (General Provisions) â |
| Align: | (Document Control) â |
(c) Negative Example #1
<details>
<summary>figures/case_8.jpg Details</summary>

### Visual Description
# Technical Document Analysis: "The Ability of Dietary Glycan to Lower Elevated Blood Cholesterol and Triglycerides in Rats"
**Date:** June 18, 1975
---
## **Experiment Overview**
Ten male rats were fed a **control diet** (basal + cholesterol and cholic acid) for four weeks and divided into two groups:
1. **Group 2A**: Continued on the control diet.
2. **Group 2B**: Received a test diet identical to the control diet except **15 parts of glycan replaced 15 parts of sucrose**.
3. **Group #1**: Fed a **basal diet** for four weeks, then switched back to the basal diet.
**Procedure**:
- Whole blood was drawn weekly from the caudal vein for serum cholesterol analysis.
- At the end of four weeks, whole blood was drawn via heart puncture for additional analysis.
---
## **Blood Analysis Table**
| Group | #1 (mg% 0 Wk.) | #2a (mg% 4 Wk.) | #3b (mg% 4 Wk.) |
|-------------|------------------|------------------|------------------|
| **Cholesterol** | 96 | 157 | 157 |
| | 96 | 148 | 115 |
| | 96 | 156 | 99 |
| | 103 | 133 | 111 |
| **Triglycerides** | 293 | 203 | 169 |
---
## **Key Observations**
1. **Cholesterol Levels**:
- Group #1 (basal diet) showed stable cholesterol levels (~96 mg%) over four weeks.
- Group #2a (control diet) had elevated cholesterol (157â156 mg%) initially, decreasing to 133 mg% by week 4.
- Group #3b (glycan-modified diet) showed the lowest cholesterol (115â111 mg%) by week 4.
2. **Triglycerides**:
- Group #1 had significantly higher triglycerides (293 mg%) compared to Groups #2a (203 mg%) and #3b (169 mg%).
---
## **Methodological Notes**
- **Control Diet**: Basal + cholesterol and cholic acid.
- **Test Diet**: Same as control but with glycan replacing sucrose.
- **Sampling**: Weekly blood draws (caudal vein) and final heart puncture for comprehensive analysis.
---
## **Source**
https://www.industrydocuments.ucsf.edu/docs/hnhd0227
</details>
| Question: | What was the diet fed to the #1 group? |
| --- | --- |
| GT: | (basal diet) |
| MLP: | (basel diet) â |
| Ovis: | (Whole blood) â |
| Align: | (control diet) â |
(d) Negative Example #2
Figure 8: Case Study for Connector Comparison on the DocVQA dataset [Mathew et al., 2021b]. We show four qualitative examples (including two correct and two incorrect examples) comparing Llama-3.2-3B- Align to the same architecture with different connectors, Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis. âGTâ denotes the ground truth.
<details>
<summary>figures/case_9.jpg Details</summary>

### Visual Description
# Technical Document Extraction Report
## Textual Information
- **Primary Text**: "good bye" (lowercase, handwritten, centered on envelope)
- **Language**: English (no other languages detected)
## Visual Description
### Envelope
- **Type**: Standard postal envelope
- **Border Design**: Red and blue diagonal stripes (alternating pattern)
- **Text Placement**: Centered on the envelope face
- **Condition**: Unsealed, flap folded inward
### Suitcase
- **Color**: Beige/cream
- **Material**: Likely fabric or coated leather
- **Hardware**: Brass-colored metal clasps and handle mechanism
- **Handle**: Leather strap with brass fittings
- **Position**: Horizontal orientation, occupying ~80% of image height
### Background
- **Pattern**: Black and white polka dots with decorative flourishes
- **Texture**: Textured surface (possibly fabric or paper)
- **Contrast**: High contrast between background and suitcase
## Spatial Analysis
- **Envelope Position**: [x: 150, y: 160] to [x: 890, y: 500] (relative to image center)
- **Suitcase Position**: Occupies [y: 0 to y: 850] (full height, centered horizontally)
- **Handle Location**: [x: 230, y: 780] to [x: 770, y: 890]
## Data Extraction Notes
- No charts, diagrams, or data tables present
- No numerical values, categories, or structured data to extract
- No legends, axis titles, or measurement markers identified
## Trend Verification
- No applicable trends (no data series present)
## Component Isolation
1. **Header**: No distinct header elements
2. **Main Content**: Envelope on suitcase (central focus)
3. **Footer**: Background pattern visible at bottom
## Conclusion
The image contains no quantitative data or structured information. All textual content is limited to the handwritten "good bye" message on a decorative envelope. Visual elements suggest a farewell or parting theme through the combination of postal imagery and luggage.
</details>
| Question: | What greeting is written on the letter? |
| --- | --- |
| GT: | (good bye) |
| MLP: | (good) â |
| Ovis: | (good buy) â |
| Align: | (good bye) â |
(a) Positive Example #1
<details>
<summary>figures/case_10.jpg Details</summary>

### Visual Description
# Technical Document Extraction: Skoscan Atomic Clock Display
## Header Section
- **Brand/Model**: Skoscan Atomic Clock
- Symbol: Satellite icon (indicating atomic time synchronization)
- **Primary Display**: Digital clock interface
## Main Data Extraction
### Time
- **Label**: TIME
- **Value**: `12:04`
### Date
- **Label**: DATE
- **Value**: `1/14` (Format: Month/Day)
### Temperature Readings
- **Indoor Temperature**
- Label: INDOOR
- Value: `68.4°F`
- **Outdoor Temperature**
- Label: OUTDOOR
- Value: `-1.4°F`
## Design Features
- **Display Frame**: Black plastic casing
- **Speaker Grille**: Perforated pattern at the bottom of the device
- **Background Context**: Mounted on a wooden surface (partial view)
## Notes
- No charts, diagrams, or non-English text detected.
- All textual elements are in English.
- Spatial grounding: Labels and values are centrally aligned on the display.
- No trend analysis or data series required (static display).
</details>
| Question: | What indoor temperature is shown? |
| --- | --- |
| GT: | (68.4) |
| MLP: | (68 F) â |
| Ovis: | (40.0) â |
| Align: | (68.4) â |
(b) Positive Example #2
<details>
<summary>figures/case_11.jpg Details</summary>

### Visual Description
# Technical Document Extraction: Promotional Poster Analysis
## Summary
This image is a promotional poster for a health club named "topnotch health club." It features bold typography, illustrative figures, and a vibrant color scheme. No charts, diagrams, or data tables are present. Below is a structured extraction of all textual and visual components.
---
### Header Section
- **Primary Text**:
- `"Professional Passionate Personal"`
- Font: Bold, white, with a shadow effect.
- Position: Top-center, spanning three lines.
---
### Main Visual Section
- **Illustrated Figures**:
- **Female Figure**:
- Pose: Pointing gesture with right hand.
- Attire: White tank top, red wristband.
- Accessories: White earbuds.
- **Male Figure**:
- Pose: Standing beside the female figure.
- Attire: Red t-shirt, dark blue jeans.
- Accessories: Silver smartwatch on left wrist.
- **Overlay Text**:
- `"We're the club 4U!"`
- Font: Playful, white with a blue outline.
- Position: Diagonally across the figures' midsection.
- Exclamation mark: Large, stylized, white with a blue outline.
---
### Footer Section
- **Club Branding**:
- **Logo**:
- Symbol: Red abstract "T" (resembling a cross).
- Text: `"topnotch health club"`
- Font: Bold, dark blue.
- Position: Bottom-left, aligned with the logo.
- **Slogan**:
- `"dare to be different"`
- Font: Red, handwritten-style.
- Position: Bottom-right, below the logo.
---
### Background and Design Elements
- **Background**:
- Color: Gradient blue with abstract white and dark blue shapes.
- Texture: Subtle bokeh effect (small white dots).
---
### Key Observations
1. **Typography Hierarchy**:
- Header text is the most prominent, followed by the overlay slogan, and finally the footer branding.
2. **Color Scheme**:
- Dominant colors: Blue (background), white (primary text), red (accent elements).
3. **Visual Messaging**:
- The phrase `"We're the club 4U!"` emphasizes inclusivity and personalization.
- The slogan `"dare to be different"` reinforces the club's unique value proposition.
---
### Notes on Absent Elements
- No charts, diagrams, or data tables are present.
- No non-English text or translations required.
This extraction ensures all textual and visual components are documented for technical reference.
</details>
| Question: | What type of club is advertised? |
| --- | --- |
| GT: | (health club) |
| MLP: | (topnote health club) â |
| Ovis: | (health club) â |
| Align: | (professional passionate personal) â |
(c) Negative Example #1
<details>
<summary>figures/case_12.jpg Details</summary>

### Visual Description
# Technical Document: Hadiyah Plus Advertisement Analysis
## Header Section
- **Main Title**: "Hadiyah Plus"
- Font: Stylized cursive, gold color
- Position: Top center, spans full width
- **Tagline**: "MORE REWARDS WITH EVERY SWIPE"
- Font: Bold uppercase, dark brown
- Position: Directly below title, centered
## Main Content
### City Destinations
A horizontal row of five images representing travel destinations, each labeled with a city name in a brown banner:
1. **Singapore**
- Image: Merlion statue spraying water
- Position: Leftmost
2. **Penang**
- Image: Urban skyline with mountains
- Position: Second from left
3. **Bangkok**
- Image: Golden-roofed temple
- Position: Center
4. **Bali**
- Image: Balinese temple with water feature
- Position: Second from right
5. **Hong Kong**
- Image: Skyline with harbor
- Position: Rightmost
### Loyalty Programme Badge
- **Text**: "AirAsia BIG LOYALTY Programme"
- Font: Bold uppercase, white
- Background: Red circular badge
- Position: Bottom right corner
## Footer Section
- **Offer Text**: "You can now convert your Hadiyah Points to BIG Points to redeem for free flights."
- Font: Bold, dark brown
- Position: Centered, spans full width
### Credit Card Visuals
Three credit card images at the bottom:
1. **Hadiyah Plus Card**
- Color: Gold
- Labels: "Hadiyah Plus" (top), "BIG" (bottom)
- Position: Left
2. **BIG Loyalty Card**
- Color: Silver
- Labels: "BIG Loyalty" (top), "Hadiyah Plus" (bottom)
- Position: Center
3. **Mastercard/Visa Card**
- Color: Black
- Labels: "Mastercard" (left), "Visa" (right)
- Position: Right
## Design Elements
- **Background**: Gold gradient with subtle geometric patterns
- **Layout**: Symmetrical, centered alignment for text and images
## Notes
- **Language**: All text in English.
- **No Data Charts/Tables**: The image is promotional; no numerical data or trends present.
- **Spatial Grounding**:
- Red badge: [x: 750, y: 560] (relative to image dimensions)
- Credit cards: Horizontally aligned at bottom edge.
This document captures all textual and structural elements for technical reference.
</details>
| Question: | What credit card is this? |
| --- | --- |
| GT: | (hadiah plus) |
| MLP: | (hadiah plus) â |
| Ovis: | (american big loyalty program) â |
| Align: | (hadia plus) â |
(d) Negative Example #2
Figure 9: Case Study for Connector Comparison on the TextVQA dataset [Singh et al., 2019]. We show four qualitative examples (including two correct and two incorrect examples) comparing Llama-3.2-3B- Align to the same architecture with different connectors, Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis. âGTâ denotes the ground truth.