2502.01341
Model: healer-alpha-free
# AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding
## Abstract
Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLMâs embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise.
## 1 Introduction
Vision-Language Models (VLMs) have gained significant traction in recent years as a powerful framework for multimodal document understanding tasks that involve interpreting both the visual and textual contents of scanned documents (Kim et al., 2022; Lee et al., 2023; Liu et al., 2023a, 2024; Hu et al., 2024; Wang et al., 2023a; Rodriguez et al., 2024b). Such tasks are common in real-world commercial applications, including invoice parsing (Park et al., 2019), form reading (Jaume et al., 2019), and document question answering (Mathew et al., 2021b). VLM architectures typically consist of three components: (i) a vision encoder to process raw images, (ii) a Large Language Model (LLM) pre-trained on text, and (iii) a connector module that maps the visual features from the vision encoder into the LLMâs semantic space.
A central challenge in this pipeline is to effectively map the continuous feature embeddings of the vision encoder into the latent space of the LLM while preserving the semantic properties of visual concepts. Existing approaches can be broadly categorized into deep fusion and shallow fusion methods. Deep fusion methods, such as NVLM (Dai et al., 2024), Flamingo (Alayrac et al., 2022), CogVLM (Wang et al., 2023b), and LLama 3.2-Vision (Grattafiori et al., 2024), integrate visual and textual features by introducing additional cross-attention and feed-forward layers at each layer of the LLM. While effective at enhancing cross-modal interaction, these methods substantially increase the parameter count of the VLM compared to the base LLM, resulting in high computational overhead and reduced efficiency.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Radar Chart: Model Performance Comparison Across Visual Question Answering and Document Understanding Benchmarks
### Overview
The image is a radar chart (spider plot) comparing the performance of four different AI models across nine distinct benchmarks. The chart uses a multi-axis layout where each axis represents a specific benchmark, and the distance from the center indicates the performance score (higher is better). The four models are represented by colored lines with filled areas underneath, creating overlapping polygons.
### Components/Axes
* **Chart Type:** Radar Chart / Spider Plot
* **Axes (Benchmarks):** Nine axes radiate from the center, each labeled with a benchmark name. Clockwise from the top:
1. **DeepForm**
2. **InfoVQA**
3. **DocVQA**
4. **TableVQA**
5. **TextVQA**
6. **ChartQA**
7. **TabFact**
8. **WTQ**
9. **KLC**
* **Legend (Bottom Center):** A legend identifies the four data series:
* **Brown Line with Circle Markers:** `Llama-3.2-3B-Perciever R.`
* **Green Line with Circle Markers:** `Llama-3.2-3B-MLP`
* **Blue Line with Circle Markers:** `Llama-3.2-3B-Ovis`
* **Orange Line with Circle Markers:** `Llama-3.2-3B-Align (ours)`
* **Data Points:** Each axis has numerical values plotted for each model, connected by lines. The values are explicitly labeled on the chart near their respective data points.
### Detailed Analysis
The following table reconstructs the performance scores for each model on each benchmark. Values are transcribed directly from the chart labels.
| Benchmark | Llama-3.2-3B-Perciever R. (Brown) | Llama-3.2-3B-MLP (Green) | Llama-3.2-3B-Ovis (Blue) | Llama-3.2-3B-Align (ours) (Orange) |
| :--- | :--- | :--- | :--- | :--- |
| **DeepForm** | 57.08 | 62.07 | 58.02 | **63.49** |
| **InfoVQA** | 34.13 | 37.56 | 42.11 | **44.53** |
| **DocVQA** | 47.76 | 69.08 | 74.68 | **79.63** |
| **TableVQA** | 50.96 | 53.56 | 53.93 | **60.1** |
| **TextVQA** | 51.33 | 52.6 | 53.93 | **57.38** |
| **ChartQA** | 65.16 | 66.48 | 67.92 | **71.88** |
| **TabFact** | 71.93 | 73.22 | 76.67 | **78.51** |
| **WTQ** | 28.94 | 33.13 | 33.13 | **38.59** |
| **KLC** | 31.75 | 33.36 | 33.5 | **35.25** |
**Visual Trend Verification:**
* **Llama-3.2-3B-Align (Orange):** Forms the outermost polygon on the chart. Its line consistently encloses the lines of the other three models across all axes, indicating superior performance on every benchmark.
* **Llama-3.2-3B-Ovis (Blue):** Generally forms the second-outermost layer, closely following the orange line but consistently inside it.
* **Llama-3.2-3B-MLP (Green):** Typically resides inside the blue polygon, showing lower performance than Ovis but higher than Perciever R. on most tasks.
* **Llama-3.2-3B-Perciever R. (Brown):** Forms the innermost polygon, indicating the lowest performance among the four models across all benchmarks.
### Key Observations
1. **Consistent Hierarchy:** There is a clear and consistent performance hierarchy across all nine benchmarks: `Align (ours)` > `Ovis` > `MLP` > `Perciever R.`. No benchmark shows a reversal of this order.
2. **Performance Spread:** The performance gap between the best (`Align`) and worst (`Perciever R.`) model varies significantly by task.
* **Largest Gaps:** DocVQA (79.63 vs. 47.76, a 31.87-point difference) and InfoVQA (44.53 vs. 34.13, a 10.4-point difference).
* **Smallest Gaps:** KLC (35.25 vs. 31.75, a 3.5-point difference) and TextVQA (57.38 vs. 51.33, a 6.05-point difference).
3. **Benchmark Difficulty:** The absolute scores suggest varying difficulty across benchmarks for these models.
* **Highest Scores:** Models achieve their highest scores on **TabFact** (all models >71) and **DocVQA** (top model nearly 80).
* **Lowest Scores:** Models struggle most with **WTQ** (all models <39) and **KLC** (all models <36).
4. **Model Strengths:** While `Align` leads everywhere, its most dominant performances are in document and visual understanding tasks like **DocVQA** and **InfoVQA**.
### Interpretation
This radar chart serves as a comprehensive benchmark evaluation, likely from a research paper introducing the `Llama-3.2-3B-Align` model. The data demonstrates that the proposed `Align` method provides a consistent and significant improvement over three baseline variants (`Perciever R.`, `MLP`, `Ovis`) of the same underlying 3B-parameter Llama model architecture across a diverse suite of tasks involving visual question answering, document understanding, table parsing, and chart interpretation.
The consistent hierarchy suggests that the modifications in the `Align` variant are fundamentally more effective for multimodal reasoning than the architectural choices in the other variants. The particularly large gains on **DocVQA** and **InfoVQA** indicate that the `Align` approach may be especially adept at extracting and reasoning about information from complex, text-rich documents and images. Conversely, the smaller gaps on **KLC** and **TextVQA** might suggest these tasks rely on capabilities where the architectural differences have less impact, or they represent a closer performance ceiling for this model scale.
The chart effectively argues for the superiority of the `Align` method by showing it is not just better on average, but universally better across every single measured dimension of performance.
</details>
Figure 1: Performance of Different VLM Connectors. The proposed Align connector outperforms other methods across benchmarks using the same training configuration. Radial distance is proportion of maximal score, truncated at $0.7$ (black dot).
In contrast, shallow fusion methods project visual features from the vision encoder into the LLM input embedding space using either multilayer perceptrons (MLPs) (Liu et al., 2023b, 2024), convolution mappings such as HoneyBee (Cha et al., 2024) and H-Reducer (Hu et al., 2024), or attention-based mechanisms such as the Perceiver Resampler (Li et al., 2023b; Laurençon et al., 2024; Alayrac et al., 2022). This approach is more parameter-efficient and computationally lighter than deep fusion method However, these connectors lack inductive bias to ensure that the projected features remain within the region spanned by the LLMâs pretrained text embeddings. Consequently, the projected visual features may fall outside the distribution the LLM was trained on, leading to noisy or misaligned representations. Moreover, these mappings are typically learned from scratch, making them data-inefficient and less effective under low-resource conditions.
Recent methods like Ovis (Lu et al., 2024) attempt to alleviate these issues by introducing separate visual embeddings indexed from the vision encoder outputs and combined together to construct the visual inputs to the LLM. However, this approach significantly increases parameter count due to the massive embedding matrix and requires extensive training to learn a new embedding space without guaranteeing alignment with the LLMâs input latent space.
To address these limitations, this paper introduces AlignVLM, a novel framework that sidesteps direct projection of visual features into the LLM embedding space. Instead, our proposed connector, Align, maps visual features into probability distributions over the LLMâs existing pretrained vocabulary embeddings, which are then combined into a weighted representation of the text embeddings. By constraining each visual feature as a convex combination of the LLM text embeddings, our approach leverages the linguistic priors already encoded in the LLMâs text space. This ensures that the resulting visual features lie within the convex hull of the LLMâs embedding space, reducing the risk of noisy or out-of-distribution inputs and improving alignment between modalities. The connector thus enables faster convergence and stronger performance, particularly in low-resource scenarios.
Our experimental results show that Align improves performance on various document understanding tasks, outperforming prior connector methods, with especially large gains in low-data regimes. We summarize our main contributions as follows:
- We propose a novel connector, Align, to bridge the representation gap between vision and text modalities.
- We introduce a family of Vision-Language Models, AlignVLM, that achieves state-of-the-art performance on multimodal document understanding tasks by leveraging Align.
- We conduct extensive experiments demonstrating the robustness and effectiveness of Align across different LLM sizes and training data setups.
We release our code and research artifacts at alignvlm.github.io.
## 2 Related Work
### 2.1 Vision-Language Models
Over the past few years, Vision-Language Models (VLMs) have achieved remarkable progress, largely due to advances in Large Language Models (LLMs). Initially demonstrating breakthroughs in text understanding and generation (Brown et al., 2020; Raffel et al., 2023; Achiam et al., 2023; Grattafiori et al., 2024; Qwen et al., 2025; Team, 2024), LLMs are now increasingly used to effectively interpret visual inputs (Liu et al., 2023b; Li et al., 2024; Wang et al., 2024; Chen et al., 2024b; Dai et al., 2024; Drouin et al., 2024; Rodriguez et al., 2022). This progress has enabled real-world applications across diverse domains, particularly in multimodal document understanding for tasks like form reading (Svetlichnaya, 2020), document question answering (Mathew et al., 2021b), and chart question answering (Masry et al., 2022). VLMs commonly adopt a three-component architecture: a pretrained vision encoder (Zhai et al., 2023; Radford et al., 2021), a LLM, and a connector module. A key challenge for VLMs is effectively aligning visual features with the LLMâs semantic space to enable accurate and meaningful multimodal interpretation.
### 2.2 Vision-Language Alignment for Multimodal Models
Existing vision-language alignment approaches can be classified into deep fusion and shallow fusion. Deep fusion methods integrate visual and textual features by modifying the LLMâs architecture, adding cross-attention and feed-forward layers. For example, Flamingo (Alayrac et al., 2022) employs the Perceiver Resampler, which uses fixed latent embeddings to attend to vision features and fuses them into the LLM via gated cross-attention layers. Similarly, NVLM (Dai et al., 2024) adopts cross-gated attention while replacing the Perceiver Resampler with a simpler MLP. CogVLM (Wang et al., 2023b) extends this approach by incorporating new feed-forward (FFN) and QKV layers for the vision modality within every layer of the LLM. While these methods improve cross-modal alignment, they significantly increase parameter counts and computational overhead, making them less efficient.
On the other hand, shallow fusion methods are more computationally efficient, mapping visual features into the LLMâs embedding space without altering its architecture. These methods can be categorized into three main types: (1) MLP-based mapping, such as LLaVA (Liu et al., 2023b) and PaliGemma (Beyer et al., 2024), which use multilayer perceptrons (MLP) to project visual features but often produce misaligned or noisy features due to a lack of constraints and inductive bias (Rodriguez et al., 2024b); (2) cross-attention mechanisms, BLIP-2 (Li et al., 2023b) uses Q-Former, which utilizes a fixed set of latent embeddings to cross-attend to visual features, but that may still produce noisy or OOD visual features; (3) convolution-based mechanisms, such as HoneyBee (Cha et al., 2024) and H-Reducer (Hu et al., 2024), which leverage convolutional or ResNet (He et al., 2015) layers to preserve spatial locality while reducing dimensionality; and (4) visual embeddings, such as those introduced by Ovis (Lu et al., 2024), which use embeddings indexed by the vision encoderâs outputs to produce the visual inputs. While this regularizes feature mapping, it adds substantial parameter overhead and creates a new vision embedding space, risking misalignment with the LLMâs text embedding space. Encoder-free VLMs, like Fuyu-8B https://www.adept.ai/blog/fuyu-8b and EVE (Diao et al., 2024), eliminate dedicated vision encoders but show degraded performance (Beyer et al., 2024).
In contrast, AlignVLM maps visual features from the vision encoder into probability distributions over the LLMâs text embeddings, using them to compute a convex combination. By leveraging the linguistic priors encoded in the LLMâs vocabulary, AlignVLM ensures that visual features remain within the convex hull of the text embedding. This design mitigates noisy or out-of-distribution projections and achieves stronger multimodal alignment, particularly in tasks that require joint modalities representation like multimodal document understanding and in low-resource settings.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## System Architecture Diagram: Multimodal Alignment Module
### Overview
This image is a technical system architecture diagram illustrating the data flow and processing pipeline for a multimodal AI model. It specifically details how visual inputs (infographics) and a text-based question are processed, aligned, and fed into a Large Language Model (LLM) to generate a response. The diagram emphasizes an "Align Module" that integrates vision and text embeddings.
### Components/Axes
The diagram is organized into a left-to-right flow with distinct processing pathways for vision and text, which converge in a central alignment module.
**1. Input Examples (Left Side):**
* Six small infographic panels are shown as example visual inputs. Each has a title and a graphic.
* Top Row (Left to Right):
* "AMERICANS WANT TO BE PROTECTED FROM BEING SCAMMED" (Image: A person at a desk with a shield).
* "AMERICANS LOVE VIDEO GAMES" (Image: A grid of pixelated icons).
* "AMERICANS ARE HYPER-SOCIAL" (Image: A network diagram of connected people).
* Bottom Row (Left to Right):
* "AMERICANS ARE CONNECTED" (Image: A row of stylized people icons).
* "AMERICANS LOVE ROUTINE" (Image: A pie chart).
* "AMERICANS ARE HIGHLY INFLUENCED BY OTHERS" (Image: A flowchart with icons).
* Below these panels is a text question: **"Question: What percentage of Americans are online?"**
**2. Vision Processing Pathway (Blue):**
* **Vision Encoder:** A blue box that receives the visual inputs (the infographics).
* **Align Module (Vision Sub-Path):** A dashed orange box labeled **"Align Module"** at the top. Inside, the vision pathway continues:
* A sequence of processing layers: `Linear` -> `Layer Norm` -> `LM Head (LLM)` (highlighted in pink) -> `Layer Norm` -> `Softmax`.
* The output goes to a **"Weighted Average Sum"** block.
* The final output is a stack of blue rectangles labeled **"Vision Inputs"**.
**3. Text Processing Pathway (Pink):**
* **Text Tokenizer:** A pink box that receives the text question.
* **LLM Embedding Matrix:** A large pink block representing the model's embedding space.
* **Align Module (Text Sub-Path):** Within the same dashed orange box:
* A pink arrow labeled **"Full Embedding Matrix"** points from the LLM Embedding Matrix to the "Weighted Average Sum" block in the vision path.
* Another pink arrow labeled **"Selected Text Embeddings"** points from the LLM Embedding Matrix to a stack of pink rectangles labeled **"Text Inputs"**.
**4. Integration and Output (Right Side):**
* The **"Vision Inputs"** (blue stack) and **"Text Inputs"** (pink stack) are shown as parallel inputs feeding into a large pink box labeled **"LLM"**.
* An arrow points from the LLM box to the final output: **"Response: 90%"**.
### Detailed Analysis
The diagram meticulously maps the transformation of raw multimodal data into a format suitable for an LLM.
* **Flow & Relationships:** The core process involves two parallel streams:
1. **Vision Stream:** Raw images -> Vision Encoder -> Alignment processing (Linear, Norm, LM Head, Softmax) -> Weighted Average -> Vision Embeddings.
2. **Text Stream:** Raw text question -> Tokenizer -> Lookup in LLM Embedding Matrix -> Selected Text Embeddings.
* **The Align Module's Role:** This is the critical junction. It uses the **"Full Embedding Matrix"** (from the text model's knowledge) to inform the **"Weighted Average Sum"** operation on the vision features. This suggests the module aligns visual features with the semantic space of the language model. The "LM Head (LLM)" layer within the vision pathway is a key indicator of this cross-modal alignment.
* **Spatial Grounding:** The "Align Module" label is centered at the top of the dashed box. The "Weighted Average Sum" is positioned in the upper-right quadrant of this module. The "Vision Inputs" and "Text Inputs" stacks are vertically aligned to the right of the Align Module, clearly showing they are the final, aligned products ready for the LLM.
* **Component Isolation:**
* **Header/Inputs:** The left third defines the problem space (visual context + specific query).
* **Main Processing:** The central two-thirds detail the encoding and alignment mechanics.
* **Footer/Output:** The right side shows the output.
The diagram does not contain numerical data charts or graphs. It is a process flow diagram. Therefore, there are no numerical data points, trends, or outliers to extract. The only quantitative value is the final model output: **"90%"**.
### Key Observations
1. **Asymmetric Processing:** The vision pathway undergoes more explicit, multi-step transformation (Linear, Norm, Head, Softmax, Weighted Average) before becoming "Vision Inputs," while the text pathway appears to involve a more direct selection from the embedding matrix.
2. **Central Role of the LLM Embedding Matrix:** This matrix is not just for text; it actively guides the alignment of visual features via the "Full Embedding Matrix" connection, indicating a deeply integrated multimodal design.
3. **Purpose-Built for Question Answering:** The entire pipeline is framed around answering a specific question about the visual input, as demonstrated by the sample question and the direct "Response: 90%" output.
### Interpretation
This diagram illustrates a sophisticated architecture for a **visually-grounded language model**. The system doesn't just process an image and a question separately; it actively *aligns* the visual information with the linguistic and conceptual space of the LLM before any reasoning occurs.
The "Align Module" is the key innovation shown. By using the LLM's own embedding matrix to weight and select relevant visual features, the model ensures that the "Vision Inputs" it ultimately sees are already primed to be relevant to the textual query and the LLM's internal knowledge. This is a Peircean investigative process: the visual data (the sign, or *representamen*) of the infographics is interpreted (aligned) through the lens of the LLM's conceptual framework (the interpretant) to produce a meaningful answer (the object, which is the fact "90%").
The final "Response: 90%" to the question "What percentage of Americans are online?" implies that one of the input infographics (likely "AMERICANS ARE CONNECTED" or "AMERICANS ARE HYPER-SOCIAL") contained this specific data point, and the model successfully extracted and reported it. The architecture is designed to make such extraction robust by ensuring visual features are contextually relevant before the LLM ever processes them.
</details>
Figure 2: AlignVLM Model Architecture. The vision encoder extracts image features, which are processed to produce probabilities over the LLM embeddings. A weighted average combines these probabilities with embeddings to generate vision input vectors. Text inputs are tokenized, and the corresponding embeddings are selected from the embedding matrix, which is then used as input to the LLM. We display the vision layers in blue, and the text layers in purple.
## 3 Methodology
### 3.1 Model Architecture
The overall model architecture, shown in Figure 2, consists of three main components:
(1) Vision Encoder.
To handle high-resolution images of different aspect ratios, we divide each input image into multiple tiles according to one of the predefined aspect ratios (e.g., $1{:}1, 1{:}2, \dots, 9{:}1$ ) chosen via a coverage ratio (Lu et al., 2024; Chen et al., 2024a). Due to limited computational resources, we set the maximum number of tiles to 9. Each tile is further partitioned into $14Ă 14$ patches, projected into vectors, and processed by a SigLip-400M vision encoder (Zhai et al., 2023) to extract contextual visual features.
Each tile $tâ\{1,âŻ,T\}$ is divided into $N_t$ patches
$$
P_t=\{p_t,1,âŻ,p_t,N_{t}\},
$$
where $p_t,i$ is the $i$ -th patch of tile $t$ . The vision encoder maps these patches to a set of visual feature vectors
| | $\displaystyleF_t=VisionEncoder(P_t), F_t=\{f_t,1,âŻ,f_t,N_{t}\}, f_t,iââ^d.$ | |
| --- | --- | --- |
Finally, we concatenate the feature sets across all tiles into a single output
$$
F=concat\Bigl(F_1,F_2,âŻ,F_T\Bigr).
$$
(2) Align Module.
This module aligns the visual features with the LLM. A linear layer $W_1ââ^DĂ d$ first projects the visual features $Fââ^T· N_tĂ d$ to the LLMâs token embedding space: one $â^D$ vector per token. A second linear layer $W_2ââ^VĂ D$ (initialized from the LLMâs language-model head) followed by a softmax, produces a probability simplex $P_vocab$ over the LLMâs vocabulary ( $V$ tokens)
$$
P_vocab=((W_2(W_1F))) \tag{1}
$$
We then use the LLM text embeddings $E_textââ^VĂ D$ to compute a weighted sum
$$
F_align^\prime=P_vocab^â€E_text. \tag{2}
$$
Finally, we concatenate $F_align^\prime$ with the tokenized text embeddings to form the LLM input
$$
H_input=concat\bigl(F_align^\prime,E_text(x)\bigr),
$$
where $E_text(x)$ is obtained by tokenizing the input text $x=(x_1,âŻ,x_M)$ and selecting the corresponding embeddings from $E_text$ such that
$$
\displaystyleE_text(x) \displaystyle=\bigl[E_text(x_1),âŻ,E_text(x_M)\bigr]. \tag{3}
$$
(3) Large Language Model.
We feed the concatenated vision and text vectors, $H_input$ , into the LLM, which then generates output text auto-regressively. To demonstrate the effectiveness of our alignment technique, we experiment with the Llama 3.1 model family (Grattafiori et al., 2024). These models offer state-of-the-art performance and permissive licenses, making them suitable for commercial applications. In particular, we utilize Llama 3.2-1B, Llama 3.2-3B, and Llama 3.1-8B.
### 3.2 Motivation and relation with existing methods
By construction, each $â^D$ representation in $F_align^\prime$ is constrained to the convex hull of the points $E_text$ , thus concentrating the visual features in the part of latent space that the LLM can effectively interpret. Moreover, we argue that our initialization of $W_2$ to the language model head is an inductive bias toward recycling some of the semantics of these text tokens into visual tokens. This contrasts with past methods that have been proposed to adapt the vision encoder outputs $Fââ^T· N_tĂ d$ to an $F^\primeââ^T· N_tĂ D$ to be fed to the LLM. Here, we consider two examples in more detail, highlighting these contrasts.
(1) MLP Connector Liu et al. (2023b) applies a linear projection with parameters $W_MLPââ^DĂ d$ and $b_MLPââ^D$ , followed by an activation function $Ï$ (e.g., ReLU)
$$
F_MLP^\prime=Ï(W_MLPF+b_MLP).
$$
These parameters are all learned from scratch, without any bias aligning them to text embeddings.
(2) Visual Embedding Table Lu et al. (2024) introduces an entire new set of visual embeddings $E_VETââ^KĂ D$ which, together with the weights $W_VETââ^KĂ d$ , specifies
$$
F_VET^\prime=(W_VETF)^â€E_VET.
$$
When $D<d$ , our $W_2W_1$ amounts to a low-rank version of $W_VET$ . There is thus much more to learn to obtain $F_VET^\prime$ , and there is again no explicit pressure to align it with the text embeddings.
### 3.3 Training Datasets & Stages
We train our model in three stages:
Stage 1.
This stage focuses on training the Align Module to map visual features to the LLMâs text embeddings effectively. We use the CC-12M dataset Changpinyo et al. (2021), a large-scale web dataset commonly used for VLM pretraining Liu et al. (2023b), which contains 12M image-text pairs. However, due to broken or unavailable links, we retrieved 8.1M pairs. This dataset facilitates the alignment of visual features with the text embedding space of the LLM. During this stage, we train the full model, as this approach improves performance and stabilizes the Align Module training.
Stage 2.
The goal is to enhance the modelâs document understanding capabilities, such as OCR, document structure comprehension, in-depth reasoning, and instruction-following. We leverage the BigDocs-7.5M dataset Rodriguez et al. (2024a), a curated collection of license-permissive datasets for multimodal document understanding. This dataset aligns with the Accountability, Responsibility, and Transparency (ART) principles Bommasani et al. (2023); Vogus and LlansĂłe (2021), ensuring compliance for commercial applications. As in Stage 1, we train the full model during this stage.
Stage 3.
To enhance the modelâs instruction-tuning capabilities, particularly for downstream tasks like question answering, we further train it on the DocDownstream Rodriguez et al. (2024a); Hu et al. (2024) instruction tuning dataset. In this stage, the vision encoder is frozen, focusing training exclusively on the LLM and Align module.
## 4 Experimental Setup
Table 1: Main Results on General Document Benchmarks. We compare AlignVLM (ours) with state-of-the-art (SOTA) open and closed-source instructed models, and with base models that we trained using the process described in Section 3.3. AlignVLM models outperform all Base VLM models trained in the same data regime. Our models also perform competitively across document benchmarks even compared with SOTA models, in which the data regime is more targeted and optimized. Color coding for comparison: closed-source models, open-source models below 7B parameters, open-source models between 7-12B parameters.
| Model | DocVQA VAL | InfoVQA VAL | DeepForm TEST | KLC TEST | WTQ TEST | TabFact TEST | ChartQA TEST | TextVQA VAL | TableVQA TEST | Avg. Score |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Closed-Source VLMs | | | | | | | | | | |
| (Opaque Training Data) | | | | | | | | | | |
| Claude-3.5 Sonnet | 88.48 | 59.05 | 31.41 | 24.82 | 47.13 | 53.48 | 51.84 | 71.42 | 81.27 | 56.54 |
| GeminiPro-1.5 | 91.23 | 73.94 | 32.16 | 24.07 | 50.29 | 71.22 | 34.68 | 68.16 | 80.43 | 58.46 |
| GPT-4o 20240806 | 92.80 | 66.37 | 38.39 | 29.92 | 46.63 | 81.10 | 85.70 | 70.46 | 72.87 | 64.91 |
| Open-Source Instruct VLMs | | | | | | | | | | |
| (Semi-Opaque Training Data) | | | | | | | | | | |
| Janus- 1.3B (Wu et al., 2024a) | 30.15 | 17.09 | 0.62 | 15.06 | 9.30 | 51.34 | 57.20 | 51.97 | 18.67 | 27.93 |
| Qwen2-VL- 2B (Wang et al., 2024) | 89.16 | 64.11 | 32.38 | 25.18 | 38.20 | 57.21 | 73.40 | 79.90 | 43.07 | 55.84 |
| Qwen2.5-VL- 3B (Wang et al., 2024) | 93.00 | 75.83 | 32.84 | 24.82 | 53.46 | 71.16 | 83.91 | 79.29 | 71.66 | 65.10 |
| InternVL-2.5- 2B (Chen et al., 2024b) | 87.70 | 61.85 | 13.14 | 16.58 | 36.33 | 57.26 | 74.96 | 76.85 | 42.20 | 51.87 |
| InternVL-3- 2B (Zhu et al., 2025) | 87.33 | 66.99 | 37.90 | 29.79 | 39.44 | 59.91 | 75.32 | 78.69 | 43.46 | 57.64 |
| DeepSeek-VL2-Tiny- 3.4B (Wu et al., 2024b) | 88.57 | 63.88 | 25.11 | 19.04 | 35.07 | 52.15 | 80.92 | 80.48 | 56.30 | 55.72 |
| Phi3.5-Vision- 4B (Abdin et al., 2024) | 86.00 | 56.20 | 10.47 | 7.49 | 17.18 | 30.43 | 82.16 | 73.12 | 70.70 | 48.19 |
| Qwen2-VL- 7B (Wang et al., 2024) | 93.83 | 76.12 | 34.55 | 23.37 | 52.52 | 74.68 | 83.16 | 84.48 | 53.97 | 64.08 |
| Qwen2.5-VL- 7B (Bai et al., 2025) | 94.88 | 82.49 | 42.21 | 24.26 | 61.96 | 78.56 | 86.00 | 85.35 | 76.10 | 70.20 |
| LLaVA-NeXT- 7B (Xu et al., 2024) | 63.51 | 30.90 | 1.30 | 5.35 | 20.06 | 52.83 | 52.12 | 65.10 | 32.87 | 36.00 |
| DocOwl1.5- 8B (Hu et al., 2024) | 80.73 | 49.94 | 68.84 | 37.99 | 38.87 | 79.67 | 68.56 | 68.91 | 52.60 | 60.68 |
| InternVL-2.5- 8B (Chen et al., 2024b) | 91.98 | 75.36 | 34.55 | 22.31 | 50.33 | 74.75 | 82.84 | 79.00 | 52.10 | 62.58 |
| InternVL-3- 8B (Zhu et al., 2025) | 91.99 | 73.90 | 51.24 | 36.41 | 53.60 | 72.27 | 85.60 | 82.41 | 53.26 | 66.74 |
| Fuyu- 8B (Bavishi et al., 2023) | 48.97 | 23.09 | 4.78 | 6.63 | 14.55 | 47.91 | 44.36 | 46.02 | 15.49 | 22.97 |
| Ovis-1.6-Gemma2- 9B (Lu et al., 2024) | 88.84 | 73.97 | 45.16 | 23.91 | 50.72 | 76.66 | 81.40 | 77.73 | 48.33 | 62.96 |
| Llama3.2- 11B (Grattafiori et al., 2024) | 82.71 | 36.62 | 1.78 | 3.47 | 23.03 | 58.33 | 23.80 | 54.28 | 22.40 | 34.04 |
| Pixtral- 12B (Agrawal et al., 2024) | 87.67 | 49.45 | 27.37 | 24.07 | 45.18 | 73.53 | 71.80 | 76.09 | 67.13 | 58.03 |
| Document Understanding Instructed Models | | | | | | | | | | |
| (Instruction Tuned on BigDocs-7.5M + DocDownStream (Rodriguez et al., 2024a; Hu et al., 2024)) | | | | | | | | | | |
| Qwen2-VL- 2B (base+) (Wang et al., 2024) | 57.23 | 31.88 | 49.31 | 34.39 | 31.61 | 64.75 | 68.60 | 61.01 | 47.53 | 49.59 |
| AlignVLM -Llama-3.2- 1B (ours) | 72.42 | 38.16 | 60.47 | 33.71 | 28.66 | 71.31 | 65.44 | 48.81 | 50.29 | 52.14 |
| AlignVLM -Llama-3.2- 3B (ours) | 79.63 | 44.53 | 63.49 | 35.25 | 38.59 | 78.51 | 71.88 | 57.38 | 60.10 | 58.81 |
| DocOwl1.5- 8B (base+) (Hu et al., 2024) | 78.70 | 47.62 | 64.39 | 36.93 | 35.69 | 72.65 | 65.80 | 67.30 | 49.03 | 57.56 |
| Llama3.2- 11B (base+) (Grattafiori et al., 2024) | 78.99 | 44.27 | 67.05 | 37.22 | 40.18 | 78.04 | 71.40 | 68.46 | 56.73 | 60.26 |
| AlignVLM -Llama-3.1- 8B (ours) | 81.18 | 53.75 | 63.25 | 35.50 | 45.31 | 83.04 | 75.00 | 64.60 | 64.33 | 62.88 |
Setup.
We conduct all experiments using 8 nodes of H100 GPUs, totaling 64 GPUs. For model training, we leverage the MS-Swift framework (Zhao et al., 2024) for its flexibility. Additionally, we utilize the DeepSpeed framework (Aminabadi et al., 2022), specifically the ZeRO-3 configuration, to optimize efficient parallel training across multiple nodes. Detailed hyperparameters are outlined in Appendix A.1.
Baselines.
Our work focuses on architectural innovations, so we ensure that all baselines are trained on the same datasets. To enable fair comparisons, we evaluate our models against a set of Base VLMs fine-tuned on the same instruction-tuning tasks (Stages 2 and 3) as our models, using the BigDocs-7.5M and BigDocs-DocDownstream datasets. This approach ensures consistent training data, avoiding biases introduced by the Instruct versions of VLMs, which are often trained on undisclosed instruction-tuning datasets. Due to the scarcity of recently released publicly available Base VLMs, we primarily compare our model against the following Base VLMs of varying sizes: Qwen2-VL-2B (Wang et al., 2024), DocOwl1.5-8B (Hu et al., 2024), and LLama 3.2-11B (Grattafiori et al., 2024).
For additional context, we also include results from the Instruct versions of recent VLMs of different sizes: Phi3.5-Vision-4B (Abdin et al., 2024), Qwen2-VL-2B and 7B (Wang et al., 2024), Qwen2.5-VL-7B (Qwen et al., 2025), LLaVA-NeXT-7B (Liu et al., 2024), InternVL2.5-2B and 8B (Chen et al., 2024b), InternVL3-2B and 8B (Zhu et al., 2025), Janus-1.3B (Wu et al., 2024a), DeepSeek-VL2-Tiny (Wu et al., 2024b), Ovis1.6-Gemma-9B (Lu et al., 2024), Llama3.2-11B (Grattafiori et al., 2024), DocOwl1.5-8B (Hu et al., 2024), and Pixtral-12B (Agrawal et al., 2024).
Evaluation Benchmarks.
We evaluate our models on a diverse range of document understanding benchmarks that assess the modelâs capabilities in OCR, chart reasoning, table processing, or form comprehension. In particular, we employ the VLMEvalKit (Duan et al., 2024) framework and report the results on the following popular benchmarks: DocVQA (Mathew et al., 2021b), InfoVQA (Mathew et al., 2021a), DeepForm (Svetlichnaya, 2020), KLC (StanisĆawek et al., 2021), WTQ (Pasupat and Liang, 2015), TabFact (Chen et al., 2020), ChartQA (Masry et al., 2022), TextVQA (Singh et al., 2019), and TableVQA (Kim et al., 2024).
## 5 Results
Table 2: Impact of Connector Designs on VLM Performance: We present the results of experiments evaluating different connector designs for conditioning LLMs on visual features. Our proposed Align connector is compared against a basic Multi-Layer Perceptron (MLP), the Perceiver Resampler, and Ovis. The results demonstrate that Align consistently outperforms these alternatives across all benchmarks.
| Model | DocVQA VAL | InfoVQA VAL | DeepForm TEST | KLC TEST | WTQ TEST | TabFact TEST | ChartQA TEST | TextVQA VAL | TableVQA TEST | Avg. Score |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Llama-3.2-3B- MLP | 71.46 | 37.56 | 62.07 | 33.36 | 28.94 | 73.22 | 66.48 | 53.56 | 50.96 | 53.06 |
| Llama-3.2-3B- Perciever R. | 69.08 | 34.13 | 57.08 | 31.75 | 27.95 | 71.93 | 65.16 | 51.33 | 47.76 | 50.68 |
| Llama-3.2-3B- Ovis | 74.68 | 42.11 | 58.02 | 33.50 | 33.13 | 76.67 | 67.92 | 52.60 | 53.93 | 54.72 |
| Llama-3.2-3B- Align (ours) | 79.63 | 44.53 | 63.49 | 35.25 | 38.59 | 78.51 | 71.88 | 57.38 | 60.10 | 58.81 |
### 5.1 Main Results
Table 1 presents the performance of AlignVLM compared to state-of-the-art (SOTA) open- and closed-source instructed models, as well as baseline Base VLMs fine-tuned in the same instruction-tuning setup. The results demonstrate that AlignVLM consistently outperforms all Base VLMs within the same size category and achieves competitive performance against SOTA Instruct VLMs despite being trained on a more limited data regime. Below, we provide a detailed analysis.
AlignVLM vs. Base VLMs.
Our AlignVLM models, based on Llama 3.2-1B and Llama 3.2-3B, significantly outperform the corresponding Base VLM, Qwen2-VL-2B, by up to 9.22%. Notably, AlignVLM -Llama-3.2-3B surpasses DocOwl1.5-8B, which has 4B more parameters, demonstrating the effectiveness of Align in enhancing multimodal capabilities compared to traditional shallow fusion methods (e.g., MLPs). Furthermore, our 8B model achieves a 2.62% improvement over Llama3.2-11B despite sharing the same Base LLM, Llama3.1-8B. Since all models in this comparison were trained on the same instruction-tuning setup, this experiment provides a controlled evaluation, isolating the impact of architectural differences rather than dataset biases. Consequently, these results suggest that AlignVLM outperforms VLMs with shallow fusion techniques and surpasses parameter-heavy deep fusion VLMs, such as Llama3.2-11B, while maintaining a more efficient architecture.
AlignVLM vs. Instruct VLMs.
Even as open-source Instruct models are trained on significantly larger, often undisclosed instruction-tuning datasets, AlignVLM achieves competitive performance. For example, AlignVLM -Llama-3.2-3B (58.81%) outperforms other strong instruction-tuned VLMs in its size class, such as Qwen2-VL-2B and InternVL-3-2B, by considerable margins (2.97% and 1.17%, respectively). While it falls slightly behind Qwen2.5-VL-3B, a direct comparison is not entirely fair, as the latter was trained on a proprietary instruction-tuning dataset.
Additionally, our 8B model outperforms significantly larger models such as Llama 3.2-11B and PixTral-12B by substantial margins. It also surpasses InternVL-2.5-8B and performs competitively with Qwen2.5-VL-7B, though a direct comparison may not be entirely fair since Qwen2.5-VL-7B was trained on an undisclosed instruction-tuning dataset. Finally, AlignVLM also exhibits comparable performance to closed-source models like GeminiPro-1.5 and GPT4o.
Overall, these results validate the effectiveness of Align and establish AlignVLM as a state-of-the-art model for multimodal document understanding.
### 5.2 Impact of Connector Designs on VLM Performance
#### 5.2.1 High-Resource Training Regime
To assess the effectiveness of our Align module, we compare it against three different and widely used shallow fusion VLM connectors: MLP, Perceiver Resampler, and Ovis. These experiments were carefully conducted under precisely identical training conditions (datasets, hyperparameters, training stages) as outlined in Appendix A.1, ensuring a fair and rigorous comparison. The results in Table 2 show that Align consistently outperforms all alternatives, demonstrating its superiority both in aligning visual and textual modalities in multimodal document understanding. MLP and Perceiver Resampler achieve the lowest performance, 53.06% and 50.68%, respectively, due to their direct feature projection, which lacks an explicit mechanism to align visual features with the LLMâs text space, leading to misalignment. Ovis introduces a separate visual embedding table, but this additional complexity does not significantly improve alignment, yielding only 54.72% accuracy. In contrast, Align ensures that visual features remain within the convex hull of the LLMâs text latent space, leveraging the linguistic priors of the LLM to enhance alignment and mitigate noisy embeddings. This design leads to the highest performance (58.81%), establishing Align as the most effective connector for integrating vision and language in multimodal document understanding. We provide some example outputs of the Llama-3.2-3B models with different connector designs in Appendix A.4. Furthermore, we include an analysis of the runtime efficiency and memory usage of different connectors in Appendix A.2.
#### 5.2.2 Low-Resource Training Regime
The previous section focused on large-scale training setups involving millions of data samples (BigDocs-7.5M), which require significant compute resources and limit the number of baselines that we were able to compare against. Here, we examine whether Align remains effective in a low-resource setting.
We conduct additional experiments using SigLIP-400M as the vision encoder and Llama-3.2-3B as the language model, fine-tuned on the LLaVA-NeXT dataset Liu et al. (2024), which contains 779K samples. We follow the official LLaVA-NeXT configuration for both training stages. (i) Pretraining: the model is trained on the LLaVA-558K imageâcaption dataset Liu et al. (2024), freezing both the LLM and vision encoder while fine-tuning the connector (learning rate = 1e-3, batch size = 32, 1 epoch on 8 Ă H100 GPUs). To handle high-resolution document images, we adopt the "anyres_max_9" strategy with grid weaving from 1Ă1 to 6Ă6, supporting resolutions up to 2304Ă2304 with 729 tokens per grid; (ii) Instruction tuning: the model is further fine-tuned on the LLaVA-NeXT-779K instruction dataset with learning rates of 1e-5 for the LLM and connector, 2e-6 for the vision encoder, batch size = 8, for 1 epoch.
This lightweight setup allows direct comparison across more connector architectures including MLP Liu et al. (2023a), Perceiver Resampler, Ovis Lu et al. (2024), H-Reducer (1Ă4) Hu et al. (2024), and HoneyBee (C-Abstractor) Cha et al. (2024), all trained under identical conditions for fairness. Since the LLaVA-Next dataset is general-purpose and not exclusively document-focused like BigDocs-7.5M (Rodriguez et al., 2024a), it allows us to evaluate whether the Align connector generalizes beyond document understanding to broader visual reasoning. Accordingly, we assess all models on a comprehensive suite of benchmarks spanning both document understanding and general visionâlanguage tasks. The document understanding benchmarks include DocVQA Mathew et al. (2021b), InfoVQA Mathew et al. (2021a), ChartQA Masry et al. (2022), and TextVQA Singh et al. (2019). For general visionâlanguage evaluation, we report results on MMMU-dev Yue et al. (2024), SeedBench Li et al. (2023a), and MMVet Yu et al. (2024), Pope (Li et al., 2023c), and GQA (Hudson and Manning, 2019).
Table 3: Connector Performance under a Low-Resource Training Regime: We evaluate the effectiveness of more shallow-fusion connectors when trained on limited data. The Align connector achieves the highest performance, with notably larger gains on document understanding tasks, demonstrating its data efficiency and strong inductive bias.
| Model | Document Understanding Tasks | General Vision Tasks | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| DocVQA | InfoVQA | ChartQA | TextVQA | Avg. | MMMU | SeedBench | MMVet | POPE | GQA | Avg. | |
| Llama-3.2-3B-MLP | 42.11 | 19.93 | 48.44 | 51.97 | 40.61 | 33.33 | 58.54 | 31.14 | 87.35 | 57.62 | 53.59 |
| Llama-3.2-3B-Perceiver | 32.18 | 18.10 | 40.00 | 44.31 | 33.64 | 35.22 | 63.70 | 26.19 | 84.92 | 55.86 | 53.17 |
| Llama-3.2-3B-Ovis | 57.73 | 26.39 | 54.52 | 55.60 | 48.56 | 31.89 | 60.97 | 30.41 | 88.26 | 56.23 | 53.55 |
| Llama-3.2-3B-Hreducer | 34.59 | 17.57 | 45.64 | 47.13 | 36.23 | 35.00 | 61.82 | 28.39 | 87.48 | 58.24 | 54.18 |
| Llama-3.2-3B-HoneyBee | 55.86 | 19.36 | 55.32 | 58.13 | 47.16 | 32.11 | 61.18 | 34.31 | 89.28 | 54.79 | 54.33 |
| Llama-3.2-3B- Align (ours) | 71.43 | 30.50 | 69.72 | 65.63 | 59.32 | 35.33 | 63.27 | 35.32 | 88.85 | 61.67 | 56.88 |
As summarized in Table 3, Align consistently outperforms other connectors under this low-data regime, with stronger gains on document understanding tasks. The wider performance margin between Align and others connectors under limited data (Table 3) compared to the high-resource setting (Table 2) underscores the benefit of its inductive bias. By grounding visual features within the LLMâs text embedding space, Align learns more efficiently from fewer samples, unlike direct-projection connectors that rely heavily on large datasets. This makes Align especially valuable for resource-constrained environments such as academic labs or small-scale industrial research setups, where both data and compute are limited.
### 5.3 Probability Distribution over Text Tokens Analysis
To better understand the behavior of Align, we examine the probability distribution, $P_vocab$ in Eq (1), over the LLMâs text vocabulary generated from visual features. Specifically, we process 100 document images through the vision encoder and Align, then average the resulting probability distributions across all image patches. The final distribution is shown in Figure 4. As illustrated, the distribution is dense (rather than sparse), with the highest probability assigned to a single token being 0.0118. This can be explained by the vision feature space being continuous and of much higher cardinality than the discrete text space. Indeed, while the LLM has 128K distinct vocabulary tokens, an image patch (e.g., 14Ă14 pixels) contains continuous, high-dimensional information that cannot be effectively mapped to a single or a few discrete tokens.
Table 4: Performance comparison when evaluating Align with the full text embedding vocabulary (128K) versus the reduced subset of 3.4K high-probability embeddings. The results show negligible performance degradation, indicating that Align relies primarily on a small subset of embeddings.
| Model | DocVQA VAL | InfoVQA VAL | DeepForm TEST | KLC TEST | WTQ TEST | TabFact TEST | ChartQA TEST | TextVQA VAL | TableVQA TEST | Avg. Score |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Llama-3.2-3B- Align (Full Embeddings) | 79.63 | 44.53 | 63.49 | 35.25 | 38.59 | 78.51 | 71.88 | 57.38 | 60.10 | 58.81 |
| Llama-3.2-3B- Align (3.4K Embeddings) | 79.40 | 44.13 | 63.64 | 35.02 | 38.26 | 78.83 | 71.72 | 57.48 | 59.80 | 58.69 |
We conducted a deeper analysis of the token probability distributions produced by the Align connector. Our observations show that Align consistently assigns high probabilities to approximately 3.4K tokens from the entire vocabulary, while the remaining tokens receive negligible probabilities (below $10^-6$ ). To better understand this behavior, we applied Principal Component Analysis (PCA) to reduce the dimensionality of the embeddings and visualized them in a two-dimensional space, as shown in Figure 4. The visualization reveals that these 3.4K tokens densely and comprehensively span the latent space of the LLMâs text embeddings. To validate this finding, we conducted additional evaluation experiments in which we retained only these 3.4K high-probability embeddings in the Align connector, entirely removing the rest during evaluation. As shown in Table 4, the performance difference compared to using the full embedding set (128K) was negligible. This confirms that Align effectively leverages and combines a compact subset of embeddings to map visual features into semantically meaningful regions within the LLMâs latent text space. Moreover, this suggests that Align can be further optimized through targeted embedding pruning to improve computational efficiency without sacrificing performance.
<details>
<summary>figures/prob_dist_image.png Details</summary>

### Visual Description
\n
## Probability Distribution Chart: LLM Text Tokens
### Overview
The image displays a bar chart titled "Probability Distribution over the LLM's Text Tokens." It visualizes the probability assigned to different tokens (indexed numerically) by a Large Language Model (LLM). The chart shows a highly skewed distribution, with a very small number of tokens receiving high probability and the vast majority having very low probability.
### Components/Axes
* **Title:** "Probability Distribution over the LLM's Text Tokens" (centered at the top).
* **X-Axis:**
* **Label:** "Tokens Index" (centered below the axis).
* **Scale:** Linear scale from 0 to approximately 130,000.
* **Major Tick Marks:** 0, 20000, 40000, 60000, 80000, 100000, 120000.
* **Y-Axis:**
* **Label:** "Probability" (centered to the left of the axis, rotated 90 degrees).
* **Scale:** Linear scale from 0.000 to 0.012.
* **Major Tick Marks:** 0.000, 0.002, 0.004, 0.006, 0.008, 0.010, 0.012.
* **Data Series:** A single series represented by light blue vertical bars. There is no legend, as only one data category is plotted.
* **Spatial Layout:** The chart occupies the central area of the image. The title is at the top-center. The y-axis label is on the left, and the x-axis label is at the bottom-center. The data bars originate from the x-axis (y=0).
### Detailed Analysis
* **Trend Verification:** The visual trend is an extremely sharp decline. A single, very tall bar is present at the far left (lowest token index). Following this peak, the bar heights drop dramatically and remain consistently low across the rest of the x-axis, with minor, random-looking fluctuations.
* **Key Data Points (Approximate):**
* **Peak Probability:** The highest bar is located at or very near Token Index 0. Its height reaches approximately **0.0118** (just below the 0.012 mark).
* **Secondary Peaks:** A few other bars in the very low index range (0-5000) show noticeable but much smaller peaks, with probabilities around **0.002 to 0.003**.
* **General Distribution:** For the vast majority of token indices (from ~5,000 to ~130,000), the probability values are very low, mostly fluctuating between **0.000 and 0.002**. Many bars are near zero, with occasional spikes rarely exceeding **0.0025**.
### Key Observations
1. **Extreme Skew:** The distribution is dominated by a single, massive outlier at the beginning of the token index. This indicates that one or a very small set of tokens (likely common words, punctuation, or special control tokens like `` or space) are assigned a disproportionately high probability by the model.
2. **Long Tail:** After the initial peak, the distribution exhibits a "long tail" where thousands of tokens have non-zero but very low probabilities. This is characteristic of language, where a few words are extremely common, and a vast vocabulary of less common words exists.
3. **Sparsity and Noise:** The tail appears noisy and sparse, with probabilities jumping up and down without a smooth pattern. This suggests that within the long tail, the model's confidence varies significantly from one token to the next, likely based on the specific context provided when this distribution was generated.
4. **Vocabulary Size:** The x-axis extends to just beyond 120,000, suggesting the LLM's vocabulary size is in the range of **128,000 to 130,000 tokens**.
### Interpretation
This chart provides a visual confirmation of fundamental principles in natural language processing and LLM behavior:
* **Zipf's Law in Action:** The distribution strongly resembles a Zipfian distribution, a power law frequently observed in language corpora where the frequency of a word is inversely proportional to its rank. The highest-ranked token (index 0) is vastly more probable than the 1000th-ranked token.
* **Model Behavior Insight:** The single dominant peak suggests the model has learned that, in the absence of strong contextual cues, it is safest to predict a very common token (like a space or a period). The long, low-probability tail represents the model's nuanced understanding that many other words are possible, but far less likely without specific context.
* **Implications for Generation:** This distribution explains why LLM text generation can sometimes feel generic or repetitiveâthe model is inherently biased towards a small set of high-probability tokens. Techniques like temperature sampling or top-k/top-p filtering are used to manipulate this raw distribution to encourage more diverse and creative outputs by down-weighting the dominant peak and sampling more from the long tail.
* **Technical Context:** The chart likely represents the model's output probability distribution for a *single* prediction step (the next token) given a specific input prompt. The exact shape would change dramatically with different input contexts.
</details>
Figure 3: Probability distribution over LLM tokens, highlighting dense probabilities for whitespace tokens.
<details>
<summary>figures/vision-to-text/alignvlm_embeds.png Details</summary>

### Visual Description
\n
## 2D PCA Scatter Plot: Embeddings (AlignVLM)
### Overview
This image is a 2D scatter plot visualizing the results of a Principal Component Analysis (PCA) performed on embedding vectors from a system or model named "AlignVLM". The plot projects high-dimensional embedding data onto its first two principal components (PC1 and PC2) to reveal the structure and distribution of the data points. It compares two distinct groups of embeddings.
### Components/Axes
* **Chart Title:** "2D PCA of Embeddings (AlignVLM)" (centered at the top).
* **X-Axis:** Labeled "PC1". The scale ranges from approximately -0.15 to 0.15, with major tick marks at intervals of 0.05.
* **Y-Axis:** Labeled "PC2". The scale ranges from approximately -0.10 to 0.10, with major tick marks at intervals of 0.05.
* **Legend:** Located in the top-left corner of the plot area.
* **Light Blue Dot:** Labeled "Other Embeddings".
* **Red Dot:** Labeled "Most Influential Embeds in AlignVLM".
* **Data Points:** The plot contains thousands of individual points, each representing a single embedding vector projected into this 2D space.
### Detailed Analysis
* **Data Series & Spatial Distribution:**
* **"Most Influential Embeds in AlignVLM" (Red Points):** This series forms a very dense, roughly elliptical cluster centered near the origin (PC1 â 0, PC2 â 0). The cluster is tightest at the center and becomes slightly more diffuse towards its edges. The red points dominate the central region of the plot, spanning approximately from PC1 = -0.08 to 0.10 and PC2 = -0.08 to 0.08.
* **"Other Embeddings" (Light Blue Points):** This series is more widely dispersed. While many blue points are intermingled within the central red cluster, a significant portion extends further out, particularly towards the left side of the plot (negative PC1 values). The blue points show a broader spread, reaching extremes near PC1 = -0.15 and PC2 = -0.10. Their density is lower than the red points in the core region.
* **Trend Verification:** There is no temporal trend. The visual pattern is one of **distribution and density**. The red points exhibit a trend of high concentration around the center. The blue points exhibit a trend of wider dispersion, with a notable directional spread towards the negative PC1 axis.
### Key Observations
1. **Core vs. Periphery:** The "Most Influential" embeddings (red) are highly concentrated in the central region of the PCA space, suggesting they are more similar to each other in the original high-dimensional space.
2. **Dispersion of "Other" Embeddings:** The "Other Embeddings" (blue) show greater variance. Their extension to the left (negative PC1) indicates a subset of embeddings that are distinct from the central, influential cluster.
3. **Overlap and Separation:** There is substantial overlap between the two groups, meaning many "Other" embeddings reside in the same region as the "Most Influential" ones. However, the influential group forms a denser core, while the other group defines the broader periphery, especially on the left.
4. **Asymmetry:** The overall distribution is not perfectly symmetrical. The spread along the PC1 axis (horizontal) appears greater than along the PC2 axis (vertical), and the "Other" embeddings pull the overall distribution more strongly towards the negative PC1 direction.
### Interpretation
This PCA plot provides a visual diagnostic of the embedding space within the AlignVLM system. The tight clustering of the "Most Influential Embeds" suggests these embeddings form a coherent, central semantic or functional core. They likely represent the most common, prototypical, or impactful concepts the model has learned.
The wider dispersion of the "Other Embeddings" indicates greater diversity or variance among the remaining data. The subset stretching towards negative PC1 may represent outlier concepts, specialized terms, or data points that are less aligned with the model's primary focus. The significant overlap implies that influence isn't strictly segregated into a separate region of the embedding space; rather, the most influential points are simply the densest part of a larger continuum.
From a technical perspective, this visualization could be used to assess the quality of embedding alignment, identify potential outliers for data cleaning, or understand the representational structure that underpins the model's decision-making. The clear central cluster for influential items might be a target for optimization or a sign of a well-organized latent space.
</details>
Figure 4: PCA of Align Embeddings: The principal components of the most influential embeddings in the Align Connector span most of the feature space represented by all embeddings.
### 5.4 Robustness to Noise Analysis
To evaluate the robustness of our Align connector to noisy visual features, we conduct an experiment where random Gaussian noise is added to the visual features produced by the vision encoder before passing them into the connector. Specifically, given the visual features $Fââ^NĂ d$ output by the vision encoder (where $N$ is the number of feature vectors and $d$ is their dimensionality), we perturbed them as
$$
\widetilde{F}=F+N, NâŒN(0,Ï=3).
$$
Table 5: Robustness to Noise. Comparison of Avg. Scores with and without Gaussian noise ( $Ï=3$ ), including performance drop ( $Î$ ).
| Model | Without Noise | With Noise | Drop ( $Î$ ) |
| --- | --- | --- | --- |
| Llama-3.2-3B-MLP | 53.06 | 27.52 | $â 25.54$ |
| Llama-3.2-3B- Align (ours) | 58.81 | 57.14 | $â\textbf{1.67}$ |
As shown in Table 5, our Align connector demonstrates high robustness to noise, with only a 1.67% average drop in performance. In contrast, the widely adopted MLP connector suffers a significant performance degradation of 25.54%, highlighting its vulnerability to noisy inputs. Furthermore, we measured the average cosine distance between the original and noise-perturbed visual embeddings using both the Align and MLP connectors. Align showed significantly lower distances (0.0036) than MLP (0.3938), further validating its robustness to noise. These empirical results support our hypothesis that leveraging the knowledge encoded in the LLMâs text embeddings and constraining the visual features within the convex hull of the text latent space act as a regularization mechanism, reducing the modelâs sensitivity to noisy visual features.
## 6 Conclusion
We introduce Align, a novel connector designed to align vision and language latent spaces in vision-language models (VLMs), specifically enhancing multimodal document understanding. By improving cross-modal alignment and minimizing noisy embeddings, our models, AlignVLM, which leverage Align, achieve state-of-the-art performance across diverse document understanding tasks. This includes outperforming base VLMs trained on the same datasets and achieving competitive performance with open-source instruct models trained on undisclosed data. Extensive experiments and ablations validate the robustness and effectiveness of Align compared to existing connector designs, establishing it as a significant contribution to vision-language modeling. Future work will explore training on more diverse instruction-tuning datasets to generalize to broader domains.
## References
- Abdin et al. [2024] M. Abdin, J. Aneja, H. Awadalla, A. Awadallah, A. A. Awan, N. Bach, A. Bahree, A. Bakhtiari, J. Bao, H. Behl, A. Benhaim, M. Bilenko, J. Bjorck, S. Bubeck, M. Cai, Q. Cai, V. Chaudhary, D. Chen, D. Chen, W. Chen, Y.-C. Chen, Y.-L. Chen, H. Cheng, P. Chopra, X. Dai, M. Dixon, R. Eldan, V. Fragoso, J. Gao, M. Gao, M. Gao, A. Garg, A. D. Giorno, A. Goswami, S. Gunasekar, E. Haider, J. Hao, R. J. Hewett, W. Hu, J. Huynh, D. Iter, S. A. Jacobs, M. Javaheripi, X. Jin, N. Karampatziakis, P. Kauffmann, M. Khademi, D. Kim, Y. J. Kim, L. Kurilenko, J. R. Lee, Y. T. Lee, Y. Li, Y. Li, C. Liang, L. Liden, X. Lin, Z. Lin, C. Liu, L. Liu, M. Liu, W. Liu, X. Liu, C. Luo, P. Madan, A. Mahmoudzadeh, D. Majercak, M. Mazzola, C. C. T. Mendes, A. Mitra, H. Modi, A. Nguyen, B. Norick, B. Patra, D. Perez-Becker, T. Portet, R. Pryzant, H. Qin, M. Radmilac, L. Ren, G. de Rosa, C. Rosset, S. Roy, O. Ruwase, O. Saarikivi, A. Saied, A. Salim, M. Santacroce, S. Shah, N. Shang, H. Sharma, Y. Shen, S. Shukla, X. Song, M. Tanaka, A. Tupini, P. Vaddamanu, C. Wang, G. Wang, L. Wang, S. Wang, X. Wang, Y. Wang, R. Ward, W. Wen, P. Witte, H. Wu, X. Wu, M. Wyatt, B. Xiao, C. Xu, J. Xu, W. Xu, J. Xue, S. Yadav, F. Yang, J. Yang, Y. Yang, Z. Yang, D. Yu, L. Yuan, C. Zhang, C. Zhang, J. Zhang, L. L. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, and X. Zhou. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv.org/abs/2404.14219.
- Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Agrawal et al. [2024] P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. D. Monicault, S. Garg, T. Gervet, S. Ghosh, A. Héliou, P. Jacob, A. Q. Jiang, K. Khandelwal, T. Lacroix, G. Lample, D. L. Casas, T. Lavril, T. L. Scao, A. Lo, W. Marshall, L. Martin, A. Mensch, P. Muddireddy, V. Nemychnikova, M. Pellat, P. V. Platen, N. Raghuraman, B. RoziÚre, A. Sablayrolles, L. Saulnier, R. Sauvestre, W. Shang, R. Soletskyi, L. Stewart, P. Stock, J. Studnia, S. Subramanian, S. Vaze, T. Wang, and S. Yang. Pixtral 12b, 2024. URL https://arxiv.org/abs/2410.07073.
- Alayrac et al. [2022] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language model for few-shot learning, 2022. URL https://arxiv.org/abs/2204.14198.
- Aminabadi et al. [2022] R. Y. Aminabadi, S. Rajbhandari, M. Zhang, A. A. Awan, C. Li, D. Li, E. Zheng, J. Rasley, S. Smith, O. Ruwase, and Y. He. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022. URL https://arxiv.org/abs/2207.00032.
- Anthropic [2024] Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024.
- Bai et al. [2025] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report, 2025. URL https://arxiv.org/abs/2502.13923.
- Bavishi et al. [2023] R. Bavishi, E. Elsen, C. Hawthorne, M. Nye, A. Odena, A. Somani, and S. TaĆırlar. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
- Beyer et al. [2024] L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. BoĆĄnjak, X. Chen, M. Minderer, P. Voigtlaender, I. Bica, I. Balazevic, J. Puigcerver, P. Papalampidi, O. Henaff, X. Xiong, R. Soricut, J. Harmsen, and X. Zhai. Paligemma: A versatile 3b vlm for transfer, 2024. URL https://arxiv.org/abs/2407.07726.
- Bommasani et al. [2023] R. Bommasani, K. Klyman, S. Longpre, S. Kapoor, N. Maslej, B. Xiong, D. Zhang, and P. Liang. The foundation model transparency index, 2023. URL https://arxiv.org/abs/2310.12941.
- Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901, 2020.
- Cha et al. [2024] J. Cha, W. Kang, J. Mun, and B. Roh. Honeybee: Locality-enhanced projector for multimodal llm, 2024. URL https://arxiv.org/abs/2312.06742.
- Changpinyo et al. [2021] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021. URL https://arxiv.org/abs/2102.08981.
- Chen et al. [2020] W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, and W. Y. Wang. Tabfact: A large-scale dataset for table-based fact verification. In International Conference Learning Representations, 2020.
- Chen et al. [2024a] Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, J. Ma, J. Wang, X. Dong, H. Yan, H. Guo, C. He, B. Shi, Z. Jin, C. Xu, B. Wang, X. Wei, W. Li, W. Zhang, B. Zhang, P. Cai, L. Wen, X. Yan, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites, 2024a. URL https://arxiv.org/abs/2404.16821.
- Chen et al. [2024b] Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185â24198, 2024b.
- Dai et al. [2024] W. Dai, N. Lee, B. Wang, Z. Yang, Z. Liu, J. Barker, T. Rintamaki, M. Shoeybi, B. Catanzaro, and W. Ping. Nvlm: Open frontier-class multimodal llms. arXiv preprint arXiv: 2409.11402, 2024.
- Diao et al. [2024] H. Diao, Y. Cui, X. Li, Y. Wang, H. Lu, and X. Wang. Unveiling encoder-free vision-language models. arXiv preprint arXiv:2406.11832, 2024.
- Drouin et al. [2024] A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. D. Verme, T. Marty, L. Boisvert, M. Thakkar, Q. Cappart, D. Vazquez, N. Chapados, and A. Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718.
- Duan et al. [2024] H. Duan, J. Yang, Y. Qiao, X. Fang, L. Chen, Y. Liu, X. Dong, Y. Zang, P. Zhang, J. Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198â11201, 2024.
- Dubey et al. [2024] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, and et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Grattafiori et al. [2024] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. McConnell, C. Keller, C. Touret, C. Wu, C. Wong, C. C. Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz, D. Livshits, D. Wyatt, D. Esiobu, D. Choudhary, D. Mahajan, D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin, E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith, F. Radenovic, F. GuzmĂĄn, F. Zhang, G. Synnaeve, G. Lee, G. L. Anderson, G. Thattai, G. Nail, G. Mialon, G. Pang, G. Cucurell, H. Nguyen, H. Korevaar, H. Xu, H. Touvron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra, I. Evtimov, J. Zhang, J. Copet, J. Lee, J. Geffert, J. Vranes, J. Park, J. Mahadeokar, J. Shah, J. van der Linde, J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang, J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park, J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala, K. Prasad, K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone, K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla, K. Lakhotia, L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan, L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher, L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti, M. Singh, M. Paluri, M. Kardas, M. Tsimpoukelli, M. Oldham, M. Rita, M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K. Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov, N. Bogoychev, N. Chatterji, N. Zhang, O. Duchenne, O. Ăelebi, P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhargava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He, Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer, R. S. Cabral, R. Stojnic, R. Raileanu, R. Maheswari, R. Girdhar, R. Patel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor, R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabasappa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie, S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhosale, S. Zhang, S. Vandenhende, S. Batra, S. Whitman, S. Sootla, S. Collot, S. Gururangan, S. Borodinsky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou, T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao, U. Karn, V. Goswami, V. Gupta, V. Ramanathan, V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Albiero, V. Petrovic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Martinet, X. Wang, X. Wang, X. E. Tan, X. Xia, X. Xie, X. Jia, X. Wang, Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song, Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan, Z. Chen, Z. Papakipos, A. Singh, A. Srivastava, A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Victoria, A. Goldstand, A. Menon, A. Sharma, A. Boesenberg, A. Baevski, A. Feinstein, A. Kallet, A. Sangani, A. Teo, A. Yunus, A. Lupu, A. Alvarado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan, A. Ramchandani, A. Dong, A. Franco, A. Goyal, A. Saraf, A. Chowdhury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yazdan, B. James, B. Maurer, B. Leonhardi, B. Huang, B. Loyd, B. D. Paola, B. Paranjape, B. Liu, B. Wu, B. Ni, B. Hancock, B. Wasti, B. Spence, B. Stojkovic, B. Gamido, B. Montalvo, C. Parker, C. Burton, C. Mejia, C. Liu, C. Wang, C. Kim, C. Zhou, C. Hu, C.-H. Chu, C. Cai, C. Tindal, C. Feichtenhofer, C. Gao, D. Civin, D. Beaty, D. Kreymer, D. Li, D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh, D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland, E. Dowling, E. Jamil, E. Montgomery, E. Presani, E. Hahn, E. Wood, E.-T. Le, E. Brinkman, E. Arcaute, E. Dunbar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Kokkinos, F. Ozgenel, F. Caggioni, F. Kanayet, F. Seide, G. M. Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern, G. Herman, G. Sizov, Guangyi, Zhang, G. Lakshminarayanan, H. Inan, H. Shojanazeri, H. Zou, H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk, H. Aspegren, H. Goldman, H. Zhan, I. Damlaj, I. Molybog, I. Tufanov, I. Leontiadis, I.-E. Veliche, I. Gat, J. Weissman, J. Geboski, J. Kohli, J. Lam, J. Asher, J.-B. Gaya, J. Marcus, J. Tang, J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong, J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shepard, J. McPhie, J. Torres, J. Ginsburg, J. Wang, K. Wu, K. H. U, K. Saxena, K. Khandelwal, K. Zand, K. Matosich, K. Veeraraghavan, K. Michelena, K. Li, K. Jagadeesh, K. Huang, K. Chawla, K. Huang, L. Chen, L. Garg, L. A, L. Silva, L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich, L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt, M. Mankus, M. Hasson, M. Lennie, M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Keneally, M. Liu, M. L. Seltzer, M. Valko, M. Restrepo, M. Patel, M. Vyatskov, M. Samvelyan, M. Clark, M. Macey, M. Wang, M. J. Hermoso, M. Metanat, M. Rastegari, M. Bansal, N. Santhanam, N. Parks, N. White, N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. Mehta, N. P. Laptev, N. Dong, N. Cheng, O. Chernoguz, O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh, P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux, P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj, Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy, R. Nayani, R. Mitra, R. Parthasarathy, R. Li, R. Hogan, R. Battey, R. Wang, R. Howes, R. Rinott, S. Mehta, S. Siby, S. J. Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon, S. Sidorov, S. Pan, S. Mahajan, S. Verma, S. Yamamoto, S. Ramaswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin, S. C. Zha, S. Patil, S. Shankar, S. Zhang, S. Zhang, S. Wang, S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max, S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad, S. Gupta, S. Deng, S. Cho, S. Virk, S. Subramanian, S. Choudhury, S. Goldman, T. Remez, T. Glaser, T. Best, T. Koehler, T. Robinson, T. Li, T. Zhang, T. Matthews, T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Montanez, V. Mohan, V. S. Kumar, V. Mangla, V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov, W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable, X. Tang, X. Wu, X. Wang, X. Wu, X. Gao, Y. Kleinman, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang, Y. Zhang, Y. Adi, Y. Nam, Yu, Wang, Y. Zhao, Y. Hao, Y. Qian, Y. Li, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick, Z. Wen, Z. Yang, Z. Zhao, and Z. Ma. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
- He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385.
- Hu et al. [2024] A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, C. Li, J. Zhang, Q. Jin, F. Huang, and J. Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding, 2024. URL https://arxiv.org/abs/2403.12895.
- Hudson and Manning [2019] D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019. URL https://arxiv.org/abs/1902.09506.
- Jaume et al. [2019] G. Jaume, H. K. Ekenel, and J.-P. Thiran. Funsd: A dataset for form understanding in noisy scanned documents, 2019. URL https://arxiv.org/abs/1905.13538.
- Kim et al. [2022] G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park. Ocr-free document understanding transformer, 2022. URL https://arxiv.org/abs/2111.15664.
- Kim et al. [2024] Y. Kim, M. Yim, and K. Y. Song. Tablevqa-bench: A visual question answering benchmark on multiple table domains. arXiv preprint arXiv:2404.19205, 2024.
- Laurençon et al. [2024] H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when building vision-language models?, 2024. URL https://arxiv.org/abs/2405.02246.
- Lee et al. [2023] K. Lee, M. Joshi, I. Turc, H. Hu, F. Liu, J. Eisenschlos, U. Khandelwal, P. Shaw, M.-W. Chang, and K. Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023. URL https://arxiv.org/abs/2210.03347.
- Li et al. [2023a] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023a. URL https://arxiv.org/abs/2307.16125.
- Li et al. [2024] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu, and C. Li. Llava-onevision: Easy visual task transfer, 2024. URL https://arxiv.org/abs/2408.03326.
- Li et al. [2023b] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b. URL https://arxiv.org/abs/2301.12597.
- Li et al. [2023c] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models, 2023c. URL https://arxiv.org/abs/2305.10355.
- Liu et al. [2023a] H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning, 2023a.
- Liu et al. [2023b] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning, 2023b.
- Liu et al. [2024] H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Lu et al. [2024] S. Lu, Y. Li, Q.-G. Chen, Z. Xu, W. Luo, K. Zhang, and H.-J. Ye. Ovis: Structural embedding alignment for multimodal large language model, 2024. URL https://arxiv.org/abs/2405.20797.
- Masry et al. [2022] A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
- Mathew et al. [2021a] M. Mathew, V. Bagal, R. P. Tito, D. Karatzas, E. Valveny, and C. V. Jawahar. Infographicvqa, 2021a. URL https://arxiv.org/abs/2104.12756.
- Mathew et al. [2021b] M. Mathew, D. Karatzas, and C. V. Jawahar. Docvqa: A dataset for vqa on document images, 2021b. URL https://arxiv.org/abs/2007.00398.
- OpenAI et al. [2023] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, et al. Gpt-4 technical report. arXiv preprint arXiv: 2303.08774, 2023.
- Park et al. [2019] S. Park, S. Shin, B. Lee, J. Lee, J. Surh, M. Seo, and H. Lee. Cord: A consolidated receipt dataset for post-ocr parsing. Document Intelligence Workshop at Neural Information Processing Systems, 2019.
- Pasupat and Liang [2015] P. Pasupat and P. Liang. Compositional semantic parsing on semi-structured tables. In Annual Meeting of the Association for Computational Linguistics, 2015.
- Qwen et al. [2025] Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115.
- Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020.
- Raffel et al. [2023] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683.
- Rodriguez et al. [2024a] J. Rodriguez, X. Jian, S. S. Panigrahi, T. Zhang, A. Feizi, A. Puri, A. Kalkunte, F. Savard, A. Masry, S. Nayak, R. Awal, M. Massoud, A. Abaskohi, Z. Li, S. Wang, P.-A. NoĂ«l, M. L. Richter, S. Vadacchino, S. Agarwal, S. Biswas, S. Shanian, Y. Zhang, N. Bolger, K. MacDonald, S. Fauvel, S. Tejaswi, S. Sunkara, J. Monteiro, K. D. Dvijotham, T. Scholak, N. Chapados, S. Kharagani, S. Hughes, M. Ăzsu, S. Reddy, M. Pedersoli, Y. Bengio, C. Pal, I. Laradji, S. Gella, P. Taslakian, D. Vazquez, and S. Rajeswar. Bigdocs: An open and permissively-licensed dataset for training multimodal models on document and code tasks, 2024a. URL https://arxiv.org/abs/2412.04626.
- Rodriguez et al. [2022] J. A. Rodriguez, D. Vazquez, I. Laradji, M. Pedersoli, and P. Rodriguez. Ocr-vqgan: Taming text-within-image generation, 2022. URL https://arxiv.org/abs/2210.11248.
- Rodriguez et al. [2024b] J. A. Rodriguez, A. Puri, S. Agarwal, I. H. Laradji, P. Rodriguez, S. Rajeswar, D. Vazquez, C. Pal, and M. Pedersoli. Starvector: Generating scalable vector graphics code from images and text, 2024b. URL https://arxiv.org/abs/2312.11556.
- Singh et al. [2019] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read. In IEEE Conference Computer Vision Pattern Recognition, 2019.
- StanisĆawek et al. [2021] T. StanisĆawek, F. GraliĆski, A. WrĂłblewska, D. LipiĆski, A. Kaliska, P. Rosalska, B. Topolski, and P. Biecek. Kleister: key information extraction datasets involving long documents with complex layouts. In International Conference on Document Analysis and Recognition, 2021.
- Svetlichnaya [2020] S. Svetlichnaya. Deepform: Understand structured documents at scale, 2020.
- Team [2024] G. Team. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805.
- Vogus and LlansĂłe [2021] C. Vogus and E. LlansĂłe. Making transparency meaningful: A framework for policymakers. Center for Democracy and Technology, 2021.
- Wang et al. [2023a] D. Wang, N. Raman, M. Sibue, Z. Ma, P. Babkin, S. Kaur, Y. Pei, A. Nourbakhsh, and X. Liu. Docllm: A layout-aware generative language model for multimodal document understanding, 2023a. URL https://arxiv.org/abs/2401.00908.
- Wang et al. [2024] P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y. Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin. Qwen2-vl: Enhancing vision-language modelâs perception of the world at any resolution, 2024. URL https://arxiv.org/abs/2409.12191.
- Wang et al. [2023b] W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b.
- Wu et al. [2024a] C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, and P. Luo. Janus: Decoupling visual encoding for unified multimodal understanding and generation, 2024a. URL https://arxiv.org/abs/2410.13848.
- Wu et al. [2024b] Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, Z. Xie, Y. Wu, K. Hu, J. Wang, Y. Sun, Y. Li, Y. Piao, K. Guan, A. Liu, X. Xie, Y. You, K. Dong, X. Yu, H. Zhang, L. Zhao, Y. Wang, and C. Ruan. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024b. URL https://arxiv.org/abs/2412.10302.
- Xu et al. [2024] R. Xu, Y. Yao, Z. Guo, J. Cui, Z. Ni, C. Ge, T.-S. Chua, Z. Liu, M. Sun, and G. Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. European Conference on Computer Vision, 2024. doi: 10.48550/arXiv.2403.11703.
- Yu et al. [2024] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2024. URL https://arxiv.org/abs/2308.02490.
- Yue et al. [2024] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y. Liu, W. Huang, H. Sun, Y. Su, and W. Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024. URL https://arxiv.org/abs/2311.16502.
- Zhai et al. [2023] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training, 2023. URL https://arxiv.org/abs/2303.15343.
- Zhang et al. [2024] T. Zhang, S. Wang, L. Li, G. Zhang, P. Taslakian, S. Rajeswar, J. Fu, B. Liu, and Y. Bengio. Vcr: Visual caption restoration. arXiv preprint arXiv: 2406.06462, 2024.
- Zhao et al. [2024] Y. Zhao, J. Huang, J. Hu, X. Wang, Y. Mao, D. Zhang, Z. Jiang, Z. Wu, B. Ai, A. Wang, W. Zhou, and Y. Chen. Swift:a scalable lightweight infrastructure for fine-tuning, 2024. URL https://arxiv.org/abs/2408.05517.
- Zhu et al. [2025] J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y. Cao, Y. Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y. He, T. Jiang, J. Luo, Y. Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y. Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. Wang, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025. URL https://arxiv.org/abs/2504.10479.
## Appendix A Appendix
### A.1 Experimental Setup
We provide detailed hyperparameters of our experiments in Table 6.
Table 6: Detailed hyperparameters for each training stage across different LLM backbones.
LLM Backbone Llama 3.2-1B Llama 3.2-3B Llama 3.1-8B Stage-1 Stage-2 Stage-3 Stage-1 Stage-2 Stage-3 Stage-1 Stage-2 Stage-3 Trainable Parameters Full Model Full Model LLM & Connector Full Model Full Model LLM & Connector Full Model Full Model LLM & Connector Batch Size 512 512 512 512 256 256 512 256 256 Text Max Length 1024 2048 2048 1024 2048 2048 1024 2048 2048 Epochs 1 1 5 1 1 5 1 1 5 Learning Rate $1Ă 10^-5$ $5Ă 10^-5$ $5Ă 10^-5$ $1Ă 10^-5$ $5Ă 10^-5$ $5Ă 10^-5$ $1Ă 10^-5$ $1Ă 10^-5$ $1Ă 10^-5$
### A.2 Runtime Comparison Between Connectors
One caveat in the Align connector is that it includes an additional LM head layer, which slightly increases the total number of parameters. However, this addition has a negligible impact on runtime efficiency due to its simple structure. It only introduces a few matrix multiplication operations (as shown in Equations 1 and 2) instead of stacking many complex layers that require sequential processing, as in deep fusion methods.
To empirically validate this claim, we benchmarked the runtime and memory usage of models equipped with different connector types (MLP, Align, Ovis, and Perceiver), following the same experimental setup as in Table 2. As shown in Table 7, the results demonstrate that although the Align connector delivers notably superior performance (see Table 2), the variations in inference speed and GPU memory usage among the connectors remain minimal.
Table 7: Runtime and memory comparison between different connector designs. The results show that Align introduces negligible computational overhead compared to other connectors.
| Model | Samples | Avg Time (s) | Tokens/sec | GPU Memory (GB) |
| --- | --- | --- | --- | --- |
| Llama-3.2-3B-MLP | 2500 | 0.161 | 118.3 | 10.9 |
| Llama-3.2-3B-Perceiver | 2500 | 0.140 | 135.1 | 10.9 |
| Llama-3.2-3B-Ovis | 2500 | 0.155 | 122.5 | 10.8 |
| Llama-3.2-3B- Align | 2500 | 0.165 | 115.4 | 10.9 |
Overall, the empirical evidence confirms that the Align connector achieves an effective balance between computational efficiency and performance. It introduces only a negligible increase in runtime and memory usage while providing substantial gains in overall accuracy.
### A.3 Pixel-Level Tasks Analysis
To rigorously evaluate the ability of vision-language models to integrate fine-grained visual and textual pixel-level cues, we test our model on the VCR benchmark [Zhang et al., 2024], which requires the model to recover partially occluded texts with pixel-level hints from the revealed parts of the text. This task challenges VLMâs alignment of text and image in extreme situations. Current state-of-the-art models like GPT-4V OpenAI et al. [2023], Claude 3.5 Sonnet Anthropic [2024], and Llama-3.2 Dubey et al. [2024] significantly underperform humans on hard VCR task due to their inability to process subtle pixel-level cues in occluded text regions. These models frequently discard critical visual tokens during image tokenization on semantic priors, overlooking the interplay between partial character strokes and contextual visual scenes. To evaluate performance on VCR, we modify our Stage 3 SFT dataset composition by replacing the exclusive use of DocDownstream with a 5:1 blended ratio of DocDownstream and VCR training data. This adjustment enables direct evaluation of our architecture Align âs ability to leverage pixel-level character cues.
From the experimental outcomes, it is evident that AlignVLM consistently outperforms the MLP Connector Model across both easy and hard settings of the pixel-level VCR task (see Figure 5), with improvements ranging from 10.18% on the hard setting to 14.41% on the easy setting.
We provide a case study on VCR in Figure 6, featuring four representative examples. In Figure 6(a), it is evident that the MLP connector model fails to capture semantic consistency as effectively as AlignVLM. The phrase âThe commune first census in written history in â (where the words in italics are generated by the model while the rest are in the image) is not as semantically coherent as the phrase generated by Align âThe commune first appears in written history in â.
Beyond the issue of semantic fluency, in Figure 6(b) we also observe that AlignVLM successfully identifies the uncovered portion of the letter âgâ in âaccountingâ and uses it as a pixel-level hint to infer the correct word. In contrast, the MLP model fails to effectively attend to this crucial detail.
Figures 6(c) and 6(d) show examples where AlignVLM fails on the VCR task. These carefully picked instances show that our method mistakes names of landmarks with common words when the two are very similar. As seen in the examples, AlignVLM mistakes âLlanengan" for âLlanongan" and âGorden" for âGardenâ. In both instances, the pairs differ by one character, indicating perhaps that AlignVLM tends to align vision representations to more common tokens in the vocabulary. One approach that would potentially mitigate such errors would be to train AlignVLM with more contextually-relevant data.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Horizontal Bar Chart: Model Performance on VCR Tasks
### Overview
The image displays a horizontal bar chart comparing the performance of two language models on two variants of a Visual Commonsense Reasoning (VCR) task in English. The performance metric is "Exact Match (%)". The chart clearly shows that one model, labeled as "Ours," outperforms the other on both task difficulties.
### Components/Axes
* **Chart Type:** Horizontal grouped bar chart.
* **Y-Axis (Vertical):** Lists the two task categories.
* Top category: `VCR EN Easy`
* Bottom category: `VCR EN Hard`
* **X-Axis (Horizontal):** Represents the performance metric.
* **Label:** `Exact Match (%)`
* **Scale:** Linear scale from 0 to approximately 70, with major tick marks at 0, 20, 40, and 60.
* **Legend:** Positioned at the bottom center of the chart.
* **Light Blue Bar:** `Llama-3.2-3B-Align (Ours)`
* **Light Orange Bar:** `Llama-3.2-3B-MLP`
* **Data Labels:** Numerical values are printed at the end of each bar, indicating the exact percentage.
### Detailed Analysis
The chart presents the following specific data points:
**1. VCR EN Easy Task:**
* **Llama-3.2-3B-Align (Ours) [Light Blue Bar]:** The bar extends to the right, ending at a data label of **65.84%**. This is the highest value on the chart.
* **Llama-3.2-3B-MLP [Light Orange Bar]:** The bar is shorter, ending at a data label of **51.43%**.
**2. VCR EN Hard Task:**
* **Llama-3.2-3B-Align (Ours) [Light Blue Bar]:** The bar extends to a data label of **48.07%**.
* **Llama-3.2-3B-MLP [Light Orange Bar]:** This is the shortest bar on the chart, ending at a data label of **37.89%**.
**Trend Verification:**
* For both models, performance is higher on the "Easy" task compared to the "Hard" task. The blue bar for "Easy" is longer than the blue bar for "Hard," and the same relationship holds for the orange bars.
* For both task difficulties, the "Llama-3.2-3B-Align (Ours)" model (blue) achieves a higher score than the "Llama-3.2-3B-MLP" model (orange). The blue bar is consistently longer than the orange bar within each task group.
### Key Observations
1. **Consistent Performance Hierarchy:** The "Align" model demonstrates a clear and consistent performance advantage over the "MLP" model across both evaluated task difficulties.
2. **Task Difficulty Impact:** Both models experience a significant drop in performance when moving from the "Easy" to the "Hard" variant of the VCR EN task. The "Align" model's score drops by approximately 17.77 percentage points (65.84% to 48.07%), while the "MLP" model's score drops by approximately 13.54 percentage points (51.43% to 37.89%).
3. **Performance Gap:** The absolute performance gap between the two models is larger on the "Easy" task (14.41 percentage points) than on the "Hard" task (10.18 percentage points).
### Interpretation
This chart provides a direct, quantitative comparison of two model variants on a visual reasoning benchmark. The data suggests that the architectural or training modification designated as "Align" in "Llama-3.2-3B-Align" yields a substantial improvement in exact match accuracy over the "MLP" variant for this specific task.
The universal drop in scores from "Easy" to "Hard" validates the task design, confirming that the "Hard" subset presents a greater challenge. The fact that the "Align" model maintains a lead even on the harder task indicates that its performance gains are robust and not limited to simpler examples.
From a research perspective, this visualization efficiently communicates the success of the "Align" method. The clear visual separation of the bars, reinforced by the precise numerical labels, leaves little ambiguity about the relative effectiveness of the two approaches on the VCR EN benchmark. The chart is designed to highlight the superiority of the authors' proposed model ("Ours").
</details>
Figure 5: Comparison of Llama-3.2-3b- Align and Llama-3.2-3B-MLP on the Easy and Hard VCR tasks.
<details>
<summary>figures/vcr_example1.png Details</summary>

### Visual Description
## Map with Descriptive Text: AÈel Commune, Romania
### Overview
The image consists of two primary components: a topographical map of a geographical region and a block of descriptive text below it. The map is a green-shaded relief map showing terrain features, with no visible labels, legends, or scales. The text provides administrative and compositional information about the location depicted.
### Components
1. **Map Component (Top):**
* **Type:** Topographical/relief map.
* **Visuals:** Depicts a landmass with varied shades of green, indicating elevation and terrain (likely hills and valleys). Lighter, yellowish-green patches are visible, which may represent settlements, agricultural areas, or different land cover. The overall shape is irregular, suggesting an administrative boundary.
* **Labels/Axes:** None present. No city names, rivers, roads, or coordinate markers are visible.
* **Legend:** None present.
2. **Text Component (Bottom):**
* **Position:** Located directly below the map, occupying the lower third of the image.
* **Language:** Romanian.
* **Font:** A standard serif typeface (e.g., Times New Roman).
* **Content:** A descriptive paragraph about the commune.
### Detailed Analysis / Content Details
**Text Transcription (Romanian):**
The visible text reads:
"AÈel este o comunÄ Ăźn judeÈul Sibiu, Transilvania, RomĂąnia.
Este compusÄ din douÄ sate, AÈel Èi DupuÈ. Comuna s-a..."
*(The text is cut off at the end of the line.)*
**English Translation:**
"AÈel is a commune in Sibiu County, Transylvania, Romania.
It is composed of two villages, AÈel and DupuÈ. The commune first..."
**Map Visual Analysis:**
* The map shows a contiguous, roughly oval-shaped region with a jagged perimeter.
* The terrain appears hilly or mountainous, with darker green shading likely indicating valleys or lower elevations and lighter shades indicating ridges or higher ground.
* Several concentrated light-yellow patches are scattered within the region, with the largest one located slightly left of the center. These are inferred to be the built-up areas of the villages mentioned in the text (AÈel and DupuÈ).
### Key Observations
1. **Information Asymmetry:** The map provides purely visual, unlabeled geographical context, while all specific factual data (name, location, composition) is contained in the text.
2. **Text Incompleteness:** The descriptive text is truncated, ending mid-sentence ("Comuna s-a..."), indicating that additional historical or administrative information is missing from the image.
3. **Lack of Cartographic Elements:** The map lacks standard reference points such as a north arrow, scale bar, coordinate grid, or labels for neighboring regions, rivers, or roads, limiting its standalone utility for precise geographical analysis.
4. **Visual-Textual Correlation:** The light patches on the map correspond logically to the "two villages" mentioned in the text, providing a visual representation of their approximate locations and relative sizes within the commune's territory.
### Interpretation
This image serves as a basic informational snippet, likely from an encyclopedia, textbook, or official document, introducing the AÈel commune. The data suggests the following:
* **Administrative Structure:** AÈel is a second-level administrative division (a commune) within Sibiu County, in the historical region of Transylvania, Romania. Its structure is composite, consisting of two distinct villages.
* **Geographical Context:** The commune encompasses a defined, hilly territory. The map visually communicates the physical landscape and the dispersed settlement pattern within its borders, even without labels.
* **Purpose and Limitation:** The combination of map and text is designed to give a quick, foundational understanding of "what and where" AÈel is. However, the cut-off text and lack of map details prevent a full understanding. The missing conclusion of the sentence ("Comuna s-a...") likely pertains to the commune's formation or first documented mention, which is a common format for such descriptions. To gain complete information, one would need to consult the full source document or a complete geographical database.
</details>
| GT: | (appears in written history in) |
| --- | --- |
| MLP: | (census in written history in) â |
| Align | (appears in written history in) â |
(a) Positive Example 1
<details>
<summary>figures/vcr_example2.png Details</summary>

### Visual Description
## Diagram: Map of Ghana Telephone Area Codes
### Overview
The image is a colored map diagram of Ghana, divided into distinct geographical regions. Each region is assigned a three-digit telephone area code, displayed directly within its boundaries. Below the map, a partial block of text provides context about the Ghana telephone numbering plan.
### Components/Axes
**Map Regions and Area Codes:**
The map is segmented into 10 distinct regions, each filled with a unique color and labeled with a three-digit number. The regions and their approximate spatial positions are as follows:
* **Top-left (Northwest):** Blue region, labeled **039**.
* **Top-center (North):** Yellow region, labeled **038**.
* **Upper-central (North-central):** Purple region, labeled **037**.
* **Center-left (West-central):** Light green/lime region, labeled **035**.
* **Center-right (East-central):** Dark green region, labeled **036**.
* **Center (Central):** Red region, labeled **032**.
* **Lower-center (South-central):** Light purple/lavender region, labeled **034**.
* **Bottom-left (Southwest):** Cyan/light blue region, labeled **031**.
* **Bottom-center (South):** Light blue region, labeled **033**.
* **Bottom-right (Southeast):** Pink region, labeled **030**.
**Text Block (Below Map):**
The text is in English and is partially cut off at the right margin. The visible portion reads:
"The Ghana telephone numbering plan is [text cut off] telephone numbers in Ghana. It is regulated by the National Communications"
### Detailed Analysis
* **Data Series:** The primary data is categorical, associating specific geographical areas of Ghana with unique telephone area codes.
* **Spatial Distribution:** The area codes are not assigned in a simple sequential geographical order (e.g., north to south). For example, 037 covers a large northern area, while 030, 031, 032, 033, and 034 are clustered in the southern half of the country.
* **Color Coding:** Each region has a distinct, solid color fill. There is no separate legend; the area code labels are placed directly on the colored regions for identification.
### Key Observations
1. **Non-Sequential Numbering:** The area codes do not follow a clear numerical progression based on geography (e.g., lower numbers in the south, higher in the north). Codes like 030, 031, 032, 033, and 034 are all in the southern half, while 037, 038, and 039 are in the north.
2. **Regional Clustering:** Several area codes are grouped in the more densely populated southern coastal region (030, 031, 032, 033, 034), which may correspond to major cities or administrative regions like Accra, Kumasi, and Takoradi.
3. **Large Northern Regions:** The northern part of the country is divided into fewer, larger regions with area codes 037, 038, and 039, suggesting lower population density or different administrative boundaries.
4. **Incomplete Text:** The descriptive text is truncated, preventing a full understanding of the plan's description or the complete name of the regulatory body (likely the "National Communications Authority").
### Interpretation
This diagram serves as a visual reference for the geographic allocation of telephone area codes in Ghana. It demonstrates that the numbering plan is not based on a simple grid or sequential system but likely aligns with historical administrative regions, population centers, or telecommunications infrastructure zones.
The clustering of multiple area codes (030-034) in the south indicates higher demand for telephone numbers in that area, consistent with it being the economic and administrative heart of the country. The regulatory body mentioned, the National Communications [Authority], is responsible for managing this numbering resource to ensure efficient telecommunication services.
**Missing Information:** The diagram does not specify which cities or towns fall within each area code region. The truncated text likely contained important details about the plan's structure (e.g., number length, trunk prefix) or history. For a complete technical document, this map would need to be supplemented with a table matching area codes to specific cities or regions and the full explanatory text.
</details>
| GT: | (the system used for assigning) |
| --- | --- |
| MLP: | (the system used for accounting) â |
| Align | (the system used for assigning) â |
(b) Positive Example 2
<details>
<summary>figures/vcr_example3.png Details</summary>

### Visual Description
## Photograph: Flooded Mine Tunnel with Text Overlay
### Overview
The image is a photograph of a narrow, flooded underground mine tunnel, with a block of descriptive text overlaid at the bottom. The tunnel recedes into the distance, with rough-hewn stone walls and a water-filled floor reflecting the dim light. The text provides historical and geographical context about the mine system.
### Components/Axes
**Visual Components:**
1. **Tunnel Structure:** A long, straight, and narrow passage with walls made of rough, uneven stone blocks. The ceiling is arched.
2. **Water:** The floor of the tunnel is completely submerged under calm, dark water, which reflects the tunnel walls and the distant light.
3. **Lighting:** The scene is dimly lit. A bright light source is visible at the far end of the tunnel, creating a strong vanishing point and illuminating the water's surface in the distance. The foreground and mid-ground are in deep shadow.
4. **Perspective:** The photograph is taken from a low angle, looking straight down the length of the tunnel, emphasizing its depth and confinement.
**Text Overlay (Bottom of Image):**
* **Position:** A white text block is superimposed over the lower portion of the photograph.
* **Content:** The text is in English and appears to be an excerpt from a description. It is partially cut off, indicated by ellipses.
### Detailed Analysis / Content Details
**Transcription of Visible Text:**
The text reads as follows, with `[...]` indicating where the text is cut off or illegible:
> "The Penrhyn Ddu Mines are a collection of
> [...]
> the LlĆ·n Peninsula.
> It encompasses the Penrhyn,
> Asheton, Western and
> [...]"
**Visual Analysis of the Tunnel:**
* **Trend/State:** The tunnel is in a static, flooded state. The water appears still and deep, suggesting long-term abandonment and water ingress.
* **Construction:** The walls are built from irregular stone blocks, indicating historical, manual construction rather than modern boring techniques. The arched ceiling is a common structural feature in mining for stability.
* **Atmosphere:** The image conveys a sense of isolation, history, and inaccessibility. The single point of light at the end draws the eye but offers no detail of what lies beyond.
### Key Observations
1. **Flooded Condition:** The primary observable fact is that this section of the mine is completely waterlogged, making it inaccessible without specialized equipment.
2. **Historical Construction:** The stonework points to an older mining operation, likely from the 18th or 19th century.
3. **Textual Information:** The text identifies the site as the "Penrhyn Ddu Mines," located on the "LlĆ·n Peninsula" (a region in North Wales, UK). It lists at least three specific mines within the collection: Penrhyn, Asheton, and Western.
4. **Incomplete Data:** The text is truncated, so the full description and the complete list of mines in the collection are not available from this image alone.
### Interpretation
This image serves as both a visual record and a historical document. The photograph provides direct evidence of the current physical state of the Penrhyn Ddu Minesâspecifically, that they are flooded and abandoned. This has implications for historical preservation, safety, and potential future exploration or study.
The text overlay connects the visual evidence to a specific geographical and historical context. It defines the "Penrhyn Ddu Mines" not as a single site but as a *collection* of mines (Penrhyn, Asheton, Western, and others) operating across the LlĆ·n Peninsula. This suggests a historically significant mining district rather than an isolated operation.
The contrast between the dark, confined, flooded tunnel and the distant light is a powerful visual metaphor. It can be read as representing the obscurity of history (the dark tunnel) with a point of inquiry or discovery (the light) at its end. The flooding itself is a key data point, indicating the end of the mine's economic life and its reclamation by natural processes. For a technical document, this image would be crucial for illustrating the challenges of accessing or surveying historical mine workings in this region.
</details>
| GT: | (mines situated near Llanengan on) |
| --- | --- |
| MLP: | (mines situated near Llanengan on) â |
| Align | (mines situated near Llanongan on) â |
(c) Negative Example 1
<details>
<summary>figures/vcr_example4.png Details</summary>

### Visual Description
## Photograph: Fairmount City Hall and Descriptive Text
### Overview
The image is a photograph of a single-story, red brick municipal building identified as the "City of Fairmount." Below the photograph is a block of printed text providing demographic and geographic information about the town of Fairmount, Georgia. The image appears to be a scan or photograph of a printed document, possibly from a book, brochure, or informational plaque.
### Components
1. **Photograph (Top Section):**
* **Subject:** A symmetrical, red brick building with a gabled roof. The architecture features a central entrance portico supported by two white columns. A set of double glass doors forms the main entrance, flanked by two multi-pane windows. The building has a white cornice and trim.
* **Signage:** Centered above the entrance portico, gold or yellow lettering on the brick reads: "CITY OF FAIRMOUNT".
* **Setting:** The building is set back from a paved area (likely a street or parking lot). Green grass and mature deciduous trees are visible in the background under a partly cloudy blue sky.
2. **Text Block (Bottom Section):**
* A paragraph of black text on a white background, positioned directly below the photograph. The text is in English.
### Content Details
**Transcription of Text Block:**
The text reads as follows (line breaks are approximate based on the image):
"Fairmount is a city in Gordon
County, Georgia, United States. As of the 2010 census,
the city had a population of 720. Gordon County is
located in northwest Georgia. The county seat is Calhoun.
Fairmount is located in the southeastern part of the county,
approximately 10 miles (16 km) southeast of Calhoun. It is
also located near the southern border of the county, near
the border with Bartow County. The city is part of the
Dalton, Georgia Metropolitan Statistical Area. Nearby
communities include: Cartersville, White, Rydal, and New Echota."
**Key Data Points Extracted:**
* **Entity:** Fairmount
* **Type:** City
* **Location:** Gordon County, Georgia, United States.
* **Population:** 720 (as of the 2010 census).
* **County Seat of Gordon County:** Calhoun.
* **Relative Location (within county):** Southeastern part, approximately 10 miles (16 km) southeast of Calhoun.
* **Relative Location (state):** Northwest Georgia.
* **Metropolitan Area:** Part of the Dalton, Georgia Metropolitan Statistical Area.
* **Nearby Communities:** Cartersville, White, Rydal, New Echota.
### Key Observations
* The photograph serves as a visual identifier for the subject of the text.
* The text provides a concise, factual summary typical of an encyclopedia entry or official document.
* The population figure is explicitly tied to a specific historical data point (the 2010 census), indicating the information may not reflect current demographics.
* The description uses precise geographical relationships ("southeastern part," "10 miles southeast," "near the southern border") to ground the town's location.
### Interpretation
This image-text pair functions as a **geographic and demographic snapshot**. The photograph establishes the town's civic identity through its city hall, while the text provides the essential data for understanding its place within administrative and population hierarchies (city, county, metropolitan area).
The information is **investigative and locational**. It answers the fundamental questions: "What is Fairmount?" (a city), "Where is it?" (Gordon County, GA, with specific coordinates relative to other towns), and "How big is it?" (population 720 in 2010). The mention of nearby communities like New Echota (a historically significant site) and Cartersville provides cultural and regional context.
The primary utility of this extracted data is for **reference, research, or documentation**. It allows someone to understand Fairmount's basic profile without needing the original source. The explicit citation of the 2010 census is a critical piece of metadata, warning the reader that the population figure is a historical data point and should be verified for current accuracy.
</details>
| GT: | (Gorden County is home to) |
| --- | --- |
| MLP: | (Gorden County is home to) â |
| Align | (Garden County is home to) â |
(d) Negative Example 2
Figure 6: Case Study for Pixel-Level Tasks. We provide examples of our proposed Align connector compared with a the Multi-Layer Perceptron (MLP) connector. The Align connector tends to better map visual elements to common words. GT is the ground truth.
### A.4 Case Studies
In this section, we provide case studies for the experiments in Section 5.1. Specifically, we provide examples of our Llama-3.2-3B- Align, and its counterpart model with alternative connectors Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis on three different datasets: KLC [StanisĆawek et al., 2021], DocVQA [Mathew et al., 2021b], and TextVQA [Singh et al., 2019]. The examples are shown in Figure 7, 8, and 9.
<details>
<summary>figures/case_1.jpg Details</summary>

### Visual Description
## Document Cover Page: Ardingly College Limited Financial Report
### Overview
This image is the cover page of a formal financial document for Ardingly College Limited. It is a black-and-white scan of a printed page, featuring a decorative border, the institution's logo, and the title of the report. A handwritten annotation appears in the top-right corner.
### Components/Axes
The page is structured within a double-lined decorative border with ornate corner flourishes. The elements are centered vertically and horizontally within this frame.
1. **Top-Right Corner (Outside Border):** Handwritten text in black ink: "EXTERNAL".
2. **Top Center (Inside Border):** The logo and name of the institution.
* **Logo:** A line drawing of a stork or similar bird standing on a base, next to a globe on a stand.
* **Text:** "ARDINGLY" in a large, serif font. Below it, in a smaller font: "College". Beneath the globe: the tagline "World Ready".
3. **Center of Page:** The main title of the document.
* **Primary Title:** "Ardingly College Limited" in a large, italicized serif font.
* **Secondary Title:** "DIRECTORS' REPORT AND FINANCIAL STATEMENTS" in all-caps, sans-serif font.
* **Reporting Period:** "FOR THE YEAR ENDED 31ST AUGUST 2017" in all-caps, sans-serif font.
4. **Bottom Center (Inside Border):** Legal registration information.
* "Charity Registration No. 1076456"
* "Company Registration No. 03779971 (England and Wales)"
### Content Details
All text on the page is in English. The transcription is as follows:
* (Handwritten, top-right): EXTERNAL
* (Logo text): ARDINGLY College / World Ready
* (Main Title): Ardingly College Limited
* (Report Title): DIRECTORS' REPORT AND FINANCIAL STATEMENTS
* (Period): FOR THE YEAR ENDED 31ST AUGUST 2017
* (Charity Info): Charity Registration No. 1076456
* (Company Info): Company Registration No. 03779971 (England and Wales)
### Key Observations
* The document is for the financial year ending **31 August 2017**.
* The entity is structured as both a registered charity (No. 1076456) and a limited company (No. 03779971) registered in England and Wales.
* The handwritten "EXTERNAL" note suggests this copy was designated for distribution outside the organization.
* The design is formal and traditional, using a decorative border and serif fonts, appropriate for an official financial report.
### Interpretation
This cover page identifies the document as the official annual financial report for Ardingly College Limited, a charitable company. The dual registration numbers are critical legal identifiers, confirming its status as a regulated entity in the UK. The report covers the academic/financial year concluding at the end of August 2017, which is a common year-end for educational institutions. The "EXTERNAL" marking implies this document was prepared for stakeholders, regulators, or public record, rather than for internal use only. The page serves as a formal gateway to the detailed financial statements and directors' commentary contained within the full report.
</details>
| Question: | What is the value for the charity name? |
| --- | --- |
| GT: | (Ardingly College Ltd.) |
| MLP: | (Ardington College Ltd.) â |
| Ovis: | (Ardington College Ltd.) â |
| Align: | (Ardingly College Ltd.) â |
(a) Positive Example #1
<details>
<summary>figures/case_2.jpg Details</summary>

### Visual Description
## [Document Form]: Trustees' Annual Report - Section A
### Overview
This image is a scanned page from a "Trustees' Annual Report" form issued by the Charity Commission. It is Section A, titled "Reference and administration details," for the charity "TENDER GRACE CHRISTIAN CENTRE." The report covers the period from 11 February 2016 to 10 February 2017. The form is partially completed with the charity's core identification details and a list of its trustees.
### Components/Axes
The document is structured as a form with labeled fields and tables.
* **Header:** Contains the Charity Commission logo (top-left) and the report title "Trustees' Annual Report for the period."
* **Reporting Period:** Defined in a table with "From" and "To" dates.
* **Section Header:** A black bar with white text reading "Section A Reference and administration details."
* **Form Fields:** Labeled boxes for "Charity name," "Other names charity is known by," "Registered charity number (if any)," and "Charity's principal address" (with sub-fields for address lines and "Postcode").
* **Trustee Table:** A table titled "Names of the charity trustees who manage the charity" with four columns:
1. Trustee name
2. Office (if any)
3. Dates acted if not for whole year
4. Name of person (or body) entitled to appoint trustee (if any)
* **Secondary Table:** A smaller, empty table titled "Names of the trustees for the charity, if any, (for example, any custodian trustees)" with columns for "Name" and "Dates acted if not for whole year."
* **Footer:** Contains the text "TAR", the page number "1", and the form revision date "March 2012."
### Detailed Analysis
**Header & Period:**
* **Charity Commission Logo:** Present in the top-left corner.
* **Report Title:** "Trustees' Annual Report for the period"
* **Period Start Date (From):** 11 02 2016 (11 February 2016)
* **Period End Date (To):** 10 02 2017 (10 February 2017)
**Section A - Reference and administration details:**
* **Charity name:** TENDER GRACE CHRISTIAN CENTRE
* **Other names charity is known by:** [This field is blank]
* **Registered charity number (if any):** 1165532
* **Charity's principal address:**
* Line 1: FLAT 13
* Line 2: 39 TRINITY RISE
* Line 3: LONDON
* **Postcode:** SW2 2QP
**Names of the charity trustees who manage the charity:**
The table lists four trustees. All columns except "Trustee name" are blank for all entries.
1. **Trustee name:** Mr Anthony Yakubu Membu
2. **Trustee name:** Miss Happiness Ngosi Jan-Nnyeruka
3. **Trustee name:** Mrs Olabisi Comfort Ogunjobi
4. **Trustee name:** Mrs Olayinka Adeola Oyesanya
* Rows 5 through 20 in the table are empty.
**Secondary Table (Custodian Trustees):**
* This table is entirely blank.
**Footer:**
* **Left:** TAR
* **Center:** 1
* **Right:** March 2012
### Key Observations
1. **Selective Completion:** The form is only partially filled. Core identification data (name, number, address, reporting period) and the list of trustee names are provided. All other fields, including trustee roles, appointment details, and dates of service, are left blank.
2. **Trustee Count:** Four individuals are listed as managing trustees.
3. **Document Version:** The form itself is a standard template from March 2012, used for this 2016-2017 report.
4. **Address Format:** The principal address is a residential flat in London.
### Interpretation
This document serves as the official administrative record for the charity "TENDER GRACE CHRISTIAN CENTRE" for the 2016-2017 fiscal year. It establishes the charity's legal identity (name, registration number) and provides a point of contact (address). The list of trustees defines the individuals legally responsible for the charity's management and governance during that period.
The significant amount of blank information suggests one of two scenarios: either the charity's reporting was incomplete, or this is a standard practice where only the mandatory name field is populated, with other details (like specific offices or appointment bodies) considered non-essential for this summary page. The absence of data in the "custodian trustees" table indicates the charity does not use that legal structure. The document is a factual snapshot of governance structure at a specific point in time, lacking narrative or financial data which would be in subsequent sections of the full report.
</details>
| Question: | What is the value for the address postcode? |
| --- | --- |
| GT: | (SW2 2QP) |
| MLP: | (SW22 0PQ) â |
| Ovis: | (SW2 2OP) â |
| Align: | (SW2 2QP) â |
(b) Positive Example #2
<details>
<summary>figures/case_3.jpg Details</summary>

### Visual Description
## Document Cover: Annual Report and Financial Statements 2015
### Overview
This image is the front cover of a printed annual report. It features a full-bleed photograph of a smiling young girl in a classroom setting, overlaid with the report's title and the organization's branding. The design uses a strong purple and white color scheme.
### Components/Axes
The cover is composed of three primary textual components and one photographic background.
1. **Main Title Text (Bottom Center):**
* **Primary Line:** "ANNUAL REPORT" in large, bold, white, sans-serif capital letters.
* **Secondary Line:** "AND FINANCIAL STATEMENTS 2015" in smaller, white, sans-serif capital letters, positioned directly below the primary line.
2. **Branding Banner (Left Edge):**
* A vertical purple banner runs along the left edge of the cover.
* **Text:** "human appeal" is written vertically in white, lowercase, sans-serif letters, reading from bottom to top.
* **Logo:** A circular, intricate white logo is positioned at the bottom of this banner.
3. **Registration Information (Top Left Edge):**
* Very small, vertical white text runs along the top left edge of the photograph.
* **Transcription:** "Charity No. 1154288 | Company Reg No. 8553893 | Scottish Reg No. SC046481"
### Detailed Analysis
* **Photograph:** The central image is a close-up, eye-level shot of a young Black girl with braided hair. She is smiling broadly, showing her teeth, and has her right hand raised to her head. She is wearing a white collared shirt under a purple pinafore dress. The background is softly blurred but shows other children in similar uniforms, suggesting a classroom environment. The lighting is natural and warm.
* **Spatial Grounding:**
* The main title text is centered horizontally and positioned in the lower third of the cover, overlaid on the photograph.
* The purple branding banner is anchored to the left edge, spanning from the bottom to about one-third up the cover's height.
* The registration text is aligned to the very top-left corner, running vertically down the margin.
* **Color Scheme:** The dominant colors are the purple of the banner and the girl's dress, the white of the text and shirts, and the natural skin tones and muted background colors of the photograph.
### Key Observations
* The photograph is emotionally engaging, focusing on a single, joyful subject to create a human connection.
* The typography is clean and modern, using a sans-serif font for clarity and impact.
* The purple color is used consistently as a brand identifier in both the banner and the subject's clothing.
* All textual information is presented in English.
### Interpretation
This is the cover for the 2015 Annual Report and Financial Statements of the charity "Human Appeal." The design choices serve a clear purpose:
* **Human Connection:** The powerful photograph of a happy child in a learning environment directly communicates the charity's focus on humanitarian aid, likely in education or child welfare. It puts a human face to the organization's work.
* **Professional Transparency:** The prominent title clearly states the document's formal purpose (annual report and financials), while the inclusion of charity and company registration numbers establishes legal legitimacy and transparency.
* **Brand Identity:** The consistent use of purple and the clean layout reinforce a professional and recognizable brand image. The vertical banner is a distinctive design element.
The cover effectively balances emotional appeal (the photograph) with formal, factual information (the title and registration details), setting the tone for a document that will detail both the impact and the financial stewardship of the charity's work in 2015.
</details>
| Question: | What is the value for the charity name? |
| --- | --- |
| GT: | (Human Appeal) |
| MLP: | (Humanitarian Agenda) â |
| Ovis: | (Human Appeal) â |
| Align: | (Human Rightsappeal) â |
(c) Negative Example #1
<details>
<summary>figures/case_4.jpg Details</summary>

### Visual Description
## Document Page: Annual Report Introduction
### Overview
This image is a scanned page from the "Annual Report 2017" of Bishop's Stortford Baptist Church. It is the first page (page 1) of the document, serving as an introduction that outlines the church's vision, mission, purposes, and a high-level overview of its activities and organizational structure for the year 2017. The document is text-based with a formal, report-style layout.
### Components/Axes
* **Header/Logo (Top Right):** A logo featuring the stylized letters "BSBC" with a cross integrated into the design. Below the letters, in smaller text, is the full name: "bishops stortford baptist church".
* **Main Title (Centered):** "Bishop's Stortford Baptist Church" in a large, bold font.
* **Subtitle/Mission Statement (Centered, below main title):** "This Church aims to be a community of believers living according to the Bible's teaching where people become fully devoted followers of Jesus Christ".
* **Report Title (Centered):** "Annual Report 2017" in a large, bold font.
* **Introductory Sentence:** "We are pleased to present a report of the life and work of the Church during 2017."
* **Section 1: "Aim and Purposes" (Centered Heading):** A paragraph describing the church's vision and mission.
* **Section 2: "Objectives and Overview" (Centered Heading):** A paragraph introducing the church's commitment, followed by a bulleted list of activities, and concluding with a paragraph on ministry areas and leadership structure.
* **Footer (Bottom Center):** The page number "1".
### Detailed Analysis / Content Details
**Full Text Transcription:**
**(Logo)**
BSBC
bishops stortford baptist church
**Bishop's Stortford Baptist Church**
This Church aims to be a community of believers living according to the Bible's teaching where people become fully devoted followers of Jesus Christ
**Annual Report 2017**
We are pleased to present a report of the life and work of the Church during 2017.
**Aim and Purposes**
The vision of Bishop's Stortford Baptist Church is to be a community of believers living according to the Bible's teaching so that Christ's redemptive purposes can be accomplished in the world. The Church's mission is to enable people to become fully devoted followers of Jesus Christ by the power of the Holy Spirit. To that end, we understand the purposes of the Church to be worship, discipleship, fellowship, outreach and service. The fellowship of the Church seeks to go deeper with God, become closer to each other and wider in our influence.
**Objectives and Overview**
The Church is committed to carrying out its activities in line with the Baptist tradition and to encourage and enable people in the local area to attend and take part in its services and meetings. Activities include:
* Services each week for worship, prayer, Bible study, preaching and teaching, including additional services for those in the local area whose first language is not English
* Baptisms on request
* Meetings for young people
* Small group meetings (LIFE Groups) for all ages
* Pastoral care
* Discipleship for Christian service
* Evangelism and mission
* Support of Christian and social charitable action within the local area
* Support for mission overseas
* Involvement with the local and national Baptist associations
The ministry areas of the Church are classified into Worship, Fellowship, Reaching Out, Discipleship and Social Action, supplemented by a Resources Team (all led by the Deacons) and with spiritual oversight and pastoral care administered by the Elders.
**1**
### Key Observations
* **Document Structure:** The page follows a clear hierarchical structure: Logo -> Church Name & Mission -> Report Title -> Introductory Statement -> Detailed Sections (Aim/Purposes, Objectives/Overview).
* **Core Themes:** The text repeatedly emphasizes key themes: being a "community of believers," living by "the Bible's teaching," and becoming "fully devoted followers of Jesus Christ."
* **Five-Fold Purpose:** The church explicitly defines its purposes as "worship, discipleship, fellowship, outreach and service."
* **Activity Scope:** The bulleted list shows a wide range of activities, from weekly services and baptisms to pastoral care, youth meetings, small groups (LIFE Groups), local charitable support, and overseas missions.
* **Organizational Structure:** The final paragraph outlines a clear ministry structure with five classified areas (Worship, Fellowship, Reaching Out, Discipleship, Social Action) led by Deacons, with spiritual oversight from Elders.
### Interpretation
This document serves as a foundational statement of identity and intent for Bishop's Stortford Baptist Church for the year 2017. It is not a report of numerical data or specific outcomes but rather a declaration of the church's *raison d'ĂȘtre* and its planned scope of work.
* **What it demonstrates:** The report establishes the theological and operational framework within which the church functions. It connects high-level vision ("community of believers") to practical, on-the-ground activities (weekly services, LIFE Groups, charitable support). The mention of services for non-English first language speakers indicates an awareness of and response to local demographic diversity.
* **Relationship between elements:** The "Aim and Purposes" section provides the theological "why," while the "Objectives and Overview" section details the practical "how." The organizational structure at the end explains the "who" responsible for executing these objectives.
* **Notable aspects:** The language is inclusive and community-focused ("become closer to each other"). The structure is typical of a formal annual report, aiming to provide transparency and accountability to its members and possibly wider stakeholders. The absence of specific metrics or financial data on this page suggests they would be contained in subsequent sections of the full report. The page sets the stage for more detailed accounts of the church's life and work in 2017.
</details>
| Question: | What is the value for the post town address? |
| --- | --- |
| GT: | (Bishopâs Stortford) |
| MLP: | (Stortford) â |
| Ovis: | (Bishopâs Stortford) â |
| Align: | (Stortford) â |
(d) Negative Example #2
Figure 7: Case Study for Connector Comparison on the KLC dataset [StanisĆawek et al., 2021]. We show four qualitative examples (including two correct and two incorrect examples) comparing Llama-3.2-3B- Align to the same architecture with different connectors, Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis. âGTâ denotes the ground truth.
<details>
<summary>figures/case_5.jpg Details</summary>

### Visual Description
## Document Schedule: Workshop Agenda for June 29-30
### Overview
This image is a scanned page (page -3-) from a typed document outlining the schedule for a two-day workshop or training program. The schedule details sessions for the afternoon of Thursday, June 29, and the full day of Friday, June 30. The content is in English. The document appears to be part of a public health or epidemiology training, featuring sessions on interviewing techniques, research design, epidemiology statistics, and questionnaire construction.
### Components/Axes (Document Structure)
The document is structured chronologically, divided by date and time of day (Morning/Afternoon). Key structural components include:
| Component | Description |
| :--- | :--- |
| **Date Headers** | "Thursday, June 29" and "Friday, June 30". |
| **Time Blocks** | "Afternoon" (for June 29) and "Morning" and "Afternoon" (for June 30). |
| **Time Column** | A left-aligned column listing specific times (e.g., 1:00, 1:30, 8:00). |
| **Activity/Topic Column** | The central column describing the session topic or activity. |
| **Leader/Group Column** | For specific sessions, this lists the group identifier (I-V, A, B) and the session leader's name. |
| **Room/Location Column** | A right-aligned column specifying the room number and, in some cases, the building (e.g., "State Health Department", "School of Public Health"). |
| **Source Line** | A footer at the very bottom of the page: "Source: https://www.industrydocuments.ucsf.edu/docs/gpcg0227". |
### Detailed Analysis / Content Details
**Thursday, June 29 - Afternoon**
* **1:00:** "Techniques of Interviewing (Mrs. Fink)" - Room 123.
* **1:30:** "Practice Interviews". This session is broken into five groups:
* **Group I:** Leader: Mrs. Fink. Room: 123 State Health Department.
* **Group II:** Leader: Miss Grass. Room: 802 State Health Department.
* **Group III:** Leader: Miss Peck. Room: 627 State Health Department.
* **Group IV:** Leader: Mr. Price. Room: 510 School of Public Health.
* **Group V:** Leader: Dr. Croley. Room: 522 School of Public Health.
* **2:45:** "Recess".
* **3:15:** "Practice Interviews (continued)". Note: "Same groups, same rooms".
**Friday, June 30 - Morning**
* **8:00:** Two concurrent sessions:
* **Group A:** "Statistical Aspects of Epidemiologic Research (Dr. Gaffey)" - Room 802.
* **Group B:** "Problems in Research Design (Dr. Reynolds)" - Room 123.
* **9:45:** "Recess".
* **10:15:** Two concurrent sessions (groups switch topics/rooms):
* **Group A:** "Problems in Research Design (Dr. Reynolds)" - Room 123.
* **Group B:** "Statistical Aspects of Epidemiologic Research (Dr. Gaffey)" - Room 802.
* **12:00:** "Lunch".
**Friday, June 30 - Afternoon**
* **1:00:** "Construction and Use of Questionnaires". This session is split into two groups:
* **Group A:** (Dr. Fink) - Room 123.
* **Group B:** (Dr. Mellinger) - Room 522 School of Public Health.
* **2:45:** "Recess".
* **3:15:** "Construction and Use of Questionnaire (continued)". Note: "Same groups, same rooms".
### Key Observations
1. **Group Structure:** The workshop uses two distinct grouping systems. Thursday's practice interviews use Roman numeral groups (I-V), while Friday's sessions use alphabetic groups (A & B).
2. **Room Utilization:** Specific rooms are consistently used for certain topics or leaders. Room 123 is used for interviewing and research design. Room 802 is used for statistical aspects. The School of Public Health rooms (510, 522) are used for practice interviews and questionnaire sessions.
3. **Session Rotation:** On Friday morning, Groups A and B swap both their topics and their physical rooms between the 8:00 and 10:15 sessions.
4. **Leader Roles:** Mrs. Fink leads both an interviewing techniques lecture and a practice group. Dr. Gaffey and Dr. Reynolds each lead one topic but teach it to both groups in separate sessions.
5. **Temporal Pattern:** Each major instructional block (Practice Interviews, Questionnaires) is split by a recess and then continued, suggesting a workshop format with breaks for reflection or informal discussion.
### Interpretation
This schedule outlines a structured, intensive training workshop likely aimed at public health professionals, researchers, or graduate students. The curriculum progresses from practical skills (interviewing) to core methodological competencies (research design, epidemiology statistics) and finally to tool development (questionnaires). The use of small, leader-facilitated practice groups indicates an emphasis on interactive, hands-on learning. The rotation of groups through different topics and rooms on Friday morning ensures all participants receive the same core instruction despite the concurrent session format. The document's source (UCSF Industry Documents Library) suggests it may be part of a historical archive related to public health research or industry practices. The clear, organized layout reflects a formal educational or professional development setting.
</details>
| Question: | What does the afternoon session begin on June 29? |
| --- | --- |
| GT: | (1:00) |
| MLP: | (2:45) â |
| Ovis: | (3:30) â |
| Align: | (1:00) â |
(a) Positive Example #1
<details>
<summary>figures/case_6.jpg Details</summary>

### Visual Description
## Technical Document: National Nutrition Survey Data - Massachusetts
### Overview
The image is a scanned page from a technical report titled "National Nutrition Survey." It presents statistical data on hemoglobin and hematocrit levels, along with general socio-economic indicators, for the surveyed population in Massachusetts. The data is categorized by age group and sex (Males, Females) and is based on ICNND (Interdepartmental Committee on Nutrition for National Defense) guidelines for defining "unsatisfactory" levels. The document is in English.
### Components/Axes
The document is structured into three distinct sections, each with a title and corresponding data table(s).
1. **Section 1: Hemoglobin data - Massachusetts**
* **Header Text:** "8% of the surveyed population had unsatisfactory hemoglobin levels (ICNND guidelines)."
* **Table Structure:** A single table split into two main columns: "Males" and "Females." Each of these is subdivided into columns for "Age," "Def." (Deficient), "Low," and "Total Unsatisfactory." The "Total Unsatisfactory" is the sum of "Def." and "Low."
* **Age Groups (Rows):** 0-5 yr, 6-12, 13-16, 17-59, 60+.
2. **Section 2: Hematocrit data - Massachusetts**
* **Header Text:** "9.2% of the surveyed population had unsatisfactory hematocrit levels (ICNND guidelines)."
* **Table Structure:** Identical to the Hemoglobin table, with columns for "Males" (Age, Def., Low, Total Unsatisfactory) and "Females" (Def., Low, Total Unsatisfactory).
* **Age Groups (Rows):** 0-5 yr, 6-12, 13-16, 17-59, 60+.
3. **Section 3: General Socio-economic data - Massachusetts**
* **Structure:** A simple two-column list of metrics and their corresponding values.
* **Metrics:** Total number of persons examined, Mean family size, Mean income: $ per annum, Mean poverty index ratio, % of families below poverty index ratio of 1.00.
4. **Footer:** A source URL is visible at the very bottom: `Source: https://www.industrydocuments.ucsf.edu/docs/tnbf0227`
### Detailed Analysis
#### **Hemoglobin Data Table (Percentage of Population)**
| Age Group | Males - Def. | Males - Low | Males - Total Unsatisfactory | Females - Def. | Females - Low | Females - Total Unsatisfactory |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **0-5 yr** | 4.5 | 8.0 | **12.5** | 1.5 | 7.0 | **8.5** |
| **6-12** | 0.2 | 3.8 | **4.0** | 0.2 | 5.0 | **5.2** |
| **13-16** | 3.6 | 12.7 | **16.3** | 0.0 | 3.5 | **3.5** |
| **17-59** | 1.2 | 10.0 | **11.2** | 1.1 | 6.0 | **7.1** |
| **60+** | 0.7 | 14.3 | **15.0** | 0.5 | 4.7 | **5.2** |
#### **Hematocrit Data Table (Percentage of Population)**
| Age Group | Males - Def. | Males - Low | Males - Total Unsatisfactory | Females - Def. | Females - Low | Females - Total Unsatisfactory |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **0-5 yr** | 4.4 | 4.0 | **8.4** | 0.5 | 1.5 | **2.0** |
| **6-12** | 0.0 | 3.9 | **3.9** | 0.0 | 5.4 | **5.4** |
| **13-16** | 1.2 | 15.0 | **16.2** | 0.0 | 2.9 | **2.9** |
| **17-59** | 0.7 | 10.0 | **10.7** | 0.5 | 7.8 | **8.3** |
| **60+** | 0.0 | 20.0 | **20.0** | 0.5 | 3.8 | **4.3** |
#### **General Socio-economic Data**
* Total number of persons examined: **4,568**
* Mean family size: **5.31**
* Mean income: $ per annum: **6,500**
* Mean poverty index ratio: **2.29**
* % of families below poverty index ratio of 1.00: **19.5**
### Key Observations
1. **Age-Related Trends:** For both hemoglobin and hematocrit, the highest rates of unsatisfactory levels in males are found in the adolescent (13-16) and elderly (60+) groups. In females, the pattern is less consistent, but the 17-59 age group often shows relatively high rates.
2. **Sex-Based Differences:** Males consistently show higher percentages of unsatisfactory hemoglobin levels across all age groups compared to females. For hematocrit, males also show higher rates, particularly in the 13-16 and 60+ age groups.
3. **Deficiency vs. Low Levels:** In most categories, the "Low" category constitutes a much larger portion of the "Total Unsatisfactory" than the "Def." (Deficient) category. The exception is the 0-5 year old male group for hemoglobin, where deficiency (4.5%) is more than half of the total unsatisfactory rate (12.5%).
4. **Socio-economic Context:** The surveyed population has a relatively large mean family size (5.31) and a significant portion (19.5%) of families living below the poverty line, which provides potential context for the nutritional health data presented.
### Interpretation
This document provides a snapshot of nutritional health indicators (hemoglobin and hematocrit) within the context of socio-economic conditions in Massachusetts at the time of the survey. The data suggests that **adolescent males (13-16) and elderly males (60+) were the most vulnerable groups** for both hemoglobin and hematocrit deficiencies. The stark contrast between male and female rates, especially in the 13-16 age group (16.3% vs. 3.5% for hemoglobin), may point to biological factors, dietary differences, or other gender-specific influences.
The socio-economic data is crucial for interpretation. A mean income of $6,500 and a 19.5% poverty rate indicate economic hardship for a substantial segment of the surveyed population. This economic stress likely correlates with the observed nutritional deficiencies, as poverty limits access to nutrient-rich foods. The report's use of ICNND guidelines frames the data within a standardized public health assessment tool, allowing for comparison with other populations or time periods. The primary value of this document is in identifying specific at-risk demographic groups (by age and sex) and linking their nutritional status to broader socio-economic challenges, which would be essential for targeting public health interventions.
</details>
| Question: | What levels does the second table indicate? |
| --- | --- |
| GT: | (hematocrit data - Massachusetts) |
| MLP: | (SATISFACTORY) â |
| Ovis: | (Females) â |
| Align: | (hematocrit data - Massachusetts) â |
(b) Positive Example #2
<details>
<summary>figures/case_7.jpg Details</summary>

### Visual Description
## Document: Policy on Document Control (Policy No. 8)
### Overview
The image displays a scanned page of a formal corporate policy document titled "Policy on Document Control," designated as Policy No. 8. The page contains the beginning of the policy, specifically Chapter 1, "General Provisions," which includes Articles 1 through 3. The document is marked as confidential and appears to be part of a legal exhibit.
### Components/Structure
The document is structured as a formal policy with the following key components:
1. **Header:** A two-cell table at the top.
* Left Cell: "Policy on Document Control"
* Right Cell: "Policy No. 8"
2. **Chapter Title:** Centered below the header: "Chapter 1 General Provisions"
3. **Article Sections:** The main body consists of three numbered articles, each with a title in parentheses.
4. **Exhibit Stamp:** A rectangular stamp in the top-right corner with the word "EXHIBIT" and a handwritten identifier "R-7". Vertical text along the left edge of the stamp reads "PENGAD 800-631-6989".
5. **Footer:**
* Left: "Confidential - Subject to Protective Order"
* Right: "TAKJ-TPC-00000252"
* Bottom Center (Source Line): "Source: https://www.industrydocuments.ucsf.edu/docs/jpjf0226" with a partially obscured Bates number "P2379-00001" overlaid.
### Content Details (Transcription)
**Article 1 (Purpose)**
This Policy shall be intended to stipulate basic matters on creation, storage, and disposal, etc. of documents (including electromagnetic records) handled by the Company and exceptional treatments, etc. in order to respond to lawsuits and other legal/administrative proceedings (hereinafter referred to as "disputes, etc."), so that document information can be properly and effectively managed and utilized.
**Article 2 (General Provisions)**
(1) Documents of the Company shall be handled in accordance with this Policy, in addition to those stipulated in the "Policy on Handling Contract Documents, etc. and Corporate Seals (Policy No. 6)", the "Policy on Confidential Information Management (Policy No. 18)", and the "Policy on IT Security (Policy No. 67)".
(2) Each core organization shall set forth standards for creation, storage, disposal, and other detailed treatment of documents to be handled at the core organization (hereinafter referred to as "Function Standards"), in accordance with the provisions stipulated herein.
**Article 3 (Definitions)**
Definitions of the terms used in this Policy shall be as set forth in the following items.
1. "Documents" shall mean approval documents, notification documents, reports, conference materials, correspondences, fax/telex transmissions, bills, various records, drawings, microfilms, photographs, video/audio media, electronic mail, and any other documents and electromagnetic records handled in connection with Company business (including those created by external parties, but excluding publications such as newspapers, magazines, and books, etc.).
2. "Disks" shall mean electromagnetic recording media, including floppy disks, CD-Rs, and DVD-Rs, etc.
3. "Storage" shall mean managing documents in any of the following methods and places for specified periods of time. Documents stored shall be referred to as "Stored Documents."
1) By the originals or write-protected disks, in lockable cabinets or desks, or designated warehouses
2) By electromagnetic means, in access-controlled servers
### Key Observations
* **Document Control Framework:** The policy establishes a foundational framework for managing all company documents, explicitly linking document handling to legal preparedness ("disputes, etc.").
* **Inter-Policy References:** It integrates with other existing corporate policies (Nos. 6, 18, and 67), indicating a structured governance system.
* **Broad Definition of "Documents":** The definition is comprehensive, covering physical and digital media (including email, audio/video), and includes documents from external parties. It explicitly excludes publicly available publications.
* **Definition of "Storage":** Storage is defined by method and location, encompassing both physical security (lockable cabinets) and digital security (access-controlled servers).
* **Legal Context:** The "Confidential - Subject to Protective Order" footer and the "EXHIBIT" stamp strongly suggest this document was produced as evidence in litigation or a regulatory proceeding.
### Interpretation
This document is the introductory chapter of a corporate records management policy. Its primary purpose is to establish authority and definitions for controlling information assets. The emphasis on responding to "lawsuits and other legal/administrative proceedings" reveals that a key driver for this policy is **legal risk mitigation and discovery compliance**. By standardizing how documents are created, stored, and disposed of, the company aims to ensure it can efficiently locate and produce relevant information when legally required, while also protecting confidential information.
The policy creates a hierarchy: this Policy No. 8 sets the general rules, while "Function Standards" are delegated to individual core organizations for detailed implementation. The very broad definition of "Documents" ensures the policy's applicability across all modern business communications and records. The inclusion of "electromagnetic records" and "access-controlled servers" shows an acknowledgment of digital information as a critical component of corporate records. The document's appearance as a legal exhibit underscores the real-world importance of such policies in litigation contexts, where the existence and adherence to a document control policy can be scrutinized.
</details>
| Question: | What type of policy is described in this document? |
| --- | --- |
| GT: | (Policy on Document Control) |
| MLP: | (Policy on Document Control) â |
| Ovis: | (General Provisions) â |
| Align: | (Document Control) â |
(c) Negative Example #1
<details>
<summary>figures/case_8.jpg Details</summary>

### Visual Description
\n
## Technical Document: Scientific Report with Data Table
### Overview
This image is a scanned page of a typed scientific report dated June 18, 1975. The document details an animal study investigating the effect of a dietary glycan on blood cholesterol and triglyceride levels in rats. It contains a descriptive paragraph outlining the experimental methodology and a data table titled "Blood Analysis" presenting the results. There are handwritten annotations on the table.
### Components/Axes
* **Document Header:** "June 18, 1975" (top right).
* **Title:** "The Ability of Dietary Glycan to Lower Elevated Blood Cholesterol and Triglycerides in Rats."
* **Methodology Paragraph:** Describes the experimental groups, diets, and blood sampling procedure.
* **Data Table:**
* **Title:** "Blood Analysis" (underlined).
* **Column Headers:** "Group", "#1", "#2a" (with a handwritten 'a'), "#3b" (with a handwritten 'b').
* **Row Labels:**
* "Cholesterol mg % 0 Wk."
* "1"
* "2"
* "3"
* "4"
* "Triglyceride mg % 4 Wk."
* **Source Watermark:** "Source: https://www.industrydocuments.ucsf.edu/docs/hnhd0227" (bottom margin).
### Detailed Analysis
**1. Transcription of Methodology Paragraph:**
"Ten male rats were fed the control diet (basal + cholesterol and cholic acid) for four weeks and divided into two groups. One group (2A) continued to receive the control diet. The second group (2B) received a test diet of the same composition as the control diet except that 15 parts of Glycan replaced 15 parts of sucrose. Another group (#1) of ten male rats were fed the basal diet for four weeks and then continued on the basal diet. At weekly intervals, whole blood was drawn from the caudal vein for the determination of serum cholesterol. At the end of four weeks, whole blood was drawn by heart puncture."
**2. Transcription and Reconstruction of Data Table:**
*Note: Handwritten annotations 'a' and 'b' are appended to the column headers #2 and #3, respectively. A circle is drawn around the value "293" in the Triglyceride row.*
| Group | #1 | #2a | #3b |
| :--- | :--- | :--- | :--- |
| **Cholesterol mg % 0 Wk.** | 96 | 157 | 157 |
| **1** | 96 | 148 | 115 |
| **2** | 96 | 156 | 99 |
| **3** | 96 | 148 | 115 |
| **4** | 103 | 133 | 111 |
| **Triglyceride mg % 4 Wk.** | **293** (circled) | 203 | 169 |
**3. Data Series Trends:**
* **Group #1 (Basal Diet):** Cholesterol levels are stable at 96 mg% for weeks 0-3, with a slight increase to 103 mg% at week 4. Triglycerides at week 4 are 293 mg%.
* **Group #2a (Control Diet - continued):** Cholesterol starts elevated at 157 mg%. It fluctuates slightly (148, 156, 148) before decreasing to 133 mg% at week 4. Triglycerides at week 4 are 203 mg%.
* **Group #3b (Glycan Test Diet):** Cholesterol starts elevated at 157 mg% (same as Group #2a). It shows a marked decrease to 115 mg% at week 1, drops further to 99 mg% at week 2, then rises slightly to 115 and 115 mg% in weeks 3 and 4. Triglycerides at week 4 are 169 mg%.
### Key Observations
1. **Baseline Disparity:** Groups #2a and #3b began the measured period (Week 0) with identical, elevated cholesterol levels (157 mg%), significantly higher than the basal diet group #1 (96 mg%). This confirms the initial "elevated" state mentioned in the title.
2. **Glycan Effect on Cholesterol:** Group #3b (Glycan) showed a rapid and substantial reduction in cholesterol from 157 mg% to 115 mg% within the first week, a 26.8% decrease. The level remained suppressed relative to its starting point and relative to Group #2a for the entire four-week period.
3. **Control Diet Trend:** Group #2a (Control) showed a much slower, modest decline in cholesterol over four weeks (from 157 to 133 mg%, a 15.3% decrease).
4. **Triglyceride Levels:** At the four-week endpoint, triglyceride levels were highest in the basal diet group (#1: 293 mg%), intermediate in the control diet group (#2a: 203 mg%), and lowest in the glycan diet group (#3b: 169 mg%).
5. **Handwritten Annotations:** The 'a' and 'b' on the column headers and the circle around the triglyceride value for Group #1 suggest this document was used for review or analysis, with specific attention drawn to the high triglyceride level in the basal diet group.
### Interpretation
The data suggests that the dietary glycan is effective in lowering elevated blood cholesterol in rats. The most compelling evidence is the rapid divergence between the glycan group (#3b) and the control group (#2a) after week 0, despite both starting from the same elevated baseline. The glycan intervention appears to have a more potent and immediate effect on cholesterol reduction than simply continuing the cholesterol-elevating control diet.
Furthermore, the glycan diet was associated with the lowest triglyceride levels at the study's end, indicating a potential broader lipid-lowering benefit. The high triglycerides in the basal diet group (#1) are an interesting outlier; this group was not on a cholesterol-elevating diet, yet had the highest triglycerides, which may reflect a specific metabolic response to the basal diet composition in this model.
The handwritten circle around the triglyceride value for Group #1 (293 mg%) highlights it as a point of interest, possibly indicating an unexpected or particularly significant result that warranted further investigation. Overall, the document provides clear, quantitative support for the hypothesis stated in its title.
</details>
| Question: | What was the diet fed to the #1 group? |
| --- | --- |
| GT: | (basal diet) |
| MLP: | (basel diet) â |
| Ovis: | (Whole blood) â |
| Align: | (control diet) â |
(d) Negative Example #2
Figure 8: Case Study for Connector Comparison on the DocVQA dataset [Mathew et al., 2021b]. We show four qualitative examples (including two correct and two incorrect examples) comparing Llama-3.2-3B- Align to the same architecture with different connectors, Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis. âGTâ denotes the ground truth.
<details>
<summary>figures/case_9.jpg Details</summary>

### Visual Description
## Photograph: Farewell Envelope on Vintage Suitcase
### Overview
The image is a close-up, top-down photograph of a white airmail-style envelope resting on the textured surface of a vintage suitcase. The composition is simple and evocative, centered on the handwritten message on the envelope.
### Components & Spatial Layout
1. **Primary Subject (Envelope):**
* **Position:** Centered horizontally in the upper two-thirds of the frame.
* **Description:** A standard white paper envelope, oriented horizontally with its triangular flap closed and facing upwards.
* **Distinctive Feature:** The envelope's border is decorated with a repeating pattern of diagonal stripes in alternating red and blue, characteristic of classic airmail stationery.
* **Text:** The words "**good bye**" are handwritten in the center of the envelope's back (the side with the flap). The script is in a simple, lowercase, black ink style.
2. **Supporting Object (Suitcase):**
* **Position:** Occupies the majority of the frame, serving as the background surface for the envelope.
* **Description:** The top surface of a hard-shell, vintage suitcase. The material appears to be a textured, off-white or light beige vinyl or leatherette, showing subtle signs of age and wear (slight discoloration and texture variation).
* **Hardware:** At the bottom edge of the frame, the suitcase's handle and latch mechanisms are visible.
* **Handle:** A central, curved handle made of a similar beige material, attached via two tarnished brass or bronze-colored metal brackets.
* **Latches:** Two metal latch mechanisms are partially visible on the far left and right edges, also showing a tarnished, aged finish.
3. **Background Surface:**
* **Position:** Visible only at the very bottom edge of the frame, beneath the suitcase handle.
* **Description:** A fabric surface with a black-and-white pattern. The pattern consists of small black dots and sections of wavy, parallel black lines on a white background, suggesting a tablecloth or bedspread.
### Detailed Analysis
* **Text Transcription:** The only textual content is the handwritten phrase: `good bye`.
* **Color Palette:** The scene is dominated by neutral tones (off-white, beige) with sharp accents from the red and blue stripes on the envelope and the dark black of the handwritten text and background pattern.
* **Texture:** Contrasting textures are prominent: the smooth paper of the envelope, the pebbled texture of the suitcase, the metallic hardness of the hardware, and the soft weave of the patterned fabric.
* **Lighting:** The lighting is soft and diffuse, casting very gentle shadows, indicating an indoor setting with ambient light.
### Key Observations
* The envelope is pristine and unmarked except for the central message, suggesting it is the focal point and carrier of meaning.
* The vintage suitcase implies themes of travel, journey, departure, or storage of memories.
* The airmail styling of the envelope reinforces the theme of distance and communication.
* The composition is deliberately staged, placing the "good bye" message directly atop an object associated with leaving.
### Interpretation
This image is a symbolic still life rather than a technical document containing factual data. It constructs a narrative of farewell.
* **Thematic Elements:** The combination of the airmail envelope (a vessel for messages across distance) and the suitcase (a vessel for belongings during travel) creates a strong metaphor for departure, ending, or a journey's conclusion.
* **Emotional Tone:** The lowercase, handwritten "good bye" feels personal, quiet, and perhaps melancholic or resigned, rather than formal or dramatic. The vintage aesthetic adds a layer of nostalgia.
* **Relationship of Elements:** The suitcase provides the context (travel, transition), while the envelope delivers the specific, poignant message. The patterned fabric at the bottom grounds the scene in a domestic setting, possibly indicating the point of departure.
* **Absence of Data:** The image contains no charts, graphs, quantitative data, or technical diagrams. Its information is entirely symbolic and emotional, conveyed through the careful arrangement of objects and text. The "data" here is the implied story of a goodbye connected to a journey.
</details>
| Question: | What greeting is written on the letter? |
| --- | --- |
| GT: | (good bye) |
| MLP: | (good) â |
| Ovis: | (good buy) â |
| Align: | (good bye) â |
(a) Positive Example #1
<details>
<summary>figures/case_10.jpg Details</summary>

### Visual Description
## Digital Display: SkyScan Atomic Clock
### Overview
The image shows a close-up photograph of a digital "SkyScan Atomic Clock" device. The device has a dark gray or black plastic casing with a large LCD screen divided into two main sections: a primary time display and a secondary information bar below it. The screen is backlit, causing some glare on the left side of the time digits. The device is positioned against a blurred background suggesting a wooden surface and a yellow wall.
### Components/Axes
The display is segmented into distinct informational zones with clear labels:
1. **Header/Branding (Top Center):**
* Text: `SKYSCAN` (with a stylized logo of a radio tower emitting waves to its right).
* Text below logo: `ATOMIC CLOCK`.
2. **Primary Time Display (Center, Large Digits):**
* Main readout: `12:04`.
* A small icon resembling a radio tower with waves is visible between the hour and minute digits, indicating atomic sync status.
* Label to the right of the digits: `TIME`.
3. **Secondary Information Bar (Bottom, Smaller Digits):**
* This bar is divided into labeled sections from left to right:
* **Label:** `DATE` (left-aligned above the first set of digits).
* **Date Readout:** `1/1`.
* **Unlabeled Numeric Field:** `40`. (This could represent a secondary time format like seconds, a counter, or another metric, but is not explicitly labeled).
* **Label:** `INDOOR` (centered above the next readout).
* **Indoor Temperature Readout:** `68.4` with a small `°F` symbol to its upper right.
* **Label:** `OUTDOOR` (right-aligned above the final readout).
* **Outdoor Temperature Readout:** `-14` with a small `°F` symbol to its upper right.
### Detailed Analysis
* **Time:** The clock displays the time as **12:04**. The presence of the atomic sync icon suggests it is (or is designed to be) synchronized with a radio time signal for high accuracy.
* **Date:** The date is displayed as **1/1**, which is most commonly interpreted as January 1st. The year is not displayed.
* **Temperature Data:**
* **Indoor Temperature:** A reading of **68.4°F** is shown, indicating a comfortable room temperature.
* **Outdoor Temperature:** A reading of **-14°F** is shown, indicating an extremely cold external environment.
* **Unlabeled Field:** The number **40** appears between the date and indoor temperature. Without a label, its meaning is ambiguous. It could be seconds (making the full time 12:04:40), a humidity percentage, a channel number, or a counter.
### Key Observations
1. **Extreme Temperature Differential:** There is a massive **82.4°F difference** between the indoor (68.4°F) and outdoor (-14°F) temperatures. This is the most striking data point.
2. **Atomic Synchronization:** The branding and icon emphasize this is an "Atomic Clock," highlighting its primary feature of automatic, precise timekeeping via radio signal.
3. **Display Hierarchy:** The time is given primary visual importance with the largest digits. Secondary information (date, temperatures) is grouped below in a smaller, but clearly segmented, format.
4. **Missing Context:** The year is not displayed, and the purpose of the numeric field "40" is not defined by any on-screen label.
### Interpretation
This image captures a snapshot of environmental and temporal data from a household device. The data suggests the photo was taken on **January 1st** at **12:04 PM** (or AM, though 12:04 PM is more likely given the active display), in a location experiencing **severe winter conditions** (outdoor temp: -14°F). The indoor environment is maintained at a standard, comfortable temperature.
The device itself is designed for at-a-glance information retrieval, prioritizing time. The "Atomic" branding implies a user preference for accuracy and reliability. The unlabeled "40" represents a gap in the information provided by the interface; its meaning would require consulting the device's manual or observing its change over time (e.g., if it counts up to 60, it's likely seconds).
The most significant takeaway is the stark contrast between the controlled, warm indoor environment and the harsh, freezing conditions outside, a common winter scenario in temperate or cold climates. The clock serves as a monitor of both human-made order (precise time, heated shelter) and natural extremes.
</details>
| Question: | What indoor temperature is shown? |
| --- | --- |
| GT: | (68.4) |
| MLP: | (68 F) â |
| Ovis: | (40.0) â |
| Align: | (68.4) â |
(b) Positive Example #2
<details>
<summary>figures/case_11.jpg Details</summary>

### Visual Description
## Promotional Poster: Topnotch Health Club
### Overview
This is a vertical promotional poster or flyer for a health club named "topnotch health club." The design features a central anime-style illustration of a man and a woman against a dynamic blue background with abstract light streaks and particle effects. The poster uses bold, stylized typography to convey its marketing message.
### Components/Axes
The poster is composed of several distinct textual and graphical elements arranged vertically:
1. **Header Text (Top):**
* **Text:** "Proffesional Passionate Personal"
* **Position:** Top third of the poster, left-aligned.
* **Style:** Large, bold, white sans-serif font with a subtle drop shadow. The word "Proffesional" is notably misspelled (should be "Professional").
2. **Central Illustration (Middle):**
* **Subjects:** A man and a woman depicted in a modern anime/cartoon style.
* **Woman:** Positioned slightly in front, pointing to the left with her right hand. She has long brown hair, wears a white tank top, orange wristbands, and a black armband with a white rectangular device (resembling a fitness tracker or MP3 player).
* **Man:** Standing behind the woman, looking forward. He has short black hair and wears a red t-shirt.
* **Background:** A vibrant blue gradient with white and light blue abstract streaks and dots, suggesting energy and motion.
3. **Logo/Club Name (Center-Bottom):**
* **Text:** "we're the club 4U!"
* **Position:** Overlaid on the lower torso of the illustrated figures.
* **Style:** Highly stylized, bubbly white font with a thick blue outline and a drop shadow, creating a 3D effect. The "4U!" is particularly large and prominent.
4. **Footer Information (Bottom):**
* **Primary Text:** "topnotch health club"
* **Position:** Bottom left, below the central logo.
* **Style:** Dark blue or purple sans-serif font.
* **Graphic:** A red cross or plus symbol (+) is integrated to the left of the text.
* **Tagline:** "dare to be different"
* **Position:** Bottom right, below the primary club name.
* **Style:** Red, cursive/script font.
### Detailed Analysis
* **Textual Content & Hierarchy:** The poster's text is structured to first present core values ("Proffesional Passionate Personal"), then the club's identity ("we're the club 4U!"), and finally its formal name and slogan. The most visually dominant text is the central "4U!" logo.
* **Color Palette:** The scheme is dominated by blues (background, text outlines), with accents of white (main text), red (man's shirt, cross symbol, tagline), and orange (woman's wristbands). This creates a high-contrast, energetic feel.
* **Spatial Grounding:** The legend (the three value words) is at the top-left. The main graphical focus (the couple) is centered. The primary call-to-action/identity ("we're the club 4U!") is centered over the figures. The formal branding ("topnotch health club") is anchored at the bottom-left, balanced by the tagline at the bottom-right.
### Key Observations
1. **Spelling Error:** The word "Proffesional" in the header is misspelled. The correct spelling is "Professional."
2. **Target Demographic:** The use of anime-style art, vibrant colors, and the "4U!" phrasing suggests a marketing focus on a younger demographic (teens to young adults).
3. **Implied Services:** The fitness tracker on the woman's arm and the energetic background imply a focus on modern, tech-integrated fitness and an active lifestyle.
4. **Brand Messaging:** The three header words ("Professional, Passionate, Personal") aim to establish trust, enthusiasm, and customized service. The tagline "dare to be different" positions the club as unique or unconventional.
### Interpretation
This poster is a marketing tool designed to attract new members to the "topnotch health club." It employs a youthful, energetic aesthetic to differentiate itself from more traditional or clinical gym branding. The core message is one of a modern, passionate, and personalized fitness experience ("the club 4U!"). The visual elementsâthe dynamic background, active pose of the woman, and integrated fitness techâall reinforce themes of energy, movement, and contemporary lifestyle. The misspelling in the header, while likely an oversight, could undermine the "Professional" claim for observant viewers. Overall, the poster prioritizes creating a vibe of fun, individuality, and modern fitness over detailing specific services or prices.
</details>
| Question: | What type of club is advertised? |
| --- | --- |
| GT: | (health club) |
| MLP: | (topnote health club) â |
| Ovis: | (health club) â |
| Align: | (professional passionate personal) â |
(c) Negative Example #1
<details>
<summary>figures/case_12.jpg Details</summary>

### Visual Description
## Advertisement: Hadiah Plus Loyalty Program Promotion
### Overview
This image is a promotional advertisement for the "Hadiah Plus" loyalty program, highlighting a partnership with the "AirAsia BIG LOYALTY Programme." The ad encourages users to convert "Hadiah Points" into "BIG Points" to redeem for free flights to various Asian destinations. The visual design uses a gold and beige color scheme with a collage of travel destination photos.
### Components/Axes
The image is structured into several distinct regions:
1. **Header Region (Top):**
* **Main Title:** "Hadiah Plus" in a large, gold, cursive script font.
* **Subtitle:** "MORE REWARDS WITH EVERY SWIPE" in a smaller, dark grey, all-caps serif font.
2. **Destination Showcase Region (Center):**
* A horizontal row of five destination names, each above a corresponding photograph.
* **Destination Labels (Left to Right):** "Singapore", "Penang", "Bangkok", "Bali", "Hong Kong".
* **Photographs (Left to Right):**
* Singapore: The Merlion statue at night.
* Penang: A coastal cityscape with modern and colonial buildings.
* Bangkok: A traditional Thai temple (Wat) with golden spires.
* Bali: A multi-tiered Balinese temple (Pura) by water.
* Hong Kong: The skyline featuring the Bank of China Tower.
3. **Partnership Badge (Center-Right):**
* A prominent red circular badge overlaps the Bali and Hong Kong images.
* **Text within badge:** "AirAsia BIG LOYALTY Programme" in white text. "BIG" is the largest and boldest word.
4. **Call-to-Action Region (Lower Center):**
* **Primary Text:** "You can now convert your Hadiah Points to BIG Points to redeem for free flights." in a dark grey serif font.
5. **Product Showcase Region (Bottom):**
* Three credit cards are displayed side-by-side.
* **Left Card:** Gold-colored, appears to be a "Maybank" card (logo partially visible). Card number prefix: `5199 **** ****`.
* **Center Card:** Silver-colored, also appears to be a "Maybank" card. Card number prefix: `5199 **** ****`.
* **Right Card:** Black-colored, appears to be a "Maybank" card. Card number prefix: `5326 5678 9012 3456`. The name "AZMAN HUSIN" is partially visible.
### Detailed Analysis
* **Text Transcription:**
* "Hadiah Plus"
* "MORE REWARDS WITH EVERY SWIPE"
* "Singapore", "Penang", "Bangkok", "Bali", "Hong Kong"
* "AirAsia BIG LOYALTY Programme"
* "You can now convert your Hadiah Points to BIG Points to redeem for free flights."
* Partial credit card numbers and a cardholder name ("AZMAN HUSIN") are visible.
* **Visual Layout & Spatial Grounding:**
* The title and subtitle are centered at the top.
* The five destination images are arranged in a tight, contiguous horizontal grid in the center of the ad.
* The red "AirAsia BIG LOYALTY Programme" badge is positioned in the lower-right quadrant of the destination grid, overlapping the fourth (Bali) and fifth (Hong Kong) images.
* The call-to-action text is centered below the destination grid.
* The three credit cards are centered at the very bottom of the image.
* **Color & Design:**
* The background is a textured beige/gold gradient with a subtle geometric pattern.
* The primary accent color is the bright red of the AirAsia badge, designed to draw the eye.
* The credit cards represent different tiers (Gold, Silver, Black), suggesting different levels of membership or card benefits.
### Key Observations
1. **Targeted Marketing:** The ad explicitly links a financial product (credit cards, likely from Maybank) with travel rewards, targeting consumers interested in leisure travel to popular Asian destinations.
2. **Partnership Emphasis:** The most visually striking element after the title is the red AirAsia badge, highlighting the strategic partnership as the core value proposition.
3. **Aspirational Imagery:** The use of iconic landmarks (Merlion, Thai temples, Hong Kong skyline) serves to evoke aspirational feelings associated with travel and vacation.
4. **Clear Value Proposition:** The text states a direct, simple benefit: convert points from one program (Hadiah) to another (BIG) for a tangible reward (free flights).
### Interpretation
This advertisement is a cross-promotional tool between a bank's loyalty program ("Hadiah Plus," likely associated with Maybank based on card logos) and AirAsia's frequent flyer program. The core message is one of **enhanced utility and flexibility** for existing Hadiah Points holders. By enabling conversion to BIG Points, the program unlocks the specific, high-value redemption option of airline tickets, which may not have been directly available through the Hadiah program alone.
The choice of destinationsâSingapore, Penang, Bangkok, Bali, Hong Kongâindicates a focus on key short-haul leisure and business travel hubs within Southeast Asia and East Asia, aligning perfectly with AirAsia's route network. The inclusion of three different card tiers suggests the promotion is available to a broad base of cardholders, from standard to premium. The ad effectively uses visual shorthand (travel photos, airline logo) and a clear call-to-action to communicate a financial benefit in terms of experiential reward.
</details>
| Question: | What credit card is this? |
| --- | --- |
| GT: | (hadiah plus) |
| MLP: | (hadiah plus) â |
| Ovis: | (american big loyalty program) â |
| Align: | (hadia plus) â |
(d) Negative Example #2
Figure 9: Case Study for Connector Comparison on the TextVQA dataset [Singh et al., 2019]. We show four qualitative examples (including two correct and two incorrect examples) comparing Llama-3.2-3B- Align to the same architecture with different connectors, Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis. âGTâ denotes the ground truth.