2502.01341

Model: gemma-3-27b-it-free

# AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding Abstract Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLM’s embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise. 1 Introduction Vision-Language Models (VLMs) have gained significant traction in recent years as a powerful framework for multimodal document understanding tasks that involve interpreting both the visual and textual contents of scanned documents (Kim et al., 2022; Lee et al., 2023; Liu et al., 2023a, 2024; Hu et al., 2024; Wang et al., 2023a; Rodriguez et al., 2024b). Such tasks are common in real-world commercial applications, including invoice parsing (Park et al., 2019), form reading (Jaume et al., 2019), and document question answering (Mathew et al., 2021b). VLM architectures typically consist of three components: (i) a vision encoder to process raw images, (ii) a Large Language Model (LLM) pre-trained on text, and (iii) a connector module that maps the visual features from the vision encoder into the LLM’s semantic space. A central challenge in this pipeline is to effectively map the continuous feature embeddings of the vision encoder into the latent space of the LLM while preserving the semantic properties of visual concepts. Existing approaches can be broadly categorized into deep fusion and shallow fusion methods. Deep fusion methods, such as NVLM (Dai et al., 2024), Flamingo (Alayrac et al., 2022), CogVLM (Wang et al., 2023b), and LLama 3.2-Vision (Grattafiori et al., 2024), integrate visual and textual features by introducing additional cross-attention and feed-forward layers at each layer of the LLM. While effective at enhancing cross-modal interaction, these methods substantially increase the parameter count of the VLM compared to the base LLM, resulting in high computational overhead and reduced efficiency. <details> <summary>x1.png Details</summary> ![d3bfd0cc](/v1/image/d3bfd0cc25793d5c5e22389112fe72d3ad11aa87eabe974f6cbc53d1b8ba401d) ### Visual Description \n ## Radar Chart: Model Performance Comparison ### Overview This image presents a radar chart comparing the performance of four different language models – Llama-3.2-3B-Perciever R., Llama-3.2-3B-MLP, Llama-3.2-3B-Ovis, and Llama-3.2-3B-Align (labeled as "ours") – across eight different visual question answering (VQA) datasets: KLC, DeepForm, InfoVQA, DocVQA, TableVQA, TextVQA, ChartQA, and TabFact. The performance is represented by scores, with higher scores indicating better performance. ### Components/Axes * **Axes/Categories:** KLC, DeepForm, InfoVQA, DocVQA, TableVQA, TextVQA, ChartQA, TabFact. These are arranged radially around the chart. * **Radial Scale:** The chart uses a radial scale, with values increasing outwards from the center. The scale is not explicitly labeled with numerical values, but the data points indicate a range from approximately 27 to 80. * **Legend:** Located at the bottom center of the image, the legend identifies each model with a corresponding color: * Llama-3.2-3B-Perciever R. (Green) * Llama-3.2-3B-MLP (Light Green) * Llama-3.2-3B-Ovis (Blue) * Llama-3.2-3B-Align (ours) (Orange) ### Detailed Analysis The chart displays performance scores for each model on each dataset, connected by lines forming a polygon for each model. * **Llama-3.2-3B-Perciever R. (Green):** * KLC: ~33.5 * DeepForm: ~62.07 * InfoVQA: ~42.11 * DocVQA: ~71.46 * TableVQA: ~53.93 * TextVQA: ~51.33 * ChartQA: ~66.48 * TabFact: ~73.22 * Trend: Generally consistent performance across datasets, with peaks at DeepForm and DocVQA. * **Llama-3.2-3B-MLP (Light Green):** * KLC: ~31.75 * DeepForm: ~58.02 * InfoVQA: ~37.56 * DocVQA: ~69.08 * TableVQA: ~50.96 * TextVQA: ~52.6 * ChartQA: ~65.16 * TabFact: ~76.67 * Trend: Similar to Perceiver R., but generally lower scores. * **Llama-3.2-3B-Ovis (Blue):** * KLC: ~28.94 * DeepForm: ~57.08 * InfoVQA: ~34.13 * DocVQA: ~74.68 * TableVQA: ~47.76 * TextVQA: ~53.56 * ChartQA: ~67.92 * TabFact: ~71.93 * Trend: Lowest performance overall, with a relatively flat profile. * **Llama-3.2-3B-Align (ours) (Orange):** * KLC: ~35.25 * DeepForm: ~63.49 * InfoVQA: ~44.53 * DocVQA: ~79.63 * TableVQA: ~60.1 * TextVQA: ~57.38 * ChartQA: ~71.88 * TabFact: ~78.51 * Trend: Highest performance overall, with peaks at DocVQA and TabFact. ### Key Observations * **"Ours" (Llama-3.2-3B-Align) consistently outperforms the other models** across all datasets, particularly in DocVQA and TabFact. * **DeepForm and DocVQA generally show the highest scores** across all models, suggesting these datasets are easier to solve. * **InfoVQA and TableVQA consistently show the lowest scores**, indicating these datasets are more challenging. * **Perceiver R. and MLP show similar performance profiles**, with Perceiver R. generally slightly outperforming MLP. ### Interpretation The radar chart demonstrates that the Llama-3.2-3B-Align model ("ours") achieves superior performance compared to the other three models across a diverse set of VQA datasets. This suggests that the alignment strategy employed in this model is effective in improving its ability to reason about and answer questions related to visual information. The consistent outperformance across different dataset types (KLC, DeepForm, InfoVQA, DocVQA, TableVQA, TextVQA, ChartQA, TabFact) indicates a robust improvement in the model's overall VQA capabilities. The relative difficulty of the datasets, as indicated by the score ranges, provides insights into the challenges inherent in different VQA tasks. The higher scores on DeepForm and DocVQA might be attributed to the simpler visual reasoning required, while the lower scores on InfoVQA and TableVQA suggest the need for more complex reasoning and understanding of visual elements and their relationships. The chart effectively visualizes the trade-offs between different models and datasets, providing a clear comparison of their strengths and weaknesses. </details> Figure 1: Performance of Different VLM Connectors. The proposed Align connector outperforms other methods across benchmarks using the same training configuration. Radial distance is proportion of maximal score, truncated at $0.7$ (black dot). In contrast, shallow fusion methods project visual features from the vision encoder into the LLM input embedding space using either multilayer perceptrons (MLPs) (Liu et al., 2023b, 2024), convolution mappings such as HoneyBee (Cha et al., 2024) and H-Reducer (Hu et al., 2024), or attention-based mechanisms such as the Perceiver Resampler (Li et al., 2023b; Laurençon et al., 2024; Alayrac et al., 2022). This approach is more parameter-efficient and computationally lighter than deep fusion method However, these connectors lack inductive bias to ensure that the projected features remain within the region spanned by the LLM’s pretrained text embeddings. Consequently, the projected visual features may fall outside the distribution the LLM was trained on, leading to noisy or misaligned representations. Moreover, these mappings are typically learned from scratch, making them data-inefficient and less effective under low-resource conditions. Recent methods like Ovis (Lu et al., 2024) attempt to alleviate these issues by introducing separate visual embeddings indexed from the vision encoder outputs and combined together to construct the visual inputs to the LLM. However, this approach significantly increases parameter count due to the massive embedding matrix and requires extensive training to learn a new embedding space without guaranteeing alignment with the LLM’s input latent space. To address these limitations, this paper introduces AlignVLM, a novel framework that sidesteps direct projection of visual features into the LLM embedding space. Instead, our proposed connector, Align, maps visual features into probability distributions over the LLM’s existing pretrained vocabulary embeddings, which are then combined into a weighted representation of the text embeddings. By constraining each visual feature as a convex combination of the LLM text embeddings, our approach leverages the linguistic priors already encoded in the LLM’s text space. This ensures that the resulting visual features lie within the convex hull of the LLM’s embedding space, reducing the risk of noisy or out-of-distribution inputs and improving alignment between modalities. The connector thus enables faster convergence and stronger performance, particularly in low-resource scenarios. Our experimental results show that Align improves performance on various document understanding tasks, outperforming prior connector methods, with especially large gains in low-data regimes. We summarize our main contributions as follows: - We propose a novel connector, Align, to bridge the representation gap between vision and text modalities. - We introduce a family of Vision-Language Models, AlignVLM, that achieves state-of-the-art performance on multimodal document understanding tasks by leveraging Align. - We conduct extensive experiments demonstrating the robustness and effectiveness of Align across different LLM sizes and training data setups. We release our code and research artifacts at alignvlm.github.io. 2 Related Work 2.1 Vision-Language Models Over the past few years, Vision-Language Models (VLMs) have achieved remarkable progress, largely due to advances in Large Language Models (LLMs). Initially demonstrating breakthroughs in text understanding and generation (Brown et al., 2020; Raffel et al., 2023; Achiam et al., 2023; Grattafiori et al., 2024; Qwen et al., 2025; Team, 2024), LLMs are now increasingly used to effectively interpret visual inputs (Liu et al., 2023b; Li et al., 2024; Wang et al., 2024; Chen et al., 2024b; Dai et al., 2024; Drouin et al., 2024; Rodriguez et al., 2022). This progress has enabled real-world applications across diverse domains, particularly in multimodal document understanding for tasks like form reading (Svetlichnaya, 2020), document question answering (Mathew et al., 2021b), and chart question answering (Masry et al., 2022). VLMs commonly adopt a three-component architecture: a pretrained vision encoder (Zhai et al., 2023; Radford et al., 2021), a LLM, and a connector module. A key challenge for VLMs is effectively aligning visual features with the LLM’s semantic space to enable accurate and meaningful multimodal interpretation. 2.2 Vision-Language Alignment for Multimodal Models Existing vision-language alignment approaches can be classified into deep fusion and shallow fusion. Deep fusion methods integrate visual and textual features by modifying the LLM’s architecture, adding cross-attention and feed-forward layers. For example, Flamingo (Alayrac et al., 2022) employs the Perceiver Resampler, which uses fixed latent embeddings to attend to vision features and fuses them into the LLM via gated cross-attention layers. Similarly, NVLM (Dai et al., 2024) adopts cross-gated attention while replacing the Perceiver Resampler with a simpler MLP. CogVLM (Wang et al., 2023b) extends this approach by incorporating new feed-forward (FFN) and QKV layers for the vision modality within every layer of the LLM. While these methods improve cross-modal alignment, they significantly increase parameter counts and computational overhead, making them less efficient. On the other hand, shallow fusion methods are more computationally efficient, mapping visual features into the LLM’s embedding space without altering its architecture. These methods can be categorized into three main types: (1) MLP-based mapping, such as LLaVA (Liu et al., 2023b) and PaliGemma (Beyer et al., 2024), which use multilayer perceptrons (MLP) to project visual features but often produce misaligned or noisy features due to a lack of constraints and inductive bias (Rodriguez et al., 2024b); (2) cross-attention mechanisms, BLIP-2 (Li et al., 2023b) uses Q-Former, which utilizes a fixed set of latent embeddings to cross-attend to visual features, but that may still produce noisy or OOD visual features; (3) convolution-based mechanisms, such as HoneyBee (Cha et al., 2024) and H-Reducer (Hu et al., 2024), which leverage convolutional or ResNet (He et al., 2015) layers to preserve spatial locality while reducing dimensionality; and (4) visual embeddings, such as those introduced by Ovis (Lu et al., 2024), which use embeddings indexed by the vision encoder’s outputs to produce the visual inputs. While this regularizes feature mapping, it adds substantial parameter overhead and creates a new vision embedding space, risking misalignment with the LLM’s text embedding space. Encoder-free VLMs, like Fuyu-8B https://www.adept.ai/blog/fuyu-8b and EVE (Diao et al., 2024), eliminate dedicated vision encoders but show degraded performance (Beyer et al., 2024). In contrast, AlignVLM maps visual features from the vision encoder into probability distributions over the LLM’s text embeddings, using them to compute a convex combination. By leveraging the linguistic priors encoded in the LLM’s vocabulary, AlignVLM ensures that visual features remain within the convex hull of the text embedding. This design mitigates noisy or out-of-distribution projections and achieves stronger multimodal alignment, particularly in tasks that require joint modalities representation like multimodal document understanding and in low-resource settings. <details> <summary>x2.png Details</summary> ![455197b3](/v1/image/455197b3287bb15f8d2ec35a5d151681005c979e0b425a72e6078dfd4505e426) ### Visual Description ## Diagram: Multimodal Alignment Module ### Overview This diagram illustrates a multimodal alignment module that processes both visual (image) and textual data to provide a response via a Large Language Model (LLM). The module takes images and text as input, encodes them, aligns their representations, and then feeds them into the LLM to generate a response. The diagram highlights the flow of information and the key components involved in this process. ### Components/Axes The diagram is segmented into four main areas: Input Images, Processing Modules, Alignment, and LLM Output. * **Input Images:** Six images are displayed, each with a title. The titles are: * "AMERICANS WANT TO BE DISTRACTED FROM REALITY" * "AMERICANS LOVE VIDEO GAMES" * "AMERICANS ARE HYPER-SOCIAL" * "AMERICANS ARE CONNECTED" * "AMERICANS LOVE ROUTINE" * "AMERICANS ARE HIGHLY INFLUENCED BY OTHERS" * **Processing Modules:** * Vision Encoder (light blue) * Text Tokenizer (light blue) * Linear (purple) * LM Head (LLM) (purple) * Layer Norm (purple) * Softmax (purple) * Weighted Average Sum (light orange) * **Alignment:** * Full Embedding Matrix (green) * **LLM Output:** * LLM (dark orange) * Response: 90% (dark orange) * **Data Flow:** Arrows indicate the direction of data flow between components. * **Question:** "What percentage of Americans are online?" (bottom-left) ### Detailed Analysis or Content Details 1. **Input Images:** The six images are arranged in a 2x3 grid in the top-left corner. Each image appears to be a visualization of data related to American behaviors and preferences. The specific content within each image is not fully legible, but the titles provide context. 2. **Vision Encoder:** The Vision Encoder receives the input images and transforms them into a visual representation. 3. **Text Tokenizer:** The Text Tokenizer processes the question "What percentage of Americans are online?" and converts it into a numerical representation (embeddings). 4. **Embedding Alignment:** The outputs from the Vision Encoder and Text Tokenizer are then passed through a series of layers: Linear, LM Head (LLM), Layer Norm, and Softmax. These layers likely refine and align the visual and textual embeddings. 5. **Full Embedding Matrix:** The aligned embeddings are then fed into a "Full Embedding Matrix," which appears to be a transformation or projection step. 6. **Weighted Average Sum:** The output of the Full Embedding Matrix is combined with the "Selected Text Embeddings" using a "Weighted Average Sum." This step likely integrates the visual and textual information. 7. **LLM Input:** The combined representation (Vision Inputs and Text Inputs) is then fed into the LLM. 8. **LLM Output:** The LLM generates a response, which in this case is "90%." ### Key Observations * The diagram emphasizes the alignment of visual and textual data. * The use of multiple layers (Linear, LM Head, Layer Norm, Softmax) suggests a complex alignment process. * The "Weighted Average Sum" indicates that both visual and textual information contribute to the final LLM input. * The LLM provides a specific answer (90%) to the question, demonstrating the effectiveness of the multimodal alignment. ### Interpretation This diagram illustrates a system for multimodal reasoning, where an LLM leverages both visual and textual information to answer questions. The system aims to bridge the gap between different modalities (images and text) by aligning their representations in a shared embedding space. The "Align Module" is the core of this process, using a series of transformations and combinations to integrate the visual and textual data. The final output (90%) suggests that the system can accurately interpret the visual information in the images and combine it with the textual question to provide a relevant and informative answer. The diagram highlights the potential of multimodal LLMs to perform complex reasoning tasks that require understanding both visual and textual cues. The question itself suggests a focus on understanding societal trends and behaviors based on visual data. The fact that the LLM can provide a percentage suggests that the visual data contains quantifiable information that can be interpreted to answer the question. </details> Figure 2: AlignVLM Model Architecture. The vision encoder extracts image features, which are processed to produce probabilities over the LLM embeddings. A weighted average combines these probabilities with embeddings to generate vision input vectors. Text inputs are tokenized, and the corresponding embeddings are selected from the embedding matrix, which is then used as input to the LLM. We display the vision layers in blue, and the text layers in purple. 3 Methodology 3.1 Model Architecture The overall model architecture, shown in Figure 2, consists of three main components: (1) Vision Encoder. To handle high-resolution images of different aspect ratios, we divide each input image into multiple tiles according to one of the predefined aspect ratios (e.g., $1{:}1,\,1{:}2,\,...,\,9{:}1$ ) chosen via a coverage ratio (Lu et al., 2024; Chen et al., 2024a). Due to limited computational resources, we set the maximum number of tiles to 9. Each tile is further partitioned into $14× 14$ patches, projected into vectors, and processed by a SigLip-400M vision encoder (Zhai et al., 2023) to extract contextual visual features. Each tile $t∈\{1,·s,T\}$ is divided into $N_{t}$ patches $$ \mathbf{P}_{t}=\{\mathbf{p}_{t,1},\cdots,\mathbf{p}_{t,N_{t}}\}, $$ where $\mathbf{p}_{t,i}$ is the $i$ -th patch of tile $t$ . The vision encoder maps these patches to a set of visual feature vectors | | $\displaystyle\mathbf{F}_{t}=\mathrm{VisionEncoder}(\mathbf{P}_{t}),\quad\mathbf{F}_{t}=\{\mathbf{f}_{t,1},·s,\mathbf{f}_{t,N_{t}}\},\quad\mathbf{f}_{t,i}∈\mathbb{R}^{d}.$ | | | --- | --- | --- | Finally, we concatenate the feature sets across all tiles into a single output $$ \mathbf{F}=\mathrm{concat}\Bigl(\mathbf{F}_{1},\mathbf{F}_{2},\cdots,\mathbf{F}_{T}\Bigr). $$ (2) Align Module. This module aligns the visual features with the LLM. A linear layer $\mathbf{W}_{1}∈\mathbb{R}^{D× d}$ first projects the visual features $\mathbf{F}∈\mathbb{R}^{T· N_{t}× d}$ to the LLM’s token embedding space: one $\mathbb{R}^{D}$ vector per token. A second linear layer $\mathbf{W}_{2}∈\mathbb{R}^{V× D}$ (initialized from the LLM’s language-model head) followed by a softmax, produces a probability simplex $\mathbf{P}_{\text{vocab}}$ over the LLM’s vocabulary ( $V$ tokens) $$ \mathbf{P}_{\text{vocab}}=\operatorname{softmax}(\operatorname{LayerNorm}(\mathbf{W}_{2}\operatorname{LayerNorm}(\mathbf{W}_{1}\mathbf{F}))) \tag{1} $$ We then use the LLM text embeddings $\mathbf{E}_{\text{text}}∈\mathbb{R}^{V× D}$ to compute a weighted sum $$ \mathbf{F}_{\text{align}}^{\prime}=\mathbf{P}_{\text{vocab}}^{\top}\mathbf{E}_{\text{text}}. \tag{2} $$ Finally, we concatenate $\mathbf{F}_{\text{align}}^{\prime}$ with the tokenized text embeddings to form the LLM input $$ \mathbf{H}_{\text{input}}=\mathrm{concat}\bigl(\mathbf{F}_{\text{align}}^{\prime},\mathbf{E}_{\text{text}}(\mathbf{x})\bigr), $$ where $\mathbf{E}_{\text{text}}(\mathbf{x})$ is obtained by tokenizing the input text $\mathbf{x}=(x_{1},·s,x_{M})$ and selecting the corresponding embeddings from $\mathbf{E}_{\text{text}}$ such that $$ \displaystyle\mathbf{E}_{\text{text}}(\mathbf{x}) \displaystyle=\bigl[\mathbf{E}_{\text{text}}(x_{1}),\cdots,\mathbf{E}_{\text{text}}(x_{M})\bigr]. \tag{3} $$ (3) Large Language Model. We feed the concatenated vision and text vectors, $\mathbf{H}_{\text{input}}$ , into the LLM, which then generates output text auto-regressively. To demonstrate the effectiveness of our alignment technique, we experiment with the Llama 3.1 model family (Grattafiori et al., 2024). These models offer state-of-the-art performance and permissive licenses, making them suitable for commercial applications. In particular, we utilize Llama 3.2-1B, Llama 3.2-3B, and Llama 3.1-8B. 3.2 Motivation and relation with existing methods By construction, each $\mathbb{R}^{D}$ representation in $\mathbf{F}_{\text{align}}^{\prime}$ is constrained to the convex hull of the points $\mathbb{E}_{\text{text}}$ , thus concentrating the visual features in the part of latent space that the LLM can effectively interpret. Moreover, we argue that our initialization of $\mathbf{W}_{2}$ to the language model head is an inductive bias toward recycling some of the semantics of these text tokens into visual tokens. This contrasts with past methods that have been proposed to adapt the vision encoder outputs $\mathbf{F}∈\mathbb{R}^{T· N_{t}× d}$ to an $\mathbf{F}^{\prime}∈\mathbb{R}^{T· N_{t}× D}$ to be fed to the LLM. Here, we consider two examples in more detail, highlighting these contrasts. (1) MLP Connector Liu et al. (2023b) applies a linear projection with parameters $\mathbf{W}_{\text{MLP}}∈\mathbb{R}^{D× d}$ and $\mathbf{b}_{\text{MLP}}∈\mathbb{R}^{D}$ , followed by an activation function $\sigma$ (e.g., ReLU) $$ \mathbf{F}_{\text{MLP}}^{\prime}=\sigma(\mathbf{W}_{\text{MLP}}\mathbf{F}+\mathbf{b}_{\text{MLP}}). $$ These parameters are all learned from scratch, without any bias aligning them to text embeddings. (2) Visual Embedding Table Lu et al. (2024) introduces an entire new set of visual embeddings $\mathbf{E}_{\text{VET}}∈\mathbb{R}^{K× D}$ which, together with the weights $\mathbf{W}_{\text{VET}}∈\mathbb{R}^{K× d}$ , specifies $$ \mathbf{F}_{\text{VET}}^{\prime}=\operatorname{softmax}(\mathbf{W}_{\text{VET}}\mathbf{F})^{\top}\mathbf{E}_{\text{VET}}. $$ When $D<d$ , our $\mathbf{W}_{2}\mathbf{W}_{1}$ amounts to a low-rank version of $\mathbf{W}_{\text{VET}}$ . There is thus much more to learn to obtain $\mathbf{F}_{\text{VET}}^{\prime}$ , and there is again no explicit pressure to align it with the text embeddings. 3.3 Training Datasets & Stages We train our model in three stages: Stage 1. This stage focuses on training the Align Module to map visual features to the LLM’s text embeddings effectively. We use the CC-12M dataset Changpinyo et al. (2021), a large-scale web dataset commonly used for VLM pretraining Liu et al. (2023b), which contains 12M image-text pairs. However, due to broken or unavailable links, we retrieved 8.1M pairs. This dataset facilitates the alignment of visual features with the text embedding space of the LLM. During this stage, we train the full model, as this approach improves performance and stabilizes the Align Module training. Stage 2. The goal is to enhance the model’s document understanding capabilities, such as OCR, document structure comprehension, in-depth reasoning, and instruction-following. We leverage the BigDocs-7.5M dataset Rodriguez et al. (2024a), a curated collection of license-permissive datasets for multimodal document understanding. This dataset aligns with the Accountability, Responsibility, and Transparency (ART) principles Bommasani et al. (2023); Vogus and Llansóe (2021), ensuring compliance for commercial applications. As in Stage 1, we train the full model during this stage. Stage 3. To enhance the model’s instruction-tuning capabilities, particularly for downstream tasks like question answering, we further train it on the DocDownstream Rodriguez et al. (2024a); Hu et al. (2024) instruction tuning dataset. In this stage, the vision encoder is frozen, focusing training exclusively on the LLM and Align module. 4 Experimental Setup Table 1: Main Results on General Document Benchmarks. We compare AlignVLM (ours) with state-of-the-art (SOTA) open and closed-source instructed models, and with base models that we trained using the process described in Section 3.3. AlignVLM models outperform all Base VLM models trained in the same data regime. Our models also perform competitively across document benchmarks even compared with SOTA models, in which the data regime is more targeted and optimized. Color coding for comparison: closed-source models, open-source models below 7B parameters, open-source models between 7-12B parameters. | Model | DocVQA VAL | InfoVQA VAL | DeepForm TEST | KLC TEST | WTQ TEST | TabFact TEST | ChartQA TEST | TextVQA VAL | TableVQA TEST | Avg. Score | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Closed-Source VLMs | | | | | | | | | | | | (Opaque Training Data) | | | | | | | | | | | | Claude-3.5 Sonnet | 88.48 | 59.05 | 31.41 | 24.82 | 47.13 | 53.48 | 51.84 | 71.42 | 81.27 | 56.54 | | GeminiPro-1.5 | 91.23 | 73.94 | 32.16 | 24.07 | 50.29 | 71.22 | 34.68 | 68.16 | 80.43 | 58.46 | | GPT-4o 20240806 | 92.80 | 66.37 | 38.39 | 29.92 | 46.63 | 81.10 | 85.70 | 70.46 | 72.87 | 64.91 | | Open-Source Instruct VLMs | | | | | | | | | | | | (Semi-Opaque Training Data) | | | | | | | | | | | | Janus- 1.3B (Wu et al., 2024a) | 30.15 | 17.09 | 0.62 | 15.06 | 9.30 | 51.34 | 57.20 | 51.97 | 18.67 | 27.93 | | Qwen2-VL- 2B (Wang et al., 2024) | 89.16 | 64.11 | 32.38 | 25.18 | 38.20 | 57.21 | 73.40 | 79.90 | 43.07 | 55.84 | | Qwen2.5-VL- 3B (Wang et al., 2024) | 93.00 | 75.83 | 32.84 | 24.82 | 53.46 | 71.16 | 83.91 | 79.29 | 71.66 | 65.10 | | InternVL-2.5- 2B (Chen et al., 2024b) | 87.70 | 61.85 | 13.14 | 16.58 | 36.33 | 57.26 | 74.96 | 76.85 | 42.20 | 51.87 | | InternVL-3- 2B (Zhu et al., 2025) | 87.33 | 66.99 | 37.90 | 29.79 | 39.44 | 59.91 | 75.32 | 78.69 | 43.46 | 57.64 | | DeepSeek-VL2-Tiny- 3.4B (Wu et al., 2024b) | 88.57 | 63.88 | 25.11 | 19.04 | 35.07 | 52.15 | 80.92 | 80.48 | 56.30 | 55.72 | | Phi3.5-Vision- 4B (Abdin et al., 2024) | 86.00 | 56.20 | 10.47 | 7.49 | 17.18 | 30.43 | 82.16 | 73.12 | 70.70 | 48.19 | | Qwen2-VL- 7B (Wang et al., 2024) | 93.83 | 76.12 | 34.55 | 23.37 | 52.52 | 74.68 | 83.16 | 84.48 | 53.97 | 64.08 | | Qwen2.5-VL- 7B (Bai et al., 2025) | 94.88 | 82.49 | 42.21 | 24.26 | 61.96 | 78.56 | 86.00 | 85.35 | 76.10 | 70.20 | | LLaVA-NeXT- 7B (Xu et al., 2024) | 63.51 | 30.90 | 1.30 | 5.35 | 20.06 | 52.83 | 52.12 | 65.10 | 32.87 | 36.00 | | DocOwl1.5- 8B (Hu et al., 2024) | 80.73 | 49.94 | 68.84 | 37.99 | 38.87 | 79.67 | 68.56 | 68.91 | 52.60 | 60.68 | | InternVL-2.5- 8B (Chen et al., 2024b) | 91.98 | 75.36 | 34.55 | 22.31 | 50.33 | 74.75 | 82.84 | 79.00 | 52.10 | 62.58 | | InternVL-3- 8B (Zhu et al., 2025) | 91.99 | 73.90 | 51.24 | 36.41 | 53.60 | 72.27 | 85.60 | 82.41 | 53.26 | 66.74 | | Fuyu- 8B (Bavishi et al., 2023) | 48.97 | 23.09 | 4.78 | 6.63 | 14.55 | 47.91 | 44.36 | 46.02 | 15.49 | 22.97 | | Ovis-1.6-Gemma2- 9B (Lu et al., 2024) | 88.84 | 73.97 | 45.16 | 23.91 | 50.72 | 76.66 | 81.40 | 77.73 | 48.33 | 62.96 | | Llama3.2- 11B (Grattafiori et al., 2024) | 82.71 | 36.62 | 1.78 | 3.47 | 23.03 | 58.33 | 23.80 | 54.28 | 22.40 | 34.04 | | Pixtral- 12B (Agrawal et al., 2024) | 87.67 | 49.45 | 27.37 | 24.07 | 45.18 | 73.53 | 71.80 | 76.09 | 67.13 | 58.03 | | Document Understanding Instructed Models | | | | | | | | | | | | (Instruction Tuned on BigDocs-7.5M + DocDownStream (Rodriguez et al., 2024a; Hu et al., 2024)) | | | | | | | | | | | | Qwen2-VL- 2B (base+) (Wang et al., 2024) | 57.23 | 31.88 | 49.31 | 34.39 | 31.61 | 64.75 | 68.60 | 61.01 | 47.53 | 49.59 | | AlignVLM -Llama-3.2- 1B (ours) | 72.42 | 38.16 | 60.47 | 33.71 | 28.66 | 71.31 | 65.44 | 48.81 | 50.29 | 52.14 | | AlignVLM -Llama-3.2- 3B (ours) | 79.63 | 44.53 | 63.49 | 35.25 | 38.59 | 78.51 | 71.88 | 57.38 | 60.10 | 58.81 | | DocOwl1.5- 8B (base+) (Hu et al., 2024) | 78.70 | 47.62 | 64.39 | 36.93 | 35.69 | 72.65 | 65.80 | 67.30 | 49.03 | 57.56 | | Llama3.2- 11B (base+) (Grattafiori et al., 2024) | 78.99 | 44.27 | 67.05 | 37.22 | 40.18 | 78.04 | 71.40 | 68.46 | 56.73 | 60.26 | | AlignVLM -Llama-3.1- 8B (ours) | 81.18 | 53.75 | 63.25 | 35.50 | 45.31 | 83.04 | 75.00 | 64.60 | 64.33 | 62.88 | Setup. We conduct all experiments using 8 nodes of H100 GPUs, totaling 64 GPUs. For model training, we leverage the MS-Swift framework (Zhao et al., 2024) for its flexibility. Additionally, we utilize the DeepSpeed framework (Aminabadi et al., 2022), specifically the ZeRO-3 configuration, to optimize efficient parallel training across multiple nodes. Detailed hyperparameters are outlined in Appendix A.1. Baselines. Our work focuses on architectural innovations, so we ensure that all baselines are trained on the same datasets. To enable fair comparisons, we evaluate our models against a set of Base VLMs fine-tuned on the same instruction-tuning tasks (Stages 2 and 3) as our models, using the BigDocs-7.5M and BigDocs-DocDownstream datasets. This approach ensures consistent training data, avoiding biases introduced by the Instruct versions of VLMs, which are often trained on undisclosed instruction-tuning datasets. Due to the scarcity of recently released publicly available Base VLMs, we primarily compare our model against the following Base VLMs of varying sizes: Qwen2-VL-2B (Wang et al., 2024), DocOwl1.5-8B (Hu et al., 2024), and LLama 3.2-11B (Grattafiori et al., 2024). For additional context, we also include results from the Instruct versions of recent VLMs of different sizes: Phi3.5-Vision-4B (Abdin et al., 2024), Qwen2-VL-2B and 7B (Wang et al., 2024), Qwen2.5-VL-7B (Qwen et al., 2025), LLaVA-NeXT-7B (Liu et al., 2024), InternVL2.5-2B and 8B (Chen et al., 2024b), InternVL3-2B and 8B (Zhu et al., 2025), Janus-1.3B (Wu et al., 2024a), DeepSeek-VL2-Tiny (Wu et al., 2024b), Ovis1.6-Gemma-9B (Lu et al., 2024), Llama3.2-11B (Grattafiori et al., 2024), DocOwl1.5-8B (Hu et al., 2024), and Pixtral-12B (Agrawal et al., 2024). Evaluation Benchmarks. We evaluate our models on a diverse range of document understanding benchmarks that assess the model’s capabilities in OCR, chart reasoning, table processing, or form comprehension. In particular, we employ the VLMEvalKit (Duan et al., 2024) framework and report the results on the following popular benchmarks: DocVQA (Mathew et al., 2021b), InfoVQA (Mathew et al., 2021a), DeepForm (Svetlichnaya, 2020), KLC (Stanisławek et al., 2021), WTQ (Pasupat and Liang, 2015), TabFact (Chen et al., 2020), ChartQA (Masry et al., 2022), TextVQA (Singh et al., 2019), and TableVQA (Kim et al., 2024). 5 Results Table 2: Impact of Connector Designs on VLM Performance: We present the results of experiments evaluating different connector designs for conditioning LLMs on visual features. Our proposed Align connector is compared against a basic Multi-Layer Perceptron (MLP), the Perceiver Resampler, and Ovis. The results demonstrate that Align consistently outperforms these alternatives across all benchmarks. | Model | DocVQA VAL | InfoVQA VAL | DeepForm TEST | KLC TEST | WTQ TEST | TabFact TEST | ChartQA TEST | TextVQA VAL | TableVQA TEST | Avg. Score | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Llama-3.2-3B- MLP | 71.46 | 37.56 | 62.07 | 33.36 | 28.94 | 73.22 | 66.48 | 53.56 | 50.96 | 53.06 | | Llama-3.2-3B- Perciever R. | 69.08 | 34.13 | 57.08 | 31.75 | 27.95 | 71.93 | 65.16 | 51.33 | 47.76 | 50.68 | | Llama-3.2-3B- Ovis | 74.68 | 42.11 | 58.02 | 33.50 | 33.13 | 76.67 | 67.92 | 52.60 | 53.93 | 54.72 | | Llama-3.2-3B- Align (ours) | 79.63 | 44.53 | 63.49 | 35.25 | 38.59 | 78.51 | 71.88 | 57.38 | 60.10 | 58.81 | 5.1 Main Results Table 1 presents the performance of AlignVLM compared to state-of-the-art (SOTA) open- and closed-source instructed models, as well as baseline Base VLMs fine-tuned in the same instruction-tuning setup. The results demonstrate that AlignVLM consistently outperforms all Base VLMs within the same size category and achieves competitive performance against SOTA Instruct VLMs despite being trained on a more limited data regime. Below, we provide a detailed analysis. AlignVLM vs. Base VLMs. Our AlignVLM models, based on Llama 3.2-1B and Llama 3.2-3B, significantly outperform the corresponding Base VLM, Qwen2-VL-2B, by up to 9.22%. Notably, AlignVLM -Llama-3.2-3B surpasses DocOwl1.5-8B, which has 4B more parameters, demonstrating the effectiveness of Align in enhancing multimodal capabilities compared to traditional shallow fusion methods (e.g., MLPs). Furthermore, our 8B model achieves a 2.62% improvement over Llama3.2-11B despite sharing the same Base LLM, Llama3.1-8B. Since all models in this comparison were trained on the same instruction-tuning setup, this experiment provides a controlled evaluation, isolating the impact of architectural differences rather than dataset biases. Consequently, these results suggest that AlignVLM outperforms VLMs with shallow fusion techniques and surpasses parameter-heavy deep fusion VLMs, such as Llama3.2-11B, while maintaining a more efficient architecture. AlignVLM vs. Instruct VLMs. Even as open-source Instruct models are trained on significantly larger, often undisclosed instruction-tuning datasets, AlignVLM achieves competitive performance. For example, AlignVLM -Llama-3.2-3B (58.81%) outperforms other strong instruction-tuned VLMs in its size class, such as Qwen2-VL-2B and InternVL-3-2B, by considerable margins (2.97% and 1.17%, respectively). While it falls slightly behind Qwen2.5-VL-3B, a direct comparison is not entirely fair, as the latter was trained on a proprietary instruction-tuning dataset. Additionally, our 8B model outperforms significantly larger models such as Llama 3.2-11B and PixTral-12B by substantial margins. It also surpasses InternVL-2.5-8B and performs competitively with Qwen2.5-VL-7B, though a direct comparison may not be entirely fair since Qwen2.5-VL-7B was trained on an undisclosed instruction-tuning dataset. Finally, AlignVLM also exhibits comparable performance to closed-source models like GeminiPro-1.5 and GPT4o. Overall, these results validate the effectiveness of Align and establish AlignVLM as a state-of-the-art model for multimodal document understanding. 5.2 Impact of Connector Designs on VLM Performance 5.2.1 High-Resource Training Regime To assess the effectiveness of our Align module, we compare it against three different and widely used shallow fusion VLM connectors: MLP, Perceiver Resampler, and Ovis. These experiments were carefully conducted under precisely identical training conditions (datasets, hyperparameters, training stages) as outlined in Appendix A.1, ensuring a fair and rigorous comparison. The results in Table 2 show that Align consistently outperforms all alternatives, demonstrating its superiority both in aligning visual and textual modalities in multimodal document understanding. MLP and Perceiver Resampler achieve the lowest performance, 53.06% and 50.68%, respectively, due to their direct feature projection, which lacks an explicit mechanism to align visual features with the LLM’s text space, leading to misalignment. Ovis introduces a separate visual embedding table, but this additional complexity does not significantly improve alignment, yielding only 54.72% accuracy. In contrast, Align ensures that visual features remain within the convex hull of the LLM’s text latent space, leveraging the linguistic priors of the LLM to enhance alignment and mitigate noisy embeddings. This design leads to the highest performance (58.81%), establishing Align as the most effective connector for integrating vision and language in multimodal document understanding. We provide some example outputs of the Llama-3.2-3B models with different connector designs in Appendix A.4. Furthermore, we include an analysis of the runtime efficiency and memory usage of different connectors in Appendix A.2. 5.2.2 Low-Resource Training Regime The previous section focused on large-scale training setups involving millions of data samples (BigDocs-7.5M), which require significant compute resources and limit the number of baselines that we were able to compare against. Here, we examine whether Align remains effective in a low-resource setting. We conduct additional experiments using SigLIP-400M as the vision encoder and Llama-3.2-3B as the language model, fine-tuned on the LLaVA-NeXT dataset Liu et al. (2024), which contains 779K samples. We follow the official LLaVA-NeXT configuration for both training stages. (i) Pretraining: the model is trained on the LLaVA-558K image–caption dataset Liu et al. (2024), freezing both the LLM and vision encoder while fine-tuning the connector (learning rate = 1e-3, batch size = 32, 1 epoch on 8 × H100 GPUs). To handle high-resolution document images, we adopt the "anyres_max_9" strategy with grid weaving from 1×1 to 6×6, supporting resolutions up to 2304×2304 with 729 tokens per grid; (ii) Instruction tuning: the model is further fine-tuned on the LLaVA-NeXT-779K instruction dataset with learning rates of 1e-5 for the LLM and connector, 2e-6 for the vision encoder, batch size = 8, for 1 epoch. This lightweight setup allows direct comparison across more connector architectures including MLP Liu et al. (2023a), Perceiver Resampler, Ovis Lu et al. (2024), H-Reducer (1×4) Hu et al. (2024), and HoneyBee (C-Abstractor) Cha et al. (2024), all trained under identical conditions for fairness. Since the LLaVA-Next dataset is general-purpose and not exclusively document-focused like BigDocs-7.5M (Rodriguez et al., 2024a), it allows us to evaluate whether the Align connector generalizes beyond document understanding to broader visual reasoning. Accordingly, we assess all models on a comprehensive suite of benchmarks spanning both document understanding and general vision–language tasks. The document understanding benchmarks include DocVQA Mathew et al. (2021b), InfoVQA Mathew et al. (2021a), ChartQA Masry et al. (2022), and TextVQA Singh et al. (2019). For general vision–language evaluation, we report results on MMMU-dev Yue et al. (2024), SeedBench Li et al. (2023a), and MMVet Yu et al. (2024), Pope (Li et al., 2023c), and GQA (Hudson and Manning, 2019). Table 3: Connector Performance under a Low-Resource Training Regime: We evaluate the effectiveness of more shallow-fusion connectors when trained on limited data. The Align connector achieves the highest performance, with notably larger gains on document understanding tasks, demonstrating its data efficiency and strong inductive bias. | Model | Document Understanding Tasks | General Vision Tasks | | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | DocVQA | InfoVQA | ChartQA | TextVQA | Avg. | MMMU | SeedBench | MMVet | POPE | GQA | Avg. | | | Llama-3.2-3B-MLP | 42.11 | 19.93 | 48.44 | 51.97 | 40.61 | 33.33 | 58.54 | 31.14 | 87.35 | 57.62 | 53.59 | | Llama-3.2-3B-Perceiver | 32.18 | 18.10 | 40.00 | 44.31 | 33.64 | 35.22 | 63.70 | 26.19 | 84.92 | 55.86 | 53.17 | | Llama-3.2-3B-Ovis | 57.73 | 26.39 | 54.52 | 55.60 | 48.56 | 31.89 | 60.97 | 30.41 | 88.26 | 56.23 | 53.55 | | Llama-3.2-3B-Hreducer | 34.59 | 17.57 | 45.64 | 47.13 | 36.23 | 35.00 | 61.82 | 28.39 | 87.48 | 58.24 | 54.18 | | Llama-3.2-3B-HoneyBee | 55.86 | 19.36 | 55.32 | 58.13 | 47.16 | 32.11 | 61.18 | 34.31 | 89.28 | 54.79 | 54.33 | | Llama-3.2-3B- Align (ours) | 71.43 | 30.50 | 69.72 | 65.63 | 59.32 | 35.33 | 63.27 | 35.32 | 88.85 | 61.67 | 56.88 | As summarized in Table 3, Align consistently outperforms other connectors under this low-data regime, with stronger gains on document understanding tasks. The wider performance margin between Align and others connectors under limited data (Table 3) compared to the high-resource setting (Table 2) underscores the benefit of its inductive bias. By grounding visual features within the LLM’s text embedding space, Align learns more efficiently from fewer samples, unlike direct-projection connectors that rely heavily on large datasets. This makes Align especially valuable for resource-constrained environments such as academic labs or small-scale industrial research setups, where both data and compute are limited. 5.3 Probability Distribution over Text Tokens Analysis To better understand the behavior of Align, we examine the probability distribution, $\mathbf{P}_{\text{vocab}}$ in Eq (1), over the LLM’s text vocabulary generated from visual features. Specifically, we process 100 document images through the vision encoder and Align, then average the resulting probability distributions across all image patches. The final distribution is shown in Figure 4. As illustrated, the distribution is dense (rather than sparse), with the highest probability assigned to a single token being 0.0118. This can be explained by the vision feature space being continuous and of much higher cardinality than the discrete text space. Indeed, while the LLM has 128K distinct vocabulary tokens, an image patch (e.g., 14×14 pixels) contains continuous, high-dimensional information that cannot be effectively mapped to a single or a few discrete tokens. Table 4: Performance comparison when evaluating Align with the full text embedding vocabulary (128K) versus the reduced subset of 3.4K high-probability embeddings. The results show negligible performance degradation, indicating that Align relies primarily on a small subset of embeddings. | Model | DocVQA VAL | InfoVQA VAL | DeepForm TEST | KLC TEST | WTQ TEST | TabFact TEST | ChartQA TEST | TextVQA VAL | TableVQA TEST | Avg. Score | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Llama-3.2-3B- Align (Full Embeddings) | 79.63 | 44.53 | 63.49 | 35.25 | 38.59 | 78.51 | 71.88 | 57.38 | 60.10 | 58.81 | | Llama-3.2-3B- Align (3.4K Embeddings) | 79.40 | 44.13 | 63.64 | 35.02 | 38.26 | 78.83 | 71.72 | 57.48 | 59.80 | 58.69 | We conducted a deeper analysis of the token probability distributions produced by the Align connector. Our observations show that Align consistently assigns high probabilities to approximately 3.4K tokens from the entire vocabulary, while the remaining tokens receive negligible probabilities (below $10^{-6}$ ). To better understand this behavior, we applied Principal Component Analysis (PCA) to reduce the dimensionality of the embeddings and visualized them in a two-dimensional space, as shown in Figure 4. The visualization reveals that these 3.4K tokens densely and comprehensively span the latent space of the LLM’s text embeddings. To validate this finding, we conducted additional evaluation experiments in which we retained only these 3.4K high-probability embeddings in the Align connector, entirely removing the rest during evaluation. As shown in Table 4, the performance difference compared to using the full embedding set (128K) was negligible. This confirms that Align effectively leverages and combines a compact subset of embeddings to map visual features into semantically meaningful regions within the LLM’s latent text space. Moreover, this suggests that Align can be further optimized through targeted embedding pruning to improve computational efficiency without sacrificing performance. <details> <summary>figures/prob_dist_image.png Details</summary> ![06a92faf](/v1/image/06a92faf1d3fbdafc88c47aff2bf02c7f02db38bf06be5092ced631a91777b58) ### Visual Description \n ## Line Chart: Probability Distribution over LLM's Text Tokens ### Overview The image presents a line chart illustrating the probability distribution across the text tokens of a Large Language Model (LLM). The x-axis represents the token index, and the y-axis represents the probability. The chart shows a steep initial drop in probability followed by a relatively stable, fluctuating distribution for the remaining tokens. ### Components/Axes * **Title:** "Probability Distribution over the LLM's Text Tokens" - positioned at the top-center of the chart. * **X-axis Label:** "Tokens Index" - positioned at the bottom-center of the chart. The scale ranges from approximately 0 to 12000. * **Y-axis Label:** "Probability" - positioned at the left-center of the chart. The scale ranges from approximately 0.000 to 0.012. * **Data Series:** A single light-blue line representing the probability distribution. * **No Legend:** There is no explicit legend present. ### Detailed Analysis The line chart begins with a very high probability value at Token Index 0, approximately 0.0115. This value rapidly decreases to around 0.002 by Token Index 500. After this initial drop, the line fluctuates between approximately 0.000 and 0.002 for the remainder of the token indices (from 500 to 12000). Here's a breakdown of approximate values at specific Token Indices: * Token Index 0: Probability ≈ 0.0115 * Token Index 500: Probability ≈ 0.002 * Token Index 1000: Probability ≈ 0.0018 * Token Index 2000: Probability ≈ 0.0015 * Token Index 4000: Probability ≈ 0.0019 * Token Index 6000: Probability ≈ 0.0012 * Token Index 8000: Probability ≈ 0.0017 * Token Index 10000: Probability ≈ 0.0014 * Token Index 12000: Probability ≈ 0.0011 The line exhibits numerous small peaks and valleys throughout the range of 500 to 12000, indicating variations in token probability. The fluctuations appear somewhat random, with no clear upward or downward trend. ### Key Observations * **Initial Peak:** The initial token (Index 0) has a significantly higher probability than all subsequent tokens. * **Rapid Decay:** The probability decays rapidly within the first 500 tokens. * **Stable Fluctuation:** After the initial decay, the probability remains relatively stable, fluctuating within a narrow range. * **No Clear Trend:** There is no discernible long-term trend (upward or downward) in the probability distribution after the initial decay. ### Interpretation This chart likely represents the probability distribution of tokens generated by a Large Language Model (LLM). The high initial probability suggests that the first token generated is highly predictable, potentially a start-of-sequence token or a common word. The rapid decay indicates that subsequent tokens become less predictable as the sequence progresses. The stable, fluctuating distribution for the remaining tokens suggests that the LLM has a relatively uniform probability of generating different tokens after the initial phase. The lack of a clear trend in the later tokens could indicate that the LLM is generating a diverse range of tokens, or that the probability distribution is inherently noisy. The fluctuations might be due to the LLM's internal state, the context of the generated text, or the inherent randomness in the generation process. The chart provides insights into the LLM's token generation behavior, highlighting the initial predictability and subsequent diversity of the generated text. It suggests that the LLM's output becomes less constrained as the sequence length increases. </details> Figure 3: Probability distribution over LLM tokens, highlighting dense probabilities for whitespace tokens. <details> <summary>figures/vision-to-text/alignvlm_embeds.png Details</summary> ![fb837bf3](/v1/image/fb837bf3e209b696cd1e945d5bbfe7c454b8b2825c8a5b909ea72f7c2f6633fd) ### Visual Description \n ## Scatter Plot: 2D PCA of Embeddings (AlignVLM) ### Overview This image presents a scatter plot visualizing the 2-dimensional Principal Component Analysis (PCA) of embeddings generated by the AlignVLM model. The plot displays the distribution of embeddings across two principal components, PC1 and PC2, with different colors representing different categories of embeddings. ### Components/Axes * **Title:** 2D PCA of Embeddings (AlignVLM) - positioned at the top-center of the image. * **X-axis:** PC1 - ranges approximately from -0.15 to 0.15. * **Y-axis:** PC2 - ranges approximately from -0.10 to 0.10. * **Legend:** Located in the top-left corner. * "Other Embeddings" - represented by light blue dots. * "Most Influential Embeddings in AlignVLM" - represented by red dots. * **Data Points:** Numerous dots scattered across the plot, representing individual embeddings. ### Detailed Analysis The plot shows a clear separation between the two categories of embeddings. * **"Other Embeddings" (Light Blue):** These points are distributed in a roughly elliptical shape centered around PC1 ≈ 0 and PC2 ≈ 0. The distribution is relatively spread out, covering a significant portion of the plot. There is a slight concentration of points in the lower-left quadrant (PC1 < 0, PC2 < 0) and upper-right quadrant (PC1 > 0, PC2 > 0). * **"Most Influential Embeddings in AlignVLM" (Red):** These points form two distinct clusters. * **Cluster 1:** Located around PC1 ≈ -0.07 and PC2 ≈ 0.02. This cluster is relatively compact. * **Cluster 2:** Located around PC1 ≈ 0.07 and PC2 ≈ 0.02. This cluster is also relatively compact and roughly symmetrical to Cluster 1 with respect to the PC1 axis. * There is a noticeable gap between the "Other Embeddings" and the "Most Influential Embeddings" clusters, particularly in the central region of the plot. The density of red points is significantly higher than the density of blue points within their respective clusters. ### Key Observations * The "Most Influential Embeddings" are clearly distinguishable from the "Other Embeddings" based on their PC1 and PC2 values. * The "Most Influential Embeddings" are not uniformly distributed; they are concentrated in two distinct clusters. * The "Other Embeddings" exhibit a broader distribution, suggesting greater variability. * The separation suggests that the two categories of embeddings capture different aspects of the data or have different characteristics. ### Interpretation The PCA plot suggests that the AlignVLM model identifies a subset of embeddings ("Most Influential Embeddings") that are significantly different from the rest ("Other Embeddings"). The two clusters of influential embeddings may represent distinct concepts or features that are particularly important for the model's performance. The clear separation indicates that the PCA effectively captures the underlying structure of the embeddings and highlights the key differences between these two groups. The fact that the influential embeddings are concentrated in clusters suggests that these concepts are not isolated but rather related to each other. This visualization could be used to understand which embeddings contribute most to the model's decision-making process and to identify potential areas for improvement. The model seems to be able to identify a subset of embeddings that are more important for its function. </details> Figure 4: PCA of Align Embeddings: The principal components of the most influential embeddings in the Align Connector span most of the feature space represented by all embeddings. 5.4 Robustness to Noise Analysis To evaluate the robustness of our Align connector to noisy visual features, we conduct an experiment where random Gaussian noise is added to the visual features produced by the vision encoder before passing them into the connector. Specifically, given the visual features $\mathbf{F}∈\mathbb{R}^{N× d}$ output by the vision encoder (where $N$ is the number of feature vectors and $d$ is their dimensionality), we perturbed them as $$ \widetilde{\mathbf{F}}=\mathbf{F}+\mathbf{N},\quad\mathbf{N}\sim\mathcal{N}(0,\sigma=3). $$ Table 5: Robustness to Noise. Comparison of Avg. Scores with and without Gaussian noise ( $\sigma=3$ ), including performance drop ( $\Delta$ ). | Model | Without Noise | With Noise | Drop ( $\Delta$ ) | | --- | --- | --- | --- | | Llama-3.2-3B-MLP | 53.06 | 27.52 | $\downarrow 25.54$ | | Llama-3.2-3B- Align (ours) | 58.81 | 57.14 | $\downarrow\textbf{1.67}$ | As shown in Table 5, our Align connector demonstrates high robustness to noise, with only a 1.67% average drop in performance. In contrast, the widely adopted MLP connector suffers a significant performance degradation of 25.54%, highlighting its vulnerability to noisy inputs. Furthermore, we measured the average cosine distance between the original and noise-perturbed visual embeddings using both the Align and MLP connectors. Align showed significantly lower distances (0.0036) than MLP (0.3938), further validating its robustness to noise. These empirical results support our hypothesis that leveraging the knowledge encoded in the LLM’s text embeddings and constraining the visual features within the convex hull of the text latent space act as a regularization mechanism, reducing the model’s sensitivity to noisy visual features. 6 Conclusion We introduce Align, a novel connector designed to align vision and language latent spaces in vision-language models (VLMs), specifically enhancing multimodal document understanding. By improving cross-modal alignment and minimizing noisy embeddings, our models, AlignVLM, which leverage Align, achieve state-of-the-art performance across diverse document understanding tasks. This includes outperforming base VLMs trained on the same datasets and achieving competitive performance with open-source instruct models trained on undisclosed data. Extensive experiments and ablations validate the robustness and effectiveness of Align compared to existing connector designs, establishing it as a significant contribution to vision-language modeling. Future work will explore training on more diverse instruction-tuning datasets to generalize to broader domains. References - Abdin et al. [2024] M. Abdin, J. Aneja, H. Awadalla, A. Awadallah, A. A. Awan, N. Bach, A. Bahree, A. Bakhtiari, J. Bao, H. Behl, A. Benhaim, M. Bilenko, J. Bjorck, S. Bubeck, M. Cai, Q. Cai, V. Chaudhary, D. Chen, D. Chen, W. Chen, Y.-C. Chen, Y.-L. Chen, H. Cheng, P. Chopra, X. Dai, M. Dixon, R. Eldan, V. Fragoso, J. Gao, M. Gao, M. Gao, A. Garg, A. D. Giorno, A. Goswami, S. Gunasekar, E. Haider, J. Hao, R. J. Hewett, W. Hu, J. Huynh, D. Iter, S. A. Jacobs, M. Javaheripi, X. Jin, N. Karampatziakis, P. Kauffmann, M. Khademi, D. Kim, Y. J. Kim, L. Kurilenko, J. R. Lee, Y. T. Lee, Y. Li, Y. Li, C. Liang, L. Liden, X. Lin, Z. Lin, C. Liu, L. Liu, M. Liu, W. Liu, X. Liu, C. Luo, P. Madan, A. Mahmoudzadeh, D. Majercak, M. Mazzola, C. C. T. Mendes, A. Mitra, H. Modi, A. Nguyen, B. Norick, B. Patra, D. Perez-Becker, T. Portet, R. Pryzant, H. Qin, M. Radmilac, L. Ren, G. de Rosa, C. Rosset, S. Roy, O. Ruwase, O. Saarikivi, A. Saied, A. Salim, M. Santacroce, S. Shah, N. Shang, H. Sharma, Y. Shen, S. Shukla, X. Song, M. Tanaka, A. Tupini, P. Vaddamanu, C. Wang, G. Wang, L. Wang, S. Wang, X. Wang, Y. Wang, R. Ward, W. Wen, P. Witte, H. Wu, X. Wu, M. Wyatt, B. Xiao, C. Xu, J. Xu, W. Xu, J. Xue, S. Yadav, F. Yang, J. Yang, Y. Yang, Z. Yang, D. Yu, L. Yuan, C. Zhang, C. Zhang, J. Zhang, L. L. Zhang, Y. Zhang, Y. Zhang, Y. Zhang, and X. Zhou. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv.org/abs/2404.14219. - Achiam et al. [2023] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. - Agrawal et al. [2024] P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. D. Monicault, S. Garg, T. Gervet, S. Ghosh, A. Héliou, P. Jacob, A. Q. Jiang, K. Khandelwal, T. Lacroix, G. Lample, D. L. Casas, T. Lavril, T. L. Scao, A. Lo, W. Marshall, L. Martin, A. Mensch, P. Muddireddy, V. Nemychnikova, M. Pellat, P. V. Platen, N. Raghuraman, B. Rozière, A. Sablayrolles, L. Saulnier, R. Sauvestre, W. Shang, R. Soletskyi, L. Stewart, P. Stock, J. Studnia, S. Subramanian, S. Vaze, T. Wang, and S. Yang. Pixtral 12b, 2024. URL https://arxiv.org/abs/2410.07073. - Alayrac et al. [2022] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language model for few-shot learning, 2022. URL https://arxiv.org/abs/2204.14198. - Aminabadi et al. [2022] R. Y. Aminabadi, S. Rajbhandari, M. Zhang, A. A. Awan, C. Li, D. Li, E. Zheng, J. Rasley, S. Smith, O. Ruwase, and Y. He. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale, 2022. URL https://arxiv.org/abs/2207.00032. - Anthropic [2024] Anthropic. The claude 3 model family: Opus, sonnet, haiku. 2024. - Bai et al. [2025] S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y. Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y. Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-vl technical report, 2025. URL https://arxiv.org/abs/2502.13923. - Bavishi et al. [2023] R. Bavishi, E. Elsen, C. Hawthorne, M. Nye, A. Odena, A. Somani, and S. Taşırlar. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b. - Beyer et al. [2024] L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Bošnjak, X. Chen, M. Minderer, P. Voigtlaender, I. Bica, I. Balazevic, J. Puigcerver, P. Papalampidi, O. Henaff, X. Xiong, R. Soricut, J. Harmsen, and X. Zhai. Paligemma: A versatile 3b vlm for transfer, 2024. URL https://arxiv.org/abs/2407.07726. - Bommasani et al. [2023] R. Bommasani, K. Klyman, S. Longpre, S. Kapoor, N. Maslej, B. Xiong, D. Zhang, and P. Liang. The foundation model transparency index, 2023. URL https://arxiv.org/abs/2310.12941. - Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. - Cha et al. [2024] J. Cha, W. Kang, J. Mun, and B. Roh. Honeybee: Locality-enhanced projector for multimodal llm, 2024. URL https://arxiv.org/abs/2312.06742. - Changpinyo et al. [2021] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021. URL https://arxiv.org/abs/2102.08981. - Chen et al. [2020] W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, and W. Y. Wang. Tabfact: A large-scale dataset for table-based fact verification. In International Conference Learning Representations, 2020. - Chen et al. [2024a] Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Ma, J. Ma, J. Wang, X. Dong, H. Yan, H. Guo, C. He, B. Shi, Z. Jin, C. Xu, B. Wang, X. Wei, W. Li, W. Zhang, B. Zhang, P. Cai, L. Wen, X. Yan, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites, 2024a. URL https://arxiv.org/abs/2404.16821. - Chen et al. [2024b] Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024b. - Dai et al. [2024] W. Dai, N. Lee, B. Wang, Z. Yang, Z. Liu, J. Barker, T. Rintamaki, M. Shoeybi, B. Catanzaro, and W. Ping. Nvlm: Open frontier-class multimodal llms. arXiv preprint arXiv: 2409.11402, 2024. - Diao et al. [2024] H. Diao, Y. Cui, X. Li, Y. Wang, H. Lu, and X. Wang. Unveiling encoder-free vision-language models. arXiv preprint arXiv:2406.11832, 2024. - Drouin et al. [2024] A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. D. Verme, T. Marty, L. Boisvert, M. Thakkar, Q. Cappart, D. Vazquez, N. Chapados, and A. Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718. - Duan et al. [2024] H. Duan, J. Yang, Y. Qiao, X. Fang, L. Chen, Y. Liu, X. Dong, Y. Zang, P. Zhang, J. Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. - Dubey et al. [2024] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, and et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. - Grattafiori et al. [2024] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. McConnell, C. Keller, C. Touret, C. Wu, C. Wong, C. C. Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz, D. Livshits, D. Wyatt, D. Esiobu, D. Choudhary, D. Mahajan, D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin, E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith, F. Radenovic, F. Guzmán, F. Zhang, G. Synnaeve, G. Lee, G. L. Anderson, G. Thattai, G. Nail, G. Mialon, G. Pang, G. Cucurell, H. Nguyen, H. Korevaar, H. Xu, H. Touvron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra, I. Evtimov, J. Zhang, J. Copet, J. Lee, J. Geffert, J. Vranes, J. Park, J. Mahadeokar, J. Shah, J. van der Linde, J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang, J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park, J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala, K. Prasad, K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone, K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla, K. Lakhotia, L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan, L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher, L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti, M. Singh, M. Paluri, M. Kardas, M. Tsimpoukelli, M. Oldham, M. Rita, M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K. Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov, N. Bogoychev, N. Chatterji, N. Zhang, O. Duchenne, O. Çelebi, P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhargava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He, Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer, R. S. Cabral, R. Stojnic, R. Raileanu, R. Maheswari, R. Girdhar, R. Patel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor, R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabasappa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie, S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhosale, S. Zhang, S. Vandenhende, S. Batra, S. Whitman, S. Sootla, S. Collot, S. Gururangan, S. Borodinsky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou, T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao, U. Karn, V. Goswami, V. Gupta, V. Ramanathan, V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Albiero, V. Petrovic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Martinet, X. Wang, X. Wang, X. E. Tan, X. Xia, X. Xie, X. Jia, X. Wang, Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song, Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan, Z. Chen, Z. Papakipos, A. Singh, A. Srivastava, A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Victoria, A. Goldstand, A. Menon, A. Sharma, A. Boesenberg, A. Baevski, A. Feinstein, A. Kallet, A. Sangani, A. Teo, A. Yunus, A. Lupu, A. Alvarado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan, A. Ramchandani, A. Dong, A. Franco, A. Goyal, A. Saraf, A. Chowdhury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yazdan, B. James, B. Maurer, B. Leonhardi, B. Huang, B. Loyd, B. D. Paola, B. Paranjape, B. Liu, B. Wu, B. Ni, B. Hancock, B. Wasti, B. Spence, B. Stojkovic, B. Gamido, B. Montalvo, C. Parker, C. Burton, C. Mejia, C. Liu, C. Wang, C. Kim, C. Zhou, C. Hu, C.-H. Chu, C. Cai, C. Tindal, C. Feichtenhofer, C. Gao, D. Civin, D. Beaty, D. Kreymer, D. Li, D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh, D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland, E. Dowling, E. Jamil, E. Montgomery, E. Presani, E. Hahn, E. Wood, E.-T. Le, E. Brinkman, E. Arcaute, E. Dunbar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Kokkinos, F. Ozgenel, F. Caggioni, F. Kanayet, F. Seide, G. M. Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern, G. Herman, G. Sizov, Guangyi, Zhang, G. Lakshminarayanan, H. Inan, H. Shojanazeri, H. Zou, H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk, H. Aspegren, H. Goldman, H. Zhan, I. Damlaj, I. Molybog, I. Tufanov, I. Leontiadis, I.-E. Veliche, I. Gat, J. Weissman, J. Geboski, J. Kohli, J. Lam, J. Asher, J.-B. Gaya, J. Marcus, J. Tang, J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong, J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shepard, J. McPhie, J. Torres, J. Ginsburg, J. Wang, K. Wu, K. H. U, K. Saxena, K. Khandelwal, K. Zand, K. Matosich, K. Veeraraghavan, K. Michelena, K. Li, K. Jagadeesh, K. Huang, K. Chawla, K. Huang, L. Chen, L. Garg, L. A, L. Silva, L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich, L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt, M. Mankus, M. Hasson, M. Lennie, M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Keneally, M. Liu, M. L. Seltzer, M. Valko, M. Restrepo, M. Patel, M. Vyatskov, M. Samvelyan, M. Clark, M. Macey, M. Wang, M. J. Hermoso, M. Metanat, M. Rastegari, M. Bansal, N. Santhanam, N. Parks, N. White, N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. Mehta, N. P. Laptev, N. Dong, N. Cheng, O. Chernoguz, O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh, P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux, P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj, Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy, R. Nayani, R. Mitra, R. Parthasarathy, R. Li, R. Hogan, R. Battey, R. Wang, R. Howes, R. Rinott, S. Mehta, S. Siby, S. J. Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon, S. Sidorov, S. Pan, S. Mahajan, S. Verma, S. Yamamoto, S. Ramaswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin, S. C. Zha, S. Patil, S. Shankar, S. Zhang, S. Zhang, S. Wang, S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max, S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad, S. Gupta, S. Deng, S. Cho, S. Virk, S. Subramanian, S. Choudhury, S. Goldman, T. Remez, T. Glaser, T. Best, T. Koehler, T. Robinson, T. Li, T. Zhang, T. Matthews, T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Montanez, V. Mohan, V. S. Kumar, V. Mangla, V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov, W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable, X. Tang, X. Wu, X. Wang, X. Wu, X. Gao, Y. Kleinman, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang, Y. Zhang, Y. Adi, Y. Nam, Yu, Wang, Y. Zhao, Y. Hao, Y. Qian, Y. Li, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick, Z. Wen, Z. Yang, Z. Zhao, and Z. Ma. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783. - He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385. - Hu et al. [2024] A. Hu, H. Xu, J. Ye, M. Yan, L. Zhang, B. Zhang, C. Li, J. Zhang, Q. Jin, F. Huang, and J. Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding, 2024. URL https://arxiv.org/abs/2403.12895. - Hudson and Manning [2019] D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019. URL https://arxiv.org/abs/1902.09506. - Jaume et al. [2019] G. Jaume, H. K. Ekenel, and J.-P. Thiran. Funsd: A dataset for form understanding in noisy scanned documents, 2019. URL https://arxiv.org/abs/1905.13538. - Kim et al. [2022] G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park. Ocr-free document understanding transformer, 2022. URL https://arxiv.org/abs/2111.15664. - Kim et al. [2024] Y. Kim, M. Yim, and K. Y. Song. Tablevqa-bench: A visual question answering benchmark on multiple table domains. arXiv preprint arXiv:2404.19205, 2024. - Laurençon et al. [2024] H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when building vision-language models?, 2024. URL https://arxiv.org/abs/2405.02246. - Lee et al. [2023] K. Lee, M. Joshi, I. Turc, H. Hu, F. Liu, J. Eisenschlos, U. Khandelwal, P. Shaw, M.-W. Chang, and K. Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023. URL https://arxiv.org/abs/2210.03347. - Li et al. [2023a] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023a. URL https://arxiv.org/abs/2307.16125. - Li et al. [2024] B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y. Li, Z. Liu, and C. Li. Llava-onevision: Easy visual task transfer, 2024. URL https://arxiv.org/abs/2408.03326. - Li et al. [2023b] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b. URL https://arxiv.org/abs/2301.12597. - Li et al. [2023c] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models, 2023c. URL https://arxiv.org/abs/2305.10355. - Liu et al. [2023a] H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning, 2023a. - Liu et al. [2023b] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning, 2023b. - Liu et al. [2024] H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/. - Lu et al. [2024] S. Lu, Y. Li, Q.-G. Chen, Z. Xu, W. Luo, K. Zhang, and H.-J. Ye. Ovis: Structural embedding alignment for multimodal large language model, 2024. URL https://arxiv.org/abs/2405.20797. - Masry et al. [2022] A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022. - Mathew et al. [2021a] M. Mathew, V. Bagal, R. P. Tito, D. Karatzas, E. Valveny, and C. V. Jawahar. Infographicvqa, 2021a. URL https://arxiv.org/abs/2104.12756. - Mathew et al. [2021b] M. Mathew, D. Karatzas, and C. V. Jawahar. Docvqa: A dataset for vqa on document images, 2021b. URL https://arxiv.org/abs/2007.00398. - OpenAI et al. [2023] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, et al. Gpt-4 technical report. arXiv preprint arXiv: 2303.08774, 2023. - Park et al. [2019] S. Park, S. Shin, B. Lee, J. Lee, J. Surh, M. Seo, and H. Lee. Cord: A consolidated receipt dataset for post-ocr parsing. Document Intelligence Workshop at Neural Information Processing Systems, 2019. - Pasupat and Liang [2015] P. Pasupat and P. Liang. Compositional semantic parsing on semi-structured tables. In Annual Meeting of the Association for Computational Linguistics, 2015. - Qwen et al. [2025] Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, and Z. Qiu. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115. - Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020. - Raffel et al. [2023] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683. - Rodriguez et al. [2024a] J. Rodriguez, X. Jian, S. S. Panigrahi, T. Zhang, A. Feizi, A. Puri, A. Kalkunte, F. Savard, A. Masry, S. Nayak, R. Awal, M. Massoud, A. Abaskohi, Z. Li, S. Wang, P.-A. Noël, M. L. Richter, S. Vadacchino, S. Agarwal, S. Biswas, S. Shanian, Y. Zhang, N. Bolger, K. MacDonald, S. Fauvel, S. Tejaswi, S. Sunkara, J. Monteiro, K. D. Dvijotham, T. Scholak, N. Chapados, S. Kharagani, S. Hughes, M. Özsu, S. Reddy, M. Pedersoli, Y. Bengio, C. Pal, I. Laradji, S. Gella, P. Taslakian, D. Vazquez, and S. Rajeswar. Bigdocs: An open and permissively-licensed dataset for training multimodal models on document and code tasks, 2024a. URL https://arxiv.org/abs/2412.04626. - Rodriguez et al. [2022] J. A. Rodriguez, D. Vazquez, I. Laradji, M. Pedersoli, and P. Rodriguez. Ocr-vqgan: Taming text-within-image generation, 2022. URL https://arxiv.org/abs/2210.11248. - Rodriguez et al. [2024b] J. A. Rodriguez, A. Puri, S. Agarwal, I. H. Laradji, P. Rodriguez, S. Rajeswar, D. Vazquez, C. Pal, and M. Pedersoli. Starvector: Generating scalable vector graphics code from images and text, 2024b. URL https://arxiv.org/abs/2312.11556. - Singh et al. [2019] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read. In IEEE Conference Computer Vision Pattern Recognition, 2019. - Stanisławek et al. [2021] T. Stanisławek, F. Graliński, A. Wróblewska, D. Lipiński, A. Kaliska, P. Rosalska, B. Topolski, and P. Biecek. Kleister: key information extraction datasets involving long documents with complex layouts. In International Conference on Document Analysis and Recognition, 2021. - Svetlichnaya [2020] S. Svetlichnaya. Deepform: Understand structured documents at scale, 2020. - Team [2024] G. Team. Gemini: A family of highly capable multimodal models, 2024. URL https://arxiv.org/abs/2312.11805. - Vogus and Llansóe [2021] C. Vogus and E. Llansóe. Making transparency meaningful: A framework for policymakers. Center for Democracy and Technology, 2021. - Wang et al. [2023a] D. Wang, N. Raman, M. Sibue, Z. Ma, P. Babkin, S. Kaur, Y. Pei, A. Nourbakhsh, and X. Liu. Docllm: A layout-aware generative language model for multimodal document understanding, 2023a. URL https://arxiv.org/abs/2401.00908. - Wang et al. [2024] P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y. Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. URL https://arxiv.org/abs/2409.12191. - Wang et al. [2023b] W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b. - Wu et al. [2024a] C. Wu, X. Chen, Z. Wu, Y. Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, and P. Luo. Janus: Decoupling visual encoding for unified multimodal understanding and generation, 2024a. URL https://arxiv.org/abs/2410.13848. - Wu et al. [2024b] Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y. Ma, C. Wu, B. Wang, Z. Xie, Y. Wu, K. Hu, J. Wang, Y. Sun, Y. Li, Y. Piao, K. Guan, A. Liu, X. Xie, Y. You, K. Dong, X. Yu, H. Zhang, L. Zhao, Y. Wang, and C. Ruan. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024b. URL https://arxiv.org/abs/2412.10302. - Xu et al. [2024] R. Xu, Y. Yao, Z. Guo, J. Cui, Z. Ni, C. Ge, T.-S. Chua, Z. Liu, M. Sun, and G. Huang. Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images. European Conference on Computer Vision, 2024. doi: 10.48550/arXiv.2403.11703. - Yu et al. [2024] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2024. URL https://arxiv.org/abs/2308.02490. - Yue et al. [2024] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y. Liu, W. Huang, H. Sun, Y. Su, and W. Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024. URL https://arxiv.org/abs/2311.16502. - Zhai et al. [2023] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training, 2023. URL https://arxiv.org/abs/2303.15343. - Zhang et al. [2024] T. Zhang, S. Wang, L. Li, G. Zhang, P. Taslakian, S. Rajeswar, J. Fu, B. Liu, and Y. Bengio. Vcr: Visual caption restoration. arXiv preprint arXiv: 2406.06462, 2024. - Zhao et al. [2024] Y. Zhao, J. Huang, J. Hu, X. Wang, Y. Mao, D. Zhang, Z. Jiang, Z. Wu, B. Ai, A. Wang, W. Zhou, and Y. Chen. Swift:a scalable lightweight infrastructure for fine-tuning, 2024. URL https://arxiv.org/abs/2408.05517. - Zhu et al. [2025] J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y. Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y. Cao, Y. Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y. He, T. Jiang, J. Luo, Y. Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y. Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. Wang, M. Dou, L. Lu, X. Zhu, T. Lu, D. Lin, Y. Qiao, J. Dai, and W. Wang. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models, 2025. URL https://arxiv.org/abs/2504.10479. Appendix A Appendix A.1 Experimental Setup We provide detailed hyperparameters of our experiments in Table 6. Table 6: Detailed hyperparameters for each training stage across different LLM backbones. LLM Backbone Llama 3.2-1B Llama 3.2-3B Llama 3.1-8B Stage-1 Stage-2 Stage-3 Stage-1 Stage-2 Stage-3 Stage-1 Stage-2 Stage-3 Trainable Parameters Full Model Full Model LLM & Connector Full Model Full Model LLM & Connector Full Model Full Model LLM & Connector Batch Size 512 512 512 512 256 256 512 256 256 Text Max Length 1024 2048 2048 1024 2048 2048 1024 2048 2048 Epochs 1 1 5 1 1 5 1 1 5 Learning Rate $1× 10^{-5}$ $5× 10^{-5}$ $5× 10^{-5}$ $1× 10^{-5}$ $5× 10^{-5}$ $5× 10^{-5}$ $1× 10^{-5}$ $1× 10^{-5}$ $1× 10^{-5}$ A.2 Runtime Comparison Between Connectors One caveat in the Align connector is that it includes an additional LM head layer, which slightly increases the total number of parameters. However, this addition has a negligible impact on runtime efficiency due to its simple structure. It only introduces a few matrix multiplication operations (as shown in Equations 1 and 2) instead of stacking many complex layers that require sequential processing, as in deep fusion methods. To empirically validate this claim, we benchmarked the runtime and memory usage of models equipped with different connector types (MLP, Align, Ovis, and Perceiver), following the same experimental setup as in Table 2. As shown in Table 7, the results demonstrate that although the Align connector delivers notably superior performance (see Table 2), the variations in inference speed and GPU memory usage among the connectors remain minimal. Table 7: Runtime and memory comparison between different connector designs. The results show that Align introduces negligible computational overhead compared to other connectors. | Model | Samples | Avg Time (s) | Tokens/sec | GPU Memory (GB) | | --- | --- | --- | --- | --- | | Llama-3.2-3B-MLP | 2500 | 0.161 | 118.3 | 10.9 | | Llama-3.2-3B-Perceiver | 2500 | 0.140 | 135.1 | 10.9 | | Llama-3.2-3B-Ovis | 2500 | 0.155 | 122.5 | 10.8 | | Llama-3.2-3B- Align | 2500 | 0.165 | 115.4 | 10.9 | Overall, the empirical evidence confirms that the Align connector achieves an effective balance between computational efficiency and performance. It introduces only a negligible increase in runtime and memory usage while providing substantial gains in overall accuracy. A.3 Pixel-Level Tasks Analysis To rigorously evaluate the ability of vision-language models to integrate fine-grained visual and textual pixel-level cues, we test our model on the VCR benchmark [Zhang et al., 2024], which requires the model to recover partially occluded texts with pixel-level hints from the revealed parts of the text. This task challenges VLM’s alignment of text and image in extreme situations. Current state-of-the-art models like GPT-4V OpenAI et al. [2023], Claude 3.5 Sonnet Anthropic [2024], and Llama-3.2 Dubey et al. [2024] significantly underperform humans on hard VCR task due to their inability to process subtle pixel-level cues in occluded text regions. These models frequently discard critical visual tokens during image tokenization on semantic priors, overlooking the interplay between partial character strokes and contextual visual scenes. To evaluate performance on VCR, we modify our Stage 3 SFT dataset composition by replacing the exclusive use of DocDownstream with a 5:1 blended ratio of DocDownstream and VCR training data. This adjustment enables direct evaluation of our architecture Align ’s ability to leverage pixel-level character cues. From the experimental outcomes, it is evident that AlignVLM consistently outperforms the MLP Connector Model across both easy and hard settings of the pixel-level VCR task (see Figure 5), with improvements ranging from 10.18% on the hard setting to 14.41% on the easy setting. We provide a case study on VCR in Figure 6, featuring four representative examples. In Figure 6(a), it is evident that the MLP connector model fails to capture semantic consistency as effectively as AlignVLM. The phrase “The commune first census in written history in ” (where the words in italics are generated by the model while the rest are in the image) is not as semantically coherent as the phrase generated by Align “The commune first appears in written history in ”. Beyond the issue of semantic fluency, in Figure 6(b) we also observe that AlignVLM successfully identifies the uncovered portion of the letter “g” in “accounting” and uses it as a pixel-level hint to infer the correct word. In contrast, the MLP model fails to effectively attend to this crucial detail. Figures 6(c) and 6(d) show examples where AlignVLM fails on the VCR task. These carefully picked instances show that our method mistakes names of landmarks with common words when the two are very similar. As seen in the examples, AlignVLM mistakes “Llanengan" for “Llanongan" and “Gorden" for “Garden”. In both instances, the pairs differ by one character, indicating perhaps that AlignVLM tends to align vision representations to more common tokens in the vocabulary. One approach that would potentially mitigate such errors would be to train AlignVLM with more contextually-relevant data. <details> <summary>x3.png Details</summary> ![40f277b6](/v1/image/40f277b68b737eccced26e38ae3ae39a0addb85cca86f7fb0a6caff0ea44facd) ### Visual Description \n ## Bar Chart: VCR EN Easy/Hard Exact Match Comparison ### Overview This is a horizontal bar chart comparing the "Exact Match (%)" scores of two models, "Llama-3.2-3B-Align (Ours)" and "Llama-3.2-3B-MLP", on two datasets: "VCR EN Easy" and "VCR EN Hard". The chart visually represents the performance of each model on each dataset using bar lengths proportional to the exact match percentage. ### Components/Axes * **X-axis:** "Exact Match (%)" - Scale ranges from 0 to 60, with increments of 20. * **Y-axis:** Two categories: "VCR EN Easy" and "VCR EN Hard". * **Legend:** Located at the top-right of the chart. * "Llama-3.2-3B-Align (Ours)" - Represented by a light blue color. * "Llama-3.2-3B-MLP" - Represented by a light orange color. ### Detailed Analysis The chart contains four horizontal bars, two for each dataset, representing the performance of each model. * **VCR EN Easy:** * "Llama-3.2-3B-Align (Ours)" - The light blue bar extends to approximately 65.84% on the x-axis. * "Llama-3.2-3B-MLP" - The light orange bar extends to approximately 51.43% on the x-axis. * **VCR EN Hard:** * "Llama-3.2-3B-Align (Ours)" - The light blue bar extends to approximately 48.07% on the x-axis. * "Llama-3.2-3B-MLP" - The light orange bar extends to approximately 37.89% on the x-axis. ### Key Observations * "Llama-3.2-3B-Align (Ours)" consistently outperforms "Llama-3.2-3B-MLP" on both "VCR EN Easy" and "VCR EN Hard" datasets. * The performance gap between the two models is larger on the "VCR EN Easy" dataset than on the "VCR EN Hard" dataset. * Both models exhibit a performance drop when moving from the "Easy" to the "Hard" dataset, as expected. ### Interpretation The data suggests that the "Llama-3.2-3B-Align (Ours)" model is more effective at achieving exact matches in the VCR (Visual Commonsense Reasoning) task compared to the "Llama-3.2-3B-MLP" model. The larger performance difference on the "Easy" dataset indicates that the alignment process may be particularly beneficial for simpler reasoning tasks. The decrease in performance for both models on the "Hard" dataset highlights the increased difficulty of the task when presented with more complex scenarios. The chart demonstrates the effectiveness of the alignment process in improving the model's ability to provide exact matches, particularly in less challenging contexts. The data points suggest a clear advantage for the "Ours" model, and the consistent trend across both difficulty levels strengthens this conclusion. </details> Figure 5: Comparison of Llama-3.2-3b- Align and Llama-3.2-3B-MLP on the Easy and Hard VCR tasks. <details> <summary>figures/vcr_example1.png Details</summary> ![d00faa67](/v1/image/d00faa678453419d44ea0d81d123660c3134fdf95b74fc7843920c144f64e763) ### Visual Description \n ## Map: Atel, Romania ### Overview The image depicts a map of the Atel commune in Alba County, Transylvania, Romania. The map is a satellite or aerial view, showing the terrain and river systems. A yellow line highlights a specific feature within the region. Below the map is a block of text providing descriptive information about the commune. ### Components/Axes The image consists of: * **Map:** A geographical representation of the Atel commune. * **Text Block:** A description of the commune, its location, and composition. * **Highlighted Line:** A yellow line tracing a feature within the map. ### Detailed Analysis or Content Details The map shows a predominantly green landscape, indicating forested areas and vegetation. A network of lighter-colored lines represents rivers and streams flowing through the terrain. The shape of the commune is irregular, defined by the natural boundaries of the landscape. The text block reads: "Atel is a commune in Alba County, Transylvania, Romania. It is composed of two villages, Atel and Dupuş. The commune first…" The text is incomplete, with the final sentence cut off. ### Key Observations * The commune is located in a hilly or mountainous region, as evidenced by the terrain depicted on the map. * The presence of numerous rivers and streams suggests a well-watered environment. * The highlighted yellow line appears to follow a road or a ridge line. * The text indicates that the commune consists of two villages: Atel and Dupuş. ### Interpretation The image provides a visual and textual overview of the Atel commune in Romania. The map highlights the geographical characteristics of the area, while the text provides basic information about its location and composition. The incomplete sentence suggests that the text may have been part of a larger document describing the history or administrative details of the commune. The highlighted line could represent a key transportation route or a significant geographical feature within the region. The data suggests a rural, geographically defined community. The map and text work together to provide a basic understanding of the commune's location and characteristics. </details> | GT: | (appears in written history in) | | --- | --- | | MLP: | (census in written history in) ✗ | | Align | (appears in written history in) ✓ | (a) Positive Example 1 <details> <summary>figures/vcr_example2.png Details</summary> ![b0dfe11f](/v1/image/b0dfe11f7ca59a89385c83a76130483988348789b9c2c16cbbaf378ad0c83211) ### Visual Description \n ## Map: Ghana Telephone Numbering Plan ### Overview The image is a map of Ghana, divided into regions, each associated with a specific area code. The map visually represents the Ghana telephone numbering plan. The text below the map provides a brief description of the plan and its regulation. ### Components/Axes The map itself is the primary component. Each region of Ghana is colored differently and labeled with a two-digit area code. The text below the map serves as a caption. There are no explicit axes in the traditional chart sense. ### Detailed Analysis or Content Details The map shows the following area codes assigned to different regions of Ghana: * **030:** Located in the southeastern region of Ghana. (Light Green) * **031:** Located in the southwestern coastal region of Ghana. (Blue) * **032:** Located in the central region of Ghana. (Red) * **033:** Located in the south-central region of Ghana. (Orange) * **034:** Located in the south-central region of Ghana, adjacent to 032 and 033. (Dark Red) * **035:** Located in the central region of Ghana, north of 032. (Green) * **036:** Located in the eastern region of Ghana. (Dark Green) * **037:** Located in the north-central region of Ghana. (Purple) * **038:** Located in the northeastern region of Ghana. (Yellow) * **039:** Located in the northwestern region of Ghana. (Dark Blue) The text below the map reads: "The Ghana telephone numbering plan is … regulating telephone numbers in Ghana. It is regulated by the National Communications…" The text is incomplete and partially obscured. ### Key Observations The area codes are distributed geographically across Ghana. The numbering appears to follow a regional pattern, with lower numbers generally assigned to southern regions and higher numbers to northern regions. The text indicates that the plan is officially regulated. ### Interpretation The map illustrates how telephone numbers are organized within Ghana. The area codes allow for efficient routing of calls and identification of geographic locations. The regulation by the National Communications authority suggests a structured and controlled system for managing telephone numbers. The incomplete text suggests that the full description of the numbering plan would likely detail the structure of the numbers beyond the area code (e.g., subscriber numbers). The visual representation of the numbering plan is useful for understanding the geographic distribution of telephone services within Ghana. </details> | GT: | (the system used for assigning) | | --- | --- | | MLP: | (the system used for accounting) ✗ | | Align | (the system used for assigning) ✓ | (b) Positive Example 2 <details> <summary>figures/vcr_example3.png Details</summary> ![e4a46059](/v1/image/e4a4605967d9f67a699ebd5d4ee9c743d4739f8ef55fd6c3d37013cb56889f3a) ### Visual Description \n ## Photograph: Penrhyn Dû Mines Tunnel ### Overview The image is a photograph depicting the interior of a long, narrow tunnel, presumably within the Penrhyn Dû Mines. The tunnel appears to be constructed of rough-hewn stone, and is illuminated by a light source at the end, creating a strong perspective effect. The photograph is accompanied by a block of text. ### Components/Axes There are no axes or components in the traditional sense of a chart or diagram. The key elements are the tunnel itself, the stone construction, the light source, and the accompanying text. ### Detailed Analysis or Content Details The tunnel walls are uneven and textured, suggesting they were carved directly into the rock. The floor of the tunnel appears to be covered in a layer of debris or sediment. The light source at the end of the tunnel is bright, creating a strong contrast with the darker areas closer to the viewer. The accompanying text reads: "The Penrhyn Dû Mines are a collection of … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … </details> | GT: | (mines situated near Llanengan on) | | --- | --- | | MLP: | (mines situated near Llanengan on) ✓ | | Align | (mines situated near Llanongan on) ✗ | (c) Negative Example 1 <details> <summary>figures/vcr_example4.png Details</summary> ![154b0d11](/v1/image/154b0d11d1cc3438494d5f0b7a4e7e5dc9aa5df0c4f27c7d29ce76a5ffafdeea) ### Visual Description \n ## Photograph: City of Fairmount Building ### Overview The image is a photograph of a single-story brick building. The building appears to be a municipal or public service building. Below the image is a block of text providing information about the town of Fairmount. ### Components/Axes The image itself does not have axes or components in the traditional chart/diagram sense. The key visual element is the building itself. The text below provides descriptive information. ### Detailed Analysis or Content Details The building is constructed of red brick and has a gabled roof. Above the entrance, on the facade of the building, is a sign that reads: "CITY OF FAIRMOUNT / GORDON COUNTY GOVERNMENT BUILDING". There is a small covered entrance with columns. A parking lot is visible in front of the building. The text below the image reads: "Fairmount is a town in Gordon County, Georgia, United States. As of the 2010 census it had a population of 720. Coordinates: 34°41′31″N 84°54′41″W New Echota, the capital of the Cherokee Nation from 1825 to its forced removal in 1838, is located near Fairmount." ### Key Observations The photograph depicts a small-town government building. The text indicates Fairmount is a relatively small town with a population of 720 as of 2010. The proximity to New Echota, the former capital of the Cherokee Nation, is a notable historical detail. ### Interpretation The image and accompanying text provide a snapshot of Fairmount, Georgia. The building suggests a functional, local government presence. The population figure indicates a rural or small-town setting. The mention of New Echota highlights the area's historical significance related to the Cherokee Nation and the Trail of Tears. The combination of the visual and textual information paints a picture of a small, historically-rooted community in Georgia. The text is partially obscured, suggesting it may have been extracted from a source with formatting issues or a low-resolution image. </details> | GT: | (Gorden County is home to) | | --- | --- | | MLP: | (Gorden County is home to) ✓ | | Align | (Garden County is home to) ✗ | (d) Negative Example 2 Figure 6: Case Study for Pixel-Level Tasks. We provide examples of our proposed Align connector compared with a the Multi-Layer Perceptron (MLP) connector. The Align connector tends to better map visual elements to common words. GT is the ground truth. A.4 Case Studies In this section, we provide case studies for the experiments in Section 5.1. Specifically, we provide examples of our Llama-3.2-3B- Align, and its counterpart model with alternative connectors Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis on three different datasets: KLC [Stanisławek et al., 2021], DocVQA [Mathew et al., 2021b], and TextVQA [Singh et al., 2019]. The examples are shown in Figure 7, 8, and 9. <details> <summary>figures/case_1.jpg Details</summary> ![ffb06868](/v1/image/ffb06868db7515b4693d1515d52667ec420e310126e4db7195376da9c919e452) ### Visual Description \n ## Document: Ardingly College Limited - Directors' Report and Financial Statements ### Overview The image presents the cover page of a document titled "Ardingly College Limited - Directors' Report and Financial Statements" for the year ended August 31st, 2017. It includes the college logo, branding, and registration information. The document appears to be a formal financial report. ### Components/Axes * **Title:** Ardingly College Limited * **Subtitle:** DIRECTORS' REPORT AND FINANCIAL STATEMENTS * **Year Ended:** 31st AUGUST 2017 * **Logo:** Ardingly College crest with the words "ARDINGLY College" above it. A globe with the text "World Ready" is positioned to the right of the crest. * **Charity Registration No.:** 1076456 * **Company Registration No.:** 03779971 (England and Wales) * **Top-Right Corner:** The word "EXTERNAL" is written in the top-right corner. * **Decorative Elements:** A red border surrounds the text and logo, with small floral-like designs in each corner. A perforated edge is visible on the left side of the document. ### Detailed Analysis or Content Details The document is primarily textual. The key information extracted is: * **Organization Name:** Ardingly College Limited * **Report Type:** Directors' Report and Financial Statements * **Reporting Period:** Year ended 31st August 2017 * **Charity Registration Number:** 1076456 * **Company Registration Number:** 03779971 * **Jurisdiction:** England and Wales * **Branding Slogan:** "World Ready" ### Key Observations The document is a formal report, indicated by the specific terminology ("Directors' Report," "Financial Statements") and the inclusion of registration numbers. The "EXTERNAL" marking suggests this is a copy intended for distribution outside the organization. The branding emphasizes a "World Ready" ethos. ### Interpretation This document serves as the official record of Ardingly College Limited's financial performance and governance for the year ending August 31st, 2017. It is likely intended for stakeholders including shareholders, donors, regulatory bodies, and the public. The inclusion of both charity and company registration numbers indicates that the college operates as both a registered charity and a limited company. The "EXTERNAL" marking suggests that this is a version of the report intended for public consumption or distribution to external parties. The branding slogan "World Ready" suggests the college's mission is to prepare students for global citizenship. </details> | Question: | What is the value for the charity name? | | --- | --- | | GT: | (Ardingly College Ltd.) | | MLP: | (Ardington College Ltd.) ✗ | | Ovis: | (Ardington College Ltd.) ✗ | | Align: | (Ardingly College Ltd.) ✓ | (a) Positive Example #1 <details> <summary>figures/case_2.jpg Details</summary> ![f1dd63e4](/v1/image/f1dd63e42779082515a018493f38c9dcf407cf4342db908a1e41a8a64e6d70b7) ### Visual Description \n ## Form: Trustees' Annual Report ### Overview This is a scanned form titled "Trustees' Annual Report" used by the Charity Commission. It requests reference and administration details for a charity, along with information about the charity trustees who manage it. The form is for the period from 11/02/2016 to 10/02/2017. ### Components/Axes The form is divided into sections: * **Header:** Contains the Charity Commission logo, report title, and period dates. * **Section A:** "Reference and administration details" - includes fields for charity name, other names, registered number, principal address, and postcode. * **Table:** "Names of the charity trustees who manage the charity" - with columns for Trustee name, Office (if any), Dates acted if not for whole year, and Name of person (or body) entitled to appoint trustee (if any). * **Footer:** Contains "TAR" and the date "March 2012". ### Content Details **Header Details:** * **Charity Commission Logo:** Top-left corner. * **Report Title:** "Trustees' Annual Report for the period" - centered at the top. * **Period Start Date:** 11/02/2016 * **Period End Date:** 10/02/2017 **Section A Details:** * **Charity Name:** TENDER GRACE CHRISTIAN CENTRE * **Other names charity is known by:** (Blank) * **Registered number (if any):** 1165532 * **Charity's principal address:** FLAT 13, 39 TRINITY RISE, LONDON * **Postcode:** SW2 2QP **Table Details:** The table lists the names of charity trustees. The first four trustees are listed with their names: 1. Mr Anthony Yakubu Membu 2. Miss Happiness Ngosi J an-Nnyeruka 3. Mrs Olabisi Comfort Ogunjobi 4. Mrs Olayinka Adeola Oyesanya The remaining rows (5-20) are blank. There are columns for "Office (if any)", "Dates acted if not for whole year", and "Name of person (or body) entitled to appoint trustee (if any)", all of which are blank for the listed trustees. Below the main table is a smaller table with the title: "Names of the trustees for the charity, if any, (for example, any custodian trustees)" with columns for "Name" and "Dates acted if not for whole year". This table is also blank. **Footer Details:** * **TAR:** Bottom-left corner. * **Date:** March 2012 - bottom-right corner. ### Key Observations The form is partially completed. Only the first four trustees are listed, and the rest of the table is empty. The form appears to be a standard template for annual reporting to the Charity Commission. The date in the footer (March 2012) is significantly earlier than the reporting period (2016-2017), suggesting the form template is outdated. ### Interpretation This document represents a snapshot of a charity's administrative information. The limited data suggests the charity, Tender Grace Christian Centre, had at least four trustees during the reporting period. The blank sections indicate either that the charity had no further trustees to report, or that the form was not fully completed. The discrepancy between the form's date and the reporting period suggests a potential issue with using outdated templates, which could lead to incomplete or inaccurate reporting. The form's structure is designed to ensure transparency and accountability in the management of charitable organizations. The presence of fields for appointment authorities highlights the importance of understanding the governance structure of the charity. </details> | Question: | What is the value for the address postcode? | | --- | --- | | GT: | (SW2 2QP) | | MLP: | (SW22 0PQ) ✗ | | Ovis: | (SW2 2OP) ✗ | | Align: | (SW2 2QP) ✓ | (b) Positive Example #2 <details> <summary>figures/case_3.jpg Details</summary> ![8692b900](/v1/image/8692b9000e6b67be178ea2e938665baec5d490b440c945bb20b5f8846ff36db1) ### Visual Description \n ## Photograph: Annual Report Cover - Human Capital 2015 ### Overview The image is a photograph used as the cover for the "Human Capital Annual Report and Financial Statements 2015". It depicts a young, smiling girl in a classroom setting. The overall impression is positive and focuses on education and opportunity. The image does not contain charts, graphs, or diagrams with quantifiable data. It is primarily a visual communication tool. ### Components/Axes The image consists of the following elements: * **Main Image:** A photograph of a young girl in a classroom. * **Text Overlay:** "ANNUAL REPORT" and "AND FINANCIAL STATEMENTS 2015" in white text on a purple background. * **Logo:** "Human Capital" logo in the bottom-left corner, consisting of a circular emblem with text and a stylized design. * **Legal Text:** Small text in the top-right corner containing registration information. ### Detailed Analysis or Content Details The photograph shows a young girl with dark skin and short, dark hair. She is smiling broadly, looking directly at the camera. She is wearing a white collared shirt with a purple vest. She is seated at a desk in a classroom, with other children visible in the blurred background. The text overlay reads: * **"ANNUAL REPORT"** - Large, bold white text. * **"AND FINANCIAL STATEMENTS 2015"** - Smaller white text below "ANNUAL REPORT". The logo in the bottom-left corner contains the text: * **"Human Capital"** - Text within a circular emblem. The legal text in the top-right corner reads: * "Charity No. 1142619 | Company Reg No. 08260030 | Scottish Reg No. SC044411" ### Key Observations The image is designed to evoke positive emotions and associate the "Human Capital" organization with education and the well-being of children. The use of a smiling child is a common technique in charitable organization branding. The year "2015" clearly indicates the report's timeframe. ### Interpretation The image serves as a visual representation of the organization's mission and values. The focus on a child in a learning environment suggests that "Human Capital" invests in the development and potential of young people. The annual report likely details the organization's activities and financial performance related to this mission. The legal text provides transparency and accountability by listing the organization's registration numbers. The image is a carefully constructed piece of visual communication intended to build trust and encourage engagement with the report's content. The choice of a vibrant purple color in the overlay may be intended to convey creativity, wisdom, and ambition. </details> | Question: | What is the value for the charity name? | | --- | --- | | GT: | (Human Appeal) | | MLP: | (Humanitarian Agenda) ✗ | | Ovis: | (Human Appeal) ✓ | | Align: | (Human Rightsappeal) ✗ | (c) Negative Example #1 <details> <summary>figures/case_4.jpg Details</summary> ![7d856e60](/v1/image/7d856e60fd8607e672aeda8440995b50837802e8bb98828a3edb66c482f380f2) ### Visual Description \n ## Document: Bishop's Stortford Baptist Church Annual Report 2017 ### Overview This document is the Annual Report for Bishop's Stortford Baptist Church for the year 2017. It outlines the church's aim, purposes, objectives, and overview of activities. The report is a textual document with a logo and page numbering. ### Components/Axes The document consists of the following elements: * **Logo:** Located in the top-right corner, displaying "BSB" and "bishops stortford baptist church". * **Title:** "Bishop's Stortford Baptist Church" in large font. * **Subtitle:** "This Church aims to be a community of believers living according to the Bible's teaching where people become fully devoted followers of Jesus Christ" * **Report Title:** "Annual Report 2017" * **Introductory Paragraph:** "We are pleased to present a report of the life and work of the Church during 2017." * **Section Title:** "Aim and Purposes" * **Section Title:** "Objectives and Overview" * **Bulleted List:** A list of activities and objectives of the church. * **Page Number:** "1" located at the bottom-right corner. ### Detailed Analysis or Content Details The document contains the following textual information: **Bishop's Stortford Baptist Church** This Church aims to be a community of believers living according to the Bible's teaching where people become fully devoted followers of Jesus Christ. **Annual Report 2017** We are pleased to present a report of the life and work of the Church during 2017. **Aim and Purposes** The vision of Bishop’s Stortford Baptist Church is to be a community of believers living according to the Bible’s teaching so that Christ’s redemptive purposes can be accomplished in the world. The Church’s mission is to enable people to become fully devoted followers of Jesus Christ by the power of the Holy Spirit. To that end, we understand the purposes of the Church to be worship, discipleship, fellowship, outreach and service. The fellowship of the Church seeks to go deeper with God, become closer to each other and wider in our influence. **Objectives and Overview** The Church is committed to carrying out its activities in line with the Baptist tradition and to encourage and enable people in the local area to attend and take part in its services and meetings. Activities include: * Services each week for worship, prayer, Bible study, preaching and teaching, including additional services for those in the local area whose first language is not English * Baptisms on request * Meetings for young people * Small group meetings (LIFE Groups) for all ages * Pastoral care * Discipleship for Christian service * Evangelism and mission * Support of Christian and social charitable action within the local area * Support for mission overseas * Involvement with the local and national Baptist associations The ministry areas of the Church are classified into Worship, Fellowship, Reaching Out, Discipleship and Social Action, supplemented by a Resources Team (all led by the Deacons) and with spiritual oversight and pastoral care administered by the Elders. ### Key Observations The document is a straightforward report outlining the church's goals and activities. The bulleted list provides a clear overview of the services and programs offered. The document emphasizes community, faith, and outreach. ### Interpretation This Annual Report serves as a communication tool to the church members and potentially the wider community. It demonstrates the church's commitment to its core values and its active involvement in both local and global initiatives. The report aims to provide transparency and accountability regarding the church's operations and impact. The emphasis on discipleship, fellowship, and outreach suggests a holistic approach to faith and community engagement. The document does not contain any quantitative data or charts, it is purely descriptive. </details> | Question: | What is the value for the post town address? | | --- | --- | | GT: | (Bishop’s Stortford) | | MLP: | (Stortford) ✗ | | Ovis: | (Bishop’s Stortford) ✓ | | Align: | (Stortford) ✗ | (d) Negative Example #2 Figure 7: Case Study for Connector Comparison on the KLC dataset [Stanisławek et al., 2021]. We show four qualitative examples (including two correct and two incorrect examples) comparing Llama-3.2-3B- Align to the same architecture with different connectors, Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis. “GT” denotes the ground truth. <details> <summary>figures/case_5.jpg Details</summary> ![6b032536](/v1/image/6b0325365452e31e031a74f8a7ee46c7f9181516adf8090b317f098adcdf7370) ### Visual Description \n ## Document: Conference Schedule - June 29-30 ### Overview This document presents a schedule for a conference held on Thursday, June 29th and Friday, June 30th. It details the time, activity, leader (or group), and room for each session. The document appears to be a scanned page from a printed program, indicated by the page number "-3-" at the top. ### Components/Axes The document is structured as a timetable with the following columns: * **Time:** Listed in hourly increments (e.g., 1:00, 1:30, 8:00). * **Activity:** Description of the session (e.g., Techniques of Interviewing, Practice Interviews). * **Leader/Group:** Name of the instructor or group leading the session (e.g., Mrs. Fink, Group A). * **Room:** Location of the session (e.g., Room 123, 802 State Health Department). The schedule is divided into sections for "Afternoon" and "Morning" for each day. Recess periods are also indicated. ### Detailed Analysis / Content Details **Thursday, June 29** * **1:00:** Techniques of Interviewing (Mrs. Fink) - Room 123 * **1:30:** Practice Interviews - * Group I: Mrs. Fink - 123 State Health Department * Group II: Miss Grass - 802 State Health Department * Group III: Miss Peck - 627 State Health Department * Group IV: Mr. Price - 510 School of Public Health * Group V: Dr. Croley - 522 School of Public Health * **2:45:** Recess * **3:15:** Practice Interviews (continued) - Same groups, same rooms **Friday, June 30** * **8:00:** * Group A: Statistical Aspects of Epidemiologic Research (Dr. Gaffey) - Room 802 * Group B: Problems in Research Design (Dr. Reynolds) - Room 123 * **9:45:** Recess * **10:15:** * Group A: Problems in Research Design (Dr. Reynolds) - Room 123 * Group B: Statistical Aspects of Epidemiologic Research (Dr. Gaffey) - Room 802 * **12:00:** Lunch * **1:00:** Construction and Use of Questionnaires * Group A: (Dr. Fink) - Room 123 * Group B: (Dr. Mellinger) - Room 522 School of Public Health * **2:45:** Recess * **3:15:** Construction and Use of Questionnaire (continued) - Same groups, same rooms **Source:** https://www.industrydocuments.ucsf.edu/docs/gpcg0227 ### Key Observations The schedule is organized around parallel sessions, with groups A and B often attending different activities simultaneously. Practice interviews are a significant component of the Thursday schedule. The location of sessions varies between State Health Departments and the School of Public Health. ### Interpretation This document provides a detailed program for a training or conference focused on public health research methodologies. The topics covered include interviewing techniques, epidemiological research, research design, and questionnaire construction. The use of group work and practice sessions suggests a hands-on, applied learning approach. The source URL indicates this document is part of a larger collection of industry documents, potentially related to tobacco industry research, though the content itself doesn't directly indicate this. The document is a factual record of scheduled events and does not contain subjective opinions or analyses. </details> | Question: | What does the afternoon session begin on June 29? | | --- | --- | | GT: | (1:00) | | MLP: | (2:45) ✗ | | Ovis: | (3:30) ✗ | | Align: | (1:00) ✓ | (a) Positive Example #1 <details> <summary>figures/case_6.jpg Details</summary> ![d50d0db3](/v1/image/d50d0db31fc2fa5e79f3039f5bc81209d809a299b98767aeb91e220e531fb023) ### Visual Description ## Data Tables: Hemoglobin, Hematocrit, and Socio-economic Data - Massachusetts ### Overview The image contains three data tables presenting results from a National Nutrition Survey conducted in Massachusetts. The first table details hemoglobin levels, the second hematocrit levels, and the third provides general socio-economic data. All data is categorized by age and sex (Males/Females). ### Components/Axes The tables are structured with 'Age' as the primary category on the left column. Within each age group, data is further divided into 'Def.' (Deficient), 'Low', and 'Total Unsatisfactory' levels. The tables are separated by descriptive headers indicating the measured variable (Hemoglobin, Hematocrit) and the survey source (National Nutrition Survey). The final table presents socio-economic indicators without age/sex categorization. ### Detailed Analysis or Content Details **1. Hemoglobin Data - Massachusetts** * **Header:** "Hemoglobin data – Massachusetts" and "8% of the surveyed population had unsatisfactory hemoglobin levels (ICNND guidelines)." * **Age 0-5 yr:** * Males: Def. = 4.5, Low = 8.0, Total Unsatisfactory = 12.5 * Females: Def. = 1.5, Low = 7.0, Total Unsatisfactory = 8.5 * **Age 6-12:** * Males: Def. = 0.2, Low = 3.8, Total Unsatisfactory = 4.0 * Females: Def. = 0.2, Low = 5.0, Total Unsatisfactory = 5.2 * **Age 13-16:** * Males: Def. = 3.6, Low = 12.7, Total Unsatisfactory = 16.3 * Females: Def. = 0.0, Low = 3.5, Total Unsatisfactory = 3.5 * **Age 17-59:** * Males: Def. = 1.2, Low = 10.0, Total Unsatisfactory = 11.2 * Females: Def. = 1.1, Low = 6.0, Total Unsatisfactory = 7.1 * **Age 60+:** * Males: Def. = 0.7, Low = 14.3, Total Unsatisfactory = 15.0 * Females: Def. = 0.5, Low = 4.7, Total Unsatisfactory = 5.2 **2. Hematocrit Data - Massachusetts** * **Header:** "Hematocrit data – Massachusetts" and "9.2% of the surveyed population had unsatisfactory hematocrit levels (ICNND guidelines)." * **Age 0-5 yr:** * Males: Def. = 4.4, Low = 4.0, Total Unsatisfactory = 8.4 * Females: Def. = 0.5, Low = 1.5, Total Unsatisfactory = 2.0 * **Age 6-12:** * Males: Def. = 0.0, Low = 3.9, Total Unsatisfactory = 3.9 * Females: Def. = 0.0, Low = 5.4, Total Unsatisfactory = 5.4 * **Age 13-16:** * Males: Def. = 1.2, Low = 15.0, Total Unsatisfactory = 16.2 * Females: Def. = 0.0, Low = 2.9, Total Unsatisfactory = 2.9 * **Age 17-59:** * Males: Def. = 0.7, Low = 10.0, Total Unsatisfactory = 10.7 * Females: Def. = 0.5, Low = 7.8, Total Unsatisfactory = 8.3 * **Age 60+:** * Males: Def. = 0.0, Low = 20.0, Total Unsatisfactory = 20.0 * Females: Def. = 0.5, Low = 3.8, Total Unsatisfactory = 4.3 **3. General Socio-economic Data - Massachusetts** * **Header:** "General Socio-economic data – Massachusetts" * Total number of persons examined: 4,568 * Mean family size: 5.31 * Mean income: $ per annum: 6,500 * Mean poverty index ratio: 2.29 * % of families below poverty index ratio of 1.00: 19.5 **Source:** https://www.industrydocuments.ucsf.edu/docs/tntB0227 ### Key Observations * The 'Total Unsatisfactory' levels generally increase with age for both hemoglobin and hematocrit, particularly in males. * In both hemoglobin and hematocrit data, the 'Def.' levels are consistently low across all age groups, especially in females. * The hematocrit data shows a significantly higher 'Total Unsatisfactory' rate for males aged 60+ (20.0%) compared to other groups. * Nearly 20% of families in Massachusetts were below the poverty index ratio of 1.00, suggesting a substantial portion of the population faced economic hardship. ### Interpretation The data suggests potential nutritional deficiencies, specifically related to iron (hemoglobin) and red blood cell production (hematocrit), within the Massachusetts population surveyed. The higher rates of unsatisfactory levels in older males could indicate age-related physiological changes or dietary factors. The socio-economic data provides context, indicating that a significant percentage of families lived below the poverty line, which could contribute to nutritional deficiencies. The ICNND guidelines referenced suggest these findings were evaluated against international standards for nutritional assessment. The source link points to a repository of industry documents, potentially indicating the data was collected or analyzed in relation to food or health-related industries. The data provides a snapshot of the nutritional and economic status of a population in Massachusetts during the time of the survey. </details> | Question: | What levels does the second table indicate? | | --- | --- | | GT: | (hematocrit data - Massachusetts) | | MLP: | (SATISFACTORY) ✗ | | Ovis: | (Females) ✗ | | Align: | (hematocrit data - Massachusetts) ✓ | (b) Positive Example #2 <details> <summary>figures/case_7.jpg Details</summary> ![e45a2474](/v1/image/e45a247480fbf0d115af4283812a77977cbf80750ef80be246cdd3223aa603e0) ### Visual Description \n ## Document: Policy on Document Control - Policy No. 8 ### Overview The image presents a scanned document titled "Policy on Document Control," designated as Policy No. 8. It outlines general provisions regarding the creation, storage, and disposal of documents within the company. The document is structured into articles and definitions, detailing the standards and procedures for handling various types of information. ### Components/Axes The document is structured as follows: * **Header:** "EXHIBIT" (top-right), "Policy on Document Control" (top-center), "Policy No. 8" (top-right) and a handwritten "R-7" (top-right). * **Body:** Divided into Articles 1, 2, and 3, each with subsections. * **Footer:** "Confidential - Subject to Protective Order" (bottom-left) and "Source: https://www.indupgs2378-0001x.ics.ufl.edu/docs/taki-tpc-0000252" (bottom-right). ### Detailed Analysis or Content Details **Article 1 (Purpose)** "This Policy shall be intended to stipulate basic matters on creation, storage, and disposal, etc. of documents (including electromagnetic records) handled by the Company and exceptional treatments, etc. in order to respond to lawsuits and other legal/administrative proceedings (hereinafter referred to as “disputes, etc.”), so that document information can be properly and effectively managed and utilized." **Article 2 (General Provisions)** "Documents of the Company shall be handled in accordance with this Policy, in addition to those stipulated in the “Policy on Handling Contract Documents, etc. and Corporate Seals (Policy No. 6),” the “Policy on Confidential Information Management (Policy No. 18),” and the “Policy on IT Security (Policy No. 67).” (2) Each core organization shall set forth standards for creation, storage, disposal, and other detailed treatment of documents to be handled at the core organization (hereinafter referred to as “Function Standards”), in accordance with the provisions stipulated herein." **Article 3 (Definitions)** 1. "“Documents” shall mean approval documents, notification documents, reports, conference materials, correspondences, fax/telex transmissions, bills, various records, drawings, microfilms, photographs, video/audio media, electronic mail, and any other documents and electromagnetic records handled in connection with Company business (including those created with external parties, but excluding publications such as newspapers, magazines, and books, etc.)." 2. "“Disks” shall mean electromagnetic recording media, including floppy disks, CD-Rs, and DVD-Rs, etc." 3. "“Storage” shall mean managing documents in any of the following methods and places for specified periods of time. Documents stored shall be referred to as “Stored Documents.” 1. By the originals or write-protected disks, in lockable cabinets or desks, or designated warehouses 2. By electromagnetic means, in access-controlled servers" **Footer Information:** * Confidential - Subject to Protective Order * Source: https://www.indupgs2378-0001x.ics.ufl.edu/docs/taki-tpc-0000252 ### Key Observations The document is a formal policy statement, emphasizing the importance of proper document management for legal and administrative purposes. It references other related policies within the company, indicating a comprehensive approach to information governance. The definitions provided are broad, encompassing both physical and electronic records. ### Interpretation This document establishes a framework for managing company records to ensure compliance and protect the organization's interests. The policy aims to standardize document handling procedures across different departments ("core organizations") while acknowledging the use of both traditional and modern storage methods. The reference to "disputes, etc." suggests a proactive approach to potential legal challenges. The inclusion of definitions clarifies the scope of the policy and minimizes ambiguity. The source URL indicates the document originates from a university-related system (ufl.edu), potentially suggesting a research or educational context for the company. The "Confidential" marking highlights the sensitive nature of the information covered by the policy. </details> | Question: | What type of policy is described in this document? | | --- | --- | | GT: | (Policy on Document Control) | | MLP: | (Policy on Document Control) ✓ | | Ovis: | (General Provisions) ✗ | | Align: | (Document Control) ✗ | (c) Negative Example #1 <details> <summary>figures/case_8.jpg Details</summary> ![25f93bd1](/v1/image/25f93bd1c8b9cccd9078fe676e48c838a1a85ff06905fe97ea9625c642cf16dd) ### Visual Description ## Data Table: Blood Analysis of Rats Fed Different Diets ### Overview This document presents a table summarizing the results of a study investigating the effect of dietary glycan on blood cholesterol and triglyceride levels in rats. The study involved three groups of rats fed different diets over a four-week period. The table displays the cholesterol and triglyceride levels measured at weeks 0 and 4 for each group. ### Components/Axes The document consists of a title, introductory text describing the experimental setup, and a data table. The table has the following structure: * **Column 1:** "Group" - Identifies the rat group (#1, #2, #3). * **Column 2:** "Cholesterol mg % 0 Wk." - Cholesterol levels at week 0. * **Column 3:** "Cholesterol mg % 1 Wk." - Cholesterol levels at week 1. * **Column 4:** "Cholesterol mg % 2 Wk." - Cholesterol levels at week 2. * **Column 5:** "Cholesterol mg % 3 Wk." - Cholesterol levels at week 3. * **Column 6:** "Cholesterol mg % 4 Wk." - Cholesterol levels at week 4. * **Column 7:** "Triglyceride mg % 4 Wk." - Triglyceride levels at week 4. The document is dated June 18, 1975, and includes a source link at the bottom. ### Detailed Analysis or Content Details **Introductory Text Transcription:** "The Ability of Dietary Glycan to Lower Elevated Blood Cholesterol and Triglycerides in Rats. Ten male rats were fed the control diet (basal + cholesterol and cholic acid) for four weeks and divided into two groups. One group (2A) continued to receive the control diet. The second group (2B) received a test diet of the same composition as the control diet except that 15 parts of Glycan replaced 15 parts of sucrose. Another group (#1) of ten male rats were fed the basal diet for four weeks and then continued on the basal diet. At weekly intervals, whole blood was drawn from the caudal vein for the determination of serum cholesterol. At the end of four weeks, whole blood was drawn by heart puncture." **Data Table Reconstruction:** | Group | Cholesterol mg % 0 Wk. | Cholesterol mg % 1 Wk. | Cholesterol mg % 2 Wk. | Cholesterol mg % 3 Wk. | Cholesterol mg % 4 Wk. | Triglyceride mg % 4 Wk. | |-------|------------------------|------------------------|------------------------|------------------------|------------------------|-------------------------| | #1 | 96 | 96 | 96 | 96 | 103 | 293 | | #2 | 157 | 148 | 156 | 148 | 133 | 203 | | #3 | 157 | 115 | 99 | 115 | 111 | 169 | ### Key Observations * **Group #1** shows a slight increase in cholesterol levels from week 0 to week 4 (96 to 103 mg%). Triglyceride levels are significantly higher in this group (293 mg%) compared to the other two groups. * **Group #2** exhibits a decrease in cholesterol levels from week 0 to week 4 (157 to 133 mg%). Triglyceride levels are moderate (203 mg%). * **Group #3** shows a substantial decrease in cholesterol levels from week 0 to week 4 (157 to 111 mg%). Triglyceride levels are the lowest among the three groups (169 mg%). * The initial cholesterol levels (week 0) are notably different between the groups, with #1 starting at 96 mg%, while #2 and #3 start at 157 mg%. ### Interpretation The data suggests that dietary glycan (as implemented in groups #2 and #3) may have a cholesterol-lowering effect in rats. Group #3, which experienced the most significant cholesterol reduction, also had the lowest triglyceride levels at week 4. This indicates a potential link between glycan consumption and improved lipid profiles. The difference in initial cholesterol levels between the groups suggests that the rats were not randomly assigned to the diets, or that there were pre-existing differences in their cholesterol levels. Group #1, fed the basal diet, had the lowest initial cholesterol but the highest triglyceride levels, indicating that the basal diet may contribute to elevated triglycerides. The study's design, dividing rats into groups receiving different diets (control, glycan-substituted, and basal), allows for a comparison of the effects of these diets on blood lipid levels. The weekly measurements provide insight into the temporal changes in cholesterol levels, while the final triglyceride measurement offers a snapshot of the overall lipid profile. The source link provided suggests this is part of a larger research effort. </details> | Question: | What was the diet fed to the #1 group? | | --- | --- | | GT: | (basal diet) | | MLP: | (basel diet) ✓ | | Ovis: | (Whole blood) ✗ | | Align: | (control diet) ✗ | (d) Negative Example #2 Figure 8: Case Study for Connector Comparison on the DocVQA dataset [Mathew et al., 2021b]. We show four qualitative examples (including two correct and two incorrect examples) comparing Llama-3.2-3B- Align to the same architecture with different connectors, Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis. “GT” denotes the ground truth. <details> <summary>figures/case_9.jpg Details</summary> ![7c193526](/v1/image/7c19352660e64041a58943c00f4732b5505007eddba4f886306447362df34763) ### Visual Description \n ## Photograph: Open Envelope and Suitcase ### Overview The image is a photograph depicting an open envelope with the handwritten phrase "good bye" resting on top of a vintage-style suitcase. The composition suggests a theme of travel, departure, or farewell. The image does not contain charts, diagrams, or data tables. It is a static visual representation. ### Components/Axes There are no axes or legends present in the image. The key components are: * **Envelope:** An open, light-blue envelope with red and blue striped borders. * **Text:** The handwritten phrase "good bye" is visible on the envelope. * **Suitcase:** A brown, vintage-style suitcase with metal clasps and a handle. * **Surface:** A wooden surface on which the envelope and suitcase are placed. * **Fabric:** A patterned fabric is visible at the bottom of the image, partially covering the surface. ### Detailed Analysis or Content Details The envelope is positioned centrally in the upper half of the image. The phrase "good bye" is written in cursive handwriting, appearing to be in black ink. The suitcase is positioned below the envelope, taking up the lower half of the image. The suitcase has four metal clasps, two on each side, and a brown handle. The surface appears to be made of light-colored wood. The fabric at the bottom has a pattern of white waves on a dark blue background, interspersed with dark dots. ### Key Observations The image evokes a sense of finality and transition. The open envelope and the word "good bye" clearly indicate a farewell. The suitcase suggests a journey or departure. The vintage style of the suitcase adds a nostalgic element to the image. ### Interpretation The image likely represents a symbolic farewell, perhaps related to travel or a significant life change. The open envelope suggests vulnerability and openness, while the suitcase implies a readiness to move on. The handwritten message adds a personal touch, emphasizing the emotional weight of the departure. The overall composition and aesthetic create a melancholic and reflective mood. The image doesn't present factual data, but rather a visual narrative. It's a representation of a concept – departure – rather than a measurement or analysis. The image could be used to evoke feelings of sadness, nostalgia, or anticipation. </details> | Question: | What greeting is written on the letter? | | --- | --- | | GT: | (good bye) | | MLP: | (good) ✗ | | Ovis: | (good buy) ✗ | | Align: | (good bye) ✓ | (a) Positive Example #1 <details> <summary>figures/case_10.jpg Details</summary> ![e86039e6](/v1/image/e86039e61997bfd3d4c458c07bd238ed80fd7b4a98840738ba3158ce16bad559) ### Visual Description \n ## Digital Clock Display: SkyScan Atomic Clock ### Overview The image shows a SkyScan Atomic Clock displaying the time, date, indoor temperature, and outdoor temperature. The display is digital and uses a segmented numeric format. The clock is positioned on a wooden surface. ### Components/Axes The display is divided into sections: * **Top Section:** Displays the time. * **Bottom Section:** Displays the date, indoor temperature, and outdoor temperature. * **Labels:** "TIME", "DATE", "INDOOR", "OUTDOOR", "SkyScan ATOMIC CLOCK". * **Units:** Degrees Fahrenheit (°F) for temperature. ### Detailed Analysis or Content Details * **Time:** The time displayed is 12:04. The small antenna symbol above the time indicates atomic clock synchronization. * **Date:** The date displayed is 1/1 (January 1st). * **Indoor Temperature:** The indoor temperature is 40°F. * **Outdoor Temperature:** The outdoor temperature is 6.8°F. * **Negative Sign:** A negative sign is present before the outdoor temperature, confirming it is below freezing. * **Clock Brand:** SkyScan. ### Key Observations The clock is displaying a time of 12:04 PM on January 1st. The indoor temperature is 40°F, while the outdoor temperature is 6.8°F, indicating a cold environment. The clock is an atomic clock, suggesting it automatically synchronizes with a time standard. ### Interpretation The image provides a snapshot of environmental conditions and time. The significant difference between the indoor and outdoor temperatures suggests the clock is located in a heated environment. The atomic clock feature implies a need for accurate timekeeping. The date indicates the beginning of a new year. The information is factual and descriptive, providing a simple record of time and temperature. There are no apparent anomalies or trends beyond the observation of a cold outdoor temperature. </details> | Question: | What indoor temperature is shown? | | --- | --- | | GT: | (68.4) | | MLP: | (68 F) ✗ | | Ovis: | (40.0) ✗ | | Align: | (68.4) ✓ | (b) Positive Example #2 <details> <summary>figures/case_11.jpg Details</summary> ![7ab9b252](/v1/image/7ab9b252b0f59b7986cdbb3c669bb5c5afcb4fe1d64394056e3d5cfde635d5d5) ### Visual Description \n ## Advertisement: Health Club Promotion ### Overview The image is a promotional advertisement for a health club named "the club 4U". It features a stylized illustration of a man and a woman, along with text emphasizing the club's attributes and a tagline. The overall aesthetic is vibrant and aimed at attracting a young adult demographic. ### Components/Axes The advertisement consists of the following textual elements: * **Header:** "Proffesional" (misspelled "Professional"), "Passionate", "Personal" - arranged vertically in a large, bold, blue font. * **Central Logo:** "We're the club 4U" - displayed in a stylized, rounded font with a light blue and purple color scheme. * **Footer:** "topnotch health club" - in a smaller, red font. * **Tagline:** "dare to be different" - in an even smaller, orange font, positioned below the club name. ### Detailed Analysis or Content Details The advertisement's text is primarily descriptive and promotional. There are no numerical data points or axes to analyze. The key textual information is as follows: * **Attributes:** The header highlights three key attributes of the health club: "Proffesional", "Passionate", and "Personal". * **Club Name:** The central logo clearly identifies the club as "the club 4U". * **Club Type:** The footer specifies that it is a "topnotch health club". * **Tagline:** The tagline "dare to be different" suggests a unique and unconventional approach to fitness. ### Key Observations The advertisement relies heavily on visual appeal and suggestive language. The image of the attractive couple implies a social and lifestyle benefit to joining the club, beyond just fitness. The misspelling of "Professional" is a notable error. ### Interpretation The advertisement is designed to appeal to individuals seeking a health club that offers more than just exercise facilities. The attributes "Professional", "Passionate", and "Personal" suggest a high-quality, engaging, and individualized experience. The tagline "dare to be different" positions the club as an alternative to mainstream fitness centers. The overall message is that "the club 4U" is a place where individuals can pursue their fitness goals in a stylish, social, and supportive environment. The advertisement is likely targeted towards young adults who value both fitness and social connection. The misspelling of "Professional" detracts from the overall impression of quality and attention to detail. </details> | Question: | What type of club is advertised? | | --- | --- | | GT: | (health club) | | MLP: | (topnote health club) ✗ | | Ovis: | (health club) ✓ | | Align: | (professional passionate personal) ✗ | (c) Negative Example #1 <details> <summary>figures/case_12.jpg Details</summary> ![ee4e16fa](/v1/image/ee4e16fa69e9f6e82ec026ae6c277d98b29dc1554f96efa8f647a784dc3cb58e) ### Visual Description \n ## Advertisement: Hadiah Plus Rewards Program ### Overview This image is an advertisement for the "Hadiah Plus" rewards program, promoting the ability to convert points to "BIG Points" for free flights with AirAsia. The advertisement features images of iconic landmarks from several Asian cities. ### Components/Axes The image is structured with a header, a central row of images representing locations, and a footer with promotional text and images of credit cards. * **Header:** "Hadiyah Plus" in ornate gold lettering. Below that, "MORE REWARDS WITH EVERY SWIPE" in a smaller, simpler font. * **Location Row:** Five images, each labeled with a city name: Singapore, Penang, Bangkok, Bali, Hong Kong. These labels are positioned directly above each image. * **Footer:** "You can now convert your Hadiah Points to BIG Points to redeem for free flights." Below this text are images of four different credit cards. A logo for "AirAsia BIG Loyalty Programme" is present in the bottom-right corner. ### Detailed Analysis or Content Details The image showcases the following locations: 1. **Singapore:** Image of the Merlion statue with the city skyline in the background. 2. **Penang:** Image of a mosque with a distinctive green and gold dome, set against a blue sky. 3. **Bangkok:** Image of a highly ornate temple structure, likely Wat Arun (Temple of Dawn). 4. **Bali:** Image of a traditional Balinese gate structure, likely a temple entrance. 5. **Hong Kong:** Image of the Hong Kong skyline, featuring numerous skyscrapers. The credit cards displayed in the footer appear to be from various banks, but the specific banks are not clearly identifiable. The AirAsia BIG Loyalty Programme logo is circular, with "AirAsia BIG" in bold red text and "Loyalty Programme" in smaller white text on a red background. ### Key Observations The advertisement focuses on travel rewards and the convenience of converting points for flights. The selection of cities suggests a focus on popular tourist destinations in Southeast Asia. The use of iconic landmarks aims to evoke a sense of wanderlust and aspiration. ### Interpretation The advertisement is designed to appeal to consumers who frequently use credit cards and are interested in travel rewards. The message is straightforward: use your Hadiah Plus card, earn points, and redeem them for free flights with AirAsia. The visual elements reinforce this message by showcasing desirable travel destinations. The advertisement leverages the association between credit card usage and travel benefits to encourage cardholders to engage with the rewards program. The overall tone is aspirational and emphasizes the ease of redeeming points for travel. The advertisement does not contain any numerical data or charts. It is purely a promotional image designed to create brand awareness and encourage customer engagement. </details> | Question: | What credit card is this? | | --- | --- | | GT: | (hadiah plus) | | MLP: | (hadiah plus) ✓ | | Ovis: | (american big loyalty program) ✗ | | Align: | (hadia plus) ✗ | (d) Negative Example #2 Figure 9: Case Study for Connector Comparison on the TextVQA dataset [Singh et al., 2019]. We show four qualitative examples (including two correct and two incorrect examples) comparing Llama-3.2-3B- Align to the same architecture with different connectors, Llama-3.2-3B-MLP and Llama-3.2-3B-Ovis. “GT” denotes the ground truth.

Rendering Paper...