# jina-vlm: Small Multilingual Vision Language Model
**Authors**: Jina AI by Elastic
> Equal contribution.Work done during internship at Jina AI.
Abstract
We present jina-vlm, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.
1 Introduction
Vision-language models (VLMs) combine pretrained vision encoders with large language models to tackle tasks requiring joint visual and textual understanding (flamingo; llava). Recent VLMs have achieved strong results on visual question answering (VQA), OCR, and multimodal reasoning. However, two challenges limit their practical deployment. First, multilingual capabilities often degrade during vision adaptation: models that perform well on English benchmarks show uneven results across other languages (vlmsurvey). Second, high-quality VLMs remain computationally expensive to train and deploy, limiting accessibility for researchers and practitioners with constrained resources.
This work introduces jina-vlm, a 2.4B parameter VLM that addresses both challenges. The model aligns a SigLIP2-So400M/14-384 vision encoder (siglip2) with Qwen3-1.7B-Base (qwen3) through an attention-pooling connector, trained with a two-stage pipeline that explicitly incorporates multilingual data. Among open 2B-scale VLMs, jina-vlm achieves state-of-the-art performance on multilingual multimodal benchmarks including MMMB and Multilingual MMBench, demonstrating that small models can excel at cross-lingual visual understanding without sacrificing general capabilities. On standard English benchmarks spanning diagrams, charts, documents, and OCR, jina-vlm achieves the highest average score (72.3) across eight VQA benchmarks among 2B-scale VLMs. These results are enabled by two technical contributions: an efficient arbitrary-resolution pipeline that combines overlapping tiling with attention-based token pooling to reduce visual token count by 4 $×$ , and a training recipe that incorporates text-only data to preserve the language understanding performance of the backbone LLM.
2 Related Work
VLM architecture and training. Modern VLMs follow an architecture introduced by PaLI (pali): a pretrained vision encoder extracts visual features, a connector projects them into the language model’s embedding space, and a decoder-only language model generates text conditioned on these visual tokens. Vision Transformers (ViTs) (vit) produce patch-level representations that the language model processes alongside text embeddings. This design is adopted by LLaVA (llava; llava_1_5; llava_uhd; llava_next_interleave; llava_onevision), QwenVL (qwen_vl; qwen_2_vl; qwen_2_5_vl), InternVL (internvl; internvl_1_5; internvl_2_5; internvl_3; internvl_3_5), and Ovis (ovis; ovis_2_5). Training strategies vary: qwen_2_vl; internvl_2_5 alternate between multimodal instruction tuning and general training; llava_1_5 incorporate academic VQA datasets; molmo, llava_onevision, and cambrian1 curate large-scale, diverse data mixtures.
Efficient resolution-agnostic image processing. Standard ViTs process fixed-resolution images, requiring resizing that discards fine-grained detail. Since visual token count scales with resolution and Transformer computation scales quadratically with sequence length, naive high-resolution processing is prohibitive. Several solutions exist: molmo tile images with overlap; qwen_2_vl introduce Naive Dynamic Resolution with Multimodal Rotary Position Embedding (rope; 2drope); ovis_2_5 use native-resolution ViTs (navit). Orthogonally, images often contain low-information regions (e.g., sky backgrounds), making visual tokens highly redundant. Token compression methods address this (fastv; prumerge; visionzip; pyramiddrop). internvl_1_5 develop Dynamic High-Resolution Tiling, and nvila propose scale-then-compress strategies. Recent work on training-free token budgeting, such as HERO (hero), demonstrates that inference-time pruning can achieve significant speedups while preserving accuracy; our approach differs by learning compact representations during training rather than dropping tokens at inference.
Vision-language connectors. The connector bridging vision encoders and language models significantly impacts both efficiency and performance. BLIP-2 (blip2) introduces Q-Former, a learnable query-based transformer that extracts fixed-length representations from visual features, reducing the number of tokens fed to the LLM. Flamingo (flamingo) uses a Perceiver Resampler with cross-attention to compress visual tokens. Our attention-pooling connector shares the goal of token reduction but operates differently: rather than learning a fixed set of queries, we apply local 2 $×$ 2 attention pooling that preserves spatial structure while achieving 4 $×$ compression, which we found more effective for tasks requiring fine-grained spatial understanding.
Small VLMs. Efficiency has become a central objective. mobilevlmv2 demonstrate competitive performance below 2B parameters. imp combine quantization with aggressive resolution reduction for mobile deployment, matching larger models’ performance. MiniCPM-V (minicpmv) targets edge deployment while maintaining strong OCR and multilingual capabilities. smolvlm systematically explore design parameters to train VLMs as small as 256M parameters.
Multilingual VLMs. Many lightweight VLMs (paligemma; paligemma2; phi3) achieve strong English performance but degrade on other languages. qwen_2_vl and internvl_1_5 address this through targeted multilingual training data. pangea introduce instruction-tuning data spanning 39 languages.
Retaining text-only performance. Multimodal training often degrades text-only capabilities. Mitigation strategies include balanced data mixtures, careful learning rate scheduling (cauldron), and partial backbone freezing (llava_onevision; internvl_3_5).
3 Model Architecture
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Jina-vlm Architecture
### Overview
This diagram illustrates the architecture of the jina-vlm (presumably a visual language model) system. It depicts the flow of information from an input image, through various processing stages, to a final output text. The diagram highlights the key components involved in processing the image and generating the output.
### Components/Axes
The diagram is structured horizontally, with the flow moving from left to right. Key components include:
* **Input Image:** Dimensions 2728 x 2846.
* **Prompt:** "what is the name of this lady?"
* **Output:** "The name is Lenna (or Lena)."
* **Tiling:** Dimensions 1176 x 918. Shows 12 tiles + thumbnail.
* **SIGLIP2:** ~400M params. Contains 27 Transformer Blocks.
* **VL-Connector:** ~50M params. Contains 2x2 Attention Pooling and MLP Projection.
* **QWEN3 Decoder:** ~1.7B parameters. Contains 39 Transformer Blocks.
* **Input Tokens:** 2366 image tokens + text tokens.
### Detailed Analysis or Content Details
1. **Input Image & Prompt:** The process begins with an input image of a woman (approximately 2728 x 2846 pixels). A text prompt, "what is the name of this lady?", is provided as input.
2. **Tiling:** The input image is divided into 12 tiles plus a thumbnail (1176 x 918). Each tile is approximately 378 x 378 pixels. The tiles are multiplied by 13.
3. **SIGLIP2:** The tiled image is fed into the SIGLIP2 module, which has approximately 400 million parameters. This module consists of 27 Transformer Blocks. The output of SIGLIP2 is then multiplied by 1.3.
4. **VL-Connector:** The output from SIGLIP2 is passed to the VL-Connector, which has approximately 50 million parameters. This component utilizes 2x2 Attention Pooling and an MLP Projection. The output of VL-Connector is then multiplied by 1.3.
5. **Input Tokens:** The VL-Connector generates 182 tokens. These are combined with text tokens to create a total of 2366 input tokens.
6. **QWEN3 Decoder:** The input tokens are then fed into the QWEN3 Decoder, which has approximately 1.7 billion parameters. This module consists of 39 Transformer Blocks.
7. **Output:** The QWEN3 Decoder generates the output text: "The name is Lenna (or Lena)."
### Key Observations
* The diagram emphasizes the modularity of the jina-vlm system, with distinct components for image processing (SIGLIP2, VL-Connector) and text generation (QWEN3 Decoder).
* The use of Transformer Blocks is consistent across multiple components (SIGLIP2 and QWEN3 Decoder), suggesting a common architectural foundation.
* The diagram highlights the parameter counts for each module, providing a sense of the model's scale.
* The multiplication by 1.3 after SIGLIP2 and VL-Connector is not explained, and its purpose is unclear.
### Interpretation
The diagram illustrates a typical visual language model architecture. The image is first processed to extract visual features (using SIGLIP2 and VL-Connector). These features are then combined with the text prompt and fed into a language model (QWEN3 Decoder) to generate a textual response. The system appears to leverage tiling to handle high-resolution images. The parameter counts suggest a relatively large model, capable of complex reasoning. The diagram provides a high-level overview of the system's components and data flow, but lacks details about the specific algorithms and techniques used within each module. The prompt and output demonstrate the model's ability to perform image recognition and answer questions about the image content. The fact that the model identifies the woman as "Lenna (or Lena)" suggests it has been trained on a dataset containing images of this well-known test image.
</details>
Figure 1: Architecture of jina-vlm. Images are resized to fit a grid of up to 12 overlapping tiles, plus a global thumbnail. Each tile is a square 378 $×$ 378 crop; adjacent tiles overlap by 112 pixels with a stride of 266 pixels between tile origins. A 4 $×$ 3 grid therefore spans 1176 $×$ 910 pixels, and images exceeding this effective resolution are downscaled to fit the tile budget. Each tile produces 729 patches via SigLIP2 (siglip2). The VL connector concatenates features from layers 24 and 18, the third- and ninth-to-last layers, then applies 2 $×$ 2 attention pooling to reduce 729 tokens to 182 before projecting to the decoder dimension. Visual tokens are combined with text embeddings for the Qwen3 decoder (qwen3).
Figure 1 illustrates the architecture of jina-vlm. The model uses overlapping image tiling following molmo, combined with attention-based token pooling to reduce sequence length while preserving spatial information.
The vision encoder, SigLIP2-So400M/14-384, is a 27-layer Vision Transformer with 400M parameters that processes 378 $×$ 378 pixel inputs as 27 $×$ 27 grids of 14 $×$ 14 patches. To handle arbitrary resolutions, we decompose each image into overlapping tiles of this size and process each tile independently through the encoder. A global thumbnail, the full image resized to 378 $×$ 378, provides context alongside the tile representations. We use a default of 12 tiles during training; this limit can be increased at inference or during continued training to handle higher resolutions, with memory scaling linearly with tile count. The tiling algorithm is detailed in Appendix A.1.
3.1 Vision-Language Connector
Rather than using the final ViT output, jina-vlm concatenates features from two intermediate layers: the third-to-last and ninth-to-last, corresponding to layers 24 and 18 of the 27-layer encoder. This captures both fine-grained spatial details from earlier layers and high-level semantics from later layers. The connector then applies attention pooling over 2 $×$ 2 patch neighborhoods, using mean-pooled features as queries. This reduces the token count by 4 $×$ while preserving local structure. A SwiGLU projection layer maps the pooled representations to the language model’s embedding dimension.
In more formal terms, let $\mathbf{H}^{(\ell)}∈\mathbb{R}^{N× d_{v}}$ denote the hidden states from ViT layer $\ell$ , where $N$ is the number of patches, $d_{v}$ is the vision encoder hidden size, and negative indices count from the final layer (e.g., $\ell=-1$ is the last layer). We concatenate features from two internal layers:
$$
\mathbf{H}_{\text{concat}}=[\mathbf{H}^{(-3)};\mathbf{H}^{(-9)}]\in\mathbb{R}^{N\times 2d_{v}} \tag{1}
$$
For each $2{×}2$ patch neighborhood $\mathcal{N}_{i}$ , we compute a query vector as the mean of the neighborhood features:
$$
\mathbf{q}_{i}=\frac{1}{4}\sum_{j\in\mathcal{N}_{i}}\mathbf{h}_{j},\quad\mathbf{Q}=[\mathbf{q}_{1};\dots;\mathbf{q}_{M}]\in\mathbb{R}^{M\times 2d_{v}} \tag{2}
$$
where $\mathcal{N}_{i}$ contains the four patches at positions $(2i_{x},2i_{y})$ , $(2i_{x}+1,2i_{y})$ , $(2i_{x},2i_{y}+1)$ , and $(2i_{x}+1,2i_{y}+1)$ and $M=N/4$ .
Attention pooling is then computed as:
$$
\mathbf{H}_{\text{pooled}}=(\text{softmax}\left(\frac{\mathbf{Q}\mathbf{W}_{Q}(\mathbf{H}_{\text{concat}}\mathbf{W}_{K})^{\top}}{\sqrt{d_{k}}}\right)\mathbf{H}_{\text{concat}}\mathbf{W}_{V})\mathbf{W}_{O}\in\mathbb{R}^{M\times d_{v}} \tag{3}
$$
where $d_{k}=d_{v}$ and $\mathbf{W}_{Q}∈\mathbb{R}^{2d_{v}× d_{k}}$ , $\mathbf{W}_{K}∈\mathbb{R}^{2d_{v}× d_{k}}$ , $\mathbf{W}_{V}∈\mathbb{R}^{2d_{v}× 2d_{v}}$ and $\mathbf{W}_{O}∈\mathbb{R}^{2d_{v}× d_{v}}$ are learnable weight matrices. Finally, the pooled visual features are projected to the language model embedding dimension via a SwiGLU (swiglu) layer:
$$
\mathbf{H}_{\text{proj}}=\left(\text{Swish}(\mathbf{H}_{\text{pooled}}\mathbf{W}_{1})\odot(\mathbf{H}_{\text{pooled}}\mathbf{W}_{2})\right)\mathbf{W}_{3}\in\mathbb{R}^{M\times d_{l}} \tag{4}
$$
where $\text{Swish}(x)=x·\sigma(x)$ , $\sigma$ is the sigmoid function, $\odot$ denotes element-wise multiplication, $\mathbf{W}_{1},\mathbf{W}_{2}∈\mathbb{R}^{d_{v}× 3d_{l}}$ , $\mathbf{W}_{3}∈\mathbb{R}^{3d_{l}× d_{l}}$ are learnable parameters, and $d_{l}$ is the language model embedding size.
3.2 Language Decoder
The language decoder is initialized from Qwen3-1.7B-Base https://huggingface.co/Qwen/Qwen3-1.7B-Base, which empirically outperformed the instruction-tuned variant in our setting. We introduce three special tokens to structure visual inputs: <im_start> and <im_end> delimit image and thumbnail sequences, while <im_col> marks row boundaries within the patch grid, where tokens are arranged left-to-right and top-to-bottom. Input and output embedding weights are not tied.
3.3 Efficiency Analysis
Table 1 quantifies the computational benefits of attention pooling. With the default 12-tile configuration (plus thumbnail), the unpooled baseline would produce 9,477 visual tokens per image, while our 2 $×$ 2 pooling reduces this to 2,366 tokens. Since the ViT processes each tile identically regardless of pooling, the savings apply exclusively to the LLM: we observe a 3.9 $×$ reduction in prefill FLOPs and a 4 $×$ reduction in KV-cache memory. The overall FLOPs reduction is 2.3 $×$ when including the shared ViT cost.
Table 1: Efficiency comparison with and without 2 $×$ 2 attention pooling for the default 12-tile configuration. FLOPs are computed for LLM prefill; KV-cache assumes fp16 precision.
| Visual tokens LLM prefill FLOPs KV-cache memory | 9,477 27.2 TFLOPs 2.12 GB | 2,366 6.9 TFLOPs 0.53 GB | 4.0 $×$ 3.9 $×$ 4.0 $×$ |
| --- | --- | --- | --- |
4 Training
Training proceeds in two stages, both updating all model components (encoder, connector, and decoder) without freezing, following molmo. The combined data comprises approximately 5M multimodal samples and 12B text tokens across 30+ languages, with roughly half in English and the remainder spanning high- and moderate-resource languages. Table 2 summarizes hyperparameters for both stages.
4.1 Stage 1: Alignment Training
The first stage focuses on cross-language semantic grounding rather than task-specific objectives. Training data consists primarily of caption datasets (PixmoCap (molmo), PangeaIns (pangea)) spanning diverse visual domains: natural scenes, documents, infographics, and diagrams. We include 15% text-only data from PleiAS/common_corpus (pleias) to mitigate degradation on text-only tasks. The connector uses a higher learning rate and shorter warmup than the encoder and decoder.
4.2 Stage 2: Instruction Fine-tuning
The second stage trains instruction-following for VQA and reasoning tasks. We combine public dataset collections, including LLaVA OneVision (llava_onevision), Cauldron (cauldron), Cambrian (cambrian1), PangeaIns (pangea), and FineVision (finevision), with text-only instruction data from aya. The mixture covers academic VQA, document understanding, OCR, mathematics, and reasoning. Appendix A.2 shows representative examples.
Given the diversity of instruction data, we found single-source batches more effective initially, likely due to the heterogeneous data mixture. We train for 30K steps with single-source batches, then 30K steps with mixed-source batches.
Table 2: Model training hyperparameters across pre-training and fine-tuning stages.
| Hyperparameter Warmup ViT Warmup Con. | Pre-Training 10% 1% | Fine-Tuning 10% 10% |
| --- | --- | --- |
| Warmup LLM | 10% | 10% |
| LR ViT | 6e-6 | 5e-6 |
| LR Con. | 2e-4 | 5e-6 |
| LR LLM | 2e-5 | 1e-5 |
| Cosine Decay | 0.1 | 0.1 |
| Eps. | 1e-6 | 1e-6 |
| Betas | 0.9, 0.95 | 0.9, 0.95 |
| Batch Size | 128 | 256 |
| Steps | 25K | 60K |
| Samples | 3.2M | 15.3M |
| Tokens | 10B | 37B |
| GPU Hours | 296 | 1,000 |
5 Evaluation
We compare jina-vlm against lightweight VLMs across six capability areas: general VQA, multimodal comprehension, multi-image reasoning, hallucination control, mathematical reasoning, text-only performance, and multilingual understanding. All evaluations use VLMEvalKit https://github.com/open-compass/VLMEvalKit (vlmevalkit) with English prompts matching our training format (e.g., “Return only the letter of the best answer option” for multiple-choice, “Respond very briefly” for open-ended questions).
5.1 General VQA Tasks
Table 3 reports results on eight VQA benchmarks covering diagrams (AI2D (ai2d)), charts (ChartQA (chartqa), CharXiv (charxiv)), scene text (TextVQA (textvqa)), documents (DocVQA (docvqa), InfoVQA (infovqa)), OCR (OCRBench (ocrbench)), and diverse scenes (SEED-Bench-2-Plus (seedbench2plus)). jina-vlm achieves the highest average (72.3), with particularly strong performance on diagram interpretation and text extraction.
Table 3: Comparison of general visual question answering performance.
| | | (test avg) | (val) | (val) | (val) | Bench | Plus | (RQ / DQ) | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| jina-vlm | $82.0$ | $81.9$ | $83.2$ | $90.6$ | $71.6$ | $778$ | $67.2$ | 32.3 / 63.5 | $72.3$ |
| Qwen2-VL-2B | $74.7$ | $73.5$ | $79.7$ | $89.2$ * | $64.0$ * | $809$ | $62.4$ | 23.3 / 55.0* | $66.4$ |
| Qwen3-VL-2B | $76.9$ | $77.2$ | $79.5$ | $92.3$ * | $71.9$ * | $858$ | $67.3$ * | 28.8 / 62.3 | $71.6$ |
| InternVL3-2B | $78.6$ | $80.2$ | $77.0$ | $87.4$ * | $67.1$ * | $835$ | $64.6$ | 28.3 / 54.7 | $69.2$ |
| InternVL3.5-2B | $78.8$ | $80.7$ | $76.5$ | $88.5$ * | $69.3$ * | $836$ | $68.0$ | 31.6 / 65.0 | $71.6$ |
Results for models other than jina-vlm are from their respective papers, internvl_3_5; internvl_3; qwen_2_vl, except those marked with * which were computed using VLMEvalKit. All scores represent accuracy (%) except OCRBench which uses a 0–1000 scale; for overall average computation, OCRBench scores are divided by 10 to align with the 0–100 scale of other benchmarks.
5.2 Document and Real-World Understanding
Table 4 shows results on multimodal comprehension (MME (mme), MMB v1.1 (mmbench), MMStar (mmstar)) and real-world understanding (RealWorldQA (realworldqa), MME-RealWorld (mmerealworld), R-Bench (rbench)). jina-vlm scores 67.4 on multimodal tasks and 61.9 on real-world tasks, achieving the best RealWorldQA result (68.2).
Table 4: Comparison of generic multimodal understanding and real-world understanding performance.
| jina-vlm Qwen2-VL-2B Qwen3-VL-2B | $1965.8$ $1872.0$ $2000.8$ * | $75.8$ $72.2$ $77.8$ | $56.2$ $48.0$ $58.3$ | $67.4$ $62.4$ $69.2$ | $68.2$ $62.9$ $63.9$ | $50.7$ $38.7$ * $57.9$ * | $66.7$ $63.2$ $67.3$ * | $61.9$ $55.0$ * $63.0$ |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| InternVL3-2B | $2221.2$ | $78.6$ | $60.7$ | $72.9$ | $64.3$ | $53.8$ | $67.5$ | $61.9$ |
| InternVL3.5-2B | $2123.3$ | $76.6$ | $62.7$ | $71.7$ | $62.0$ | $49.7$ | $62.4$ | $58.0$ |
Results for models other than jina-vlm are from their respective papers, internvl_3_5; internvl_3; qwen_2_vl, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%) except MME which uses a 0–2800 scale; for overall average computation, MME scores are divided by 28 to align with the 0–100 scale of other benchmarks.
5.3 Multi-Image Reasoning and Hallucination
Table 5 reports multi-image reasoning (BLINK (blink), MuirBench (muirbench), MMT (mmtbench)) and hallucination benchmarks that measure the tendency to fabricate visual details (HallBench (hallusionbench), POPE (pope)). jina-vlm scores 47.3 on multi-image tasks, which is expected given limited multi-image training data, but achieves the best POPE score (90.3), indicating low hallucination rates.
Table 5: Comparison of multi-image and hallucination performance.
| jina-vlm Qwen2-VL-2B Qwen3-VL-2B | $50.1$ $44.4$ $53.8$ | $34.7$ $25.5$ * $47.4$ | $57.2$ $55.1$ $60.0$ * | $47.3$ $41.7$ $53.7$ | $39.1$ $41.7$ $44.5$ | $90.3$ $87.9$ * $88.9$ * | $64.7$ $64.8$ $66.7$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
| InternVL3-2B | $50.3$ | $38.8$ | $59.5$ | $49.5$ | $42.5$ | $89.6$ | $66.1$ |
| InternVL3.5-2B | $51.3$ | $44.0$ | $58.5$ | $51.3$ | $48.6$ | $87.2$ | $67.9$ |
Results for models other than jina-vlm are from their respective papers, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).
5.4 Mathematical Reasoning
Table 6 reports structured reasoning benchmarks: multidisciplinary comprehension (MMMU (mmmu)), visual mathematics (MathVista (mathvista), MathVision (mathvision), MathVerse (mathverse), WeMath (wemath)), and logical reasoning (LogicVista (logicvista)). jina-vlm performs comparably to InternVL3-2B and outperforms Qwen2-VL-2B.
Table 6: Comparison of multimodal reasoning and mathematical problem-solving performance.
| Model | MMMU | MathVista | MathVision | MathVerse | WeMath | LogicVista | Overall |
| --- | --- | --- | --- | --- | --- | --- | --- |
| (Vision Only) | | | | | | | |
| jina-vlm | $45.6$ | 59.5 | 19.2 | 23.9 | 17.1 | 33.3 | $33.1$ |
| Qwen2-VL-2B | $41.1$ | 43.0 | 12.4 | 17.3* | 10.9* | 27.3* | $25.3$ |
| Qwen3-VL-2B | $53.4$ | 61.3 | 31.6 | 22.7* | 28.0* | 35.4* | $38.7$ |
| InternVL3-2B | $48.6$ | 57.0 | 21.7 | 25.3 | 22.4 | 36.9 | $35.3$ |
| InternVL3.5-2B | $59.0$ | 71.8 / 61.5 † | 42.8 / 26.5 † | 53.4 / 35.3 † | 48.5 / 19.1 † | 47.7 / 41.4 † | $50.7$ |
Results for models other than jina-vlm are from their respective papers, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. $\dagger$ indicates scores for InternVL3.5-2B without thinking mode, evaluated using VLMEvalKit. All scores represent accuracy (%).
5.5 Text-Only Performance
Table 7 compares jina-vlm against the backbone Qwen3-1.7B on text-only benchmarks: MMLU (mmlu), MMLU-Pro (mmlupro), GSM-8K (gsm8k), ARC-C (arc), and HellaSwag (hellaswag). Results show mixed preservation of text-only capabilities: jina-vlm matches or exceeds the backbone on commonsense reasoning (ARC-C, HellaSwag) and retains most performance on MMLU and GSM-8K. However, MMLU-Pro shows substantial degradation (46.4 $→$ 30.3), likely because this benchmark emphasizes extended multi-step reasoning that conflicts with our instruction-tuning toward concise visual responses. This suggests a trade-off between optimizing for multimodal tasks and preserving complex text-only reasoning, which future work could address through more balanced data mixtures or curriculum scheduling.
Table 7: Comparison of Text-only benchmarks.
| jina-vlm Qwen3-1.7B | $56.1$ $62.6$ | $30.3$ $46.4$ | $71.3$ $75.3$ | $77.3$ $73.4$ | $59.4$ $59.0$ | $58.9$ $63.3$ |
| --- | --- | --- | --- | --- | --- | --- |
Results are collected using our evaluation code. All scores represent accuracy (%).
5.6 Multilingual Understanding
Table 8 reports multilingual multimodal benchmarks: MMMB (mmmb), Multilingual MMBench (mmmb), and MTVQA (mtvqa). jina-vlm achieves state-of-the-art multilingual performance among 2B-scale VLMs, with the highest averages on MMMB (78.8) and Multilingual MMBench (74.3).
Table 8: Comparison of multilingual multimodal understanding performance.
| Benchmark MMMB cn | jina-vlm ar $80.0$ | Qwen2-VL-2B $76.9$ $74.2$ | Qwen3-VL-2B $68.3$ $75.7$ * | InternVL3-2B $72.7$ * $78.3$ | InternVL3.5-2B $68.6$ $77.7$ | $68.5$ |
| --- | --- | --- | --- | --- | --- | --- |
| en | $82.0$ | $78.3$ | $80.7$ * | $81.9$ | $80.2$ | |
| pt | $79.2$ | $72.6$ | $75.0$ * | $75.4$ | $75.9$ | |
| ru | $79.2$ | $72.8$ | $75.9$ * | $74.6$ | $76.3$ | |
| tr | $75.5$ | $61.8$ | $68.5$ * | $62.9$ | $69.1$ | |
| avg | $78.8$ | $71.3$ | $75.0$ * | $73.6$ | $74.6$ | |
| Multi. MMBench | ar | $70.0$ | $66.7$ | $66.2$ * | $66.4$ | $63.7$ |
| cn | $75.9$ | $67.0$ | $75.7$ * | $77.8$ | $75.9$ | |
| en | $78.8$ | $71.1$ | $77.8$ * | $81.3$ | $78.4$ | |
| pt | $74.7$ | $72.1$ | $71.4$ * | $75.9$ | $73.7$ | |
| ru | $75.3$ | $69.9$ | $75.9$ * | $70.7$ | $71.4$ | |
| tr | $71.1$ | $69.3$ | $67.0$ * | $59.5$ | $62.0$ | |
| avg | $74.3$ | $69.4$ | $72.3$ * | $71.9$ | $70.9$ | |
| MTVQA | $25.6$ | $20.6$ | $27.3$ * | $26.7$ | $28.5$ | |
| Overall | $59.6$ | $53.8$ | $58.2$ | $57.4$ | $58.0$ | |
Results for baseline models are derived from their original publications, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).
6 Conclusion
We presented jina-vlm, a 2.4B vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. Our results demonstrate that small VLMs can attain strong cross-lingual visual understanding through careful architectural and training choices: attention-based token pooling reduces visual tokens by 4 $×$ while preserving spatial information, and incorporating text-only data during multimodal training mitigates the catastrophic forgetting typically observed in vision-adapted language models. On standard English VQA benchmarks, jina-vlm achieves leading results, demonstrating that multilingual capabilities need not come at the cost of general performance.
The current approach has limitations. Multi-tile processing introduces computational overhead that scales with image resolution, and tiling can fragment global spatial context, potentially impairing performance on tasks requiring holistic scene understanding such as object counting or precise spatial reasoning across tile boundaries. While the global thumbnail partially mitigates this, native-resolution approaches (navit) may be better suited for such tasks. We have not emphasized safety-critical training or alignment, and multi-image reasoning remains weak due to limited training data in this regime. Future work could explore more efficient resolution handling, targeted improvements for counting and spatial tasks, and investigate whether our multilingual training recipe transfers to larger model scales.
Appendix A Appendix
A.1 Pseudocode for Creating Overlapping Tiles
Input: Image $I$ of size $(h,w)$ ; Base input size $\mathbf{b}=(b_{h},b_{w})$ ( $(378,378)$ ) Patch size $p$ ( $14$ ); Maximum number of tiles $M$ ( $12$ by default, configurable) Overlap margins $(m_{L},m_{R})$ in patches ( $(4,4)$ )
Output: List of tiles $\mathcal{C}$ (thumbnail + grid tiles) Tiling $(t_{h},t_{w})$ = (number of rows, number of columns)
1. Compute overlap-related sizes
$m_{\text{tot}}← p·(m_{L}+m_{R})$ // Total overlap margin in pixels
$s_{\text{win}}←\bigl(\lfloor b_{h}/p\rfloor-(m_{L}+m_{R})\bigr)· p$ // Tile stride in pixels
2. Select tiling on the margin-reduced image
$(t_{h},t_{w})←\textsc{SelectTilingWithMinimalScaleChange}\bigl(h-m_{\text{tot}},\,w-m_{\text{tot}},\,s_{\text{win}},\,M\bigr)$ ;
3. Resize image to exactly fit the chosen tiling + margins;
$H^{\prime}← t_{h}· s_{\text{win}}+m_{\text{tot}}$ ;
$W^{\prime}← t_{w}· s_{\text{win}}+m_{\text{tot}}$ ;
$I_{\text{grid}}←\textsc{Resize}(I,\;[H^{\prime},W^{\prime}])$ ;
4. Extract overlapping tiles
$\mathcal{G}←\textsc{ExtractTiles}\bigl(I_{\text{grid}},\,(t_{h},t_{w}),\,s_{\text{win}},\,b_{h}\bigr)$ // $b_{h}$ is the tile height, equal to $b_{w}$ here
5. Build thumbnail and final tile list
$T←\textsc{Resize}(I,\;[b_{h},b_{w}])$ // Global thumbnail
$\mathcal{C}←[T]\mathbin{+\!\!+}\mathcal{G}$ // Concatenate thumbnail and tiles
return $(\mathcal{C},(t_{h},t_{w}))$ ;
Algorithm 1 GetAllTilesOverlapAndResize
A.2 Training Set Examples
Captioning & Instruction
Dataset: VisualWebInstruct visualwebinstruct
<details>
<summary>selected_samples/visualwebinstruct.png Details</summary>

### Visual Description
\n
## Document: Best Practices for Conflict Documentation
### Overview
This document outlines best practices and templates for conflict documentation, including meeting notes and official communications. It is divided into three main sections: Meeting Notes Template, Official Communication Template, and Best Practices for Documentation.
### Components/Axes
The document is structured into three distinct sections, each with a "Header," "Body," and "Footer" designation. The "Best Practices for Documentation" section is presented as a numbered list.
### Detailed Analysis or Content Details
**1. Meeting Notes Template**
* **Header:**
* Meeting Title: Conflict Resolution Meeting
* Date: October 15, 2023
* Time: 3:00 PM - 4:00 PM
* Location: Main Conference Room
* Attendees: Alice Johnson (CEO), Bob Thompson (CTO), Claire Lee (COO), David Smith (Employee Representative)
* **Body:**
* Agenda: Review of conflict, Statements from attendees, Discussion of possible resolutions, Action items and next steps
* Minutes: (1) Introduction by Alice Johnson, (2) Statements from Alice and David, (3) Discussion on solutions, (4) Actions assigned
* **Footer:** Prepared by: Claire Lee, Date: October 15, 2023
**2. Official Communication Template**
* **Header:**
* From: Alice Johnson, CEO
* To: All Employees
* Date: October 16, 2023
* Subject: Update on Recent Conflict Resolution Efforts
* **Body:**
* Introduction
* Summary of Actions
* Commitment to Resolution
* Contact for Concerns
* **Footer:** Closing and Signature
**3. Best Practices for Documentation**
1. Timeliness: Document conflicts promptly without bias
2. Objectivity: Record factual information
3. Consistency: Use standardized templates
4. Confidentiality: Ensure secure storage and limited access
5. Inclusion of All Perspectives: Document all viewpoints
6. Follow-Up: Schedule follow-up meetings and communications
### Key Observations
The document emphasizes the importance of prompt, objective, and consistent documentation of conflicts. It provides specific templates for meeting notes and official communications, as well as a list of best practices to ensure thorough and fair documentation. The templates are designed to capture key information such as meeting details, agenda items, action items, and communication recipients.
### Interpretation
This document serves as a guide for organizations seeking to improve their conflict resolution processes. The templates and best practices are intended to promote transparency, accountability, and fairness in handling conflicts. The emphasis on timeliness and objectivity suggests a desire to prevent conflicts from escalating and to ensure that all perspectives are considered. The inclusion of confidentiality as a best practice highlights the importance of protecting sensitive information. The document demonstrates a proactive approach to conflict management, aiming to resolve issues effectively and maintain a positive work environment. The templates are designed to be easily adaptable to different conflict situations, providing a standardized framework for documentation.
</details>
Question what is the meeting title? Answer Conflict Resolution Meeting
Figure 2: Answer questions given web documents.
Charts & Tables
Dataset: TAT-QA tatqa
<details>
<summary>selected_samples/tatqa.png Details</summary>

### Visual Description
\n
## Data Table: Unrecognized Tax Benefit Rollforward
### Overview
The image presents a data table detailing the rollforward of unrecognized tax benefits for fiscal years ended July 31st, from 2017 to 2019. The table shows the beginning balance, gross increases and decreases related to prior and current period tax positions, and the ending balance of unrecognized tax benefits. All values are expressed in thousands of US dollars.
### Components/Axes
The table has the following structure:
* **Rows:** Represent different components of the unrecognized tax benefit rollforward:
* Unrecognized tax benefit - beginning of period
* Gross increases - prior period tax positions
* Gross decreases - prior period tax positions
* Gross increases - current period tax positions
* Unrecognized tax benefit - end of period
* **Columns:** Represent the fiscal years:
* 2019
* 2018
* 2017
* **Header:** "Fiscal years ended July 31,"
### Detailed Analysis or Content Details
Here's a reconstruction of the data table's content:
| Component | 2019 | 2018 | 2017 |
| ---------------------------------------------- | --------- | --------- | --------- |
| Unrecognized tax benefit - beginning of period | $10,321 | $9,346 | $7,687 |
| Gross increases - prior period tax positions | 98 | 729 | 712 |
| Gross decreases - prior period tax positions | (88) | (878) | (691) |
| Gross increases - current period tax positions | 1,302 | 1,124 | 1,638 |
| Unrecognized tax benefit - end of period | $11,633 | $10,321 | $9,346 |
**Trends:**
* **Beginning Balance:** The unrecognized tax benefit at the beginning of the period increased from $7,687 in 2017 to $9,346 in 2018, and further to $10,321 in 2019.
* **Prior Period Increases:** Gross increases from prior period tax positions decreased from $712 in 2017 to $729 in 2018, and then to $98 in 2019.
* **Prior Period Decreases:** Gross decreases from prior period tax positions decreased in absolute value from $691 in 2017 to $878 in 2018, and then to $88 in 2019.
* **Current Period Increases:** Gross increases from current period tax positions decreased from $1,638 in 2017 to $1,124 in 2018, and then increased to $1,302 in 2019.
* **Ending Balance:** The unrecognized tax benefit at the end of the period increased from $9,346 in 2017 to $10,321 in 2018, and further to $11,633 in 2019.
### Key Observations
* The ending balance of unrecognized tax benefits consistently increased over the three-year period.
* The largest contributor to the increase in the ending balance appears to be the gross increases from current period tax positions, particularly in 2019.
* The net effect of prior period adjustments (increases minus decreases) was positive in 2017 and 2018, but significantly reduced in 2019.
### Interpretation
The data suggests a growing amount of unrecognized tax benefits over the period 2017-2019. This could indicate an increasing number of tax positions taken that are considered uncertain, or a slower resolution of existing uncertain tax positions. The increase in current period tax positions in 2019 suggests that the company may have taken on new tax positions with uncertain outcomes during that year. The decreasing impact of prior period adjustments could mean that older uncertain tax positions are being resolved, but the overall trend is still towards a larger amount of unrecognized tax benefits. The values are presented in thousands of dollars, indicating a material amount of potential future tax liabilities. The rollforward provides transparency into how the unrecognized tax benefit balance changes over time due to various factors.
</details>
Question Unrecognized Tax Benefits Activity related to unrecognized tax benefits is as follows (in thousands): … As of July 31, 2019, the Company has no income tax audits in progress in the U.S. or foreign jurisdictions. What was the increase in unrecognized tax benefits in 2019? Answer $1.3 million.
Figure 3: Financial table requiring numerical reasoning over text.
Document Understanding & Infographics
Dataset: DocVQA docvqa
<details>
<summary>selected_samples/docvqa.png Details</summary>

### Visual Description
\n
## Form: Response Code Request Form
### Overview
This is a scanned image of a response code request form, likely used for marketing or research purposes. The form requests information related to a direct marketing campaign involving mailers and viewer interest confirmation. It appears to be a document from RJR (Reynolds American) related to a product called "Eclipse".
### Components/Axes
The form is divided into several sections:
* **Header:** Contains "From", "To", "Date", and "Page" information.
* **Main Body:** Contains sections for RJR Brand, Project Name, Description, Distribution, Timing, Data Entry, Agency, Fulfillment, and Comments.
* **Footer:** Contains information about the Marketing Person Responsible for Project, Production Contact, Response Code Requested By, Copies of Confirmation Letter To, and handwritten notes/numbers.
The form includes the following labels and fields:
* **From:** Joyce Bagby
* **To:** Brian Robbins
* **Date:** 6/14/96 12:18:14
* **Page:** 2 of 7
* **RJR Brand:** Eclipse
* **RJR Program #:** 601776
* **Response Code:** U19
* **Project Name:** 2-Free Pack Request - Viewership Interest Confirmation #2
* **Description:** Mail Launch Mailers without an invitation to Sneak Previews to Charlotte and Indianapolis to confirm interest and intent levels. Mailer has a BRC offer for free trial which is this response code. If returned smoker gets two free packs.
* **Distribution:**
* Quantity: 750
* Pct Response: 20.00 %
* # of Responders: 150
* **Distribution Vehicle:** Direct Marketing
* **Timing:**
* DTS: 07/10/96
* Expiration Date: 08/01/96
* Offer Complete: 07/11/96
* Days Offer Open: 1
* **Data Entry:**
* Supplier: M/A/R/C
* **Agency:**
* Agency Name: Mezzina/Brown
* Contact Person: Tami LaManna
* Phone #: (212) 251-7700
* **Incoming mail:** BRC
* **If M/A/R/C is supplier, do you need a M/A/R/C P.O. Box?:** Yes
* **Fulfillment:**
* Company: M/A/R/C
* Job Number:
* Number of Items on Order Form: 1
* P.O. Box:
* City:
* State:
* Zip Code:
* **Comments:** [Blank field]
* **Marketing Person Responsible for Project:** Mike Duffy - 0910
* **Production Contact:** Joe Leinster
* **Response Code Requested By:** Mike Duffy
* **Copies of Confirmation Letter To:**
* Suzi Hicks
* Terri LaMarre
* Vanessa Oakley
* **Handwritten Notes:**
* 834-076
* 359
* 6910-69199
* Teresa Smith
* Jack Bellis
* Nancy Montgomery
* **Source:** https://www.industrydocuments.ucsf.edu/docs/lffw0023
### Detailed Analysis or Content Details
The form is largely filled with pre-printed text and fields for data entry. The key data points are:
* **Quantity of Mailers:** 750
* **Expected Response Rate:** 20.00%
* **Expected Number of Responders:** 150
* **DTS (Date of Test Start):** 07/10/96
* **Expiration Date:** 08/01/96
* **Offer Completion Date:** 07/11/96
* **Offer Open Duration:** 1 day
* **Supplier:** M/A/R/C
* **Agency:** Mezzina/Brown, Contact: Tami LaManna, Phone: (212) 251-7700
* **Marketing Contact:** Mike Duffy (phone number 0910)
* **Production Contact:** Joe Leinster
* **Response Code Requestor:** Mike Duffy
The handwritten notes in the bottom right corner appear to be related to tracking or reference numbers.
### Key Observations
The form indicates a targeted direct marketing campaign for the "Eclipse" product. The campaign involves sending mailers to Charlotte and Indianapolis to gauge interest in a "sneak preview" offer. The offer includes a BRC (likely a barcode or response code) for a free trial, with the incentive of receiving two free packs of cigarettes upon return. The expected response rate is 20%, which translates to 150 responders out of 750 mailers. The short offer duration (1 day) suggests a time-sensitive promotion.
### Interpretation
This document provides insight into the marketing strategies employed by RJR for the "Eclipse" product. The use of a response code and a free trial offer are common marketing tactics to encourage customer engagement and gather data on consumer interest. The form's detailed fields suggest a rigorous tracking and analysis process for the campaign's performance. The handwritten notes indicate internal tracking and communication related to the campaign. The document is a primary source for understanding the marketing and promotional activities of RJR during this period. The reference to "without an invitation to Sneak Previews" suggests a test or control group was also used, where some recipients *did* receive an invitation. This is a common A/B testing methodology. The form's existence on the industrydocuments.ucsf.edu website suggests it is part of a larger collection of internal documents related to the tobacco industry, potentially used in legal or research contexts.
</details>
Question what is the response code ? Answer U19
Figure 4: Document image with question about textual fields.
OCR QA (text-centric VQA)
Dataset: TextVQA textvqa
<details>
<summary>selected_samples/textvqa.png Details</summary>

### Visual Description
\n
## Photograph: New York City Taxi
### Overview
The image is a photograph of a New York City taxi cab in motion, captured with a motion blur effect. The background is a blurred cityscape at night, filled with lights. The focus is on the yellow taxi, which is prominently displayed in the center of the frame.
### Components/Axes
There are no axes or components in the traditional sense of a chart or diagram. The key elements are:
* **Taxi Cab:** A yellow New York City taxi.
* **License Plate:** Visible on the front of the taxi, reading "3G54".
* **Taxi Sign:** A sign on the roof of the taxi, with the text "TAXI RATE" visible.
* **Background:** Blurred lights and shapes suggesting buildings and traffic.
* **Road:** The taxi is driving on a road with crosswalk markings.
### Detailed Analysis or Content Details
The photograph depicts a single taxi cab, a bright yellow color, moving from right to left in the frame. The motion blur indicates a relatively high speed. The driver is visible inside the cab. The taxi's license plate number is "3G54". The "TAXI RATE" sign on the roof is partially visible, but the specific rate information is illegible. The background is heavily blurred, with streaks of red, white, and yellow lights, suggesting a busy urban environment at night. The road surface shows crosswalk markings.
### Key Observations
* The image emphasizes motion through the use of motion blur.
* The bright yellow color of the taxi stands out against the darker, blurred background.
* The license plate number provides a unique identifier for the vehicle.
* The "TAXI RATE" sign indicates the vehicle is operating as a metered taxi.
* The image captures a typical scene from a bustling city environment.
### Interpretation
The photograph likely aims to convey the energy and dynamism of New York City. The motion blur suggests a fast-paced lifestyle and constant movement. The iconic yellow taxi cab is a symbol of the city, instantly recognizable and associated with its urban landscape. The blurred background reinforces the sense of a busy, vibrant environment. The image doesn't present any quantifiable data, but rather evokes a feeling or impression of city life. The photograph is a snapshot of a moment in time, capturing the essence of urban transportation and the constant flow of activity within a major city. The image is a visual representation of the city's pulse.
</details>
Question what number is the cab Answer 3G54
Figure 5: Photo with textual question needing OCR reading.
General VQA
Dataset: VQAv2 vqav2
<details>
<summary>selected_samples/vqav2.png Details</summary>

### Visual Description
\n
## Photograph: Skateboarder in Mid-Air
### Overview
The image is a photograph capturing a skateboarder performing a trick over a picnic table. The background shows a crowd of people and a partially visible building, suggesting an outdoor public space. The photograph appears to be taken during daylight hours with a slightly overcast sky.
### Components/Axes
There are no axes or formal components like in a chart or diagram. The key elements are:
* **Subject:** A skateboarder in mid-air.
* **Obstacle:** A dark-colored wooden picnic table.
* **Background:** A blurred crowd of people, a white van or truck, and a building.
* **Sky:** A light blue sky with some cloud cover.
* **Watermark:** A watermark in the bottom-right corner.
### Detailed Analysis or Content Details
The skateboarder is wearing a black t-shirt with a colorful graphic, dark pants, and a green baseball cap. They are in a dynamic pose, with one arm raised and the skateboard positioned beneath their feet. The skateboarder is approximately 1.5-2 meters above the picnic table. The picnic table is constructed of dark wood and has a traditional design with attached benches. The crowd in the background appears to be observing the skateboarder. The van/truck is white and partially obscured by the crowd. The building in the background is light-colored and appears to be several stories high.
The watermark in the bottom-right corner reads: “© Heltil Johansen”.
### Key Observations
The photograph captures a moment of action and skill. The blurred background suggests motion and emphasizes the skateboarder as the focal point. The lighting is relatively even, allowing for clear visibility of the subject and surrounding environment. The composition is dynamic, with the skateboarder positioned diagonally across the frame.
### Interpretation
The photograph likely documents a skateboarding event or demonstration in a public space. The presence of a crowd suggests that the skateboarder is performing for an audience. The image conveys a sense of energy, athleticism, and urban culture. The watermark indicates that Heltil Johansen is the copyright holder of the photograph. The image doesn't present quantifiable data, but rather a snapshot of a cultural activity. The choice of location (a public park or plaza) suggests an attempt to integrate skateboarding into the everyday urban landscape. The photograph could be used to promote skateboarding, document a specific event, or simply capture a moment of artistic expression.
</details>
Question Where is he looking? Answer down
Figure 6: General visual question answering on natural images.
Grounding, Spatial & Counting
Dataset: TallyQA tallyqa
<details>
<summary>selected_samples/tallyqa.png Details</summary>

### Visual Description
\n
## Photograph: Motorcycle with Rider and Passenger
### Overview
The image depicts a black motorcycle with a rider and a young passenger. The scene appears to be outdoors, likely a parking lot or a road with other motorcycles visible in the background. The focus is on the rider and passenger, highlighting a potentially unsafe riding situation.
### Components/Axes
There are no axes or legends present in this image. It is a photograph, not a chart or diagram. The key components are:
* Motorcycle: A large, black touring motorcycle.
* Rider: An adult male wearing a black helmet, sunglasses, and a black leather jacket.
* Passenger: A young child wearing a pink helmet and a black jacket.
* Background: Other motorcycles and trees.
* Environment: Appears to be a parking lot or road.
### Detailed Analysis or Content Details
The motorcycle is a large touring model, possibly a Harley-Davidson, based on the styling. The rider is seated and appears to be holding the handlebars. The passenger is seated in front of the rider, on the motorcycle's seat. The passenger is small in stature, suggesting they are a young child. The child's helmet is pink with what appears to be character ears on top. The rider is wearing dark sunglasses and a black leather jacket. The motorcycle has a large windshield and saddlebags. The background shows several other motorcycles, suggesting a group ride or event. The lighting suggests an overcast day.
### Key Observations
* The passenger is very young and appears to be inadequately protected for motorcycle travel.
* The passenger's feet do not reach the footrests, indicating they are not securely positioned.
* The overall scene suggests a potentially unsafe riding situation.
* The presence of other motorcycles suggests a group activity.
### Interpretation
The photograph likely aims to highlight the dangers of allowing young children to ride on motorcycles without proper safety equipment and training. The image evokes concern due to the child's size and apparent lack of secure positioning on the motorcycle. The presence of other motorcycles suggests a social context, potentially indicating a lack of awareness or disregard for safety regulations. The image could be used in a safety campaign or to raise awareness about responsible motorcycle riding practices. The photograph does not contain any quantifiable data, but rather presents a visual narrative about safety concerns. It is a snapshot of a moment, and its interpretation relies on understanding the risks associated with motorcycle travel and the vulnerability of young passengers.
</details>
Question How many more people can ride on the vehicle? Answer 0
Figure 7: Scene requiring counting and spatial reasoning accuracy.
Math & Geometry (vision)
Dataset: CLEVR clevr
<details>
<summary>selected_samples/clevr.png Details</summary>

### Visual Description
\n
## Photograph: Arrangement of Geometric Shapes
### Overview
The image is a photograph depicting an arrangement of several geometric shapes on a neutral gray surface. The shapes include cylinders, spheres, and a cuboid. The lighting appears soft and diffused, creating subtle highlights and shadows. There is no explicit data or chart information present; the image is purely visual.
### Components/Axes
There are no axes, scales, or legends present in the image. The components are solely the geometric shapes themselves.
### Detailed Analysis or Content Details
The following shapes are visible:
* **Blue Cylinder:** Located on the left side of the image. It appears to have a radius of approximately 20% of the image width and a height of approximately 30% of the image width.
* **Gray Cuboid:** Positioned slightly above and to the right of the blue cylinder. It is relatively small, with dimensions approximately 10% of the image width.
* **Teal Sphere:** Situated to the right of the gray cuboid. Its diameter is roughly 10% of the image width.
* **Purple Sphere (Small):** Located directly to the right of the teal sphere. It is smaller than the teal sphere, with a diameter of approximately 5% of the image width.
* **Green Cylinder:** Positioned towards the center-right of the image. It has a radius of approximately 15% of the image width and a height of approximately 25% of the image width.
* **Yellow Cylinder:** Located on the right side of the image. It has a radius of approximately 20% of the image width and a height of approximately 20% of the image width.
* **Purple Sphere (Large):** Situated at the bottom center of the image. It has a diameter of approximately 15% of the image width.
The shapes are arranged seemingly at random, with no apparent order or alignment.
### Key Observations
The image does not contain any quantifiable data or trends. The key observation is the variety of shapes and colors present. The lighting creates a sense of depth and dimensionality.
### Interpretation
The image appears to be a simple still life composition, likely intended for artistic or illustrative purposes. The arrangement of shapes does not convey any specific information or message beyond the visual presentation of the objects themselves. The lack of context suggests the image is not intended to represent a dataset or a functional diagram. It could be a rendering created for design or visualization purposes. The choice of colors and shapes may be intended to evoke a particular aesthetic or mood, but without further information, this is speculative. The image does not provide any facts or data, and is purely descriptive.
</details>
Question There is a large shiny object; does it have the same shape as the object right of the large metallic thing? Provide a short and direct response. Answer Yes.
Figure 8: Synthetic shapes testing compositional spatial reasoning.
Screens, Web & GUI
Dataset: ScreenQA screenqa
<details>
<summary>selected_samples/screenqa.png Details</summary>

### Visual Description
\n
## Screenshot: Initial Setup - Period Tracking App
### Overview
This is a screenshot of a mobile application's initial setup screen, likely for a period tracking or menstrual cycle app. The screen prompts the user to input information about their period length, cycle length, and the start date of their last period. The UI uses a simple, clean design with input fields and navigation arrows.
### Components/Axes
The screen is divided into three main sections, each requesting a specific piece of information:
1. **Default period length:** Label: "Default period length". Sub-label: "The length of your period. Usually lasts between 4 and 7 days". Input field displaying the value "5". Left and right arrow buttons for incrementing/decrementing the value.
2. **Default cycle length:** Label: "Default cycle length". Sub-label: "The duration between the start dates of two periods. Usually between 23 and 35 days". Input field displaying the value "30". Left and right arrow buttons for incrementing/decrementing the value.
3. **Start date of your last period:** Label: "Start date of your last period". Three dropdown menus for Day, Month, and Year. Currently selected values are: Day: "10", Month: "Feb", Year: "2017". Downward-facing arrow icons indicate these are dropdown menus.
At the bottom of the screen is a "Continue" button with a right-pointing arrow. The top of the screen displays "Initial Setup" with a gear icon to the left. Standard Android navigation buttons are visible at the very bottom of the screen (back, home, recent apps).
### Detailed Analysis or Content Details
The screen presents three input fields with pre-populated values:
* **Period Length:** The default period length is set to 5 days. The app suggests a typical range of 4-7 days.
* **Cycle Length:** The default cycle length is set to 30 days. The app suggests a typical range of 23-35 days.
* **Last Period Start Date:** The last period started on February 10, 2017.
The input fields are designed to be easily adjustable using the arrow buttons or dropdown menus.
### Key Observations
The app provides suggested ranges for period and cycle length, which is helpful for users who are unsure of their typical values. The use of dropdown menus for the date selection is a standard UI pattern for mobile apps. The pre-populated values suggest a default cycle for the user.
### Interpretation
This screen is a crucial part of the onboarding process for a period tracking app. The information collected here is used to calculate predicted period dates and provide personalized insights. The app aims to simplify the process of tracking menstrual cycles by providing clear instructions and suggested values. The inclusion of typical ranges for period and cycle length demonstrates an understanding of the variability in menstrual cycles and helps to manage user expectations. The choice of February 10, 2017, as the default start date is arbitrary and likely serves as a placeholder for the user to customize. The overall design emphasizes ease of use and a user-friendly experience. The app is designed to be accessible to a wide range of users, regardless of their familiarity with period tracking.
</details>
Question What is the default period length? Answer 5 days
Figure 9: User interface screenshot with structured textual elements.
Medical (vision)
Dataset: PathVQA pathvqa
<details>
<summary>selected_samples/pathvqa.png Details</summary>

### Visual Description
\n
## Microscopic Image: Brain Tissue with Amyloid Plaques
### Overview
The image is a microscopic view of brain tissue, likely stained to highlight cellular structures and pathological features. The dominant feature is the presence of several amyloid plaques, indicated by darker, irregular shapes, surrounded by numerous smaller, uniformly stained cells (likely neurons and glial cells). Two plaques are specifically pointed to by black arrows. The background is a lighter color, representing the extracellular matrix.
### Components/Axes
There are no explicit axes or scales in this image. The components are:
* **Amyloid Plaques:** Irregularly shaped, dark brown/purple deposits.
* **Cells:** Small, round, blue-stained nuclei.
* **Extracellular Matrix:** Lightly stained background.
* **Arrows:** Black arrows pointing to specific amyloid plaques.
### Detailed Analysis or Content Details
The image shows a high density of cells throughout the field of view. The cells appear relatively uniform in size and staining intensity. The amyloid plaques vary in size and shape.
* **Plaque 1 (Left):** Large, roughly circular plaque with a fragmented, irregular edge. It is approximately 50-75 micrometers in diameter.
* **Plaque 2 (Center-Right):** A cluster of smaller, interconnected plaques, appearing more fragmented and less defined than the first. The combined area is approximately 30-40 micrometers.
* **Plaque 3 (Right):** A smaller, more compact plaque, approximately 15-20 micrometers in diameter, pointed to by an arrow.
* **Cell Density:** The cell density appears relatively consistent across the image, with approximately 20-30 cells visible per 100x100 micrometer square.
* **Arrow Placement:** The arrows are positioned to specifically highlight the presence of amyloid plaques.
### Key Observations
* The presence of amyloid plaques is a hallmark of Alzheimer's disease and other neurodegenerative conditions.
* The plaques are surrounded by cells, suggesting an inflammatory response.
* The variation in plaque size and shape may indicate different stages of plaque development or different types of amyloid deposits.
* The image does not provide quantitative data on plaque load or cell counts.
### Interpretation
This image likely represents brain tissue from a patient with Alzheimer's disease or a related dementia. The amyloid plaques are a key pathological feature of these conditions, and their presence suggests neuronal dysfunction and cognitive decline. The surrounding cellular response indicates an attempt by the brain to clear the plaques, but this process is often ineffective in the long term. The image demonstrates the characteristic neuropathology associated with these diseases. The lack of quantitative data limits the ability to assess the severity of the pathology, but the presence of multiple plaques suggests a significant degree of amyloid deposition. Further analysis, such as immunohistochemistry for specific proteins (e.g., beta-amyloid, tau), would be needed to confirm the diagnosis and characterize the specific type of amyloid pathology. The image is a visual confirmation of a pathological process, but does not provide information on the patient's clinical status or disease progression.
</details>
Question where are liver stem cells (oval cells) located? Answer in the canals of hering
Figure 10: Microscopic pathology image for medical VQA.
Text-only (instruction / code / math / translation)
Dataset: aya_dataset aya Question Quels président des États-Unis ne s’est jamais marié? Answer James Buchanan est le seul président qui ne s’est jamais marié.
Figure 11: Text-only tasks covering multiple languages.