2512.04032
Model: nemotron-free
# jina-vlm: Small Multilingual Vision Language Model
**Authors**: Jina AI by Elastic
> Equal contribution.Work done during internship at Jina AI.
## Abstract
We present jina-vlm, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.
## 1 Introduction
Vision-language models (VLMs) combine pretrained vision encoders with large language models to tackle tasks requiring joint visual and textual understanding (flamingo; llava). Recent VLMs have achieved strong results on visual question answering (VQA), OCR, and multimodal reasoning. However, two challenges limit their practical deployment. First, multilingual capabilities often degrade during vision adaptation: models that perform well on English benchmarks show uneven results across other languages (vlmsurvey). Second, high-quality VLMs remain computationally expensive to train and deploy, limiting accessibility for researchers and practitioners with constrained resources.
This work introduces jina-vlm, a 2.4B parameter VLM that addresses both challenges. The model aligns a SigLIP2-So400M/14-384 vision encoder (siglip2) with Qwen3-1.7B-Base (qwen3) through an attention-pooling connector, trained with a two-stage pipeline that explicitly incorporates multilingual data. Among open 2B-scale VLMs, jina-vlm achieves state-of-the-art performance on multilingual multimodal benchmarks including MMMB and Multilingual MMBench, demonstrating that small models can excel at cross-lingual visual understanding without sacrificing general capabilities. On standard English benchmarks spanning diagrams, charts, documents, and OCR, jina-vlm achieves the highest average score (72.3) across eight VQA benchmarks among 2B-scale VLMs. These results are enabled by two technical contributions: an efficient arbitrary-resolution pipeline that combines overlapping tiling with attention-based token pooling to reduce visual token count by 4 $\times$ , and a training recipe that incorporates text-only data to preserve the language understanding performance of the backbone LLM.
## 2 Related Work
VLM architecture and training. Modern VLMs follow an architecture introduced by PaLI (pali): a pretrained vision encoder extracts visual features, a connector projects them into the language model’s embedding space, and a decoder-only language model generates text conditioned on these visual tokens. Vision Transformers (ViTs) (vit) produce patch-level representations that the language model processes alongside text embeddings. This design is adopted by LLaVA (llava; llava_1_5; llava_uhd; llava_next_interleave; llava_onevision), QwenVL (qwen_vl; qwen_2_vl; qwen_2_5_vl), InternVL (internvl; internvl_1_5; internvl_2_5; internvl_3; internvl_3_5), and Ovis (ovis; ovis_2_5). Training strategies vary: qwen_2_vl; internvl_2_5 alternate between multimodal instruction tuning and general training; llava_1_5 incorporate academic VQA datasets; molmo, llava_onevision, and cambrian1 curate large-scale, diverse data mixtures.
Efficient resolution-agnostic image processing. Standard ViTs process fixed-resolution images, requiring resizing that discards fine-grained detail. Since visual token count scales with resolution and Transformer computation scales quadratically with sequence length, naive high-resolution processing is prohibitive. Several solutions exist: molmo tile images with overlap; qwen_2_vl introduce Naive Dynamic Resolution with Multimodal Rotary Position Embedding (rope; 2drope); ovis_2_5 use native-resolution ViTs (navit). Orthogonally, images often contain low-information regions (e.g., sky backgrounds), making visual tokens highly redundant. Token compression methods address this (fastv; prumerge; visionzip; pyramiddrop). internvl_1_5 develop Dynamic High-Resolution Tiling, and nvila propose scale-then-compress strategies. Recent work on training-free token budgeting, such as HERO (hero), demonstrates that inference-time pruning can achieve significant speedups while preserving accuracy; our approach differs by learning compact representations during training rather than dropping tokens at inference.
Vision-language connectors. The connector bridging vision encoders and language models significantly impacts both efficiency and performance. BLIP-2 (blip2) introduces Q-Former, a learnable query-based transformer that extracts fixed-length representations from visual features, reducing the number of tokens fed to the LLM. Flamingo (flamingo) uses a Perceiver Resampler with cross-attention to compress visual tokens. Our attention-pooling connector shares the goal of token reduction but operates differently: rather than learning a fixed set of queries, we apply local 2 $\times$ 2 attention pooling that preserves spatial structure while achieving 4 $\times$ compression, which we found more effective for tasks requiring fine-grained spatial understanding.
Small VLMs. Efficiency has become a central objective. mobilevlmv2 demonstrate competitive performance below 2B parameters. imp combine quantization with aggressive resolution reduction for mobile deployment, matching larger models’ performance. MiniCPM-V (minicpmv) targets edge deployment while maintaining strong OCR and multilingual capabilities. smolvlm systematically explore design parameters to train VLMs as small as 256M parameters.
Multilingual VLMs. Many lightweight VLMs (paligemma; paligemma2; phi3) achieve strong English performance but degrade on other languages. qwen_2_vl and internvl_1_5 address this through targeted multilingual training data. pangea introduce instruction-tuning data spanning 39 languages.
Retaining text-only performance. Multimodal training often degrades text-only capabilities. Mitigation strategies include balanced data mixtures, careful learning rate scheduling (cauldron), and partial backbone freezing (llava_onevision; internvl_3_5).
## 3 Model Architecture
<details>
<summary>x1.png Details</summary>

### Visual Description
## Flowchart: Multimodal Image-Text Processing System (jina-vlm)
### Overview
The diagram illustrates a technical architecture for a multimodal model (jina-vlm) that processes input images and text prompts to generate textual outputs. The system combines image tiling, transformer-based processing, and a vision-language connector to analyze visual content and textual queries.
### Components/Axes
1. **Input Elements**:
- **INPUT IMAGE**: 2728Ă—2846 resolution (likely RGB)
- **PROMPT**: Text query "what is the name of this lady?" with 13 input tokens
- **OUTPUT**: Text response "The name is Lenna (or Lena)." with 13 output tokens
2. **Model Architecture**:
- **TILING**:
- Input image divided into 12 tiles (1176Ă—918 resolution)
- Includes thumbnail generation
- **SGLIP2**:
- 400M parameters
- Contains TRANSFORMER BLOCK 27 and TRANSFORMER BLOCK 1
- **VL-CONNECTOR**:
- 500M parameters
- Implements 2Ă—2 ATTENTION POOLING and FLP PROJECTION
- **QWEN3 DECODER**:
- 1.7B parameters
- Contains TRANSFORMER BLOCK 29 and TRANSFORMER BLOCK 1
3. **Tokenization**:
- **INPUT TOKENS**:
- 2366 image tokens (from tiling)
- 102 text tokens (from prompt)
- **OUTPUT TOKENS**: 13 text tokens (final response)
### Detailed Analysis
- **Image Processing Flow**:
1. Input image is tiled into 12 segments (1176Ă—918) with thumbnail generation
2. Tiles processed through SGLIP2's 400M-parameter transformer blocks
3. Visual features combined with text tokens via VL-CONNECTOR's 500M-parameter architecture
4. Final output generated by QWEN3's 1.7B-parameter decoder
- **Parameter Distribution**:
- SGLIP2: 400M (image feature extraction)
- VL-CONNECTOR: 500M (cross-modal integration)
- QWEN3: 1.7B (language generation)
### Key Observations
1. **Modular Design**: System separates image processing (SGLIP2) from language generation (QWEN3) with a dedicated connector
2. **Scale Progression**: Parameters increase from 400M (SGLIP2) to 1.7B (QWEN3), suggesting specialized roles
3. **Token Handling**:
- Input: 2366 image tokens + 102 text tokens
- Output: 13 text tokens (concise response)
4. **Attention Mechanisms**:
- 2Ă—2 ATTENTION POOLING in VL-CONNECTOR
- FLP PROJECTION for feature alignment
### Interpretation
This architecture demonstrates a hierarchical approach to multimodal understanding:
1. **Image Decomposition**: Tiling enables handling large images while maintaining resolution
2. **Feature Extraction**: SGLIP2's transformer blocks capture visual patterns
3. **Cross-Modal Integration**: VL-CONNECTOR bridges visual and textual representations
4. **Language Generation**: QWEN3's large parameter count enables nuanced text output
The system appears optimized for image captioning tasks, with the VL-CONNECTOR serving as the critical interface between visual and linguistic processing. The parameter distribution suggests increasing complexity from feature extraction (SGLIP2) to final response generation (QWEN3), with the connector enabling effective cross-modal interaction.
</details>
Figure 1: Architecture of jina-vlm. Images are resized to fit a grid of up to 12 overlapping tiles, plus a global thumbnail. Each tile is a square 378 $\times$ 378 crop; adjacent tiles overlap by 112 pixels with a stride of 266 pixels between tile origins. A 4 $\times$ 3 grid therefore spans 1176 $\times$ 910 pixels, and images exceeding this effective resolution are downscaled to fit the tile budget. Each tile produces 729 patches via SigLIP2 (siglip2). The VL connector concatenates features from layers 24 and 18, the third- and ninth-to-last layers, then applies 2 $\times$ 2 attention pooling to reduce 729 tokens to 182 before projecting to the decoder dimension. Visual tokens are combined with text embeddings for the Qwen3 decoder (qwen3).
Figure 1 illustrates the architecture of jina-vlm. The model uses overlapping image tiling following molmo, combined with attention-based token pooling to reduce sequence length while preserving spatial information.
The vision encoder, SigLIP2-So400M/14-384, is a 27-layer Vision Transformer with 400M parameters that processes 378 $\times$ 378 pixel inputs as 27 $\times$ 27 grids of 14 $\times$ 14 patches. To handle arbitrary resolutions, we decompose each image into overlapping tiles of this size and process each tile independently through the encoder. A global thumbnail, the full image resized to 378 $\times$ 378, provides context alongside the tile representations. We use a default of 12 tiles during training; this limit can be increased at inference or during continued training to handle higher resolutions, with memory scaling linearly with tile count. The tiling algorithm is detailed in Appendix A.1.
### 3.1 Vision-Language Connector
Rather than using the final ViT output, jina-vlm concatenates features from two intermediate layers: the third-to-last and ninth-to-last, corresponding to layers 24 and 18 of the 27-layer encoder. This captures both fine-grained spatial details from earlier layers and high-level semantics from later layers. The connector then applies attention pooling over 2 $\times$ 2 patch neighborhoods, using mean-pooled features as queries. This reduces the token count by 4 $\times$ while preserving local structure. A SwiGLU projection layer maps the pooled representations to the language model’s embedding dimension.
In more formal terms, let $\mathbf{H}^{(\ell)}\in\mathbb{R}^{N\times d_{v}}$ denote the hidden states from ViT layer $\ell$ , where $N$ is the number of patches, $d_{v}$ is the vision encoder hidden size, and negative indices count from the final layer (e.g., $\ell=-1$ is the last layer). We concatenate features from two internal layers:
$$
\mathbf{H}_{\text{concat}}=[\mathbf{H}^{(-3)};\mathbf{H}^{(-9)}]\in\mathbb{R}^{N\times 2d_{v}} \tag{1}
$$
For each $2{\times}2$ patch neighborhood $\mathcal{N}_{i}$ , we compute a query vector as the mean of the neighborhood features:
$$
\mathbf{q}_{i}=\frac{1}{4}\sum_{j\in\mathcal{N}_{i}}\mathbf{h}_{j},\quad\mathbf{Q}=[\mathbf{q}_{1};\dots;\mathbf{q}_{M}]\in\mathbb{R}^{M\times 2d_{v}} \tag{2}
$$
where $\mathcal{N}_{i}$ contains the four patches at positions $(2i_{x},2i_{y})$ , $(2i_{x}+1,2i_{y})$ , $(2i_{x},2i_{y}+1)$ , and $(2i_{x}+1,2i_{y}+1)$ and $M=N/4$ .
Attention pooling is then computed as:
$$
\mathbf{H}_{\text{pooled}}=(\text{softmax}\left(\frac{\mathbf{Q}\mathbf{W}_{Q}(\mathbf{H}_{\text{concat}}\mathbf{W}_{K})^{\top}}{\sqrt{d_{k}}}\right)\mathbf{H}_{\text{concat}}\mathbf{W}_{V})\mathbf{W}_{O}\in\mathbb{R}^{M\times d_{v}} \tag{3}
$$
where $d_{k}=d_{v}$ and $\mathbf{W}_{Q}\in\mathbb{R}^{2d_{v}\times d_{k}}$ , $\mathbf{W}_{K}\in\mathbb{R}^{2d_{v}\times d_{k}}$ , $\mathbf{W}_{V}\in\mathbb{R}^{2d_{v}\times 2d_{v}}$ and $\mathbf{W}_{O}\in\mathbb{R}^{2d_{v}\times d_{v}}$ are learnable weight matrices. Finally, the pooled visual features are projected to the language model embedding dimension via a SwiGLU (swiglu) layer:
$$
\mathbf{H}_{\text{proj}}=\left(\text{Swish}(\mathbf{H}_{\text{pooled}}\mathbf{W}_{1})\odot(\mathbf{H}_{\text{pooled}}\mathbf{W}_{2})\right)\mathbf{W}_{3}\in\mathbb{R}^{M\times d_{l}} \tag{4}
$$
where $\text{Swish}(x)=x\cdot\sigma(x)$ , $\sigma$ is the sigmoid function, $\odot$ denotes element-wise multiplication, $\mathbf{W}_{1},\mathbf{W}_{2}\in\mathbb{R}^{d_{v}\times 3d_{l}}$ , $\mathbf{W}_{3}\in\mathbb{R}^{3d_{l}\times d_{l}}$ are learnable parameters, and $d_{l}$ is the language model embedding size.
### 3.2 Language Decoder
The language decoder is initialized from Qwen3-1.7B-Base https://huggingface.co/Qwen/Qwen3-1.7B-Base, which empirically outperformed the instruction-tuned variant in our setting. We introduce three special tokens to structure visual inputs: <im_start> and <im_end> delimit image and thumbnail sequences, while <im_col> marks row boundaries within the patch grid, where tokens are arranged left-to-right and top-to-bottom. Input and output embedding weights are not tied.
### 3.3 Efficiency Analysis
Table 1 quantifies the computational benefits of attention pooling. With the default 12-tile configuration (plus thumbnail), the unpooled baseline would produce 9,477 visual tokens per image, while our 2 $\times$ 2 pooling reduces this to 2,366 tokens. Since the ViT processes each tile identically regardless of pooling, the savings apply exclusively to the LLM: we observe a 3.9 $\times$ reduction in prefill FLOPs and a 4 $\times$ reduction in KV-cache memory. The overall FLOPs reduction is 2.3 $\times$ when including the shared ViT cost.
Table 1: Efficiency comparison with and without 2 $\times$ 2 attention pooling for the default 12-tile configuration. FLOPs are computed for LLM prefill; KV-cache assumes fp16 precision.
| Visual tokens LLM prefill FLOPs KV-cache memory | 9,477 27.2 TFLOPs 2.12 GB | 2,366 6.9 TFLOPs 0.53 GB | 4.0 $\times$ 3.9 $\times$ 4.0 $\times$ |
| --- | --- | --- | --- |
## 4 Training
Training proceeds in two stages, both updating all model components (encoder, connector, and decoder) without freezing, following molmo. The combined data comprises approximately 5M multimodal samples and 12B text tokens across 30+ languages, with roughly half in English and the remainder spanning high- and moderate-resource languages. Table 2 summarizes hyperparameters for both stages.
### 4.1 Stage 1: Alignment Training
The first stage focuses on cross-language semantic grounding rather than task-specific objectives. Training data consists primarily of caption datasets (PixmoCap (molmo), PangeaIns (pangea)) spanning diverse visual domains: natural scenes, documents, infographics, and diagrams. We include 15% text-only data from PleiAS/common_corpus (pleias) to mitigate degradation on text-only tasks. The connector uses a higher learning rate and shorter warmup than the encoder and decoder.
### 4.2 Stage 2: Instruction Fine-tuning
The second stage trains instruction-following for VQA and reasoning tasks. We combine public dataset collections, including LLaVA OneVision (llava_onevision), Cauldron (cauldron), Cambrian (cambrian1), PangeaIns (pangea), and FineVision (finevision), with text-only instruction data from aya. The mixture covers academic VQA, document understanding, OCR, mathematics, and reasoning. Appendix A.2 shows representative examples.
Given the diversity of instruction data, we found single-source batches more effective initially, likely due to the heterogeneous data mixture. We train for 30K steps with single-source batches, then 30K steps with mixed-source batches.
Table 2: Model training hyperparameters across pre-training and fine-tuning stages.
| Hyperparameter Warmup ViT Warmup Con. | Pre-Training 10% 1% | Fine-Tuning 10% 10% |
| --- | --- | --- |
| Warmup LLM | 10% | 10% |
| LR ViT | 6e-6 | 5e-6 |
| LR Con. | 2e-4 | 5e-6 |
| LR LLM | 2e-5 | 1e-5 |
| Cosine Decay | 0.1 | 0.1 |
| Eps. | 1e-6 | 1e-6 |
| Betas | 0.9, 0.95 | 0.9, 0.95 |
| Batch Size | 128 | 256 |
| Steps | 25K | 60K |
| Samples | 3.2M | 15.3M |
| Tokens | 10B | 37B |
| GPU Hours | 296 | 1,000 |
## 5 Evaluation
We compare jina-vlm against lightweight VLMs across six capability areas: general VQA, multimodal comprehension, multi-image reasoning, hallucination control, mathematical reasoning, text-only performance, and multilingual understanding. All evaluations use VLMEvalKit https://github.com/open-compass/VLMEvalKit (vlmevalkit) with English prompts matching our training format (e.g., “Return only the letter of the best answer option” for multiple-choice, “Respond very briefly” for open-ended questions).
### 5.1 General VQA Tasks
Table 3 reports results on eight VQA benchmarks covering diagrams (AI2D (ai2d)), charts (ChartQA (chartqa), CharXiv (charxiv)), scene text (TextVQA (textvqa)), documents (DocVQA (docvqa), InfoVQA (infovqa)), OCR (OCRBench (ocrbench)), and diverse scenes (SEED-Bench-2-Plus (seedbench2plus)). jina-vlm achieves the highest average (72.3), with particularly strong performance on diagram interpretation and text extraction.
Table 3: Comparison of general visual question answering performance.
| | | (test avg) | (val) | (val) | (val) | Bench | Plus | (RQ / DQ) | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| jina-vlm | $82.0$ | $81.9$ | $83.2$ | $90.6$ | $71.6$ | $778$ | $67.2$ | 32.3 / 63.5 | $72.3$ |
| Qwen2-VL-2B | $74.7$ | $73.5$ | $79.7$ | $89.2$ * | $64.0$ * | $809$ | $62.4$ | 23.3 / 55.0* | $66.4$ |
| Qwen3-VL-2B | $76.9$ | $77.2$ | $79.5$ | $92.3$ * | $71.9$ * | $858$ | $67.3$ * | 28.8 / 62.3 | $71.6$ |
| InternVL3-2B | $78.6$ | $80.2$ | $77.0$ | $87.4$ * | $67.1$ * | $835$ | $64.6$ | 28.3 / 54.7 | $69.2$ |
| InternVL3.5-2B | $78.8$ | $80.7$ | $76.5$ | $88.5$ * | $69.3$ * | $836$ | $68.0$ | 31.6 / 65.0 | $71.6$ |
Results for models other than jina-vlm are from their respective papers, internvl_3_5; internvl_3; qwen_2_vl, except those marked with * which were computed using VLMEvalKit. All scores represent accuracy (%) except OCRBench which uses a 0–1000 scale; for overall average computation, OCRBench scores are divided by 10 to align with the 0–100 scale of other benchmarks.
### 5.2 Document and Real-World Understanding
Table 4 shows results on multimodal comprehension (MME (mme), MMB v1.1 (mmbench), MMStar (mmstar)) and real-world understanding (RealWorldQA (realworldqa), MME-RealWorld (mmerealworld), R-Bench (rbench)). jina-vlm scores 67.4 on multimodal tasks and 61.9 on real-world tasks, achieving the best RealWorldQA result (68.2).
Table 4: Comparison of generic multimodal understanding and real-world understanding performance.
| jina-vlm Qwen2-VL-2B Qwen3-VL-2B | $1965.8$ $1872.0$ $2000.8$ * | $75.8$ $72.2$ $77.8$ | $56.2$ $48.0$ $58.3$ | $67.4$ $62.4$ $69.2$ | $68.2$ $62.9$ $63.9$ | $50.7$ $38.7$ * $57.9$ * | $66.7$ $63.2$ $67.3$ * | $61.9$ $55.0$ * $63.0$ |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| InternVL3-2B | $2221.2$ | $78.6$ | $60.7$ | $72.9$ | $64.3$ | $53.8$ | $67.5$ | $61.9$ |
| InternVL3.5-2B | $2123.3$ | $76.6$ | $62.7$ | $71.7$ | $62.0$ | $49.7$ | $62.4$ | $58.0$ |
Results for models other than jina-vlm are from their respective papers, internvl_3_5; internvl_3; qwen_2_vl, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%) except MME which uses a 0–2800 scale; for overall average computation, MME scores are divided by 28 to align with the 0–100 scale of other benchmarks.
### 5.3 Multi-Image Reasoning and Hallucination
Table 5 reports multi-image reasoning (BLINK (blink), MuirBench (muirbench), MMT (mmtbench)) and hallucination benchmarks that measure the tendency to fabricate visual details (HallBench (hallusionbench), POPE (pope)). jina-vlm scores 47.3 on multi-image tasks, which is expected given limited multi-image training data, but achieves the best POPE score (90.3), indicating low hallucination rates.
Table 5: Comparison of multi-image and hallucination performance.
| jina-vlm Qwen2-VL-2B Qwen3-VL-2B | $50.1$ $44.4$ $53.8$ | $34.7$ $25.5$ * $47.4$ | $57.2$ $55.1$ $60.0$ * | $47.3$ $41.7$ $53.7$ | $39.1$ $41.7$ $44.5$ | $90.3$ $87.9$ * $88.9$ * | $64.7$ $64.8$ $66.7$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
| InternVL3-2B | $50.3$ | $38.8$ | $59.5$ | $49.5$ | $42.5$ | $89.6$ | $66.1$ |
| InternVL3.5-2B | $51.3$ | $44.0$ | $58.5$ | $51.3$ | $48.6$ | $87.2$ | $67.9$ |
Results for models other than jina-vlm are from their respective papers, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).
### 5.4 Mathematical Reasoning
Table 6 reports structured reasoning benchmarks: multidisciplinary comprehension (MMMU (mmmu)), visual mathematics (MathVista (mathvista), MathVision (mathvision), MathVerse (mathverse), WeMath (wemath)), and logical reasoning (LogicVista (logicvista)). jina-vlm performs comparably to InternVL3-2B and outperforms Qwen2-VL-2B.
Table 6: Comparison of multimodal reasoning and mathematical problem-solving performance.
| Model | MMMU | MathVista | MathVision | MathVerse | WeMath | LogicVista | Overall |
| --- | --- | --- | --- | --- | --- | --- | --- |
| (Vision Only) | | | | | | | |
| jina-vlm | $45.6$ | 59.5 | 19.2 | 23.9 | 17.1 | 33.3 | $33.1$ |
| Qwen2-VL-2B | $41.1$ | 43.0 | 12.4 | 17.3* | 10.9* | 27.3* | $25.3$ |
| Qwen3-VL-2B | $53.4$ | 61.3 | 31.6 | 22.7* | 28.0* | 35.4* | $38.7$ |
| InternVL3-2B | $48.6$ | 57.0 | 21.7 | 25.3 | 22.4 | 36.9 | $35.3$ |
| InternVL3.5-2B | $59.0$ | 71.8 / 61.5 †| 42.8 / 26.5 †| 53.4 / 35.3 †| 48.5 / 19.1 †| 47.7 / 41.4 †| $50.7$ |
Results for models other than jina-vlm are from their respective papers, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. $\dagger$ indicates scores for InternVL3.5-2B without thinking mode, evaluated using VLMEvalKit. All scores represent accuracy (%).
### 5.5 Text-Only Performance
Table 7 compares jina-vlm against the backbone Qwen3-1.7B on text-only benchmarks: MMLU (mmlu), MMLU-Pro (mmlupro), GSM-8K (gsm8k), ARC-C (arc), and HellaSwag (hellaswag). Results show mixed preservation of text-only capabilities: jina-vlm matches or exceeds the backbone on commonsense reasoning (ARC-C, HellaSwag) and retains most performance on MMLU and GSM-8K. However, MMLU-Pro shows substantial degradation (46.4 $\rightarrow$ 30.3), likely because this benchmark emphasizes extended multi-step reasoning that conflicts with our instruction-tuning toward concise visual responses. This suggests a trade-off between optimizing for multimodal tasks and preserving complex text-only reasoning, which future work could address through more balanced data mixtures or curriculum scheduling.
Table 7: Comparison of Text-only benchmarks.
| jina-vlm Qwen3-1.7B | $56.1$ $62.6$ | $30.3$ $46.4$ | $71.3$ $75.3$ | $77.3$ $73.4$ | $59.4$ $59.0$ | $58.9$ $63.3$ |
| --- | --- | --- | --- | --- | --- | --- |
Results are collected using our evaluation code. All scores represent accuracy (%).
### 5.6 Multilingual Understanding
Table 8 reports multilingual multimodal benchmarks: MMMB (mmmb), Multilingual MMBench (mmmb), and MTVQA (mtvqa). jina-vlm achieves state-of-the-art multilingual performance among 2B-scale VLMs, with the highest averages on MMMB (78.8) and Multilingual MMBench (74.3).
Table 8: Comparison of multilingual multimodal understanding performance.
| Benchmark MMMB cn | jina-vlm ar $80.0$ | Qwen2-VL-2B $76.9$ $74.2$ | Qwen3-VL-2B $68.3$ $75.7$ * | InternVL3-2B $72.7$ * $78.3$ | InternVL3.5-2B $68.6$ $77.7$ | $68.5$ |
| --- | --- | --- | --- | --- | --- | --- |
| en | $82.0$ | $78.3$ | $80.7$ * | $81.9$ | $80.2$ | |
| pt | $79.2$ | $72.6$ | $75.0$ * | $75.4$ | $75.9$ | |
| ru | $79.2$ | $72.8$ | $75.9$ * | $74.6$ | $76.3$ | |
| tr | $75.5$ | $61.8$ | $68.5$ * | $62.9$ | $69.1$ | |
| avg | $78.8$ | $71.3$ | $75.0$ * | $73.6$ | $74.6$ | |
| Multi. MMBench | ar | $70.0$ | $66.7$ | $66.2$ * | $66.4$ | $63.7$ |
| cn | $75.9$ | $67.0$ | $75.7$ * | $77.8$ | $75.9$ | |
| en | $78.8$ | $71.1$ | $77.8$ * | $81.3$ | $78.4$ | |
| pt | $74.7$ | $72.1$ | $71.4$ * | $75.9$ | $73.7$ | |
| ru | $75.3$ | $69.9$ | $75.9$ * | $70.7$ | $71.4$ | |
| tr | $71.1$ | $69.3$ | $67.0$ * | $59.5$ | $62.0$ | |
| avg | $74.3$ | $69.4$ | $72.3$ * | $71.9$ | $70.9$ | |
| MTVQA | $25.6$ | $20.6$ | $27.3$ * | $26.7$ | $28.5$ | |
| Overall | $59.6$ | $53.8$ | $58.2$ | $57.4$ | $58.0$ | |
Results for baseline models are derived from their original publications, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).
## 6 Conclusion
We presented jina-vlm, a 2.4B vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. Our results demonstrate that small VLMs can attain strong cross-lingual visual understanding through careful architectural and training choices: attention-based token pooling reduces visual tokens by 4 $\times$ while preserving spatial information, and incorporating text-only data during multimodal training mitigates the catastrophic forgetting typically observed in vision-adapted language models. On standard English VQA benchmarks, jina-vlm achieves leading results, demonstrating that multilingual capabilities need not come at the cost of general performance.
The current approach has limitations. Multi-tile processing introduces computational overhead that scales with image resolution, and tiling can fragment global spatial context, potentially impairing performance on tasks requiring holistic scene understanding such as object counting or precise spatial reasoning across tile boundaries. While the global thumbnail partially mitigates this, native-resolution approaches (navit) may be better suited for such tasks. We have not emphasized safety-critical training or alignment, and multi-image reasoning remains weak due to limited training data in this regime. Future work could explore more efficient resolution handling, targeted improvements for counting and spatial tasks, and investigate whether our multilingual training recipe transfers to larger model scales.
## Appendix A Appendix
### A.1 Pseudocode for Creating Overlapping Tiles
Input: Image $I$ of size $(h,w)$ ; Base input size $\mathbf{b}=(b_{h},b_{w})$ ( $(378,378)$ ) Patch size $p$ ( $14$ ); Maximum number of tiles $M$ ( $12$ by default, configurable) Overlap margins $(m_{L},m_{R})$ in patches ( $(4,4)$ )
Output: List of tiles $\mathcal{C}$ (thumbnail + grid tiles) Tiling $(t_{h},t_{w})$ = (number of rows, number of columns)
1. Compute overlap-related sizes
$m_{\text{tot}}\leftarrow p\cdot(m_{L}+m_{R})$ // Total overlap margin in pixels
$s_{\text{win}}\leftarrow\bigl(\lfloor b_{h}/p\rfloor-(m_{L}+m_{R})\bigr)\cdot p$ // Tile stride in pixels
2. Select tiling on the margin-reduced image
$(t_{h},t_{w})\leftarrow\textsc{SelectTilingWithMinimalScaleChange}\bigl(h-m_{\text{tot}},\,w-m_{\text{tot}},\,s_{\text{win}},\,M\bigr)$ ;
3. Resize image to exactly fit the chosen tiling + margins;
$H^{\prime}\leftarrow t_{h}\cdot s_{\text{win}}+m_{\text{tot}}$ ;
$W^{\prime}\leftarrow t_{w}\cdot s_{\text{win}}+m_{\text{tot}}$ ;
$I_{\text{grid}}\leftarrow\textsc{Resize}(I,\;[H^{\prime},W^{\prime}])$ ;
4. Extract overlapping tiles
$\mathcal{G}\leftarrow\textsc{ExtractTiles}\bigl(I_{\text{grid}},\,(t_{h},t_{w}),\,s_{\text{win}},\,b_{h}\bigr)$ // $b_{h}$ is the tile height, equal to $b_{w}$ here
5. Build thumbnail and final tile list
$T\leftarrow\textsc{Resize}(I,\;[b_{h},b_{w}])$ // Global thumbnail
$\mathcal{C}\leftarrow[T]\mathbin{+\!\!+}\mathcal{G}$ // Concatenate thumbnail and tiles
return $(\mathcal{C},(t_{h},t_{w}))$ ;
Algorithm 1 GetAllTilesOverlapAndResize
### A.2 Training Set Examples
Captioning & Instruction
Dataset: VisualWebInstruct visualwebinstruct
<details>
<summary>selected_samples/visualwebinstruct.png Details</summary>

### Visual Description
## Document Template: Conflict Resolution Documentation Framework
### Overview
The image depicts a structured template for documenting and communicating conflict resolution processes within an organization. It includes sections for meeting notes, official communications, and best practices for documentation.
---
### Components/Axes
#### Header Sections
1. **Best Practices for Conflict Documentation**
- Header: Blue banner with white text.
- Subsections:
- **Meeting Notes Template**
- **Official Communication Template**
- **Best Practices for Documentation**
2. **Meeting Notes Template**
- Header:
- Meeting Title: *Conflict Resolution Meeting*
- Date: *October 15, 2023*
- Time: *3:00 PM - 4:00 PM*
- Location: *Main Conference Room*
- Attendees:
- Alice Johnson (CEO)
- Bob Thompson (CTO)
- Claire Lee (COO)
- David Smith (Employee Representative)
- Body:
- Agenda: Review of conflict, Statements from attendees, Discussion of possible resolutions, Action items and next steps.
- Minutes:
1. Introduction by Alice Johnson
2. Statements from Alice and David
3. Discussion on solutions
4. Actions assigned
- Footer:
- Prepared by: *Claire Lee*
- Date: *October 15, 2023*
3. **Official Communication Template**
- Header:
- From: *Alice Johnson, CEO*
- To: *All Employees*
- Date: *October 16, 2023*
- Subject: *Update on Recent Conflict Resolution Efforts*
- Body:
- Introduction
- Summary of Actions
- Commitment to Resolution
- Contact for Concerns
- Footer:
- Closing and Signature
4. **Best Practices for Documentation**
- Listed as six principles:
1. **Timeliness**: Document conflicts promptly.
2. **Objectivity**: Record factual information without bias.
3. **Consistency**: Use standardized templates.
4. **Confidentiality**: Ensure secure storage and limited access.
5. **Inclusion of All Perspectives**: Document all viewpoints.
6. **Follow-Up**: Schedule follow-up meetings and communications.
---
### Detailed Analysis
#### Meeting Notes Template
- **Attendees**: Cross-functional leadership (CEO, CTO, COO) and an employee representative, ensuring diverse input.
- **Agenda Items**: Structured to address conflict review, attendee statements, resolution discussions, and actionable next steps.
- **Minutes**: Explicitly lists discussion phases, emphasizing accountability through assigned actions.
#### Official Communication Template
- **Purpose**: Transparent internal communication to all employees, summarizing conflict resolution progress and inviting feedback.
- **Structure**: Balances brevity (introduction/summary) with actionable steps (commitment, contact details).
#### Best Practices for Documentation
- **Key Principles**:
- **Timeliness** and **Objectivity** prioritize accuracy and immediacy.
- **Consistency** and **Confidentiality** ensure reliability and privacy.
- **Inclusion of All Perspectives** and **Follow-Up** emphasize thoroughness and accountability.
---
### Key Observations
1. **Hierarchical Structure**: Templates are designed for clarity, with headers, bodies, and footers for easy navigation.
2. **Stakeholder Inclusion**: Attendees span leadership and employee representation, ensuring balanced decision-making.
3. **Action-Oriented**: Both templates emphasize concrete outcomes (e.g., assigned actions, follow-up schedules).
---
### Interpretation
This framework underscores the importance of systematic conflict resolution in organizations. By standardizing documentation (e.g., meeting notes, communications) and adhering to best practices (e.g., objectivity, confidentiality), it aims to mitigate bias, ensure transparency, and foster trust. The inclusion of employee representatives and follow-up mechanisms highlights a commitment to collaborative problem-solving. The repetition of "Best Practices" suggests a layered approach, with foundational principles (e.g., timeliness) reinforced across contexts.
</details>
Question what is the meeting title? Answer Conflict Resolution Meeting
Figure 2: Answer questions given web documents.
Charts & Tables
Dataset: TAT-QA tatqa
<details>
<summary>selected_samples/tatqa.png Details</summary>

### Visual Description
## Table: Unrecognized Tax Benefit and Tax Position Changes (Fiscal Years Ended July 31)
### Overview
The table presents financial data for unrecognized tax benefits and related tax position changes across three fiscal years (2017–2019). It includes beginning/ending balances, gross increases/decreases from prior and current periods.
### Components/Axes
- **Rows**:
1. Unrecognized tax benefit - beginning of period
2. Gross increases - prior period tax positions
3. Gross decreases - prior period tax positions
4. Gross increases - current period tax positions
5. Unrecognized tax benefit - end of period
- **Columns**: Fiscal years ended July 31, 2019, 2018, 2017.
- **Values**: Monetary amounts in USD (e.g., $10,321, 98, -88, etc.).
### Detailed Analysis
| Row Description | 2019 | 2018 | 2017 |
|------------------------------------------|------------|------------|------------|
| Unrecognized tax benefit - beginning | $10,321 | $9,346 | $7,687 |
| Gross increases - prior period tax positions | 98 | 729 | 712 |
| Gross decreases - prior period tax positions | (88) | (878) | (691) |
| Gross increases - current period tax positions | 1,302 | 1,124 | 1,638 |
| Unrecognized tax benefit - end of period | $11,633 | $10,321 | $9,346 |
### Key Observations
1. **Unrecognized Tax Benefit Growth**:
- The ending unrecognized tax benefit increased annually:
- 2017: $9,346 → 2018: $10,321 (+10.4%) → 2019: $11,633 (+12.7%).
2. **Gross Increases/Decreases**:
- **Prior Period Adjustments**:
- Gross increases (2019: $98) were significantly lower than 2018 ($729) and 2017 ($712).
- Gross decreases (2019: -$88) were smaller in magnitude compared to 2018 (-$878) and 2017 (-$691).
- **Current Period Adjustments**:
- Gross increases (2019: $1,302) were lower than 2017 ($1,638) but higher than 2018 ($1,124).
3. **Net Impact**:
- The combination of smaller prior-period adjustments (2019) and moderate current-period increases drove the steady growth in unrecognized tax benefits.
### Interpretation
The data suggests a **consistent increase in unrecognized tax benefits** over the three years, driven by:
- **Reduced reversals of prior-period tax positions** (lower gross decreases in 2019).
- **Moderate additions from current-period tax positions** (2019: $1,302).
This trend may reflect changes in tax regulations, accounting policies, or strategic tax planning. The decline in prior-period gross decreases (2019: -$88 vs. 2018: -$878) indicates fewer adjustments to historical tax positions, while current-period increases remain stable, contributing to the overall growth in liabilities.
</details>
Question Unrecognized Tax Benefits Activity related to unrecognized tax benefits is as follows (in thousands): … As of July 31, 2019, the Company has no income tax audits in progress in the U.S. or foreign jurisdictions. What was the increase in unrecognized tax benefits in 2019? Answer $1.3 million.
Figure 3: Financial table requiring numerical reasoning over text.
Document Understanding & Infographics
Dataset: DocVQA docvqa
<details>
<summary>selected_samples/docvqa.png Details</summary>

### Visual Description
## Form: Response Code Request Form
### Overview
This is a structured form for requesting a response code for a marketing campaign by RJR Brand Eclipse. It includes details about distribution, timing, agency, fulfillment, and responsible personnel.
### Components/Axes
- **Header**:
- From: Joyce Bagby
- To: Brian Robbins
- Date: 6/14/96
- Time: 12:18:14
- Page: 2 of 7
- Fax: (910) 741-1672
- Attention: Joyce Bagby
- **Project Details**:
- RJR Brand: Eclipse
- RJR Program #: 601778
- Response Code: U19
- Project Name: 2-Free Pack Request - Viewership Interest Confirmation #2
- Description: Mail Launch Mailers without an invitation to Sneak Previews in Charlotte and Indianapolis to confirm interest and intent levels. Mailer includes a BRC offer for a free trial (response code U19). If returned, smoker receives two free packs.
- **Distribution**:
- Quantity: 750
- Pct. Response: 20.00%
- # of Responders: 150
- Distribution Vehicle: Direct Marketing
- **Timing**:
- DTS: 07/10/96
- Expiration Date: 08/01/96
- Offer Complete: 07/11/96
- Days Offer Open: 1
- **Data Entry**:
- Supplier: M/A/R/C
- Incoming mail: BRC
- If M/A/R/C is supplier, do you need a M/A/R/C P.O. Box?: Yes
- **Agency**:
- Agency Name: Mezzina/Brown
- Contact Person: Tami LaManna
- Phone #: (212) 251-7700
- **Fulfillment**:
- Company: M/A/R/C
- Job Number: (blank)
- Number of Items on Order Form: 1
- P.O. Box Title: Eclipse
- P.O. Box, City, State, Zip Code: (blank)
- **Comments**: (blank)
- **Responsibility**:
- Marketing Person Responsible for Project: Mike Duffy - 0910
- Production Contact: Joe Leinster - 359
- Response Code Requested By: Mike Duffy - 69100–69199
- **Copies of Confirmation Letter To**:
- Suzi Hicks
- Tami LaManna
- Vanessa Oakley
- Teresa Smith
- Jack Bellis
- Nancy Montgomery
### Detailed Analysis
- **Distribution Metrics**:
- 750 mailers distributed with a 20% response rate (150 responders).
- Direct Marketing as the distribution channel.
- **Timing Constraints**:
- Campaign starts on 07/10/96, expires on 08/01/96, and concludes on 07/11/96.
- Offer window is 1 day.
- **Agency and Fulfillment**:
- Agency: Mezzina/Brown (contact: Tami LaManna).
- Fulfillment handled by M/A/R/C, requiring a P.O. Box for responses.
- **Response Code**:
- Code U19 tied to a BRC offer for free trial packs.
### Key Observations
1. **High Response Rate**: 20% response rate suggests strong initial interest in the offer.
2. **Tight Timing**: Campaign duration is short (1 day offer window), indicating urgency.
3. **Multiple Stakeholders**: Confirmation letters sent to six individuals, highlighting cross-departmental coordination.
4. **P.O. Box Requirement**: Mandatory for M/A/R/C, emphasizing structured fulfillment processes.
### Interpretation
This form outlines a targeted marketing campaign to gauge consumer interest in Eclipse products via direct mail. The 20% response rate and 1-day offer window suggest a focus on immediate engagement. The involvement of multiple stakeholders (e.g., Mezzina/Brown, M/A/R/C) indicates a collaborative effort, with Mike Duffy overseeing both marketing and response code requests. The BRC offer (free trial packs) is central to the campaign’s incentive structure, aiming to convert interest into tangible customer action. The form’s structured format ensures clarity in roles, timelines, and metrics for tracking success.
</details>
Question what is the response code ? Answer U19
Figure 4: Document image with question about textual fields.
OCR QA (text-centric VQA)
Dataset: TextVQA textvqa
<details>
<summary>selected_samples/textvqa.png Details</summary>

### Visual Description
## Photograph: Yellow Taxi in Urban Night Setting
### Overview
The image captures a yellow taxi cab in motion on a city street at night. The taxi is the central subject, with motion blur indicating speed. The background features blurred urban elements, including illuminated buildings, streetlights, and other vehicles. The scene is set on a paved road with visible crosswalk markings.
### Components/Axes
- **Primary Subject**: Yellow taxi cab (centered, slightly right of frame).
- **Textual Elements**:
- Roof sign: "3G54" (black text on white background).
- License plate: "3G54" (black text on white background).
- Small black sign on driver’s side door (text illegible due to motion blur).
- **Environmental Elements**:
- Blurred city lights (yellow, white, red, and blue hues).
- Crosswalk stripes (white on dark asphalt).
- Road surface (dark asphalt with reflective markings).
### Detailed Analysis
- **Taxi Features**:
- Bright yellow body with black trim.
- Rectangular roof sign (mounted on the rear window) displaying "3G54" in bold black letters.
- License plate on the front bumper, also reading "3G54".
- Driver visible through the windshield (wearing dark clothing; details indistinct due to motion).
- **Motion Blur**:
- Background elements (buildings, lights, other vehicles) are heavily blurred, suggesting the taxi is moving at high speed.
- Crosswalk stripes and road surface show minimal blur, indicating the camera’s focus on the taxi.
- **Lighting**:
- Artificial lighting dominates, with warm yellow/orange tones from streetlights and cooler white/blue tones from building signage.
- Reflections on the taxi’s surface suggest wet or polished pavement.
### Key Observations
1. **Taxi Identification**: The repeated "3G54" on the roof sign and license plate confirms the vehicle’s unique identifier.
2. **Urban Context**: The blurred background implies a densely populated, active city environment.
3. **Lighting Dynamics**: The interplay of warm and cool tones highlights the contrast between natural and artificial light sources.
4. **Motion Emphasis**: The taxi’s sharp focus against the blurred background creates a sense of urgency and movement.
### Interpretation
- **Functional Role**: The taxi’s design (roof sign, license plate) and motion suggest it is actively transporting passengers in a high-demand urban area.
- **Technical Implications**: The motion blur and focus technique emphasize the taxi’s speed, potentially symbolizing the fast-paced nature of city life.
- **Ambiguity**: The illegible door sign and blurred background elements introduce uncertainty about specific contextual details (e.g., destination, time of day).
- **Cultural Significance**: Yellow taxis are iconic in many cities, often associated with accessibility and urban mobility.
## Notes on Data Limitations
- No numerical data, charts, or diagrams are present.
- Textual elements are limited to the taxi’s identifiers ("3G54") and unreadable signage.
- The image prioritizes visual storytelling over quantitative analysis.
</details>
Question what number is the cab Answer 3G54
Figure 5: Photo with textual question needing OCR reading.
General VQA
Dataset: VQAv2 vqav2
<details>
<summary>selected_samples/vqav2.png Details</summary>

### Visual Description
## Photograph: Skateboarder Performing Trick Over Picnic Table
### Overview
The image captures a dynamic moment of a skateboarder executing a trick over a black picnic table. The skateboarder is airborne, with their body tilted forward, arms extended for balance. A crowd of onlookers is visible in the background, observing the action. The setting appears to be an outdoor skate park or public space with a clear sky, some structures, and lighting poles. A watermark reading "© Ketil Johansen" is visible at the bottom of the image.
### Components/Axes
- **Textual Elements**:
- Watermark: "© Ketil Johansen" (bottom center, black text).
- No other visible text, labels, or axis markers.
### Detailed Analysis
- **Skateboarder**:
- Position: Mid-air, above the picnic table.
- Clothing: Black T-shirt with a colorful graphic (unreadable text), black pants, green-and-white cap.
- Skateboard: Visible beneath the skateboarder’s feet, tilted at an angle.
- **Crowd**:
- Group of 10–15 people standing behind a red-and-white barrier.
- Diverse attire (casual clothing, jackets, hats).
- Focused attention on the skateboarder.
- **Environment**:
- Picnic table: Black, rectangular, positioned horizontally.
- Background: Open sky with scattered clouds, distant buildings, and a tall light pole.
- Lighting: Bright daylight, casting shadows on the ground.
### Key Observations
- The skateboarder’s posture suggests a high level of skill and control during the trick.
- The crowd’s positioning and focus indicate a public demonstration or competition.
- The picnic table serves as a central obstacle, emphasizing the trick’s complexity.
- No numerical data, charts, or diagrams are present in the image.
### Interpretation
This image likely documents a skateboarding event or practice session, highlighting the interplay between the athlete’s technical execution and the audience’s engagement. The absence of textual data or structured elements (e.g., charts) suggests the focus is on capturing the physicality and artistry of skateboarding. The watermark implies the image is attributed to a photographer, possibly for professional or archival purposes. The scene underscores the cultural significance of skateboarding as a form of urban expression and performance.
</details>
Question Where is he looking? Answer down
Figure 6: General visual question answering on natural images.
Grounding, Spatial & Counting
Dataset: TallyQA tallyqa
<details>
<summary>selected_samples/tallyqa.png Details</summary>

### Visual Description
## Photograph: Man and Child on Motorcycle
### Overview
The image depicts a man and a child seated on a black motorcycle in an outdoor parking area. The man is positioned in the front, operating the motorcycle, while the child sits behind him. Both are wearing helmets, and the motorcycle has chrome accents and a windshield. The background includes other motorcycles, a red car, trees, and a cloudy sky.
### Components/Axes
- **Subjects**:
- **Man**: Wearing a black leather jacket, black helmet, and sunglasses. Seated in the front, gripping the handlebars.
- **Child**: Wearing a pink helmet with a unicorn horn, black hoodie, blue jeans, and orange sneakers. Seated behind the man.
- **Motorcycle**:
- Black with chrome accents (e.g., engine, exhaust, handlebars).
- Features a windshield, side storage compartments, and a rearview mirror.
- **Background**:
- Other motorcycles parked in the distance.
- A red car partially visible.
- Trees and a cloudy sky.
### Detailed Analysis
- **Man**:
- Positioned centrally, facing forward.
- Helmet has a reflective visor.
- Jacket has visible stitching details.
- **Child**:
- Sitting on the man’s lap, facing slightly to the side.
- Helmet has a pink base with a white unicorn horn.
- Hoodie has faint text ("LAVIN" partially visible).
- **Motorcycle**:
- Chrome engine and exhaust pipes.
- Windshield extends over the handlebars.
- Side storage compartments are closed.
- **Background**:
- Other motorcycles are parked in a row, suggesting a gathering or parking lot.
- Red car is parked diagonally in the background.
- Trees are leafless, indicating a cooler season.
### Key Observations
- No textual information (labels, signs, or text) is visible in the image.
- The man and child are the primary focus, with the motorcycle serving as a central object.
- The background elements (other vehicles, trees) provide context but no additional data.
### Interpretation
The image likely captures a casual or recreational moment, possibly a family outing or a motorcycle event. The child’s unicorn-themed helmet suggests a playful or personalized touch, while the man’s attire indicates preparedness for riding. The absence of text or explicit data points limits technical analysis, but the composition emphasizes the relationship between the subjects and their environment. The cloudy sky and leafless trees suggest an overcast day, potentially in late autumn or early spring.
**Note**: No numerical data, charts, or diagrams are present in the image. All descriptions are based on visible elements and contextual inference.
</details>
Question How many more people can ride on the vehicle? Answer 0
Figure 7: Scene requiring counting and spatial reasoning accuracy.
Math & Geometry (vision)
Dataset: CLEVR clevr
<details>
<summary>selected_samples/clevr.png Details</summary>

### Visual Description
## 3D Rendering: Abstract Geometric Composition
### Overview
The image depicts a minimalist 3D scene with six distinct objects arranged on a neutral gray surface. The composition emphasizes contrast in color, material, and form, with no textual annotations or contextual labels.
### Components/Axes
- **Objects**:
1. **Blue Cylinder** (left): Large, matte surface, cylindrical shape.
2. **Green Cylinder** (center-right): Large, glossy reflective surface, cylindrical shape.
3. **Yellow Cylinder** (right): Large, matte surface, cylindrical shape.
4. **Purple Sphere** (foreground): Small, highly reflective metallic surface.
5. **Teal Sphere** (background): Small, matte surface.
6. **Gray Cube** (background-left): Small, matte surface, partially obscured.
- **Environment**: Neutral gray background with soft, diffused lighting casting subtle shadows.
### Detailed Analysis
- **Material Properties**:
- The green cylinder’s glossy finish reflects light and surroundings, suggesting a polished or metallic material.
- The purple sphere’s reflective surface mirrors the environment, indicating a polished metal or glass-like material.
- All other objects (blue, yellow, teal, gray) have matte finishes, absorbing light without reflection.
- **Spatial Arrangement**:
- Cylinders are positioned in a linear progression from left (blue) to right (yellow), with the green cylinder slightly offset forward.
- Spheres are placed asymmetrically: the purple sphere is in the foreground, while the teal sphere is near the center-back.
- The gray cube is partially hidden behind the blue cylinder, suggesting depth.
### Key Observations
1. **Material Contrast**: The interplay between glossy (green cylinder, purple sphere) and matte surfaces creates visual hierarchy.
2. **Color Distribution**: Primary colors (blue, green, yellow) dominate, with secondary colors (purple, teal, gray) adding complexity.
3. **Absence of Text**: No labels, legends, or annotations are present, leaving interpretation open to spatial and material relationships.
### Interpretation
This image likely serves as a study in **material rendering** or **geometric abstraction**, emphasizing how light interacts with different surfaces. The lack of textual context suggests it could be used in technical documentation to demonstrate:
- **Lighting Effects**: How shadows and reflections vary with material properties.
- **Spatial Composition**: The use of color and form to create balance or tension.
- **Material Science**: Visualizing differences between matte and reflective surfaces.
Notably, the absence of scale markers or contextual clues (e.g., units, labels) limits quantitative analysis. The scene prioritizes aesthetic and material exploration over functional data representation.
</details>
Question There is a large shiny object; does it have the same shape as the object right of the large metallic thing? Provide a short and direct response. Answer Yes.
Figure 8: Synthetic shapes testing compositional spatial reasoning.
Screens, Web & GUI
Dataset: ScreenQA screenqa
<details>
<summary>selected_samples/screenqa.png Details</summary>

### Visual Description
## Screenshot: Menstrual Cycle Tracking App Initial Setup
### Overview
The image shows a mobile application's initial setup screen for configuring menstrual cycle tracking parameters. The interface includes three configurable settings with numerical inputs and a date selector, followed by a "Continue" button.
### Components/Axes
1. **Header**
- Title: "Initial Setup"
- Icon: Red gear symbol (top-left corner)
2. **Sections**
- **Default Period Length**
- Label: "The length of your period. Usually lasts between 4 and 7 days"
- Input: Number "5" with left/right arrows for adjustment
- Separator: Red dotted line
- **Default Cycle Length**
- Label: "The duration between the start dates of two periods. Usually between 23 and 35 days"
- Input: Number "30" with left/right arrows for adjustment
- Separator: Red dotted line
- **Start Date of Last Period**
- Label: "Start date of your last period"
- Date Picker: "10 Feb 2017" with up/down arrows for month/year adjustment
- Separator: Red dotted line
3. **Footer**
- Button: "Continue" (red text with right arrow)
- Android Navigation Bar: Back triangle, home circle, and recent apps square (black background)
### Detailed Analysis
- **Default Period Length**:
- Label text explains the parameter's purpose.
- Input value: 5 days (within the stated 4-7 day range).
- Arrows allow incremental adjustment.
- **Default Cycle Length**:
- Label text defines the parameter as the interval between period starts.
- Input value: 30 days (within the stated 23-35 day range).
- Arrows enable value modification.
- **Start Date of Last Period**:
- Date displayed as "10 Feb 2017" (day-month-year format).
- Arrows suggest month/year adjustment capability (no day adjustment arrows visible).
### Key Observations
1. All numerical inputs use red arrows for consistency.
2. Date format follows day-month-year convention (common in many regions).
3. Default values align with medical averages for menstrual cycles.
4. No validation messages or error states are visible.
5. The "Continue" button is the only interactive element below the configuration fields.
### Interpretation
This setup screen establishes foundational parameters for period tracking:
- The default values (5-day period, 30-day cycle) reflect typical medical norms.
- The start date input suggests the app will calculate future period predictions based on these parameters.
- The absence of validation rules implies users can input any values, though the app likely validates internally.
- The date selector's lack of day adjustment arrows may indicate the app prioritizes month/year precision for cycle calculations.
- The "Continue" button's placement suggests this is a multi-step setup process.
The interface prioritizes simplicity, using minimal text and clear visual hierarchy to guide users through essential configuration steps for menstrual cycle tracking.
</details>
Question What is the default period length? Answer 5 days
Figure 9: User interface screenshot with structured textual elements.
Medical (vision)
Dataset: PathVQA pathvqa
<details>
<summary>selected_samples/pathvqa.png Details</summary>

### Visual Description
## Photograph: Histological/Cytological Slide with Stained Cellular Structures
### Overview
The image depicts a microscopic view of a histological or cytological sample, likely stained with hematoxylin and eosin (H&E) or similar differential staining techniques. The background is predominantly light blue, with scattered dark blue and brownish-red structures. Two black arrows highlight specific regions of interest: one pointing to a central dark blue structure and another to a cluster of brownish-red elements.
### Components/Axes
- **No explicit textual labels, axes, legends, or scale bars** are visible in the image.
- **Staining patterns**:
- **Dark blue**: Likely represents nuclei (eosinophilic staining, possibly hematoxylin).
- **Brownish-red**: May indicate cytoplasmic or extracellular matrix components (eosinophilic staining).
- **Arrows**: Two black arrows annotate specific regions but lack accompanying labels or identifiers.
### Detailed Analysis
- **Cellular morphology**:
- The dark blue structures (nuclei) vary in size and shape, with some appearing rounded and others irregular.
- The brownish-red elements are irregularly distributed, with one cluster forming a branching or dendritic pattern.
- **Staining intensity**:
- Dark blue staining is concentrated in the central region, suggesting a high density of nuclei or chromatin.
- Brownish-red staining is localized to the periphery, potentially indicating extracellular material or cytoplasmic inclusions.
- **Arrows**:
- The first arrow points to a large, dark blue structure with a clear boundary, possibly a cell nucleus or apoptotic body.
- The second arrow highlights a cluster of brownish-red elements, which may represent extracellular debris, collagen fibers, or pathological aggregates.
### Key Observations
1. **Lack of scale**: No scale bar is present, making it impossible to quantify cell size or spacing.
2. **Staining ambiguity**: Without a legend, the exact identity of the stains (e.g., hematoxylin, eosin, or other markers) cannot be confirmed.
3. **Pathological features**: The irregular clustering of brownish-red elements and the prominence of dark blue nuclei may suggest pathological conditions (e.g., inflammation, fibrosis, or malignancy), but this requires contextual clinical or experimental data.
### Interpretation
- **Biological significance**:
- The dark blue nuclei likely represent viable cells, while the brownish-red elements could indicate extracellular matrix deposition, cellular debris, or apoptotic material.
- The arrows may highlight areas of interest for further analysis (e.g., tumor cell clusters, inflammatory infiltrates, or tissue damage).
- **Limitations**:
- The absence of textual annotations, legends, or scale bars limits the ability to draw definitive conclusions.
- Differential staining patterns alone cannot confirm specific cell types or pathologies without additional context (e.g., immunohistochemistry markers, clinical history).
- **Next steps**:
- Cross-reference with experimental protocols to identify staining agents and their targets.
- Compare with control samples to assess pathological changes.
**Note**: This image contains no explicit textual data, numerical values, or structured information. The analysis is based solely on visual interpretation of staining patterns and morphological features.
</details>
Question where are liver stem cells (oval cells) located? Answer in the canals of hering
Figure 10: Microscopic pathology image for medical VQA.
Text-only (instruction / code / math / translation)
Dataset: aya_dataset aya Question Quels président des États-Unis ne s’est jamais marié? Answer James Buchanan est le seul président qui ne s’est jamais marié.
Figure 11: Text-only tasks covering multiple languages.