# jina-vlm: Small Multilingual Vision Language Model
**Authors**: Jina AI by Elastic
> Equal contribution.Work done during internship at Jina AI.
Abstract
We present jina-vlm, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.
1 Introduction
Vision-language models (VLMs) combine pretrained vision encoders with large language models to tackle tasks requiring joint visual and textual understanding (flamingo; llava). Recent VLMs have achieved strong results on visual question answering (VQA), OCR, and multimodal reasoning. However, two challenges limit their practical deployment. First, multilingual capabilities often degrade during vision adaptation: models that perform well on English benchmarks show uneven results across other languages (vlmsurvey). Second, high-quality VLMs remain computationally expensive to train and deploy, limiting accessibility for researchers and practitioners with constrained resources.
This work introduces jina-vlm, a 2.4B parameter VLM that addresses both challenges. The model aligns a SigLIP2-So400M/14-384 vision encoder (siglip2) with Qwen3-1.7B-Base (qwen3) through an attention-pooling connector, trained with a two-stage pipeline that explicitly incorporates multilingual data. Among open 2B-scale VLMs, jina-vlm achieves state-of-the-art performance on multilingual multimodal benchmarks including MMMB and Multilingual MMBench, demonstrating that small models can excel at cross-lingual visual understanding without sacrificing general capabilities. On standard English benchmarks spanning diagrams, charts, documents, and OCR, jina-vlm achieves the highest average score (72.3) across eight VQA benchmarks among 2B-scale VLMs. These results are enabled by two technical contributions: an efficient arbitrary-resolution pipeline that combines overlapping tiling with attention-based token pooling to reduce visual token count by 4 $×$ , and a training recipe that incorporates text-only data to preserve the language understanding performance of the backbone LLM.
2 Related Work
VLM architecture and training. Modern VLMs follow an architecture introduced by PaLI (pali): a pretrained vision encoder extracts visual features, a connector projects them into the language model’s embedding space, and a decoder-only language model generates text conditioned on these visual tokens. Vision Transformers (ViTs) (vit) produce patch-level representations that the language model processes alongside text embeddings. This design is adopted by LLaVA (llava; llava_1_5; llava_uhd; llava_next_interleave; llava_onevision), QwenVL (qwen_vl; qwen_2_vl; qwen_2_5_vl), InternVL (internvl; internvl_1_5; internvl_2_5; internvl_3; internvl_3_5), and Ovis (ovis; ovis_2_5). Training strategies vary: qwen_2_vl; internvl_2_5 alternate between multimodal instruction tuning and general training; llava_1_5 incorporate academic VQA datasets; molmo, llava_onevision, and cambrian1 curate large-scale, diverse data mixtures.
Efficient resolution-agnostic image processing. Standard ViTs process fixed-resolution images, requiring resizing that discards fine-grained detail. Since visual token count scales with resolution and Transformer computation scales quadratically with sequence length, naive high-resolution processing is prohibitive. Several solutions exist: molmo tile images with overlap; qwen_2_vl introduce Naive Dynamic Resolution with Multimodal Rotary Position Embedding (rope; 2drope); ovis_2_5 use native-resolution ViTs (navit). Orthogonally, images often contain low-information regions (e.g., sky backgrounds), making visual tokens highly redundant. Token compression methods address this (fastv; prumerge; visionzip; pyramiddrop). internvl_1_5 develop Dynamic High-Resolution Tiling, and nvila propose scale-then-compress strategies. Recent work on training-free token budgeting, such as HERO (hero), demonstrates that inference-time pruning can achieve significant speedups while preserving accuracy; our approach differs by learning compact representations during training rather than dropping tokens at inference.
Vision-language connectors. The connector bridging vision encoders and language models significantly impacts both efficiency and performance. BLIP-2 (blip2) introduces Q-Former, a learnable query-based transformer that extracts fixed-length representations from visual features, reducing the number of tokens fed to the LLM. Flamingo (flamingo) uses a Perceiver Resampler with cross-attention to compress visual tokens. Our attention-pooling connector shares the goal of token reduction but operates differently: rather than learning a fixed set of queries, we apply local 2 $×$ 2 attention pooling that preserves spatial structure while achieving 4 $×$ compression, which we found more effective for tasks requiring fine-grained spatial understanding.
Small VLMs. Efficiency has become a central objective. mobilevlmv2 demonstrate competitive performance below 2B parameters. imp combine quantization with aggressive resolution reduction for mobile deployment, matching larger models’ performance. MiniCPM-V (minicpmv) targets edge deployment while maintaining strong OCR and multilingual capabilities. smolvlm systematically explore design parameters to train VLMs as small as 256M parameters.
Multilingual VLMs. Many lightweight VLMs (paligemma; paligemma2; phi3) achieve strong English performance but degrade on other languages. qwen_2_vl and internvl_1_5 address this through targeted multilingual training data. pangea introduce instruction-tuning data spanning 39 languages.
Retaining text-only performance. Multimodal training often degrades text-only capabilities. Mitigation strategies include balanced data mixtures, careful learning rate scheduling (cauldron), and partial backbone freezing (llava_onevision; internvl_3_5).
3 Model Architecture
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: jina-vlm Model Architecture
### Overview
The image presents a diagram illustrating the architecture of the jina-vlm model. It shows the flow of data from an input image and prompt through several processing stages, ultimately generating a text output. The diagram includes components like image tiling, SIGLIP2, VL-Connector, input tokens, and QWEN3 Decoder.
### Components/Axes
* **Header:** Contains the input image, prompt, and output.
* **Input Image:** A photograph of a woman with a hat. Dimensions are 2728 x 2046.
* **Prompt:** "what is the name of this lady?" Represented by 6 teal-colored blocks.
* **Output:** "The name is Lenna (or Lena)." Represented by 9 yellow-colored blocks.
* **Main Diagram:** Illustrates the model architecture.
* **jina-vlm:** Title of the model.
* **Tiling:** Shows the input image being divided into 12 tiles plus a thumbnail. Dimensions are 1176 x 918.
* Each tile is processed, resulting in 378-378 features, multiplied by 13.
* **SIGLIP2:** A block with approximately 400M parameters.
* Contains Transformer Blocks 27 to 1.
* Layer 18 and Layer 24 are indicated.
* Each layer produces 27-27-2384 features, multiplied by 13.
* **VL-Connector:** A block with approximately 50M parameters.
* Contains 2x2 Attention Pooling and MLP Projection.
* Outputs 182 tokens, multiplied by 13.
* **Input Tokens:** Represented by a grid of red and teal blocks.
* 2366 image tokens + text tokens.
* **QWEN3 Decoder:** A block with approximately 1.7B parameters.
* Contains Transformer Blocks 28 to 1.
### Detailed Analysis
* **Image Tiling:** The input image is divided into a grid of 12 tiles, with an additional thumbnail. Each tile is processed to extract features.
* **SIGLIP2:** This component processes the tiled image features using a series of Transformer blocks. The number of parameters is approximately 400 million.
* **VL-Connector:** This component connects the visual features from SIGLIP2 with the text prompt. It uses attention pooling and MLP projection. The number of parameters is approximately 50 million.
* **Input Tokens:** The output of the VL-Connector is converted into a sequence of tokens. The diagram indicates 2366 image tokens and some text tokens. The tokens are represented by red blocks, while the text tokens are represented by teal blocks.
* **QWEN3 Decoder:** This component decodes the input tokens to generate the final text output. It uses a series of Transformer blocks. The number of parameters is approximately 1.7 billion.
### Key Observations
* The diagram illustrates a multi-stage process for image captioning or visual question answering.
* The model uses a combination of visual and textual information to generate the output.
* The diagram highlights the key components of the model and their interconnections.
### Interpretation
The diagram provides a high-level overview of the jina-vlm model architecture. It demonstrates how the model processes an input image and prompt to generate a text output. The model uses a combination of image tiling, Transformer blocks, attention pooling, and MLP projection to extract visual features, connect them with the text prompt, and generate the final output. The diagram suggests that the model is designed to perform tasks such as image captioning or visual question answering. The use of a large number of parameters in each component indicates that the model is capable of learning complex relationships between visual and textual information.
</details>
Figure 1: Architecture of jina-vlm. Images are resized to fit a grid of up to 12 overlapping tiles, plus a global thumbnail. Each tile is a square 378 $×$ 378 crop; adjacent tiles overlap by 112 pixels with a stride of 266 pixels between tile origins. A 4 $×$ 3 grid therefore spans 1176 $×$ 910 pixels, and images exceeding this effective resolution are downscaled to fit the tile budget. Each tile produces 729 patches via SigLIP2 (siglip2). The VL connector concatenates features from layers 24 and 18, the third- and ninth-to-last layers, then applies 2 $×$ 2 attention pooling to reduce 729 tokens to 182 before projecting to the decoder dimension. Visual tokens are combined with text embeddings for the Qwen3 decoder (qwen3).
Figure 1 illustrates the architecture of jina-vlm. The model uses overlapping image tiling following molmo, combined with attention-based token pooling to reduce sequence length while preserving spatial information.
The vision encoder, SigLIP2-So400M/14-384, is a 27-layer Vision Transformer with 400M parameters that processes 378 $×$ 378 pixel inputs as 27 $×$ 27 grids of 14 $×$ 14 patches. To handle arbitrary resolutions, we decompose each image into overlapping tiles of this size and process each tile independently through the encoder. A global thumbnail, the full image resized to 378 $×$ 378, provides context alongside the tile representations. We use a default of 12 tiles during training; this limit can be increased at inference or during continued training to handle higher resolutions, with memory scaling linearly with tile count. The tiling algorithm is detailed in Appendix A.1.
3.1 Vision-Language Connector
Rather than using the final ViT output, jina-vlm concatenates features from two intermediate layers: the third-to-last and ninth-to-last, corresponding to layers 24 and 18 of the 27-layer encoder. This captures both fine-grained spatial details from earlier layers and high-level semantics from later layers. The connector then applies attention pooling over 2 $×$ 2 patch neighborhoods, using mean-pooled features as queries. This reduces the token count by 4 $×$ while preserving local structure. A SwiGLU projection layer maps the pooled representations to the language model’s embedding dimension.
In more formal terms, let $\mathbf{H}^{(\ell)}∈\mathbb{R}^{N× d_{v}}$ denote the hidden states from ViT layer $\ell$ , where $N$ is the number of patches, $d_{v}$ is the vision encoder hidden size, and negative indices count from the final layer (e.g., $\ell=-1$ is the last layer). We concatenate features from two internal layers:
$$
\mathbf{H}_{\text{concat}}=[\mathbf{H}^{(-3)};\mathbf{H}^{(-9)}]\in\mathbb{R}^{N\times 2d_{v}} \tag{1}
$$
For each $2{×}2$ patch neighborhood $\mathcal{N}_{i}$ , we compute a query vector as the mean of the neighborhood features:
$$
\mathbf{q}_{i}=\frac{1}{4}\sum_{j\in\mathcal{N}_{i}}\mathbf{h}_{j},\quad\mathbf{Q}=[\mathbf{q}_{1};\dots;\mathbf{q}_{M}]\in\mathbb{R}^{M\times 2d_{v}} \tag{2}
$$
where $\mathcal{N}_{i}$ contains the four patches at positions $(2i_{x},2i_{y})$ , $(2i_{x}+1,2i_{y})$ , $(2i_{x},2i_{y}+1)$ , and $(2i_{x}+1,2i_{y}+1)$ and $M=N/4$ .
Attention pooling is then computed as:
$$
\mathbf{H}_{\text{pooled}}=(\text{softmax}\left(\frac{\mathbf{Q}\mathbf{W}_{Q}(\mathbf{H}_{\text{concat}}\mathbf{W}_{K})^{\top}}{\sqrt{d_{k}}}\right)\mathbf{H}_{\text{concat}}\mathbf{W}_{V})\mathbf{W}_{O}\in\mathbb{R}^{M\times d_{v}} \tag{3}
$$
where $d_{k}=d_{v}$ and $\mathbf{W}_{Q}∈\mathbb{R}^{2d_{v}× d_{k}}$ , $\mathbf{W}_{K}∈\mathbb{R}^{2d_{v}× d_{k}}$ , $\mathbf{W}_{V}∈\mathbb{R}^{2d_{v}× 2d_{v}}$ and $\mathbf{W}_{O}∈\mathbb{R}^{2d_{v}× d_{v}}$ are learnable weight matrices. Finally, the pooled visual features are projected to the language model embedding dimension via a SwiGLU (swiglu) layer:
$$
\mathbf{H}_{\text{proj}}=\left(\text{Swish}(\mathbf{H}_{\text{pooled}}\mathbf{W}_{1})\odot(\mathbf{H}_{\text{pooled}}\mathbf{W}_{2})\right)\mathbf{W}_{3}\in\mathbb{R}^{M\times d_{l}} \tag{4}
$$
where $\text{Swish}(x)=x·\sigma(x)$ , $\sigma$ is the sigmoid function, $\odot$ denotes element-wise multiplication, $\mathbf{W}_{1},\mathbf{W}_{2}∈\mathbb{R}^{d_{v}× 3d_{l}}$ , $\mathbf{W}_{3}∈\mathbb{R}^{3d_{l}× d_{l}}$ are learnable parameters, and $d_{l}$ is the language model embedding size.
3.2 Language Decoder
The language decoder is initialized from Qwen3-1.7B-Base https://huggingface.co/Qwen/Qwen3-1.7B-Base, which empirically outperformed the instruction-tuned variant in our setting. We introduce three special tokens to structure visual inputs: <im_start> and <im_end> delimit image and thumbnail sequences, while <im_col> marks row boundaries within the patch grid, where tokens are arranged left-to-right and top-to-bottom. Input and output embedding weights are not tied.
3.3 Efficiency Analysis
Table 1 quantifies the computational benefits of attention pooling. With the default 12-tile configuration (plus thumbnail), the unpooled baseline would produce 9,477 visual tokens per image, while our 2 $×$ 2 pooling reduces this to 2,366 tokens. Since the ViT processes each tile identically regardless of pooling, the savings apply exclusively to the LLM: we observe a 3.9 $×$ reduction in prefill FLOPs and a 4 $×$ reduction in KV-cache memory. The overall FLOPs reduction is 2.3 $×$ when including the shared ViT cost.
Table 1: Efficiency comparison with and without 2 $×$ 2 attention pooling for the default 12-tile configuration. FLOPs are computed for LLM prefill; KV-cache assumes fp16 precision.
| Visual tokens LLM prefill FLOPs KV-cache memory | 9,477 27.2 TFLOPs 2.12 GB | 2,366 6.9 TFLOPs 0.53 GB | 4.0 $×$ 3.9 $×$ 4.0 $×$ |
| --- | --- | --- | --- |
4 Training
Training proceeds in two stages, both updating all model components (encoder, connector, and decoder) without freezing, following molmo. The combined data comprises approximately 5M multimodal samples and 12B text tokens across 30+ languages, with roughly half in English and the remainder spanning high- and moderate-resource languages. Table 2 summarizes hyperparameters for both stages.
4.1 Stage 1: Alignment Training
The first stage focuses on cross-language semantic grounding rather than task-specific objectives. Training data consists primarily of caption datasets (PixmoCap (molmo), PangeaIns (pangea)) spanning diverse visual domains: natural scenes, documents, infographics, and diagrams. We include 15% text-only data from PleiAS/common_corpus (pleias) to mitigate degradation on text-only tasks. The connector uses a higher learning rate and shorter warmup than the encoder and decoder.
4.2 Stage 2: Instruction Fine-tuning
The second stage trains instruction-following for VQA and reasoning tasks. We combine public dataset collections, including LLaVA OneVision (llava_onevision), Cauldron (cauldron), Cambrian (cambrian1), PangeaIns (pangea), and FineVision (finevision), with text-only instruction data from aya. The mixture covers academic VQA, document understanding, OCR, mathematics, and reasoning. Appendix A.2 shows representative examples.
Given the diversity of instruction data, we found single-source batches more effective initially, likely due to the heterogeneous data mixture. We train for 30K steps with single-source batches, then 30K steps with mixed-source batches.
Table 2: Model training hyperparameters across pre-training and fine-tuning stages.
| Hyperparameter Warmup ViT Warmup Con. | Pre-Training 10% 1% | Fine-Tuning 10% 10% |
| --- | --- | --- |
| Warmup LLM | 10% | 10% |
| LR ViT | 6e-6 | 5e-6 |
| LR Con. | 2e-4 | 5e-6 |
| LR LLM | 2e-5 | 1e-5 |
| Cosine Decay | 0.1 | 0.1 |
| Eps. | 1e-6 | 1e-6 |
| Betas | 0.9, 0.95 | 0.9, 0.95 |
| Batch Size | 128 | 256 |
| Steps | 25K | 60K |
| Samples | 3.2M | 15.3M |
| Tokens | 10B | 37B |
| GPU Hours | 296 | 1,000 |
5 Evaluation
We compare jina-vlm against lightweight VLMs across six capability areas: general VQA, multimodal comprehension, multi-image reasoning, hallucination control, mathematical reasoning, text-only performance, and multilingual understanding. All evaluations use VLMEvalKit https://github.com/open-compass/VLMEvalKit (vlmevalkit) with English prompts matching our training format (e.g., “Return only the letter of the best answer option” for multiple-choice, “Respond very briefly” for open-ended questions).
5.1 General VQA Tasks
Table 3 reports results on eight VQA benchmarks covering diagrams (AI2D (ai2d)), charts (ChartQA (chartqa), CharXiv (charxiv)), scene text (TextVQA (textvqa)), documents (DocVQA (docvqa), InfoVQA (infovqa)), OCR (OCRBench (ocrbench)), and diverse scenes (SEED-Bench-2-Plus (seedbench2plus)). jina-vlm achieves the highest average (72.3), with particularly strong performance on diagram interpretation and text extraction.
Table 3: Comparison of general visual question answering performance.
| | | (test avg) | (val) | (val) | (val) | Bench | Plus | (RQ / DQ) | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| jina-vlm | $82.0$ | $81.9$ | $83.2$ | $90.6$ | $71.6$ | $778$ | $67.2$ | 32.3 / 63.5 | $72.3$ |
| Qwen2-VL-2B | $74.7$ | $73.5$ | $79.7$ | $89.2$ * | $64.0$ * | $809$ | $62.4$ | 23.3 / 55.0* | $66.4$ |
| Qwen3-VL-2B | $76.9$ | $77.2$ | $79.5$ | $92.3$ * | $71.9$ * | $858$ | $67.3$ * | 28.8 / 62.3 | $71.6$ |
| InternVL3-2B | $78.6$ | $80.2$ | $77.0$ | $87.4$ * | $67.1$ * | $835$ | $64.6$ | 28.3 / 54.7 | $69.2$ |
| InternVL3.5-2B | $78.8$ | $80.7$ | $76.5$ | $88.5$ * | $69.3$ * | $836$ | $68.0$ | 31.6 / 65.0 | $71.6$ |
Results for models other than jina-vlm are from their respective papers, internvl_3_5; internvl_3; qwen_2_vl, except those marked with * which were computed using VLMEvalKit. All scores represent accuracy (%) except OCRBench which uses a 0–1000 scale; for overall average computation, OCRBench scores are divided by 10 to align with the 0–100 scale of other benchmarks.
5.2 Document and Real-World Understanding
Table 4 shows results on multimodal comprehension (MME (mme), MMB v1.1 (mmbench), MMStar (mmstar)) and real-world understanding (RealWorldQA (realworldqa), MME-RealWorld (mmerealworld), R-Bench (rbench)). jina-vlm scores 67.4 on multimodal tasks and 61.9 on real-world tasks, achieving the best RealWorldQA result (68.2).
Table 4: Comparison of generic multimodal understanding and real-world understanding performance.
| jina-vlm Qwen2-VL-2B Qwen3-VL-2B | $1965.8$ $1872.0$ $2000.8$ * | $75.8$ $72.2$ $77.8$ | $56.2$ $48.0$ $58.3$ | $67.4$ $62.4$ $69.2$ | $68.2$ $62.9$ $63.9$ | $50.7$ $38.7$ * $57.9$ * | $66.7$ $63.2$ $67.3$ * | $61.9$ $55.0$ * $63.0$ |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| InternVL3-2B | $2221.2$ | $78.6$ | $60.7$ | $72.9$ | $64.3$ | $53.8$ | $67.5$ | $61.9$ |
| InternVL3.5-2B | $2123.3$ | $76.6$ | $62.7$ | $71.7$ | $62.0$ | $49.7$ | $62.4$ | $58.0$ |
Results for models other than jina-vlm are from their respective papers, internvl_3_5; internvl_3; qwen_2_vl, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%) except MME which uses a 0–2800 scale; for overall average computation, MME scores are divided by 28 to align with the 0–100 scale of other benchmarks.
5.3 Multi-Image Reasoning and Hallucination
Table 5 reports multi-image reasoning (BLINK (blink), MuirBench (muirbench), MMT (mmtbench)) and hallucination benchmarks that measure the tendency to fabricate visual details (HallBench (hallusionbench), POPE (pope)). jina-vlm scores 47.3 on multi-image tasks, which is expected given limited multi-image training data, but achieves the best POPE score (90.3), indicating low hallucination rates.
Table 5: Comparison of multi-image and hallucination performance.
| jina-vlm Qwen2-VL-2B Qwen3-VL-2B | $50.1$ $44.4$ $53.8$ | $34.7$ $25.5$ * $47.4$ | $57.2$ $55.1$ $60.0$ * | $47.3$ $41.7$ $53.7$ | $39.1$ $41.7$ $44.5$ | $90.3$ $87.9$ * $88.9$ * | $64.7$ $64.8$ $66.7$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
| InternVL3-2B | $50.3$ | $38.8$ | $59.5$ | $49.5$ | $42.5$ | $89.6$ | $66.1$ |
| InternVL3.5-2B | $51.3$ | $44.0$ | $58.5$ | $51.3$ | $48.6$ | $87.2$ | $67.9$ |
Results for models other than jina-vlm are from their respective papers, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).
5.4 Mathematical Reasoning
Table 6 reports structured reasoning benchmarks: multidisciplinary comprehension (MMMU (mmmu)), visual mathematics (MathVista (mathvista), MathVision (mathvision), MathVerse (mathverse), WeMath (wemath)), and logical reasoning (LogicVista (logicvista)). jina-vlm performs comparably to InternVL3-2B and outperforms Qwen2-VL-2B.
Table 6: Comparison of multimodal reasoning and mathematical problem-solving performance.
| Model | MMMU | MathVista | MathVision | MathVerse | WeMath | LogicVista | Overall |
| --- | --- | --- | --- | --- | --- | --- | --- |
| (Vision Only) | | | | | | | |
| jina-vlm | $45.6$ | 59.5 | 19.2 | 23.9 | 17.1 | 33.3 | $33.1$ |
| Qwen2-VL-2B | $41.1$ | 43.0 | 12.4 | 17.3* | 10.9* | 27.3* | $25.3$ |
| Qwen3-VL-2B | $53.4$ | 61.3 | 31.6 | 22.7* | 28.0* | 35.4* | $38.7$ |
| InternVL3-2B | $48.6$ | 57.0 | 21.7 | 25.3 | 22.4 | 36.9 | $35.3$ |
| InternVL3.5-2B | $59.0$ | 71.8 / 61.5 † | 42.8 / 26.5 † | 53.4 / 35.3 † | 48.5 / 19.1 † | 47.7 / 41.4 † | $50.7$ |
Results for models other than jina-vlm are from their respective papers, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. $\dagger$ indicates scores for InternVL3.5-2B without thinking mode, evaluated using VLMEvalKit. All scores represent accuracy (%).
5.5 Text-Only Performance
Table 7 compares jina-vlm against the backbone Qwen3-1.7B on text-only benchmarks: MMLU (mmlu), MMLU-Pro (mmlupro), GSM-8K (gsm8k), ARC-C (arc), and HellaSwag (hellaswag). Results show mixed preservation of text-only capabilities: jina-vlm matches or exceeds the backbone on commonsense reasoning (ARC-C, HellaSwag) and retains most performance on MMLU and GSM-8K. However, MMLU-Pro shows substantial degradation (46.4 $→$ 30.3), likely because this benchmark emphasizes extended multi-step reasoning that conflicts with our instruction-tuning toward concise visual responses. This suggests a trade-off between optimizing for multimodal tasks and preserving complex text-only reasoning, which future work could address through more balanced data mixtures or curriculum scheduling.
Table 7: Comparison of Text-only benchmarks.
| jina-vlm Qwen3-1.7B | $56.1$ $62.6$ | $30.3$ $46.4$ | $71.3$ $75.3$ | $77.3$ $73.4$ | $59.4$ $59.0$ | $58.9$ $63.3$ |
| --- | --- | --- | --- | --- | --- | --- |
Results are collected using our evaluation code. All scores represent accuracy (%).
5.6 Multilingual Understanding
Table 8 reports multilingual multimodal benchmarks: MMMB (mmmb), Multilingual MMBench (mmmb), and MTVQA (mtvqa). jina-vlm achieves state-of-the-art multilingual performance among 2B-scale VLMs, with the highest averages on MMMB (78.8) and Multilingual MMBench (74.3).
Table 8: Comparison of multilingual multimodal understanding performance.
| Benchmark MMMB cn | jina-vlm ar $80.0$ | Qwen2-VL-2B $76.9$ $74.2$ | Qwen3-VL-2B $68.3$ $75.7$ * | InternVL3-2B $72.7$ * $78.3$ | InternVL3.5-2B $68.6$ $77.7$ | $68.5$ |
| --- | --- | --- | --- | --- | --- | --- |
| en | $82.0$ | $78.3$ | $80.7$ * | $81.9$ | $80.2$ | |
| pt | $79.2$ | $72.6$ | $75.0$ * | $75.4$ | $75.9$ | |
| ru | $79.2$ | $72.8$ | $75.9$ * | $74.6$ | $76.3$ | |
| tr | $75.5$ | $61.8$ | $68.5$ * | $62.9$ | $69.1$ | |
| avg | $78.8$ | $71.3$ | $75.0$ * | $73.6$ | $74.6$ | |
| Multi. MMBench | ar | $70.0$ | $66.7$ | $66.2$ * | $66.4$ | $63.7$ |
| cn | $75.9$ | $67.0$ | $75.7$ * | $77.8$ | $75.9$ | |
| en | $78.8$ | $71.1$ | $77.8$ * | $81.3$ | $78.4$ | |
| pt | $74.7$ | $72.1$ | $71.4$ * | $75.9$ | $73.7$ | |
| ru | $75.3$ | $69.9$ | $75.9$ * | $70.7$ | $71.4$ | |
| tr | $71.1$ | $69.3$ | $67.0$ * | $59.5$ | $62.0$ | |
| avg | $74.3$ | $69.4$ | $72.3$ * | $71.9$ | $70.9$ | |
| MTVQA | $25.6$ | $20.6$ | $27.3$ * | $26.7$ | $28.5$ | |
| Overall | $59.6$ | $53.8$ | $58.2$ | $57.4$ | $58.0$ | |
Results for baseline models are derived from their original publications, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%).
6 Conclusion
We presented jina-vlm, a 2.4B vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. Our results demonstrate that small VLMs can attain strong cross-lingual visual understanding through careful architectural and training choices: attention-based token pooling reduces visual tokens by 4 $×$ while preserving spatial information, and incorporating text-only data during multimodal training mitigates the catastrophic forgetting typically observed in vision-adapted language models. On standard English VQA benchmarks, jina-vlm achieves leading results, demonstrating that multilingual capabilities need not come at the cost of general performance.
The current approach has limitations. Multi-tile processing introduces computational overhead that scales with image resolution, and tiling can fragment global spatial context, potentially impairing performance on tasks requiring holistic scene understanding such as object counting or precise spatial reasoning across tile boundaries. While the global thumbnail partially mitigates this, native-resolution approaches (navit) may be better suited for such tasks. We have not emphasized safety-critical training or alignment, and multi-image reasoning remains weak due to limited training data in this regime. Future work could explore more efficient resolution handling, targeted improvements for counting and spatial tasks, and investigate whether our multilingual training recipe transfers to larger model scales.
Appendix A Appendix
A.1 Pseudocode for Creating Overlapping Tiles
Input: Image $I$ of size $(h,w)$ ; Base input size $\mathbf{b}=(b_{h},b_{w})$ ( $(378,378)$ ) Patch size $p$ ( $14$ ); Maximum number of tiles $M$ ( $12$ by default, configurable) Overlap margins $(m_{L},m_{R})$ in patches ( $(4,4)$ )
Output: List of tiles $\mathcal{C}$ (thumbnail + grid tiles) Tiling $(t_{h},t_{w})$ = (number of rows, number of columns)
1. Compute overlap-related sizes
$m_{\text{tot}}← p·(m_{L}+m_{R})$ // Total overlap margin in pixels
$s_{\text{win}}←\bigl(\lfloor b_{h}/p\rfloor-(m_{L}+m_{R})\bigr)· p$ // Tile stride in pixels
2. Select tiling on the margin-reduced image
$(t_{h},t_{w})←\textsc{SelectTilingWithMinimalScaleChange}\bigl(h-m_{\text{tot}},\,w-m_{\text{tot}},\,s_{\text{win}},\,M\bigr)$ ;
3. Resize image to exactly fit the chosen tiling + margins;
$H^{\prime}← t_{h}· s_{\text{win}}+m_{\text{tot}}$ ;
$W^{\prime}← t_{w}· s_{\text{win}}+m_{\text{tot}}$ ;
$I_{\text{grid}}←\textsc{Resize}(I,\;[H^{\prime},W^{\prime}])$ ;
4. Extract overlapping tiles
$\mathcal{G}←\textsc{ExtractTiles}\bigl(I_{\text{grid}},\,(t_{h},t_{w}),\,s_{\text{win}},\,b_{h}\bigr)$ // $b_{h}$ is the tile height, equal to $b_{w}$ here
5. Build thumbnail and final tile list
$T←\textsc{Resize}(I,\;[b_{h},b_{w}])$ // Global thumbnail
$\mathcal{C}←[T]\mathbin{+\!\!+}\mathcal{G}$ // Concatenate thumbnail and tiles
return $(\mathcal{C},(t_{h},t_{w}))$ ;
Algorithm 1 GetAllTilesOverlapAndResize
A.2 Training Set Examples
Captioning & Instruction
Dataset: VisualWebInstruct visualwebinstruct
<details>
<summary>selected_samples/visualwebinstruct.png Details</summary>

### Visual Description
## Documentation: Conflict Resolution Templates and Best Practices
### Overview
The image presents a document outlining best practices for conflict documentation, including a meeting notes template, an official communication template, and general guidelines for documentation.
### Components/Axes
**1. Meeting Notes Template:**
* **Header:**
* Meeting Title: Conflict Resolution Meeting
* Date: October 15, 2023
* Time: 3:00 PM - 4:00 PM
* Location: Main Conference Room
* Attendees: Alice Johnson (CEO), Bob Thompson (CTO), Claire Lee (COO), David Smith (Employee Representative)
* **Body:**
* Agenda: Review of conflict, Statements from attendees, Discussion of possible resolutions, Action items and next steps
* Minutes: (1) Introduction by Alice Johnson, (2) Statements from Alice and David, (3) Discussion on solutions, (4) Actions assigned
* **Footer:**
* Prepared by: Claire Lee, Date: October 15, 2023
**2. Official Communication Template:**
* **Header:**
* From: Alice Johnson, CEO
* To: All Employees
* Date: October 16, 2023
* Subject: Update on Recent Conflict Resolution Efforts
* **Body:**
* Introduction
* Summary of Actions
* Commitment to Resolution
* Contact for Concerns
* **Footer:**
* Closing and Signature
**3. Best Practices for Documentation:**
* Timeliness: Document conflicts promptly
* Objectivity: Record factual information without bias
* Consistency: Use standardized templates
* Confidentiality: Ensure secure storage and limited access
* Inclusion of All Perspectives: Document all viewpoints
* Follow-Up: Schedule follow-up meetings and communications
### Detailed Analysis or Content Details
The document provides structured templates for documenting conflict-related meetings and official communications. The meeting notes template includes sections for the header (meeting details), body (agenda and minutes), and footer (preparer and date). The official communication template follows a similar structure, with fields for sender, recipient, date, subject, body content, and closing.
The "Best Practices for Documentation" section outlines key principles for effective conflict documentation, including timeliness, objectivity, consistency, confidentiality, inclusion of all perspectives, and follow-up.
### Key Observations
* The meeting notes template is designed for a specific conflict resolution meeting held on October 15, 2023.
* The official communication template is intended for an update on recent conflict resolution efforts, dated October 16, 2023.
* The best practices emphasize the importance of thorough, unbiased, and secure documentation of conflicts.
### Interpretation
The document aims to standardize and improve the process of conflict documentation within an organization. By providing templates and guidelines, it promotes consistency, accuracy, and confidentiality in handling conflict-related information. The emphasis on including all perspectives and following up on conflicts suggests a commitment to fair and effective conflict resolution. The document serves as a valuable resource for employees involved in conflict management and documentation.
</details>
Question what is the meeting title? Answer Conflict Resolution Meeting
Figure 2: Answer questions given web documents.
Charts & Tables
Dataset: TAT-QA tatqa
<details>
<summary>selected_samples/tatqa.png Details</summary>

### Visual Description
## Data Table: Unrecognized Tax Benefit
### Overview
The image presents a data table showing the unrecognized tax benefit at the beginning and end of the period, along with gross increases and decreases related to prior and current period tax positions. The data is provided for the fiscal years ended July 31, 2019, 2018, and 2017. All values are in US dollars.
### Components/Axes
* **Title:** Fiscal years ended July 31,
* **Columns (Years):** 2019, 2018, 2017
* **Rows (Tax Benefit Components):**
* Unrecognized tax benefit - beginning of period
* Gross increases - prior period tax positions
* Gross decreases - prior period tax positions
* Gross increases - current period tax positions
* Unrecognized tax benefit - end of period
### Detailed Analysis or ### Content Details
Here's the data extracted from the table:
| | 2019 | 2018 | 2017 |
| :--------------------------------------------- | :-------- | :-------- | :-------- |
| Unrecognized tax benefit - beginning of period | \$10,321 | \$9,346 | \$7,687 |
| Gross increases - prior period tax positions | 98 | 729 | 712 |
| Gross decreases - prior period tax positions | (88) | (878) | (691) |
| Gross increases - current period tax positions | 1,302 | 1,124 | 1,638 |
| Unrecognized tax benefit - end of period | \$11,633 | \$10,321 | \$9,346 |
### Key Observations
* **Unrecognized Tax Benefit - Beginning of Period:** Increased from \$7,687 in 2017 to \$9,346 in 2018, and further to \$10,321 in 2019.
* **Gross Increases - Prior Period Tax Positions:** Decreased significantly from 712 in 2017 and 729 in 2018 to 98 in 2019.
* **Gross Decreases - Prior Period Tax Positions:** Increased in magnitude from (691) in 2017 to (878) in 2018, then decreased to (88) in 2019.
* **Gross Increases - Current Period Tax Positions:** Decreased from 1,638 in 2017 to 1,124 in 2018, then increased to 1,302 in 2019.
* **Unrecognized Tax Benefit - End of Period:** Increased from \$9,346 in 2017 to \$10,321 in 2018, and further to \$11,633 in 2019.
### Interpretation
The data indicates a general upward trend in the unrecognized tax benefit over the three-year period. The gross increases in prior period tax positions decreased significantly in 2019, while the gross decreases also reduced substantially in magnitude. The gross increases in current period tax positions fluctuated, decreasing in 2018 but increasing in 2019. The overall increase in the unrecognized tax benefit suggests that the company is recognizing more tax benefits over time, potentially due to changes in tax laws or business operations. The large swings in prior period adjustments suggest some volatility in the assessment and resolution of prior year tax matters.
</details>
Question Unrecognized Tax Benefits Activity related to unrecognized tax benefits is as follows (in thousands): … As of July 31, 2019, the Company has no income tax audits in progress in the U.S. or foreign jurisdictions. What was the increase in unrecognized tax benefits in 2019? Answer $1.3 million.
Figure 3: Financial table requiring numerical reasoning over text.
Document Understanding & Infographics
Dataset: DocVQA docvqa
<details>
<summary>selected_samples/docvqa.png Details</summary>

### Visual Description
## Form: Response Code Request Form
### Overview
The image is a scanned form titled "Response Code Request Form". It appears to be an internal document, possibly for a marketing campaign. The form contains fields for various details such as project name, distribution information, timing, agency details, fulfillment information, and contact persons.
### Components/Axes
* **Header**:
* "From: Joyce Bagby To: Brian Robbins" (top-left)
* "Date: 6/14/96 Time: 12:18:14" (top-center)
* "Page 2 of 7" (top-right)
* "Response Code Request Form" (centered, large font)
* **Form Fields**:
* "RJR Brand": "Eclipse"
* "Fax #": "(910) 741-1672", "Attn: Joyce Bagby"
* "RJR Program #": "601778", "Response Code: U19"
* "Project Name": "2-Free Pack Request - Viewership Interest Confirmation #2"
* "Description": "Mail Launch Mailers without an invitation to Sneak Previews to Charlotte and Indianapolis to confirm interest and intent levels. Mailer has a BRC offer for free trial which is this response code. If returned smoker gets two free packs."
* "Distribution":
* "Quantity": "750"
* "Pct. Response": "20.00 %"
* "# of Responders": "150"
* "Distribution Vehicle": "Direct Marketing"
* "Timing":
* "DTS": "07/10/96"
* "Expiration Date": "05/01/96"
* "Offer Complete": "07/11/96"
* "Days Offer Open": "1"
* "Data Entry":
* "Supplier": "M/A/R/C"
* "Incoming mail": "BRC"
* "Agency":
* "Agency Name": "Mezzina/Brown"
* "Contact Person": "Tami LaManna"
* "Phone #": "(212) 251-7700"
* "If M/A/R/C is supplier, do you need a M/A/R/C P.O. Box?": "Yes"
* "P.O. Box Title": "Eclipse"
* "Fulfillment":
* "Company": "M/A/R/C"
* "Job Number": (blank)
* "Number of Items on Order Form": "1"
* "P.O. Box": (blank), "City": (blank), "State": (blank), "Zip Code": (blank)
* "Comments": (blank text area)
* "Marketing Person Responsible for Project": "Mike Duffy - 0910"
* "Production Contact": "Joe Leinster"
* "Response Code Requested By": "Mike Duffy"
* "Copies of Confirmation Letter To":
* "Suzi Hicks"
* "Tami LaManna"
* "Vanessa Oakley"
* "Teresa Smith"
* "Jack Bellis"
* "Nancy Montgomery"
* **Footer**:
* "RJR Form 931 WM $300" (bottom-left)
* "Source: https://www.industrydocuments.ucsf.edu/docs/lflw0023" (bottom-center)
* "834076" (handwritten, right of comments box)
* "359" (handwritten, below 834076)
* "69100-69199" (handwritten, below 359)
* "52250 5639" (handwritten, along right edge)
### Detailed Analysis or ### Content Details
* **Distribution Details**: The campaign involves a quantity of 750 mailers, with an expected response rate of 20%, resulting in 150 responders. The distribution vehicle is direct marketing.
* **Timing**: The DTS (Date Time Stamp) is 07/10/96, the expiration date is 05/01/96, the offer is complete on 07/11/96, and the offer is open for 1 day.
* **Agency Information**: The agency involved is Mezzina/Brown, with Tami LaManna as the contact person and the phone number (212) 251-7700.
* **Fulfillment**: The company is M/A/R/C, and the number of items on the order form is 1.
* **Handwritten Numbers**: The numbers "834076", "359", "69100-69199", and "52250 5639" are handwritten, possibly representing internal tracking or coding.
### Key Observations
* The form is dated 6/14/96, and many of the dates are in 1996, suggesting this is a historical document.
* The project involves sending mailers to promote sneak previews in Charlotte and Indianapolis.
* The handwritten numbers at the bottom-right suggest internal tracking or coding.
### Interpretation
The "Response Code Request Form" provides a snapshot of a specific marketing campaign undertaken by RJR (likely a tobacco company, given the source URL). The form captures key details about the campaign's scope, timing, target audience, and responsible parties. The handwritten numbers likely represent internal tracking codes or identifiers used by RJR for managing and analyzing the campaign's performance. The document's existence within the UCSF industry documents library suggests its relevance to research or investigations related to the tobacco industry's marketing practices.
</details>
Question what is the response code ? Answer U19
Figure 4: Document image with question about textual fields.
OCR QA (text-centric VQA)
Dataset: TextVQA textvqa
<details>
<summary>selected_samples/textvqa.png Details</summary>

### Visual Description
## Photograph: New York City Taxi
### Overview
The image is a photograph of a New York City taxi in motion at night. The background is blurred, suggesting a long exposure time to capture the movement. The taxi is orange and has a roof-mounted advertisement.
### Components/Axes
* **Taxi:** The primary subject of the image, an orange New York City taxi.
* **Roof Advertisement:** A white box on top of the taxi with an advertisement. The visible text on the advertisement is "SONY" and "ON SALE NOW".
* **Taxi Markings:** The taxi has "N.Y.C. TAXI" and "3G54" printed on the rear quarter panel.
* **Taxi Fare Information:** A sign is visible on the passenger door, displaying taxi fare information. The text "TAXI FARE" is visible.
* **Background:** The background consists of blurred lights and shapes, suggesting a cityscape at night.
### Detailed Analysis or Content Details
* **Taxi Color:** Orange.
* **Taxi Number:** 3G54.
* **Advertisement:** The advertisement on top of the taxi is for "SONY" and indicates "ON SALE NOW".
* **Taxi Fare Sign:** The sign on the passenger door indicates "TAXI FARE".
* **Motion Blur:** The background exhibits significant motion blur, indicating that the taxi is moving.
### Key Observations
* The photograph captures the dynamic environment of New York City at night.
* The motion blur emphasizes the speed of the taxi.
* The visible advertisements and taxi markings provide context about the location and purpose of the vehicle.
### Interpretation
The photograph likely aims to convey the energy and pace of New York City. The blurred background and the moving taxi create a sense of dynamism. The presence of advertisements and taxi markings reinforces the urban setting. The image could be used to illustrate themes of transportation, urban life, or advertising.
</details>
Question what number is the cab Answer 3G54
Figure 5: Photo with textual question needing OCR reading.
General VQA
Dataset: VQAv2 vqav2
<details>
<summary>selected_samples/vqav2.png Details</summary>

### Visual Description
## Photograph: Skateboarder Performing Trick
### Overview
The image is a photograph of a skateboarder performing a trick over a picnic table. A crowd of people watches in the background. The sky is partly cloudy.
### Components/Axes
* **Foreground:** A picnic table with a skateboarder in mid-air above it.
* **Midground:** A crowd of people standing and watching.
* **Background:** A partly cloudy sky, buildings, and a light pole.
* **Text:** "@ Ketil Johansen" in the bottom center. The skateboarder's hat has the number "88" on it.
### Detailed Analysis
* The skateboarder is wearing a dark t-shirt with a colorful design, dark pants, and a green hat with the number "88" on it.
* The skateboarder is in mid-air, with the skateboard positioned beneath their feet.
* The picnic table is made of wood and has a dark finish.
* The crowd of people in the midground appears to be watching the skateboarder.
* The background includes buildings, a light pole, and a partly cloudy sky.
* The text "@ Ketil Johansen" is located in the bottom center of the image, likely indicating the photographer.
### Key Observations
* The skateboarder is the main subject of the photograph.
* The photograph captures a moment of action and skill.
* The crowd in the background adds context and atmosphere to the scene.
### Interpretation
The photograph likely captures a moment from a skateboarding event or demonstration. The skateboarder's skill is highlighted by the mid-air trick, and the crowd's presence suggests an audience or competition setting. The photographer's credit indicates the source of the image. The image suggests a dynamic and exciting scene within the skateboarding subculture.
</details>
Question Where is he looking? Answer down
Figure 6: General visual question answering on natural images.
Grounding, Spatial & Counting
Dataset: TallyQA tallyqa
<details>
<summary>selected_samples/tallyqa.png Details</summary>

### Visual Description
## Photograph: Motorcycle with Rider and Passenger
### Overview
The image is a medium shot of a black Harley Davidson motorcycle with a male rider and a younger passenger. The motorcycle is parked in what appears to be a parking lot, with other motorcycles and vehicles visible in the background. The rider is wearing a black leather jacket, black helmet, and sunglasses. The passenger is wearing a pink and white helmet with bunny ears, a black hoodie with "HARLEY DAVIDSON" printed on it, and jeans.
### Components/Axes
* **Foreground:** The primary subject is the black Harley Davidson motorcycle with the rider and passenger.
* **Midground:** Other motorcycles and vehicles are parked in the background.
* **Background:** Trees and a building structure are visible in the distance.
* **Text on Hoodie:** "HARLEY DAVIDSON"
### Detailed Analysis or Content Details
* **Motorcycle:** The motorcycle is a black Harley Davidson touring model, equipped with a windshield, saddlebags, and a passenger backrest.
* **Rider:** The rider is wearing a black leather jacket, black helmet, sunglasses, and black leather chaps.
* **Passenger:** The passenger is wearing a pink and white helmet with bunny ears, a black hoodie with "HARLEY DAVIDSON" printed on it, jeans, and tan shoes. The passenger is waving.
* **Background Details:** The background shows other motorcycles and vehicles parked in a parking lot. There are trees and a building structure visible in the distance.
### Key Observations
* The image captures a moment of leisure and recreation, with the rider and passenger seemingly ready for a ride.
* The presence of other motorcycles suggests a gathering or event of motorcycle enthusiasts.
* The passenger's attire, particularly the helmet with bunny ears, adds a touch of whimsy to the scene.
### Interpretation
The photograph likely depicts a recreational outing or gathering of motorcycle enthusiasts. The presence of a child passenger suggests a family-oriented aspect to the activity. The image conveys a sense of freedom, adventure, and camaraderie associated with motorcycle culture. The "HARLEY DAVIDSON" text on the passenger's hoodie reinforces the brand association and identity.
</details>
Question How many more people can ride on the vehicle? Answer 0
Figure 7: Scene requiring counting and spatial reasoning accuracy.
Math & Geometry (vision)
Dataset: CLEVR clevr
<details>
<summary>selected_samples/clevr.png Details</summary>

### Visual Description
## 3D Scene: Geometric Shapes
### Overview
The image is a 3D rendering of several geometric shapes on a white surface. There are four cylinders of different colors (blue, green, yellow, and gray) and three spheres of different colors (purple, teal, and purple). The scene is lit from an unseen source, casting shadows and highlights on the objects.
### Components/Axes
* **Shapes:** Cylinders and Spheres
* **Colors:** Blue, Green, Yellow, Gray, Purple, Teal
* **Arrangement:** The shapes are arranged seemingly randomly on the surface.
### Detailed Analysis or ### Content Details
* **Blue Cylinder:** Located on the left side of the image.
* **Green Cylinder:** Located towards the right side of the image, slightly behind the blue cylinder.
* **Yellow Cylinder:** Located on the right side of the image, next to the green cylinder.
* **Gray Cylinder:** Located behind the blue cylinder.
* **Purple Sphere (near):** Located in the lower center of the image.
* **Teal Sphere:** Located between the blue cylinder and the purple sphere in the background.
* **Purple Sphere (far):** Located behind the teal sphere.
### Key Observations
* The cylinders are opaque, while the spheres appear to have a reflective surface.
* The lighting creates highlights on the top surfaces of the cylinders and spheres.
* The shadows indicate a light source coming from the top-left of the image.
### Interpretation
The image appears to be a simple 3D scene created for demonstration or visualization purposes. The arrangement of the shapes does not suggest any specific relationship or meaning. The use of different colors and shapes could be intended to showcase the rendering capabilities of the software used to create the image. There is no data or information being conveyed beyond the visual representation of the shapes and their arrangement.
</details>
Question There is a large shiny object; does it have the same shape as the object right of the large metallic thing? Provide a short and direct response. Answer Yes.
Figure 8: Synthetic shapes testing compositional spatial reasoning.
Screens, Web & GUI
Dataset: ScreenQA screenqa
<details>
<summary>selected_samples/screenqa.png Details</summary>

### Visual Description
## Mobile App Screenshot: Initial Setup Screen
### Overview
The image is a screenshot of a mobile app's "Initial Setup" screen, prompting the user to input information about their menstrual cycle. The screen includes fields for "Default period length," "Default cycle length," and "Start date of your last period." The UI elements are primarily red and white on a light background.
### Components/Axes
* **Header:** "Initial Setup" with a settings gear icon.
* **Default period length:**
* Label: "Default period length"
* Description: "The length of your period. Usually lasts between 4 and 7 days"
* Value: "5" (days) with left and right arrow buttons for adjustment.
* **Default cycle length:**
* Label: "Default cycle length"
* Description: "The duration between the start dates of two periods. Usually between 23 and 35 days."
* Value: "30" (days) with left and right arrow buttons for adjustment.
* **Start date of your last period:**
* Label: "Start date of your last period"
* Date fields: Day (10), Month (Feb), Year (2017), each with up and down arrow buttons for adjustment.
* **Footer:** "Continue >" button.
* **Navigation Bar:** Android navigation bar with triangle (back), circle (home), and square (recent apps) buttons.
### Detailed Analysis or ### Content Details
* **Default Period Length:** The default period length is set to 5 days. The description indicates a typical range of 4-7 days.
* **Default Cycle Length:** The default cycle length is set to 30 days. The description indicates a typical range of 23-35 days.
* **Start Date:** The start date is set to February 10, 2017.
### Key Observations
* The app is designed to collect data related to menstrual cycles.
* The UI is simple and intuitive, using arrow buttons for easy adjustment of values.
* Default values are provided for period and cycle length, potentially based on common averages.
### Interpretation
The screenshot shows the initial setup screen of a period tracking app. The app aims to personalize the user experience by collecting data about their individual cycle. The default values and ranges provided suggest the app is designed to be user-friendly, even for those who may not know their exact cycle details. The "Continue" button indicates that this is just the first step in the setup process, and more data may be collected on subsequent screens.
</details>
Question What is the default period length? Answer 5 days
Figure 9: User interface screenshot with structured textual elements.
Medical (vision)
Dataset: PathVQA pathvqa
<details>
<summary>selected_samples/pathvqa.png Details</summary>

### Visual Description
## Microscopic Image: Liver Tissue
### Overview
The image is a microscopic view of liver tissue, showing cells and structures stained with immunohistochemistry. Two black arrows point to specific areas of interest. The tissue appears to be stained with two different colors, one blue and one brown.
### Components/Axes
* **Stain Colors:** Blue and Brown
* **Arrows:** Two black arrows pointing to specific structures.
* **Cellular Structures:** Visible cells with nuclei.
### Detailed Analysis or ### Content Details
The image shows liver cells stained blue. There are areas stained brown, indicated by the arrows. The arrow on the left points to a circular structure stained brown. The arrow on the right points to a cluster of cells stained brown.
### Key Observations
* Two distinct staining patterns are visible: blue and brown.
* The brown staining appears to be localized to specific structures within the tissue.
### Interpretation
The image likely represents an immunohistochemical stain of liver tissue. The blue stain probably highlights the general cellular structure, while the brown stain indicates the presence of a specific protein or antigen. The arrows point to areas where this protein/antigen is concentrated. The circular structure on the left could be a bile duct or blood vessel, while the cluster of cells on the right could be immune cells or hepatocytes expressing the antigen. Further information about the antibody used for the brown stain would be needed to fully interpret the image.
</details>
Question where are liver stem cells (oval cells) located? Answer in the canals of hering
Figure 10: Microscopic pathology image for medical VQA.
Text-only (instruction / code / math / translation)
Dataset: aya_dataset aya Question Quels président des États-Unis ne s’est jamais marié? Answer James Buchanan est le seul président qui ne s’est jamais marié.
Figure 11: Text-only tasks covering multiple languages.