2512.04032v2

Model: jina-vlm

# jina-vlm: Small Multilingual Vision Language Model **Authors**: Jina AI by Elastic > Equal contribution.Work done during internship at Jina AI. Abstract We present jina-vlm, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm. 1 Introduction Vision-language models (VLMs) combine pretrained vision encoders with large language models to tackle tasks requiring joint visual and textual understanding (flamingo; llava). Recent VLMs have achieved strong results on visual question answering (VQA), OCR, and multimodal reasoning. However, two challenges limit their practical deployment. First, multilingual capabilities often degrade during vision adaptation: models that perform well on English benchmarks show uneven results across other languages (vlmsurvey). Second, high-quality VLMs remain computationally expensive to train and deploy, limiting accessibility for researchers and practitioners with constrained resources. This work introduces jina-vlm, a 2.4B parameter VLM that addresses both challenges. The model aligns a SigLIP2-So400M/14-384 vision encoder (siglip2) with Qwen3-1.7B-Base (qwen3) through an attention-pooling connector, trained with a two-stage pipeline that explicitly incorporates multilingual data. Among open 2B-scale VLMs, jina-vlm achieves state-of-the-art performance on multilingual multimodal benchmarks including MMMB and Multilingual MMBench, demonstrating that small models can excel at cross-lingual visual understanding without sacrificing general capabilities. On standard English benchmarks spanning diagrams, charts, documents, and OCR, jina-vlm achieves the highest average score (72.3) across eight VQA benchmarks among 2B-scale VLMs. These results are enabled by two technical contributions: an efficient arbitrary-resolution pipeline that combines overlapping tiling with attention-based token pooling to reduce visual token count by 4 $×$ , and a training recipe that incorporates text-only data to preserve the language understanding performance of the backbone LLM. 2 Related Work VLM architecture and training. Modern VLMs follow an architecture introduced by PaLI (pali): a pretrained vision encoder extracts visual features, a connector projects them into the language model’s embedding space, and a decoder-only language model generates text conditioned on these visual tokens. Vision Transformers (ViTs) (vit) produce patch-level representations that the language model processes alongside text embeddings. This design is adopted by LLaVA (llava; llava_1_5; llava_uhd; llava_next_interleave; llava_onevision), QwenVL (qwen_vl; qwen_2_vl; qwen_2_5_vl), InternVL (internvl; internvl_1_5; internvl_2_5; internvl_3; internvl_3_5), and Ovis (ovis; ovis_2_5). Training strategies vary: qwen_2_vl; internvl_2_5 alternate between multimodal instruction tuning and general training; llava_1_5 incorporate academic VQA datasets; molmo, llava_onevision, and cambrian1 curate large-scale, diverse data mixtures. Efficient resolution-agnostic image processing. Standard ViTs process fixed-resolution images, requiring resizing that discards fine-grained detail. Since visual token count scales with resolution and Transformer computation scales quadratically with sequence length, naive high-resolution processing is prohibitive. Several solutions exist: molmo tile images with overlap; qwen_2_vl introduce Naive Dynamic Resolution with Multimodal Rotary Position Embedding (rope; 2drope); ovis_2_5 use native-resolution ViTs (navit). Orthogonally, images often contain low-information regions (e.g., sky backgrounds), making visual tokens highly redundant. Token compression methods address this (fastv; prumerge; visionzip; pyramiddrop). internvl_1_5 develop Dynamic High-Resolution Tiling, and nvila propose scale-then-compress strategies. Recent work on training-free token budgeting, such as HERO (hero), demonstrates that inference-time pruning can achieve significant speedups while preserving accuracy; our approach differs by learning compact representations during training rather than dropping tokens at inference. Vision-language connectors. The connector bridging vision encoders and language models significantly impacts both efficiency and performance. BLIP-2 (blip2) introduces Q-Former, a learnable query-based transformer that extracts fixed-length representations from visual features, reducing the number of tokens fed to the LLM. Flamingo (flamingo) uses a Perceiver Resampler with cross-attention to compress visual tokens. Our attention-pooling connector shares the goal of token reduction but operates differently: rather than learning a fixed set of queries, we apply local 2 $×$ 2 attention pooling that preserves spatial structure while achieving 4 $×$ compression, which we found more effective for tasks requiring fine-grained spatial understanding. Small VLMs. Efficiency has become a central objective. mobilevlmv2 demonstrate competitive performance below 2B parameters. imp combine quantization with aggressive resolution reduction for mobile deployment, matching larger models’ performance. MiniCPM-V (minicpmv) targets edge deployment while maintaining strong OCR and multilingual capabilities. smolvlm systematically explore design parameters to train VLMs as small as 256M parameters. Multilingual VLMs. Many lightweight VLMs (paligemma; paligemma2; phi3) achieve strong English performance but degrade on other languages. qwen_2_vl and internvl_1_5 address this through targeted multilingual training data. pangea introduce instruction-tuning data spanning 39 languages. Retaining text-only performance. Multimodal training often degrades text-only capabilities. Mitigation strategies include balanced data mixtures, careful learning rate scheduling (cauldron), and partial backbone freezing (llava_onevision; internvl_3_5). 3 Model Architecture <details> <summary>x1.png Details</summary> ![8e047544](/v1/image/8e047544db04f3f3d5fac037b2d1366aa3a614420aaefe0884ab25f9a10e2536) ### Visual Description ## Image Description The image is a technical diagram illustrating the process of image generation using a generative model. The diagram is divided into several sections, each representing different components of the model. ### Overview The diagram shows the flow of data through a generative model, starting from an input image to an output text. The input image is a photograph of a woman wearing a hat, and the prompt is a question asking for the name of the woman. The output is a text response that identifies the woman as Lena. ### Components/Axes - **Input Image**: A photograph of a woman wearing a hat. - **Prompt**: A text prompt asking for the name of the woman. - **Output**: A text response identifying the woman as Lena. - **Model Architecture**: The diagram includes several blocks and layers, including: - **SIGILIP2**: A transformer block with 400M parameters. - **VL-CONNECTOR**: A transformer block with 50M parameters. - **Qwen3 Decoder**: A transformer block with 1.7B parameters. - **Input Tokens**: The number of image tokens and text tokens used in the model. - **Output Tokens**: The number of tokens generated by the model. ### Detailed Analysis or ### Content Details - **Input Image**: The image is tiled into 1176 x 910 pixels, with each tile being 378 x 378 pixels. - **Prompt**: The prompt is represented by a series of teal-colored blocks. - **Output**: The output is represented by a series of yellow-colored blocks. - **Model Architecture**: The model architecture includes multiple transformer blocks, each with a different number of parameters. The SIGILIP2 block has 400M parameters, the VL-CONNECTOR block has 50M parameters, and the Qwen3 Decoder block has 1.7B parameters. - **Input Tokens**: The model uses 2366 image tokens and 12 text tokens. - **Output Tokens**: The model generates 182 tokens. ### Key Observations - The model architecture includes multiple transformer blocks, each with a different number of parameters. - The input image is tiled into 1176 x 910 pixels, with each tile being 378 x 378 pixels. - The prompt is represented by a series of teal-colored blocks. - The output is represented by a series of yellow-colored blocks. ### Interpretation The diagram illustrates the process of image generation using a generative model. The model takes an input image and a prompt, and generates an output text that identifies the woman in the image as Lena. The model architecture includes multiple transformer blocks, each with a different number of parameters, and the input image is tiled into 1176 x 910 pixels, with each tile being 378 x 378 pixels. The prompt is represented by a series of teal-colored blocks, and the output is represented by a series of yellow-colored blocks. The model uses 2366 image tokens and 12 text tokens, and generates 182 tokens. </details> Figure 1: Architecture of jina-vlm. Images are resized to fit a grid of up to 12 overlapping tiles, plus a global thumbnail. Each tile is a square 378 $×$ 378 crop; adjacent tiles overlap by 112 pixels with a stride of 266 pixels between tile origins. A 4 $×$ 3 grid therefore spans 1176 $×$ 910 pixels, and images exceeding this effective resolution are downscaled to fit the tile budget. Each tile produces 729 patches via SigLIP2 (siglip2). The VL connector concatenates features from layers 24 and 18, the third- and ninth-to-last layers, then applies 2 $×$ 2 attention pooling to reduce 729 tokens to 182 before projecting to the decoder dimension. Visual tokens are combined with text embeddings for the Qwen3 decoder (qwen3). Figure 1 illustrates the architecture of jina-vlm. The model uses overlapping image tiling following molmo, combined with attention-based token pooling to reduce sequence length while preserving spatial information. The vision encoder, SigLIP2-So400M/14-384, is a 27-layer Vision Transformer with 400M parameters that processes 378 $×$ 378 pixel inputs as 27 $×$ 27 grids of 14 $×$ 14 patches. To handle arbitrary resolutions, we decompose each image into overlapping tiles of this size and process each tile independently through the encoder. A global thumbnail, the full image resized to 378 $×$ 378, provides context alongside the tile representations. We use a default of 12 tiles during training; this limit can be increased at inference or during continued training to handle higher resolutions, with memory scaling linearly with tile count. The tiling algorithm is detailed in Appendix A.1. 3.1 Vision-Language Connector Rather than using the final ViT output, jina-vlm concatenates features from two intermediate layers: the third-to-last and ninth-to-last, corresponding to layers 24 and 18 of the 27-layer encoder. This captures both fine-grained spatial details from earlier layers and high-level semantics from later layers. The connector then applies attention pooling over 2 $×$ 2 patch neighborhoods, using mean-pooled features as queries. This reduces the token count by 4 $×$ while preserving local structure. A SwiGLU projection layer maps the pooled representations to the language model’s embedding dimension. In more formal terms, let $\mathbf{H}^{(\ell)}∈\mathbb{R}^{N× d_{v}}$ denote the hidden states from ViT layer $\ell$ , where $N$ is the number of patches, $d_{v}$ is the vision encoder hidden size, and negative indices count from the final layer (e.g., $\ell=-1$ is the last layer). We concatenate features from two internal layers: $$ \mathbf{H}_{\text{concat}}=[\mathbf{H}^{(-3)};\mathbf{H}^{(-9)}]\in\mathbb{R}^{N\times 2d_{v}} \tag{1} $$ For each $2{×}2$ patch neighborhood $\mathcal{N}_{i}$ , we compute a query vector as the mean of the neighborhood features: $$ \mathbf{q}_{i}=\frac{1}{4}\sum_{j\in\mathcal{N}_{i}}\mathbf{h}_{j},\quad\mathbf{Q}=[\mathbf{q}_{1};\dots;\mathbf{q}_{M}]\in\mathbb{R}^{M\times 2d_{v}} \tag{2} $$ where $\mathcal{N}_{i}$ contains the four patches at positions $(2i_{x},2i_{y})$ , $(2i_{x}+1,2i_{y})$ , $(2i_{x},2i_{y}+1)$ , and $(2i_{x}+1,2i_{y}+1)$ and $M=N/4$ . Attention pooling is then computed as: $$ \mathbf{H}_{\text{pooled}}=(\text{softmax}\left(\frac{\mathbf{Q}\mathbf{W}_{Q}(\mathbf{H}_{\text{concat}}\mathbf{W}_{K})^{\top}}{\sqrt{d_{k}}}\right)\mathbf{H}_{\text{concat}}\mathbf{W}_{V})\mathbf{W}_{O}\in\mathbb{R}^{M\times d_{v}} \tag{3} $$ where $d_{k}=d_{v}$ and $\mathbf{W}_{Q}∈\mathbb{R}^{2d_{v}× d_{k}}$ , $\mathbf{W}_{K}∈\mathbb{R}^{2d_{v}× d_{k}}$ , $\mathbf{W}_{V}∈\mathbb{R}^{2d_{v}× 2d_{v}}$ and $\mathbf{W}_{O}∈\mathbb{R}^{2d_{v}× d_{v}}$ are learnable weight matrices. Finally, the pooled visual features are projected to the language model embedding dimension via a SwiGLU (swiglu) layer: $$ \mathbf{H}_{\text{proj}}=\left(\text{Swish}(\mathbf{H}_{\text{pooled}}\mathbf{W}_{1})\odot(\mathbf{H}_{\text{pooled}}\mathbf{W}_{2})\right)\mathbf{W}_{3}\in\mathbb{R}^{M\times d_{l}} \tag{4} $$ where $\text{Swish}(x)=x·\sigma(x)$ , $\sigma$ is the sigmoid function, $\odot$ denotes element-wise multiplication, $\mathbf{W}_{1},\mathbf{W}_{2}∈\mathbb{R}^{d_{v}× 3d_{l}}$ , $\mathbf{W}_{3}∈\mathbb{R}^{3d_{l}× d_{l}}$ are learnable parameters, and $d_{l}$ is the language model embedding size. 3.2 Language Decoder The language decoder is initialized from Qwen3-1.7B-Base https://huggingface.co/Qwen/Qwen3-1.7B-Base, which empirically outperformed the instruction-tuned variant in our setting. We introduce three special tokens to structure visual inputs: <im_start> and <im_end> delimit image and thumbnail sequences, while <im_col> marks row boundaries within the patch grid, where tokens are arranged left-to-right and top-to-bottom. Input and output embedding weights are not tied. 3.3 Efficiency Analysis Table 1 quantifies the computational benefits of attention pooling. With the default 12-tile configuration (plus thumbnail), the unpooled baseline would produce 9,477 visual tokens per image, while our 2 $×$ 2 pooling reduces this to 2,366 tokens. Since the ViT processes each tile identically regardless of pooling, the savings apply exclusively to the LLM: we observe a 3.9 $×$ reduction in prefill FLOPs and a 4 $×$ reduction in KV-cache memory. The overall FLOPs reduction is 2.3 $×$ when including the shared ViT cost. Table 1: Efficiency comparison with and without 2 $×$ 2 attention pooling for the default 12-tile configuration. FLOPs are computed for LLM prefill; KV-cache assumes fp16 precision. | Visual tokens LLM prefill FLOPs KV-cache memory | 9,477 27.2 TFLOPs 2.12 GB | 2,366 6.9 TFLOPs 0.53 GB | 4.0 $×$ 3.9 $×$ 4.0 $×$ | | --- | --- | --- | --- | 4 Training Training proceeds in two stages, both updating all model components (encoder, connector, and decoder) without freezing, following molmo. The combined data comprises approximately 5M multimodal samples and 12B text tokens across 30+ languages, with roughly half in English and the remainder spanning high- and moderate-resource languages. Table 2 summarizes hyperparameters for both stages. 4.1 Stage 1: Alignment Training The first stage focuses on cross-language semantic grounding rather than task-specific objectives. Training data consists primarily of caption datasets (PixmoCap (molmo), PangeaIns (pangea)) spanning diverse visual domains: natural scenes, documents, infographics, and diagrams. We include 15% text-only data from PleiAS/common_corpus (pleias) to mitigate degradation on text-only tasks. The connector uses a higher learning rate and shorter warmup than the encoder and decoder. 4.2 Stage 2: Instruction Fine-tuning The second stage trains instruction-following for VQA and reasoning tasks. We combine public dataset collections, including LLaVA OneVision (llava_onevision), Cauldron (cauldron), Cambrian (cambrian1), PangeaIns (pangea), and FineVision (finevision), with text-only instruction data from aya. The mixture covers academic VQA, document understanding, OCR, mathematics, and reasoning. Appendix A.2 shows representative examples. Given the diversity of instruction data, we found single-source batches more effective initially, likely due to the heterogeneous data mixture. We train for 30K steps with single-source batches, then 30K steps with mixed-source batches. Table 2: Model training hyperparameters across pre-training and fine-tuning stages. | Hyperparameter Warmup ViT Warmup Con. | Pre-Training 10% 1% | Fine-Tuning 10% 10% | | --- | --- | --- | | Warmup LLM | 10% | 10% | | LR ViT | 6e-6 | 5e-6 | | LR Con. | 2e-4 | 5e-6 | | LR LLM | 2e-5 | 1e-5 | | Cosine Decay | 0.1 | 0.1 | | Eps. | 1e-6 | 1e-6 | | Betas | 0.9, 0.95 | 0.9, 0.95 | | Batch Size | 128 | 256 | | Steps | 25K | 60K | | Samples | 3.2M | 15.3M | | Tokens | 10B | 37B | | GPU Hours | 296 | 1,000 | 5 Evaluation We compare jina-vlm against lightweight VLMs across six capability areas: general VQA, multimodal comprehension, multi-image reasoning, hallucination control, mathematical reasoning, text-only performance, and multilingual understanding. All evaluations use VLMEvalKit https://github.com/open-compass/VLMEvalKit (vlmevalkit) with English prompts matching our training format (e.g., “Return only the letter of the best answer option” for multiple-choice, “Respond very briefly” for open-ended questions). 5.1 General VQA Tasks Table 3 reports results on eight VQA benchmarks covering diagrams (AI2D (ai2d)), charts (ChartQA (chartqa), CharXiv (charxiv)), scene text (TextVQA (textvqa)), documents (DocVQA (docvqa), InfoVQA (infovqa)), OCR (OCRBench (ocrbench)), and diverse scenes (SEED-Bench-2-Plus (seedbench2plus)). jina-vlm achieves the highest average (72.3), with particularly strong performance on diagram interpretation and text extraction. Table 3: Comparison of general visual question answering performance. | | | (test avg) | (val) | (val) | (val) | Bench | Plus | (RQ / DQ) | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | jina-vlm | $82.0$ | $81.9$ | $83.2$ | $90.6$ | $71.6$ | $778$ | $67.2$ | 32.3 / 63.5 | $72.3$ | | Qwen2-VL-2B | $74.7$ | $73.5$ | $79.7$ | $89.2$ * | $64.0$ * | $809$ | $62.4$ | 23.3 / 55.0* | $66.4$ | | Qwen3-VL-2B | $76.9$ | $77.2$ | $79.5$ | $92.3$ * | $71.9$ * | $858$ | $67.3$ * | 28.8 / 62.3 | $71.6$ | | InternVL3-2B | $78.6$ | $80.2$ | $77.0$ | $87.4$ * | $67.1$ * | $835$ | $64.6$ | 28.3 / 54.7 | $69.2$ | | InternVL3.5-2B | $78.8$ | $80.7$ | $76.5$ | $88.5$ * | $69.3$ * | $836$ | $68.0$ | 31.6 / 65.0 | $71.6$ | Results for models other than jina-vlm are from their respective papers, internvl_3_5; internvl_3; qwen_2_vl, except those marked with * which were computed using VLMEvalKit. All scores represent accuracy (%) except OCRBench which uses a 0–1000 scale; for overall average computation, OCRBench scores are divided by 10 to align with the 0–100 scale of other benchmarks. 5.2 Document and Real-World Understanding Table 4 shows results on multimodal comprehension (MME (mme), MMB v1.1 (mmbench), MMStar (mmstar)) and real-world understanding (RealWorldQA (realworldqa), MME-RealWorld (mmerealworld), R-Bench (rbench)). jina-vlm scores 67.4 on multimodal tasks and 61.9 on real-world tasks, achieving the best RealWorldQA result (68.2). Table 4: Comparison of generic multimodal understanding and real-world understanding performance. | jina-vlm Qwen2-VL-2B Qwen3-VL-2B | $1965.8$ $1872.0$ $2000.8$ * | $75.8$ $72.2$ $77.8$ | $56.2$ $48.0$ $58.3$ | $67.4$ $62.4$ $69.2$ | $68.2$ $62.9$ $63.9$ | $50.7$ $38.7$ * $57.9$ * | $66.7$ $63.2$ $67.3$ * | $61.9$ $55.0$ * $63.0$ | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | InternVL3-2B | $2221.2$ | $78.6$ | $60.7$ | $72.9$ | $64.3$ | $53.8$ | $67.5$ | $61.9$ | | InternVL3.5-2B | $2123.3$ | $76.6$ | $62.7$ | $71.7$ | $62.0$ | $49.7$ | $62.4$ | $58.0$ | Results for models other than jina-vlm are from their respective papers, internvl_3_5; internvl_3; qwen_2_vl, except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%) except MME which uses a 0–2800 scale; for overall average computation, MME scores are divided by 28 to align with the 0–100 scale of other benchmarks. 5.3 Multi-Image Reasoning and Hallucination Table 5 reports multi-image reasoning (BLINK (blink), MuirBench (muirbench), MMT (mmtbench)) and hallucination benchmarks that measure the tendency to fabricate visual details (HallBench (hallusionbench), POPE (pope)). jina-vlm scores 47.3 on multi-image tasks, which is expected given limited multi-image training data, but achieves the best POPE score (90.3), indicating low hallucination rates. Table 5: Comparison of multi-image and hallucination performance. | jina-vlm Qwen2-VL-2B Qwen3-VL-2B | $50.1$ $44.4$ $53.8$ | $34.7$ $25.5$ * $47.4$ | $57.2$ $55.1$ $60.0$ * | $47.3$ $41.7$ $53.7$ | $39.1$ $41.7$ $44.5$ | $90.3$ $87.9$ * $88.9$ * | $64.7$ $64.8$ $66.7$ | | --- | --- | --- | --- | --- | --- | --- | --- | | InternVL3-2B | $50.3$ | $38.8$ | $59.5$ | $49.5$ | $42.5$ | $89.6$ | $66.1$ | | InternVL3.5-2B | $51.3$ | $44.0$ | $58.5$ | $51.3$ | $48.6$ | $87.2$ | $67.9$ | Results for models other than jina-vlm are from their respective papers, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%). 5.4 Mathematical Reasoning Table 6 reports structured reasoning benchmarks: multidisciplinary comprehension (MMMU (mmmu)), visual mathematics (MathVista (mathvista), MathVision (mathvision), MathVerse (mathverse), WeMath (wemath)), and logical reasoning (LogicVista (logicvista)). jina-vlm performs comparably to InternVL3-2B and outperforms Qwen2-VL-2B. Table 6: Comparison of multimodal reasoning and mathematical problem-solving performance. | Model | MMMU | MathVista | MathVision | MathVerse | WeMath | LogicVista | Overall | | --- | --- | --- | --- | --- | --- | --- | --- | | (Vision Only) | | | | | | | | | jina-vlm | $45.6$ | 59.5 | 19.2 | 23.9 | 17.1 | 33.3 | $33.1$ | | Qwen2-VL-2B | $41.1$ | 43.0 | 12.4 | 17.3* | 10.9* | 27.3* | $25.3$ | | Qwen3-VL-2B | $53.4$ | 61.3 | 31.6 | 22.7* | 28.0* | 35.4* | $38.7$ | | InternVL3-2B | $48.6$ | 57.0 | 21.7 | 25.3 | 22.4 | 36.9 | $35.3$ | | InternVL3.5-2B | $59.0$ | 71.8 / 61.5 † | 42.8 / 26.5 † | 53.4 / 35.3 † | 48.5 / 19.1 † | 47.7 / 41.4 † | $50.7$ | Results for models other than jina-vlm are from their respective papers, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. $\dagger$ indicates scores for InternVL3.5-2B without thinking mode, evaluated using VLMEvalKit. All scores represent accuracy (%). 5.5 Text-Only Performance Table 7 compares jina-vlm against the backbone Qwen3-1.7B on text-only benchmarks: MMLU (mmlu), MMLU-Pro (mmlupro), GSM-8K (gsm8k), ARC-C (arc), and HellaSwag (hellaswag). Results show mixed preservation of text-only capabilities: jina-vlm matches or exceeds the backbone on commonsense reasoning (ARC-C, HellaSwag) and retains most performance on MMLU and GSM-8K. However, MMLU-Pro shows substantial degradation (46.4 $→$ 30.3), likely because this benchmark emphasizes extended multi-step reasoning that conflicts with our instruction-tuning toward concise visual responses. This suggests a trade-off between optimizing for multimodal tasks and preserving complex text-only reasoning, which future work could address through more balanced data mixtures or curriculum scheduling. Table 7: Comparison of Text-only benchmarks. | jina-vlm Qwen3-1.7B | $56.1$ $62.6$ | $30.3$ $46.4$ | $71.3$ $75.3$ | $77.3$ $73.4$ | $59.4$ $59.0$ | $58.9$ $63.3$ | | --- | --- | --- | --- | --- | --- | --- | Results are collected using our evaluation code. All scores represent accuracy (%). 5.6 Multilingual Understanding Table 8 reports multilingual multimodal benchmarks: MMMB (mmmb), Multilingual MMBench (mmmb), and MTVQA (mtvqa). jina-vlm achieves state-of-the-art multilingual performance among 2B-scale VLMs, with the highest averages on MMMB (78.8) and Multilingual MMBench (74.3). Table 8: Comparison of multilingual multimodal understanding performance. | Benchmark MMMB cn | jina-vlm ar $80.0$ | Qwen2-VL-2B $76.9$ $74.2$ | Qwen3-VL-2B $68.3$ $75.7$ * | InternVL3-2B $72.7$ * $78.3$ | InternVL3.5-2B $68.6$ $77.7$ | $68.5$ | | --- | --- | --- | --- | --- | --- | --- | | en | $82.0$ | $78.3$ | $80.7$ * | $81.9$ | $80.2$ | | | pt | $79.2$ | $72.6$ | $75.0$ * | $75.4$ | $75.9$ | | | ru | $79.2$ | $72.8$ | $75.9$ * | $74.6$ | $76.3$ | | | tr | $75.5$ | $61.8$ | $68.5$ * | $62.9$ | $69.1$ | | | avg | $78.8$ | $71.3$ | $75.0$ * | $73.6$ | $74.6$ | | | Multi. MMBench | ar | $70.0$ | $66.7$ | $66.2$ * | $66.4$ | $63.7$ | | cn | $75.9$ | $67.0$ | $75.7$ * | $77.8$ | $75.9$ | | | en | $78.8$ | $71.1$ | $77.8$ * | $81.3$ | $78.4$ | | | pt | $74.7$ | $72.1$ | $71.4$ * | $75.9$ | $73.7$ | | | ru | $75.3$ | $69.9$ | $75.9$ * | $70.7$ | $71.4$ | | | tr | $71.1$ | $69.3$ | $67.0$ * | $59.5$ | $62.0$ | | | avg | $74.3$ | $69.4$ | $72.3$ * | $71.9$ | $70.9$ | | | MTVQA | $25.6$ | $20.6$ | $27.3$ * | $26.7$ | $28.5$ | | | Overall | $59.6$ | $53.8$ | $58.2$ | $57.4$ | $58.0$ | | Results for baseline models are derived from their original publications, (internvl_3_5; internvl_3; qwen_2_vl), except those marked with * which are computed using VLMEvalKit. All scores represent accuracy (%). 6 Conclusion We presented jina-vlm, a 2.4B vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. Our results demonstrate that small VLMs can attain strong cross-lingual visual understanding through careful architectural and training choices: attention-based token pooling reduces visual tokens by 4 $×$ while preserving spatial information, and incorporating text-only data during multimodal training mitigates the catastrophic forgetting typically observed in vision-adapted language models. On standard English VQA benchmarks, jina-vlm achieves leading results, demonstrating that multilingual capabilities need not come at the cost of general performance. The current approach has limitations. Multi-tile processing introduces computational overhead that scales with image resolution, and tiling can fragment global spatial context, potentially impairing performance on tasks requiring holistic scene understanding such as object counting or precise spatial reasoning across tile boundaries. While the global thumbnail partially mitigates this, native-resolution approaches (navit) may be better suited for such tasks. We have not emphasized safety-critical training or alignment, and multi-image reasoning remains weak due to limited training data in this regime. Future work could explore more efficient resolution handling, targeted improvements for counting and spatial tasks, and investigate whether our multilingual training recipe transfers to larger model scales. Appendix A Appendix A.1 Pseudocode for Creating Overlapping Tiles Input: Image $I$ of size $(h,w)$ ; Base input size $\mathbf{b}=(b_{h},b_{w})$ ( $(378,378)$ ) Patch size $p$ ( $14$ ); Maximum number of tiles $M$ ( $12$ by default, configurable) Overlap margins $(m_{L},m_{R})$ in patches ( $(4,4)$ ) Output: List of tiles $\mathcal{C}$ (thumbnail + grid tiles) Tiling $(t_{h},t_{w})$ = (number of rows, number of columns) 1. Compute overlap-related sizes $m_{\text{tot}}← p·(m_{L}+m_{R})$ // Total overlap margin in pixels $s_{\text{win}}←\bigl(\lfloor b_{h}/p\rfloor-(m_{L}+m_{R})\bigr)· p$ // Tile stride in pixels 2. Select tiling on the margin-reduced image $(t_{h},t_{w})←\textsc{SelectTilingWithMinimalScaleChange}\bigl(h-m_{\text{tot}},\,w-m_{\text{tot}},\,s_{\text{win}},\,M\bigr)$ ; 3. Resize image to exactly fit the chosen tiling + margins; $H^{\prime}← t_{h}· s_{\text{win}}+m_{\text{tot}}$ ; $W^{\prime}← t_{w}· s_{\text{win}}+m_{\text{tot}}$ ; $I_{\text{grid}}←\textsc{Resize}(I,\;[H^{\prime},W^{\prime}])$ ; 4. Extract overlapping tiles $\mathcal{G}←\textsc{ExtractTiles}\bigl(I_{\text{grid}},\,(t_{h},t_{w}),\,s_{\text{win}},\,b_{h}\bigr)$ // $b_{h}$ is the tile height, equal to $b_{w}$ here 5. Build thumbnail and final tile list $T←\textsc{Resize}(I,\;[b_{h},b_{w}])$ // Global thumbnail $\mathcal{C}←[T]\mathbin{+\!\!+}\mathcal{G}$ // Concatenate thumbnail and tiles return $(\mathcal{C},(t_{h},t_{w}))$ ; Algorithm 1 GetAllTilesOverlapAndResize A.2 Training Set Examples Captioning & Instruction Dataset: VisualWebInstruct visualwebinstruct <details> <summary>selected_samples/visualwebinstruct.png Details</summary> ![295070c1](/v1/image/295070c1191f5ad4dec9742dbc0ecff134f5047abf876f43f487868b02893847) ### Visual Description ## Meeting Notes Template ### Overview The document outlines the structure and content of a meeting notes template for a conflict resolution meeting. ### Components/Axes - **Header**: - Meeting Title: Conflict Resolution Meeting - Date: October 15, 2023 - Time: 3:00 PM - 4:00 PM - Location: Main Conference Room - Attendees: Alice Johnson (CEO), Bob Thompson (CTO), Claire Lee (COO), David Smith (Employee Representative) - **Body**: - Agenda: Review of conflict, Statements from attendees, Discussion of possible resolutions, Action items and next steps - Minutes: 1. Introduction by Alice Johnson 2. Statements from Alice and David 3. Discussion on solutions 4. Actions assigned - **Footer**: - Prepared by: Claire Lee - Date: October 15, 2023 ## Official Communication Template ### Overview The document outlines the structure and content of an official communication template for updating employees on conflict resolution efforts. ### Components/Axes - **Header**: - From: Alice Johnson, CEO - To: All Employees - Date: October 16, 2023 - Subject: Update on Recent Conflict Resolution Efforts - **Body**: - Introduction - Summary of Actions - Commitment to Resolution - Contact for Concerns - **Footer**: - Closing and Signature ## Best Practices for Documentation 1. **Timeliness**: Document conflicts promptly 2. **Objectivity**: Record factual information without bias 3. **Consistency**: Use standardized templates 4. **Confidentiality**: Ensure secure storage and limited access 5. **Inclusion of All Perspectives**: Document all viewpoints 6. **Follow-Up**: Schedule follow-up meetings and communications </details> Question what is the meeting title? Answer Conflict Resolution Meeting Figure 2: Answer questions given web documents. Charts & Tables Dataset: TAT-QA tatqa <details> <summary>selected_samples/tatqa.png Details</summary> ![a2e0d4b2](/v1/image/a2e0d4b2ddec6464b0c7f53595943f8de9cfa0147e3bb2845c128820a12a817b) ### Visual Description ## Table: Fiscal Years Ended July 31, 2019, 2018, and 2017 ### Overview The table presents financial data for unrecognized tax benefits over three fiscal years. It includes the beginning of the period, gross increases and decreases in prior period tax positions, gross increases and decreases in current period tax positions, and the end of the period. ### Components/Axes - **Year**: Fiscal years ended July 31, 2019, 2018, and 2017. - **Unrecognized tax benefit - beginning of period**: $10,321, $9,346, $7,687. - **Gross increases - prior period tax positions**: 98, 729, 712. - **Gross decreases - prior period tax positions**: (88), (878), (691). - **Gross increases - current period tax positions**: 1,302, 1,124, 1,638. - **Unrecognized tax benefit - end of period**: $11,633, $10,321, $9,346. ### Detailed Analysis or ### Content Details - **2019**: The beginning of the period was $10,321, with gross increases of 98 and gross decreases of (88). The end of the period was $11,633, indicating an increase of $1,302. - **2018**: The beginning of the period was $9,346, with gross increases of 729 and gross decreases of (878). The end of the period was $10,321, indicating an increase of 1,124. - **2017**: The beginning of the period was $7,687, with gross increases of 712 and gross decreases of (691). The end of the period was $9,346, indicating an increase of 1,638. ### Key Observations - There is a consistent increase in the unrecognized tax benefit over the three fiscal years. - The gross increases in current period tax positions are consistently higher than the gross decreases in both prior and current periods. - The beginning of the period is consistently higher than the end of the period, indicating a net increase in unrecognized tax benefits. ### Interpretation The data suggests that the company has been recognizing tax benefits over the three fiscal years, with an increasing trend. The gross increases in current period tax positions are higher than the gross decreases, indicating that the company is more likely to recognize tax benefits in the current period than in previous periods. The beginning of the period is consistently higher than the end of the period, indicating a net increase in unrecognized tax benefits. This could be due to various factors such as changes in tax laws, changes in the company's tax position, or changes in the company's financial position. </details> Question Unrecognized Tax Benefits Activity related to unrecognized tax benefits is as follows (in thousands): … As of July 31, 2019, the Company has no income tax audits in progress in the U.S. or foreign jurisdictions. What was the increase in unrecognized tax benefits in 2019? Answer $1.3 million. Figure 3: Financial table requiring numerical reasoning over text. Document Understanding & Infographics Dataset: DocVQA docvqa <details> <summary>selected_samples/docvqa.png Details</summary> ![b9bb573a](/v1/image/b9bb573af50205eb2323d62ceab05cc8b917b33c565e1f7f47b639c427e0e32d) ### Visual Description ## Response Code Request Form ### Overview This form is a Request for Response Code related to a project named "2-Free Pack Request - Viewership Interest Confirmation #2". The project involves mailing mailers without an invitation to sneak previews to Charlotte and Indianapolis to confirm interest and intent levels. The mailer has a BRC offer for a free trial which is the response code. If returned, the smoker gets two free packs. ### Components/Axes - **Project Name**: 2-Free Pack Request - Viewership Interest Confirmation #2 - **Description**: Mail Launch Mailers without an invitation to Sneak Previews to Charlotte and Indianapolis to confirm interest and intent levels. Mailer has a BRC offer for free trial which is this response code. If returned, smoker gets two free packs. - **Distribution**: - **Quantity**: 750 - **Pot Response**: 20.00% - **# of Responders**: 150 - **Timing**: - **DTS**: 07/10/98 - **Expiration Date**: 08/01/98 - **Offer Complete**: 07/11/98 - **Days Offer Open**: 1 - **Data Entry**: - **Supplier**: M/A/R/C - **Incoming mail**: BRC - **Agency**: - **Agency Name**: Mezzina/Brown - **Contact Person**: Tami LaManna - **Phone #**: (212) 251-7700 - **If M/A/R/C is supplier, do you need a M/A/R/C P.O. Box?**: Yes - **Fulfillment**: - **Company**: M/A/R/C - **Job Number**: 1 - **Number of Items on Order Form**: 1 - **P.O. Box**: - **City**: - **State**: - **Zip Code**: - **Comments**: - **Marketing Person Responsible for Project**: Mike Duffy - 0910 - **Production Contact**: Joe Leinster - 359 - **Response Code Requested By**: Mike Duffy - 69100-69199 - **Copies of Confirmation Letter To**: - Suzi Hicks - Tami LaManna - Vanessa Oakley - Teresa Smith - Jack Bellis - Nancy Montgomery ### Key Observations - The project involves mailing mailers without an invitation to sneak previews to Charlotte and Indianapolis to confirm interest and intent levels. - The mailer has a BRC offer for a free trial which is the response code. If returned, the smoker gets two free packs. - The distribution is 750 mailers with a 20.00% pot response rate. - The offer is valid for 1 day. - The supplier is M/A/R/C. - The incoming mail is from BRC. - The agency is Mezzina/Brown with contact person Tami LaManna. - The fulfillment is handled by M/A/R/C with a job number of 1. - The response code is requested by Mike Duffy. - The copies of the confirmation letter are to be sent to Suzi Hicks, Tami LaManna, Vanessa Oakley, Teresa Smith, Jack Bellis, and Nancy Montgomery. ### Interpretation The data suggests that the project is a marketing campaign to confirm interest and intent levels among potential customers in Charlotte and Indianapolis. The BRC offer for a free trial is a way to incentivize customers to participate in the campaign. The response code is used to track the response rate and the number of responders. The data indicates that the campaign is effective, with a 20.00% pot response rate. The offer is valid for 1 day, and the supplier is M/A/R/C. The incoming mail is from BRC, and the fulfillment is handled by M/A/R/C. The response code is requested by Mike Duffy, and the copies of the confirmation letter are to be sent to the specified individuals. </details> Question what is the response code ? Answer U19 Figure 4: Document image with question about textual fields. OCR QA (text-centric VQA) Dataset: TextVQA textvqa <details> <summary>selected_samples/textvqa.png Details</summary> ![1001124a](/v1/image/1001124a86587617e7c4b8f5a318b42abf76254234e5ddb21a1eeeeb97efba56) ### Visual Description ## Image Description The image depicts a vibrant cityscape at night, with a focus on a bright orange taxi cab in motion. The taxi is prominently positioned in the foreground, with its headlights and taillights clearly visible. The background is a blur of colorful lights and neon signs, suggesting a bustling urban environment. The taxi's license plate reads "3G54," and there is a sign on the side of the taxi that reads "TAXI BAG." ### Components/Axes - **Labels**: The image does not contain any explicit labels. - **Axes**: There are no axes present in the image. - **Legends**: There are no legends visible in the image. - **Axis Markers**: There are no axis markers present in the image. ### Detailed Analysis or ### Content Details The image is a dynamic snapshot of city life, capturing the movement and energy of a night in New York City. The taxi cab, a iconic symbol of the city, is in sharp focus, while the background is a blur of lights, indicating the high speed at which the taxi is moving. The colors are vivid, with the orange of the taxi contrasting against the darker tones of the night sky and the colorful lights in the background. ### Key Observations - **Notable Patterns**: The image captures the essence of city life, with the taxi cab representing the movement and energy of the city. - **Outliers**: There are no outliers in the image, as it is a single, dynamic snapshot of city life. - **Trends**: The image does not provide any information about trends, as it is a single, static snapshot. ### Interpretation The image suggests a sense of movement and energy, capturing the dynamic nature of city life. The taxi cab, a symbol of the city, is in motion, while the background is a blur of lights, indicating the high speed at which the taxi is moving. The colors are vivid, with the orange of the taxi contrasting against the darker tones of the night sky and the colorful lights in the background. The image does not provide any information about trends, as it is a single, static snapshot of city life. </details> Question what number is the cab Answer 3G54 Figure 5: Photo with textual question needing OCR reading. General VQA Dataset: VQAv2 vqav2 <details> <summary>selected_samples/vqav2.png Details</summary> ![82466f1c](/v1/image/82466f1c6d1acb3d7258dd2a92303c36be8ca7601455e11fc4603bb32facfacd) ### Visual Description ## Image Description ### Overview The image depicts a skateboarder performing a trick in an outdoor skatepark. The skateboarder is airborne, executing a jump over a black bench. The background shows a crowd of spectators watching the performance. The sky is partly cloudy, suggesting a fair weather day. ### Components/Axes - **Skateboarder**: The central figure, wearing a black t-shirt with colorful graphics, a green cap, and dark pants. The skateboarder is in mid-air, with one leg extended forward and the other back, indicating a dynamic movement. - **Skateboard**: The skateboard is visible beneath the skateboarder, with its underside facing the ground. - **Bench**: The black bench is positioned in the foreground, serving as the obstacle for the skateboarder's trick. - **Crowd**: A group of spectators is visible in the background, standing behind a barrier. They appear to be focused on the skateboarder's performance. - **Sky**: The sky is partly cloudy, with blue patches visible, indicating fair weather. ### Detailed Analysis or ### Content Details - **Skateboarder's Pose**: The skateboarder's pose suggests a high level of skill and control. The extended leg and forward-leaning posture indicate a powerful jump. - **Crowd's Reaction**: The spectators' attention is clearly on the skateboarder, suggesting that the trick is impressive and well-executed. - **Environmental Context**: The skatepark setting is typical for skateboarding, with obstacles and a designated area for tricks. ### Key Observations - **Skateboarder's Skill**: The skateboarder's ability to perform a trick over a bench demonstrates advanced skills in skateboarding. - **Crowd's Engagement**: The crowd's presence and focus on the skateboarder indicate a high level of interest and enjoyment in the sport. - **Weather Conditions**: The partly cloudy sky suggests that the weather is suitable for outdoor activities, which is ideal for skateboarding. ### Interpretation The image captures a moment of skill and performance in skateboarding. The skateboarder's trick over the bench is a testament to their expertise and the popularity of skateboarding as a sport. The crowd's engagement suggests that skateboarding is a social activity that brings people together. The partly cloudy sky indicates that the weather is favorable for outdoor activities, which is essential for skateboarding. Overall, the image provides a glimpse into the world of skateboarding, showcasing the sport's appeal and the skill of its practitioners. </details> Question Where is he looking? Answer down Figure 6: General visual question answering on natural images. Grounding, Spatial & Counting Dataset: TallyQA tallyqa <details> <summary>selected_samples/tallyqa.png Details</summary> ![3d586c73](/v1/image/3d586c736c60f210517ee585571ebf57497b5f5c0670d307d98cf1cd21a269c4) ### Visual Description ## Image Description The image depicts a scene of two individuals riding a motorcycle. The motorcycle is black and appears to be a Harley-Davidson, given the distinctive design and branding visible on the front. The rider, who is seated on the right side of the motorcycle, is wearing a black leather jacket, black pants, and a black helmet. The rider is also wearing sunglasses and has a serious expression on their face. The rider is holding the handlebars with both hands and appears to be looking straight ahead. Seated on the left side of the motorcycle is a child, wearing a black jacket with the word "HARLEY" printed on it, blue jeans, and brown shoes. The child is wearing a pink helmet with a visor and is waving to the camera. The child's posture is relaxed, and they are looking towards the camera with a smile. In the background, there are other motorcycles and people, suggesting that this is a motorcycle event or gathering. The sky is overcast, and the environment appears to be an outdoor parking lot or a similar area. ### Components/Axes - **Labels**: None visible in the image. - **Axes**: None visible in the image. - **Legends**: None visible in the image. - **Axis Markers**: None visible in the image. ### Detailed Analysis or ### Content Details The image does not contain any numerical data or specific values. The focus is on the two individuals and the motorcycle. The child's presence on the motorcycle adds a playful element to the scene. ### Key Observations - The child is wearing a helmet, which is a safety measure for motorcyclists. - The child's attire is casual, with a jacket and jeans, which is appropriate for a motorcycle event. - The child's waving gesture suggests that they are enjoying the event and are interacting with the camera. ### Interpretation The image captures a moment of joy and camaraderie between the two individuals. The child's presence on the motorcycle adds a sense of adventure and freedom. The overcast sky and outdoor setting suggest that the event is taking place in a casual, relaxed environment. The image does not provide any information about the specific event or location, but it does convey a sense of community and shared interest in motorcycles. </details> Question How many more people can ride on the vehicle? Answer 0 Figure 7: Scene requiring counting and spatial reasoning accuracy. Math & Geometry (vision) Dataset: CLEVR clevr <details> <summary>selected_samples/clevr.png Details</summary> ![74ef7e6f](/v1/image/74ef7e6f753a954b9f59f7c85f7bcc404529d7d58915f0aa90155e4a02c7f3a3) ### Visual Description ## Image Description The image depicts a collection of geometric shapes arranged on a flat surface. The shapes include cylinders and spheres, each with a different color and texture. The colors of the shapes are blue, green, yellow, and purple. The textures vary from matte to shiny, with the blue and green shapes having a matte finish, while the yellow and purple shapes have a shiny finish. The cylinders are of different sizes and heights. The blue cylinder is the tallest and has a matte finish. The green cylinder is shorter and has a shiny finish. The yellow cylinder is the shortest and has a matte finish. The purple cylinder is the smallest and has a shiny finish. The spheres are also of different sizes and colors. The blue sphere is the largest and has a matte finish. The green sphere is the smallest and has a shiny finish. The yellow sphere is the medium-sized and has a matte finish. The purple sphere is the smallest and has a shiny finish. The shapes are arranged in a scattered manner, with no particular pattern or order. The background is a neutral gray color, providing a contrast to the colorful shapes. ## Interpretation The image does not contain any text or data that can be interpreted. The shapes are simply geometric objects with different colors and textures. The image does not provide any information about the shapes' properties or relationships. Therefore, no interpretation can be made from the image. </details> Question There is a large shiny object; does it have the same shape as the object right of the large metallic thing? Provide a short and direct response. Answer Yes. Figure 8: Synthetic shapes testing compositional spatial reasoning. Screens, Web & GUI Dataset: ScreenQA screenqa <details> <summary>selected_samples/screenqa.png Details</summary> ![a07d07a1](/v1/image/a07d07a10342aea17c9d79448e4c4e2bfa137e02f7454ae07192f3af5893b3b4) ### Visual Description ## Initial Setup ### Overview The image displays a user interface for setting up a period tracking application. The interface is divided into three main sections: Default period length, Default cycle length, and Start date of your last period. ### Components/Axes - **Default period length**: The label is followed by a description stating that the period length usually lasts between 4 and 7 days. - **Default cycle length**: The label is followed by a description stating that the cycle length usually lasts between 23 and 35 days. - **Start date of your last period**: The label is followed by a date picker with arrows to adjust the date. ### Detailed Analysis or ### Content Details - **Default period length**: The value is set to 5 days, indicated by the number 5 in the middle of the section. - **Default cycle length**: The value is set to 30 days, indicated by the number 30 in the middle of the section. - **Start date of your last period**: The date picker shows the date as February 10, 2017. ### Key Observations - The default period length is set to 5 days, which is within the specified range. - The default cycle length is set to 30 days, which is also within the specified range. - The start date of the last period is set to February 10, 2017. ### Interpretation The image shows a user interface for setting up a period tracking application. The user has set the default period length to 5 days and the default cycle length to 30 days. The start date of the last period is set to February 10, 2017. The interface is designed to be user-friendly, with clear labels and easy-to-use date picker. The user can adjust the values as needed to fit their specific needs. The interpretation of the data suggests that the user is in the process of setting up the application and has provided the necessary information to start tracking their period. </details> Question What is the default period length? Answer 5 days Figure 9: User interface screenshot with structured textual elements. Medical (vision) Dataset: PathVQA pathvqa <details> <summary>selected_samples/pathvqa.png Details</summary> ![e4b2ed54](/v1/image/e4b2ed54c3b4ce3681177855ecdfd476e09b81d1e848aa22958b60682421122c) ### Visual Description ## Microscopy of Liver Tissue ### Overview The image displays a microscopic view of liver tissue. The liver is composed of various cell types, including hepatocytes, which are the primary functional cells of the liver. The image shows a section of the liver with a focus on the hepatocytes. ### Components/Axes - **Labels**: The image does not contain any visible labels. - **Axes**: There are no axes present in the image. - **Legends**: There are no legends visible in the image. - **Axis Markers**: There are no axis markers visible in the image. ### Detailed Analysis or ### Content Details - **Hepatocytes**: The image shows numerous hepatocytes, which are oval-shaped cells with a distinct nucleus. The hepatocytes are scattered throughout the liver tissue. - **Blue Staining**: The hepatocytes are stained with a blue dye, which is likely a marker for the liver's specific proteins or enzymes. - **Brown Staining**: There are also brown areas in the image, which could be indicative of other cell types or structures within the liver tissue. - **Arrows**: Two arrows are pointing to specific areas of the hepatocytes, possibly indicating regions of interest or abnormalities. ### Key Observations - **Hepatocyte Distribution**: The hepatocytes are distributed throughout the liver tissue, with no apparent clustering or abnormality. - **Staining Patterns**: The blue and brown staining patterns suggest the presence of different cell types or structures within the liver tissue. - **No Visible Abnormalities**: The image does not show any visible abnormalities or signs of disease. ### Interpretation The image provides a clear view of the liver tissue, showing the hepatocytes and their staining patterns. The blue and brown staining could indicate the presence of specific proteins or enzymes within the hepatocytes. The arrows pointing to specific areas of the hepatocytes suggest that these regions may be of particular interest or may require further investigation. The overall appearance of the liver tissue appears normal, with no visible abnormalities or signs of disease. </details> Question where are liver stem cells (oval cells) located? Answer in the canals of hering Figure 10: Microscopic pathology image for medical VQA. Text-only (instruction / code / math / translation) Dataset: aya_dataset aya Question Quels président des États-Unis ne s’est jamais marié? Answer James Buchanan est le seul président qui ne s’est jamais marié. Figure 11: Text-only tasks covering multiple languages.

Rendering Paper...