\n
## Bar Chart: Multimodal Model Performance Comparison
### Overview
This image presents a bar chart comparing the performance of four large multimodal models – Kimi K2.5, GPT-5.2 (xhigh), Claude Opus 4.5, and Gemini 3 Pro – across ten different benchmark tasks. The performance metric is a percentile score (%). Each model's performance on each task is represented by a bar divided into four segments, each with a different color and icon, presumably representing different evaluation criteria or sub-scores.
### Components/Axes
* **X-axis:** Represents the ten benchmark tasks:
1. Agents Humanity's Last Exam (Full)
2. Agents BrowseComp
3. Agents DeepSearchQA
4. Coding SWE-bench Verified
5. Coding SWE-bench Multilingual
6. Image MMMU Pro
7. Image MathVision
8. Image OmniDocBench 1.5*
9. Video VideoMMU
10. Video LongVideoBench
* **Y-axis:** Implied scale representing percentile score (%). The scale is not explicitly labeled, but the values on the bars range from approximately 43% to 88%.
* **Models:** Four models are compared: Kimi K2.5, GPT-5.2 (xhigh), Claude Opus 4.5, and Gemini 3 Pro. Each model has a distinct color:
* Kimi K2.5: Blue
* GPT-5.2 (xhigh): Orange
* Claude Opus 4.5: Green
* Gemini 3 Pro: Purple
* **Legend/Icons:** Each bar is divided into four segments, each with a unique icon:
* "K" icon (likely representing a key metric)
* Star icon
* Diamond icon
* Triangle icon
* **Footer Text:** "* OmniDocBench Score is computed as (1 - normalized Levenshtein distance) x 100, where a higher score denotes superior accuracy."
### Detailed Analysis
Here's a breakdown of the performance for each model on each task, with approximate values based on visual estimation:
**Kimi K2.5 (Blue)**
1. Agents Humanity's Last Exam (Full): 50.2%, 45.5%, 43.2%, 45.8%
2. Agents BrowseComp: 74.9%, 65.8%, 57.8%, 59.2%
3. Agents DeepSearchQA: 77.1%, 71.3%, 61.3%, 76.1%
4. Coding SWE-bench Verified: 76.8%, 80.0%, 80.9%, 76.2%
5. Coding SWE-bench Multilingual: 73.0%, 72.0%, 77.5%, 65.0%
6. Image MMMU Pro: 78.5%, 79.5%, 74.0%, 81.0%
7. Image MathVision: 84.2%, 83.0%, 77.1%, 86.1%
8. Image OmniDocBench 1.5*: 88.8%, 87.7%, 88.5%, 86.1%
9. Video VideoMMU: 86.6%, 85.9%, 84.4%, 87.6%
10. Video LongVideoBench: 79.8%, 78.5%, 67.2%, 77.7%
**GPT-5.2 (xhigh) (Orange)**
(Values are omitted for brevity, but follow the same format as above. The trend for each task can be visually assessed from the image.)
**Claude Opus 4.5 (Green)**
(Values are omitted for brevity, but follow the same format as above. The trend for each task can be visually assessed from the image.)
**Gemini 3 Pro (Purple)**
(Values are omitted for brevity, but follow the same format as above. The trend for each task can be visually assessed from the image.)
### Key Observations
* **OmniDocBench 1.5* consistently shows high scores** across all models, with Gemini 3 Pro achieving the highest score (approximately 88.8%).
* **Agents Humanity's Last Exam (Full) consistently shows the lowest scores** across all models, with Kimi K2.5 achieving the highest score (approximately 50.2%).
* **Gemini 3 Pro generally performs well**, often achieving the highest or near-highest scores across multiple tasks.
* **Kimi K2.5 shows relatively consistent performance** across tasks, with no exceptionally high or low scores.
* The internal segments within each bar (represented by the icons) show varying contributions to the overall score, suggesting different strengths and weaknesses within each model.
### Interpretation
This chart provides a comparative performance analysis of four leading multimodal models across a diverse set of benchmarks. The benchmarks cover areas like agent capabilities, coding, image understanding, and video processing. The use of percentile scores allows for a standardized comparison, although the specific meaning of each percentile requires understanding of the benchmark distributions.
The consistent high performance on OmniDocBench 1.5* suggests that all models are proficient in document understanding tasks, while the lower scores on Agents Humanity's Last Exam (Full) indicate a challenge in complex reasoning or human-level task completion.
The varying contributions of the internal segments (icons) within each bar suggest that the models excel in different aspects of each task. For example, a model might have a high score for the "K" icon but a lower score for the star icon, indicating strength in a specific evaluation criterion.
The chart highlights Gemini 3 Pro as a generally strong performer, but also reveals that the optimal model choice depends on the specific task. The footer note regarding OmniDocBench clarifies that the score is based on Levenshtein distance, indicating a focus on accuracy in text reproduction or matching. The chart is a valuable tool for researchers and practitioners seeking to understand the capabilities and limitations of these multimodal models.