This image is a technical visualization demonstrating the performance of a model (likely a Vision Language Model) across different visual question-answering (VQA) tasks as the number of tokens is varied. The image is structured into three distinct pairs of examples, each containing a source image, a question, and a performance table.
### **Layout Overview**
The document is organized into three horizontal sections (Left, Middle, Right), each color-coded to represent a performance outcome:
* **Green (Left):** Successful performance across all token counts.
* **Purple (Middle):** Partial success, failing at lower token counts.
* **Red (Right):** Failure across all token counts.
---
### **Section 1: Full Success (Green)**
This section contains two image examples where the model correctly answers the questions regardless of token count.
#### **Example 1.1: Price Tag**
* **Image Content:** A close-up of a shop window or display. A white sign with red text is visible.
* **Transcribed Text on Sign:** "Polos Crazy Bike 9.90€"
* **Question (Q):** "how much is a polos crazy bike?"
* **Performance Table:**
| # Tokens | 577 | 144 | 36 | 9 | 1 |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **Correct?** | ✓ | ✓ | ✓ | ✓ | ✓ |
#### **Example 1.2: Warning Sign**
* **Image Content:** A grassy area in front of a building. A white rectangular sign is mounted on a pole.
* **Transcribed Text on Sign:** "NO TRESPASSING"
* **Question (Q):** "what directive is the sign giving?"
* **Performance Table:**
| # Tokens | 577 | 144 | 36 | 9 | 1 |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **Correct?** | ✓ | ✓ | ✓ | ✓ | ✓ |
---
### **Section 2: Partial Success (Purple)**
This section contains two examples where the model succeeds at high token counts but fails as the resolution/token count decreases.
#### **Example 2.1: Road Signs**
* **Image Content:** A street view with a tall pole holding multiple blue and white highway/route signs.
* **Transcribed Text on Signs:** "TO 15", "TO 201" (The 201 sign is black and white).
* **Question (Q):** "what number is on the black and white sign?"
* **Performance Table:**
| # Tokens | 577 | 144 | 36 | 9 | 1 |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **Correct?** | ✓ | ✓ | ✓ | ✗ | ✗ |
#### **Example 2.2: Liquor Bottles**
* **Image Content:** A row of various liquor and syrup bottles on a bar counter.
* **Transcribed Text on Labels:** "MONIN", "REMY MARTIN", "DEKUYPER APRICOT BRANDY".
* **Question (Q):** "what brand is the apricot brandy?"
* **Performance Table:**
| # Tokens | 577 | 144 | 36 | 9 | 1 |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **Correct?** | ✓ | ✓ | ✓ | ✗ | ✗ |
---
### **Section 3: Failure (Red)**
This section contains two examples where the model fails to answer correctly at any token count, likely due to the small scale of the text.
#### **Example 3.1: Stadium Scoreboard**
* **Image Content:** A night shot of a baseball stadium scoreboard with various advertisements.
* **Transcribed Text (Sponsors):** "Budweiser", "Publix", "Lakeland Regional Health", "Imperial Swan Hotel", "DQ".
* **Question (Q):** "what beer company is a sponsor on the score board?"
* **Performance Table:**
| # Tokens | 577 | 144 | 36 | 9 | 1 |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **Correct?** | ✗ | ✗ | ✗ | ✗ | ✗ |
#### **Example 3.2: Soccer Field Perimeter**
* **Image Content:** A soccer match in progress. The background shows a perimeter fence with advertising banners.
* **Transcribed Text on Banner:** "Andrew Yates Carpenter Joinery Building", "Tel: 07980 570 574 or 01298 773...".
* **Question (Q):** "what is the telephone number of andrew yates?"
* **Performance Table:**
| # Tokens | 577 | 144 | 36 | 9 | 1 |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **Correct?** | ✗ | ✗ | ✗ | ✗ | ✗ |
---
### **Summary of Data Trends**
* **High Token Counts (577, 144, 36):** The model is generally successful at identifying large, clear text (Signs, Price tags) and medium-sized labels (Bottles).
* **Low Token Counts (9, 1):** The model loses the ability to read specific numbers and brand names on smaller objects.
* **Small/Dense Text:** The model consistently fails to extract information from complex, low-resolution areas like stadium scoreboards or distant advertising banners, regardless of the token budget provided in this test.