\n
## Multi-Panel Line Chart: AI Model Translation Performance (chrF Score vs. Number of Shots)
### Overview
The image displays a set of six line charts arranged in a 2x3 grid. Each chart compares the translation performance of three large language models—Gemini 1.5 Pro, Gemini 1.5 Flash, and GPT-4 Turbo—from English to a specific target language. Performance is measured using the chrF score (a character-level F-score for machine translation evaluation) as a function of the number of "shots" (in-context learning examples), plotted on a logarithmic scale (base 2). The overall trend shows that performance generally improves with more shots for all models, but the rate of improvement and absolute performance vary significantly by language and model.
### Components/Axes
* **Legend:** Positioned at the top center of the entire figure. It defines the three data series:
* **Gemini 1.5 Pro:** Represented by a solid, light green line with circular markers.
* **Gemini 1.5 Flash:** Represented by a solid, light blue line with circular markers.
* **GPT-4 Turbo:** Represented by a solid, light gray line with circular markers.
* **X-Axis (Common to all subplots):** Labeled "Number of Shots (K)". The scale is logarithmic base 2, with tick marks at 2⁰, 2¹, 2², 2³, 2⁴, 2⁵, 2⁶, 2⁷, 2⁸, 2⁹, 2¹⁰, 2¹¹, and 2¹² (representing 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, and 4096 shots, respectively).
* **Y-Axis (Varies by subplot):** Labeled "Test chrF (Flores)" for the top row and "Test chrF (In-house)" for the bottom row. The numerical scale and range differ for each language chart.
* **Subplot Titles:** Each of the six charts has a title indicating the translation task: "Translation: English → [Target Language]". The languages are, from top-left to bottom-right: Bemba, Kurdish, Ewe, Acholi, Abkhaz, and Navajo.
### Detailed Analysis (Per Subplot)
**1. Top-Left: English → Bemba**
* **Y-Axis Range:** ~25.0 to 50.0.
* **Trends & Key Points:**
* **Gemini 1.5 Pro (Green):** Starts highest at ~40.5 (2⁰ shots). Shows a steady, strong upward trend, reaching the highest overall score on the chart of ~50.0 at 2¹² shots.
* **Gemini 1.5 Flash (Blue):** Starts lowest at ~24.0 (2⁰ shots). Exhibits the steepest improvement curve, surpassing GPT-4 Turbo around 2⁶ shots and ending at ~44.0 at 2¹² shots.
* **GPT-4 Turbo (Gray):** Starts at ~29.5 (2⁰ shots). Shows a moderate, steady increase, ending at ~37.0 at 2¹² shots. It is overtaken by Gemini 1.5 Flash mid-chart.
**2. Top-Center: English → Kurdish**
* **Y-Axis Range:** ~35.0 to 47.0.
* **Trends & Key Points:**
* **Gemini 1.5 Pro (Green):** Consistently the top performer. Starts at ~45.0 (2⁰ shots) and shows a gentle upward slope, ending at ~46.5 at 2¹² shots.
* **Gemini 1.5 Flash (Blue):** Starts at ~42.5 (2⁰ shots). Follows a similar gentle upward trend parallel to Gemini 1.5 Pro, ending at ~45.0 at 2¹² shots.
* **GPT-4 Turbo (Gray):** Starts at ~35.5 (2⁰ shots). Performance is relatively flat with minor fluctuations, ending at ~35.5 at 2¹² shots. It shows minimal improvement with more shots.
**3. Top-Right: English → Ewe**
* **Y-Axis Range:** ~25.0 to 43.0.
* **Trends & Key Points:**
* **Gemini 1.5 Pro (Green):** Dominant performer. Starts at ~42.0 (2⁰ shots) and remains nearly flat, with a very slight upward trend to ~43.0 at 2¹² shots.
* **Gemini 1.5 Flash (Blue):** Starts at ~29.5 (2⁰ shots). Shows a strong, consistent upward trend, ending at ~37.5 at 2¹² shots.
* **GPT-4 Turbo (Gray):** Starts lowest at ~24.0 (2⁰ shots). Shows a steady but slower upward trend compared to Gemini 1.5 Flash, ending at ~27.5 at 2¹² shots.
**4. Bottom-Left: English → Acholi**
* **Y-Axis Range:** ~15.0 to 35.0.
* **Trends & Key Points:**
* **Gemini 1.5 Pro (Green):** Top performer. Starts at ~33.0 (2⁰ shots) and shows a very gradual increase, ending at ~35.0 at 2¹² shots.
* **Gemini 1.5 Flash (Blue):** Exhibits a notable dip. Starts at ~17.0 (2⁰ shots), drops to a low of ~13.0 at 2⁵ shots, then recovers sharply, ending at ~29.0 at 2¹² shots.
* **GPT-4 Turbo (Gray):** Starts at ~21.5 (2⁰ shots). Shows a very slow, steady increase, ending at ~23.0 at 2¹² shots.
**5. Bottom-Center: English → Abkhaz**
* **Y-Axis Range:** ~5.0 to 35.0.
* **Trends & Key Points:**
* **Gemini 1.5 Pro (Green):** Starts at ~25.5 (2⁰ shots). Performance is flat until 2⁸ shots, then increases sharply, ending at ~34.0 at 2¹² shots.
* **Gemini 1.5 Flash (Blue):** Starts very low at ~7.0 (2⁰ shots). Remains flat until 2⁶ shots, then begins a dramatic, steep climb, ending at ~28.0 at 2¹² shots.
* **GPT-4 Turbo (Gray):** Starts at ~29.0 (2⁰ shots), making it the initial leader. Performance is flat with a slight dip, ending at ~28.0 at 2¹² shots. It is overtaken by both Gemini models at higher shot counts.
**6. Bottom-Right: English → Navajo**
* **Y-Axis Range:** ~10.0 to 35.0.
* **Trends & Key Points:**
* **Gemini 1.5 Pro (Green):** Starts at ~24.5 (2⁰ shots). Shows a gradual increase until 2⁸ shots, then a sharper rise, ending at ~34.0 at 2¹² shots.
* **Gemini 1.5 Flash (Blue):** Starts at ~11.0 (2⁰ shots). Dips slightly to ~10.0 at 2⁵ shots, then begins a strong, accelerating upward trend, ending at ~28.0 at 2¹² shots.
* **GPT-4 Turbo (Gray):** Starts at ~18.0 (2⁰ shots). Shows a slow, steady increase, ending at ~23.0 at 2¹² shots.
### Key Observations
1. **Model Hierarchy:** Gemini 1.5 Pro is the top or tied-for-top performer in all six translation tasks, demonstrating consistently high baseline performance and scaling.
2. **Scaling Behavior:** Gemini 1.5 Flash shows the most dramatic scaling with increased shots, often starting lowest but exhibiting the steepest improvement curves (especially for Bemba, Acholi, Abkhaz, Navajo). Its performance is highly sensitive to the number of shots.
3. **GPT-4 Turbo's Plateau:** GPT-4 Turbo often shows the least improvement with more shots (e.g., Kurdish, Ewe, Acholi). Its performance lines are notably flatter, suggesting it may benefit less from additional in-context examples for these specific low-resource language tasks within the tested range.
4. **Language Difficulty:** The absolute chrF scores vary widely by language. For example, scores for Kurdish are high (35-47 range), while scores for Abkhaz and Navajo start much lower (5-35 range), indicating these are more challenging translation tasks for the models.
5. **Anomaly - Acholi Dip:** Gemini 1.5 Flash shows a distinct performance dip at 2⁵ (32) shots for English→Acholi before recovering, which is not observed in the other languages or models.
### Interpretation
This data suggests a significant difference in how these advanced AI models utilize in-context learning for machine translation, particularly for lower-resource languages.
* **Gemini 1.5 Pro** appears to have a robust underlying translation capability that requires less "priming" with examples, as evidenced by its high starting points. It still benefits from more shots, but the marginal gain is smaller.
* **Gemini 1.5 Flash** behaves like a model that is highly reliant on in-context learning. Its poor zero/few-shot performance but strong scaling indicates it effectively uses the provided examples to adapt and improve, making it potentially more flexible but dependent on having a sufficient demonstration set.
* **GPT-4 Turbo's** flat scaling for several languages could imply a few possibilities: its pre-training data may have included less relevant material for these languages, its context window utilization for this task is less efficient, or it has reached a performance plateau that more examples cannot easily break through.
The stark contrast in scaling behaviors (steep vs. flat lines) highlights that "more shots" is not universally beneficial to the same degree. The optimal model choice may depend on the available computational budget for inference (more shots = more cost) and the specific target language. The charts provide a clear visual argument that model performance on low-resource language translation is not static but dynamically changes with the amount of provided context, and this dynamic varies fundamentally between model architectures.