Image a11bf197ad82...

EXPERT: gemini-2.0-flash VERSION 1

RUNTIME: nugit/gemini/gemini-2.0-flash

INTEL_VERIFIED

## Chart: Planning Accuracy vs. Few-Shot Exemplars for Different Tasks

### Overview
The image presents a series of five line charts comparing the planning accuracy of three different models (Gemini 1.5 Flash, Gemini 1.5 Pro, and GPT-4 Turbo 20240409) across five different tasks: BlocksWorld, Logistics, Mini-Grid, Trip Planning, and Calendar Scheduling. The x-axis represents the number of few-shot exemplars (log scale), and the y-axis represents the planning accuracy in percentage. Error bars representing a 70% confidence interval are included for each data point.

### Components/Axes

*   **Title:** Planning Accuracy vs. Few-Shot Exemplars for Different Tasks
*   **X-axis:** Few-shot exemplars (log scale)
    *   Values: 1, 2, 4, 10, 20, 40, 100, 200, 400, 800
*   **Y-axis:** Planning Accuracy (%)
    *   Values: 0, 10, 20, 30, 40, 50, 60, 70, 80
*   **Legend:** Located within each subplot.
    *   Gemini 1.5 Flash (Green)
    *   Gemini 1.5 Pro (Blue)
    *   GPT-4 Turbo 20240409 (Orange)
*   **Error Bars:** Represent a 70% Confidence Interval (CI).
*   **Subplot Titles:** (a) BlocksWorld, (b) Logistics, (c) Mini-Grid, (d) Trip Planning, (e) Calendar Scheduling
*   **Sentence Pieces (log scale):** Displayed above the x-axis on each subplot. The values vary depending on the subplot.

### Detailed Analysis

#### (a) BlocksWorld

*   **Sentence pieces in 1000 (log scale):** 0.4k, 0.6k, 1.1k, 2.0k, 2.5k, 4.9k, 9.6k, 19.1k, 23.3k, 47.7k
*   **Gemini 1.5 Flash (Green):** Starts at approximately 30% accuracy, dips to around 25% at 10 exemplars, then rises to approximately 40% at 40 exemplars, and decreases to around 30% at 200 exemplars.
*   **Gemini 1.5 Pro (Blue):** Starts at approximately 10% accuracy, rises sharply to approximately 35% at 10 exemplars, then decreases to approximately 25% at 200 exemplars.
*   **GPT-4 Turbo 20240409 (Orange):** Starts at approximately 40% accuracy, increases slightly to approximately 45% at 10 exemplars, and remains relatively stable around 45% until 200 exemplars.

#### (b) Logistics

*   **Sentence pieces in 1000 (log scale):** 0.9k, 1.3k, 2.3k, 4.2k, 5.1k, 9.7k, 19.1k, 37.7k, 93.7k, 187.1k, 373.8k
*   **Gemini 1.5 Flash (Green):** Starts at approximately 20% accuracy and remains relatively flat around 20% across all exemplar values.
*   **Gemini 1.5 Pro (Blue):** Starts at approximately 35% accuracy, decreases to approximately 25% at 4 exemplars, then decreases further to approximately 15% at 40 exemplars, and rises slightly to approximately 25% at 400 exemplars.
*   **GPT-4 Turbo 20240409 (Orange):** Starts at approximately 50% accuracy, increases to approximately 60% at 4 exemplars, and remains relatively stable around 60% across all exemplar values.

#### (c) Mini-Grid

*   **Sentence pieces in 1000 (log scale):** 2.6k, 3.8k, 6.4k, 13.9k, 27.2k, 53.1k, 105.0k, 130.6k, 259.7k, 518.0k
*   **Gemini 1.5 Flash (Green):** Starts at approximately 20% accuracy, increases to approximately 45% at 100 exemplars, and remains relatively stable around 45% until 400 exemplars.
*   **Gemini 1.5 Pro (Blue):** Starts at approximately 20% accuracy, increases to approximately 40% at 100 exemplars, and remains relatively stable around 40% until 400 exemplars.
*   **GPT-4 Turbo 20240409 (Orange):** Starts at approximately 60% accuracy, increases to approximately 80% at 10 exemplars, and remains relatively stable around 80% across all exemplar values.

#### (d) Trip Planning

*   **Sentence pieces in 1000 (log scale):** 0.7k, 1.5k, 3.8k, 8.5k, 17.3k, 45.4k, 89.5k, 178.3k, 355.4k
*   **Gemini 1.5 Flash (Green):** Starts at approximately 5% accuracy, increases to approximately 30% at 20 exemplars, and remains relatively stable around 30% until 400 exemplars.
*   **Gemini 1.5 Pro (Blue):** Starts at approximately 5% accuracy, increases to approximately 30% at 20 exemplars, and remains relatively stable around 30% until 400 exemplars.
*   **GPT-4 Turbo 20240409 (Orange):** Starts at approximately 10% accuracy, increases to approximately 40% at 40 exemplars, and remains relatively stable around 40% until 400 exemplars.

#### (e) Calendar Scheduling

*   **Sentence pieces in 1000 (log scale):** 0.7k, 1.6k, 3.8k, 7.3k, 13.6k, 34.6k, 70.6k, 144.6k
*   **Gemini 1.5 Flash (Green):** Starts at approximately 10% accuracy, increases to approximately 40% at 20 exemplars, and remains relatively stable around 40% until 400 exemplars.
*   **Gemini 1.5 Pro (Blue):** Starts at approximately 10% accuracy, increases to approximately 40% at 20 exemplars, and remains relatively stable around 40% until 400 exemplars.
*   **GPT-4 Turbo 20240409 (Orange):** Starts at approximately 20% accuracy, increases to approximately 50% at 20 exemplars, and remains relatively stable around 50% until 400 exemplars.

### Key Observations

*   GPT-4 Turbo 20240409 generally outperforms Gemini 1.5 Flash and Gemini 1.5 Pro across all tasks.
*   The performance of Gemini 1.5 Flash and Gemini 1.5 Pro is often similar, especially in tasks like Trip Planning and Calendar Scheduling.
*   The impact of increasing the number of few-shot exemplars varies across tasks. In some tasks (e.g., Mini-Grid), performance plateaus after a certain number of exemplars, while in others (e.g., Logistics), performance may even decrease with more exemplars for Gemini 1.5 Pro.
*   The error bars indicate the variability in performance, with some data points having wider confidence intervals than others.

### Interpretation

The data suggests that GPT-4 Turbo 20240409 is a more robust and accurate model for planning tasks compared to Gemini 1.5 Flash and Gemini 1.5 Pro. The effectiveness of few-shot learning appears to be task-dependent, with some tasks benefiting more from increased exemplars than others. The decrease in performance for Gemini 1.5 Pro in the Logistics task with more exemplars could indicate overfitting or the introduction of noise with additional examples. The error bars highlight the uncertainty in the performance estimates, suggesting that further experimentation may be needed to draw more definitive conclusions. The sentence pieces in 1000 (log scale) likely represent the number of tokens or sub-word units used in the input prompts or training data, providing context for the complexity of each task.

DECODING INTELLIGENCE...

EXPERT: gemma-3-27b-it-free VERSION 1

RUNTIME: google-free/gemma-3-27b-it

INTEL_VERIFIED

## Charts: Planning Accuracy vs. Few-shot Exemplars

### Overview
The image presents five charts comparing the planning accuracy of different language models (Gemini 1.5 Flash, Gemini 1.5 Pro, and GPT-4 Turbo 20240409) across five different environments: BlocksWorld, Logistics, Mini-Grid, Trip Planning, and Calendar Scheduling. The x-axis of each chart represents the number of few-shot exemplars (on a logarithmic scale), and the y-axis represents the planning accuracy (in percentage). Error bars representing a 70% confidence interval are also shown. The top of each chart displays the sentence pieces in 1000 (log scale).

### Components/Axes
*   **X-axis (all charts):** Few-shot exemplars (log scale). Markers are at 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048.
*   **Y-axis (all charts):** Planning Accuracy (%). Scale ranges from approximately 10% to 60%.
*   **Models (all charts):**
    *   Gemini 1.5 Flash (represented by a green line)
    *   Gemini 1.5 Pro (represented by a blue line)
    *   GPT-4 Turbo 20240409 (represented by an orange line)
*   **Error Bars (all charts):** Represent a 70% confidence interval.
*   **Sentence Pieces (all charts):** Displayed at the top of each chart, in 1000 (log scale).
*   **Chart Titles:** (a) BlocksWorld, (b) Logistics, (c) Mini-Grid, (d) Trip Planning, (e) Calendar Scheduling.

### Detailed Analysis or Content Details

**Chart (a) BlocksWorld:**
*   **Gemini 1.5 Flash (Green):** Starts at approximately 12% accuracy with 2 exemplars, rises to around 45% at 16 exemplars, plateaus around 45-50% from 32 to 2048 exemplars.
*   **Gemini 1.5 Pro (Blue):** Starts at approximately 15% accuracy with 2 exemplars, rises to around 55% at 16 exemplars, and plateaus around 55-60% from 32 to 2048 exemplars.
*   **GPT-4 Turbo 20240409 (Orange):** Starts at approximately 18% accuracy with 2 exemplars, rises to around 40% at 16 exemplars, and plateaus around 40-45% from 32 to 2048 exemplars.
*   Sentence Pieces: 0.4k, 1.1k, 2.6k, 4.9k, 9.3k, 19.5k, 47.7k

**Chart (b) Logistics:**
*   **Gemini 1.5 Flash (Green):** Starts at approximately 10% accuracy with 2 exemplars, rises to around 30% at 64 exemplars, and plateaus around 30-35% from 128 to 2048 exemplars.
*   **Gemini 1.5 Pro (Blue):** Starts at approximately 15% accuracy with 2 exemplars, rises to around 50% at 128 exemplars, and plateaus around 50-55% from 256 to 2048 exemplars.
*   **GPT-4 Turbo 20240409 (Orange):** Starts at approximately 20% accuracy with 2 exemplars, rises to around 40% at 64 exemplars, and plateaus around 40-45% from 128 to 2048 exemplars.
*   Sentence Pieces: 0.9k, 2.3k, 4.5k, 9.1k, 19.1k, 38.7k, 78.9k, 157k, 313.8k

**Chart (c) Mini-Grid:**
*   **Gemini 1.5 Flash (Green):** Starts at approximately 15% accuracy with 2 exemplars, rises to around 40% at 32 exemplars, and plateaus around 40-45% from 64 to 2048 exemplars.
*   **Gemini 1.5 Pro (Blue):** Starts at approximately 20% accuracy with 2 exemplars, rises to around 55% at 64 exemplars, and plateaus around 55-60% from 128 to 2048 exemplars.
*   **GPT-4 Turbo 20240409 (Orange):** Starts at approximately 25% accuracy with 2 exemplars, rises to around 45% at 64 exemplars, and plateaus around 45-50% from 128 to 2048 exemplars.
*   Sentence Pieces: 2.6k, 6.4k, 13.9k, 27.2k, 53.1k, 104.6k, 259.7k, 518.9k

**Chart (d) Trip Planning:**
*   **Gemini 1.5 Flash (Green):** Starts at approximately 10% accuracy with 2 exemplars, rises to around 35% at 16 exemplars, and plateaus around 35-40% from 32 to 2048 exemplars.
*   **Gemini 1.5 Pro (Blue):** Starts at approximately 15% accuracy with 2 exemplars, rises to around 50% at 32 exemplars, and plateaus around 50-55% from 64 to 2048 exemplars.
*   **GPT-4 Turbo 20240409 (Orange):** Starts at approximately 20% accuracy with 2 exemplars, rises to around 40% at 32 exemplars, and plateaus around 40-45% from 64 to 2048 exemplars.
*   Sentence Pieces: 0.7k, 1.5k, 3.8k, 8.5k, 17.3k, 45.4k, 89.9k, 178.2k

**Chart (e) Calendar Scheduling:**
*   **Gemini 1.5 Flash (Green):** Starts at approximately 10% accuracy with 2 exemplars, rises to around 30% at 16 exemplars, and plateaus around 30-35% from 32 to 2048 exemplars.
*   **Gemini 1.5 Pro (Blue):** Starts at approximately 15% accuracy with 2 exemplars, rises to around 45% at 32 exemplars, and plateaus around 45-50% from 64 to 2048 exemplars.
*   **GPT-4 Turbo 20240409 (Orange):** Starts at approximately 20% accuracy with 2 exemplars, rises to around 40% at 32 exemplars, and plateaus around 40-45% from 64 to 2048 exemplars.
*   Sentence Pieces: 0.7k, 1.6k, 3.8k, 7.3k, 13.6k, 34.6k, 70.4k, 144.6k

### Key Observations
*   Gemini 1.5 Pro consistently outperforms both Gemini 1.5 Flash and GPT-4 Turbo 20240409 across all environments.
*   GPT-4 Turbo 20240409 generally outperforms Gemini 1.5 Flash, especially at lower numbers of exemplars.
*   In most environments, the accuracy plateaus after a certain number of exemplars (typically between 32 and 128), indicating diminishing returns from adding more examples.
*   The impact of few-shot exemplars varies across environments. Some environments (e.g., Logistics, Mini-Grid) show a more significant improvement with increasing exemplars than others (e.g., BlocksWorld, Calendar Scheduling).

### Interpretation
The data suggests that Gemini 1.5 Pro is the most effective model for planning tasks across the tested environments, followed by GPT-4 Turbo 20240409 and then Gemini 1.5 Flash. The diminishing returns observed with increasing exemplars indicate that there's a limit to how much performance can be improved by simply providing more examples. The varying impact of exemplars across environments suggests that the complexity of the task and the nature of the environment play a role in how effectively few-shot learning can be applied. The sentence pieces data at the top of each chart may indicate the length of the input prompts or the complexity of the language used, but without further context, it's difficult to draw definitive conclusions about its relationship to planning accuracy. The error bars provide a measure of uncertainty, and it's important to consider these when interpreting the differences between models. Overall, the results highlight the importance of model selection and the potential benefits of few-shot learning for planning tasks.

DECODING INTELLIGENCE...

EXPERT: healer-alpha-free VERSION 1

RUNTIME: free/openrouter/healer-alpha

INTEL_VERIFIED

## Line Charts: Planning Accuracy vs. Few-Shot Exemplars for Three AI Models

### Overview
The image contains five line charts, labeled (a) through (e), comparing the "Planning Accuracy (%)" of three large language models across five different planning tasks. The models are Gemini 1.5 Flash (green line), Gemini 1.5 Pro (orange line), and GPT-4 Turbo 20240409 (blue line). Each chart plots accuracy against the number of "Few-shot exemplars" on a logarithmic scale (bottom x-axis). A secondary top x-axis shows the corresponding "Sentence pieces in 1000 (log scale)". Error bars represent a 70% confidence interval (CI).

### Components/Axes
*   **Common Elements (All Charts):**
    *   **Y-axis:** "Planning Accuracy (%)". Scale ranges from 0 to 60 or 70, depending on the chart.
    *   **Bottom X-axis:** "Few-shot exemplars (log scale)". Values are powers of 2 (e.g., 1, 2, 4, 8, 10, 20, 40, 80, 100, 200, 400, 800).
    *   **Top X-axis:** "Sentence pieces in 1000 (log scale)". Values are specific to each task and increase with the number of exemplars.
    *   **Legend:** Located in the top-left corner of each chart. Lists the three models with corresponding line colors and markers:
        *   Green line with circle markers: `Gemini 1.5 Flash`
        *   Orange line with circle markers: `Gemini 1.5 Pro`
        *   Blue line with circle markers: `GPT-4 Turbo 20240409`
    *   **Annotation:** Text in the bottom-right corner of each chart: "Error bars represent a 70% CI".

*   **Individual Chart Titles (Subplot Labels):**
    *   (a) BlocksWorld.
    *   (b) Logistics.
    *   (c) Mini-Grid.
    *   (d) Trip Planning.
    *   (e) Calendar Scheduling.

### Detailed Analysis

#### (a) BlocksWorld
*   **Trend Verification:**
    *   **Gemini 1.5 Pro (Orange):** Slopes upward, peaks around 40-80 exemplars, then slightly declines.
    *   **Gemini 1.5 Flash (Green):** Slopes upward to a peak at ~20 exemplars, then declines sharply.
    *   **GPT-4 Turbo (Blue):** Slopes upward, with a notable dip at 10 exemplars, then continues rising.
*   **Data Points (Approximate Accuracy %):**
    *   **@1 Exemplar:** Pro ~35%, Flash ~25%, GPT-4 ~0%.
    *   **@Peak (Pro):** ~48% at 40 exemplars.
    *   **@Peak (Flash):** ~42% at 20 exemplars.
    *   **@800 Exemplars:** Pro ~42%, Flash ~25%, GPT-4 ~40%.
*   **Top X-axis (Sentence pieces):** Ranges from 0.4k (at 1 exemplar) to 42.7k (at 800 exemplars).

#### (b) Logistics
*   **Trend Verification:**
    *   **Gemini 1.5 Pro (Orange):** Strong, consistent upward trend.
    *   **Gemini 1.5 Flash (Green):** Relatively flat, low performance with slight fluctuations.
    *   **GPT-4 Turbo (Blue):** Erratic. Rises to a peak at 4 exemplars, then declines and fluctuates.
*   **Data Points (Approximate Accuracy %):**
    *   **@1 Exemplar:** Pro ~42%, Flash ~18%, GPT-4 ~20%.
    *   **@Peak (Pro):** ~65% at 800 exemplars.
    *   **@Peak (Flash):** ~20% at 2-4 exemplars.
    *   **@Peak (GPT-4):** ~35% at 4 exemplars.
    *   **@800 Exemplars:** Pro ~65%, Flash ~12%, GPT-4 ~15%.
*   **Top X-axis (Sentence pieces):** Ranges from 0.9k to 373.8k.

#### (c) Mini-Grid
*   **Trend Verification:**
    *   **Gemini 1.5 Pro (Orange):** Very strong, smooth upward trend, plateauing at high exemplar counts.
    *   **Gemini 1.5 Flash (Green):** Steady upward trend, plateauing around 40-80 exemplars.
    *   **GPT-4 Turbo (Blue):** Steady upward trend, closely follows Flash but slightly lower.
*   **Data Points (Approximate Accuracy %):**
    *   **@1 Exemplar:** Pro ~25%, Flash ~15%, GPT-4 ~22%.
    *   **@Peak (Pro):** ~75% at 200-400 exemplars.
    *   **@Peak (Flash):** ~45% at 80-200 exemplars.
    *   **@400 Exemplars:** Pro ~75%, Flash ~45%, GPT-4 ~42%.
*   **Top X-axis (Sentence pieces):** Ranges from 2.6k to 518.0k.

#### (d) Trip Planning
*   **Trend Verification:**
    *   **Gemini 1.5 Pro (Orange):** Upward trend, peaks around 40-100 exemplars, then declines.
    *   **Gemini 1.5 Flash (Green):** Upward trend to a peak at 20-40 exemplars, then declines.
    *   **GPT-4 Turbo (Blue):** Upward trend, peaks around 20-40 exemplars, then declines.
*   **Data Points (Approximate Accuracy %):**
    *   **@1 Exemplar:** Pro ~3%, Flash ~6%, GPT-4 ~14%.
    *   **@Peak (Pro):** ~42% at 40 exemplars.
    *   **@Peak (Flash):** ~27% at 20 exemplars.
    *   **@Peak (GPT-4):** ~32% at 20 exemplars.
    *   **@800 Exemplars:** Pro ~39%, Flash ~20%, GPT-4 ~20%.
*   **Top X-axis (Sentence pieces):** Ranges from 0.7k to 355.4k.

#### (e) Calendar Scheduling
*   **Trend Verification:**
    *   **Gemini 1.5 Pro (Orange):** Upward trend, peaks around 40-100 exemplars, then slightly declines.
    *   **Gemini 1.5 Flash (Green):** Gradual upward trend, peaks around 100-200 exemplars.
    *   **GPT-4 Turbo (Blue):** Upward trend to a peak at 20 exemplars, then declines.
*   **Data Points (Approximate Accuracy %):**
    *   **@1 Exemplar:** Pro ~33%, Flash ~19%, GPT-4 ~9%.
    *   **@Peak (Pro):** ~53% at 40 exemplars.
    *   **@Peak (Flash):** ~34% at 100 exemplars.
    *   **@Peak (GPT-4):** ~40% at 20 exemplars.
    *   **@400 Exemplars:** Pro ~50%, Flash ~29%, GPT-4 ~30%.
*   **Top X-axis (Sentence pieces):** Ranges from 0.7k to 144.6k.

### Key Observations
1.  **Model Performance Hierarchy:** Gemini 1.5 Pro (orange) consistently achieves the highest or near-highest planning accuracy across all five tasks, especially as the number of exemplars increases.
2.  **Task Difficulty:** The tasks show varying levels of difficulty. "Logistics" and "Mini-Grid" allow for higher peak accuracies (up to ~65% and ~75% respectively), while "BlocksWorld" and "Trip Planning" peak at lower accuracies (~48% and ~42%).
3.  **Diminishing Returns:** For most models and tasks, accuracy improves with more few-shot exemplars but eventually plateaus or even declines, suggesting a point of diminishing returns or potential overfitting.
4.  **Gemini 1.5 Flash Variability:** The performance of Gemini 1.5 Flash (green) is more variable. It performs relatively well in "BlocksWorld" and "Mini-Grid" but poorly in "Logistics". It often peaks at a lower number of exemplars than Pro.
5.  **GPT-4 Turbo Performance:** GPT-4 Turbo (blue) is competitive, often performing similarly to or better than Gemini 1.5 Flash, but generally below Gemini 1.5 Pro. Its performance curve is sometimes less smooth (e.g., the dip in BlocksWorld).

### Interpretation
This set of charts provides a comparative benchmark of planning capabilities for three advanced AI models. The data suggests that **Gemini 1.5 Pro is the most robust and capable planner** among the three across diverse domains, benefiting significantly from increased context (more exemplars and sentence pieces). The consistent upward trend for Pro indicates it effectively utilizes in-context learning for planning tasks.

The **dual x-axes reveal a correlation**: tasks requiring more sentence pieces (context) to represent the same number of exemplars (like Logistics and Mini-Grid) also allow for higher potential accuracy, implying that the complexity of the planning problem and the model's capacity to handle long contexts are key factors.

The **performance gap between Pro and Flash** highlights differences within the same model family, likely due to model size and capacity. The **variability in Flash's and GPT-4 Turbo's performance** across tasks suggests their planning abilities are more sensitive to the specific structure and rules of the domain (e.g., the logistics task appears particularly challenging for Flash).

Overall, the charts demonstrate that state-of-the-art LLMs can perform structured planning tasks with moderate to good accuracy, and their performance is strongly influenced by both the model architecture and the amount of provided in-context examples. The results are a snapshot of model capabilities as of early 2024 (based on the GPT-4 Turbo date).

DECODING INTELLIGENCE...

EXPERT: nemotron-free VERSION 1

RUNTIME: free/nvidia/nemotron-nano-12b-v2-vl:free

INTEL_VERIFIED

## Line Charts: Planning Accuracy vs Few-Shot Exemplars Across Tasks

### Overview
The image contains five line charts comparing the planning accuracy of three AI models (Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4 Turbo 20240409) across different tasks (BlocksWorld, Logistics, Mini-Grid, Trip Planning, Calendar Scheduling). Each chart plots accuracy (%) against few-shot exemplars (log scale) with 70% confidence interval error bars.

### Components/Axes
- **X-axis**: "Few-shot exemplars (log scale)" ranging from 1 to 800 exemplars (logarithmic scale)
- **Y-axis**: "Planning Accuracy (%)" ranging from 0% to 80% (linear scale)
- **Legends**: Positioned in top-left corner of each chart, with:
  - Green line: Gemini 1.5 Flash
  - Orange line: Gemini 1.5 Pro
  - Blue line: GPT-4 Turbo 20240409
- **Error Bars**: Represent 70% confidence intervals (CI)

### Detailed Analysis
#### (a) BlocksWorld
- **Trend**: Gemini 1.5 Pro (orange) maintains highest accuracy (35-45%), followed by Gemini 1.5 Flash (green: 25-40%), with GPT-4 Turbo (blue) showing erratic performance (10-35%).
- **Key Data Points**:
  - At 10 exemplars: Gemini 1.5 Pro ≈40%, Gemini 1.5 Flash ≈35%, GPT-4 Turbo ≈30%
  - At 200 exemplars: Gemini 1.5 Pro ≈45%, Gemini 1.5 Flash ≈30%, GPT-4 Turbo ≈40%

#### (b) Logistics
- **Trend**: Gemini 1.5 Pro dominates (50-70%), Gemini 1.5 Flash (15-25%), GPT-4 Turbo (20-40%).
- **Key Data Points**:
  - At 10 exemplars: Gemini 1.5 Pro ≈55%, Gemini 1.5 Flash ≈20%, GPT-4 Turbo ≈30%
  - At 200 exemplars: Gemini 1.5 Pro ≈65%, Gemini 1.5 Flash ≈15%, GPT-4 Turbo ≈45%

#### (c) Mini-Grid
- **Trend**: Gemini 1.5 Pro (60-80%), Gemini 1.5 Flash (30-50%), GPT-4 Turbo (40-60%).
- **Key Data Points**:
  - At 10 exemplars: Gemini 1.5 Pro ≈60%, Gemini 1.5 Flash ≈30%, GPT-4 Turbo ≈40%
  - At 200 exemplars: Gemini 1.5 Pro ≈75%, Gemini 1.5 Flash ≈45%, GPT-4 Turbo ≈55%

#### (d) Trip Planning
- **Trend**: Gemini 1.5 Pro (30-50%), Gemini 1.5 Flash (20-40%), GPT-4 Turbo (25-45%).
- **Key Data Points**:
  - At 10 exemplars: Gemini 1.5 Pro ≈35%, Gemini 1.5 Flash ≈25%, GPT-4 Turbo ≈30%
  - At 200 exemplars: Gemini 1.5 Pro ≈45%, Gemini 1.5 Flash ≈35%, GPT-4 Turbo ≈40%

#### (e) Calendar Scheduling
- **Trend**: Gemini 1.5 Pro (40-60%), Gemini 1.5 Flash (20-40%), GPT-4 Turbo (30-50%).
- **Key Data Points**:
  - At 10 exemplars: Gemini 1.5 Pro ≈40%, Gemini 1.5 Flash ≈20%, GPT-4 Turbo ≈30%
  - At 200 exemplars: Gemini 1.5 Pro ≈55%, Gemini 1.5 Flash ≈35%, GPT-4 Turbo ≈45%

### Key Observations
1. **Model Performance**: Gemini 1.5 Pro consistently outperforms other models across all tasks and exemplar counts.
2. **Task Complexity**: Logistics and Mini-Grid show highest absolute accuracy, while BlocksWorld and Trip Planning have lower baselines.
3. **Error Bar Variability**: GPT-4 Turbo exhibits larger error bars (wider 70% CI), indicating less reliable performance.
4. **Scaling Behavior**: All models show improved performance with more exemplars, but diminishing returns after ~100 exemplars.

### Interpretation
The data demonstrates that Gemini 1.5 Pro exhibits superior few-shot learning capabilities across diverse planning tasks, maintaining higher accuracy and reliability (narrower error bars) compared to other models. The Logistics and Mini-Grid tasks appear more amenable to AI planning, achieving >60% accuracy even with minimal exemplars. GPT-4 Turbo's performance is inconsistent, suggesting potential task-specific limitations or training data gaps. The logarithmic scaling of exemplars highlights that most gains occur in the early stages of few-shot learning, with diminishing returns at higher exemplar counts. This pattern suggests that task-specific fine-tuning or hybrid approaches might be necessary for optimal performance in complex planning scenarios.

DECODING INTELLIGENCE...

TECHNICAL ASSET FINGERPRINT

a11bf197ad828dd9756b91f3

FOUND IN PAPERS

EXPERT: gemini-2.0-flash VERSION 1

EXPERT: gemma-3-27b-it-free VERSION 1

EXPERT: healer-alpha-free VERSION 1

EXPERT: nemotron-free VERSION 1