## Grouped Bar Chart: Accuracy by Model and Agent
### Overview
This is a grouped bar chart comparing the performance accuracy of nine different large language models (LLMs) across nine distinct "agent" configurations or prompting strategies. The chart visualizes how accuracy varies both by model and by the agent method applied.
### Components/Axes
* **Title:** "Accuracy by Model and Agent"
* **Y-Axis:** Labeled "Accuracy". Scale ranges from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **X-Axis:** Labeled "Model". Lists nine distinct models:
1. Llama 2 7B
2. Llama 2 70B
3. GPT-3.5 Turbo
4. Gemini Pro 1.0
5. Command R+
6. Mistral Large
7. Gemini Pro 1.5
8. GPT-4
9. Claude 3 Opus
* **Legend:** Positioned in the bottom-right quadrant, overlapping the bars for the last two models. It defines the "Agent" types by color:
* **Baseline:** Blue
* **Retry:** Orange
* **Keywords:** Green
* **Advice:** Red
* **Instructions:** Purple
* **Explanation:** Brown
* **Solution:** Pink
* **Composite:** Olive/Yellow-Green
* **Unredacted:** Gray
### Detailed Analysis
Below are the approximate accuracy values for each agent within each model group, derived from visual inspection of the bar heights. Values are estimated to the nearest 0.01.
**1. Llama 2 7B**
* **Trend:** Generally low accuracy, with a gradual increase from Baseline to Unredacted.
* **Values:** Baseline (~0.30), Retry (~0.37), Keywords (~0.36), Advice (~0.37), Instructions (~0.38), Explanation (~0.45), Solution (~0.41), Composite (~0.42), Unredacted (~0.50).
**2. Llama 2 70B**
* **Trend:** Significant improvement over the 7B model. A clear upward trend from Baseline to Unredacted, with a notable jump for the Unredacted agent.
* **Values:** Baseline (~0.38), Retry (~0.48), Keywords (~0.56), Advice (~0.59), Instructions (~0.57), Explanation (~0.65), Solution (~0.60), Composite (~0.67), Unredacted (~0.84).
**3. GPT-3.5 Turbo**
* **Trend:** Higher overall accuracy. A steady, step-wise increase across agents, with Unredacted performing best.
* **Values:** Baseline (~0.60), Retry (~0.69), Keywords (~0.69), Advice (~0.70), Instructions (~0.71), Explanation (~0.80), Solution (~0.83), Composite (~0.82), Unredacted (~0.90).
**4. Gemini Pro 1.0**
* **Trend:** Similar pattern to GPT-3.5 Turbo but with slightly lower peak accuracy for Unredacted.
* **Values:** Baseline (~0.61), Retry (~0.72), Keywords (~0.73), Advice (~0.72), Instructions (~0.72), Explanation (~0.75), Solution (~0.77), Composite (~0.78), Unredacted (~0.88).
**5. Command R+**
* **Trend:** Strong performance, with a pronounced peak for the Unredacted agent.
* **Values:** Baseline (~0.64), Retry (~0.74), Keywords (~0.77), Advice (~0.73), Instructions (~0.80), Explanation (~0.77), Solution (~0.84), Composite (~0.87), Unredacted (~0.94).
**6. Mistral Large**
* **Trend:** High and relatively flat performance across most agents, with Unredacted and Composite leading.
* **Values:** Baseline (~0.72), Retry (~0.77), Keywords (~0.79), Advice (~0.80), Instructions (~0.80), Explanation (~0.82), Solution (~0.89), Composite (~0.90), Unredacted (~0.92).
**7. Gemini Pro 1.5**
* **Trend:** Very high accuracy, with most agents clustering above 0.80. Unredacted is the clear outlier at the top.
* **Values:** Baseline (~0.75), Retry (~0.81), Keywords (~0.81), Advice (~0.82), Instructions (~0.81), Explanation (~0.81), Solution (~0.81), Composite (~0.81), Unredacted (~0.97).
**8. GPT-4**
* **Trend:** Consistently high accuracy across all agents, with a gradual increase towards the rightmost agents.
* **Values:** Baseline (~0.79), Retry (~0.83), Keywords (~0.83), Advice (~0.84), Instructions (~0.85), Explanation (~0.88), Solution (~0.93), Composite (~0.93), Unredacted (~0.97).
**9. Claude 3 Opus**
* **Trend:** The highest-performing model overall. All agents score above 0.80, with a tight cluster at the top end.
* **Values:** Baseline (~0.79), Retry (~0.85), Keywords (~0.85), Advice (~0.85), Instructions (~0.85), Explanation (~0.91), Solution (~0.94), Composite (~0.95), Unredacted (~0.97).
### Key Observations
1. **Model Performance Hierarchy:** There is a clear progression in overall accuracy from left to right on the x-axis. Llama 2 7B is the lowest-performing model, while Claude 3 Opus and GPT-4 are the highest.
2. **Agent Effect:** Within every single model group, the **Unredacted** agent (gray bar) achieves the highest accuracy. The **Baseline** agent (blue bar) is consistently the lowest or among the lowest.
3. **Performance Clustering:** For the top-performing models (Gemini Pro 1.5, GPT-4, Claude 3 Opus), the accuracy scores for many agents (Retry, Keywords, Advice, Instructions) are very similar, forming a plateau. The major differentiators at the top are the Explanation, Solution, Composite, and especially Unredacted agents.
4. **Non-Linear Improvement:** The jump in accuracy from Llama 2 7B to Llama 2 70B is substantial, particularly for the more advanced agents (Explanation, Composite, Unredacted), indicating that model scale significantly amplifies the benefits of these prompting strategies.
### Interpretation
This chart demonstrates two key findings in LLM evaluation:
1. **Model Capability is Foundational:** The base capability of the model (represented by its position on the x-axis) sets the primary ceiling for performance. No agent strategy can elevate a weaker model (e.g., Llama 2 7B) to the level of a stronger one (e.g., Claude 3 Opus).
2. **Agent Strategies Unlock Potential:** The choice of agent or prompting strategy has a profound and consistent impact on accuracy *within* a given model. The "Unredacted" strategy, which likely involves providing the model with full, unfiltered context or information, universally yields the best results. This suggests that performance bottlenecks are often related to information access or framing rather than pure model reasoning. The "Baseline" strategy's poor performance highlights the inadequacy of minimal prompting.
The data implies that for optimal performance, one should use the most capable model available **and** employ advanced agent strategies like "Unredacted," "Composite," or "Solution." The diminishing returns between agents for the top models suggest they are approaching a performance ceiling on this specific task, where further gains require either better base models or fundamentally different approaches.