Image 30da64807531...

EXPERT: gemma-3-27b-it-free VERSION 1

RUNTIME: google-free/gemma-3-27b-it
INTEL_VERIFIED
## Bar Chart: Accuracy of LLM and PRM Combinations on AIME2024

### Overview
This bar chart displays the accuracy of different Large Language Model (LLM) and Program-aided Reasoning Model (PRM) combinations when evaluated on the AIME2024 dataset, using a Best-of-N approach. Accuracy is measured as a percentage. The chart compares four different LLM/PRM combinations.

### Components/Axes
*   **Title:** "Accuracy of each LLM and PRM combination using Best-of-N on AIME2024" (positioned at the top-center)
*   **X-axis:** LLM/PRM combinations: "Qwen2.5-7B-Instruct", "Eurus-2-7B-PRIME", "Phi-4-14B", "Qwen2.5-7B-AIRL-S(Our LLM)" (positioned at the bottom)
*   **Y-axis:** Accuracy (%) - Scale ranges from 10 to 30, with increments of 5. (positioned on the left)
*   **Legend:** Located in the top-left corner, identifying the color-coded LLM/PRM combinations:
    *   "Accuracy@1" (light red)
    *   "Math-Shepherd-Mistral-7B-PRM" (light green)
    *   "EurusPRM-Stage2" (light blue)
    *   "Llama3.1-8B-PRM-Deepseek-Data" (medium green)
    *   "Qwen2.5-AIRL-S-PRM(Ours PRM)" (dark grey)

### Detailed Analysis
The chart consists of four groups of bars, one for each LLM/PRM combination on the x-axis. Each group contains five bars, representing the accuracy of each of the five LLM/PRM combinations.

*   **Qwen2.5-7B-Instruct:**
    *   Accuracy@1: Approximately 16.7%
    *   Math-Shepherd-Mistral-7B-PRM: Approximately 20.0%
    *   EurusPRM-Stage2: Approximately 20.0%
    *   Llama3.1-8B-PRM-Deepseek-Data: Approximately 23.3%
    *   Qwen2.5-AIRL-S-PRM(Ours PRM): Approximately 23.3%
*   **Eurus-2-7B-PRIME:**
    *   Accuracy@1: Approximately 20.0%
    *   Math-Shepherd-Mistral-7B-PRM: Approximately 23.3%
    *   EurusPRM-Stage2: Approximately 23.3%
    *   Llama3.1-8B-PRM-Deepseek-Data: Approximately 23.3%
    *   Qwen2.5-AIRL-S-PRM(Ours PRM): Approximately 20.0%
*   **Phi-4-14B:**
    *   Accuracy@1: Approximately 13.3%
    *   Math-Shepherd-Mistral-7B-PRM: Approximately 16.7%
    *   EurusPRM-Stage2: Approximately 20.0%
    *   Llama3.1-8B-PRM-Deepseek-Data: Approximately 20.0%
    *   Qwen2.5-AIRL-S-PRM(Ours PRM): Approximately 20.0%
*   **Qwen2.5-7B-AIRL-S(Our LLM):**
    *   Accuracy@1: Approximately 26.7%
    *   Math-Shepherd-Mistral-7B-PRM: Approximately 30.0%
    *   EurusPRM-Stage2: Approximately 30.0%
    *   Llama3.1-8B-PRM-Deepseek-Data: Approximately 26.7%
    *   Qwen2.5-AIRL-S-PRM(Ours PRM): Approximately 26.7%

### Key Observations
*   The "Qwen2.5-7B-AIRL-S(Our LLM)" combination consistently achieves the highest accuracy across all PRM models, particularly with "Math-Shepherd-Mistral-7B-PRM" and "EurusPRM-Stage2", reaching 30.0%.
*   "Phi-4-14B" consistently shows the lowest accuracy across all PRM models.
*   The "Llama3.1-8B-PRM-Deepseek-Data" and "Qwen2.5-AIRL-S-PRM(Ours PRM)" combinations often yield similar accuracy scores.
*   The "Accuracy@1" baseline is consistently lower than the accuracy achieved when combined with any of the PRM models.

### Interpretation
The data suggests that combining LLMs with PRM models significantly improves accuracy on the AIME2024 dataset. The "Qwen2.5-7B-AIRL-S" LLM, when paired with "Math-Shepherd-Mistral-7B-PRM" or "EurusPRM-Stage2", demonstrates the highest performance, indicating a strong synergy between these models. The consistently low performance of "Phi-4-14B" suggests it may not be as well-suited for this particular task or dataset. The improvement observed when using PRM models compared to the "Accuracy@1" baseline highlights the benefit of program-aided reasoning in enhancing LLM capabilities. The fact that the "Qwen2.5-AIRL-S-PRM(Ours PRM)" performs well, but not always the best, suggests that while the team's PRM is effective, other PRM models like "Math-Shepherd-Mistral-7B-PRM" and "EurusPRM-Stage2" may offer further improvements when combined with their LLM.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

30da648075318fa73ec58213

FOUND IN PAPERS

EXPERT: gemma-3-27b-it-free VERSION 1