Image 9f59a35670b6...

EXPERT: healer-alpha-free VERSION 1

RUNTIME: free/openrouter/healer-alpha
INTEL_VERIFIED
## [Bar Chart]: Error and Non-Response by Dataset and Model  


### Overview  
The chart visualizes **error rates** (y-axis, labeled "Error") and **non-response rates** (black cross-hatching) for 14 models (x-axis) across three datasets: *SCLI5* (light blue), *GSM8K-SC* (peach), and *PRM800K-SC* (light green). The y-axis ranges from 0.0 to 1.0, representing the combined rate of "error" (response with error) and "non-response" (no response).  


### Components/Axes  
- **Title**: "Error and non-response by dataset and model"  
- **Y-axis**: "Error" (scale: 0.0–1.0, increments of 0.2)  
- **X-axis**: "Models" (14 models, left to right):  
  1. Llama-4-Maverick-17B-128E-Instruct-FP8  
  2. DeepSeek-V3-0324  
  3. Qwen2.5-72B-Instruct  
  4. Llama-4-Scout-17B-16E-Instruct-FP8-dynamic  
  5. Llama-3.3-70B-Instruct  
  6. Qwen3-235B-A22B  
  7. Phi-4  
  8. Qwen2.5-7B-Instruct  
  9. Qwen2-7B-Instruct  
  10. Qwen3-14B  
  11. Qwen3-30B-A3B  
  12. Llama-3.1-8B-Instruct  
  13. Qwen3-32B  
  14. Mistral-Small-24B-Instruct-2501  
- **Legend**:  
  - *SCLI5*: Light blue bars (error: response with error)  
  - *GSM8K-SC*: Peach bars (error: response with error)  
  - *PRM800K-SC*: Light green bars (error: response with error)  
  - *Non-Response*: Black cross-hatching (no response, overlaid on bars)  


### Detailed Analysis (Model-by-Model, Dataset-by-Dataset)  
For each model, we analyze three datasets (SCLI5, GSM8K-SC, PRM800K-SC) with two components: **solid color** (error: response with error) and **cross-hatching** (non-response: no response). The total height of each bar is the sum of error and non-response.  

| Model                          | SCLI5 (Blue)       | GSM8K-SC (Peach)   | PRM800K-SC (Green) |  
|--------------------------------|--------------------|--------------------|--------------------|  
| **Llama-4-Maverick-17B-128E-Instruct-FP8** | Error: ~0.05; Non-Response: ~0.00 | Error: ~0.58; Non-Response: ~0.00 | Error: ~0.54; Non-Response: ~0.00 |  
| **DeepSeek-V3-0324**           | Error: ~0.17; Non-Response: ~0.00 | Error: ~0.60; Non-Response: ~0.00 | Error: ~0.52; Non-Response: ~0.00 |  
| **Qwen2.5-72B-Instruct**       | Error: ~0.08; Non-Response: ~0.00 | Error: ~0.41; Non-Response: ~0.00 | Error: ~0.47; Non-Response: ~0.00 |  
| **Llama-4-Scout-17B-16E-Instruct-FP8-dynamic** | Error: ~0.02; Non-Response: ~0.00 | Error: ~0.76; Non-Response: ~0.00 | Error: ~0.73; Non-Response: ~0.00 |  
| **Llama-3.3-70B-Instruct**     | Error: ~0.05; Non-Response: ~0.00 | Error: ~0.72; Non-Response: ~0.00 | Error: ~0.75; Non-Response: ~0.00 |  
| **Qwen3-235B-A22B**            | Error: ~0.30; Non-Response: ~0.13 | Error: ~0.35; Non-Response: ~0.57 | Error: ~0.53; Non-Response: ~0.12 |  
| **Phi-4**                      | Error: ~0.19; Non-Response: ~0.01 | Error: ~0.23; Non-Response: ~0.69 | Error: ~0.68; Non-Response: ~0.23 |  
| **Qwen2.5-7B-Instruct**        | Error: ~0.31; Non-Response: ~0.13 | Error: ~0.60; Non-Response: ~0.21 | Error: ~0.68; Non-Response: ~0.18 |  
| **Qwen2-7B-Instruct**          | Error: ~0.27; Non-Response: ~0.13 | Error: ~0.72; Non-Response: ~0.20 | Error: ~0.40; Non-Response: ~0.54 |  
| **Qwen3-14B**                  | Error: ~0.37; Non-Response: ~0.63 | Error: ~0.61; Non-Response: ~0.30 | Error: ~0.63; Non-Response: ~0.12 |  
| **Qwen3-30B-A3B**              | Error: ~0.37; Non-Response: ~0.57 | Error: ~0.68; Non-Response: ~0.26 | Error: ~0.67; Non-Response: ~0.13 |  
| **Llama-3.1-8B-Instruct**      | Error: ~0.12; Non-Response: ~0.74 | Error: ~0.07; Non-Response: ~0.91 | Error: ~0.18; Non-Response: ~0.80 |  
| **Qwen3-32B**                  | Error: ~0.00; Non-Response: ~1.00 | Error: ~0.48; Non-Response: ~0.47 | Error: ~0.55; Non-Response: ~0.37 |  
| **Mistral-Small-24B-Instruct-2501** | Error: ~0.00; Non-Response: ~0.96 | Error: ~0.05; Non-Response: ~0.94 | Error: ~0.20; Non-Response: ~0.79 |  


### Key Observations  
1. **Non-Response Dominance in Later Models**:  
   - Earlier models (left) have minimal non-response (cross-hatching), so error rates (solid color) are visible.  
   - Later models (right) have **high non-response** (cross-hatching dominates), making total "error" (error + non-response) approach 1.0.  

2. **Dataset-Specific Trends**:  
   - *SCLI5* (blue): Low error in early models, but non-response dominates in later models (e.g., Qwen3-14B, Qwen3-32B).  
   - *GSM8K-SC* (peach): Moderate error in early models, with non-response increasing sharply in later models (e.g., Llama-3.1-8B-Instruct, Mistral-Small-24B-Instruct-2501).  
   - *PRM800K-SC* (green): Moderate error in early models, with non-response increasing in later models (less than GSM8K-SC).  

3. **Model Generation Shift**:  
   - Early models (e.g., Llama-4-Maverick, DeepSeek-V3) prioritize responding (low non-response) but have moderate error.  
   - Later models (e.g., Qwen3-14B, Mistral-Small-24B-Instruct-2501) prioritize non-response (abstaining) over incorrect responses, leading to high total "error."  


### Interpretation  
The chart reveals a **trade-off between error (response with error) and non-response (no response)** across models and datasets:  

- **Early Models**: Focus on responding (low non-response) but struggle with accuracy (moderate error), especially on *GSM8K-SC* and *PRM800K-SC* (reasoning-heavy datasets).  
- **Later Models**: Shift toward non-response (abstaining) to avoid incorrect answers, leading to high total "error" (since non-response is counted as a failure). This suggests a design choice (e.g., models trained to avoid wrong answers) or a limitation (e.g., models struggle with the dataset and fail to generate a response).  

- **Dataset Complexity**: *GSM8K-SC* (math reasoning) and *PRM800K-SC* (reasoning) are more challenging, driving both error and non-response. *SCLI5* (possibly simpler) sees low error in early models but high non-response in later ones, indicating a shift in model behavior.  


This analysis highlights how model performance (error + non-response) evolves with generation and dataset, with non-response becoming a critical factor in later models.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

9f59a35670b6906be4633284

FOUND IN PAPERS

EXPERT: healer-alpha-free VERSION 1