Image f406804a3d7e...

EXPERT: gemini-3-flash-free VERSION 2

RUNTIME: nugit/gemini/gemini-3-flash-preview
INTEL_VERIFIED
# Technical Data Extraction: Model Performance vs. Scaled Parameters

This document contains a detailed extraction of data from two scatter plots comparing various model pruning and merging methods across different base models.

## 1. General Metadata and Layout
*   **Image Structure:** Two side-by-side scatter plots with a shared legend on the far right.
*   **X-Axis (Shared):** "Log Scaled Total Parameters (in billions)". The scale is logarithmic, ranging from $10^1$ to $10^3$.
*   **Y-Axis (Left Plot):** "Non-Agentic Code Acc. (%)". Range: 0 to 70.
*   **Y-Axis (Right Plot):** "MC Accuracy (%)". Range: 45 to 80.
*   **Legend Location:** Right-hand side of the image.

---

## 2. Legend and Categorization

### A. Baseline Models (Black Markers)
These represent the uncompressed/original models.
*   **Circle (●):** ERNIE-4.5-21B-A3B
*   **Square (■):** Qwen3-30B-A3B
*   **Inverted Triangle (▼):** Mixtral-8x7B-Instruct-v0.1
*   **Star (★):** LLaMA-4-Scout-17B-16E-Instruct
*   **Pentagon (⬟):** GLM-4.5-Air
*   **Diamond (◆):** Qwen3-Coder-480B-A35B-Instruct-FP8
*   **Cross (✖):** Kimi-K2-Instruct-W4A16

### B. Pruning Methods (Dashed Lines with Markers)
*   **REAP (ours) [Blue]:** Consistently the highest-performing pruning method across most parameter scales.
*   **EAN [Pink]:** Generally follows REAP but at a slightly lower accuracy level.
*   **Frequency [Green]:** Shows the steepest performance drop as parameters are reduced; often the lowest-performing pruning method.

### C. Merging Methods (Dashed Lines with Markers)
*   **HC-SMoE [Gold/Yellow]:** Mid-tier performance.
*   **M-SMoE [Light Blue]:** Generally the lowest-performing merging method, showing significant accuracy degradation at lower parameter counts.

---

## 3. Data Trends and Analysis

### Left Plot: Non-Agentic Code Acc. (%)
*   **General Trend:** All methods show a positive correlation between the number of parameters and accuracy. Pruning/merging from a larger base model (e.g., the 480B diamond series) results in higher accuracy than smaller base models, even when scaled to the same total parameter count.
*   **REAP Performance:** In the cluster around $10^1$ to $10^2$ parameters, REAP (blue) maintains accuracy closest to the black baseline markers.
*   **Frequency Method Drop-off:** For the 480B model (diamond), the Frequency method (green) drops from ~55% accuracy at ~300B parameters to near 0% accuracy when scaled down to ~250B parameters, indicating high sensitivity.

### Right Plot: MC Accuracy (%)
*   **General Trend:** Similar to the left plot, accuracy increases with parameter count. The "Pareto front" is defined by the Baseline models (black) and the REAP pruning method (blue).
*   **Method Comparison:**
    *   **REAP (Blue)** consistently stays at the top of each model's cluster.
    *   **EAN (Pink)** and **HC-SMoE (Gold)** occupy the middle ground.
    *   **M-SMoE (Light Blue)** and **Frequency (Green)** consistently show the worst performance retention.
*   **Scaling Observations:** For the Kimi-K2 (cross) and Qwen3-Coder (diamond) models at the $10^2$ to $10^3$ scale, REAP maintains >70% MC Accuracy, while Frequency and M-SMoE drop toward 60% or lower as they are compressed.

---

## 4. Component Isolation and Spatial Grounding

| Region | Content Description |
| :--- | :--- |
| **Header** | No formal title text; axes provide the context. |
| **Left Chart** | Focuses on Code Accuracy. Shows 4 distinct clusters of models being compressed. |
| **Right Chart** | Focuses on Multiple Choice (MC) Accuracy. Shows 3 distinct clusters of models being compressed. |
| **Footer** | X-axis labels: $10^1$, $10^2$, $10^3$ (Log scale). |
| **Right Sidebar** | Legend containing 2 categories of methods (Pruning, Merging) and 7 specific model identifiers. |

## 5. Precise Marker Mapping
*   **High-End ($10^3$):** The Kimi-K2 (✖) baseline is at ~78% MC Accuracy. The REAP version (blue ✖) is slightly below it, while the Frequency version (green ✖) is significantly lower at ~60%.
*   **Mid-Range ($10^2$):** The GLM-4.5-Air (⬟) baseline is at ~60% Code Acc. and ~74% MC Acc.
*   **Low-End ($10^1$):** Models compressed to ~15B parameters show a wide spread, with REAP holding ~45% Code Acc. while M-SMoE drops below 10%.
DECODING INTELLIGENCE...
TECHNICAL ASSET FINGERPRINT

f406804a3d7e4a2272f478c8

FOUND IN PAPERS

EXPERT: gemini-3-flash-free VERSION 2