## Line Charts: Mean Cumulative Mass Distribution for Edges and Heads (Sparse vs. Non-Sparse)
### Overview
The image consists of two side-by-side line charts comparing the "Mean Cumulative Mass" against a "Sorted Index" for two different components: "Edges" (left chart) and "Heads" (right chart). Both charts plot two data series distinguished by color (orange and blue), representing "Sparse" and "Non Sparse" configurations, respectively. The charts demonstrate how quickly cumulative mass is achieved as the sorted index increases, highlighting the efficiency of sparse representations.
**Language Declaration:** All text in the image is in English.
---
### Components/Axes
**Global Elements:**
* **Legend:** Located in the bottom-right corner of the right chart ("Heads"). It applies to both charts based on color consistency.
* **Blue Line:** Non Sparse
* **Orange Line:** Sparse
**Left Chart: "Edges"**
* **Title:** "Edges" (Top-left, bold text).
* **Y-axis:** Label: "Mean Cumulative Mass". Linear scale with visible markers at 0.50, 0.75, and 1.00. The axis extends slightly below 0.50 (approximately to 0.25). Faint horizontal grid lines align with the major ticks.
* **X-axis:** Label: "Sorted Index (log scale)". Logarithmic scale with visible markers at $10^0$, $10^1$, $10^2$, and $10^3$. Faint vertical grid lines align with the major ticks.
**Right Chart: "Heads"**
* **Title:** "Heads" (Top-left, bold text).
* **Y-axis:** Label: "Mean Cumulative Mass". Linear scale with visible markers at 0.50, 0.75, and 1.00. The axis extends slightly below 0.50. Faint horizontal grid lines align with the major ticks.
* **X-axis:** Label: "Sorted Index". Linear scale with visible markers at 25, 50, 75, 100, and 125. Faint vertical grid lines align with the major ticks.
---
### Detailed Analysis
#### Left Chart: Edges (Logarithmic X-Axis)
* **Trend Verification:** Both lines slope upward from left to right, starting at a lower cumulative mass and asymptotically approaching 1.00. The Orange (Sparse) line rises significantly steeper and earlier than the Blue (Non Sparse) line.
* **Orange Line (Sparse):**
* Starts at approximately y = 0.45 at x = $10^0$ (1).
* Crosses y = 0.75 at approximately x = 5.
* Reaches y = 0.90 at approximately x = 15.
* Plateaus at y = 1.00 around x = $10^2$ (100).
* **Blue Line (Non Sparse):**
* Starts below the visible y-axis labels, approximately y = 0.20 at x = $10^0$ (1).
* Crosses y = 0.50 at approximately x = 5.
* Crosses y = 0.75 at approximately x = 30.
* Reaches y = 0.90 at approximately x = 240.
* Approaches y = 1.00 near x = $10^3$ (1000) and beyond.
* **Annotation:** A horizontal dashed black line connects the Orange line to the Blue line at a y-value of approximately 0.90. Above this dashed line is the text **"16.1x"**. This indicates that to reach ~90% of the mean cumulative mass, the Non Sparse model requires an index that is 16.1 times larger than the Sparse model. (e.g., $15 \times 16.1 \approx 241$).
#### Right Chart: Heads (Linear X-Axis)
* **Trend Verification:** Similar to the left chart, both lines slope upward, starting low and plateauing at 1.00. The Orange (Sparse) line rises much faster than the Blue (Non Sparse) line.
* **Orange Line (Sparse):**
* Starts at approximately y = 0.40 near x = 0.
* Crosses y = 0.75 at approximately x = 5.
* Reaches y = 0.90 at approximately x = 10.
* Plateaus at y = 1.00 around x = 30.
* **Blue Line (Non Sparse):**
* Starts at approximately y = 0.25 near x = 0.
* Crosses y = 0.50 at approximately x = 5.
* Crosses y = 0.75 at approximately x = 15.
* Reaches y = 0.90 at approximately x = 34.
* Plateaus at y = 1.00 around x = 100.
* **Annotation:** A horizontal dashed black line connects the Orange line to the Blue line at a y-value of approximately 0.90. Above this dashed line is the text **"3.4x"**. This indicates that to reach ~90% of the mean cumulative mass, the Non Sparse model requires an index that is 3.4 times larger than the Sparse model. (e.g., $10 \times 3.4 = 34$).
---
### Key Observations
1. **Concentration of Mass:** In both "Edges" and "Heads", the "Sparse" (orange) configuration concentrates its mass in a much smaller number of indices compared to the "Non Sparse" (blue) configuration.
2. **Magnitude of Sparsity:** The effect of sparsity is vastly more pronounced in the "Edges" than in the "Heads". The multiplier to reach ~90% mass is 16.1x for Edges, compared to only 3.4x for Heads.
3. **Scale Differences:** The x-axis for Edges is logarithmic, spanning thousands of indices, whereas the x-axis for Heads is linear, spanning only about 140 indices. This suggests there are significantly more "Edges" in the system being measured than there are "Heads".
---
### Interpretation
These charts likely represent an analysis of a neural network architecture, specifically a Transformer model (implied by the terms "Heads" for attention heads and "Edges" for network connections/graph edges).
The "Sorted Index" represents individual components (heads or edges) sorted by their importance or "mass" (likely activation magnitude, attention weight, or parameter value) in descending order. The "Mean Cumulative Mass" shows what percentage of the total network's activity/weight is captured as you add more of these sorted components.
**Reading between the lines (Peircean investigative analysis):**
* **The Power of Sparsity:** The data proves that applying sparse techniques to this model is highly effective. By using the "Sparse" method, the model can capture 90% of the necessary information using a fraction of the parameters.
* **Pruning Potential:** For "Edges", you could theoretically prune (remove) the vast majority of the connections (everything past index ~100) in the sparse model and still retain 100% of the cumulative mass. In the non-sparse model, you would need to keep over 1,000 edges to achieve the same result. The "16.1x" annotation is a direct boast of computational efficiency: the sparse edge representation is 16 times more efficient at concentrating importance.
* **Architectural Insights:** The fact that "Edges" require a logarithmic scale up to $10^3$ while "Heads" only go up to ~140 indicates the structural reality of the model: there are relatively few attention heads, but a massive number of edge connections between nodes/tokens. Sparsifying the edges yields a much higher relative reduction in required components (16.1x) than sparsifying the heads (3.4x), making edge-sparsification a highly lucrative target for model optimization and compression.