\n
## Heatmap: p-value vs. Tokenizer Rank
### Overview
This image presents a heatmap visualizing the relationship between the p-value and the rank of different tokenizers. The heatmap displays p-values for comparisons between a tokenizer and all lower-ranked tokenizers. The color intensity represents the p-value, with warmer colors (red) indicating higher p-values and cooler colors (blue) indicating lower p-values. The heatmap is structured as an 18x18 grid, representing the comparison of each tokenizer (ranked 1 to 18) against all tokenizers with lower ranks. Black outlines separate the cells.
### Components/Axes
* **X-axis:** "Tokenizer Rank" - Ranges from 1 to 18, representing the rank of the tokenizer.
* **Y-axis:** "p-value vs. Lower Ranked Tokenizers" - Ranges from 1 to 18, representing the rank of the tokenizers being compared against.
* **Color Scale (Legend):** Located on the right side of the image. It maps color intensity to p-value.
* 0.00 (Light Blue)
* 0.01
* 0.02
* 0.03
* 0.04
* 0.05 (Medium Blue)
* 0.10
* 0.20
* 0.30
* 0.40
* 0.50
* 0.60
* 0.70
* 0.80
* 0.90
* 1.00 (Dark Red)
### Detailed Analysis
The heatmap shows a clear diagonal pattern. The cells along the main diagonal (where the tokenizer rank on the x-axis equals the tokenizer rank on the y-axis) are generally lighter in color, indicating higher p-values. As you move away from the diagonal, the colors become progressively darker blue, indicating lower p-values.
Here's a breakdown of approximate p-value ranges based on color and position:
* **Rank 1:**
* Rank 1 vs. Rank 1: ~0.95 (Dark Red)
* Rank 1 vs. Rank 2: ~0.85 (Orange-Red)
* Rank 1 vs. Rank 3: ~0.75 (Orange)
* Rank 1 vs. Rank 4: ~0.60 (Orange)
* Rank 1 vs. Rank 5: ~0.50 (Orange)
* Rank 1 vs. Rank 6: ~0.40 (Light Orange)
* Rank 1 vs. Rank 7: ~0.30 (Light Blue)
* Rank 1 vs. Rank 8: ~0.20 (Light Blue)
* Rank 1 vs. Rank 9: ~0.10 (Light Blue)
* Rank 1 vs. Rank 10: ~0.05 (Medium Blue)
* Rank 1 vs. Rank 11: ~0.03 (Dark Blue)
* Rank 1 vs. Rank 12: ~0.02 (Dark Blue)
* Rank 1 vs. Rank 13: ~0.01 (Dark Blue)
* Rank 1 vs. Rank 14: ~0.01 (Dark Blue)
* Rank 1 vs. Rank 15: ~0.01 (Dark Blue)
* Rank 1 vs. Rank 16: ~0.01 (Dark Blue)
* Rank 1 vs. Rank 17: ~0.00 (Dark Blue)
* Rank 1 vs. Rank 18: ~0.00 (Dark Blue)
* **Rank 2:**
* Rank 2 vs. Rank 1: ~0.85 (Orange-Red)
* Rank 2 vs. Rank 2: ~0.90 (Dark Red)
* Rank 2 vs. Rank 3: ~0.70 (Orange)
* Rank 2 vs. Rank 4: ~0.55 (Orange)
* Rank 2 vs. Rank 5: ~0.45 (Light Orange)
* Rank 2 vs. Rank 6: ~0.35 (Light Blue)
* Rank 2 vs. Rank 7: ~0.25 (Light Blue)
* Rank 2 vs. Rank 8: ~0.15 (Light Blue)
* Rank 2 vs. Rank 9: ~0.05 (Medium Blue)
* Rank 2 vs. Rank 10: ~0.03 (Dark Blue)
* Rank 2 vs. Rank 11: ~0.02 (Dark Blue)
* Rank 2 vs. Rank 12: ~0.01 (Dark Blue)
* Rank 2 vs. Rank 13: ~0.01 (Dark Blue)
* Rank 2 vs. Rank 14: ~0.01 (Dark Blue)
* Rank 2 vs. Rank 15: ~0.01 (Dark Blue)
* Rank 2 vs. Rank 16: ~0.01 (Dark Blue)
* Rank 2 vs. Rank 17: ~0.00 (Dark Blue)
* Rank 2 vs. Rank 18: ~0.00 (Dark Blue)
* **Rank 18:**
* Rank 18 vs. Rank 1: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 2: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 3: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 4: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 5: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 6: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 7: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 8: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 9: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 10: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 11: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 12: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 13: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 14: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 15: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 16: ~0.00 (Dark Blue)
* Rank 18 vs. Rank 17: ~0.01 (Dark Blue)
* Rank 18 vs. Rank 18: ~0.95 (Dark Red)
### Key Observations
* The p-values generally decrease as the rank difference between the two tokenizers increases.
* The highest p-values are observed when comparing a tokenizer to itself (diagonal).
* There is a noticeable gradient from red (high p-value) to blue (low p-value) as you move away from the diagonal.
* The lower-right corner of the heatmap (comparing lower-ranked tokenizers) consistently shows very low p-values.
### Interpretation
This heatmap suggests that higher-ranked tokenizers are statistically significantly different from lower-ranked tokenizers. The high p-values along the diagonal indicate that each tokenizer is very similar to itself (as expected). As we compare a tokenizer to those with lower ranks, the p-values decrease, indicating a growing statistical difference. This implies that the ranking system effectively differentiates between tokenizers based on some underlying performance metric. The consistently low p-values in the lower-right corner suggest that the lowest-ranked tokenizers are significantly different from all higher-ranked tokenizers. This could indicate that these tokenizers are substantially less effective or perform differently than the others. The heatmap provides a visual representation of the statistical significance of differences between tokenizers, allowing for a quick assessment of their relative performance.