# Technical Document Extraction: Entity Frequency Distribution Chart
## Chart Overview
The image is a **line chart** comparing the distribution of entity frequencies across different models and a ground truth reference. The x-axis represents **Entity ID's Sorted by Frequency** (ranging from 0 to 800), and the y-axis represents **Entity Frequency** on a **logarithmic scale** (10⁻⁴ to 10⁻¹).
---
## Axis Labels and Scales
- **X-Axis**:
- Title: *"Entity ID's Sorted by Frequency"*
- Range: 0 to 800 (linear scale).
- **Y-Axis**:
- Title: *"Entity Frequency"*
- Scale: Logarithmic (10⁻⁴ to 10⁻¹).
---
## Legend and Model Comparisons
The chart includes five lines, each representing a distinct model or reference:
1. **ZeroGen** (Blue line):
- Starts with the highest initial frequency (~10⁻¹) but drops sharply.
- Deviates significantly from the Ground Truth after ~100 Entity IDs.
2. **DemoGen** (Orange line):
- Similar initial frequency to ZeroGen but declines more gradually.
- Diverges from Ground Truth after ~200 Entity IDs.
3. **ClinGen w/KG** (Green line):
- Closely follows the Ground Truth curve.
- Slightly higher frequency than Ground Truth for Entity IDs < 300.
4. **ClinGen w/LLM** (Red line):
- Nearly overlaps with ClinGen w/KG.
- Slightly lower frequency than Ground Truth for Entity IDs > 500.
5. **Ground Truth** (Purple line):
- Smooth, gradual decline across all Entity IDs.
- Serves as the reference baseline.
---
## Key Trends and Observations
1. **Initial Frequency**:
- All models exhibit high frequencies for low Entity IDs (0–100), with ZeroGen and DemoGen showing the steepest declines.
2. **Convergence with Ground Truth**:
- ClinGen w/KG and ClinGen w/LLM align most closely with the Ground Truth, particularly for Entity IDs > 300.
3. **Divergence**:
- ZeroGen and DemoGen underperform for higher-frequency Entity IDs (ID > 200), showing sharper declines than the Ground Truth.
4. **Logarithmic Scale Impact**:
- The y-axis compression emphasizes differences in frequency distributions at lower magnitudes (10⁻³ to 10⁻⁴).
---
## Data Point Highlights
- **ZeroGen**:
- Peaks at ~10⁻¹ for Entity ID 0, dropping to ~10⁻³ by ID 200.
- **DemoGen**:
- Peaks at ~10⁻¹ for Entity ID 0, declining to ~10⁻² by ID 200.
- **ClinGen w/KG/LLM**:
- Maintain frequencies between ~10⁻² and 10⁻³ for Entity IDs 0–800.
- **Ground Truth**:
- Declines from ~10⁻¹ (ID 0) to ~10⁻³ (ID 800) with minimal fluctuations.
---
## Conclusion
The chart demonstrates that **ClinGen w/KG** and **ClinGen w/LLM** models best approximate the Ground Truth distribution, while **ZeroGen** and **DemoGen** exhibit significant deviations, particularly for higher-frequency Entity IDs. The logarithmic y-axis highlights the disparity in frequency distributions across models.