## Charts: Scaling Performance Analysis & Scaling Efficiency Analysis
### Overview
The image presents four charts comparing the performance of different models on various tasks as a function of model size and complexity. The first three charts focus on task accuracy versus model size for Complex Reasoning, Math Reasoning, and Question-Answering tasks. The fourth chart shows task runtime versus complexity for Neuro-symbolic and KL-based reasoning models.
### Components/Axes
**Chart 1: Complex Reasoning Tasks**
* **X-axis:** Model Size (TB) - Values: 7B, 8B, 13B, 70B, GPT
* **Y-axis:** Task Accuracy (%) - Scale: 20 to 100, increments of 10.
* **Legend:**
* TextEdit (C) - Orange Squares
* CLUTRR (C) - Red Circles
* ProofWriter (C) - Blue Triangles
* TextEdit (M) - Light Orange Squares
* CLUTRR (M) - Light Red Circles
* ProofWriter (M) - Light Blue Triangles
**Chart 2: Math Reasoning Tasks**
* **X-axis:** Model Size (TB) - Values: 7B, 8B, 13B, 70B, GPT
* **Y-axis:** Task Accuracy (%) - Scale: 20 to 100, increments of 10.
* **Legend:**
* GSM8K (C) - Orange Squares
* SVAMP (C) - Red Circles
* TabMWP (C) - Blue Triangles
* In-Domain GSM8K (C) - Light Orange Squares
* In-Domain SVAMP (C) - Light Red Circles
* In-Domain MATH (C) - Light Blue Triangles
**Chart 3: Question-Answering Tasks**
* **X-axis:** Model Size (TB) - Values: 7B, 8B, 13B, 70B, GPT
* **Y-axis:** Task Accuracy (%) - Scale: 30 to 100, increments of 10.
* **Legend:**
* AmbiguityQA (C) - Orange Squares
* TriviaQA (C) - Red Circles
* HotpotQA (C) - Blue Triangles
* AmbiguityQA (M) - Light Orange Squares
* TriviaQA (M) - Light Red Circles
* HotpotQA (M) - Light Blue Triangles
**Chart 4: Scaling Efficiency Analysis**
* **X-axis:** Complexity (Inter Math Olympics reasoning (Year Problem)) - Values: P1, 08, P6, 04, P12, P5, 20, P9, P6
* **Y-axis:** Task runtime (min) - Scale: 0 to 30, increments of 5.
* **Legend:**
* Neuro-symbolic models (AlphaGeometry) - Gray Circles
* KL-based (T reasoning models - Blue Triangles
### Detailed Analysis or Content Details
**Chart 1: Complex Reasoning Tasks**
* TextEdit (C): Starts at approximately 40% at 7B, increases to around 60% at 8B, plateaus around 70% at 13B, and reaches approximately 90% at 70B.
* CLUTRR (C): Starts at approximately 30% at 7B, increases to around 50% at 8B, plateaus around 60% at 13B, and reaches approximately 80% at 70B.
* ProofWriter (C): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B.
* TextEdit (M): Starts at approximately 30% at 7B, increases to around 50% at 8B, plateaus around 60% at 13B, and reaches approximately 80% at 70B.
* CLUTRR (M): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B.
* ProofWriter (M): Starts at approximately 10% at 7B, increases to around 30% at 8B, plateaus around 40% at 13B, and reaches approximately 60% at 70B.
**Chart 2: Math Reasoning Tasks**
* GSM8K (C): Starts at approximately 30% at 7B, increases to around 50% at 8B, plateaus around 60% at 13B, and reaches approximately 80% at 70B.
* SVAMP (C): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B.
* TabMWP (C): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B.
* In-Domain GSM8K (C): Starts at approximately 60% at 7B, increases to around 70% at 8B, plateaus around 80% at 13B, and reaches approximately 90% at 70B.
* In-Domain SVAMP (C): Starts at approximately 50% at 7B, increases to around 60% at 8B, plateaus around 70% at 13B, and reaches approximately 80% at 70B.
* In-Domain MATH (C): Starts at approximately 40% at 7B, increases to around 60% at 8B, plateaus around 70% at 13B, and reaches approximately 80% at 70B.
**Chart 3: Question-Answering Tasks**
* AmbiguityQA (C): Starts at approximately 40% at 7B, increases to around 60% at 8B, plateaus around 70% at 13B, and reaches approximately 90% at 70B.
* TriviaQA (C): Starts at approximately 30% at 7B, increases to around 50% at 8B, plateaus around 60% at 13B, and reaches approximately 80% at 70B.
* HotpotQA (C): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B.
* AmbiguityQA (M): Starts at approximately 30% at 7B, increases to around 50% at 8B, plateaus around 60% at 13B, and reaches approximately 80% at 70B.
* TriviaQA (M): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B.
* HotpotQA (M): Starts at approximately 10% at 7B, increases to around 30% at 8B, plateaus around 40% at 13B, and reaches approximately 60% at 70B.
**Chart 4: Scaling Efficiency Analysis**
* Neuro-symbolic models: Starts at approximately 5 minutes at P1, increases linearly to approximately 25 minutes at P9, and reaches approximately 30 minutes at P6.
* KL-based models: Starts at approximately 10 minutes at P1, increases linearly to approximately 20 minutes at P9, and plateaus around 20 minutes at P6.
### Key Observations
* Across all three accuracy charts, increasing model size generally leads to improved task accuracy, with diminishing returns beyond 13B.
* The "In-Domain" tasks consistently outperform their corresponding "C" (presumably "Cross-Domain") counterparts.
* The Neuro-symbolic models exhibit a steeper runtime increase with complexity compared to the KL-based models.
* The runtime of KL-based models appears to plateau after a certain level of complexity.
### Interpretation
The data suggests that scaling model size is an effective strategy for improving performance on complex reasoning, math reasoning, and question-answering tasks. However, the gains diminish as the model size increases, indicating a potential limit to the benefits of simply increasing parameters. The performance difference between in-domain and cross-domain tasks highlights the importance of training data distribution. The runtime analysis reveals a trade-off between model complexity and computational efficiency, with neuro-symbolic models being more computationally expensive than KL-based models. The plateau in runtime for KL-based models suggests that they may be more scalable for highly complex problems. The charts collectively demonstrate the ongoing research into balancing accuracy and efficiency in large language models. The use of "(C)" and "(M)" likely denotes different training methodologies or data splits, with "(M)" potentially representing a more refined or targeted training approach. The x-axis labels in the final chart are somewhat cryptic ("P1", "08", etc.), suggesting a specific benchmark or competition context (Inter Math Olympics).