## Line Charts: Performance of Different Methods on Logic Puzzles
### Overview
The image displays three side-by-side line charts comparing the accuracy of four different methods (Few-Shot, Logic-LM, PAL, SymPro-LM) on three types of logic puzzles (Sudoku, Sujiko, Magic-Square) as the puzzle board size increases. The overall trend shows that the SymPro-LM method maintains perfect accuracy across all tested sizes, while the performance of other methods generally degrades with increased puzzle complexity.
### Components/Axes
* **Chart Titles (Top Center):** "Sudoku", "Sujiko", "Magic-Square".
* **Y-Axis (Left Side, Shared):** Labeled "Accuracy (%)". Scale runs from 0 to 100 in increments of 20.
* **X-Axis (Bottom of Each Chart):** Labeled "Board Size".
* **Sudoku Chart:** Categories are "4x4", "9x9", "16x16", "25x25".
* **Sujiko Chart:** Categories are "3x3", "4x4", "5x5".
* **Magic-Square Chart:** Categories are "3x3", "4x4", "5x5".
* **Legend (Top-Right of Magic-Square Chart):**
* **Few-Shot:** Orange line with circle markers.
* **Logic-LM:** Purple dashed line with square markers.
* **PAL:** Blue dash-dot line with triangle markers.
* **SymPro-LM:** Green dotted line with diamond markers.
### Detailed Analysis
**1. Sudoku Chart (Left Panel)**
* **SymPro-LM (Green Diamond):** Maintains a constant 100% accuracy across all board sizes (4x4, 9x9, 16x16, 25x25). The line is perfectly horizontal at the top of the chart.
* **PAL (Blue Triangle):** Starts at 100% for 4x4, 9x9, and 16x16 boards. Shows a significant drop to approximately 60% accuracy for the 25x25 board.
* **Logic-LM (Purple Square):** Starts at approximately 60% for the 4x4 board. Accuracy drops sharply to near 0% for the 9x9 board and remains at approximately 0% for 16x16 and 25x25.
* **Few-Shot (Orange Circle):** Follows a nearly identical path to Logic-LM. Starts at approximately 60% for 4x4, drops to 0% for 9x9, and remains at 0% for larger sizes.
**2. Sujiko Chart (Middle Panel)**
* **SymPro-LM (Green Diamond):** Maintains 100% accuracy across all board sizes (3x3, 4x4, 5x5).
* **PAL (Blue Triangle):** Starts at 100% for 3x3 and 4x4 boards. Accuracy decreases to approximately 80% for the 5x5 board.
* **Logic-LM (Purple Square):** Starts at approximately 65% for 3x3. Accuracy declines to approximately 55% for 4x4 and further to approximately 15% for 5x5.
* **Few-Shot (Orange Circle):** Starts at approximately 45% for 3x3. Accuracy declines to approximately 20% for 4x4 and drops to 0% for 5x5.
**3. Magic-Square Chart (Right Panel)**
* **SymPro-LM (Green Diamond):** Maintains 100% accuracy across all board sizes (3x3, 4x4, 5x5).
* **PAL (Blue Triangle):** Starts at 100% for the 3x3 board. Accuracy plummets to 0% for both the 4x4 and 5x5 boards.
* **Logic-LM (Purple Square):** Starts at approximately 65% for 3x3. Accuracy declines to approximately 30% for 4x4 and further to approximately 5% for 5x5.
* **Few-Shot (Orange Circle):** Starts at approximately 20% for 3x3. Accuracy drops to 0% for both the 4x4 and 5x5 boards.
### Key Observations
1. **Dominant Performance:** The SymPro-LM method achieves perfect (100%) accuracy on every puzzle type and board size tested, showing no degradation with increased complexity.
2. **Critical Failure Point for PAL:** The PAL method performs perfectly on smaller boards but experiences a catastrophic failure at the largest tested size for each puzzle: 25x25 for Sudoku, 5x5 for Sujiko, and 4x4 for Magic-Square.
3. **Rapid Degradation of Baselines:** Both the Few-Shot and Logic-LM methods show poor scalability. Their accuracy is moderate on the smallest puzzles but drops to near zero as the board size increases by just one step in most cases.
4. **Puzzle Difficulty Hierarchy:** For the non-perfect methods, the Magic-Square puzzle appears to be the most challenging, causing the steepest and earliest drops in accuracy (e.g., PAL fails at 4x4, while it holds until 5x5 in Sujiko).
### Interpretation
The data strongly suggests that the **SymPro-LM** method possesses a fundamental architectural or algorithmic advantage for solving constraint-based logic puzzles, as it is completely unaffected by the scaling of problem size within the tested range. This implies it may be using a form of symbolic reasoning or program synthesis that generalizes perfectly.
In contrast, the other methods (**PAL, Logic-LM, Few-Shot**) likely rely on pattern recognition or in-context learning that breaks down when the puzzle's combinatorial complexity exceeds a certain threshold. The sharp "cliff-edge" failure of PAL (e.g., from 100% to 0% in Magic-Square) is particularly notable, suggesting it hits a hard limit rather than a gradual decline.
The charts collectively demonstrate a clear performance dichotomy: one method (SymPro-LM) is robust and scalable, while the others are fragile and limited to simpler instances. This has significant implications for applying large language models to formal reasoning tasks, highlighting the need for specialized techniques like symbolic programming integration to achieve reliable performance on complex problems.