\n
## Box Plot: Normalized MSE on ODE/PDE Systems
### Overview
The image displays two side-by-side box plots comparing the performance of three symbolic regression methods—PySR, LLM-SR, and KeplerAgent—on ordinary differential equation (ODE) and partial differential equation (PDE) systems. Performance is measured by the base-10 logarithm of the Normalized Mean Squared Error (MSE). The left panel shows results on "Clean data," and the right panel shows results on "Noisy data."
### Components/Axes
* **Title:** "Normalized MSE on ODE/PDE Systems" (centered at the top).
* **Y-axis (Shared Concept):** Label is "log₁₀(Normalized MSE)". The scale is logarithmic, representing orders of magnitude of error.
* **Left Panel (Clean data) Y-axis:** Ranges from approximately -14 to +1. Major tick marks are at -14, -12, -4, -3, -2, -1, 0, 1.
* **Right Panel (Noisy data) Y-axis:** Ranges from approximately -2.0 to +1.5. Major tick marks are at -2.0, -1.5, -1.0, -0.5, 0.0, 0.5, 1.0, 1.5.
* **X-axis (Both Panels):** Lists the three methods: "PySR", "LLM-SR", "KeplerAgent". The panel labels "Clean data" and "Noisy data" are centered below their respective x-axes.
* **Legend:** Located in the top-right corner of the "Noisy data" panel.
* Blue box: "PySR"
* Orange box: "LLM-SR"
* Green box: "KeplerAgent"
* **Data Annotations:** Each box plot has its median value annotated inside the box in scientific notation.
### Detailed Analysis
**Panel 1: Clean data (Left)**
* **PySR (Blue):** The box spans from a log₁₀(MSE) of about -14 (bottom whisker) to about -1.2 (top of box). The median line is annotated as **1.98×10⁻⁴** (log₁₀ ≈ -3.7). The upper whisker extends to approximately -0.4.
* **LLM-SR (Orange):** The box spans from about -11.8 (bottom whisker) to about -1.5 (top of box). The median line is annotated as **8.24×10⁻⁴** (log₁₀ ≈ -3.08). The upper whisker extends to approximately +0.6.
* **KeplerAgent (Green):** The box is positioned extremely low on the y-axis. The main box body is compressed between approximately -14.2 and -1.5. The median line is annotated as **9.81×10⁻¹⁴** (log₁₀ ≈ -13.01). The upper whisker extends to approximately -0.8.
**Panel 2: Noisy data (Right)**
* **PySR (Blue):** The box spans from about -2.05 (bottom whisker) to about -0.1 (top of box). The median line is annotated as **3.42×10⁻¹** (log₁₀ ≈ -0.47). The upper whisker extends to approximately +1.3.
* **LLM-SR (Orange):** The box spans from about -2.05 (bottom whisker) to about -0.1 (top of box). The median line is annotated as **1.75×10⁻¹** (log₁₀ ≈ -0.76). The upper whisker extends to approximately +1.0.
* **KeplerAgent (Green):** The box spans from about -1.8 (bottom whisker) to about -0.6 (top of box). The median line is annotated as **7.41×10⁻²** (log₁₀ ≈ -1.13). The upper whisker extends to approximately -0.4.
### Key Observations
1. **Massive Performance Gap on Clean Data:** KeplerAgent's median error (9.81×10⁻¹⁴) is approximately **9 orders of magnitude lower** than PySR's and **10 orders of magnitude lower** than LLM-SR's on clean data. Its entire interquartile range (the box) is situated far below the others.
2. **Performance Degradation with Noise:** All three methods show significantly higher errors (worse performance) on the "Noisy data" panel. The y-axis scale shifts upward by roughly 12 orders of magnitude at the median for KeplerAgent.
3. **Relative Ranking Consistency:** The performance ranking (KeplerAgent best, followed by LLM-SR, then PySR) is consistent across both clean and noisy conditions, based on the median values.
4. **Increased Variance with Noise:** The boxes and whiskers for all methods are much taller in the "Noisy data" panel, indicating a wider spread of error values and less consistent performance when noise is present.
5. **Visual Anomaly in Clean Data Plot:** There is a hatched gray band across the "Clean data" plot between y = -4 and y = -12, likely used to visually compress the large empty space between the high-error methods and KeplerAgent's extremely low-error box.
### Interpretation
This chart demonstrates the superior accuracy and robustness of the KeplerAgent method for discovering governing equations of ODE/PDE systems compared to PySR and LLM-SR.
* **On clean, ideal data,** KeplerAgent achieves near-perfect reconstruction (error ~10⁻¹³), suggesting it can identify the exact or near-exact underlying mathematical forms. The other methods plateau at errors around 10⁻⁴, indicating they find good but not perfect approximations.
* **The introduction of noise** severely challenges all methods, increasing errors by many orders of magnitude. However, KeplerAgent maintains its lead, suggesting its approach is more resilient to imperfect real-world data. The increased variance under noise implies the discovery process becomes more stochastic for all methods.
* The **consistent ranking** implies fundamental algorithmic advantages in KeplerAgent's methodology for this class of problems. The chart effectively argues that for symbolic regression on dynamical systems, KeplerAgent is the state-of-the-art among the compared methods, offering both exceptional precision on clean data and relative robustness to noise. The visual design, especially the broken axis in the clean data plot, powerfully emphasizes the magnitude of KeplerAgent's advantage.