2602.10576
Model: gemini-2.0-flash
# LLM-Based Scientific Equation Discovery via Physics-Informed Token-Regularized Policy Optimization
**Authors**: Boxiao Wang, Kai Li, Tianyi Liu, Chen Li, Junzhe Wang, Yifan Zhang, Jian Cheng
> wangboxiao22@mails.ucas.ac.cn0009-0008-2970-7575Institute of Automation, Chinese Academy of SciencesBeijingChina
> kai.li@ia.ac.cnInstitute of Automation, Chinese Academy of SciencesBeijingChina
> franktyliu@outlook.comState Key Laboratory of AerodynamicsMianyang, SichuanChina
> lichen@skla.cardc.cnState Key Laboratory of AerodynamicsMianyang, SichuanChina
> wangjunzhe21@mails.ucas.ac.cnSchool of Mathematical Sciences,
University of Chinese Academy of SciencesBeijingChina
> yifan.zhang@ia.ac.cnInstitute of Automation, Chinese Academy of SciencesBeijingChina
> jian.cheng@ia.ac.cnInstitute of Automation, Chinese Academy of SciencesBeijingChina
Abstract.
Symbolic regression aims to distill mathematical equations from observational data. Recent approaches have successfully leveraged Large Language Models (LLMs) to generate equation hypotheses, capitalizing on their vast pre-trained scientific priors. However, existing frameworks predominantly treat the LLM as a static generator, relying on prompt-level guidance to steer exploration. This paradigm fails to update the modelās internal representations based on search feedback, often yielding physically inconsistent or mathematically redundant expressions. In this work, we propose PiT-PO (Physics-informed Token-regularized Policy Optimization), a unified framework that evolves the LLM into an adaptive generator via reinforcement learning. Central to PiT-PO is a dual-constraint mechanism that rigorously enforces hierarchical physical validity while simultaneously applying fine-grained, token-level penalties to suppress redundant structures. Consequently, PiT-PO aligns LLM to produce equations that are both scientifically consistent and structurally parsimonious. Empirically, PiT-PO achieves state-of-the-art performance on standard benchmarks and successfully discovers novel turbulence models for challenging fluid dynamics problems. We also demonstrate that PiT-PO empowers small-scale models to outperform closed-source giants, democratizing access to high-performance scientific discovery. conference: ; ;
1. Introduction
Symbolic Regression (SR) (Makke and Chawla, 2024b) stands as a cornerstone of data-driven scientific discovery, uniquely capable of distilling interpretable mathematical equations from observational data. Unlike black-box models that prioritize mere prediction, SR elucidates the fundamental mechanisms governing system behavior, proving instrumental in uncovering physical laws (Makke and Chawla, 2024a; Reuter et al., 2023), modeling chemical kinetics (Chen et al., 2025; Deng et al., 2023), and analyzing complex biological dynamics (Wahlquist et al., 2024; Shi et al., 2024).
However, the search for exact governing equations represents a formidable challenge, formally classified as an NP-hard problem (Virgolin and Pissis, 2022). To navigate this vast search space, algorithmic strategies have evolved from Genetic Programming (GP) (Schmidt and Lipson, 2009; Cranmer, 2023) and Reinforcement Learning (RL) (Petersen et al., 2021) to Transformer-based architectures that map numerical data directly to symbolic equations (Biggio et al., 2021; Kamienny et al., 2022; Zhang et al., 2025). Most recently, the advent of Large Language Models (LLMs) has introduced a new paradigm. Methods such as LLM-SR (Shojaee et al., 2025a) and LaSR (Grayeli et al., 2024) leverage the pre-trained scientific priors and in-context learning capabilities of LLMs to generate equation hypotheses. These approaches typically employ an evolutionary search paradigm, where candidates are evaluated, and high-performing solutions are fed back via prompt-level conditioning to steer subsequent generation.
Despite encouraging progress, existing LLM-based SR methods remain constrained by severe limitations. First, most approaches treat the LLM as a static generator, relying primarily on prompt-level, verbal guidance to steer the evolutionary search (Shojaee et al., 2025a; Grayeli et al., 2024). This āfrozenā paradigm inherently neglects the opportunity to adapt and enhance the generative capability of the LLM itself based on evaluation signals, preventing the model from internalizing feedback and adjusting its generation strategies to the specific problem. Second, they typically operate in a physics-agnostic manner, prioritizing syntactic correctness over physical validity (Shojaee et al., 2025a; Grayeli et al., 2024). Without rigorous constraints, LLMs often generate equations that fit the data numerically but violate fundamental physical principles, rendering them prone to overfitting and practically unusable.
In this work, we propose to fundamentally shift the role of the LLM in SR from a static proposer to an adaptive generator. We establish a dynamic feedback loop in which evolutionary exploration and parametric learning reinforce each other: evolutionary search uncovers diverse candidate equations and generates informative evaluation signals, while parametric adaptation enables the LLM to consolidate effective symbolic patterns and guide subsequent exploration more efficiently. By employing in-search fine-tuning, i.e., updating the LLM parameters during the evolutionary search process, we move beyond purely verbal, prompt-level guidance and introduce numerical guidance that allows feedback to be directly internalized into the model parameters, progressively aligning LLM with the intrinsic properties of the target system.
To realize this vision, we introduce PiT-PO (Physics-informed Token-regularized Policy Optimization), a unified framework that bridges LLM-driven evolutionary exploration with rigorous verification. PiT-PO is built upon two technical components. 1) In-Search LLM Evolution. We implement the numerical guidance via reinforcement learning, which efficiently updates the LLMās parameters during the search process. Instead of relying on static pre-trained knowledge, this in-search policy optimization enables LLM to dynamically align its generative distribution with the structural characteristics of the specific task, effectively transforming general scientific priors into domain-specific expertise on the fly. 2) Dual Constraints as Search Guidance. We enforce hierarchical physical constraints to ensure scientific validity, and uniquely, we incorporate a fine-grained regularization based on our proposed Support Exclusion Theorem. This theorem allows us to identify mathematically redundant terms and translate them into token-level penalties, effectively pruning the search space and guiding LLM toward physically meaningful and structurally parsimonious equations.
Comprehensive experimental results demonstrate that PiT-PO achieves state-of-the-art performance across standard SR benchmarks, including the LLM-SR Suite (Shojaee et al., 2025a) and LLM-SRBench (Shojaee et al., 2025b), and recovers the largest number of ground-truth equations among all evaluated methods. In addition to synthetic benchmarks, the effectiveness of PiT-PO is validated on an application-driven turbulence modeling task involving flow over periodic hills. PiT-PO improves upon traditional Reynolds-Averaged Navier-Stokes (RANS) approaches by producing anisotropic Reynolds stresses closer to Direct Numerical Simulation (DNS) references. The learned method shows enhanced physical consistency, with reduced non-physical extremes and better flow field predictions. Finally, PiT-PO maintains robust performance even when using resource-constrained small models, such as a quantized version of Llama-8B, and remains efficient under strict wall-clock time budgets, establishing a practical methodology for automated scientific discovery.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: PiT-PO Framework
### Overview
The image presents a diagram of the Physics-informed Token-Regularized Policy Optimization (PiT-PO) framework. It illustrates the interaction between an LLM Policy, Island-Based Exploration, Physical and Theoretical Constraints, and the PiT-PO module itself, showing how these components contribute to policy and prompt updates.
### Components/Axes
* **LLM Policy:** Represented by a light blue rounded rectangle labeled "LLM Policy" with the notation "ĻĪø" below it. Located on the left side of the diagram.
* **Island-Based Exploration:** Enclosed in a dashed gray rounded rectangle. Contains multiple yellow rounded rectangles labeled "f0", "f1", ..., "fn-1", "fn".
* **Physical Constraints:** An orange rounded rectangle labeled "Physical Constraints". Sub-labels include "General-Level Pdim, Pdiff" and "Domain-Specific Pdomain".
* **Theoretical Constraints:** A red rounded rectangle labeled "Theoretical Constraints". Contains a document icon and the sub-label "Support Exclusion Theorem Ptok".
* **PiT-PO:** A large purple circle containing several components:
* "Global Reward" (orange rounded rectangle)
* "Token-Aware Advantage Estimation" (gray rounded rectangle)
* "GRPO" (gray rounded rectangle)
* "Token Penalty" (red rounded rectangle)
* "Physics-informed Token-Regularized Policy Optimization" (text label at the bottom of the circle)
* **Policy Update:** Text label at the top of the diagram.
* **Prompt Update:** Text label at the bottom of the diagram.
### Detailed Analysis
* **Flow:**
* The LLM Policy (ĻĪø) sends outputs (represented by light blue arrows) to the Island-Based Exploration module.
* The Island-Based Exploration module sends outputs (represented by orange arrows) to both the Physical Constraints and Theoretical Constraints modules.
* The Physical and Theoretical Constraints modules send outputs (represented by orange arrows) to the PiT-PO module's "Global Reward" and "Token Penalty" components, respectively.
* Within the PiT-PO module, there is a feedback loop between "Token-Aware Advantage Estimation" and "Token Penalty" (represented by purple arrows).
* "Token-Aware Advantage Estimation" sends output (represented by a purple arrow) to "GRPO".
* The PiT-PO module sends feedback (represented by a purple arrow) back to the LLM Policy (ĻĪø), completing the loop.
* A large purple arrow loops from the PiT-PO module back to the LLM Policy, labeled "Prompt Update".
* A large purple arrow loops from the LLM Policy to the Physical and Theoretical Constraints, labeled "Policy Update".
* **Island-Based Exploration Details:**
* The Island-Based Exploration module contains multiple "f" nodes, indexed from 0 to n.
* **Constraint Details:**
* Physical Constraints are divided into General-Level (Pdim, Pdiff) and Domain-Specific (Pdomain) constraints.
* Theoretical Constraints are based on the Support Exclusion Theorem (Ptok).
### Key Observations
* The diagram illustrates a closed-loop system where the LLM Policy is updated based on feedback from the PiT-PO module, which incorporates physical and theoretical constraints.
* The Island-Based Exploration module seems to provide diverse inputs to the constraint modules.
* The PiT-PO module uses a token-aware approach, incorporating both global rewards and token penalties.
### Interpretation
The diagram describes the PiT-PO framework, which aims to improve LLM policies by incorporating physical and theoretical constraints during the optimization process. The Island-Based Exploration module likely serves to generate a diverse set of candidate policies, which are then evaluated against the constraints. The PiT-PO module uses a token-aware approach, suggesting that it considers the impact of individual tokens on the overall policy performance. The feedback loop ensures that the LLM policy is continuously refined based on the constraints and rewards. The "Prompt Update" and "Policy Update" arrows indicate that both the prompt and the policy of the LLM are being updated during the process. This framework appears to be designed to create more robust and reliable LLM policies by grounding them in real-world constraints and physics-based principles.
</details>
Figure 1. The overall framework of PiT-PO. PiT-PO transforms the LLM from a static proposer into an adaptive generator via a closed-loop evolutionary process. The framework integrates dual-constraint evaluationācomprising physical constraints and theoretical constraintsāto generate fine-grained token-level learning signals. These signals guide the LLM policy update via reinforcement learning, ensuring the discovery of parsimonious, physically consistent equations.
2. Preliminaries
2.1. Problem Setup
In SR, a dataset of inputāoutput observations is given:
$$
D=\{(x_{i},y_{i})\}_{i=1}^{n},\;x_{i}\in\mathbb{R}^{d},\;y_{i}\in\mathbb{R}. \tag{1}
$$
The objective is to identify a compact and interpretable function $fā\mathcal{F}$ such that $f(x_{i})ā y_{i}$ for the observed samples, while retaining the ability to generalize to unseen inputs.
2.2. LLM-based SR Methods
Contemporary LLM-based approaches reformulate SR as a iterative program synthesis task. In this paradigm, typified by frameworks such as LLM-SR (Shojaee et al., 2025a), the discovery process is decoupled into two phases: structure proposal and parameter estimation. Specifically, the LLM functions as a symbolic generator, emitting functional skeletons with placeholders for learnable coefficients. A numerical optimizer (e.g., BFGS (Fletcher, 1987)) subsequently fits these constants to the observed data. To navigate the combinatorial search space, these methods employ an evolutionary-style feedback loop: high-fitness equations are maintained in a pool to serve as in-context examples, prompting the LLM to refine subsequent generations. Our work leverages this architecture as a backbone, but fundamentally redefines the LLMās role from a static proposer to an adaptive generator.
2.3. Group Relative Policy Optimization
Group Relative Policy Optimization (GRPO) (Shao et al., 2024) is a RL algorithm tailored for optimizing LLMs on reasoning tasks, characterized by its efficient baseline estimation without the need for a separate value network. In the context of LLM-based SR, the generation process is modeled as a Markov Decision Process (MDP), where LLM functions as a policy $\pi_{\theta}$ that generates a sequence of tokens $o=(t_{1},...,t_{L})$ given a prompt $q$ . For each $q$ , GRPO samples a group of $G$ outputs $\{o_{1},...,o_{G}\}$ from the sampling policy $\pi_{\theta_{old}}$ . GRPO maximizes the following surrogate loss function:
$$
\mathcal{J}_{GRPO}(\theta)=\mathbb{E}_{q\sim P(Q),\{o_{i}\}\sim\pi_{\theta_{old}}}\left[\frac{1}{G}\sum_{i=1}^{G}\left(\frac{1}{L_{i}}\sum_{k=1}^{L_{i}}\mathcal{L}^{clip}_{i,k}(\theta)-\beta\mathbb{D}_{KL}(\pi_{\theta}||\pi_{ref})\right)\right], \tag{2}
$$
where $\pi_{ref}$ is the reference policy to prevent excessive deviation, and $\beta$ controls the KL-divergence penalty. The clipping term $\mathcal{L}^{clip}_{i,k}(\theta)$ ensures trust region updates:
$$
\mathcal{L}^{clip}_{i,k}(\theta)=\min\left(\frac{\pi_{\theta}(t_{i,k}|q,o_{i,<k})}{\pi_{\theta_{old}}(t_{i,k}|q,o_{i,<k})}\hat{A}_{i},\;\text{clip}\left(\frac{\pi_{\theta}(t_{i,k}|q,o_{i,<k})}{\pi_{\theta_{old}}(t_{i,k}|q,o_{i,<k})},1-\epsilon,1+\epsilon\right)\hat{A}_{i}\right). \tag{3}
$$
Here, $\epsilon$ is the clipping coefficient. A distinctive feature of GRPO lies in its advantage estimation, it computes the advantage $\hat{A}_{i}$ by standardizing the reward $R(o_{i})$ relative to the group:
$$
\hat{A}_{i}=\frac{R(o_{i})-\text{mean}(\{R(o_{j})\})}{\text{std}(\{R(o_{j})\})}. \tag{4}
$$
Consequently, every token within the sequence $o_{i}$ is assigned the exact same feedback signal. This coarse granularity treats valid and redundant terms indistinguishably, a limitation that our work addresses by introducing fine-grained, token-level regularization.
3. Method
We propose PiT-PO (Physics-informed Token-regularized Policy Optimization), a framework that evolves LLM into an adaptive, physics-aware generator. PiT-PO establishes a closed-loop evolutionary process driven by two synergistic mechanisms: (1) a dual-constraint evaluation system that rigorously assesses candidates through hierarchical physical verification and theorem-guided redundancy pruning; and (2) a novel policy optimization strategy that updates LLM using fine-grained, token-level feedback derived from these constraints. This combination effectively aligns the LLM with the intrinsic structure of the problem, guiding the LLM toward solutions that are not only numerically accurate but also structurally parsimonious and scientifically consistent.
3.1. Dual-Constraint Learning Signals
Navigating the combinatorial space of symbolic equations requires rigorous guidance. We employ a reward system driven by dual constraints: physical constraints delineate the scientifically valid region, while theoretical constraints drive the search toward simpler equations by identifying and pruning redundant terms.
3.1.1. Hierarchical Physical Constraints.
To ensure scientific validity, we construct a hierarchical filter that categorizes constraints into two levels: general properties and domain-specific priors.
General-Level Constraints. We enforce fundamental physical properties applicable across scientific disciplines. To prune physically impossible structures (e.g., adding terms with mismatched units), we assign penalty-based rewards for Dimensional Homogeneity ( $P_{dim}$ ) and Differentiability ( $P_{diff}$ ). The former strictly penalizes equations with unit inconsistencies, while the latter enforces smoothness on the data-defined domain.
Domain-Specific Constraints. To tackle specialized tasks, we inject expert knowledge as inductive biases. We define the domain-specific penalty $P^{(j)}_{domain}$ to penalize candidate equations that violate the $j$ -th domain-specific constraint. Taking the turbulence modeling task (detailed in Appendix E.4) as a representative instantiation, we enforce four rigorous constraints: (1) Realizability (Pope, 2000), ensuring the Reynolds stress tensor has positive eigenvalues; (2) Boundary Condition Consistency (Monkewitz, 2021), requiring stresses to decay to zero at the wall; (3) Asymptotic Scaling (Tennekes and Lumley, 1972; WANG et al., 2019), enforcing the cubic relationship between stress and wall distance in the viscous sublayer; and (4) Energy Consistency (Pope, 2000; MOCHIZUKI and OSAKA, 2000), aligning predicted stress with turbulent kinetic energy production.
This hierarchical design effectively embeds physical consistency as a hard constraint in the reward function, prioritizing scientific validity over mere empirical fitting.
3.1.2. Theorem-Guided Mathematical Constraints
While physical constraints ensure validity, they do not prevent mathematical redundancy. To rigorously distinguish between essential terms and redundant artifacts, we introduce the Support Exclusion Theorem.
Let $\mathcal{S}$ denote the full support set containing all candidate basis functions $\{\phi_{j}\}$ . The ground truth equation is $f^{*}=\sum_{jā\mathcal{S}^{\prime}}a_{j}\phi_{j}$ , where $\mathcal{S}^{\prime}āeq\mathcal{S}$ is the true support set (i.e., the indices of basis functions that truly appear in the governing equation), and $\{a_{j}\}_{jā\mathcal{S}^{\prime}}$ are the corresponding true coefficients. Consider a candidate equation $f=\sum_{jā\mathcal{K}}b_{j}\phi_{j}$ , where $\mathcal{K}āeq\mathcal{S}$ represents the current support set (i.e., the selected terms in the skeleton), $\mathbf{b}=\{b_{j}\}_{jā\mathcal{K}}$ are the optimized coefficients derived from the data. We define the empirical Gram matrix of these basis functions as $Gā\mathbb{R}^{|\mathcal{S}|Ć|\mathcal{S}|}$ and the corresponding Projection Matrix as $T$ , where $T_{ij}:=G_{ji}/G_{ii}$ .
** Theorem 3.1 (Support Exclusion Theorem)**
*Assume the ground-truth support is finite and satisfies $|\mathcal{S}^{\prime}|⤠M$ , and let the true function coefficients be bounded by $Aā¤|a_{j}|⤠B$ for all $jā\mathcal{S}^{\prime}$ . A term $\phi_{i}$ ( $iā\mathcal{K}$ ) is theoretically guaranteed to be a false discovery (not in the true support $\mathcal{S}^{\prime}$ ) if its fitted coefficient magnitude satisfies:
$$
|b_{i}|<A-\left(\underbrace{\sum_{j\in\mathcal{K},j\neq i}(B+|b_{j}|)|T_{ij}|}_{\text{Internal Interference}}+\underbrace{B\sum_{k=1}^{m}s_{(k)}}_{\text{External Interference}}\right). \tag{5}
$$
$s(k)$ denotes the $k$ -th largest value in $\{|T_{i\ell}|:\ellā\mathcal{S}\setminus\mathcal{K}\}$ , and $m:=\min\!\big(M-1,\;|\mathcal{S}\setminus\mathcal{K}|\big)$ .*
Detailed definitions of all notations and the rigorous proof of Theorem 3.1 are provided in Appendix B. This theorem formalizes the intuition that coefficients of redundant terms (absent from the true support $\mathcal{S}^{\prime}$ ) have significantly smaller magnitudes than those of valid components.
Specifically, after fitting $\mathbf{b}$ , we compute the normalized coefficient ratio $\tau_{i}=|b_{i}|/(\sum_{j}|b_{j}|+\epsilon)$ . We introduce a threshold $\rhoā(0,1)$ to identify potentially redundant terms. Terms satisfying $\tau_{i}>\rho$ incur no penalty, while components with $\tau_{i}ā¤\rho$ are considered redundant. To suppress these redundancies, we define a token penalty for each token in redundant term $i$ :
$$
P_{tok}=p\cdot\max\left(0,-\log\left(|b_{i}|+\epsilon\right)\right), \tag{6}
$$
where $p>0$ is a scaling coefficient. We use a logarithmic scale to impose stronger penalties on terms with smaller coefficients.
By integrating this penalty into the policy optimization, we guide the LLM to reduce the probability of generating redundant terms, thereby steering the optimization toward parsimonious equations.
3.2. Token-Aware Policy Update
Our proposed PiT-PO effectively operationalize the hierarchical constraints and theoretical insights derived in Section 3.1. Unlike standard GRPO that assign a uniform scalar reward to the entire generated sequence, our method transitions the learning process from coarse-grained sequence scoring to fine-grained token-level credit assignment. This ensures that the policy not only learns to generate physically valid equations but also explicitly suppresses theoretically redundant terms.
3.2.1. Global Reward with Gated Constraints
The optimization is driven by a composite global reward, $R_{global}$ , which balances fitting accuracy, structural parsimony, and physical consistency. Formally, for a sampled equation $o_{i}$ , the rewards are defined as follows:
Fitting Accuracy ( $R_{fit}$ ). We use the normalized log-MSE to encourage precise data fitting:
$$
R_{fit}=-\alpha\log(\text{MSE}+\epsilon), \tag{7}
$$
where $MSE=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}$ .
Complexity Penalty ( $P_{cplx}$ ). Adhering to Occamās Razor, we penalize structural complexity based on the Abstract Syntax Tree (AST) (Neamtiu et al., 2005) node count:
$$
P_{cplx}=\lambda_{len}\cdot\text{Length}(\text{AST}). \tag{8}
$$
In PiT-PO, each equation generated by the LLM is represented as a Python function and parsed into an AST, where each node corresponds to a variable or operator. The total node count provides a meaningful estimate of structural complexity.
Gated Physical Penalty ( $P_{phy}$ ). Imposing strict physical constraints too early can hinder exploration, causing the model to discard potentially promising functional forms. We therefore activate physical penalties only after the candidate equation reaches a baseline fitting accuracy threshold ( $\delta_{gate}$ ). Specifically, we define
where $\mathbb{1}(Ā·)$ is the indicator function. This mechanism effectively creates a soft curriculum: it allows āfreeā exploration in the early stages and enforcing strict physical compliance only after the solution enters a plausible region.
The total reward $R_{global}$ is then formulated as:
$$
R_{global}(o_{i})=R_{fit}(o_{i})-P_{cplx}(o_{i})-P_{phy}(o_{i}). \tag{10}
$$
3.2.2. Fine-Grained Advantage Estimation
Standard GRPO applies a uniform advantage across all tokens in a sequence. We refine this by synthesizing the global reward with the token-level penalty $P_{tok}$ (Equation 6). Specifically, we define the token-aware advantage $\hat{A}_{i,k}$ for the $k$ -th token in the $i$ -th sampled equation as:
$$
\hat{A}_{i,k}=\underbrace{\frac{R_{global}(o_{i})-\mu_{group}}{\sigma_{group}}}_{\text{Global Standardization}}-\underbrace{P_{i,k}}_{\text{Local Pruning}}. \tag{11}
$$
Here, the first term standardizes the global reward against the group statistics ( $\mu_{group},\sigma_{group}$ ), reinforcing equations that satisfy multi-objective criteria relative to their peers. The second term, $P_{i,k}$ , applies a targeted penalty to suppress redundancy. Specifically, we set $P_{i,k}=0$ if token $k$ belongs to a non-redundant term, and $P_{i,k}=P_{tok}$ otherwise. This ensures that penalties are applied exclusively to tokens contributing to mathematically redundant structures, while valid terms remain unaffected.
Substituting this token-aware advantage into the GRPO objective, the policy gradient update of our PiT-PO becomes:
$$
\nabla\mathcal{J}_{PiT-PO}\propto\sum_{i,k}\hat{A}_{i,k}\nabla\log\pi_{\theta}(t_{i,k}|o_{i,<k}). \tag{12}
$$
This creates a dual-pressure optimization landscape: global rewards guide the policy toward physically consistent and accurate equations, while local penalties surgically excise redundant terms. This ensures the final output aligns with the sparse, underlying physical laws rather than merely overfitting numerical data.
Input: Dataset $D=\{(x_{i},y_{i})\}_{i=1}^{n}$ ; LLM $\pi_{\theta}$ ; number of islands $N$ ; group size $G$ ; iterations $T$ .
Output: Best equation $o^{*}$ .
1
2 1ex
3 Initialize $o^{*}$ , $s^{*}$ , and buffers $\mathcal{B}_{j}ā\emptyset$ for $j=1,...,N$
4
5 for $tā 1$ to $T$ do
// Stage 1: Island-Based Exploration
6 for $jā 1$ to $N$ do
$q_{j}ā\textsc{BuildPrompt}(D,\mathcal{B}_{j})$
// in-context rule
$\{o_{i}\}_{i=1}^{G}\sim\pi_{\theta}(Ā·\mid q_{j})$
// sample a group
7 for $iā 1$ to $G$ do
$(R_{i},\{P_{i,k}\})ā\textsc{DualConstraintEval}(o_{i},D)$
// $R_{i}=R_{\mathrm{global}}(o_{i})$
8 $\mathcal{B}_{j}ā\mathcal{B}_{j}\cup\{(q_{j},o_{i},R_{i},\{P_{i,k}\})\}$
9 if $R_{i}>s^{*}$ then
10 $o^{*}ā o_{i}$ ; $s^{*}ā R_{i}$
11
12
13
// Stage 2: In-Search LLM Evolution
14 $\thetaā\textsc{PiT-PO\_Update}(\theta,\{\mathcal{B}_{j}\}_{j=1}^{N},\pi_{{\theta}})$
// Stage 3: Hierarchical Selection
15 $\{\mathcal{B}_{j}\}_{j=1}^{N}ā\textsc{SelectAndReset}(\{\mathcal{B}_{j}\}_{j=1}^{N})$
16 return $o^{*}$
Algorithm 1 PiT-PO Overall Training Pipeline
3.3. Overall Training Pipeline
We orchestrate the PiT-PO framework through a closed-loop evolutionary RL cycle. As illustrated in Algorithm 1, the training process iterates through three synergistic phases:
Phase 1: Island-Based Exploration (Data Generation). To prevent premature convergence to local optima, a common pitfall in SR, we employ a standard multi-island topology ( $N$ islands) to structurally enforce search diversity (Cranmer, 2023; Romera-Paredes et al., 2024). Each island $j$ maintains an isolated experience buffer $\mathcal{B}_{j}$ , evolving its own lineage of equations. This information isolation allows distinct islands to cultivate diverse functional forms independently.
Phase 2: In-Search LLM Evolution (Policy Update). This phase transforms the collected data into parametric knowledge. We aggregate the trajectories from all $N$ islands into a global batch to perform policy optimization using PiT-PO. By minimizing the loss, the model explicitly lowers the probability of generating mathematically redundant tokens and physically inconsistent structures. To ensure computational efficiency during this iterative search, we implement the update using Low-Rank Adaptation (LoRA) (Hu et al., 2021).
Phase 3: Hierarchical Selection (Population Management). We apply standard survival-of-the-fittest mechanisms to maintain population quality. Local buffers $\mathcal{B}_{j}$ are updated by retaining only top-performing candidates, while underperforming islands are periodically reset with high-fitness seeds to escape local optima.
This cycle establishes a reciprocal reinforcement mechanism: the island-based exploration maintains search diversity, while policy update consolidates these findings into the model weights, progressively transforming the LLM into a domain-specialized scientific discoverer.
4. Experiments
4.1. Setup
Benchmarks. To provide a comprehensive evaluation of PiT-PO, we adopt two widely used benchmarks to compare against state-of-the-art baselines:
LLM-SR Suite (Shojaee et al., 2025a). This suite comprises four tasks spanning multiple scientific domains: Oscillation 1 & 2 (Nonlinear Oscillatory Systems) feature dissipative couplings and non-polynomial nonlinearities with explicit forcing, making recovery of the correct interaction terms from trajectory data non-trivial; E. coli Growth (Monod, 1949; Rosso et al., 1995) models multivariate population dynamics with strongly coupled, multiplicative effects from nutrients, temperature, and acidity; and Stress-Strain (Aakash et al., 2019) uses experimental measurements of Aluminum 6061-T651 and exhibits temperature-dependent, piecewise non-linear deformation behavior. Detailed information about these tasks is provided in Appendix D.1.
LLM-SRBench: (Shojaee et al., 2025b) To evaluate generalization beyond canonical forms, we adopt the comprehensive LLM-SRBench benchmark, which contains 239 tasks organized into two complementary subsets, LSR-Transform and LSR-Synth. LSR-Transform changes the prediction target to rewrite well-known physics equations into less common yet analytically equivalent forms, producing 111 transformed tasks. This design aims to reduce reliance on direct memorization of canonical templates and tests whether a method can recover the same physical law under non-trivial variable reparameterizations. Complementarily, LSR-Synth composes equations from both known scientific terms and synthetic but plausible terms to further assess discovery beyond memorized templates: candidate terms are proposed by an LLM under domain context, assembled into full equations, and then filtered through multiple checks, including numerical solvability, contextual novelty, and expert plausibility, yielding 128 synthetic tasks. Further details are given in Appendix D.2.
Baselines.
We compare PiT-PO against representative baselines spanning both classical and LLM-based SR methods. For the four tasks in the LLM-SR Suite, we include GPlearn, a genetic programming-based SR approach; PySR (Grayeli et al., 2024), which couples evolutionary search with symbolic simplification; uDSR (Landajuela et al., 2022), which replaces the RNN policy in DSR with a pretrained Transformer and employs neural-guided decoding; RAG-SR (Zhang et al., 2025), which incorporates structure retrieval to assist equation generation; and LLM-SR (Shojaee et al., 2025a). For the broader LLM-SRBench benchmark, we further compare against leading LLM-based SR methods, including SGA (Ma et al., 2024), which integrates LLM-driven hypothesis proposal with physics-informed parameter optimization in a bilevel search framework, and LaSR (Grayeli et al., 2024), which leverages abstract symbolic concepts distilled from prior equations to guide hybrid LLMāevolutionary generation.
Evaluation metrics. We evaluate methods using Accuracy to Tolerance and Normalized Mean Squared Error (NMSE). For a tolerance $\tau$ , we report $\mathrm{Acc}_{\mathrm{all}}(\tau)$ (Biggio et al., 2021) and $\mathrm{Acc}_{\mathrm{avg}}(\tau)$ based on relative error: $\mathrm{Acc}_{\mathrm{all}}(\tau)=\mathbbm{1}\!\bigl(\max_{1⤠i⤠N_{\mathrm{test}}}\bigl|\tfrac{\hat{y}_{i}-y_{i}}{y_{i}}\bigr|ā¤\tau\bigr)$ and $\mathrm{Acc}_{\mathrm{avg}}(\tau)=\tfrac{1}{N_{\mathrm{test}}}\sum_{i=1}^{N_{\mathrm{test}}}\mathbbm{1}\!\bigl(\bigl|\tfrac{\hat{y}_{i}-y_{i}}{y_{i}}\bigr|ā¤\tau\bigr)$ , where $\hat{y}_{i}$ and $y_{i}$ denote the predicted and ground-truth values at the $i$ -th test point, respectively. We additionally report $\mathrm{NMSE}=\frac{1}{N_{\mathrm{test}}}\sum_{i=1}^{N_{\mathrm{test}}}\frac{(\hat{y}_{i}-y_{i})^{2}}{\mathrm{Var}(y)}$ to assess overall numerical accuracy. We additionally adopt the Symbolic Accuracy (SA) metric (Shojaee et al., 2025b), which directly measures whether the discovered equation recovers the correct symbolic form (i.e., whether it is mathematically equivalent to the ground-truth equation up to fitted constants).
Hyperparameter Configurations. All experiments were run for 2,500 search iterations. To ensure a fair comparison, all hyperparameters related to LLM generation and search were kept consistent with the default configuration of LLM-SR. For the in-search policy optimization specific to PiT-PO, we use a learning rate of $1Ć 10^{-6}$ , a group size of $G=4$ , and a multi-island setting of $N=4$ , resulting in an effective per-device batch size of $GĆ N=16$ . The coefficient of the KL regularization term was set to 0.01, and the LoRA rank was set to $r=16$ . In addition, experiments were conducted on a single NVIDIA RTX 3090 using 4-bit quantized Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, and Llama-3.1-8B-Instruct (Kassianik et al., 2025) to evaluate the training stability and performance transferability of PiT-PO across different parameter scales under constrained compute and memory budgets. More details are in Appendix A.
| GPlern uDSR PySR | 0.11 1.78 3.80 | 0.0972 0.0002 0.0003 | 0.05 0.36 7.02 | 0.2000 0.0856 0.0002 | 0.76 1.12 2.80 | 1.0023 0.5059 0.4068 | 28.43 59.15 70.60 | 0.3496 0.0639 0.0347 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| RAG-SR | 39.47 | 1.49e-6 | 0.43 | 0.0282 | 2.04 | 0.2754 | 76.28 | 0.0282 |
| LLM-SR (Mixtral) | 100.00 | 1.32e-11 | 99.98 | 1.18e-11 | 2.88 | 0.0596 | 71.44 | 0.0276 |
| LLM-SR (4o-mini) | 99.92 | 8.84e-12 | 99.97 | 8.70e-10 | 5.52 | 0.0453 | 85.33 | 0.0245 |
| LLM-SR (Llama-3.1-8B) | 59.58 | 1.17e-6 | 99.96 | 9.66e-10 | 4.86 | 0.0555 | 77.74 | 0.0246 |
| LLM-SR (Llama-3.2-3B) | 39.45 | 1.76e-6 | 66.34 | 6.99e-7 | 1.08 | 0.3671 | 74.78 | 0.0324 |
| LLM-SR (Llama-3.2-1B) | 3.28 | 4.47e-4 | 7.81 | 0.0002 | 1.70 | 0.5801 | 30.35 | 0.3801 |
| PiT-PO (Llama-3.1-8B) | 100.00 | 6.41e-31 | 99.99 | 2.11e-13 | 10.42 | 0.0090 | 84.45 | 0.0136 |
| PiT-PO (Llama-3.2-3B) | 100.00 | 7.58e-31 | 99.97 | 9.77e-10 | 7.01 | 0.0248 | 84.54 | 0.0156 |
| PiT-PO (Llama-3.2-1B) | 99.95 | 1.34e-11 | 99.97 | 1.70e-8 | 4.76 | 0.0240 | 76.91 | 0.1767 |
Table 1. Overall performance on LLM-SR Suite.
4.2. PiT-PO Demonstrates Superior Equation Discovery Capability
As evidenced in Table 1, PiT-PO establishes a new state-of-the-art on LLM-SR Suite. It consistently dominates baseline methods across all metrics, achieving the highest accuracy while maintaining the lowest NMSE in nearly all test cases. Crucially, when controlling for the LLM backbone, PiT-PO yields a substantial performance margin over LLM-SR, validating the effectiveness of our in-search policy optimization framework. Notably, PiT-PO is the only approach to successfully identify the exact ground-truth equation for the Oscillator 1. This structural precision extends to the larger-scale LLM-SRBench (Table 2), where PiT-PO achieves the highest symbolic accuracy across all categories. These results collectively demonstrate that PiT-PO not only fits data numerically but excels in uncovering the true underlying equations.
These quantitative gains are not accidental but stem from PiT-POās structural awareness. Analysis of the iterative trajectories of LLM-SR and PiT-PO in Appendix C.4 corroborates this conclusion: the iterations of LLM-SR remain persistently influenced by clearly incorrect terms, which violate physical meaning despite providing strong numerical fits, as well as by additional nuisance terms. Consequently, the search of LLM-SR often stagnates in a low-MSE regime without reaching the correct structure. In contrast, once PiT-PO enters the same regime, the dual constraints rapidly eliminate terms that improve fitting performance but are structurally incorrect. This behavior highlights the central advantage of the proposed dual-constraint learning signals, in which the physical constraints and token-level penalties provide data-driven signals for precise structural correction, guiding the LLM toward the true underlying equation rather than mere numerical overfitting.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Multiple Line Charts: NMSE vs. Iteration for Different Scenarios
### Overview
The image contains four line charts arranged in a 2x2 grid. Each chart plots the NMSE (Normalized Mean Squared Error) on a logarithmic scale against the iteration number. Two algorithms, LLM-SR (blue line) and PiT-PO (red line), are compared across four different scenarios: Oscillation 1, Oscillation 2, E. coli Growth, and Stress-Strain. Shaded regions around each line represent the uncertainty or variance associated with each algorithm's performance.
### Components/Axes
* **X-axis (all charts):** Iteration, with tick marks at 0, 625, 1250, 1875, and 2500.
* **Y-axis (all charts):** NMSE (log scale). The scale varies slightly between charts.
* Oscillation 1: 10^-25 to 10^-1
* Oscillation 2: 10^-11 to 10^-2
* E. coli Growth: 10^-2 to 10^0 (100)
* Stress-Strain: 10^-2 to 10^-1
* **Legend (top):** Located at the top of the image, spanning across the two top charts.
* LLM-SR: Blue line
* PiT-PO: Red line
* **Chart Titles:**
* Top-left: Oscillation 1
* Top-right: Oscillation 2
* Bottom-left: E. coli Growth
* Bottom-right: Stress-Strain
### Detailed Analysis
**1. Oscillation 1**
* **LLM-SR (blue):** The line starts at approximately 10^-1 and quickly drops to around 10^-6, where it remains relatively constant throughout the iterations.
* **PiT-PO (red):** The line starts around 10^-6, then drops in steps at approximately iterations 625, 1250, and 1875, reaching approximately 10^-24 by the end.
**2. Oscillation 2**
* **LLM-SR (blue):** The line starts at approximately 10^-2 and decreases to approximately 10^-6 by iteration 625, then remains relatively constant.
* **PiT-PO (red):** The line starts at approximately 10^-3 and decreases to approximately 10^-10 by iteration 1250, then remains relatively constant.
**3. E. coli Growth**
* **LLM-SR (blue):** The line starts at approximately 10^0 (100) and decreases to approximately 10^-1 by iteration 625, then remains relatively constant.
* **PiT-PO (red):** The line starts at approximately 10^0 (100) and decreases in steps at approximately iterations 625, 1250, and 1875, reaching approximately 10^-2 by the end.
**4. Stress-Strain**
* **LLM-SR (blue):** The line starts at approximately 10^-1 and decreases to approximately 10^-2 by iteration 1250, then remains relatively constant.
* **PiT-PO (red):** The line starts at approximately 10^-1 and decreases in steps at approximately iterations 625, 1250, and 1875, reaching approximately 10^-2 by the end.
### Key Observations
* In all four scenarios, both algorithms show a decrease in NMSE as the number of iterations increases, indicating improved performance over time.
* The PiT-PO algorithm tends to have more pronounced step-wise decreases in NMSE, while the LLM-SR algorithm often plateaus after an initial drop.
* The shaded regions indicate the variability in performance, with some scenarios showing wider bands than others.
### Interpretation
The charts compare the performance of two algorithms (LLM-SR and PiT-PO) in terms of NMSE across four different applications. The data suggests that both algorithms are effective in reducing error over iterations, but their convergence behavior differs. LLM-SR appears to converge quickly to a certain error level and then plateaus, while PiT-PO shows more gradual, step-wise improvements. The choice of algorithm may depend on the specific application and the desired trade-off between initial convergence speed and final error level. The shaded regions provide insight into the robustness of each algorithm, with wider bands indicating greater sensitivity to initial conditions or noise in the data.
</details>
Figure 2. NMSE trajectories (log scale) over search iterations for LLM-SR and PiT-PO (Llama-3.1-8B) on LLM-SR Suite. Lines denote the median over seeds, and shaded regions indicate the mināmax range.The remaining iteration curves for smaller backbones (3B and 1B) are deferred to Appendix C.1.
| Direct Prompting SGA LaSR | 3.61 2.70 5.41 | 1.801 0.909 45.94 | 0.3697 0.3519 0.0021 | 0.00 0.00 0.00 | 0.00 8.33 27.77 | 0.0644 0.0458 2.77e-4 | 0.00 0.00 4.16 | 0.00 0.00 16.66 | 0.5481 0.2416 2.73e-4 | 0.00 0.00 4.54 | 0.00 2.27 25.02 | 0.0459 0.1549 0.0018 | 0.00 0.00 8.21 | 0.00 12.12 64.22 | 0.0826 0.0435 7.44e-5 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLM-SR | 30.63 | 38.55 | 0.0101 | 8.33 | 66.66 | 8.01e-6 | 25.30 | 58.33 | 1.04e-6 | 6.97 | 34.09 | 1.23e-4 | 4.10 | 88.12 | 1.15e-7 |
| PiT-PO | 34.23 | 46.84 | 0.0056 | 13.89 | 77.78 | 4.13e-7 | 29.17 | 70.83 | 9.37e-8 | 11.36 | 40.91 | 6.57e-5 | 12.00 | 92.00 | 1.18e-8 |
Table 2. Overall performance on LLM-SRBench (Llama-3.1-8B-Instruct).
4.3. PiT-PO Empowers Lightweight Backbones to Rival Large Models
As shown in Table 1, the performance of PiT-PO with the Llama-3.1-8B, Llama-3.2-3B, and Llama-3.2-1B backbones is competitive with, and often exceeds, the performance of LLM-SR that relies on substantially larger or proprietary models, including Mixtral 8 $Ć$ 7B and 4o-mini.
These results indicate that PiT-PO effectively bridges the capability gap between lightweight open-source models and large-scale commercial systems. From a practical standpoint, this reduces the barrier to entry for scientific discovery: by delivering state-of-the-art performance on consumer-grade hardware (even maintaining competitiveness with a 1B backbone), PiT-PO eliminates the dependence on massive compute and closed-source APIs, thereby democratizing access to powerful SR tools.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Bar Chart: NMSE Comparison
### Overview
The image is a bar chart comparing the Normalized Mean Squared Error (NMSE) on a logarithmic scale for different methods: "w/o Phy", "w/o TokenReg", and "PiT-PO". The chart compares the NMSE for two conditions: In-Distribution (ID) and Out-of-Distribution (OOD).
### Components/Axes
* **Y-axis:** NMSE (log scale). The y-axis ranges from approximately 10<sup>-29</sup> to 10<sup>-11</sup>, with gridlines at each power of 10.
* **X-axis:** Categorical axis representing the different methods: "w/o Phy", "w/o TokenReg", and "PiT-PO".
* **Legend:** Located in the top-right corner.
* White square: ID (In-Distribution)
* Diagonal lines: OOD (Out-of-Distribution)
* **Bar Colors:**
* ID: Solid blue or orange
* OOD: Light blue or light orange with diagonal lines
### Detailed Analysis
* **w/o Phy:**
* ID (blue): NMSE = 7.60e-21
* OOD (light blue, diagonal lines): NMSE = 2.06e-10
* **w/o TokenReg:**
* ID (blue): NMSE = 2.77e-19
* OOD (light blue, diagonal lines): NMSE = 9.97e-11
* **PiT-PO:**
* ID (orange): NMSE = 6.40e-31
* OOD (light orange, diagonal lines): NMSE = 1.63e-30
### Key Observations
* For "w/o Phy" and "w/o TokenReg", the OOD NMSE is significantly higher than the ID NMSE.
* For "PiT-PO", both ID and OOD NMSE values are very low, close to the bottom of the scale (10<sup>-30</sup> range).
* The NMSE values for PiT-PO are several orders of magnitude lower than the other two methods.
### Interpretation
The chart demonstrates that the "PiT-PO" method significantly outperforms "w/o Phy" and "w/o TokenReg" in terms of NMSE, for both in-distribution and out-of-distribution data. The large difference between ID and OOD NMSE for "w/o Phy" and "w/o TokenReg" suggests that these methods do not generalize well to out-of-distribution data. In contrast, "PiT-PO" maintains a low NMSE even for OOD data, indicating better generalization capabilities.
</details>
Figure 3. Ablation results of PiT-PO and its variants.
4.4. PiT-PO Enhances Search Efficiency and Breaks Stagnation
Figure 2 shows that PiT-PO achieves superior search efficiency in discovering accurate equations. In the early search stage, the red and blue curves are close across all four tasks: both methods primarily rely on the fitting signal (MSE) and therefore exhibit comparable per-iteration progress. As NMSE enters a lower regime, the trajectories consistently separate: PiT-PO exhibits abrupt step-wise drops while LLM-SR tends to plateau, yielding a clear redāblue gap in every subplot. Concretely, once the search reaches these lower-error regions, PiT-PO repeatedly exits stagnation and transitions to the next accuracy phase with orders-of-magnitude NMSE reductions (most prominently in Oscillation 1 and Oscillation 2, and also evident in E. coli Growth and Stress-Strain), whereas LLM-SR often remains trapped near its current error floor. This behavior confirms that the proposed dual-constraint mechanism effectively activates exactly when naive MSE feedback becomes insufficient. By penalizing physical inconsistencies and structural redundancy, PiT-PO forces the LLM to exit stagnation and transition toward the correct functional form.
While the in-search fine-tuning introduces a computational overhead, this cost is decisively outweighed by the substantial gains in performance. As detailed in Appendix C.2, PiT-PO maintains a significant performance edge even when evaluated under equivalent wall-clock time, demonstrating that the accelerated convergence speed effectively compensates for the additional training time.
4.5. Ablation Study
To rigorously validate the contribution of each algorithmic component, we conduct an ablation study across three settings: w/o Phy, which excludes the physics-consistency penalty $P_{\text{phy}}$ ; w/o TokenReg, which removes the redundancy-aware token-level regularization; and the full PiT-PO framework. As shown in Figure 3, removing any single component leads to a substantial deterioration of NMSE and a larger generalization gap between In-Distribution (ID) and Out-Of-Distribution (OOD) data. These empirical results underscore the necessity of the complete framework, demonstrating that the proposed dual constraints are indispensable for ensuring both search stability and robust generalization.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Diagram: Flow Channel with Varying Obstructions
### Overview
The image depicts a two-dimensional flow channel with obstructions of varying shapes and sizes. The channel is defined by solid walls and cyclic boundaries. The diagram illustrates the impact of different obstruction profiles on the flow.
### Components/Axes
* **X-axis:** x/H, ranging from 0 to 9.
* **Y-axis:** y/H, ranging from 0 to 3.
* **Boundaries:**
* Solid walls (top and bottom)
* Cyclic boundaries (left and right sides)
* **Obstructions:** Three different obstruction profiles are shown, each corresponding to a different value of alpha (α).
* α = 0.5 (brown line)
* α = 0.8 (teal line)
* α = 1.0 (dark teal line)
* **Flow Direction:** Indicated by a yellow arrow pointing from left to right.
* **Region H:** A yellow shaded region near the right side of the channel, labeled "H".
### Detailed Analysis
* **Channel Geometry:** The channel is rectangular, with the x-axis representing the horizontal length and the y-axis representing the vertical height.
* **Boundary Conditions:** The top and bottom boundaries are solid walls, implying no-slip conditions. The left and right boundaries are cyclic, indicating periodic boundary conditions.
* **Obstructions:**
* The obstruction for α = 0.5 (brown) starts at x/H ā 0, rises to y/H ā 1, remains constant until x/H ā 6, then drops back down to y/H ā 0.
* The obstruction for α = 0.8 (teal) starts at x/H ā 0, rises to y/H ā 1, remains constant until x/H ā 7, then drops back down to y/H ā 0.
* The obstruction for α = 1.0 (dark teal) starts at x/H ā 0, rises to y/H ā 1, remains constant until x/H ā 7.75, then drops back down to y/H ā 0.
* **Flow Direction:** The flow is from left to right, as indicated by the arrow. The obstructions will influence the flow field, creating pressure gradients and velocity variations.
* **Region H:** The yellow shaded region labeled "H" likely represents the height of the channel or a characteristic length scale.
### Key Observations
* The obstructions are located on the bottom wall of the channel.
* The parameter α controls the shape and size of the obstruction.
* As α increases, the obstruction becomes taller and extends further into the channel.
* The cyclic boundary conditions suggest that the flow is expected to repeat periodically along the x-axis.
### Interpretation
The diagram illustrates the impact of different obstruction profiles on the flow within a channel. The parameter α is a key design variable that controls the shape and size of the obstruction. By varying α, the flow field can be manipulated to achieve desired performance characteristics. The cyclic boundary conditions imply that this setup is modeling a repeating section of a larger flow domain. The "H" region likely indicates a reference height used for normalization or analysis of the flow behavior. The diagram is likely used to study the effects of different obstruction shapes on flow characteristics such as pressure drop, velocity distribution, and turbulence.
</details>
Figure 4. Schematic of the geometries for periodic hills.
4.6. Case Study: Turbulence Modeling
To validate the practical utility of PiT-PO in high-fidelity scientific discovery, we select the Flow over Periodic Hills (Xiao et al., 2020) (Figure 4) as a testbed. This problem is widely recognized in Computational Fluid Dynamics (CFD) (Pope, 2000) as a benchmark for Separated Turbulent Flows, presenting complex features such as strong adverse pressure gradients, massive flow detachment, and reattachment.
Problem Definition and Physics: The geometry consists of a sequence of polynomially shaped hills arranged periodically in the streamwise direction. The flow is driven by a constant body force at a bulk Reynolds number of $Re_{b}=5600$ (based on hill height $H$ and bulk velocity $U_{b}$ ). The domain height is fixed at $L_{y}/H=3.036$ , while the streamwise length $L_{x}$ varies with the slope factor $\alpha$ according to $L_{x}/H=3.858\alpha+5.142$ . Periodic boundary conditions are applied in the streamwise direction, with no-slip conditions on the walls.
The Scientific Challenge: The challenge lies in the Separation Bubble (Pope, 2000), a region where turbulence exhibits strong anisotropy due to streamline curvature. Traditional Linear Eddy Viscosity Models (LEVM) (Pope, 2000), such as the $k$ - $\omega$ SST model (Menter, 1994; Menter et al., 2003), rely on the Boussinesq hypothesis which assumes isotropic turbulence. Consequently, they systematically fail to predict key flow features, such as the separation bubble size and reattachment location.
Discovery Objective: Instead of fitting a simple curve, our goal is to discover a Non-linear Constitutive Relation for the Reynolds stress anisotropy tensor $a_{ij}$ and the dimensionless Reynolds stress anisotropy tensor $b_{ij}$ . By learning the Reynolds stress tensor $\tau_{ij}$ from high-fidelity Direct Numerical Simulation (DNS) (Pope, 2000) data, PiT-PO aims to formulate a symbolic correction term that captures the anisotropic physics missed by linear models.
Baselines: We follow standard turbulence modeling protocols and compare primarily against the standard $k$ - $\omega$ SST model of RANS. We also include LLM-SR and DSRRANS (Tang et al., 2023a), a strong SR-based turbulence modeling method specifically designed for turbulence tasks.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Heatmap: Reynolds Stress Anisotropy Components
### Overview
The image presents a series of heatmaps visualizing the Reynolds stress anisotropy components (a11/ub^2, a22/ub^2, a33/ub^2, a12/ub^2) for different turbulence models (RANS, DSRRANS, LLM-SR, PIT-PO, DNS) in a channel flow with a varying bottom topography. The x-axis represents the streamwise direction (x/H), and the y-axis represents the vertical direction (y/H). The color scale indicates the magnitude of the anisotropy components, with blue representing negative values and red representing positive values.
### Components/Axes
* **Rows (Turbulence Models):**
* RANS (Reynolds-Averaged Navier-Stokes)
* DSRRANS (Data-Driven Scale Resolving RANS)
* LLM-SR (Likelihood Maximization Model - Scale Resolving)
* PIT-PO (Pressure Implicit with Turbulence - Partially Optimized)
* DNS (Direct Numerical Simulation)
* **Columns (Anisotropy Components):**
* Column 1: a11/ub^2
* Column 2: a22/ub^2
* Column 3: a33/ub^2
* Column 4: a12/ub^2
* **X-axis:** x/H (Streamwise direction), ranging from approximately 0 to 8, with tick marks at intervals of 0.5.
* **Y-axis:** y/H (Vertical direction), ranging from 0 to 3, with tick marks at intervals of 0.5.
* **Color Scale:** Ranges from approximately -0.04 (dark blue) to 0.03 (dark red).
### Detailed Analysis
**Row 1: RANS**
* **a11/ub^2:** Predominantly light blue, indicating slightly negative values.
* **a22/ub^2:** Predominantly light blue, indicating slightly negative values.
* **a33/ub^2:** Predominantly light blue, indicating slightly negative values.
* **a12/ub^2:** Predominantly light blue, indicating slightly negative values.
**Row 2: DSRRANS**
* **a11/ub^2:** Mostly light blue, with a slight increase in magnitude near the bottom topography.
* **a22/ub^2:** Mostly light blue, with a slight increase in magnitude near the bottom topography.
* **a33/ub^2:** Mostly light blue, with a slight increase in magnitude near the bottom topography.
* **a12/ub^2:** Mostly light blue, with a slight increase in magnitude near the bottom topography.
**Row 3: LLM-SR**
* **a11/ub^2:** Shows a distinct region of positive (red) values near the bottom topography, surrounded by negative (blue) values.
* **a22/ub^2:** Shows a distinct region of negative (blue) values near the bottom topography, surrounded by positive (red) values.
* **a33/ub^2:** Shows a complex pattern of positive and negative values, with a concentration of negative values near the bottom topography.
* **a12/ub^2:** Shows a complex pattern of positive and negative values, with a concentration of negative values near the bottom topography.
**Row 4: PIT-PO**
* **a11/ub^2:** Shows a region of positive (red) values near the bottom topography, surrounded by negative (blue) values.
* **a22/ub^2:** Shows a region of negative (blue) values near the bottom topography, surrounded by positive (red) values.
* **a33/ub^2:** Shows a complex pattern of positive and negative values, with a concentration of negative values near the bottom topography.
* **a12/ub^2:** Shows a complex pattern of positive and negative values, with a concentration of negative values near the bottom topography.
**Row 5: DNS**
* **a11/ub^2:** Shows a region of positive (red) values near the bottom topography, surrounded by negative (blue) values.
* **a22/ub^2:** Shows a region of negative (blue) values near the bottom topography, surrounded by positive (red) values.
* **a33/ub^2:** Shows a complex pattern of positive and negative values, with a concentration of negative values near the bottom topography.
* **a12/ub^2:** Shows a complex pattern of positive and negative values, with a concentration of negative values near the bottom topography.
### Key Observations
* RANS and DSRRANS models show relatively uniform and low magnitudes for all anisotropy components.
* LLM-SR, PIT-PO, and DNS models exhibit more complex patterns, particularly near the bottom topography.
* The a11/ub^2 component tends to be positive near the bottom topography for LLM-SR, PIT-PO, and DNS.
* The a22/ub^2 component tends to be negative near the bottom topography for LLM-SR, PIT-PO, and DNS.
* The a33/ub^2 and a12/ub^2 components show more intricate distributions with both positive and negative regions.
### Interpretation
The heatmaps illustrate the differences in how various turbulence models capture the Reynolds stress anisotropy in a channel flow with complex geometry. RANS and DSRRANS, being simpler models, predict relatively uniform anisotropy distributions. In contrast, LLM-SR, PIT-PO, and DNS, which are more sophisticated, resolve more complex anisotropy patterns, especially in the vicinity of the bottom topography. The DNS results, considered the most accurate, serve as a benchmark for evaluating the performance of the other models. The differences in anisotropy components suggest variations in how these models represent the turbulent stresses and their impact on the flow field. The alternating regions of positive and negative values indicate the presence of complex flow structures and gradients in the turbulent stresses.
</details>
Figure 5. Comparison of the four anisotropic Reynolds stress components for periodic hill training flow using RANS, DSRRANS, LLM-SR, PiT-PO and DNS, respectively.
We cast turbulence closure modeling as a SR problem (Tang et al., 2023b) (see Appendix E for details). After obtaining the final symbolic equation, we embed it into a RANS solver of OpenFOAM (Weller et al., 1998) and run CFD simulations on the periodic-hill configuration. We compare the resulting Reynolds-stress components, mean-velocity fields, and skin-friction profiles against DNS references. Figures 5 ā 7 visualize these quantities, enabling a direct assessment of physical fidelity and flow-field prediction quality.
Based on the comparative analysis of the anisotropic Reynolds stress contours (Figure 5), DSRRANS and PiT-PO show enhancement over the traditional RANS approach. Among them, PiT-PO performs the best: its contour matches the DNS reference most closely, with reduced error compared to DSRRANS and LLM-SR, demonstrating less severe non-physical extremes.
The stream-wise velocity contours illustrate the correction of the bubble size, a region of reversed flow that forms when fluid detaches from a surface. In Figure 6, PiT-PO most accurately represents the extent and shape of the recirculation zone, where fluid circulates within the separated region, closely consistent with the DNS data throughout the domain, particularly within the separation region and the recovery layer, where flow re-attaches to the surface.
The skin friction coefficient (Figure 7), defined as the ratio of the wall stress to the dynamic pressure of the flow along the bottom wall, is a sensitive metric for predicting flow separation. The $k$ - $\omega$ SST model of RANS underestimates the magnitude of the skin friction and predicts a delayed reattachment location compared to the DNS. The learned model (PiT-PO) improves the prediction, aligning more closely with the DNS profile.
These results demonstrate that PiT-PO can generate symbolic equations tailored to turbulence modeling and that, under a posteriori CFD evaluation, the resulting predictions more closely match DNS references, which increases the practical value of LLM-based SR in real scientific and engineering workflows. With the proposed dual constraints,
PiT-PO provides targeted search and learning signals that enable the internalization of turbulence priors during equation discovery, thereby steering the model toward physically consistent and domain-relevant structures.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Flow Simulation Comparison
### Overview
The image presents a comparison of flow simulations using different computational models: RANS, LLM-SR, DSRRANS, PiT-PO, and DNS. Each simulation visualizes the flow field over a backward-facing step, displaying the velocity component *u_x / u_b* as a color gradient and streamlines to indicate flow direction.
### Components/Axes
* **Title:** The title is not explicitly stated in the image, but the sub-titles of each subplot indicate the simulation method used.
* **X-axis:** *x/H*, ranging from 0 to 8. This represents the normalized horizontal distance.
* **Y-axis:** *y/H*, ranging from 0 to 3. This represents the normalized vertical distance.
* **Colorbar:** Located at the top-right, indicating the velocity component *u_x / u_b*. Red corresponds to positive values (1.20 to 0.40), white corresponds to 0.00, and blue corresponds to negative values (-0.20).
* **Streamlines:** Black lines indicating the direction of flow.
### Detailed Analysis
Each subplot represents a different simulation method:
1. **RANS (Reynolds-Averaged Navier-Stokes):**
* A large recirculation zone is visible behind the step (x/H ā 1 to 4, y/H ā 0 to 1).
* The flow reattaches around x/H ā 6.
* The color gradient shows a relatively smooth transition from negative (blue) to positive (red) velocities.
2. **LLM-SR (Likely Large-eddy Simulation - Spectral Relaxation):**
* The recirculation zone is less defined compared to RANS.
* Significant regions of negative velocity (blue) extend further upwards (y/H ā 1.5) and downstream.
* The color gradient shows more turbulent structures.
3. **DSRRANS (Detached Eddy Simulation RANS):**
* The recirculation zone is similar to RANS but slightly smaller.
* The flow reattaches around x/H ā 6.
* The color gradient is smoother than LLM-SR but shows some turbulent features.
4. **PiT-PO (Perturbation-Informed Training - Physics-Only):**
* The recirculation zone is similar in size and shape to RANS and DSRRANS.
* The flow reattaches around x/H ā 6.
* The color gradient is relatively smooth.
5. **DNS (Direct Numerical Simulation):**
* The recirculation zone is similar to RANS, DSRRANS, and PiT-PO.
* The flow reattaches around x/H ā 6.
* The color gradient is relatively smooth.
### Key Observations
* **Recirculation Zone:** All simulations show a recirculation zone behind the step, but the size and intensity vary. LLM-SR shows a more extended region of negative velocity.
* **Flow Reattachment:** The flow reattaches at approximately the same location (x/H ā 6) in RANS, DSRRANS, PiT-PO, and DNS.
* **Turbulence:** LLM-SR exhibits more turbulent structures in the color gradient compared to the other methods.
### Interpretation
The image compares the performance of different turbulence models in simulating flow over a backward-facing step. RANS, DSRRANS, PiT-PO, and DNS show qualitatively similar results, with a well-defined recirculation zone and flow reattachment at approximately the same location. LLM-SR, on the other hand, predicts a more extended region of negative velocity and exhibits more turbulent structures. This suggests that LLM-SR captures more of the unsteady flow features compared to the other models. The choice of turbulence model depends on the specific application and the desired level of accuracy. RANS models are computationally cheaper but may not accurately capture complex flow features, while DNS is the most accurate but also the most computationally expensive. LLM-SR and DSRRANS offer a compromise between accuracy and computational cost.
</details>
Figure 6. Non-dimensional stream-wise velocity contours obtained by the learned model and the standard $k$ - $\omega$ SST model of RANS, compared with DNS data.
5. Related Work
Traditional SR has been studied through several lines, including genetic programming , reinforcement learning (Petersen et al., 2021), and transformer-based generation (Biggio* et al., 2021). Genetic programming (Koza, 1990) casts equation discovery as an evolutionary search over tree-structured programs, where candidate expressions are iteratively refined via mutation and crossover. Reinforcement learning-based SR, introduced by Petersen et al. (Petersen et al., 2021), has developed into a family of policy-optimization frameworks (Mundhenk et al., 2021; Landajuela et al., 2021; Crochepierre et al., 2022; Du et al., 2023) that formulate SR as a sequential decision-making process. More recently, transformer-based models (Valipour et al., 2021; Vastl et al., 2024; Kamienny et al., 2022; Li et al., 2023; Zhang et al., 2025) have been adopted for SR, using large-scale pretraining to map numerical data directly to equations. However, these methods typically fail to incorporate scientific prior knowledge.
Recent progress in natural language processing has further enabled LLM-based SR methods, including LLM-SR (Shojaee et al., 2025a), LaSR (Grayeli et al., 2024), ICSR (Merler et al., 2024), CoEvo (Guo et al., 2025), and SR-Scientist (Xia et al., 2025). LLM-SR exploits scientific priors that are implicitly captured by LLMs to propose plausible functional forms, followed by data-driven parameter estimation. LaSR augments SR with abstract concept generation to guide hypothesis formation, while ICSR reformulates training examples as in-context prompts to elicit function generation. However, a unifying limitation across these methods is their reliance on the LLM as a frozen generator, which precludes incorporating search feedback to update the generation strategy and consequently restricts their ability to adapt to complex problems.
While some recent works, such as SOAR (Pourcel et al., 2025) and CALM (Huang et al., 2025), have begun to explore adaptive in-search tuning, they primarily focus on algorithm discovery or combinatorial optimization problems, whereas our method is specifically tailored for SR. By integrating hierarchical physical constraints and theorem-guided token regularization, PiT-PO establishes an adaptive framework capable of discovering accurate and physically consistent equations.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Cf vs x/H
### Overview
The image is a line chart comparing different computational fluid dynamics (CFD) models against Direct Numerical Simulation (DNS) data. The chart plots the coefficient of friction (Cf) on the y-axis against the normalized distance (x/H) on the x-axis. Several models are compared: RANS, DSRRANS, PiT-PO, and LLM-SR.
### Components/Axes
* **Title:** Implicit, but the chart compares Cf vs x/H for different models.
* **X-axis:**
* Label: x/H
* Scale: -1 to 9, with tick marks at every integer value.
* **Y-axis:**
* Label: Cf
* Scale: -0.020 to 0.040, with tick marks at -0.020, -0.010, 0.000, 0.010, 0.020, 0.030, and 0.040.
* **Legend:** Located at the top of the chart.
* DNS: Red diamonds
* RANS: Solid dark blue line
* DSRRANS: Dashed light blue line
* PiT-PO: Dash-dotted orange line
* LLM-SR: Green dots
### Detailed Analysis
* **DNS (Red Diamonds):**
* Trend: Relatively flat near zero from x/H = -1 to approximately x/H = 7.5, then sharply increases to a peak around x/H = 8, followed by a sharp drop.
* Data Points:
* x/H = 0, Cf ā 0.001
* x/H = 2, Cf ā -0.003
* x/H = 4, Cf ā -0.003
* x/H = 6, Cf ā 0.001
* x/H = 8, Cf ā 0.022
* **RANS (Solid Dark Blue Line):**
* Trend: Similar to DNS, relatively flat near zero from x/H = -1 to approximately x/H = 7.5, then increases to a peak around x/H = 8, followed by a drop.
* Data Points:
* x/H = 0, Cf ā -0.003
* x/H = 2, Cf ā -0.004
* x/H = 4, Cf ā -0.004
* x/H = 6, Cf ā 0.000
* x/H = 8, Cf ā 0.024
* **DSRRANS (Dashed Light Blue Line):**
* Trend: Similar to DNS and RANS, relatively flat near zero from x/H = -1 to approximately x/H = 7.5, then increases to a peak around x/H = 8, followed by a sharp drop.
* Data Points:
* x/H = 0, Cf ā -0.006
* x/H = 2, Cf ā -0.005
* x/H = 4, Cf ā -0.004
* x/H = 6, Cf ā -0.001
* x/H = 8, Cf ā 0.038
* **PiT-PO (Dash-dotted Orange Line):**
* Trend: Similar to DNS, RANS, and DSRRANS, relatively flat near zero from x/H = -1 to approximately x/H = 7.5, then increases to a peak around x/H = 8, followed by a drop.
* Data Points:
* x/H = 0, Cf ā -0.004
* x/H = 2, Cf ā -0.004
* x/H = 4, Cf ā -0.002
* x/H = 6, Cf ā 0.001
* x/H = 8, Cf ā 0.015
* **LLM-SR (Green Dots):**
* Trend: Highly scattered data points. From x/H = -1 to approximately x/H = 3, the data points are scattered above the zero line. From x/H = 3 to x/H = 8, the data points are scattered below the zero line. There is a large amount of variance.
* Data Points: The data is too scattered to provide accurate point estimates.
### Key Observations
* The RANS, DSRRANS, and PiT-PO models generally follow the trend of the DNS data, but there are some deviations, particularly around the peak at x/H = 8.
* The LLM-SR model shows a high degree of scatter and does not closely follow the trend of the DNS data.
* All models converge to near zero for x/H values between 2 and 7.
* All models show a peak in Cf around x/H = 8.
### Interpretation
The chart compares the performance of different CFD models in predicting the coefficient of friction (Cf) against DNS data, which is considered the most accurate. The RANS, DSRRANS, and PiT-PO models show reasonable agreement with the DNS data, suggesting they can capture the general trend of the flow. However, the DSRRANS model overestimates the peak at x/H = 8, while the PiT-PO model underestimates it. The LLM-SR model's high scatter indicates that it is not a reliable predictor of Cf in this scenario. The peak at x/H = 8 likely corresponds to a significant flow feature, such as a separation or reattachment point, which the models capture with varying degrees of accuracy.
</details>
Figure 7. Skin friction distribution along the bottom obtained by the learned model and the standard $k$ - $\omega$ SST model of RANS, compared with DNS data.
6. Conclusion
In this work, we introduced PiT-PO, a unified framework that fundamentally transforms LLMs from static equation proposers into adaptive, physics-aware generators for SR. By integrating in-search policy optimization with a novel dual-constraint evaluation mechanism, PiT-PO rigorously enforces hierarchical physical validity while leveraging theorem-guided, token-level penalties to eliminate structural redundancy. This synergistic design aligns generation with numerical fitness, scientific consistency, and parsimony, establishing new state-of-the-art performance on SR benchmarks. Beyond synthetic tasks, PiT-PO demonstrates significant practical utility in turbulence modeling, where the discovered symbolic corrections improve Reynolds stress and flow-field predictions. Notably, PiT-PO achieves these results using small open-source backbones, making it a practical and accessible tool for scientific communities with limited computational resources. Looking forward, we plan to extend PiT-PO to broader scientific and engineering domains by enriching the library of domain-specific constraints and validating it across more complex, real-world systems. Moreover, we anticipate that integrating PiT-PO with larger-scale multi-modal foundation models could further unlock its potential in processing heterogeneous scientific data.
References
- (1)
- Aakash et al. (2019) B. S. Aakash, JohnPatrick Connors, and Michael D Shields. 2019. Stress-strain data for aluminum 6061-T651 from 9 lots at 6 temperatures under uniaxial and plane strain tension. Data in Brief 25 (Aug 2019), 104085. doi: 10.1016/j.dib.2019.104085
- Biggio et al. (2021) Luca Biggio, Tommaso Bendinelli, Alexander Neitz, Aurelien Lucchi, and Giambattista Parascandolo. 2021. Neural Symbolic Regression that Scales. arXiv:2106.06427 [cs.LG] https://arxiv.org/abs/2106.06427
- Biggio* et al. (2021) L. Biggio*, T. Bendinelli*, A. Neitz, A. Lucchi, and G. Parascandolo. 2021. Neural Symbolic Regression that Scales. In Proceedings of 38th International Conference on Machine Learning (ICML 2021) (Proceedings of Machine Learning Research, Vol. 139). PMLR, 936ā945. https://proceedings.mlr.press/v139/biggio21a.html *equal contribution.
- Chen et al. (2025) Jindou Chen, Jidong Tian, Liang Wu, ChenXinWei, Xiaokang Yang, Yaohui Jin, and Yanyan Xu. 2025. KinFormer: Generalizable Dynamical Symbolic Regression for Catalytic Organic Reaction Kinetics. In International Conference on Representation Learning, Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (Eds.), Vol. 2025. 67058ā67080. https://proceedings.iclr.cc/paper_files/paper/2025/file/a76b693f36916a5ed84d6e5b39a0dc03-Paper-Conference.pdf
- Cranmer (2023) Miles Cranmer. 2023. Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl. arXiv:2305.01582 [astro-ph.IM] https://arxiv.org/abs/2305.01582
- Crochepierre et al. (2022) Laure Crochepierre, Lydia Boudjeloud-Assala, and Vincent Barbesant. 2022. Interactive Reinforcement Learning for Symbolic Regression from Multi-Format Human-Preference Feedbacks. In IJCAI 2022- 31st International Joint Conference on Artificial Intelligence. Vienne, Austria. https://hal.science/hal-03695471
- Deng et al. (2023) Song Deng, Junjie Wang, Li Tao, Su Zhang, and Hongwei Sun. 2023. EV charging load forecasting model mining algorithm based on hybrid intelligence. Computers and Electrical Engineering 112 (2023), 109010. doi: 10.1016/j.compeleceng.2023.109010
- Du et al. (2023) Mengge Du, Yuntian Chen, and Dongxiao Zhang. 2023. DISCOVER: Deep identification of symbolically concise open-form PDEs via enhanced reinforcement-learning. arXiv:2210.02181 [cs.LG] https://arxiv.org/abs/2210.02181
- Fletcher (1987) Roger Fletcher. 1987. Practical Methods of Optimization (2nd ed.). John Wiley & Sons, Chichester, New York.
- Grayeli et al. (2024) Arya Grayeli, Atharva Sehgal, Omar Costilla-Reyes, Miles Cranmer, and Swarat Chaudhuri. 2024. Symbolic Regression with a Learned Concept Library. arXiv:2409.09359 [cs.LG] https://arxiv.org/abs/2409.09359
- Guo et al. (2025) Ping Guo, Qingfu Zhang, and Xi Lin. 2025. CoEvo: Continual Evolution of Symbolic Solutions Using Large Language Models. arXiv:2412.18890 [cs.AI] https://arxiv.org/abs/2412.18890
- Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685
- Huang et al. (2025) Ziyao Huang, Weiwei Wu, Kui Wu, Jianping Wang, and Wei-Bin Lee. 2025. CALM: Co-evolution of Algorithms and Language Model for Automatic Heuristic Design. arXiv:2505.12285 [cs.NE] https://arxiv.org/abs/2505.12285
- Kamienny et al. (2022) Pierre-Alexandre Kamienny, StĆ©phane dāAscoli, Guillaume Lample, and FranƧois Charton. 2022. End-to-end symbolic regression with transformers. arXiv:2204.10532 [cs.LG] https://arxiv.org/abs/2204.10532
- Kassianik et al. (2025) Paul Kassianik, Baturay Saglam, Alexander Chen, Blaine Nelson, Anu Vellore, Massimo Aufiero, Fraser Burch, Dhruv Kedia, Avi Zohary, Sajana Weerawardhena, Aman Priyanshu, Adam Swanda, Amy Chang, Hyrum Anderson, Kojin Oshiba, Omar Santos, Yaron Singer, and Amin Karbasi. 2025. Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report. arXiv:2504.21039 [cs.CR] https://arxiv.org/abs/2504.21039
- Koza (1990) J.R. Koza. 1990. Genetically breeding populations of computer programs to solve problems in artificial intelligence. In [1990] Proceedings of the 2nd International IEEE Conference on Tools for Artificial Intelligence. 819ā827. doi: 10.1109/TAI.1990.130444
- Landajuela et al. (2022) Mikel Landajuela, Chak Shing Lee, Jiachen Yang, Ruben Glatt, Claudio P Santiago, Ignacio Aravena, Terrell Mundhenk, Garrett Mulcahy, and Brenden K Petersen. 2022. A Unified Framework for Deep Symbolic Regression. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 33985ā33998. https://proceedings.neurips.cc/paper_files/paper/2022/file/dbca58f35bddc6e4003b2dd80e42f838-Paper-Conference.pdf
- Landajuela et al. (2021) Mikel Landajuela, Brenden K. Petersen, Soo K. Kim, Claudio P. Santiago, Ruben Glatt, T. Nathan Mundhenk, Jacob F. Pettit, and Daniel M. Faissol. 2021. Improving exploration in policy gradient search: Application to symbolic optimization. arXiv:2107.09158 [cs.LG] https://arxiv.org/abs/2107.09158
- Li et al. (2023) Wenqiang Li, Weijun Li, Linjun Sun, Min Wu, Lina Yu, Jingyi Liu, Yanjie Li, and Song Tian. 2023. Transformer-based model for symbolic regression via joint supervised learning. In International Conference on Learning Representations. https://api.semanticscholar.org/CorpusID:259298765
- Ma et al. (2024) Pingchuan Ma, Tsun-Hsuan Wang, Minghao Guo, Zhiqing Sun, Joshua B. Tenenbaum, Daniela Rus, Chuang Gan, and Wojciech Matusik. 2024. LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (Eds.). PMLR, 33940ā33962. https://proceedings.mlr.press/v235/ma24m.html
- Makke and Chawla (2024a) Nour Makke and Sanjay Chawla. 2024a. Data-driven discovery of Tsallis-like distribution using symbolic regression in high-energy physics. PNAS Nexus 3, 11 (10 2024), pgae467. arXiv:https://academic.oup.com/pnasnexus/article-pdf/3/11/pgae467/60816181/pgae467.pdf doi: 10.1093/pnasnexus/pgae467
- Makke and Chawla (2024b) Nour Makke and Sanjay Chawla. 2024b. Interpretable scientific discovery with symbolic regression: a review. Artificial Intelligence Review 57 (01 2024). doi: 10.1007/s10462-023-10622-0
- Menter (1994) Florian R Menter. 1994. Two-equation eddy-viscosity turbulence models for engineering applications. AIAA journal 32, 8 (1994), 1598ā1605.
- Menter et al. (2003) Florian R Menter, Martin Kuntz, Robin Langtry, et al. 2003. Ten years of industrial experience with the SST turbulence model. Turbulence, heat and mass transfer 4, 1 (2003), 625ā632.
- Merler et al. (2024) Matteo Merler, Katsiaryna Haitsiukevich, Nicola Dainese, and Pekka Marttinen. 2024. In-Context Symbolic Regression: Leveraging Large Language Models for Function Discovery. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). Association for Computational Linguistics, 589ā606. doi: 10.18653/v1/2024.acl-srw.49
- MOCHIZUKI and OSAKA (2000) Shinsuke MOCHIZUKI and Hideo OSAKA. 2000. Management of a Stronger Wall Jet by a Pair of Streamwise Vortices. Reynolds Stress Tensor and Production Tenns. TRANSACTIONS OF THE JAPAN SOCIETY OF MECHANICAL ENGINEERS Series B 66, 646 (2000), 1309ā1317. doi: 10.1299/kikaib.66.646_1309
- Monkewitz (2021) Peter A. Monkewitz. 2021. Asymptotics of streamwise Reynolds stress in wall turbulence. Journal of Fluid Mechanics 931 (Nov. 2021). doi: 10.1017/jfm.2021.924
- Monod (1949) Jacques Monod. 1949. THE GROWTH OF BACTERIAL CULTURES. Annual Review of Microbiology 3, Volume 3, 1949 (1949), 371ā394. doi: 10.1146/annurev.mi.03.100149.002103
- Mundhenk et al. (2021) T. Nathan Mundhenk, Mikel Landajuela, Ruben Glatt, Claudio P. Santiago, Daniel M. Faissol, and Brenden K. Petersen. 2021. Symbolic Regression via Neural-Guided Genetic Programming Population Seeding. arXiv:2111.00053 [cs.NE] https://arxiv.org/abs/2111.00053
- Neamtiu et al. (2005) Iulian Neamtiu, Jeffrey S. Foster, and Michael Hicks. 2005. Understanding source code evolution using abstract syntax tree matching. In Proceedings of the 2005 International Workshop on Mining Software Repositories (St. Louis, Missouri) (MSR ā05). Association for Computing Machinery, New York, NY, USA, 1ā5. doi: 10.1145/1083142.1083143
- Petersen et al. (2021) Brenden K. Petersen, Mikel Landajuela, T. Nathan Mundhenk, Claudio P. Santiago, Soo K. Kim, and Joanne T. Kim. 2021. Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. arXiv:1912.04871 [cs.LG] https://arxiv.org/abs/1912.04871
- Pope (2000) Stephen B. Pope. 2000. Turbulent Flows. Cambridge University Press. doi: 10.1017/cbo9780511840531
- Pourcel et al. (2025) Julien Pourcel, CƩdric Colas, and Pierre-Yves Oudeyer. 2025. Self-Improving Language Models for Evolutionary Program Synthesis: A Case Study on ARC-AGI. arXiv:2507.14172 [cs.LG] https://arxiv.org/abs/2507.14172
- Reuter et al. (2023) Julia Reuter, Hani Elmestikawy, Fabien Evrard, Sanaz Mostaghim, and Berend van Wachem. 2023. Graph Networks as Inductive Bias for Genetic Programming: Symbolic Models for Particle-Laden Flows. In Genetic Programming, Gisele Pappa, Mario Giacobini, and Zdenek Vasicek (Eds.). Springer Nature Switzerland, Cham, 36ā51.
- Romera-Paredes et al. (2024) Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. 2024. Mathematical discoveries from program search with large language models. Nature 625, 7995 (2024), 468ā475.
- Rosso et al. (1995) L Rosso, J. R. Lobry, S Bajard, and J. P. Flandrois. 1995. Convenient Model To Describe the Combined Effects of Temperature and pH on Microbial Growth. Applied and Environmental Microbiology 61, 2 (Feb 1995), 610ā6. doi: 10.1128/aem.61.2.610-616.1995
- Schmidt and Lipson (2009) Michael Schmidt and Hod Lipson. 2009. Distilling Free-Form Natural Laws from Experimental Data. Science 324, 5923 (2009), 81ā85. arXiv:https://www.science.org/doi/pdf/10.1126/science.1165893 doi: 10.1126/science.1165893
- Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300
- Shi et al. (2024) Hao Shi, Weili Song, Xinting Zhang, Jiahe Shi, Cuicui Luo, Xiang Ao, Hamid Arian, and Luis Seco. 2024. AlphaForge: A Framework to Mine and Dynamically Combine Formulaic Alpha Factors. arXiv:2406.18394 [q-fin.CP] https://arxiv.org/abs/2406.18394
- Shojaee et al. (2025a) Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, and Chandan K Reddy. 2025a. LLM-SR: Scientific Equation Discovery via Programming with Large Language Models. arXiv:2404.18400 [cs.LG] https://arxiv.org/abs/2404.18400
- Shojaee et al. (2025b) Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, and Chandan K Reddy. 2025b. LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models. arXiv:2504.10415 [cs.CL] https://arxiv.org/abs/2504.10415
- Tang et al. (2023a) Hongwei Tang, Yan Wang, Tongguang Wang, and Linlin Tian. 2023a. Discovering explicit Reynolds-averaged turbulence closures for turbulent separated flows through deep learning-based symbolic regression with non-linear corrections. Physics of Fluids 35, 2 (2023), 025118. arXiv:https://doi.org/10.1063/5.0135638 doi: 10.1063/5.0135638
- Tang et al. (2023b) Hongwei Tang, Yan Wang, Tongguang Wang, and Linlin Tian. 2023b. Discovering explicit Reynolds-averaged turbulence closures for turbulent separated flows through deep learning-based symbolic regression with non-linear corrections. Physics of Fluids 35, 2 (Feb. 2023). doi: 10.1063/5.0135638
- Tennekes and Lumley (1972) Henk Tennekes and John L. Lumley. 1972. A First Course in Turbulence. The MIT Press. doi: 10.7551/mitpress/3014.001.0001
- Valipour et al. (2021) Mojtaba Valipour, Bowen You, Maysum Panju, and Ali Ghodsi. 2021. SymbolicGPT: A Generative Transformer Model for Symbolic Regression. arXiv:2106.14131 [cs.LG] https://arxiv.org/abs/2106.14131
- Vastl et al. (2024) Martin Vastl, JonÔŔ KulhĆ”nek, JiÅĆ KubalĆk, Erik Derner, and Robert BabuÅ”ka. 2024. SymFormer: End-to-End Symbolic Regression Using Transformer-Based Architecture. IEEE Access 12 (2024), 37840ā37849. doi: 10.1109/ACCESS.2024.3374649
- Virgolin and Pissis (2022) Marco Virgolin and Solon P. Pissis. 2022. Symbolic Regression is NP-hard. arXiv:2207.01018 [cs.NE] https://arxiv.org/abs/2207.01018
- Wahlquist et al. (2024) Ylva Wahlquist, Jesper Sundell, and Kristian Soltesz. 2024. Learning pharmacometric covariate model structures with symbolic regression networks. Journal of Pharmacokinetics and Pharmacodynamics 51, 2 (2024), 155ā167. doi: 10.1007/s10928-023-09887-3
- WANG et al. (2019) Wei WANG, Yang ZHANG, and Lili CHEN. 2019. An application of the shear stress transport low-Reynolds-number k-ε turbulence model on turbulent flows. ACTA AERODYNAMICA SINICA 37, 3 (2019), 419ā425. doi: 10.7638/kqdlxxb-2016.0158
- Weller et al. (1998) H.G. Weller, Gavin Tabor, Hrvoje Jasak, and Christer Fureby. 1998. A Tensorial Approach to Computational Continuum Mechanics Using Object Orientated Techniques. Computers in Physics 12 (11 1998), 620ā631. doi: 10.1063/1.168744
- Xia et al. (2025) Shijie Xia, Yuhan Sun, and Pengfei Liu. 2025. SR-Scientist: Scientific Equation Discovery With Agentic AI. arXiv:2510.11661 [cs.AI] https://arxiv.org/abs/2510.11661
- Xiao et al. (2020) Heng Xiao, Jin-Long Wu, Sylvain Laizet, and Lian Duan. 2020. Flows over periodic hills of parameterized geometries: A dataset for data-driven turbulence modeling from direct simulations. Computers & Fluids 200 (2020), 104431. doi: 10.1016/j.compfluid.2020.104431
- Zhang et al. (2025) Hengzhe Zhang, Qi Chen, Bing XUE, Wolfgang Banzhaf, and Mengjie Zhang. 2025. RAG-SR: Retrieval-Augmented Generation for Neural Symbolic Regression. In The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=NdHka08uWn
| Fitting log-MSE scale | $\alpha$ | 1.0 | Scaling in $R_{\mathrm{fit}}=-\alpha\log(\mathrm{MSE}+\epsilon)$ . |
| --- | --- | --- | --- |
| Complexity penalty weight | $\lambda_{\mathrm{len}}$ | 5e-3 | Weight in $P_{\mathrm{cplx}}=\lambda_{\mathrm{len}}Ā·\mathrm{Length}(\mathrm{AST})$ . |
| Dimensional homogeneity penalty weight | $P_{\mathrm{dim}}$ | 1.0 | Penalty/reward weight for unit inconsistency (general-level constraint). |
| Differentiability penalty weight | $P_{\mathrm{diff}}$ | 0.5 | Penalty/reward weight for non-smoothness on the data-defined domain (general-level constraint). |
| Domain-specific constraint weight (j-th) | $P_{\mathrm{domain}}^{(j)}$ | 0.5 | Weight for the $j$ -th domain prior (e.g., realizability, wall BC, asymptotic scaling, energy consistency). |
| Gating threshold for physics constraints | $\delta_{\mathrm{gate}}$ | 1e-3 * MSE_{initial} | Activate physics penalties only after fitting reaches a sufficiently low-error regime. |
| Numerical stability constant in $\tau_{i}$ | $\epsilon$ | 1e-50 | Used in $\tau_{i}=\frac{|b_{i}|}{\sum_{j}|b_{j}|+\epsilon}$ to avoid division by zero. |
| Redundancy threshold | $\rho$ | 1e-2 | A term is considered redundant if $\tau_{i}ā¤\rho$ . |
| Token penalty scale | $p$ | 0.5 | Scaling coefficient in $P_{\mathrm{tok}}=pĀ·\max(0,-\log(|b_{i}|+\epsilon))$ . |
Table 3. Hyperparameters for dual-constraint evaluation and token-level signal synthesis.
Appendix A Additional Hyperparameters of PiT-PO
Table 3 lists the additional hyperparameters of PiT-PO, including (i) general-level physical penalties (dimensional homogeneity and differentiability), (ii) domain-specific constraint penalties, (iii) the gated activation threshold for physical constraints, and (iv) theorem-guided redundancy detection and token-level penalization.
Appendix B Support Exclusion Theorem: Theory and Practice
B.1. Preliminaries: Empirical Function Space and Orthogonality
** Definition B.1 (Basis functions / dictionary)**
*Let $\mathcal{X}āeq\mathbb{R}^{d}$ be the domain, and let $\{\phi_{j}:\mathcal{X}ā\mathbb{R}\}_{jā\mathcal{S}}$ be a collection of candidate basis (dictionary) functions indexed by $\mathcal{S}$ . For any subset $\mathcal{K}āeq\mathcal{S}$ , denote the corresponding model subspace by
$$
\mathrm{span}\{\phi_{j}:j\in\mathcal{K}\}.
$$*
** Definition B.2 (Target functionfāf^{*})**
*Assume the ground-truth target function admits a sparse expansion over the dictionary:
$$
f^{*}(x)=\sum_{j\in\mathcal{S}^{\prime}}a_{j}\,\phi_{j}(x),
$$
where $\mathcal{S}^{\prime}āeq\mathcal{S}$ is the true support set (indices of nonzero terms) and satisfies $|\mathcal{S}^{\prime}|⤠M$ . Moreover, the true coefficients are bounded as
$$
A\leq|a_{j}|\leq B,\qquad\forall j\in\mathcal{S}^{\prime}.
$$*
** Definition B.3 (Empirical inner product and empirical norm)**
*Given a finite dataset $D=\{x_{n}\}_{n=1}^{N}ā\mathcal{X}$ , for any two functions $f,g:\mathcal{X}ā\mathbb{R}$ , define the empirical inner product by
$$
\langle f,g\rangle_{D}:=\frac{1}{N}\sum_{n=1}^{N}f(x_{n})\,g(x_{n}),
$$
and the induced empirical norm by
$$
\|f\|_{D}:=\sqrt{\langle f,f\rangle_{D}}.
$$
The empirical function space
$$
\mathcal{F}_{D}:=\{(f(x_{1}),\ldots,f(x_{N})):\ f:\mathcal{X}\to\mathbb{R}\}
$$
is a finite-dimensional inner-product space under $\langleĀ·,Ā·\rangle_{D}$ .*
** Definition B.4 (Empirical Orthogonality)**
*Two functions $f,gā\mathcal{F}_{D}$ are empirically orthogonal if $\langle f,g\rangle_{D}=0$ .*
Convention. Unless stated otherwise, throughout this appendix we write $\langleĀ·,Ā·\rangle$ and $\|Ā·\|$ to mean the empirical inner product $\langleĀ·,Ā·\rangle_{D}$ and the empirical norm $\|Ā·\|_{D}$ .
B.2. Proof of the Support Exclusion Theorem
** Theorem B.5 (Support Exclusion Theorem)**
*Let $\mathcal{K}āeq\mathcal{S}$ be a candidate skeleton (active set). Consider the empirical least-squares fit over $\mathcal{K}$ :
$$
b\in\arg\min_{\{b_{j}\}_{j\in\mathcal{K}}}\left\|f^{*}-\sum_{j\in\mathcal{K}}b_{j}\phi_{j}\right\|^{2}.
$$
Define the empirical Gram matrix $Gā\mathbb{R}^{|\mathcal{S}|Ć|\mathcal{S}|}$ by
$$
G_{ij}:=\langle\phi_{i},\phi_{j}\rangle,\qquad i,j\in\mathcal{S},
$$
and assume $G_{ii}>0$ . For any $iā\mathcal{S}$ , define
$$
T_{ij}:=\frac{G_{ji}}{G_{ii}},\qquad j\in\mathcal{S}.
$$
Further define the external-candidate vector
$$
u_{i}:=\big(|T_{i\ell}|\big)_{\ell\in\mathcal{S}\setminus\mathcal{K}}\in\mathbb{R}^{|\mathcal{S}\setminus\mathcal{K}|},
$$
and let $s(1)ā„ s(2)ā„Ā·sā„ s(|\mathcal{S}\setminus\mathcal{K}|)$ be the entries of $u_{i}$ sorted in non-increasing order. If for some $iā\mathcal{K}$ ,
$$
|b_{i}|<A-\left(\underbrace{\sum_{j\in\mathcal{K},\,j\neq i}(B+|b_{j}|)\,|T_{ij}|}_{\textnormal{Internal Interference}}+\underbrace{B\sum_{k=1}^{|\mathcal{S}\setminus\mathcal{K}|}s(k)}_{\textnormal{External Interference}}\right),
$$
then $iā\mathcal{S}^{\prime}$ .*
* Proof*
Step 1: Empirical orthogonality (first-order optimality of least squares). Let
$$
g(x):=\sum_{j\in\mathcal{K}}b_{j}\phi_{j}(x),\qquad r(x):=f^{*}(x)-g(x).
$$
The objective is $\|r\|^{2}=\langle r,r\rangle$ . For any $iā\mathcal{K}$ , the first-order optimality condition yields
$$
0=\frac{\partial}{\partial b_{i}}\|r\|^{2}=-2\langle r,\phi_{i}\rangle\quad\Longrightarrow\quad\langle r,\phi_{i}\rangle=0.
$$
Hence, the residual $r$ is empirically orthogonal to $\mathrm{span}\{\phi_{i}:iā\mathcal{K}\}$ . Step 2: Set decomposition. Define three disjoint sets
$$
\mathcal{T}:=\mathcal{K}\cap\mathcal{S}^{\prime},\qquad\mathcal{F}:=\mathcal{K}\setminus\mathcal{S}^{\prime},\qquad\mathcal{R}:=\mathcal{S}^{\prime}\setminus\mathcal{K}.
$$
Then $\mathcal{K}=\mathcal{T}\cup\mathcal{F}$ and $\mathcal{S}^{\prime}=\mathcal{T}\cup\mathcal{R}$ . Expanding the residual gives
$$
r=\sum_{j\in\mathcal{T}}(a_{j}-b_{j})\phi_{j}+\sum_{j\in\mathcal{R}}a_{j}\phi_{j}-\sum_{j\in\mathcal{F}}b_{j}\phi_{j}.
$$ Step 3: An exact coefficient identity for $iā\mathcal{T}$ . Fix any $iā\mathcal{T}āeq\mathcal{K}$ . Using $\langle r,\phi_{i}\rangle=0$ and writing $G_{ji}=\langle\phi_{j},\phi_{i}\rangle$ , we obtain
$$
(a_{i}-b_{i})G_{ii}+\sum_{j\in\mathcal{T},\,j\neq i}(a_{j}-b_{j})G_{ji}+\sum_{j\in\mathcal{R}}a_{j}G_{ji}-\sum_{j\in\mathcal{F}}b_{j}G_{ji}=0.
$$
Dividing by $G_{ii}>0$ and using $T_{ij}=G_{ji}/G_{ii}$ yields
$$
a_{i}-b_{i}=-\sum_{j\in\mathcal{T},\,j\neq i}(a_{j}-b_{j})T_{ij}-\sum_{j\in\mathcal{R}}a_{j}T_{ij}+\sum_{j\in\mathcal{F}}b_{j}T_{ij}.
$$ Step 4: Upper bound via internal and external interference. Taking absolute values and applying the triangle inequality gives
$$
|a_{i}-b_{i}|\leq\sum_{j\in\mathcal{T},\,j\neq i}|a_{j}-b_{j}|\,|T_{ij}|+\sum_{j\in\mathcal{R}}|a_{j}|\,|T_{ij}|+\sum_{j\in\mathcal{F}}|b_{j}|\,|T_{ij}|.
$$
For $jā\mathcal{T}āeq\mathcal{S}^{\prime}$ , we have $|a_{j}|⤠B$ and thus
$$
|a_{j}-b_{j}|\leq|a_{j}|+|b_{j}|\leq B+|b_{j}|.
$$
Therefore,
$$
|a_{i}-b_{i}|\leq\sum_{j\in\mathcal{T},\,j\neq i}(B+|b_{j}|)\,|T_{ij}|+\sum_{j\in\mathcal{F}}|b_{j}|\,|T_{ij}|+B\sum_{j\in\mathcal{R}}|T_{ij}|.
$$
Since $|b_{j}|⤠B+|b_{j}|$ for all $jā\mathcal{F}$ , we further have
$$
\sum_{j\in\mathcal{F}}|b_{j}|\,|T_{ij}|\leq\sum_{j\in\mathcal{F}}(B+|b_{j}|)\,|T_{ij}|.
$$
Combining $\mathcal{T}\setminus\{i\}$ and $\mathcal{F}$ yields the internal-interference term:
$$
|a_{i}-b_{i}|\leq\underbrace{\sum_{j\in\mathcal{K},\,j\neq i}(B+|b_{j}|)\,|T_{ij}|}_{\textnormal{Internal Interference}}+B\sum_{j\in\mathcal{R}}|T_{ij}|.
$$
For the external term, note that $\mathcal{R}āeq\mathcal{S}\setminus\mathcal{K}$ , and the entries $\{|T_{i\ell}|\}_{\ellā\mathcal{S}\setminus\mathcal{K}}$ sorted in non-increasing order are $s(1)ā„Ā·sā„ s(|\mathcal{S}\setminus\mathcal{K}|)$ . Hence, for any subset $\mathcal{R}$ ,
$$
\sum_{j\in\mathcal{R}}|T_{ij}|\leq\sum_{k=1}^{|\mathcal{R}|}s(k).
$$
Assuming $|\mathcal{S}^{\prime}|⤠M$ and that the current support is partially correct, i.e., $\mathcal{K}\cap\mathcal{S}^{\prime}ā \emptyset$ , we have $|\mathcal{R}|=|\mathcal{S}^{\prime}\setminus\mathcal{K}|ā¤|\mathcal{S}^{\prime}|-1⤠M-1$ . Define $m:=\min\!\big(M-1,\ |\mathcal{S}\setminus\mathcal{K}|\big)$ . Since also $|\mathcal{R}|ā¤|\mathcal{S}\setminus\mathcal{K}|$ , it follows that $|\mathcal{R}|⤠m$ , and thus
$$
\sum_{j\in\mathcal{R}}|T_{ij}|\leq\sum_{k=1}^{|\mathcal{R}|}s(k)\leq\sum_{k=1}^{m}s(k).
$$
Thus,
$$
|a_{i}-b_{i}|\leq\sum_{j\in\mathcal{K},\,j\neq i}(B+|b_{j}|)\,|T_{ij}|+B\sum_{k=1}^{m}s(k).
$$ Step 5: Lower bound when $iā\mathcal{S}^{\prime}$ . If $iā\mathcal{S}^{\prime}$ , then $|a_{i}|ā„ A$ . By the reverse triangle inequality,
$$
|a_{i}-b_{i}|\geq\big||a_{i}|-|b_{i}|\big|\geq A-|b_{i}|.
$$ Step 6: Contradiction. Assume, for contradiction, that there exists $iā\mathcal{K}\cap\mathcal{S}^{\prime}$ satisfying
$$
|b_{i}|<A-\left(\sum_{j\in\mathcal{K},\,j\neq i}(B+|b_{j}|)\,|T_{ij}|+B\sum_{k=1}^{m}s(k)\right).
$$
Rearranging yields
$$
A-|b_{i}|>\sum_{j\in\mathcal{K},\,j\neq i}(B+|b_{j}|)\,|T_{ij}|+B\sum_{k=1}^{m}s(k).
$$
However, Step 5 implies $|a_{i}-b_{i}|ā„ A-|b_{i}|$ , while the bound above implies
$$
|a_{i}-b_{i}|\leq\sum_{j\in\mathcal{K},\,j\neq i}(B+|b_{j}|)\,|T_{ij}|+B\sum_{k=1}^{m}s(k),
$$
which is a contradiction. Therefore such an $i$ cannot belong to $\mathcal{S}^{\prime}$ , i.e., $iā\mathcal{S}^{\prime}$ . ā
Appendix C Supplementary Diagnostics and Analyses
C.1. Remaining Iteration Curves on Smaller Backbones
In the main text, we report iteration curves for the 8B backbone. Here we provide the remaining curves for 3B and 1B (Figure 8), which corroborate the observations in Section 4.4.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Charts: Performance Comparison of LLM-SR and PiT-PO
### Overview
The image presents four line charts comparing the performance of two methods, LLM-SR (blue) and PiT-PO (red), across different scenarios: "Oscillation 1", "Oscillation 2", "E. coli Growth", and "Stress-Strain". Each chart plots the NMSE (Normalized Mean Squared Error) on a logarithmic scale against the iteration number. The charts show how the NMSE decreases with increasing iterations for both methods, with shaded regions indicating variability or uncertainty.
### Components/Axes
* **Title:** The image contains four subplots, each with a title: "Oscillation 1", "Oscillation 2", "E. coli Growth", and "Stress-Strain".
* **X-axis:** All charts share the same x-axis label: "Iteration". The x-axis ranges from 0 to 2500, with tick marks at 0, 625, 1250, 1875, and 2500.
* **Y-axis:** All charts share the same y-axis label: "NMSE (log scale)". The y-axis scale varies between charts.
* Oscillation 1: ranges from 10^-17 to 10^-1
* Oscillation 2: ranges from 10^-8 to 10^0
* E. coli Growth: ranges from 10^-2 to 10^0
* Stress-Strain: ranges from 10^-2 to 10^1
* **Legend:** Located at the top of the image, the legend identifies the two methods:
* LLM-SR: Represented by a blue line with a light blue shaded region.
* PiT-PO: Represented by a red line with a light red shaded region.
### Detailed Analysis
**Oscillation 1**
* **LLM-SR (Blue):** The line starts at approximately 10^-1 and decreases to around 10^-5, then remains relatively stable.
* **PiT-PO (Red):** The line starts at approximately 10^-5 and decreases significantly to around 10^-18 in a step-wise fashion.
* The shaded regions around each line indicate the variability in the NMSE for each method.
**Oscillation 2**
* **LLM-SR (Blue):** The line starts at approximately 10^0 and decreases to around 10^-6, then remains relatively stable.
* **PiT-PO (Red):** The line starts at approximately 10^-2 and decreases significantly to around 10^-9 in a step-wise fashion.
* The shaded regions around each line indicate the variability in the NMSE for each method.
**E. coli Growth**
* **LLM-SR (Blue):** The line starts at approximately 10^0 and decreases to around 10^-1, then remains relatively stable.
* **PiT-PO (Red):** The line starts at approximately 10^0 and decreases significantly to around 10^-2 in a step-wise fashion.
* The shaded regions around each line indicate the variability in the NMSE for each method.
**Stress-Strain**
* **LLM-SR (Blue):** The line starts at approximately 10^1 and decreases to around 10^-1, then remains relatively stable.
* **PiT-PO (Red):** The line starts at approximately 10^0 and decreases significantly to around 10^-2 in a step-wise fashion.
* The shaded regions around each line indicate the variability in the NMSE for each method.
### Key Observations
* In all four scenarios, both LLM-SR and PiT-PO show a decrease in NMSE as the number of iterations increases, indicating that both methods are converging towards a solution.
* PiT-PO generally achieves a lower NMSE than LLM-SR in all scenarios, suggesting that it may be a more effective method for these specific problems.
* The step-wise decrease in NMSE for PiT-PO suggests that it may be making discrete improvements at certain iterations.
* The shaded regions indicate that there is some variability in the NMSE for both methods, but the overall trend is still clear.
### Interpretation
The charts provide a comparative analysis of the performance of LLM-SR and PiT-PO across different problem domains. The data suggests that PiT-PO generally outperforms LLM-SR in terms of achieving a lower NMSE. This could be due to the specific algorithms used by each method, or the way in which they are optimized for these particular problems. The step-wise decrease in NMSE for PiT-PO could indicate that it is using a more aggressive or adaptive optimization strategy. The variability in NMSE, as indicated by the shaded regions, suggests that there may be some sensitivity to initial conditions or other factors. Overall, the data suggests that PiT-PO is a promising method for these types of problems, but further investigation may be needed to understand its behavior in more detail.
</details>
(a) Llama-3B
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Charts: NMSE vs. Iteration for Different Models and Datasets
### Overview
The image contains four line charts arranged in a 2x2 grid. Each chart plots the Normalized Mean Squared Error (NMSE) on a logarithmic scale against the iteration number. The charts compare the performance of two models, LLM-SR (blue) and PiT-PO (red), across four different datasets: Oscillation 1, Oscillation 2, E. coli Growth, and Stress-Strain. Each line is surrounded by a shaded region of the same color, representing the uncertainty or variance in the model's performance.
### Components/Axes
* **X-axis (all charts):** Iteration, with tick marks at 0, 625, 1250, 1875, and 2500.
* **Y-axis (all charts):** NMSE (log scale). The y-axis scales vary slightly between charts.
* Oscillation 1: 10^-5 to 10^-1
* Oscillation 2: 10^-8 to 10^0
* E. coli Growth: 10^-1 to 10^0
* Stress-Strain: 10^-2 to 10^0
* **Titles (each chart):**
* Top-left: Oscillation 1
* Top-right: Oscillation 2
* Bottom-left: E. coli Growth
* Bottom-right: Stress-Strain
* **Legend (top center):**
* LLM-SR (blue line and shaded region)
* PiT-PO (red line and shaded region)
### Detailed Analysis
**Oscillation 1:**
* **LLM-SR (blue):** Starts at approximately 10^-2 and remains relatively constant, with a slight decrease, fluctuating between 10^-2 and 10^-3.
* **PiT-PO (red):** Starts at approximately 10^-1, then decreases stepwise to approximately 10^-5 by iteration 1875, remaining constant thereafter.
* Step 1: From 10^-1 to 10^-3 by iteration 625
* Step 2: From 10^-3 to 10^-5 by iteration 1875
**Oscillation 2:**
* **LLM-SR (blue):** Starts at approximately 10^-2 and remains relatively constant, fluctuating between 10^-2 and 10^-3.
* **PiT-PO (red):** Starts at approximately 10^-2, then decreases stepwise to approximately 10^-8 by iteration 1250, remaining constant thereafter.
* Step 1: From 10^-2 to 10^-3 by iteration 625
* Step 2: From 10^-3 to 10^-8 by iteration 1250
**E. coli Growth:**
* **LLM-SR (blue):** Starts at approximately 10^0 and remains relatively constant, fluctuating between 10^-0 and 10^-1.
* **PiT-PO (red):** Starts at approximately 10^0, then decreases stepwise to approximately 10^-1 by iteration 1875, remaining constant thereafter.
* Step 1: From 10^0 to 10^-1 by iteration 1875
**Stress-Strain:**
* **LLM-SR (blue):** Starts at approximately 10^0, then decreases stepwise to approximately 10^-1 by iteration 625, remaining constant thereafter.
* Step 1: From 10^0 to 10^-1 by iteration 625
* **PiT-PO (red):** Starts at approximately 10^0, then decreases stepwise to approximately 10^-2 by iteration 625, remaining constant thereafter.
* Step 1: From 10^0 to 10^-1 by iteration 625
* Step 2: From 10^-1 to 10^-2 by iteration 625
### Key Observations
* In all four datasets, the PiT-PO model (red) generally achieves a lower NMSE than the LLM-SR model (blue), indicating better performance.
* The PiT-PO model exhibits a stepwise decrease in NMSE, suggesting discrete improvements at specific iterations.
* The LLM-SR model tends to maintain a more stable NMSE, with less fluctuation.
* The shaded regions around the lines indicate the variability in the model's performance across multiple runs or trials.
### Interpretation
The charts demonstrate the performance of two different models (LLM-SR and PiT-PO) on four different datasets. The PiT-PO model consistently outperforms the LLM-SR model in terms of NMSE, suggesting that it is a more effective model for these tasks. The stepwise decrease in NMSE for the PiT-PO model may indicate specific iterations where the model learned significant features or adjusted its parameters effectively. The relatively stable NMSE of the LLM-SR model suggests that it may be less sensitive to changes in the data or optimization process. The shaded regions provide insight into the robustness and reliability of each model, with wider regions indicating greater variability in performance. Overall, the data suggests that the PiT-PO model is a better choice for these datasets, but further analysis may be needed to understand the specific factors that contribute to its superior performance.
</details>
(b) Llama-1B
Figure 8. NMSE versus iteration on smaller backbones, consistent with the trends in the 8B setting.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Charts: NMSE vs. Time for Different Systems
### Overview
The image contains four line charts arranged in a 2x2 grid. Each chart plots the Normalized Mean Squared Error (NMSE) on a logarithmic scale against time in hours. The charts compare the performance of two systems, LLM-SR (blue line) and PIT-PO (red line), across four different scenarios: Oscillation 1, Oscillation 2, E. coli Growth, and Stress-Strain. Each line has a shaded region around it, indicating some measure of uncertainty or variability.
### Components/Axes
* **X-axis (all charts):** Time (hours), ranging from 0 to 6.
* **Y-axis (all charts):** NMSE (log scale). The range varies between charts.
* Oscillation 1: 10^-26 to 10^-1
* Oscillation 2: 10^-12 to 10^-2
* E. coli Growth: 10^-2 to 10^0
* Stress-Strain: 10^-2 to 10^-1
* **Titles (top-left of each chart):**
* Top-left: Oscillation 1
* Top-right: Oscillation 2
* Bottom-left: E. coli Growth
* Bottom-right: Stress-Strain
* **Legend (top center of the image):**
* Blue line: LLM-SR
* Red line: PIT-PO
### Detailed Analysis
#### Oscillation 1
* **LLM-SR (blue):** Starts at approximately 10^-1 and quickly drops to around 10^-6, remaining relatively constant thereafter.
* **PIT-PO (red):** Starts at approximately 10^-6 and decreases in steps to around 10^-21 by 4 hours, then remains constant.
* **Trend:** Both lines show a decreasing trend in NMSE over time, with PIT-PO decreasing more significantly.
#### Oscillation 2
* **LLM-SR (blue):** Starts at approximately 10^-2, drops to 10^-4, then decreases to approximately 10^-6 by 4 hours, then remains constant.
* **PIT-PO (red):** Starts at approximately 10^-4, drops to 10^-8 by 2 hours, then decreases to approximately 10^-12 by 4 hours, then remains constant.
* **Trend:** Both lines show a decreasing trend in NMSE over time, with PIT-PO decreasing more significantly.
#### E. coli Growth
* **LLM-SR (blue):** Starts at approximately 10^0, drops to 10^-1 by 1 hour, then decreases to approximately 10^-2 by 4 hours, then remains constant.
* **PIT-PO (red):** Starts at approximately 10^0, drops to 10^-1 by 1 hour, then decreases to approximately 10^-2 by 4 hours, then remains constant.
* **Trend:** Both lines show a decreasing trend in NMSE over time, with PIT-PO and LLM-SR performing similarly.
#### Stress-Strain
* **LLM-SR (blue):** Starts at approximately 10^-1, drops to 10^-1.5 by 1 hour, then decreases to approximately 10^-2 by 4 hours, then remains constant.
* **PIT-PO (red):** Starts at approximately 10^-1, drops to 10^-1.5 by 1 hour, then decreases to approximately 10^-2 by 4 hours, then remains constant.
* **Trend:** Both lines show a decreasing trend in NMSE over time, with PIT-PO and LLM-SR performing similarly.
### Key Observations
* In Oscillation 1 and Oscillation 2, PIT-PO achieves significantly lower NMSE values than LLM-SR.
* In E. coli Growth and Stress-Strain, the performance of LLM-SR and PIT-PO is more similar.
* All charts show a decrease in NMSE over time, indicating that both systems improve their performance as time progresses.
* The shaded regions around the lines suggest variability in the NMSE values, which could be due to noise or uncertainty in the data.
### Interpretation
The charts compare the performance of two systems, LLM-SR and PIT-PO, in different scenarios. The NMSE metric indicates the accuracy of the systems, with lower values indicating better performance. The results suggest that PIT-PO is more effective than LLM-SR in the Oscillation 1 and Oscillation 2 scenarios, as it achieves significantly lower NMSE values. However, in the E. coli Growth and Stress-Strain scenarios, the performance of the two systems is more comparable. The decreasing trend in NMSE over time suggests that both systems improve their performance as they are allowed to run longer. The shaded regions around the lines indicate the variability in the NMSE values, which could be due to noise or uncertainty in the data.
</details>
(a) Llama-3.1-8B
<details>
<summary>x11.png Details</summary>

### Visual Description
## Multiple Line Charts: Performance Comparison of LLM-SR and PIT-PO
### Overview
The image contains four line charts arranged in a 2x2 grid. Each chart compares the performance of two methods, LLM-SR (blue line) and PIT-PO (red line), over time (in hours). The y-axis represents the NMSE (Normalized Mean Squared Error) on a logarithmic scale. The charts depict the performance on different tasks: "Oscillation 1", "Oscillation 2", "E. coli Growth", and "Stress-Strain". Each line has a shaded region around it, representing the uncertainty or variability in the data.
### Components/Axes
* **Legend:** Located at the top of the image.
* LLM-SR: Blue line
* PIT-PO: Red line
* **X-axis (all charts):**
* Label: Time (hours)
* Scale: 0 to 6 hours, with tick marks at every 2 hours.
* **Y-axis (all charts):**
* Label: NMSE (log scale)
* Scale varies for each chart, but is always logarithmic.
### Detailed Analysis
**Chart 1: Oscillation 1**
* Y-axis scale: 10^-18 to 10^-3
* **LLM-SR (Blue):** The line starts at approximately 10^-3 and remains relatively constant around 10^-6 throughout the entire time range.
* **PIT-PO (Red):** The line starts around 10^-6 and decreases stepwise to approximately 10^-18 by 4 hours, then remains constant.
* At t=0, NMSE is approximately 10^-6
* At t=1, NMSE is approximately 10^-9
* At t=2, NMSE is approximately 10^-12
* At t=4, NMSE is approximately 10^-18
**Chart 2: Oscillation 2**
* Y-axis scale: 10^-8 to 10^-2
* **LLM-SR (Blue):** The line starts at approximately 10^-2 and decreases stepwise to approximately 10^-6 by 6 hours.
* At t=0, NMSE is approximately 10^-2
* At t=2, NMSE is approximately 10^-3
* At t=4, NMSE is approximately 10^-4
* At t=6, NMSE is approximately 10^-6
* **PIT-PO (Red):** The line starts at approximately 10^-3 and decreases stepwise to approximately 10^-8 by 6 hours.
* At t=0, NMSE is approximately 10^-3
* At t=2, NMSE is approximately 10^-4
* At t=4, NMSE is approximately 10^-6
* At t=6, NMSE is approximately 10^-8
**Chart 3: E. coli Growth**
* Y-axis scale: 10^-2 to 10^0
* **LLM-SR (Blue):** The line starts at approximately 10^0 and decreases stepwise to approximately 10^-1 by 2 hours, then remains constant.
* At t=0, NMSE is approximately 10^0
* At t=2, NMSE is approximately 10^-1
* At t=6, NMSE is approximately 10^-1
* **PIT-PO (Red):** The line starts at approximately 10^-1 and decreases stepwise to approximately 10^-2 by 4 hours, then remains constant.
* At t=0, NMSE is approximately 10^-1
* At t=2, NMSE is approximately 10^-1
* At t=4, NMSE is approximately 10^-2
* At t=6, NMSE is approximately 10^-2
**Chart 4: Stress-Strain**
* Y-axis scale: 10^-2 to 10^0
* **LLM-SR (Blue):** The line starts at approximately 10^0 and decreases stepwise to approximately 10^-1 by 4 hours, then remains constant.
* At t=0, NMSE is approximately 10^0
* At t=2, NMSE is approximately 10^-1
* At t=4, NMSE is approximately 10^-1
* At t=6, NMSE is approximately 10^-1
* **PIT-PO (Red):** The line starts at approximately 10^-1 and decreases stepwise to approximately 10^-2 by 2 hours, then remains constant.
* At t=0, NMSE is approximately 10^-1
* At t=2, NMSE is approximately 10^-2
* At t=6, NMSE is approximately 10^-2
### Key Observations
* In "Oscillation 1", PIT-PO achieves significantly lower NMSE values than LLM-SR, indicating superior performance.
* In "Oscillation 2", both methods start with similar NMSE, but PIT-PO reaches a lower NMSE value by the end of the time period.
* In "E. coli Growth" and "Stress-Strain", PIT-PO generally achieves lower NMSE values than LLM-SR.
* The shaded regions around the lines indicate the variability or uncertainty associated with each method's performance.
### Interpretation
The charts compare the performance of two methods, LLM-SR and PIT-PO, across four different tasks. The NMSE (Normalized Mean Squared Error) is used as a metric to evaluate the performance, with lower values indicating better performance.
The data suggests that PIT-PO generally outperforms LLM-SR in these tasks, especially in "Oscillation 1" where the difference in NMSE is substantial. In "Oscillation 2", PIT-PO shows a slight advantage over LLM-SR. For "E. coli Growth" and "Stress-Strain", PIT-PO also demonstrates better performance, although the difference is less pronounced than in "Oscillation 1".
The stepwise decrease in NMSE for PIT-PO in "Oscillation 1" suggests that the method converges to a solution more efficiently than LLM-SR. The shaded regions around the lines indicate the variability in the performance of each method, which should be considered when interpreting the results.
</details>
(b) Llama-3.2-3B
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Charts: Performance Comparison of LLM-SR and PIT-PO
### Overview
The image contains four line charts arranged in a 2x2 grid, comparing the performance of two methods, LLM-SR (blue) and PIT-PO (red), across different tasks: Oscillation 1, Oscillation 2, E. coli Growth, and Stress-Strain. The y-axis represents NMSE (Normalized Mean Squared Error) on a logarithmic scale, and the x-axis represents Time (in hours). Each chart also includes shaded regions around the lines, presumably indicating uncertainty or variance.
### Components/Axes
* **Legend:** Located at the top of the image.
* LLM-SR: Blue line
* PIT-PO: Red line
* **X-axis:** Time (hours), ranging from 0 to 6 in all four charts.
* **Y-axis:** NMSE (log scale)
* Oscillation 1: 10^-6 to 10^-1
* Oscillation 2: 10^-8 to 10^-2
* E. coli Growth: 10^-1 to 10^0
* Stress-Strain: 10^-2 to 10^-1
* **Titles:**
* Top-left: Oscillation 1
* Top-right: Oscillation 2
* Bottom-left: E. coli Growth
* Bottom-right: Stress-Strain
### Detailed Analysis
**1. Oscillation 1**
* **LLM-SR (Blue):** The line starts at approximately 10^-2 and remains relatively constant throughout the time period, with a slight decrease.
* Time = 0: NMSE ā 10^-2
* Time = 6: NMSE ā 10^-2
* **PIT-PO (Red):** The line starts at approximately 10^-2 and decreases in a step-wise fashion, reaching a value of approximately 10^-6.
* Time = 0: NMSE ā 10^-2
* Time = 1: NMSE ā 10^-3
* Time = 2: NMSE ā 10^-4
* Time = 4: NMSE ā 10^-5
* Time = 6: NMSE ā 10^-6
**2. Oscillation 2**
* **LLM-SR (Blue):** The line starts at approximately 10^-2 and remains relatively constant throughout the time period.
* Time = 0: NMSE ā 10^-2
* Time = 6: NMSE ā 10^-2
* **PIT-PO (Red):** The line starts at approximately 10^-2, decreases to approximately 10^-8, and then remains constant.
* Time = 0: NMSE ā 10^-2
* Time = 1: NMSE ā 10^-3
* Time = 2: NMSE ā 10^-7
* Time = 6: NMSE ā 10^-8
**3. E. coli Growth**
* **LLM-SR (Blue):** The line starts at approximately 10^0 and remains relatively constant throughout the time period.
* Time = 0: NMSE ā 10^0
* Time = 6: NMSE ā 10^0
* **PIT-PO (Red):** The line starts at approximately 10^0 and decreases in a step-wise fashion, reaching a value of approximately 10^-1.
* Time = 0: NMSE ā 10^0
* Time = 4: NMSE ā 10^-1
* Time = 6: NMSE ā 10^-1
**4. Stress-Strain**
* **LLM-SR (Blue):** The line starts at approximately 10^-1 and decreases slightly over time.
* Time = 0: NMSE ā 10^-1
* Time = 6: NMSE ā 10^-1
* **PIT-PO (Red):** The line starts at approximately 10^-1 and decreases in a step-wise fashion, reaching a value of approximately 10^-2.
* Time = 0: NMSE ā 10^-1
* Time = 1: NMSE ā 10^-1
* Time = 2: NMSE ā 10^-2
* Time = 6: NMSE ā 10^-2
### Key Observations
* In all four tasks, the PIT-PO method (red line) generally achieves lower NMSE values compared to the LLM-SR method (blue line), indicating better performance.
* The PIT-PO method exhibits a step-wise decrease in NMSE over time, while the LLM-SR method remains relatively constant.
* The shaded regions around the lines suggest that there is some variability in the performance of both methods, but the PIT-PO method consistently outperforms the LLM-SR method.
### Interpretation
The data suggests that the PIT-PO method is more effective than the LLM-SR method in reducing the Normalized Mean Squared Error across the four tasks: Oscillation 1, Oscillation 2, E. coli Growth, and Stress-Strain. The step-wise decrease in NMSE for the PIT-PO method indicates that it is learning and improving over time, while the relatively constant NMSE for the LLM-SR method suggests that it may not be as adaptable or effective in these tasks. The shaded regions indicate the variance in the results, but the overall trend consistently favors the PIT-PO method.
</details>
(c) Llama-3.2-1B
Figure 9. Wall-clock efficiency under a fixed 25,000-second budget.
C.2. Time-Cost Analysis under a Fixed Wall-Clock Budget
To complement the iteration-based efficiency analysis in Section 4.4, we evaluate search efficiency under a strict wall-clock constraint. We fix the total runtime of each method to 25,000 seconds ( $ā$ 6.9 hours) and track the best-so-far NMSE as a function of elapsed time (Figure 9). This protocol directly answers a practical question: given the same compute time, which method returns a more accurate recovered equation?
Across all three backbones (Llama-3.1-8B, Llama-3.2-3B, and Llama-3.2-1B), PiT-PO consistently yields lower NMSE trajectories than LLM-SR under the same 25,000-second budget. In line with the behavior discussed in Section 4.4, the curves exhibit a characteristic divergence in the low-error regime: LLM-SR often plateaus after an initial reduction, while PiT-PO continues to realize step-wise decreases over time, indicating more effective late-stage credit assignment and continued progress rather than stagnation.
The advantage is most pronounced on the two oscillator systems. For 8B and 3B, PiT-PO achieves rapid early improvements and maintains additional phase transitions later in the run, reaching substantially lower NMSE within the same wall-clock budget. For the 1B model, both methods converge more slowly overall, yet PiT-PO still more reliably escapes stagnation and attains markedly lower final NMSE, suggesting that the proposed in-search tuning remains effective even in the small-model regime.
Similar trends hold for the E. coli Growth and StressāStrain tasks, where progress is typically more gradual. Under equal wall-clock time, PiT-PO achieves a lower final NMSE and exhibits clearer late-stage improvements, whereas LLM-SR tends to flatten earlier. Importantly, these gains are obtained without increasing runtime: despite the additional overhead introduced by in-search policy optimization, PiT-PO delivers consistently better return-on-compute within a fixed time budget, validating the practical efficiency of the proposed approach.
| LLM-SR (Llama-3.1-8B) LLM-SR (Llama-3.2-3B) LLM-SR (Llama-3.2-1B) | 0.0114 0.0191 0.1182 | 8.05e-10 5.56e-5 7.57e-3 | 0.1676 809.94 1888.5 | 0.1026 0.0264 0.2689 |
| --- | --- | --- | --- | --- |
| PiT-PO (Llama-3.1-8B) | 1.63e-30 | 1.36e-11 | 0.1163 | 0.0163 |
| PiT-PO (Llama-3.2-3B) | 2.63e-30 | 8.52e-10 | 0.0237 | 0.0144 |
| PiT-PO (Llama-3.2-1B) | 5.99e-5 | 8.58e-10 | 0.3462 | 0.0124 |
Table 4. OOD evaluation: NMSE on the OOD split.
C.3. OOD Evaluation (NMSE ${}_{\text{OOD}}$ )
We report out-of-distribution performance using NMSE on the OOD split (NMSE ${}_{\text{OOD}}$ ), summarized in Table 4. Across all three backbones (8B/3B/1B) and all four tasks, PiT-PO achieves lower NMSE ${}_{\text{OOD}}$ than LLM-SR under the same setting. In particular, PiT-PO maintains extremely small OOD errors on the two oscillator systems, and yields substantially reduced OOD NMSE on E. coli growth and StressāStrain, where LLM-SR exhibits noticeably larger errors. These results provide consistent evidence that the improvements observed in the main text persist under OOD evaluation.
C.4. Equation Iterative Trajectories
<details>
<summary>x13.png Details</summary>

### Visual Description
## NMSE vs. Iterations Chart
### Overview
The image is a line chart showing the Normalized Mean Squared Error (NMSE) on a logarithmic scale versus the number of iterations. The chart displays a red line that represents the NMSE, which decreases in steps as the number of iterations increases. Several text annotations are placed near the line, indicating different mathematical expressions.
### Components/Axes
* **X-axis:** Iterations, ranging from 0 to 2500, with tick marks at intervals of 500.
* **Y-axis:** NMSE (log scale), ranging from 10^0 to 10^-28. Tick marks are present at each order of magnitude.
* **Data Series:** A single red line representing the NMSE.
* **Annotations:** Several text boxes containing mathematical expressions, positioned near the red line. These boxes have either a light green or light pink background.
### Detailed Analysis
* **Red Line (NMSE):** The red line starts at approximately 10^0 and decreases in a step-wise fashion.
* From 0 to approximately 100 iterations, the NMSE drops from 10^0 to approximately 10^-6.
* From 100 to approximately 300 iterations, the NMSE drops from 10^-6 to approximately 10^-12.
* From 300 to approximately 1500 iterations, the NMSE remains constant at approximately 10^-12.
* At approximately 1500 iterations, the NMSE drops sharply to approximately 10^-28, where it remains constant until 2500 iterations.
* **Annotations:**
* Near the top-left (around 0 iterations, 10^0 NMSE): "p1 * x + p2 * v + p3" (light pink background)
* Around 100 iterations, 10^-6 NMSE: "+p3 * sin(x) + p6 * v^3 - p1 * v + p2 * x + p4 * cos(x) + p5 * v^2 + p7 * x^2 + p8 * v^4 + p9 * x * sin(v) + p10" (light pink background)
* Around 300 iterations, 10^-12 NMSE: "+p3 * sin(x) + p5 * v^3 + p6 * x * v + p8 * x^3 p1 * x + p2 * v + p4 * cos(v) + p7 * sin(v)" (light green background)
* Around 1500 iterations, 10^-12 NMSE: "p1 * x^3 + p2 * x * v + p3 * v^3 + p4 * sin(x) + p5 * x * cos(x)" (light green background)
### Key Observations
* The NMSE decreases significantly in the first 1500 iterations, after which it stabilizes.
* The annotations suggest that different mathematical expressions or models are being applied at different stages of the iterative process.
* The step-wise decrease in NMSE indicates that the model is being refined or adjusted at specific iteration points.
### Interpretation
The chart illustrates the convergence of an iterative process, likely an optimization or machine learning algorithm, as measured by the NMSE. The annotations provide insight into the potential changes in the model or algorithm being used during the iterations. The initial rapid decrease in NMSE suggests that the model quickly improves, while the later stabilization indicates that the model has reached a point of diminishing returns. The different mathematical expressions likely represent different terms or features being added or adjusted in the model to improve its performance. The sudden drop at 1500 iterations could indicate a significant change in the model or algorithm that leads to a substantial reduction in error.
</details>
(a) PiT-PO.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Chart: NMSE vs. Iterations for LLMSR and PIT-PO
### Overview
The image is a line chart comparing the performance of two algorithms, LLMSR (red line) and PIT-PO (blue dashed line), in terms of NMSE (Normalized Mean Squared Error) on a logarithmic scale, plotted against the number of iterations. The chart also includes annotations with mathematical expressions, seemingly related to the algorithms or the data they process.
### Components/Axes
* **X-axis:** Iterations, ranging from 0 to 2500. Axis markers are present at 0, 500, 1000, 1500, 2000, and 2500.
* **Y-axis:** NMSE (log scale), ranging from 10<sup>0</sup> to 10<sup>-28</sup>. Axis markers are present at 10<sup>0</sup>, 10<sup>-4</sup>, 10<sup>-8</sup>, 10<sup>-12</sup>, 10<sup>-16</sup>, 10<sup>-20</sup>, 10<sup>-24</sup>, and 10<sup>-28</sup>.
* **Legend:** Located in the top-right corner.
* LLMSR: Represented by a solid red line.
* PIT-PO: Represented by a dashed blue line.
* **Annotations:** Several mathematical expressions are scattered across the chart, enclosed in green and pink boxes. These expressions involve variables like *x*, *v*, *p*<sub>*i*</sub>, *α*, *β*, and *γ*.
### Detailed Analysis
* **LLMSR (Red Line):**
* Trend: The LLMSR line starts at approximately 10<sup>-2</sup>, drops to approximately 10<sup>-4</sup> within the first 50 iterations, remains relatively constant until around 500 iterations, then drops again to approximately 10<sup>-7</sup>, and remains constant for the rest of the iterations.
* Data Points:
* Iteration 0: NMSE ā 10<sup>-2</sup>
* Iteration 50: NMSE ā 10<sup>-4</sup>
* Iteration 500: NMSE ā 10<sup>-4</sup>
* Iteration 750: NMSE ā 10<sup>-7</sup>
* Iteration 2500: NMSE ā 10<sup>-7</sup>
* **PIT-PO (Blue Dashed Line):**
* Trend: The PIT-PO line starts at approximately 10<sup>-2</sup>, drops sharply to approximately 10<sup>-6</sup> within the first 50 iterations, then drops again to approximately 10<sup>-28</sup> around 1500 iterations, where it remains constant.
* Data Points:
* Iteration 0: NMSE ā 10<sup>-2</sup>
* Iteration 50: NMSE ā 10<sup>-6</sup>
* Iteration 1500: NMSE ā 10<sup>-28</sup>
* Iteration 2500: NMSE ā 10<sup>-28</sup>
* **Mathematical Expressions:**
* Green Boxes (examples): (p3 * v<sup>3</sup>)/p4, (p1 * x)/p4 + (p2 * v)/p4, +p3 * sin(2 * pi * p4 * x) + p5 * x * v + p6 * v<sup>3</sup> - p1 * v - p2 * x - p7 * cos(p9 * x + p8), γ * p5 * x * v + γ * p8 * x<sup>3</sup> + γ * p9 * v<sup>3</sup> - α * x - β * v
* Pink Boxes (examples): (p7) * x<sup>3</sup> + (p8) * v<sup>3</sup> + (p9) * x * v - (p1) * x + (-(p2) * abs(v) * v) + ((p3) * sin((p4) * np.arange(len(x)))), + γ * p4 * x<sup>2</sup> + γ * p6 * v<sup>2</sup> + γ * p7 * x * v<sup>2</sup> + γ * p10
### Key Observations
* Both algorithms start with similar NMSE values.
* PIT-PO converges much faster and reaches a significantly lower NMSE than LLMSR.
* LLMSR plateaus at a higher NMSE value.
* The mathematical expressions are likely related to the computations performed within each iteration of the algorithms.
### Interpretation
The chart demonstrates that PIT-PO outperforms LLMSR in terms of NMSE for the given problem. PIT-PO converges to a much lower error rate with fewer iterations. The mathematical expressions suggest that the algorithms involve various parameters (p1-p10, alpha, beta, gamma) and operations on input variables *x* and *v*. The expressions might represent different components or regularization terms within the algorithms. The rapid convergence of PIT-PO suggests it may be more efficient or better suited for the specific task being evaluated. The annotations are spatially linked to the data points, suggesting that the expressions are relevant to the algorithm's behavior at those specific iterations or NMSE levels.
</details>
(b) LLM-SR.
Figure 10. Iterative trajectories of PiT-PO and LLM-SR on Oscillator1 with Llama-3.1-8B-Instruct. The NMSE curves (log scale) report the best-so-far value over iterations. Annotated equation snapshots illustrate how the discovered structure evolves along the search trajectory. The green-shaded terms are correct, whereas the red-shaded terms are erroneous.
Figure 10 visualizes how symbolic structures evolve along the iterative search. For PiT-PO (Figure 10(a)), the trajectory reveals a clear process of progressively excluding spurious terms. Under the theorem-guided mathematical constraints and the token-aware policy update, tokens marked as redundant are assigned token-level penalties; as this penalty signal repeatedly accumulates across many sampled equations during optimization, such spurious components become progressively less preferred and are unlikely to reappear in later generations. As a result, the search does not merely fit numerically; instead, it repeatedly transitions to cleaner structures, and each step-wise NMSE drop aligns with the removal of nuisance terms and a move toward a more parsimonious functional form. This āerror-term eliminationā behavior effectively shrinks the search space and substantially strengthens the exploration capability.
In contrast, LLM-SR (Figure 10(b)) exhibits much weaker structural correction. Since its guidance is primarily prompt-level rather than parameter-level, terms that appear earlyāeven if structurally incorrectātend to persist and continue influencing subsequent proposals. Consequently, the search frequently enters a plateau where NMSE stops improving meaningfully: the method keeps revisiting or retaining earlier spurious components instead of systematically pruning them, indicating a stagnation regime with limited ability to reach the correct structure.
Appendix D Datasets
D.1. LLM-SR Suite
We use the LLM-SR Suite, which consists of four standard scientific equation discovery tasks spanning physics, biology, and materials science. The suite includes two nonlinear oscillator systems (Oscillation 1/2), an E. coli growth model (E. coli Growth), and a temperature-dependent stressāstrain dataset (StressāStrain). The first three tasks are dynamical systems and are simulated over a fixed time horizon, while the last task is a static constitutive relation evaluated on experimental measurements.
D.1.1. Oscillation 1 (Nonlinear Oscillator)
Nonlinear oscillators with dissipative and non-polynomial couplings provide a demanding test for recovering interacting terms from trajectories. We define the state as $x(t)$ and $v(t)=\dot{x}(t)$ , and simulate the following system:
| | $\displaystyle\dot{x}$ | $\displaystyle=v,$ | |
| --- | --- | --- | --- |
We set $F=0.8$ , $\alpha=0.5$ , $\beta=0.2$ , $\gamma=0.5$ , and $\omega=1.0$ , with initial conditions $x(0)=0.5$ , $v(0)=0.5$ , and simulate over $tā[0,50]$ .
D.1.2. Oscillation 2 (Nonlinear Oscillator)
The second oscillator introduces explicit time forcing and an exponential nonlinearity, leading to a different coupling pattern between $x$ and $v$ :
| | $\displaystyle\dot{x}$ | $\displaystyle=v,$ | |
| --- | --- | --- | --- |
We use $F=0.3$ , $\alpha=0.5$ , $\beta=1.0$ , $\delta=5.0$ , $\gamma=0.5$ , and $\omega=1.0$ . The initial conditions and simulation window are the same as Oscillation 1: $x(0)=0.5$ , $v(0)=0.5$ , $tā[0,50]$ .
D.1.3. E. coli Growth (Bacterial Growth)
This task models the growth rate of E. coli as a product of multiplicative factors capturing distinct physiological effects from population size, nutrient availability, temperature, and acidity:
$$
\frac{dB}{dt}=f_{B}(B)\cdot f_{S}(S)\cdot f_{T}(T)\cdot f_{\mathrm{pH}}(\mathrm{pH}),
$$
where $B$ is bacterial density, $S$ is nutrient concentration, $T$ is temperature, and $\mathrm{pH}$ denotes acidity. In our benchmark, the explicit functional form is:
This specification combines saturation kinetics, sharp temperature modulation, and a structured pH response, yielding a challenging multivariate nonlinear discovery setting.
D.1.4. StressāStrain (Material Stress Behavior)
To assess performance on experimental measurements, we consider tensile stressāstrain data for Aluminum 6061-T651 collected at multiple temperatures (from room temperature up to $300^{\circ}\text{C}$ ). While no single exact governing equation is provided, a commonly used phenomenological approximation for temperature-dependent hardening is:
$$
\sigma=\left(A+B\varepsilon^{n}\right)\left(1-\left(\frac{T-T_{r}}{T_{m}-T_{r}}\right)^{m}\right).
$$
where $\sigma$ is stress, $\varepsilon$ is strain, $T$ is temperature, $T_{r}$ is a reference temperature, and $T_{m}$ is the melting temperature. The coefficients $A$ , $B$ , $n$ , and $m$ are fitted from data for the given alloy.
D.2. LLM-SRBench
We additionally report results on LLM-SRBench, a benchmark specifically constructed to evaluate data-driven scientific equation discovery with LLMs while reducing the risk of trivial memorization of canonical textbook formulas. It contains 239 problems spanning four scientific domains (chemistry, biology, physics, and material science), and each task is packaged with (i) a short scientific context describing variables and the target quantity and (ii) tabular numerical samples used for discovery.
Dataset composition
LLM-SRBench is organized into two complementary subsets.
(1) LSR-Transform. This subset converts well-known physics equations into less-common but analytically equivalent forms by changing the prediction target. Concretely, starting from the original equation, one input variable is chosen as a pivot and promoted to the new target; the equation is then symbolically solved (e.g., via SymPy) for that pivot variable, yielding an alternative representation of the same underlying physical law. Only transformations that admit an analytic solution are retained, and the original datapoints are filtered to satisfy the valid domain of the transformed expression (e.g., avoiding invalid denominators or square-root domains). Finally, a new natural-language problem statement is generated to match the transformed inputāoutput specification. This pipeline produces 111 transformed problems.
(2) LSR-Synth. This subset is designed to test discovery beyond memorized templates by composing equations from both known scientific terms and synthetic (novel yet plausible) terms. Candidate known/synthetic terms are proposed with an LLM given the domain context, combined into full equations, and then filtered through multiple checks: numerical solvability (e.g., via standard ODE solvers for dynamical systems), novelty assessment in context (using an LLM-based novelty evaluator), and expert plausibility validation. After passing these filters, datapoints are generated for each approved equation, yielding 128 synthetic problems across the same four domains.
Symbolic Accuracy (SA)
Besides numeric fit metrics (e.g., $\mathrm{NMSE}$ and $\mathrm{Acc}_{all}(\tau)$ ), LLM-SRBench emphasizes symbolic correctness. Symbolic Accuracy (SA) is computed using an LLM-based equivalence judge (GPT-4o): given a predicted hypothesis and the ground-truth expression, the evaluator first normalizes both by removing extraneous text (e.g., comments) and replacing explicit numeric constants with placeholder parameters, then decides whether there exist parameter values that make the hypothesis mathematically equivalent to the ground truth. SA is the fraction (percentage) of problems judged equivalent under this protocol. The benchmark authors validated this evaluation by comparing against human judgments on 130 sampled problems and reported 94.6% agreement between GPT-4o and human evaluators.
Appendix E Turbulence Task Setup: Periodic Hill Flow
E.1. Dataset
We consider the periodic hill flow configuration with streamwise periodic boundary conditions and no-slip walls.
In this work, the features selected for the symbolic regression process are computed from the mean strain-rate ( $S$ ) and rotation-rate ( $R$ ) tensors. Following established practice, these tensors are normalized using local turbulence scales prior to invariant calculation to ensure well-conditioned inputs:
$$
\displaystyle S \displaystyle=\frac{1}{2}\,\frac{1}{\beta^{*}\omega}\Bigl[\nabla\boldsymbol{u}+(\nabla\boldsymbol{u})^{\mathsf{T}}\Bigr], \displaystyle R \displaystyle=\frac{1}{2}\,\frac{1}{\beta^{*}\omega}\Bigl[\nabla\boldsymbol{u}-(\nabla\boldsymbol{u})^{\mathsf{T}}\Bigr], \tag{13}
$$
where $(Ā·)^{\mathsf{T}}$ denotes matrix transposition. The invariants are then computed as $I_{1}=\mathrm{Tr}(S^{2})$ and $I_{2}=\mathrm{Tr}(R^{2})$ .
The dataset provides (i) two invariant features $(I_{1},I_{2})$ (stored in the Lambda columns), (ii) tensor bases $\{T^{m}\}_{m=1}^{3}$ (stored in the first three $3Ć 3$ tensors in the Tensor columns), and (iii) the target anisotropy tensor field $b$ (and its nonlinear correction form used for training, denoted by $b^{\perp}$ ), together with auxiliary fields such as turbulent kinetic energy $k$ when needed by the evaluator.
E.1.1. Data Collection
DNS source.
The DNS data are obtained from the publicly available NASA turbulence modeling resource (Turbulence Modeling Resource / TMBWG) for periodic hills of parameterized geometries. https://tmbwg.github.io/turbmodels/Other_DNS_Data/parameterized_periodic_hills.html
RANS baseline generation (for non-DNS methods).
For all non-DNS methods, we compute baseline solutions using OpenFOAM with the same initial and boundary conditions as the DNS. Specifically, we employ the incompressible steady-state solver simpleFoam and apply the turbulence models considered in this paper (all belonging to RANS and its SST variants). The solver outputs, at each sampling point and converged iteration/time-step, the required physical quantities for subsequent post-processing.
DNSāRANS grid alignment and target construction.
To construct training targets consistent with the RANS discretization, we interpolate the DNS fields onto the RANS grid points. The interpolated DNS quantities are then post-processed in the CFD workflow (i.e., as offline preprocessing prior to training) to compute the Reynolds stress tensor, which is packaged into the dataset files as the supervision signal for training.
Feature & basis construction.
Following Popeās tensor-basis expansion hypothesis, the Reynolds stress decomposition involves (i) tensor bases and (ii) scalar coefficient functions of invariant features. Since our final turbulence closure is embedded as a correction to the baseline kOmegaSST model, we use the physical fields produced by the baseline kOmegaSST computation for post-processing to obtain the input features and the tensor bases. These are stored in separate files (features and bases), while the learning target is defined to match the Reynolds-stress-related quantity described in Section E.2.
E.1.2. Pre-processing
We normalize the two invariant inputs prior to feeding them into the learned symbolic expressions:
$$
\tilde{I}=\tanh\!\left(\frac{I}{2.0}\right),\qquad\tilde{I}_{1}=\tanh\!\left(\frac{I_{1}}{2.0}\right),\quad\tilde{I}_{2}=\tanh\!\left(\frac{I_{2}}{2.0}\right), \tag{15}
$$
which maps the raw values into approximately $[-1,1]$ and improves numerical stability during parameter fitting.
E.2. Casting Turbulence Closure as a Symbolic Regression Problem
Predicting scalar coefficients instead of tensor entries.
Rather than directly searching over tensor-valued symbolic forms, we let the LLM predict three scalar coefficient functions $(G^{1},G^{2},G^{3})$ , each parameterized as a symbolic expression of only two variables $(I_{1},I_{2})$ . The evaluator then reconstructs the predicted tensor via the (fixed) tensor bases:
$$
\hat{b}\;=\;G^{1}(I_{1},I_{2})\,T^{1}\;+\;G^{2}(I_{1},I_{2})\,T^{2}\;+\;G^{3}(I_{1},I_{2})\,T^{3}. \tag{16}
$$
This design compresses the LLM input variable dimension to two, significantly reducing the symbolic search space.
Scoring by MSE in the evaluator.
Given a candidate hypothesis, we compute the fitting score by the mean squared error between the target tensor and the reconstructed tensor. In the final evaluator, we additionally scale both prediction and target by the sample-wise $k$ before error aggregation:
$$
\hat{\tau}_{ij}(x_{n})=k(x_{n})\,\hat{b}_{ij}(x_{n}),\qquad\tau_{ij}(x_{n})=k(x_{n})\,b_{ij}(x_{n}), \tag{17}
$$
We compute the MSE only over a selected subset of tensor components. This is because the Reynolds-stress (and anisotropy) tensor is symmetric, so only six components are independent, and in the present periodic hill flow setting two off-plane shear components are typically several orders of magnitude smaller and are commonly neglected in practical turbulence modeling. Following this standard practice, we evaluate the error only on
$$
\Omega=\{(0,0),(0,1),(1,1),(2,2)\}, \tag{18}
$$
$$
\qquad\mathrm{MSE}=\frac{1}{|\Omega|}\sum_{(i,j)\in\Omega}\frac{1}{N}\sum_{n=1}^{N}\Bigl(\tau_{ij}(x_{n})-\hat{\tau}_{ij}(x_{n})\Bigr)^{2}. \tag{19}
$$
The resulting MSE is used as the numerical fitting score for the symbolic regression loop.
Prompt and fairness across methods.
We provide the same problem specification text (task description, variable definitions, and operator constraints) to both LLM-SR and PiT-PO to ensure a fair comparison. An example prompt is shown in Figure 11 ā 13. In our current experiments, we also explicitly restrict the hypothesis space by asking the LLM to avoid trigonometric functions.
<details>
<summary>x15.png Details</summary>

### Visual Description
## Technical Document: Symbolic Regression for Periodic Hill Turbulence Modeling
### Overview
This document outlines the task of symbolic regression for modeling periodic hill turbulence. It details the flow case context, evaluation rule, tensor bases used, and the assembly of predicted anisotropy.
### Components/Axes
The document is structured into several sections:
1. **Task Definition:** Describes the objective of learning three scalar functions G1, G2, and G3, which depend on invariants I1 and I2.
2. **Flow Case Context:** Defines the bottom wall profile using equations and parameters like hill height (a), characteristic length (h), and a steepness control parameter (alpha). It also specifies boundary conditions and the Reynolds number.
3. **Evaluation Rule:** Explains how the predicted anisotropy (b_hat) is calculated using tensor bases T1, T2, and T3, along with the scalar functions G1, G2, and G3. It also mentions the use of Mean Squared Error (MSE) for evaluation.
4. **Tensor Bases:** Describes the three tensor bases (T1, T2, T3) and their interpretations:
* T1 (linear strain basis): Represents the Boussinesq/linear eddy-viscosity direction.
* T2 (strain-rotation coupling basis): Captures the interaction between mean strain and mean rotation.
* T3 (quadratic strain nonlinearity basis): Introduces nonlinear strain effects and normal-stress anisotropy.
5. **Predicted Anisotropy Assembly:** Shows how the predicted anisotropy (b_hat) is assembled using the scalar functions and tensor bases.
### Detailed Analysis or ### Content Details
**Task Definition:**
* Objective: Learn three scalar functions: G1(I1, I2), G2(I1, I2), G3(I1, I2).
**Flow Case Context:**
* Bottom Wall Profile:
* y(x) = a[1 + cos(Ļx / L)] for |x| ⤠L
* y(x) = 0 for |x| > L
* 'a' is the hill height (characteristic length h).
* α = L/h controls hill steepness (training case: α = 0.8).
* Streamwise (x) direction: Periodic boundary conditions.
* Top and bottom walls: No-slip boundary conditions.
* Reynolds number: Re_h = U_b h / v = 5600.
**Evaluation Rule:**
* Given tensor bases T1, T2, T3 (each 3x3), the predicted anisotropy is:
* b_hat = G1 \* T1 + G2 \* T2 + G3 \* T3
* The evaluator computes MSE between b_hat and the target b.
**Tensor Bases:**
* Tensor bases T1, T2, T3 (provided by the evaluator, do NOT compute them).
* For each sample n, the evaluator provides three 3x3 tensor bases (symmetric, traceless) constructed from the non-dimensionalized mean strain-rate tensor S and rotation tensor R.
* **T1 (linear strain basis):**
* T1 = S
* Interpretation: Boussinesq/linear eddy-viscosity direction. Dominates in simple shear flows and acts as the baseline anisotropy response aligned with the mean strain.
* **T2 (strain-rotation coupling basis):**
* T2 = S @ R - R @ S
* Interpretation: Captures the interaction between mean strain and mean rotation (curvature/turning, strong vortices). Crucial in separated flows, reattachment, and swirling/shear-layer regions where linear eddy-viscosity assumptions tend to fail.
* **T3 (quadratic strain nonlinearity basis):**
* T3 = S @ S - (1/3) \* tr(S @ S) \* I
* Interpretation: Introduces nonlinear strain effects and normal-stress anisotropy beyond linear models. Helps represent differences among normal stress components and improves predictions in strongly anisotropic regions.
**Predicted Anisotropy Assembly:**
* b_hat = G1(I1, I2) \* T1 + G2(I1, I2) \* T2 + G3(I1, I2) \* T3
### Key Observations
* The document provides a structured approach to modeling turbulence over a periodic hill.
* It uses symbolic regression to learn scalar functions that relate to tensor bases.
* The tensor bases represent different aspects of the flow, including linear strain, strain-rotation coupling, and quadratic strain nonlinearity.
* The predicted anisotropy is assembled as a linear combination of these tensor bases, weighted by the learned scalar functions.
### Interpretation
The document describes a methodology for modeling complex turbulent flows, specifically over a periodic hill. The approach uses symbolic regression to learn relationships between flow invariants and tensor bases that represent different physical phenomena. By combining these elements, the model aims to predict the anisotropy of the turbulence, which is crucial for understanding and simulating such flows. The use of different tensor bases allows the model to capture various aspects of the flow, including linear and nonlinear effects, as well as the interaction between strain and rotation. The ultimate goal is to improve the accuracy of turbulence models in complex flow scenarios where traditional linear eddy-viscosity assumptions may not be valid.
</details>
Figure 11. Problem Specification.
<details>
<summary>x16.png Details</summary>

### Visual Description
## Code Snippet: Python Function Definitions
### Overview
The image contains a Python code snippet defining two functions, `evaluate` and `loss`, along with initialization steps. The code appears to be related to machine learning or numerical optimization, possibly in the context of tensor analysis or anisotropy prediction.
### Components/Axes
* **Function `evaluate(data, equation)`:**
* Input arguments: `data`, `equation`
* Extracts `I1`, `I2` from `data['inputs']`
* Extracts `T1`, `T2`, `T3` from `data['tensors']`
* Extracts `b_true` from `data['outputs']`
* **Function `loss(params)`:**
* Input argument: `params`
* Predicts scalar coefficients `G = [G1, G2, G3]`
* `equation` returns shape `(N, 3)` based on `params`
* `G_pred = equation(I1, I2, params)`
* Reconstructs predicted anisotropy tensor `b_hat`
* Computes Mean Squared Error (MSE) against target `b_true`
* **Initialization:**
* `initial_params = np.ones(30)`: Initializes parameters as an array of 30 ones.
* `result = minimize(loss, initial_params, method='BFGS')`: Minimizes the `loss` function using the BFGS optimization method.
* `return result.fun`: Returns the optimized function value.
### Detailed Analysis or ### Content Details
**Function `evaluate(data, equation)`:**
* The function takes `data` and `equation` as input.
* It extracts input data (`I1`, `I2`), tensor data (`T1`, `T2`, `T3`), and target output (`b_true`) from the `data` dictionary.
**Function `loss(params)`:**
* The function takes `params` as input.
* It predicts scalar coefficients `G = [G1, G2, G3]`.
* It calls the `equation` function with inputs `I1`, `I2`, and `params` to get `G_pred`.
* It reconstructs the predicted anisotropy tensor `b_hat` using `G_pred` and the tensor data `T1`, `T2`, `T3`. The reconstruction involves element-wise multiplication and broadcasting using `None` to expand dimensions.
* It computes the MSE between the predicted tensor `b_hat` and the target tensor `b_true` using `np.mean((b_hat - b_true) ** 2)`.
**Initialization:**
* `initial_params = np.ones(30)`: Creates an array of 30 ones, which serves as the initial guess for the parameters to be optimized.
* `result = minimize(loss, initial_params, method='BFGS')`: Uses the `minimize` function (likely from `scipy.optimize`) to find the parameters that minimize the `loss` function. The BFGS (BroydenāFletcherāGoldfarbāShanno) algorithm is used for optimization.
* `return result.fun`: Returns the optimized value of the loss function.
### Key Observations
* The code defines a loss function that calculates the MSE between a predicted anisotropy tensor and a target tensor.
* The `evaluate` function prepares the data for the loss calculation.
* The code initializes parameters and uses an optimization algorithm (BFGS) to minimize the loss function.
* The use of `None` in the tensor reconstruction suggests broadcasting operations for element-wise multiplication.
### Interpretation
The code snippet implements a process for predicting anisotropy tensors and optimizing the prediction using a loss function and an optimization algorithm. The `evaluate` function prepares the input data, the `loss` function quantifies the error between the prediction and the target, and the initialization and optimization steps find the parameters that minimize this error. This is a common pattern in machine learning and numerical optimization, where the goal is to find the best parameters for a model by minimizing a loss function. The specific application appears to be related to tensor analysis, possibly in the context of material science or image processing.
</details>
Figure 12. Evaluation and Optimization.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Code Snippet: Equation Definition
### Overview
The image presents a Python code snippet defining a function named `equation` using the NumPy library. The function predicts three scalar coefficients (G1, G2, G3) for tensor bases based on input arrays I1, I2, and params.
### Components/Axes
* **Function Definition:** `def equation(I1: np.ndarray, I2: np.ndarray, params: np.ndarray) -> np.ndarray:`
* Defines a function named "equation" that takes three NumPy array arguments: `I1`, `I2`, and `params`.
* Specifies the return type as a NumPy array (`np.ndarray`).
* **Docstring:**
* "Predict three scalar coefficients G1, G2, G3 for tensor bases." - Describes the function's purpose.
* "Args:" - Specifies the input arguments:
* "I1, I2: numpy arrays of shape (N,), invariants." - Describes `I1` and `I2` as NumPy arrays of shape (N,) representing invariants.
* "params: numpy array of free constants." - Describes `params` as a NumPy array of free constants.
* "Returns:" - Specifies the return value:
* "numpy array of shape (N, 3), columns are G1, G2, G3." - Describes the return value as a NumPy array of shape (N, 3) with columns representing G1, G2, and G3.
* **G1 Block:** `# G1 block: params[0:10]`
* Calculates `g1` based on `params[0]`, `params[1]`, `params[2]`, `I1`, and `I2`.
* `g1 = (params[0] * I1 + params[1] * I2 + params[2])`
* **G2 Block:** `# G2 block: params[10:20]`
* Calculates `g2` based on `params[10]`, `params[11]`, `params[12]`, `I1`, and `I2`.
* `g2 = (params[10] * I1 + params[11] * I2 + params[12])`
* **G3 Block:** `# G3 block: params[20:30]`
* Calculates `g3` based on `params[20]`, `params[21]`, `params[22]`, `I1`, and `I2`.
* `g3 = (params[20] * I1 + params[21] * I2 + params[22])`
* **Return Statement:** `return np.stack([g1, g2, g3], axis=1)`
* Stacks `g1`, `g2`, and `g3` horizontally (along `axis=1`) to create the output array.
### Detailed Analysis or ### Content Details
The function `equation` computes three values, `g1`, `g2`, and `g3`, each representing a linear combination of the input arrays `I1` and `I2`, plus a constant term. The coefficients for these linear combinations are extracted from the `params` array using array slicing. Specifically:
* `g1` uses `params[0]`, `params[1]`, and `params[2]`.
* `g2` uses `params[10]`, `params[11]`, and `params[12]`.
* `g3` uses `params[20]`, `params[21]`, and `params[22]`.
The function then stacks these three values horizontally to create a NumPy array of shape (N, 3), where N is the length of the input arrays `I1` and `I2`.
### Key Observations
* The function uses array slicing to extract different sets of parameters from the `params` array for calculating `g1`, `g2`, and `g3`.
* The calculations for `g1`, `g2`, and `g3` are structurally identical, differing only in the indices used to access the `params` array.
* The final output is created by stacking the calculated values horizontally.
### Interpretation
The code snippet defines a function that predicts three scalar coefficients (G1, G2, G3) for tensor bases. It takes two invariant arrays (I1, I2) and a parameter array as input. The function calculates G1, G2, and G3 as linear combinations of I1 and I2, using different subsets of the parameter array as coefficients. The resulting G1, G2, and G3 values are then stacked into a single array, which is returned. This function is likely part of a larger system for tensor analysis or machine learning where the parameters are learned or optimized to predict the tensor coefficients.
</details>
Figure 13. Equation Program Example.
E.3. Back to Turbulence Modeling
From discovered expressions to a deployable closure
After obtaining an explicit expression for the Reynolds-stress-related term from symbolic regression, we rewrite the discovered formula into an OpenFOAM-readable form.
Embedding into OpenFOAM
We embed the resulting closure into the existing kOmegaSST implementation by replacing the routine that evaluates the Reynolds stress (or its modeled contribution) with the discovered expression, thereby forming an optimized turbulence model within the RANS framework.
Verification against DNS
We then run RANS simulations with the modified model under the same configuration as the baseline and compare the resulting flow statistics against the DNS reference data. This completes the end-to-end turbulence modeling and validation pipeline considered in this work.
E.4. Domain-Specific Constraints for Turbulence Modeling
To encode expert knowledge as inductive biases, we augment the reward with a domain-specific penalty term:
$$
P_{\text{domain}}\;=\;\sum_{j=1}^{4}w_{j}\,P_{\text{domain}}^{(j)}, \tag{20}
$$
where each $P_{\text{domain}}^{(j)}ā„ 0$ quantifies violations of the $j$ -th turbulence constraint.
(1) Realizability.
A physically admissible Reynolds stress tensor must be symmetric and positive semi-definite (PSD), which implies that all eigenvalues are nonnegative. Non-realizable predictions are penalized by
$$
P_{\text{domain}}^{(1)}\;=\;\frac{1}{N}\sum_{n=1}^{N}\max\Bigl(0,\,-\lambda_{\min}(\tau(x_{n}))\Bigr), \tag{21}
$$
where $\lambda_{\min}(Ā·)$ denotes the smallest eigenvalue. This PSD requirement is a standard realizability condition for Reynolds stresses (Pope, 2000).
(2) Boundary-condition consistency.
At a no-slip wall, the Reynolds stress tensor must decay to zero. Accordingly, non-vanishing stresses in a near-wall band $\mathcal{W}=\{x:y(x)<y_{0}\}$ are penalized as
$$
P_{\text{domain}}^{(2)}\;=\;\frac{1}{|\mathcal{W}|}\sum_{x_{n}\in\mathcal{W}}\|\tau(x_{n})\|_{F}, \tag{22}
$$
where $y(x)$ is the wall distance field. In practice, $y(x)$ can be computed and exported using standard CFD post-processing tools (OpenFOAM wall-distance utilities such as wallDist and checkMesh-writeAllFields) (Monkewitz, 2021).
(3) Asymptotic scaling in the viscous sublayer.
Near-wall Taylor expansions under the no-slip constraint imply that $u^{\prime}=O(y)$ and, for incompressible flow, $v^{\prime}=O(y^{2})$ . Consequently, the Reynolds shear stress $-\overline{u^{\prime}v^{\prime}}$ exhibits cubic leading-order scaling, $O(y^{3})$ , in the viscous sublayer (Tennekes and Lumley, 1972). This constraint is enforced by evaluating the slope of $\log|\tau_{xy}^{+}|$ with respect to $\log(y^{+})$ over the viscous sublayer range $\mathcal{V}=\{x:y^{+}(x)ā[y_{1}^{+},y_{2}^{+}]\}$ :
$$
p\;=\;\mathrm{slope}\bigl(\log|\tau_{xy}^{+}|\;\text{vs}\;\log(y^{+})\bigr),\qquad P_{\text{domain}}^{(3)}\;=\;|p-3|. \tag{23}
$$
Here $y^{+}=y\,u_{\tau}/\nu$ is the wall unit, with friction velocity $u_{\tau}=\sqrt{\tau_{w}/\rho}$ . In CFD, $y^{+}$ and $\tau_{w}$ can be obtained through standard post-processing utilities (OpenFOAM yPlus and wallShearStress).
(4) Energy consistency.
Energy consistency is enforced by constraining the turbulent kinetic energy production implied by the predicted stress:
$$
P_{k}(x)\;=\;-\tau_{ij}(x)\,\frac{\partial U_{i}}{\partial x_{j}}(x), \tag{24}
$$
and penalizing mismatches with the reference production $P_{k}^{\text{ref}}$ (computed from DNS-aligned fields) via
$$
P_{\text{domain}}^{(4)}\;=\;\mathrm{MSE}\bigl(P_{k},\,P_{k}^{\text{ref}}\bigr). \tag{25}
$$
The definition in Eq. (24) is the standard production term in the TKE budget (Pope, 2000).
Efficient evaluation via elite-CFD refinement (multi-fidelity constraint scheduling).
Constraints (2)ā(4) require additional CFD-derived fields (wall distance $y$ , wall shear stress $\tau_{w}$ and the associated $y^{+}$ , and mean velocity gradients $ā U$ ), which can be expensive to obtain at each candidate-evaluation step. To keep the symbolic regression loop efficient, we adopt a two-stage evaluation schedule.
Stage-A (inexpensive screening, applied to all candidates) evaluates the data-fitting score (MSE) together with realizability $P_{\text{domain}}^{(1)}$ using only the predicted stress tensor.
Stage-B (expensive physics checks, triggered by an improving best MSE) is activated when a candidate in the current batch achieves a new best MSE value relative to all previously evaluated candidates. In this case, the candidate is treated as an elite solution, and CFD-based post-processing is performed to obtain the required auxiliary fields, compute $P_{\text{domain}}^{(2)}$ ā $P_{\text{domain}}^{(4)}$ , and conduct an additional online fine-tuning step using the full domain-specific penalty. All non-elite candidates are trained only with the realizability constraint, whereas elite candidates receive the full turbulence constraint suite after CFD evaluation. This design enables rigorous physics enforcement without incurring prohibitive CFD costs during exploration.