# RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation
**Authors**: Yuanyuan Lin, Xiangyu Ouyang, Teng Zhang, Kaixin Sui
> Corresponding Author
## Abstract
Tree search-based methods have made significant progress in enhancing the code generation capabilities of large language models. However, due to the difficulty in effectively evaluating intermediate algorithmic steps and the inability to locate and timely correct erroneous steps, these methods often generate incorrect code and incur increased computational costs. To tackle these problems, we propose RPM-MCTS, an effective method that utilizes Knowledge- R etrieval as P rocess Reward M odel based on M onte C arlo T ree S earch to evaluate intermediate algorithmic steps. By utilizing knowledge base retrieval, RPM-MCTS avoids the complex training of process reward models. During the expansion phase, similarity filtering is employed to remove redundant nodes, ensuring diversity in reasoning paths. Furthermore, our method utilizes sandbox execution feedback to locate erroneous algorithmic steps during generation, enabling timely and targeted corrections. Extensive experiments on four public code generation benchmarks demonstrate that RPM-MCTS outperforms current state-of-the-art methods while achieving an approximately 15% reduction in token consumption. Furthermore, full fine-tuning of the base model using the data constructed by RPM-MCTS significantly enhances its code capabilities.
## Introduction
Code generation aims at understanding problem descriptions in natural language and generating the corresponding code snippets. In recent years, large language models (LLMs) have demonstrated remarkable performance in code generation tasks (zhang2024unifying). For code generation, the early methods involve dividing code planning and synthesis into two phases using chain-of-thought or tree structures (wei2022chain; jiang2024self; zelikman2023parsel). wang2024planning have demonstrated that providing LLMs with a correct solution can significantly enhance model performance, even when these solutions consist of incomplete plans, i.e., for solutions, correctness is preferred over completeness, and the key to enhancing the code generation capability of LLMs lies in generating correct plans.
Programming languages possess their own inherent logical structures and tightly interconnected knowledge, which makes it essential not to overlook long-range dependencies within the code. Previous work has shown that rotary position embedding does not always lead to attention weights decaying with relative distance (barbero2024round). Concurrently, through exploring the attention distribution between tokens, we have experimentally demonstrated that selecting algorithmic steps as the basic units is a superior choice. Therefore, our objective focuses on how to accurately generate intermediate algorithmic steps.
However, a limitation of previous methods lies in the lack of an evaluation and correction mechanism for intermediate algorithmic steps, which fails to guarantee the correctness of these steps (lu2025lsr; li2025structured). One way to tackle this issue is to use a value function or reward model to verify reasoning traces for correctness, which then serves as a learning signal for self-training (lightman2023let; wang2023math). However, training a reliable reward model to verify every step in a reasoning trace generally depends on dense human-generated annotations per reasoning step (lightman2023let), which does not scale well.
Unlike other reasoning tasks, code generation benefits from the homogeneity of algorithmic workflows across different problem categories. This allows us to leverage historical experience from a knowledge base containing numerous correct algorithmic steps to evaluate the process reward of expansion steps. Additionally, code generation typically benefits from detailed feedback provided by compilers. Consequently, in this paper, we propose RPM-MCTS, which optimizes the Monte Carlo Tree Search (MCTS) algorithm using external information feedback. Our method utilizes the knowledge base for intermediate algorithmic step-level evaluation and employs sandbox feedback for result-level assessment of complete code. Specifically, the root node of the Monte Carlo tree represents the coding problem, while all other nodes represent individual algorithmic steps. During each iteration, multiple distinct potential next steps are generated based on the current reasoning path. Node selection is guided by historical experience from the knowledge base, enabling faster discovery of high-value search paths. In the simulation phase, complete code is generated and evaluated using sandbox and model feedback to update node values. Notably, during simulation, we localize erroneous steps within the full algorithmic workflow and incorporate newly generated correct steps into the tree, thereby reducing token consumption. After multiple iterations, the highest-scoring path from root to leaf is selected, ultimately yielding a complete solution alongside its corresponding code. The contributions are summarized as follows:
- We propose RPM-MCTS, which leverages knowledge base retrieval scores to evaluate intermediate algorithmic steps, steering LLMs to explore high-value reasoning paths more effectively.
- We leverage sandbox feedback during the simulation phase to evaluate code generated from reasoning steps, localize errors, and truncate simulations, thereby reducing computational costs.
- We conduct extensive experiments and show that RPM-MCTS is superior to state-of-the-art methods. Moreover, we verify that base models fine-tuned with data generated by RPM-MCTS enjoy greater code capabilities.
## Related Work
#### Monte Carlo Tree Search.
As the extension of Chain-of-Thought (CoT) (wei2022chain), Tree-of-Thought (ToT) (yao2023tree) enhances the reasoning and planning capabilities of LLMs by exploring different thought paths within a tree structure. Subsequently, Monte Carlo Tree Search has served as a search algorithm to more effectively guide LLMs in exploring intermediate sub-steps (zhao2023large; hao2023reasoning; zhou2023language; ding2023everything). ReST-MCTS* (zhang2024rest) combines process reward guidance with Monte Carlo Tree Search to collect high-quality reasoning trajectories and step-by-step values for training strategy and reward models. SRA-MCTS (xu2024sra) further extends this to the field of code generation, using Monte Carlo Tree Search to generate intermediate reasoning steps and conducting iterative self-evaluation to synthesize training data for supervised fine-tuning. However, relying solely on model self-evaluation introduces biases and hallucinations, and small-scale LLMs exhibit limited instruction-following capabilities. RethinkMCTS (li2025rethinkmcts) is another prior work that also uses execution feedback but employs a patching strategy. If this patch fails, the search may proceed on an incorrect path, making it less suitable for generating high-quality SFT data.
#### Process Evaluation.
In heuristic search, a robust reasoning process needs to have self-evaluation capabilities, and the evaluation results are further used to guide the search. Early work mainly focused on outcome-level evaluation (cobbe2021training), that is, evaluating the complete solution after the reasoning is completed. Outcome-level evaluation is simple to implement but often requires more detailed assessment. Step-level evaluation (lightman2023let; wang2023math; gao2024llm) emphasizes the assessment of individual reasoning steps. In tree search algorithms, process evaluation is widely used to guide search trajectories. Logic-RL (xie2025logic) optimizes path selection by implementing state scoring in beam search. Furthermore, step-level evaluation has proven its effectiveness in both error correction and the summarization of reasoning steps. zheng2024makes developed a method capable of accurately locating inaccuracies in specific reasoning steps, thereby providing more precise and actionable feedback for comprehensive evaluation.
## Method
In this section, we elaborate on the proposed modified MCTS that incorporates the knowledge base as a process reward model. The methodology comprises three key components: knowledge base construction, RPM-MCTS, and code generation. First, knowledge base retrieval scores circumvent random selection during node expansion. Then, in the expansion phase, nodes are filtered based on similarity metrics to eliminate redundant candidates. Finally, during the simulation phase, the algorithm performs error reflection and retains nodes with verified correct reasoning. These collective strategies enable faster exploration of higher-quality algorithmic steps.
### Knowledge Base Construction
In this section, we introduce the construction of a retrievable global knowledge base designed to mitigate hallucination during the planning process. Due to the homogeneity of algorithms within the same category, where fundamental principles and methods are relatively similar, we utilize a knowledge base containing numerous correct algorithms across diverse categories. This serves as the evaluation model for intermediate algorithmic steps in RPM-MCTS, eliminating the need to train a separate process reward model.
We use the training set data from APPS (hendrycks2021measuring) and CodeContests (li2022competition), which contain coding problems paired with their correctly implemented solutions. We utilize the Claude Sonnet 3.7 to generate the correct algorithmic steps corresponding to the correct code and decompose them step by step. We sequentially concatenate the problems by rolling them out according to the algorithmic steps. Specifically, for problem $p_i$ with $n_i$ algorithmic steps and $a_i^(j)$ corresponding to the $j$ -th step, we have
$$
\displaystyleK_i=\{concat(p_i,a_i^(1),âŚ,a_i^(j)),~j=1,2,âŚ,n_i\}, \tag{1}
$$
and $K=\uplus_i=1^nK_i$ is the knowledge base with cardinality $n$ .
To enhance retrieval efficiency and improve retrieval precision by distinguishing between problems with similar descriptions but different algorithmic solutions, we organize the knowledge base into 14 distinct algorithm categories and store them as vector database by using the BGE (xiao2024c) embedding model.
### RPM-MCTS
We propose an enhanced MCTS method, named RPM-MCTS. In this method, the root node represents the problem, while all other nodes represent an algorithmic step. Specifically, the method comprises four distinct phases: Selection, Expansion, Evaluation and Reflection, and Backpropagation, as shown in Figure 6. These phases are performed on a search tree composed of tree nodes and are iterated multiple times, with each iteration generating a concrete algorithmic step.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Iterative Tree Search Process with Evaluation and Backpropagation
### Overview
The image is a technical diagram illustrating a four-stage iterative process for tree-based search or optimization, likely within a computational or machine learning context. The process is cyclical, indicated by a "Rollout X times" loop connecting the final stage back to the first. Each stage is represented by a tree structure with colored nodes, and specific actions or components are annotated below each tree.
### Components/Axes
The diagram is organized into four main vertical columns, each corresponding to a stage in the process. A horizontal arrow at the top connects the stages in sequence and loops back, labeled **"Rollout X times"**.
**Stage Labels (Top, Left to Right):**
1. **Selection**
2. **Expansion**
3. **Evaluation**
4. **Backpropagation**
**Tree Structures:**
Each stage features a hierarchical tree diagram with a root node (yellow) and multiple levels of child nodes. The nodes are colored circles: yellow (root), pink, green, light blue, and purple. The connections (edges) between nodes are black lines, with some highlighted in red or orange to indicate active paths or selections.
**Additional Components (Below Trees):**
* **Below "Selection":** A box labeled **"UCB"** containing three icons: a code symbol (`</>`), a database symbol, and a multi-colored sphere (representing an LLM). Below these icons are the labels **"Sandbox"**, **"Knowledge"**, and **"LLM"** respectively, connected by plus signs (`+`).
* **Below "Expansion":** A box labeled **"Value"** containing two icons: a database symbol and a multi-colored sphere. Below these are the labels **"Knowledge"** and **"LLM"**, connected by a plus sign (`+`). A dashed red box highlights a section of the tree above, showing two child nodes with annotations **"Value: 8"** and **"Value: 9"**, and a red **"X"** next to the second node.
* **Below "Evaluation":** A box containing a Python logo (`đ`), an arrow (`â`), and a code symbol (`</>`). Below these are the labels **"Code"** and **"Sandbox"**. The tree above has a path highlighted in orange, ending at a node with a green checkmark (`â`). A separate, lower node is marked with a red **"X"** and a small icon resembling a person or agent.
* **Below "Backpropagation":** No additional component box is present. The tree above shows multiple paths highlighted in red, with arrows indicating upward flow from leaf nodes back toward the root.
### Detailed Analysis
**Process Flow:**
1. **Selection:** The process begins here. The tree shows a path from the root (yellow) down to a specific leaf node (green), indicated by a dashed blue arrow pointing to the "UCB" component box. This suggests the selection of a node based on a formula (UCB likely stands for Upper Confidence Bound) that combines sandbox execution, knowledge, and an LLM.
2. **Expansion:** The selected node from the previous stage is expanded, generating new child nodes. The dashed red box focuses on this expansion, showing two new nodes with assigned values (8 and 9). The red "X" next to "Value: 9" may indicate a rejected or poor-value expansion.
3. **Evaluation:** A path through the tree (highlighted in orange) is evaluated. The evaluation involves executing code in a sandbox (as shown by the component box: Python code â Sandbox). The outcome is binary: a green checkmark (`â`) for a successful/valid path endpoint and a red "X" for a failed/invalid one.
4. **Backpropagation:** The results from the evaluation (the success/failure signals) are propagated back up the tree along the highlighted red paths. Arrows on these paths point upward, indicating that value or reward information is being updated from the leaf nodes back to the root.
**Spatial Grounding & Element Relationships:**
* The **"Rollout X times"** loop is positioned at the very top, spanning the entire width of the diagram, indicating the entire four-stage process is repeated multiple times.
* The **component boxes** are consistently placed directly below their corresponding tree, creating a clear visual association between the abstract tree operation and the concrete tools/knowledge sources (Sandbox, Knowledge, LLM, Code) used to perform it.
* The **color-coding of nodes** (yellow, pink, green, blue, purple) is consistent across all four trees, allowing the viewer to track the same conceptual nodes through different stages of the process.
* The **highlighting of paths** changes per stage: a single dashed blue line in Selection, a red box in Expansion, a solid orange path in Evaluation, and multiple red upward arrows in Backpropagation. This visually distinguishes the primary action of each stage.
### Key Observations
* **Hybrid System:** The process integrates traditional algorithmic components (UCB, tree search, code execution in a sandbox) with modern AI components (Knowledge base, Large Language Model - LLM).
* **Value-Driven Expansion:** The "Expansion" stage explicitly assigns numerical values to new nodes, and one is rejected (marked with an X), suggesting a pruning or selection mechanism based on these values.
* **Outcome-Based Learning:** The "Evaluation" stage produces a clear binary outcome (success/failure), which is the critical signal used in the "Backpropagation" stage to update the tree's knowledge or value estimates.
* **Iterative Refinement:** The "Rollout X times" loop emphasizes that this is not a one-pass algorithm but an iterative process of search, evaluation, and learning, designed to improve performance over multiple cycles.
### Interpretation
This diagram depicts a sophisticated **hybrid AI planning or reasoning system**. It combines the structured, explainable search of a tree-based algorithm (like Monte Carlo Tree Search - MCTS) with the generative and knowledge-retrieval capabilities of LLMs and external knowledge bases.
* **What it demonstrates:** The system is designed to solve complex problems by exploring a space of possibilities (the tree). It uses an LLM and knowledge to guide the search (Selection/Expansion), validates potential solutions by executing code (Evaluation), and learns from the results to make better future choices (Backpropagation). The "Sandbox" is crucial for safe, verifiable execution of generated code or actions.
* **Relationships:** The LLM and Knowledge base are not passive; they are active components integrated into each decision point. The flow shows a tight coupling between high-level reasoning (LLM), factual grounding (Knowledge), and concrete verification (Code/Sandbox).
* **Notable Implications:** This architecture aims to overcome key limitations of standalone LLMs: hallucination (by grounding in knowledge and sandbox execution), lack of deep planning (via tree search), and inability to learn from trial-and-error (through backpropagation). The "Value" assignment in expansion suggests it may be optimizing for a specific metric. The entire process is a form of **deliberate, verifiable, and iterative problem-solving**.
</details>
Figure 1: Overview of RPM-MCTS. (a) Selection: Select a leaf node according to Eqn. (2). (b) Expansion: After selecting a node, expand multiple child nodes, and use knowledge base retrieval scores and LLM evaluation to select nodes for simulation. The node color represents similarity magnitude. (c) Evaluation: Generate complete reasoning steps for the selected node, generate code strictly in accordance with these reasoning steps, and use a sandbox for information feedback. (d) Backpropagation: Propagate the reward scores backward. The yellow root node represents the problem, and the remaining nodes represent each reasoning step.
#### Selection.
In the selection phase, a leaf node is selected from the current tree for further expansion according to the selection score, which is defined as a weighted combination of the Upper Confidence Bound (UCB) (silver2017mastering) and the knowledge base retrieval score:
$$
\displaystyleSelectionScore(s,a)=UCB(s,a)+Îą K(s,a), \tag{2}
$$
where $(s,a)$ denotes a state-action pair with $s$ containing the description of the problem and previously generated algorithmic steps and $a$ representing the new step at the current node. The parameter $Îą$ is for balancing the two terms.
UCB is a classical multi-armed bandit algorithm and well performed in addressing the exploration-exploitation trade-off. UCB selects actions by computing an upper confidence estimate of each actionâs potential reward:
$$
\displaystyleUCB(s,a)=Q(s,a)+βâ{\frac{\log N(s)}{1+N(s,a)}}, \tag{3}
$$
where $Q(s,a)$ represents the empirical mean cumulative reward after taking action $a$ from state $s$ , $N(s)$ is the number of times state $s$ has been explored in the current context, and $N(s,a)$ is the number of times action $a$ has been taken in state $s$ . The parameter $β$ is for trading off the exploitation (the former term) and exploration (the latter term).
The knowledge base retrieval score $K(s,a)$ is obtained by retrieving the concatenated $(s,a)$ pair from the knowledge base. Specifically, let $f$ denote the embedding model that maps $(s,a)$ to a vector with the same dimension as the knowledge base. Given the preceding reasoning path, the knowledge base retrieval score for the current node is calculated as follows:
$$
\displaystyle K(s,a)=\maxâ¤ft(0,\max_kâK\frac{f((s,a))¡ k}{\|f((s,a))\|¡\|k\|}\right). \tag{4}
$$
The knowledge base similarity score $K(s,a)$ enables acquisition of step-wise assessments prior to the evaluation phase. In other words, when newly generated nodes remain unexplored, we prioritize leveraging historically validated solutions through knowledge base retrieval scores to identify higher-value nodes.
Starting from the root node, we recursively select the child node with the maximum $SelectionScore$ value at each branching point. Selection ties are resolved stochastically. Each iteration advances to the highest-scoring child node until reaching a leaf node.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Heatmap and Code Analysis: Python Factorial Calculation
### Overview
The image is a composite technical figure illustrating the analysis of a Python code snippet that calculates the factorial of 5 (5!). It consists of three main components: (a) an attention or activation heatmap, (b) the corresponding Python code with annotated steps, and (c) a series of visual feature maps. The overall purpose appears to be visualizing how a model (likely a neural network or attention mechanism) processes and represents the code.
### Components/Axes
**Part (a): Heatmap**
* **Type:** Square heatmap with a triangular mask (lower-left triangle filled, upper-right is solid dark purple).
* **Title/Problem Statement:** "Problem: Use Python to calculate 5! and output the result." (Located at the top center of the entire figure).
* **X-Axis (Bottom):** Labeled with Python code tokens. From left to right: `n`, `=`, `5`, `\n`, `p`, `=`, `1`, `\n`, `for`, `i`, `in`, `range`, `(`, `1`, `,`, `n`, `+`, `1`, `)`, `:`, `\n`, `p`, `=`, `p`, `*`, `i`, `\n`, `print`, `(`, `p`, `)`.
* **Y-Axis (Left):** Labeled with the same sequence of Python code tokens, read from top to bottom.
* **Color Bar (Right of Heatmap):** Vertical scale indicating values from 0.0 (dark purple) to 0.8 (bright yellow). The gradient passes through blue (~0.2-0.3), teal/green (~0.4-0.5), and yellow-green (~0.6-0.7).
* **Highlighted Regions:** Three red rectangular boxes are drawn on the heatmap to draw attention to specific cell clusters.
1. **Left Box:** Spans the column for token `5` and rows for tokens `p`, `=`, `1`, `\n`, `for`, `i`, `in`, `range`. Contains a bright yellow cell at the intersection of column `5` and row `i`.
2. **Middle Box:** Spans the column for token `i` (within the `for` loop line) and rows for tokens `p`, `=`, `p`, `*`, `i`. Contains a bright yellow cell at the intersection of column `i` and row `*`.
3. **Right Box:** Spans the column for token `p` (within the `print` statement) and rows for tokens `p`, `)`. Contains a teal/green cell.
**Part (b): Code Snippet**
* **Content:** A numbered list of steps and the corresponding Python code.
* **Step List:**
â Input the value of n
⥠Iterate from 1 to n
⢠Calculate the factorial
⣠Output the result
* **Code Block (with line numbers and step annotations):**
1. `n = 5` â
2. `p = 1`
3. `for i in range(1, n + 1):` âĄ
4. ` p = p * i` â˘
5. `print(p)` âŁ
**Part (c): Feature Visualizations**
* **Structure:** Two horizontal rows of small, rectangular image patches.
* **Top Row:** Approximately 20 patches. Each shows a pink/magenta triangular shape against a black background. The triangles vary in texture; some are solid, while others have internal vertical striations or patterns.
* **Bottom Row:** Approximately 20 patches, aligned vertically with the top row. These are grayscale versions of the patches above them, showing similar triangular shapes and internal patterns in shades of gray.
* **Arrangement:** The patches are tightly packed in a grid-like strip, suggesting they represent sequential steps, layers, or attention heads from a model's processing pipeline.
### Detailed Analysis
**Heatmap (a) Data Points & Trends:**
* **General Trend:** The heatmap shows a strong diagonal pattern (from top-left to bottom-right), indicating high values where a token is compared to itself. The off-diagonal values are generally low (dark purple/blue), with specific, isolated exceptions highlighted by the red boxes.
* **Highlighted Cell Values (Approximate from Color Bar):**
* **Left Box (Column `5`, Row `i`):** Bright yellow. Value â 0.75 - 0.80.
* **Left Box (Column `5`, Row `p`):** Teal/green. Value â 0.45 - 0.50.
* **Middle Box (Column `i`, Row `*`):** Bright yellow. Value â 0.75 - 0.80.
* **Middle Box (Column `i`, Row `p`):** Blue. Value â 0.25 - 0.30.
* **Right Box (Column `p`, Row `p`):** Teal/green. Value â 0.40 - 0.45.
* **Spatial Grounding:** The highest values (yellow) are not on the main diagonal. They form specific, meaningful connections: the input value `5` strongly attends to the loop variable `i`, and the loop variable `i` strongly attends to the multiplication operator `*`.
**Code (b) Content:**
The code is a standard iterative factorial calculation. It initializes `n=5` and `p=1`, then loops from 1 to `n`, multiplying `p` by the loop index `i` each time, and finally prints the result `p` (which will be 120).
### Key Observations
1. **Selective High Attention:** The model's attention (or activation) is not uniformly distributed. It focuses intensely on specific token relationships that are crucial for the computation: the link between the input number and the loop variable, and the link between the loop variable and the arithmetic operation.
2. **Functional Grouping:** The red boxes seem to group tokens involved in distinct functional phases: initialization (left box), the core iterative operation (middle box), and the output (right box).
3. **Feature Map Correspondence:** The visualizations in (c) likely correspond to internal representations of the code tokens or the attention patterns. The consistent triangular shape might be a learned feature detector for code structure or syntax. The variation in internal texture (striations) suggests different features are being extracted at different stages or by different model components.
### Interpretation
This figure demonstrates a **Peircean investigative** reading of how an AI model "understands" a simple program. It moves beyond the surface text to expose the underlying **relational structure** the model identifies as important.
* **What the Data Suggests:** The heatmap reveals that the model does not process code as a flat sequence. Instead, it builds a **relational graph** where the significance of a token is defined by its connections to other tokens. The strongest connections (`5`â`i` and `i`â`*`) highlight the model's focus on the **data flow** (the number 5 entering the loop) and the **core operation** (multiplication within the loop).
* **How Elements Relate:** The code (b) provides the ground truth. The heatmap (a) visualizes the model's internal "focus" or "association strength" when processing that code. The feature maps (c) show the visual patterns the model uses internally to represent these concepts. The red boxes in (a) explicitly link the abstract numerical values in the heatmap back to the logical steps of the algorithm.
* **Notable Anomalies/Insights:** The most striking insight is the **off-diagonal dominance**. The model's strongest signals are not self-associations but specific, meaningful cross-token relationships. This suggests the model has learned a functional, almost semantic, representation of the code's purposeâtracking the variable `n` into the loop and emphasizing the multiplication stepârather than just memorizing syntactic patterns. The visualization in (c) implies this understanding is built upon consistent, shape-based visual features extracted from the code's representation.
</details>
Figure 2: (a) Token-level attention heatmap for code corresponding to the programming problem. (b) Algorithmic steps and corresponding code for the programming problem. (c) Attention sink phenomenon.
#### Expansion.
Upon selecting a leaf node during the selection phase, the expansion phase aims to generate remaining child nodes, thereby expanding the search scope of the entire tree. Since the attention weights between tokens do not always decay with relative distance (barbero2024round), we conduct an in-depth study on the attention mechanism between tokens in LLMs during code generation to reveal the influencing factors among tokens. As shown in Figure 7, it can be observed that certain tokens have a profound impact on subsequent code generation. It can thus be inferred that these key tokens can summarize and interpret the information of previously generated tokens, and have higher reference value for subsequent token generation. Meanwhile, relevant studies (barbero2025llms; xiao2023efficient) have shown that modern LLMs exhibit the phenomenon of âattention sinkâ. Specifically, numerous attention heads allocate a disproportionate share of weights, such as exceeding 30% or even 80%, to the beginning-of-sequence token â¨bosâŠ, despite its primary function as a sequence delimiter with minimal semantic content. Therefore, to facilitate our examination of inter-token dependencies in code generation tasks, we selectively visualize token attention mechanisms at designated layers. Figure 7 (c) shows that attention not only sinks to â¨bos⊠but also peaks at algorithmic step boundaries, justifying that algorithmic step blocks are more effective basic processing units in code generation tasks. Therefore, we select algorithmic steps as the basic units for expansion.
To ensure diversity in generated steps during the expansion phase, we implement a sampling decoding strategy that sequentially generates each child node. Specifically, to prevent repetitive generation by the LLM, we iteratively provide all previously generated steps as context when producing each new step. The input for the LLM is
$$
\displaystyleconcat(s,a_1,âŚ,a_i,g),~i=1,2,âŚ,b \tag{5}
$$
where $g$ represents the reflection in the simulation phase, and $b$ denotes the maximum number of branches each node can expand.
After expanding $b$ nodes, we employ cosine similarity for filtering to reduce computational costs by avoiding simulations on redundant nodes. Specifically, we map the reasoning steps of the $b$ nodes to vectors using the embedding model $E$ and calculate the cosine similarities between these $b$ nodes. When the similarity exceeds a predetermined threshold, the node is identified as redundant and filtered out. This method effectively reduces the search space and enhances algorithmic efficiency while maintaining diversity.
#### Evaluation and Reflection.
During the evaluation phase, simulation and evaluation are performed for the selected leaf nodes. We provide the LLM with the algorithmic steps $s$ already generated for the node and its ancestor nodes, enabling the LLM to strictly follow the generated steps and continue simulating to complete all remaining steps. We search for the thoughts and evaluate with the code generated following the thoughts.
The generated code undergoes sandbox evaluation using public test cases. However, since public test cases only cover a subset of possible scenarios, the code may fail on unseen cases, such as boundary conditions or performance issues. We therefore employ the LLM to analyze the complete algorithmic steps based on sandbox feedback.
We assess the steps generated during the expansion phase through two components, which are the pass rate on public test cases and LLM evaluation. The final evaluation score is obtained by weighted summation of these two scores. The formula is as follows:
$$
\displaystyle Q(s,a)=γ¡ r_exec+(1-γ)¡ r_LLM \tag{6}
$$
where $r_exec$ denotes the pass rate on public test cases, $r_LLM$ represents the score from LLM evaluation based on the sandbox feedback results and complete steps provided to the LLM, and $Îł$ indicates the weight controlling these two parts of the scores.
For code that fails to public test cases, we isolate erroneous algorithmic steps by decomposing the code into blocks and sequentially debugging each block via LLM analysis with public test inputs (zhong2024debug). We retain all correct steps generated during the simulation phase, truncated before the first erroneous step. These validated steps are then incorporated into the MCTS tree as expanded nodes.
The entire RPM-MCTS process is terminated when the solution passes all public test cases and achieves a high LLM evaluation score. Otherwise, node updates and reflection are performed, and the RPM-MCTS process proceeds until the maximum iteration count is reached.
#### Backpropagation.
The objective of backpropagation is to update the reward values of nodes upon completion of state value evaluation. We propagate reward values backward from leaf nodes to the root node, updating the state estimates of all nodes along the path. For newly generated nodes during the expansion phase, they collectively update their parent node. As the number of simulations increases, these value estimates become increasingly accurate. This process repeats until the preset maximum simulation count is reached, ultimately resulting in a search tree that records the state value and visit count for each node.
### Generate Code
Termination of the RPM-MCTS process occurs under two conditions: 1) If all public test cases are passed and LLM analysis confirms robustness to unseen edge cases before reaching maximum iterations, the code generated during the simulation phase is retained. 2) When maximum iterations are reached without meeting termination criteria, the leaf node with the highest state value is selected, its ancestral path is traced, and the LLM is instructed to generate code by rigorously adhering to the algorithmic steps assembled from this path.
## Experiments
### Experimental Settings
#### Datasets.
For the construction of the knowledge base, we use the train set splits of APPS (hendrycks2021measuring) and CodeContests (li2022competition) as data sources. After validation and filtering, we obtained 11,038 samples with a total of 82,923 steps. For benchmarking, we used the test set splits of APPS and CodeContests, as well as HumanEval+ (liu2023your) and MBPP+ (liu2023your). The APPS dataset contains three difficulty levels: introductory, interview, and competition. We selected 150 validated samples from each difficulty level. The CodeContests dataset consists of competitive programming problems collected from contest websites such as Codeforces. Additionally, HumanEval (chen2021evaluating) and MBPP (austin2021program) are widely recognized benchmarks in the code generation domain, while HumanEval+ and MBPP+ introduce a larger number of test cases to enable more accurate evaluations. We utilized Claude Sonnet 3.7 to convert all datasets into a unified format, which primarily includes the problem statement, public test cases, private test cases, and standard solution. To facilitate sandbox execution, we transformed datasets with standard input-output problems into function definitions with docstrings. For datasets without public test cases, we selected the first two private test cases as public test cases.
#### Baselines.
We selected the following methods as baselines for comparison. Base LLM refers to directly prompting the LLM to output solution code using the problem statement and public test cases as input. LDB (zhong2024debug) leverages the LLM to track intermediate variables during code execution to iteratively improve the code. ToT (yao2023tree) performs a search of thought steps using DFS or BFS before generating the final code. SRA-MCTS (xu2024sra) combines LLM with MCTS to explore intermediate reasoning steps. The complete steps obtained by SRA-MCTS are used as input to prompt the LLM to directly infer and output the solution code for evaluation.
#### Implementation Details.
We use two large-parameter backbone models, Qwen3-235B-A22B (yang2025qwen3) and Claude Sonnet 3.7, alongside a smaller-parameter model, Qwen3-8B. In the code generation domain, pass@k (chen2024survey) is a widely used metric, and we adopted pass@1 as the evaluation metric. The rollout, i.e., maximum number of iterations, was set to 5 for all methods. The branching factor $b$ for tree-based methods was set to 3. The exploration constant $β$ for UCB was set to 0.5. In RPM-MCTS, the weight of the knowledge base retrieval score $ι$ was set to 0.5, and the similarity filtering threshold was set to 0.85.
### Main Results
| Method Qwen3-8B Base LLM | APPS-Intro. 56.7 | APPS-Interv. 35.3 | APPS-Comp. 29.3 | CodeContests 10.7 | HumanEval+ 75.6 | MBPP+ 72.2 | Average 52.1 |
| --- | --- | --- | --- | --- | --- | --- | --- |
| LDB | 64.0 (+7.3) | 42.0 (+6.7) | 28.0 (-1.3) | 11.3 (+0.7) | 78.1 (+2.4) | 70.1 (-2.1) | 53.5 (+1.4) |
| ToT | 69.3 (+12.7) | 54.0 (+18.7) | 41.3 (+12.0) | 17.3 (+6.7) | 82.3 (+6.7) | 70.4 (-1.9) | 59.0 (+6.9) |
| SRA-MCTS | 67.3 (+10.7) | 42.7 (+7.3) | 29.3 (+0.0) | 16.0 (+5.3) | 73.8 (-1.8) | 65.9 (-6.4) | 52.8 (+0.7) |
| Ours w/o KB | 76.7 (+20.0) | 56.7 (+21.3) | 40.7 (+11.3) | 22.3 (+11.6) | 82.3 (+6.7) | 78.3 (+6.1) | 63.5 (+11.4) |
| Ours | 77.3 (+20.7) | 60.0 (+24.7) | 43.6 (+14.3) | 23.0 (+12.3) | 83.5 (+7.9) | 76.2 (+4.0) | 64.0 (+11.9) |
| Qwen3-235B-A22B | | | | | | | |
| Base LLM | 78.0 | 54.7 | 42.7 | 24.0 | 86.0 | 78.8 | 64.6 |
| LDB | 78.7 (+0.7) | 61.3 (+6.7) | 47.3 (+4.7) | 25.3 (+1.3) | 86.0 (+0.0) | 78.8 (+0.0) | 66.4 (+1.8) |
| ToT | 84.7 (+6.7) | 62.7 (+8.0) | 57.3 (+14.7) | 27.3 (+3.3) | 85.4 (-0.6) | 75.4 (-3.4) | 67.7 (+3.1) |
| SRA-MCTS | 76.0 (-2.0) | 52.7 (-2.0) | 44.0 (+1.3) | 24.7 (+0.7) | 85.4 (-0.6) | 70.9 (-7.9) | 61.7 (-3.0) |
| Ours w/o KB | 88.0 (+10.0) | 72.0 (+17.3) | 52.0 (+9.3) | 34.7 (+10.7) | 86.6 (+0.6) | 79.9 (+1.1) | 71.3 (+6.7) |
| Ours | 86.7 (+8.7) | 67.3 (+12.7) | 59.3 (+16.7) | 36.7 (+12.7) | 87.8 (+1.8) | 81.2 (+2.4) | 72.3 (+7.7) |
| Claude Sonnet 3.7 | | | | | | | |
| Base LLM | 78.7 | 56.0 | 59.3 | 31.3 | 82.9 | 77.8 | 67.3 |
| LDB | 82.0 (+3.3) | 64.7 (+8.7) | 73.3 (+14.0) | 33.3 (+2.0) | 88.4 (+5.5) | 77.0 (-0.8) | 71.5 (+4.2) |
| ToT | 84.0 (+5.3) | 68.0 (+12.0) | 66.0 (+6.7) | 39.3 (+8.0) | 86.0 (+3.1) | 74.6 (-3.2) | 70.8 (+3.6) |
| SRA-MCTS | 83.3 (+4.7) | 63.3 (+7.3) | 62.0 (+2.7) | 36.0 (+4.7) | 81.1 (-1.8) | 74.3 (-3.4) | 68.4 (+1.1) |
| Ours w/o KB | 92.0 (+13.3) | 73.3 (+17.3) | 78.0 (+18.7) | 42.7 (+11.3) | 86.6 (+3.7) | 79.1 (+1.3) | 76.2 (+8.9) |
| Ours | 92.0 (+13.3) | 74.0 (+18.0) | 81.3 (+22.0) | 46.0 (+14.7) | 89.0 (+6.1) | 81.0 (+3.2) | 78.1 (+10.9) |
Table 1: Performance comparison of all methods across different backbone models on code generation benchmarks. Values in parentheses indicate the improvement over the base LLM.
Our method achieves the most significant improvements across different backbone models and datasets. As shown in Table 3, Qwen3-8B achieves an average improvement of 11.90%, Qwen3-235B-A22B achieves an average improvement of 7.71%, and Claude Sonnet 3.7 achieves an average improvement of 10.86%. Since the base Qwen3-8B performs worse than the other two larger base LLMs across all datasets, especially on simpler datasets, the Qwen3-8B shows the most significant improvement when using RPM-MCTS. On the two more challenging datasets, APPS-competition and CodeContests, Qwen3-8B achieves an average improvement of 13.3%, Qwen3-235B-A22B achieves an average improvement of 14.67%, and Claude Sonnet 3.7 achieves an average improvement of 18.34%. This is because Qwen3-8B has weaker evaluation scoring capabilities, while larger LLMs have relatively stronger evaluation capabilities, resulting in greater gains. This demonstrates that the more difficult the task, the more accurate evaluation of intermediate algorithm steps is required.
LDB achieves greater improvements on simpler datasets compared to more challenging ones. We found that this is because, for more difficult problems, through multiple rounds of execution feedback, LLMs often only modify code conditions to pass public test cases rather than thinking about modifying the actual logic of the code. SRA-MCTS shows performance improvements on more challenging datasets but declines on simpler ones. The reason is that for simple problems, LLM evaluation scores are always perfect or near-perfect, prematurely ending the search for steps, resulting in incomplete or lower-quality reasoning steps.
Comparing the results across three different difficulty levels in the APPS dataset, it can be observed that for the two larger LLMs, as the difficulty increases, our method brings more significant performance improvements. The higher the difficulty of the problem, the more guidance the LLM needs to avoid getting lost in complex reasoning chains. This demonstrates the effectiveness of our method in evaluating intermediate steps, helping LLMs enhance their evaluation capabilities and further unlocking the vast potential code knowledge and reasoning abilities inherent in LLMs.
For fair comparison, even without using knowledge base retrieval scores as rewards, our method outperforms other baselines. Experimental results show that overall, especially on the two most challenging datasets, incorporating the knowledge base further stabilizes and improves performance. The reason is that LLM evaluation of intermediate steps in complex problems is unreliable, and random exploration struggles to find the correct solution path. Therefore, leveraging the knowledge base to use the reasoning patterns of historically similar problems as guidance helps direct the search. This demonstrates the effectiveness of using knowledge base retrieval scores as rewards for intermediate process evaluation.
On a few simpler datasets, performance slightly improves when knowledge base retrieval scores are not used. We analyze that this is because, in simple tasks, LLMs can already accurately evaluate the quality of generated paths. In this case, introducing knowledge base rewards, while aiming to provide additional prior information, may retrieve historical cases that are textually similar but logically different in their solutions, introducing noise into MCTS node selection. In contrast, for complex tasks, LLM evaluation capabilities for intermediate steps are limited, the search space is vast, and solutions are sparse. The structured priors provided by the knowledge base effectively guide the search direction, significantly improving success rates. This phenomenon indicates that the effectiveness of knowledge base rewards depends on the balance between task difficulty and LLM evaluation confidence.
### Ablation Study
We conduct ablation experiments using Qwen3-235B-A22B as the backbone model to evaluate performance, and the results are shown in Figure 8.
w/o KB indicates that only LLM evaluation is used in selection, without knowledge base retrieval. Compared to the complete method, the overall performance slightly decreased, with an average drop of 1.05%. The decline was most significant on the two more challenging datasets, with an average drop of 4.67%. This indicates that large models still face challenges with complex problems. By introducing a knowledge base to compare the generated reasoning steps with the correct reasoning steps of similar problems in the knowledge base, the self-assessment capability for complex problems can be improved.
w/o ER means that the execution rewards of public test cases in the sandbox are not used during the simulation phase. This resulted in the largest overall performance drop, highlighting that the core of RPM-MCTS reflection lies in the detailed feedback provided by the code execution environment. In fact, previous research (huang2023large) has already pointed out that without external feedback, LLMs lack the ability to self-correct their reasoning processes.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Grouped Bar Chart: Performance Comparison of Ablated Methods Across Code Generation Benchmarks
### Overview
The image displays a grouped bar chart comparing the performance (in percentage) of a proposed method ("Ours") against four ablated variants across six different code generation benchmarks. The chart evaluates the contribution of different components (KB, ER, SF, LDB) by showing performance when each is removed.
### Components/Axes
* **Y-Axis:** Labeled "Performance (%)". Scale ranges from 20 to 100, with major tick marks every 5 units.
* **X-Axis:** Lists six benchmark categories:
1. APPS-Intro.
2. APPS-Interv.
3. APPS-Comp.
4. CodeContests
5. HumanEval+
6. MBPP+
* **Legend:** Positioned in the top-right corner. It defines five data series by color:
* **Ours** (Light Blue)
* **w/o KB** (Medium Blue)
* **w/o ER** (Orange)
* **w/o SF** (Yellow)
* **w/o LDB** (Red)
* **Data Labels:** Each bar has its exact performance value printed above it.
### Detailed Analysis
Performance values for each method across all benchmarks are as follows:
| Benchmark | Ours | w/o KB | w/o ER | w/o SF | w/o LDB | Trend |
|----------------|--------|--------|--------|--------|---------|-------------------------------------------------------------------------------------------------------|
| APPS-Intro. | 86.7% | 88.0% | 78.0% | 86.0% | 88.0% | High performance overall. The "w/o ER" variant shows a notable drop. |
| APPS-Interv. | 67.3% | 72.0% | 63.3% | 63.3% | 70.7% | Moderate performance. "w/o KB" achieves the highest score, while "w/o ER" and "w/o SF" are tied for the lowest. |
| APPS-Comp. | 59.3% | 52.0% | 46.0% | 56.7% | 60.7% | Lower performance tier. "w/o LDB" slightly outperforms "Ours". "w/o ER" is the lowest. |
| CodeContests | 36.7% | 34.7% | 28.9% | 31.1% | 30.4% | The lowest performance across all benchmarks. "Ours" leads, but all scores are below 37%. |
| HumanEval+ | 87.8% | 86.6% | 86.0% | 87.8% | 87.2% | Very high and tightly clustered performance. "Ours" and "w/o SF" are tied for the highest. |
| MBPP+ | 81.2% | 79.9% | 77.2% | 77.5% | 79.6% | High performance. "Ours" leads, with other variants trailing by 1-4 percentage points. |
### Key Observations
1. **Benchmark Difficulty:** Performance varies drastically by benchmark. APPS-Intro. and HumanEval+ yield scores in the high 80s, while CodeContests yields scores in the low 30s, indicating it is a significantly more challenging task.
2. **Impact of Ablations:** The effect of removing a component (KB, ER, SF, LDB) is not uniform; it depends on the benchmark.
* The **w/o ER** (orange) variant is consistently among the lowest performers in every benchmark, suggesting the "ER" component is critical for performance.
* The **w/o KB** (medium blue) variant sometimes outperforms "Ours" (e.g., APPS-Interv.), suggesting the "KB" component may have a neutral or slightly negative effect on certain tasks.
3. **"Ours" Performance:** The full method ("Ours") is generally a top performer but is not always the absolute best. It is either the best or tied for best in 4 out of 6 benchmarks (APPS-Intro., CodeContests, HumanEval+, MBPP+).
### Interpretation
This chart is an **ablation study**, a common analysis in machine learning research. Its purpose is to isolate and quantify the contribution of individual components (Knowledge Base - KB, Execution Results - ER, Solution Filter - SF, Large-scale Debugging - LDB) to the overall performance of a proposed system ("Ours").
The data suggests that:
* The **ER (Execution Results)** component is the most vital, as its removal leads to the most consistent and significant performance degradation.
* The system's strength is **benchmark-dependent**. It excels on introductory and standard evaluation sets (APPS-Intro., HumanEval+, MBPP+) but struggles on competition-level problems (CodeContests, APPS-Comp.), where even the full model's performance is modest.
* The **KB (Knowledge Base)** component's role is nuanced. Its removal sometimes helps, sometimes hurts, and sometimes has little effect. This could imply that the knowledge base is either redundant for some tasks or that its integration method needs refinement to avoid negative interference.
* The close performance of "Ours" and "w/o SF" on HumanEval+ (both 87.8%) indicates the **Solution Filter** may have minimal impact on that specific benchmark.
In summary, the chart provides evidence that the proposed method's architecture is effective, with the Execution Results component being a key driver of its success. It also clearly identifies the harder problem domains (competition-level code generation) where future research should focus.
</details>
Figure 3: Ablation study results on different benchmarks.
w/o SF refers to the removal of similarity filtering, i.e., not discarding similar child nodes during the expansion phase. The results show that filtering out repeated intermediate algorithmic steps based on similarity allows resources to be better allocated to exploring new steps, thereby improving performance while reducing computational costs.
w/o LDB denotes not using LDB to locate erroneous steps in our method. The average performance drop was minimal, indicating that removing LDB has little impact on our method. With execution feedback, LLMs are already capable of accurately locating errors. However, in a few cases, LDB still helps in pinpointing erroneous steps.
### Performance vs. Rollout
We explore the results of different values of the hyperparameter rollout on Qwen3-235B-A22B, as shown in Figure 9. Since SRA-MCTS is prone to premature termination due to self-overestimation by the model, we set its end gate value to exceed the maximum possible score, allowing it to reach the maximum number of iterations whenever possible. we denote this variant as SRA-MCTS_no_eg. The results show that in the early stages, all methods exhibit significant performance improvements as the rollout increases, after which the performance gradually stabilizes. Notably, RPM-MCTS exhibits better performance even with a rollout of 1. This is because it enjoys two advantages in its first rollout: proactive guidance via its Knowledge Base during the selection phase, and wrong step truncation with rethink-based regeneration during the simulation phase. This allows it to perform at least one round of verification and reflection and generate complete code. Moreover, for simpler problems, RPM-MCTS can often arrive at the correct answer with only a single simulation, whereas traditional tree search methods tend to require multiple unnecessary expansions even for straightforward tasks.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: Performance Comparison of Different Methods Across Max Rollout Steps
### Overview
This image is a line chart comparing the performance (in percentage) of five different methods or algorithms as a function of the "Max Rollout" parameter. The chart tracks how each method's performance changes as the maximum rollout value increases from 1 to 10.
### Components/Axes
* **Chart Type:** Multi-line chart with markers.
* **X-Axis:**
* **Label:** "Max Rollout"
* **Scale:** Linear, integer values from 1 to 10.
* **Markers:** Ticks and labels at every integer from 1 to 10.
* **Y-Axis:**
* **Label:** "Performance (%)"
* **Scale:** Linear, ranging from 30 to 70.
* **Markers:** Major ticks and labels at intervals of 5 (30, 35, 40, 45, 50, 55, 60, 65, 70).
* **Legend:**
* **Position:** Top-left corner of the plot area.
* **Content:** Five entries, each with a colored line, marker symbol, and text label.
1. **Green line with circle markers:** "Ours"
2. **Red line with upward-pointing triangle markers:** "SRA-MCTS"
3. **Purple line with star markers:** "SRA-MCTS_no_eg"
4. **Blue line with square markers:** "LDB"
5. **Orange line with diamond markers:** "ToT"
### Detailed Analysis
The performance of each method at each Max Rollout value is extracted below. Values are approximate based on visual inspection of the chart.
**Trend Verification & Data Points:**
| Max Rollout | Ours (%) | ToT (%) | LDB (%) | SRA-MCTS_no_eg (%) | SRA-MCTS (%) |
| :---------- | :------- | :------ | :------ | :----------------- | :----------- |
| 1 | 54 | 41.5 | 40 | 42 | 37 |
| 2 | 53.5 | 47 | 44 | 43 | 39 |
| 3 | 57 | 55 | 41 | 45 | 45 |
| 4 | 61 | 53 | 47 | 43 | 43 |
| 5 | 59.5 | 57.5 | 47.5 | 45.5 | 44 |
| 6 | 58 | 56 | 50.5 | 46 | 46 |
| 7 | 62 | 56 | 49 | 46 | 43.5 |
| 8 | 61 | 53 | 50.5 | 41.5 | 44 |
| 9 | 61 | 55.5 | 50 | 45.5 | 41.5 |
| 10 | 63 | 55 | 50 | 45 | 41.5 |
**Trend Descriptions:**
1. **Ours (Green, Circle):**
* **Trend:** Generally upward sloping with minor fluctuations. Starts as the highest performer and maintains the lead throughout.
2. **ToT (Orange, Diamond):**
* **Trend:** Sharp initial increase, then plateaus with fluctuations. Consistently the second-highest performer after Rollout 2.
3. **LDB (Blue, Square):**
* **Trend:** Moderate upward trend with some volatility. Generally occupies the middle tier.
4. **SRA-MCTS_no_eg (Purple, Star):**
* **Trend:** Relatively flat with minor oscillations. Performs similarly to SRA-MCTS initially but shows more stability in the mid-range.
5. **SRA-MCTS (Red, Up-Triangle):**
* **Trend:** Slight upward trend initially, then declines after Rollout 6. Generally the lowest-performing method.
### Key Observations
1. **Clear Hierarchy:** A consistent performance hierarchy is visible: "Ours" > "ToT" > "LDB" > "SRA-MCTS_no_eg" & "SRA-MCTS".
2. **"Ours" Dominance:** The "Ours" method not only starts highest but also shows the strongest positive trend, achieving the highest overall performance (~63%) at Max Rollout 10.
3. **ToT's Early Jump:** "ToT" shows the most dramatic single improvement, jumping from ~41.5% to ~55% between Rollout 1 and 3.
4. **Convergence and Crossover:** Around Rollout 3, "SRA-MCTS" briefly catches up to "SRA-MCTS_no_eg". "LDB" and "SRA-MCTS_no_eg" have similar values at several points (e.g., Rollout 5, 6).
5. **Stability vs. Volatility:** "SRA-MCTS_no_eg" is relatively stable. "LDB" and "ToT" show moderate volatility. "Ours" and "SRA-MCTS" show more pronounced peaks and valleys.
### Interpretation
This chart likely presents results from a research paper or technical report comparing a proposed method ("Ours") against several baselines ("SRA-MCTS", "LDB", "ToT") and an ablated version of a baseline ("SRA-MCTS_no_eg").
* **What the data suggests:** The primary conclusion is that the proposed method ("Ours") is superior, achieving higher performance than all compared methods across nearly all rollout budgets. Its advantage becomes more pronounced at higher rollout values.
* **Relationship between elements:** The "Max Rollout" parameter likely controls computational budget or search depth. The chart demonstrates how efficiently each method converts increased budget into performance gains. "Ours" and "ToT" show good scaling, while "SRA-MCTS" scales poorly after a certain point.
* **Notable anomaly:** The "_no_eg" suffix on one SRA-MCTS variant suggests an ablation study where a component (possibly "eg" for "example guidance" or similar) was removed. The fact that this variant often outperforms the standard "SRA-MCTS" is counter-intuitive and would require reading the associated paper to understand. It might indicate that the removed component was detrimental in this specific evaluation setting or that its interaction with the rollout parameter is complex.
* **Underlying message:** The visualization is designed to convincingly argue for the effectiveness and robustness of the "Ours" method, showing it is not only best on average but also maintains its lead under varying conditions (different rollout limits).
</details>
Figure 4: Performance comparison across different maximum rollout values.
### Token Efficiency Analysis
Figure 10 shows the average token usage of different methods across all benchmark datasets. Our method reduces token consumption by approximately 15% compared to the previous MCTS method on both Qwen3-235B-A22B and Claude Sonnet 3.7. This improvement is attributed to: 1) The knowledge base retrieval scoring prioritizes more correct nodes, avoiding exploration of invalid branches. 2) Similarity filtering eliminates duplicate intermediate reasoning steps, enabling dynamic pruning of the Monte Carlo tree and reducing redundant path generation. 3) The simulation phase leverages sandbox feedback to pinpoint erroneous steps, while retaining the verified correct ones. Overall, RPM-MCTS achieves enhanced search efficiency and generation quality through knowledge base guidance and execution feedback.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Scatter Plot: Method Performance vs. Average Token Usage
### Overview
This image is a scatter plot comparing the performance (in percentage) of three different methods (ToT, SRA-MCTS, and "Ours") across three different base models (qwen3-8b, qwen3-235b-a22b, claude37_sonnet). The plot visualizes the trade-off between performance and computational cost, measured by average token usage.
### Components/Axes
* **Y-Axis:** Labeled "Performance (%)". Scale ranges from 50 to 90, with major tick marks every 10 units (50, 60, 70, 80, 90).
* **X-Axis:** Labeled "Average Token Usage". Scale ranges from 0 to 20,000, with major tick marks every 4,000 units (0, 4000, 8000, 12000, 16000, 20000).
* **Legend 1 (Top-Left):** Titled "Methods". Defines the color coding for the three methods:
* Blue circle: ToT
* Green circle: SRA-MCTS
* Orange circle: Ours
* **Legend 2 (Below Legend 1):** Titled "Base Models". Defines the shape coding for the three base models:
* Circle: qwen3-8b
* Triangle: qwen3-235b-a22b
* Square: claude37_sonnet
* **Data Points:** Each point is labeled with the format "Method(Base Model)". The color indicates the method, and the shape indicates the base model.
### Detailed Analysis
The plot contains 9 distinct data points. Below is a reconstruction of each point's approximate coordinates and label, grouped by method.
**Method: Ours (Orange)**
1. **Label:** Ours(qwen3-8b)
* **Shape:** Circle
* **Approx. Coordinates:** (7500, 63.5)
* **Trend:** This is the lowest token usage point for the "Ours" method.
2. **Label:** Ours(qwen3-235b-a22b)
* **Shape:** Triangle
* **Approx. Coordinates:** (8500, 71)
* **Trend:** Shows a significant performance increase over the 8b model with a modest increase in token usage.
3. **Label:** Ours(claude37_sonnet)
* **Shape:** Square
* **Approx. Coordinates:** (9500, 76)
* **Trend:** The highest-performing point on the entire chart, using fewer tokens than many competing methods.
**Method: SRA-MCTS (Green)**
1. **Label:** SRA-MCTS(qwen3-8b)
* **Shape:** Circle
* **Approx. Coordinates:** (2500, 52.5)
* **Trend:** The lowest performance and lowest token usage point on the chart.
2. **Label:** SRA-MCTS(qwen3-235b-a22b)
* **Shape:** Triangle
* **Approx. Coordinates:** (10000, 61.5)
* **Trend:** Higher performance and token usage than its 8b counterpart.
3. **Label:** SRA-MCTS(claude37_sonnet)
* **Shape:** Square
* **Approx. Coordinates:** (11500, 68.5)
* **Trend:** The highest-performing configuration for SRA-MCTS.
**Method: ToT (Blue)**
1. **Label:** ToT(qwen3-8b)
* **Shape:** Circle
* **Approx. Coordinates:** (9500, 59)
* **Trend:** Uses significantly more tokens than SRA-MCTS for a similar performance level on the same base model.
2. **Label:** ToT(qwen3-235b-a22b)
* **Shape:** Triangle
* **Approx. Coordinates:** (12500, 67.5)
* **Trend:** Higher performance and token usage than its 8b counterpart.
3. **Label:** ToT(claude37_sonnet)
* **Shape:** Square
* **Approx. Coordinates:** (18000, 70.5)
* **Trend:** The highest token usage point on the chart by a large margin.
### Key Observations
1. **Performance Hierarchy:** For every base model, the "Ours" method achieves the highest performance, followed generally by ToT, then SRA-MCTS.
2. **Token Efficiency:** The "Ours" method demonstrates superior token efficiency. For example, "Ours(claude37_sonnet)" achieves ~76% performance with ~9500 tokens, while "ToT(claude37_sonnet)" achieves only ~70.5% performance with ~18000 tokens.
3. **Base Model Scaling:** All three methods show a consistent trend: performance increases when moving from the qwen3-8b (circle) to qwen3-235b-a22b (triangle) to claude37_sonnet (square) base model. Token usage also generally increases with model scale.
4. **Outlier:** The "ToT(claude37_sonnet)" point is a clear outlier in terms of token usage, positioned far to the right of all other data points.
### Interpretation
This chart presents a compelling case for the efficacy of the "Ours" method. It suggests that this new approach achieves a better performance-to-cost ratio than the compared baselines (ToT and SRA-MCTS). The data demonstrates that "Ours" not only reaches higher peak performance but does so with greater computational efficiency (lower token usage).
The relationship between the points indicates that while all methods benefit from more powerful base models, the "Ours" method leverages this increased model capacity more effectively, yielding greater performance gains per additional token spent. The significant rightward position of the ToT method, especially with the strongest base model, implies it may have a higher inherent computational overhead or a less efficient search strategy compared to the other methods evaluated. The chart effectively argues that the proposed method ("Ours") advances the state-of-the-art by optimizing both the quality of the solution and the resources required to find it.
</details>
Figure 5: Comparison of token consumption and performance across different methods and models.
### Reasoning Steps for Data Distillation
Our method is training-free, yet the algorithmic synthesized reasoning steps it produces can also be used for supervised fine-tuning. Based on Doubao-1.5-pro-32K model, we utilize RPM-MCTS for data distillation and construct a dataset of 2.4k code generation samples with reasoning steps. This distilled data is then combined with a foundational dataset of 170k samples to perform full fine-tuning on Doubao-1.5-pro-32K. Benchmark results in Table 4 demonstrate that the training data generated using RPM-MCTS significantly enhances the code capabilities of the base model.
| SWE-Bench (jimenez2024swebench) MBPP+ (liu2023your) LiveCodeBench (jain2024livecodebench) | 37.6 75.4 46.2 | 38.5 76.7 50.5 | +0.9 +1.3 +4.3 |
| --- | --- | --- | --- |
| Aider (aider2024polyglot) | 17.3 | 22.2 | +4.9 |
| McEval (mceval) | 57.5 | 61.2 | +3.7 |
Table 2: Supervised fine-tuning results on Doubao-1.5-pro-32K model using RPM-MCTS synthesized data.
## Conclusion
In this paper, we propose RPM-MCTS, which leverages a knowledge base and external sandbox feedback to directly obtain accurate reward values without requiring additional training of a process reward model. During the search process, errors are identified and promptly corrected. Experimental results demonstrate that RPM-MCTS outperforms current state-of-the-art methods under more constrained search budgets. Additionally, we construct training data using RPM-MCTS and perform full fine-tuning on base model, which significantly enhances code capabilities of the base model.
A limitation of RPM-MCTS is that code solvable in a single line may be divided into multiple lines due to the step-by-step approach, without impacting correctness. In the future, during the evaluation phase of MCTS, we can dynamically adjust the weights of external rewards from the knowledge base and sandbox, based on the uncertainty of LLMs.
## Acknowledgments
This work was supported by the National Science and Technology Major Project (2022ZD0114803), the Natural Science Foundation of Wuhan (2023010201020229), the Fundamental Research Funds for the Central Universities (NO.NJ2023032), and the Major Program (JD) of Hubei Province (2023BAA024).
## Appendix A Topic Categories
Table 3 presents the classification of algorithms in our knowledge base, which is divided into 14 categories. These categories were derived from the most common algorithmic tags on popular programming websites. The final knowledge base comprises a total of 82,923 items.
| Data Structures | 750 |
| --- | --- |
| Algorithm Strategies | 1218 |
| String Processing | 1676 |
| Sorting and Searching | 542 |
| Graph Theory | 977 |
| Bit Manipulation | 403 |
| Mathematics and Number Theory | 1658 |
| Computational Geometry | 611 |
| Optimization Problems | 1310 |
| Two-Pointer Techniques | 213 |
| Dynamic Programming | 836 |
| Recursion and Backtracking | 226 |
| Hashing Techniques | 316 |
| Other | 302 |
Table 3: Knowledge base data categories and statistics.
## Appendix B Dataset Details
As detailed in Table 4, our knowledge base was built from the CodeContests-Train and APPS-Train datasets, which together provide 11,038 training samples. The modelâs performance was then benchmarked against a test set consisting of six standard benchmarks: the APPS test splits (by difficulty), the CodeContests test split, HumanEval+, and MBPP+.
| | Dataset | Samples |
| --- | --- | --- |
| Knowledge Base | CodeContests-Train | 7368 |
| APPS-Train | 3670 | |
| Test Set | APPS-Test-Introductory | 150 |
| APPS-Test-Interview | 150 | |
| APPS-Test-Competition | 150 | |
| CodeContests-Test | 150 | |
| HumanEval+ | 164 | |
| MBPP+ | 378 | |
Table 4: Dataset statistics.
## Appendix C Prompts
Figure 6 - 9 demonstrate selected key prompts used in our method. The prompt in Figure 6 generates the next step. The prompt in Figure 7 completes full steps. The prompt in Figure 8 analyzes execution errors to locate error steps. The prompt in Figure 9 translates the full steps into code.
## Appendix D Algorithm Details
Algorithm 1 provides the detailed pseudocode for our proposed RPM-MCTS. The algorithm follows the canonical MCTS structure, iteratively performing four phases consisting of Selection, Expansion, Evaluation, and Backpropagation. A key component is the EVALUATE_NODE function. Instead of a traditional random rollout, this function generates a complete code solution from the current path, executes it in a sandboxed environment, and assesses its correctness. If the execution fails, the algorithm activates a reflection mechanism that identifies the erroneous step, prunes the incorrect sub-path, and updates its understanding to guide future searches.
## Appendix E Case Study
We present a case study of our method in Figure 10. This figure provides a visualization of the final state of the Monte Carlo Tree after the search process concludes. In the tree, the root node represents the problem statement, while subsequent nodes correspond to individual reasoning steps. Furthermore, each node is annotated with its access sequence, visit count, and value. After four search iterations, our method successfully identifies the correct reasoning path to the solution, which corresponds to the leftmost path in the figure.
Prompt for Generating Next Step
Your task is to provide the correct next step based on the previous incorrect code used to solve the problem and a reflection, for a given programming problem and its existing solution steps (which are incomplete). Letâs think step by step. But you only generate one step at a time. We aim to decompose complex problems into a series of simpler subproblems and sequentially generate the corresponding steps to solve each subproblem. All the substeps should be combined in a way that avoids contradictions, forming a coherent solution to the original complex problem. Input format (n steps): â˘
Problem: {problem} â˘
Existing steps: {existing_steps} â˘
Analysis: {reflection} â˘
History: {history} The historical content is the solution proposed earlier. To ensure the diversity of solutions, please do not generate ideas identical to those in the historical content. Guidelines: â˘
The steps you generate will be passed to a code generation model, so they should be structured in a way that is easy for the model to understand. â˘
Keep each step concise and focused, avoiding the inclusion of too much information at once. Ensure clear organization and logical progression in your reasoning. â˘
Important: You can use very little code as detailed explanations in your answers, but you cannot just write code. â˘
If your answer includes code, it will cause unforeseen losses! â˘
Your answer should be based on the given analysis. Only if the analysis is wrong can you answer it in your own way. â˘
If no existing steps are provided, you should output the first step based on the given analysis. â˘
If there are existing steps, output the next step (Step n+1) that logically follows the provided analysis and the previous steps. Output format: â˘
Next step: ⌠Your response should only generate solutions to the problem, without any extra words.
Figure 6: Prompt for Generating Next Step.
Prompt for Generating Full Steps
Your task is to take a programming problem and incomplete solution steps (not a full answer), then continue from the provided steps to complete all remaining steps and generate the complete final solution. Letâs think step by step. We aim to decompose complex problems into a series of simpler subproblems and sequentially generate the corresponding steps to solve each subproblem. All the substeps should be combined in a way that avoids contradictions, forming a coherent solution to the original complex problem. Note: Do not modify the existing solution steps. Input format (n steps): â˘
Problem: {problem} â˘
Existing steps: {existing_steps} Guidelines: â˘
If n is equal to 0, you need to start from scratch and analyze the solution idea briefly, and then output the complete answer. â˘
Otherwise, you need to output the complete answer that you think is correct following the train of thought of the existing steps. â˘
Each step generated should be concise and focused, addressing only a small part of the solution. Avoid making the steps too complex or combining multiple ideas into one. â˘
The complete solution should consist of at least three steps, so donât skip any essential steps. â˘
Your output should be clear and systematic, with each step described one at a time to ensure logical progression. â˘
Note: You are only allowed to describe the reasoning steps in natural language. Do not output any code. Output format: â˘
Step 1: ⌠â˘
Step 2: ⌠â˘
⌠â˘
Step n: ⌠â˘
Step n + 1: ⌠â˘
⌠Among them, Step 1 to Step n are consistent with the existing steps. Continue to generate based on the existing steps to obtain a complete answer. The following is the input. Please output according to the specified output format, do not output unnecessary information, and do not repeat the question. Note: Your output should start from Step 1 and include all the steps, not just the next step.
Figure 7: Prompt for Generating Full Steps.
Prompt for Code Debugging and Analysis
The following is a Python code problem, which includes the thoughts and code for solving the problem, as well as the return results of debugging for a failed test case. Input: â˘
Python code problem: {problem} â˘
Thoughts: {solution} â˘
Code: {code} â˘
Test case debug information: {exec_result} The debugging process is to first split the code into block-level code according to the AST. If the block-level code is correct after debugging analysis, the âcorrectâ field is True, otherwise it is False. The âexplanationâ field is the analysis of the block-level code debugging. Guidelines: Your task is to determine which specific step is written incorrectly based on the debug return results and conduct an analysis and summary. The correctly generated code and corresponding thought processes will be retained, while the incorrect code and corresponding thought processes will be discarded. You need to analyze and summarize the points to note so that subsequent thought processes can be generated based on the correct thought processes to correct the previous errors. Output format: Your output consists of two parts: â˘
1. Which specific step went wrong. Wrap it with the <step_n>x</step_n> XML tag, where x represents the specific number of the first erroneous step. If there are multiple erroneous steps in the thought process, only output the number of the first erroneous step. Do not output any extra content. â˘
2. Analyze and summarize the points to note. The final output should look like this: <step_n>x</step_n>..., where ⌠represents the generated analysis.
Figure 8: Prompt for analyzing debugging results and identifying errors.
Prompt for Code Implementation
You will play the role of a code implementer, writing a complete code based on the given problem and the step-by-step analysis of the problem. Your code must strictly follow the analysis steps provided and should not include your own opinions. Rules: â˘
Importing function libraries(like: import math) and output function code only, without main function so that I can call your generated functions directly. â˘
The output code should be wrapped with code blocks (like ââpython). Example: ââpython\ndef add(a, b):\n return a + b\nââ. Input: â˘
question: {question} â˘
analysis: {analysis}
Figure 9: Prompt for generating code based on steps.
Algorithm 1 The RPM-MCTS Algorithm for Code Generation
Input: Problem description $P$ , total iterations $I$ , branching factor $B$ , success threshold $θ_succ$ Output: The best generated code solution $C_best$
1: $v_rootâCREATE\_NODE(P)$
2: for $iâ 1$ to $I$ do
3: // 1. Selection
4: $v_lâ v_root$
5: while $v_l$ is fully expanded do
6: $v_lâSELECT\_BEST\_CHILD\_UCB(v_l)$
7: end while
8:
9: // 2. Expansion
10: if $v_l$ is not a terminal node then
11: $S_genââ ,H_histââ $
12: for $jâ 1$ to $B$ do
13: $s_newâGENERATE\_NEXT\_STEP(P,path(v_l),H_hist)$ {Generate diverse steps}
14: Add $s_new$ to $S_gen$ and $H_hist$
15: end for
16: $S_uniqueâFILTER\_SIMILAR\_STEPS(S_gen)$ {Filter semantic duplicates}
17: for each step $s$ in $S_unique$ do
18: $v_câADD\_CHILD(v_l,s)$
19: $v_c.valueâGET\_INITIAL\_VALUE(path(v_c))$ {Score from knowledge base & LLM}
20: end for
21: Mark $v_l$ as fully expanded
22: end if
23:
24: // 3. Evaluation (Simulation)
25: $v_râSELECT\_BEST\_CHILD\_UCB(v_l)$
26: if $v_r$ is not NULL then
27: $is\_solved,Q_finalâEVALUATE\_NODE(v_r,P,θ_succ)$
28: if $is\_solved$ then
29: return $GET\_SOLUTION(v_r)$ {Optimal solution found, terminate early}
30: end if
31: else
32: $Q_finalâ v_l.value$ {Use parent value if no children to evaluate}
33: end if
34:
35: // 4. Backpropagation
36: $BACKPROPAGATE(v_r,Q_final)$
37: end for
38: return $GET\_BEST\_SOLUTION(v_root)$ {Return best solution after all iterations}
39:
40: Function EVALUATE_NODE( $v,P,θ_succ$ )
41: $Ď_sâpath(v)$
42: $S_full,CâGENERATE\_FULL\_SOLUTION(Ď_s)$
43: $r_exec,res_sbâEXECUTE\_CODE(C)$ {Evaluate in sandbox}
44: $r_llm,fâEVALUATE\_WITH\_LLM(S_full,C,res_sb)$ { $f$ is reflection}
45: $Q_combâγ¡ r_exec+(1-Îł)¡ r_llm$ {Weighted combined value}
46: if $r_exec$ is SUCCESS and $r_llmâĽÎ¸_succ$ then
47: $ADD\_SOLUTION\_TO\_TREE(v,S_full)$
48: return true, $Q_comb$
49: end if
50: if $r_exec$ is FAILURE then
51: $idx_errâLOCATE\_ERROR\_STEP(S_full,res_sb,f)$
52: $S_prunedâPRUNE\_STEPS(S_full,idx_err)$
53: $ADD\_SOLUTION\_TO\_TREE(v,S_pruned)$ {Add correct partial path}
54: $UPDATE\_REFLECTION\_IN\_TREE(v,f)$
55: end if
56: return false, $Q_comb$ {Return failure if not solved}
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Flowchart/Diagram: Algorithmic Problem Solving - Minimum Scoring Pair
### Overview
The image is a detailed flowchart or decision tree illustrating the step-by-step reasoning process for solving a specific algorithmic problem. The problem involves finding a pair of indices `(i, j)` in an array `a` of `n` integers that minimizes a scoring function `g(i, j)`. The diagram traces three distinct solution approaches (brute force, optimized prefix sums, and a greedy method), showing the logical progression, calculations, and complexity analysis for each. The content is highly technical, containing pseudocode, mathematical formulas, and step-by-step derivations.
### Components/Axes
The diagram is structured as a hierarchical tree with a central problem statement at the top, branching into three main solution paths. Each path consists of sequentially numbered steps contained within rectangular boxes. The boxes are color-coded:
* **Yellow Boxes:** Represent the initial problem statement and the starting point for each of the three main solution branches.
* **Red Boxes:** Indicate steps that involve a key insight, a critical calculation, or a conclusion about complexity.
* **Green Boxes:** Highlight steps that involve verification, example testing, or the final answer derivation.
**Spatial Layout:**
* **Top Center:** The root node containing the core problem definition and pseudocode.
* **Upper Left Branch:** Details a brute-force O(n²) approach.
* **Upper Middle Branch:** Details an optimized approach using prefix sums, aiming for O(n log n).
* **Upper Right Branch:** Details a greedy approach, also aiming for O(n log n).
* **Lower Sections:** Each branch extends downward with numerous steps, calculations, and sub-conclusions, eventually leading to final complexity statements and answers for given examples.
### Detailed Analysis
**1. Problem Statement (Top Yellow Box):**
* **Given:** An array `a` of `n` integers.
* **Goal:** Find a pair of indices `(i, j)` where `i < j` that minimizes the score `g(i, j)`.
* **Scoring Function:** `g(i, j) = (j - i) * max(a[i...j])`.
* **Pseudocode Provided:**
```
min = INF
for i = 0 to n-1:
for j = i+1 to n-1:
score = (j - i) * max(a[i...j])
min = min(min, score)
```
* **Question:** "Probably by now I have already figured out a solution to this problem. Can you?"
* **Access Sequence & Value:** `Access sequence: 0 | Value: 4.013`
**2. Left Branch (Brute Force Approach):**
* **Step 1:** Understand the problem. Goal is to find `min g(i, j)`.
* **Step 2:** Need to find min value of `g(i, j)` for all pairs `(i, j)` where `i < j`. `g(i, j) = (j - i) * max(a[i...j])`.
* **Step 3:** A brute-force approach would be to calculate `g(i, j)` for all possible pairs `(i, j)` where `i < j`. This would take O(n²) time because there are O(n²) pairs.
* **Step 4:** We can optimize the calculation of `g(i, j)` using prefix sums. Let `prefix[i]` be the sum of elements from `a[0]` to `a[i-1]`. Then `sum(a[i...j]) = prefix[j+1] - prefix[i]`.
* **Step 5:** Wait, the definition of `g(i, j)` is slightly different. It's `(j - i) * max(a[i...j])`, not the sum.
* **Step 6:** Using prefix sums, we can calculate `g(i, j)` in O(1) time for each pair, reducing the overall time complexity to O(n²).
* **Step 7:** Let's implement this approach. We'll first calculate the prefix sums, then iterate through all pairs `(i, j)` where `i < j` and compute `g(i, j)`.
* **Step 8:** To calculate prefix sums, we can use an array `prefix` where `prefix[i] = a[0] + a[1] + ... + a[i-1]`. We can compute this in O(n) time.
* **Step 9:** Then, for each pair `(i, j)` where `i < j`, we calculate `g(i, j) = (j - i) * max(a[i...j])`. We can find the max in O(n) time for each pair.
* **Step 10:** The minimum value of `g(i, j)` among all these pairs is our answer.
* **Step 11:** Let's test this approach on the given examples. For the first example with array `[1, 3, 1]`, the pairs are `(0,1), (0,2), (1,2)`.
* **Step 12:** For pair `(0,1)`: `g(0,1) = (1-0)*max(1,3) = 1*3 = 3`.
* **Step 13:** For pair `(0,2)`: `g(0,2) = (2-0)*max(1,3,1) = 2*3 = 6`.
* **Step 14:** For pair `(1,2)`: `g(1,2) = (2-1)*max(3,1) = 1*3 = 3`.
* **Step 15:** The minimum value among all these is 3. Which matches the expected output for the first example.
* **Step 16:** For the second example with array `[1, 2]`, the only pair is `(0,1)`.
* **Step 17:** For pair `(0,1)`: `g(0,1) = (1-0)*max(1,2) = 1*2 = 2`.
* **Step 18:** Therefore, our approach correctly solves the problem, but with a time complexity of O(n³) due to finding the max for each pair. (Note: This contradicts Step 6's claim of O(n²) with prefix sums, indicating a logical correction within the flow).
**3. Middle Branch (Optimized Prefix Sum Approach):**
* **Step 1:** Understand the problem. We need to find the minimum value of `g(i, j) = (j - i) * max(a[i...j])` for all pairs `(i, j)` where `i < j`.
* **Step 2:** The brute-force approach would be to calculate `g(i, j)` for all possible pairs `(i, j)` where `i < j`. This would take O(n²) time because there are O(n²) pairs.
* **Step 3:** We can optimize the calculation of `g(i, j)` using prefix sums. Let `prefix[i]` be the sum of elements from `a[0]` to `a[i-1]`. Then `sum(a[i...j]) = prefix[j+1] - prefix[i]`.
* **Step 4:** Wait, the definition of `g(i, j)` is slightly different. It's `(j - i) * max(a[i...j])`, not the sum.
* **Step 5:** We can optimize our approach by considering the minimum value of `g(i, j)` for each possible `max` value. For each index `k`, we can find the largest subarray where `a[k]` is the maximum.
* **Step 6:** For each index `k`, we can find the leftmost index `l` and rightmost index `r` such that `a[k]` is the maximum in `a[l...r]`. Then, for any `i` in `[l, k]` and `j` in `[k, r]`, `max(a[i...j]) = a[k]`.
* **Step 7:** The minimum `g(i, j)` for this `k` would be `(j - i) * a[k]` where `j - i` is minimized. This occurs when `i` and `j` are as close as possible, i.e., `i = k-1` and `j = k` (if they exist within the bounds).
* **Step 8:** Alternatively, we can use a stack to find for each element the previous and next smaller elements. This helps in determining the range where an element is the maximum.
* **Step 9:** Let's implement this approach. We'll use a stack to find for each index `k`, the previous smaller element `left[k]` and the next smaller element `right[k]`.
* **Step 10:** Then, for each `k`, the range where `a[k]` is the maximum is `(left[k], right[k])`. The minimum `g(i, j)` for this `k` would be `(right[k] - left[k] - 1) * a[k]`.
* **Step 11:** We can compute this for all `k` and take the minimum. This approach runs in O(n) time after the stack operations.
* **Step 12:** Let's test this on the first example `[1, 3, 1]`. For `k=1` (value 3), `left[1]=0`, `right[1]=2`. The range is `(0,2)`, so the minimum `g(i, j)` is `(2-0-1)*3 = 1*3 = 3`.
* **Step 13:** For the second example `[1, 2]`, for `k=1` (value 2), `left[1]=0`, `right[1]=2` (assuming out-of-bounds as n). The range is `(0,2)`, so `g(i, j) = (2-0-1)*2 = 1*2 = 2`.
* **Step 14:** The final answer will be the minimum over all `k`. This approach correctly solves the problem with O(n) time complexity.
**4. Right Branch (Greedy Approach):**
* **Step 1:** Let's analyze the function `g(i, j) = (j - i) * max(a[i...j])` more carefully. Notice that `g(i, j)` is the product of the distance between indices `(j - i)` and the maximum value in the subarray.
* **Step 2:** The minimum value of `g(i, j)` will likely occur when either `j - i` is small or `max(a[i...j])` is small, or both.
* **Step 3:** For adjacent indices `(i, i+1)`, `g(i, i+1) = 1 * max(a[i], a[i+1])`. This is a good candidate for the minimum.
* **Step 4:** For any pair `(i, j)` where `j > i+1`, we have `g(i, j) = (j - i) * max(a[i...j]) >= 2 * max(a[i...j])`. Since `max(a[i...j]) >= max(a[i], a[i+1])` (as the subarray includes `i` and `i+1`), we have `g(i, j) >= 2 * max(a[i], a[i+1])`.
* **Step 5:** Therefore, the minimum of `g(i, j)` can only be achieved when `j = i+1` (adjacent pairs) or when `max(a[i...j])` is significantly smaller than `max(a[i], a[i+1])`, which is unlikely.
* **Step 6:** Let's implement a greedy approach. We'll compute `g(i, i+1)` for all adjacent pairs and take the minimum. This runs in O(n) time.
* **Step 7:** We need to verify if this greedy approach always works. Consider an array where the minimum is not at adjacent indices. For example, `[5, 1, 5]`. The adjacent pairs give `g(0,1)=5`, `g(1,2)=5`. But `g(0,2)=2*5=10`. So the minimum is 5, achieved by adjacent pairs.
* **Step 8:** Another example: `[1, 100, 1]`. Adjacent pairs: `g(0,1)=100`, `g(1,2)=100`. Non-adjacent: `g(0,2)=2*100=200`. Minimum is 100.
* **Step 9:** It seems the greedy approach of checking only adjacent pairs might be sufficient. Let's try to prove it. Suppose the minimum is achieved at `(i, j)` with `j > i+1`. Then `g(i, j) = (j-i) * M`, where `M = max(a[i...j])`. Consider the adjacent pair `(i, i+1)`. `g(i, i+1) = 1 * max(a[i], a[i+1]) <= 1 * M = M`. Since `j-i >= 2`, we have `g(i, j) >= 2M`. Therefore, `g(i, i+1) <= M <= (1/2) * g(i, j)`. So `g(i, i+1)` is strictly less than `g(i, j)` unless `M=0`. Thus, the minimum cannot be at a non-adjacent pair unless all elements are zero. This proves the greedy approach is correct for non-zero arrays.
* **Step 10:** For arrays with zeros, if there's a zero, then `g(i, j)` can be zero if the subarray contains a zero and `j-i` is finite. The minimum would be 0. The adjacent pair containing the zero would also yield 0. So the greedy approach still works.
* **Step 11:** Therefore, the optimal solution is to compute `min over i of { max(a[i], a[i+1]) }` for `i = 0 to n-2`. This runs in O(n) time and O(1) space.
* **Step 12:** Let's verify with the examples. Example 1: `[1, 3, 1]`. Adjacent maxes: `max(1,3)=3`, `max(3,1)=3`. Minimum is 3. Example 2: `[1, 2]`. Adjacent max: `max(1,2)=2`. Minimum is 2.
* **Step 13:** The final answer is the minimum of these values. This approach is optimal with O(n) time complexity.
### Key Observations
1. **Three Distinct Strategies:** The diagram meticulously compares a brute-force O(nÂł) method, an optimized O(n) method using monotonic stacks (to find ranges where an element is maximum), and a greedy O(n) method that only checks adjacent pairs.
2. **Logical Correction:** The left branch contains a self-correction (Step 18 vs. Step 6), highlighting the importance of accurately analyzing the time complexity of finding the maximum within each pair evaluation.
3. **Proof by Contradiction:** The right branch (greedy approach) includes a proof (Step 9) demonstrating that the minimum score must occur for adjacent indices `(i, i+1)` in any array with at least one positive element, making the O(n) greedy solution correct and optimal.
4. **Example Verification:** All three branches conclude by testing their logic on the same two example arrays (`[1, 3, 1]` and `[1, 2]`), consistently arriving at the answers `3` and `2`, respectively.
5. **Complexity Convergence:** Despite different paths, the two efficient solutions (middle and right branches) both achieve O(n) time complexity, with the greedy approach being the simplest to implement.
### Interpretation
This flowchart is a pedagogical tool that demonstrates the process of algorithmic problem-solving. It doesn't just present a solution; it walks through the natural thought process: starting with a straightforward brute-force idea, identifying its inefficiency, exploring optimizations (prefix sums/stacks), and finally discovering a profound mathematical insight that leads to a dramatically simpler and optimal greedy solution.
The core insight, proven in the right branch, is that the scoring function `g(i, j) = (j - i) * max(a[i...j])` is minimized when the distance `(j - i)` is as small as possible (i.e., 1), because the penalty for increasing distance (multiplicative factor) outweighs any potential reduction in the maximum value over a larger subarray. This "reading between the lines" reveals that the problem is essentially asking for the minimum value among the maximums of all adjacent pairs, a fact not immediately obvious from the problem statement.
The diagram emphasizes rigorous verification through examples and logical proof, showcasing best practices in algorithm design. It serves as a case study in moving from a correct but inefficient solution to an elegant, optimal one by deeply analyzing the problem's mathematical structure.
</details>
Figure 10: Case study.