# SACTOR: LLM-Driven Correct and Idiomatic C to Rust Translation with Static Analysis and FFI-Based Verification
**Authors**: Tianyang Zhou, Ziyi Zhang, Haowen Lin, Somesh Jha, Mihai Christodorescu, Kirill Levchenko, Varun Chandrasekaran
> University of Illinois Urbana-Champaign
> University of WisconsinβMadison
> Google
## Abstract
Translating software written in C to Rust has significant benefits in improving memory safety. However, manual translation is cumbersome, error-prone, and often produces unidiomatic code. Large language models (LLMs) have demonstrated promise in producing idiomatic translations, but offer no correctness guarantees. We propose SACTOR, an LLM-driven C-to-Rust translation tool that employs a two-step process: an initial βunidiomaticβ translation to preserve interface, followed by an βidiomaticβ refinement to align with Rust standards. To validate correctness of our function-wise incremental translation that mixes C and Rust, we use end-to-end testing via the foreign function interface. We evaluate SACTOR on $200$ programs from two public datasets and on two more complex scenarios (a 50-sample subset of CRust-Bench and the libogg library), comparing multiple LLMs. Across datasets, SACTOR delivers high end-to-end correctness and produces safe, idiomatic Rust with up to 7 $Γ$ fewer Clippy warnings; On CRust-Bench, SACTOR achieves an average (across samples) of 85% unidiomatic and 52% idiomatic success, and on libogg it attains full unidiomatic and up to 78% idiomatic coverage on GPT-5.
Keywords Software Engineering $Β·$ Static Analysis $Β·$ C $Β·$ Rust $Β·$ Large Language Models $Β·$ Machine Learning
## 1 Introduction
C is widely used due to its ability to directly manipulate memory and hardware (love2013linux). However, manual memory management leads to vulnerabilities such as buffer overflows, dangling pointers, and memory leaks (bigvul). Rust addresses these issues by enforcing memory safety through a strict ownership model without garbage collection (matsakis2014rust), and has been adopted in projects like the Linux kernel https://github.com/Rust-for-Linux/linux and Mozilla Firefox. Translating legacy C code into idiomatic Rust improves safety and maintainability, but manual translation is error-prone, slow, and requires expertise in both languages.
Automatic tools such as C2Rust (c2rust) generate Rust by analyzing C ASTs, but rule-based or static approaches (crown; c2rust; emre2021translating; hong2024don; ling2022rust) typically yield unidiomatic code with heavy use of unsafe. Given semantic differences between C and Rust, idiomatic translations are crucial for compiler-enforced safety, readability, and maintainability.
Large language models (LLMs) show potential for capturing syntax and semantics (pan2023understanding), but they hallucinate and often generate incorrect or unsafe code (perry2023users). In C-to-Rust translation, naive prompting produces unsafe or semantically misaligned outputs. Prior work has explored prompting strategies (syzygy; c2saferrust; shiraishi2024context) and verification methods such as fuzzing and symbolic execution (vert; flourine). While these improve correctness, they struggle with complex programs and rarely yield idiomatic Rust. For example, Vert (vert) fails on programs with complex data structures, and C2SaferRust (c2saferrust) still produces Rust with numerous unsafe blocks.
In this paper, we introduce SACTOR, a structure-aware, LLM-driven C-to-Rust translator (Figure 1). SACTOR follows a two-stage pipeline:
- C $β$ Unidiomatic Rust: Interface-preserving translation that may use unsafe for low-level operations.
- Unidiomatic $β$ Idiomatic Rust: Behaviorally-equivalent translation that refines to Rust idioms, eliminating unsafe and migrating C API patterns to Rust equivalents.
Static analysis of C code (pointer semantics, dependencies) guides both stages. To verify correctness, we embed the translated Rust with the original C via the Foreign Function Interface (FFI), enabling end-to-end testing on both stages and accept a stage when all end-to-end tests can pass. This decomposition separates syntax from semantics, simplifies the LLM task, and ensures more idiomatic, memory-safe Rust SACTOR code is available at https://github.com/qsdrqs/sactor and datasets are available at https://github.com/qsdrqs/sactor-datasets. An example of SACTOR translation process is in Appendix E.
LLM orchestration. SACTOR places the LLM inside a neuro-symbolic feedback loop. Static analysis and a machine-readable interface specification guide prompting; compiler diagnostics and end-to-end tests provide structured feedback. In the idiomatic verification phase, a rule-based harness generator with an LLM fallback completes the feedback loop. This design first ensures semantic correctness in unidiomatic Rust, then refines it into idiomatic Rust, with both stages verifiable in a unified two-step process.
Our contributions are as follows:
- Method: An LLM-orchestrated, structure-aware two-phase pipeline that separates semantic preservation from idiomatic refinement, guided by static analysis (Β§ 4)
- Verification: SACTOR verifies both unidiomatic and idiomatic translations via FFI-based testing. During idiomatic verification, it uses a co-produced interface specification to synthesize C/Rust harnesses with an LLM fallback for missing patterns; compiler and test feedback are structured into targeted prompt repairs (Β§ 4.3).
- Evaluation: Across two datasets (200 programs) and five LLMs, SACTOR reaches 93% / 84% end-to-end correctness (DeepSeek-R1) and improves idiomaticity (Β§ 6.2). On CRust-Bench (50 samples), unidiomatic translation averages 85% function-level success rate across all samples (82% aggregated across functions), with 32/50 samples fully translated; idiomatic success is computed on those 32 samples and averages 52% (43% aggregated; 8/32 fully idiomatic). On libogg (77 functions), the function-level success rate is 100% for unidiomatic and 53% and 78% for idiomatic across GPT-4o and GPT-5, respectively (Β§ 6.3).
- Diagnostics: We analyze efficiency, feedback, temperature sensitivity, and failure cases: GPT-4o is the most token-efficient, compilation/testing feedback boosts weaker models by 17%, temperature has little effect, and reasoning models like DeepSeek-R1 excel on complex bugs such as format-string and array errors (Appendix H).
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: AI-Assisted C-to-Rust Code Translation Pipeline
### Overview
The image is a technical flowchart illustrating a multi-stage, AI-driven pipeline for translating C programming language code into Rust code. The process integrates static analysis tools, large language models (LLMs), and an end-to-end verifier in a feedback loop to produce correct and idiomatic Rust output.
### Components/Axes
The diagram is organized into three horizontal layers and a central processing flow.
**1. Top Layer: Static Analysis Tools**
* **Label:** "Static Analysis Tools"
* **Components (Left to Right):**
* An icon of a bird (likely representing the "Clippy" linter for Rust).
* An icon with a stylized "R" inside a gear, labeled "C2Rust".
* An icon with a stylized "R" inside a gear, labeled "AST Parser".
* An icon of a crown.
* **Function:** This block provides "Static Analysis Hints" (indicated by blue text and downward arrows) to the central AI processing stages.
**2. Central Processing Flow (Left to Right):**
* **Input:** "C Code" (represented by a document icon with a "C").
* **Stage 1 - Division & Initial Translation:**
* Action: "Divide" (blue text).
* Processor: A box containing the logos for "Gemini" and "AI" (a brain icon).
* Output: "Unidiomatic Rust" (represented by a document icon with "</>" and "RS").
* **Stage 2 - Refinement & Combination:**
* Processor: Another box containing the logos for "Gemini" and "AI".
* Action: "Combine" (blue text).
* Output: "Rust Code" (represented by a stack of document icons with "RS").
* **Flow Arrows:** Black arrows indicate the primary data flow from C Code through the two AI stages to the final Rust Code.
**3. Bottom Layer: Verification**
* **Label:** "FFI-based E2E Verifier"
* **Function:** This component provides "Verification Feedback" (indicated by blue text and upward arrows) to both AI processing stages. It likely checks the functional correctness of the translated Rust code by testing its Foreign Function Interface (FFI) compatibility with the original C logic.
### Detailed Analysis
The pipeline operates as follows:
1. **Input Preparation:** The source "C Code" is divided, presumably into functions or modules.
2. **First AI Pass (Guided by Static Analysis):** The first "Gemini AI" instance receives the divided C code and "Static Analysis Hints" from tools like C2Rust and an AST Parser. Its output is "Unidiomatic Rust" β code that is functionally translated but not written in a style natural to the Rust language.
3. **Second AI Pass (Guided by Verification):** The second "Gemini AI" instance takes the unidiomatic Rust code and refines it. It is guided by "Verification Feedback" from the "FFI-based E2E Verifier," which tests the code's behavior.
4. **Output Generation:** The refined code from the second AI pass is combined to produce the final "Rust Code."
5. **Feedback Loop:** The "FFI-based E2E Verifier" continuously sends feedback to both AI stages, creating an iterative loop for error correction and improvement.
### Key Observations
* **Two-Stage AI Processing:** The use of two distinct AI processing stages suggests a separation of concerns: initial translation followed by idiomatic refinement and correctness verification.
* **Hybrid Guidance System:** The pipeline is guided by two complementary sources: **static analysis** (structural code properties) and **dynamic verification** (runtime behavior).
* **Focus on Idiomatic Output:** The explicit labeling of "Unidiomatic Rust" as an intermediate product highlights that the goal is not just functional translation, but also producing code that adheres to Rust's best practices and conventions.
* **Central Role of FFI:** The verifier is specifically "FFI-based," indicating the translation process heavily relies on or preserves the Foreign Function Interface boundary, a common strategy when interoperating between C and Rust.
### Interpretation
This diagram represents a sophisticated, automated approach to the challenging problem of legacy code migration (from C to Rust). It demonstrates a **Peircean investigative method** by combining different types of reasoning:
* **Abductive Reasoning (Inference to the Best Explanation):** The AI models hypothesize the most likely Rust equivalent for a given C construct.
* **Deductive Reasoning:** The static analysis tools apply formal rules about code structure.
* **Inductive Reasoning:** The E2E verifier tests hypotheses about the translated code's behavior against observed outcomes (test results), feeding corrections back into the system.
The pipeline's architecture suggests that direct, one-shot AI translation is insufficient for producing high-quality, reliable Rust code. The **"unidiomatic" intermediate stage is a critical insight**, acknowledging that syntactic translation and semantic/idiomatic refinement are separate challenges. The closed feedback loop between the AI and the verifier is the core innovation, enabling the system to self-correct and converge on a correct and idiomatic solution. This approach reduces the manual effort required for porting large codebases while aiming to maintain functional correctness.
</details>
Figure 1: Overview of the SACTOR methodology.
## 2 Background
Primer on C and Rust: C is a low-level language that provides direct access to memory and hardware through pointers and abstracts machine-level instructions (tiobe). While this makes it efficient, it suffers from memory vulnerabilities (sbufferoverflow; hbufferoverflow; uaf; memoryleak). Rust, in contrast, provides memory safety without additional performance penalty, and has the same ability to access low-level hardware as C; it enforces strict compile-time memory safety through ownership, borrowing, and lifetimes to eliminate memory vulnerabilities (matsakis2014rust; jung2017rustbelt).
Challenges in Code Translation: Despite its advantages, and since Rust is relatively new, many widely used system-level programs remain in C. It is desirable to translate such programs to Rust, but the process is challenging due to fundamental language differences. Figure 3 in Appendix A shows an example of a simple C program and its Rust equivalent to illustrate the differences between two languages in terms of memory management and error handling. While Rust permits unsafe blocks for C-like pointer operations, their use is discouraged due to the absence of compiler guarantees and their non-idiomatic nature for further maintenance Other differences include string representation, pointer usage, array handling, reference lifetimes, and error propagation. A non-exhaustive summary appears in Appendix A..
## 3 Related Work
LLMs for C-to-Rust Translation: Vert (vert) combines LLM-generated candidates with fuzz testing and symbolic execution to ensure equivalence, but this strict verification struggles with scalability and complex C features. Flourine (flourine) incorporates error feedback and fuzzing, using data type serialization to mitigate mismatches, yet serialization issues still account for nearly half of errors. shiraishi2024context decompose C programs into sub-tasks (e.g., macros) and translate them with predefined Rust idioms, but evaluate only compilation success without functional correctness. syzygy employ dynamic analysis to capture runtime behavior as translation guidance, but coverage limits hinder generalization across execution paths. c2saferrust refine C2Rust outputs with LLMs to reduce unidiomatic constructs (unsafe, libc), but remain constrained by C2Rustβs preprocessing, which strips comments and directives (Β§ 4.2) and reduces context for idiomatic translation.
Non-LLM Approaches for C-to-Rust Translation: C2Rust (c2rust) translates by converting C ASTs into Rust ASTs and applying rule-based transformations. While syntactically correct, the results are structural translations that rely heavily on unsafe blocks and explicit type conversions, yielding low readability. Crown (crown) introduces static ownership tracking to reduce pointer usage in generated Rust code. hong2024don focus on handling return values in translation, while ling2022rust rely on rules and heuristics. Although these methods reduce some unsafe usage compared to C2Rust, the resulting code remains largely unidiomatic.
## 4 SACTOR Methodology
We propose SACTOR, an LLM-driven C-to-Rust translation tool using a two-step translation methodology. As Rust and C differ substantially in semantics (Β§ 2), SACTOR augments the LLM with static-analysis-derived βhintsβ that capture semantic information in the C code. The four main stages of SACTOR are outlined below.
### 4.1 Task Division
We begin by dividing the program into smaller parts that can be processed by the LLM independently. This enables the LLM to focus on a narrower scope for each translation task and ensures the program fits within its context window. This strategy is supported by studies showing that LLM performance degrades on long-context understanding and generation tasks (liu2024longgenbench; li2024long). By breaking the program into smaller pieces, we can mitigate these limitations and improve performance on each individual task. To facilitate task division and extract relevant language information β such as definitions, declarations, and dependencies β from C code, we developed a static analysis tool called C Parser based on libclang (a library that provides a C compiler interface, allowing access to semantic information of the code).
Our C Parser analyzes the input program and splits the program into fragments consisting of a single type, global variable, or function definition. This step also extracts semantic dependencies between each part (e.g., a function definition depending on a prior type definition). We then process each program fragment in dependency order: all dependencies of a code fragment are processed before the fragment. Concretely, C Parser constructs a directed dependency graph whose nodes are types, global variables, and functions, and whose edges point from each item to the items it directly depends on. We compute a translation order by repeatedly selecting items whose dependencies have already been processed. If the dependency graph contains a cycle, SACTOR currently treats this as an unsupported case and terminates with an explicit error. In addition, to support real-world C projects, SACTOR makes use of the C project compile commands generated by the make tool and performs preprocessing on the C source files. In Appendix B, we provide more details on how we preprocess source files and divide programs.
### 4.2 Translation
To ensure that each program fragment is translated only after its dependencies have been processed, we begin by translating data types, as they form the foundational elements for functions. This is followed by global variables and functions. We divide the translation process into two steps.
Step 1. Unidiomatic Rust Translation: We aim to produce interface equivalent Rust code from the original C code, which allows the use of unsafe blocks to do pointer manipulations and C standard library functions while keeping the same interface as original C code. For data type translation, we leverage information from C2Rust (c2rust) to help the conversion. While C2Rust provides reliable data type translation, it struggles with function translation due to its compiler-based approach, which omits source-level details like comments, macros, and other elements. These omissions significantly reduce the readability and usability of the generated Rust code. Thus, we use C2Rust only for data type translation, and use an LLM to translate global variables and functions. For functions, we rely on our C Parser to automatically extract dependencies (e.g., function signatures, data types, and global variables) and reference the corresponding Rust code. This approach guides the LLM to accurately translate functions by leveraging the previously translated components and directly reusing or invoking them as needed.
Step 2. Idiomatic Rust Translation: The goal of this step is to refine unidiomatic Rust into idiomatic Rust by removing unsafe blocks and following Rust idioms. This stage focuses on rewriting behavioral-equivalent but low-level constructs into type-safe abstractions while preserving behavior verified in the previous step. Handling pointers from C code is a key challenge, as they are considered unsafe in Rust. Unsafe pointers should be replaced with Rust types such as references, arrays, or owned types. To address this, we use Crown (crown) to facilitate the translation by analyzing pointer mutability, fatness (e.g., arrays), and ownership. This information provided by Crown helps the LLM assign appropriate Rust types to pointers. Owned pointers are translated to Box, while borrowed pointers use references or smart pointers. Crown assists in translating data types like struct and union, which are processed first as they are often dependencies for functions. For function translations, Crown analyzes parameters and return pointers, while local variable pointers are inferred by the LLM. Dependencies are extracted using our C Parser to guide accurate function translation. The idiomatic code is produced together with an interface transformation specification, forms the input to the verification step in Β§ 4.3.
### 4.3 Verification
To verify the equivalence between source and target languages, prior work has relied on symbolic execution and fuzz testing, are impractical for real-world C-to-Rust translation (details in Appendix C). We instead validate correctness through soft equivalence: ensuring functional equivalence of the entire program via end-to-end (E2E) tests. This avoids the complexity of generating specific inputs or constraints for individual functions and is well-suited for real-world programs where such E2E tests are often available and reusable. Correctness confidence in this framework depends on the code coverage of the E2E tests: the broader the coverage, stronger the assurance of equivalence.
Verifying Unidiomatic Rust Code. This is straightforward, as it is semantically equivalent to the original C code and maintains compatible function signatures and data types, which ensures a consistent Application Binary Interface (ABI) between the two languages and enabling direct use of the FFI for cross-language linking. The verification process involves two main steps: First, the unidiomatic Rust code is compiled using the Rust compiler to check for successful compilation. Then, the original C code is recompiled with the Rust translation linked as a shared library. This setup ensures that when the C code calls the target function, it invokes the Rust translation instead. To verify correctness, E2E tests are run on the entire program, comparing the outputs of the original C code and the unidiomatic Rust translation. If all tests pass, the target function is considered verified.
Verifying Idiomatic Rust Code. Idiomatic Rust diverges from the original C program in both types and function signatures, producing an ABI mismatch that prevents direct linking into the C build. We therefore verify it via a synthesized, C-compatible test harness together with E2E tests.
During idiomatic translation, SACTOR co-produces a small, machine-readable specification (SPEC) for each function/struct. The SPEC captures, in a compact form, how C-facing values map to idiomatic Rust, including the expected pointer shape (slice / cstring / ref), where lengths come from (a sibling field or a constant), and basic nullability and return conventions; it also allows marking fields that should be compared in self-checks. A rule-based generator consumes the SPEC to synthesize a C-compatible harness that bridges from the C ABI to idiomatic code and backwards. Figure 9 shows the schematic, and Table 12 summarizes current supported patterns; Appendix L presents a detailed exposition of the SPEC-driven harness generation technique (rules and design choices), and Appendix D provides a concrete example of the generated harness. For structs, the SPEC defines bidirectional converters between the C-facing and idiomatic layouts, validated by a lightweight roundtrip test that checks the fields marked as comparable for consistency after conversion. When the SPEC includes a pattern the generator does not yet implement (e.g., aliasing/offset views or unsupported pointer kinds or types), we emit a localized TODO and use an LLM guided by the SPEC to fill only the missing conversions. Finally, we compile the idiomatic crate and the generated harness, link them into the original C build via FFI, and run the programβs existing E2E tests; passing tests validate the idiomatic translation under the coverage of those tests, while failures trigger the feedback procedure in Β§ 4.3.
Feedback Mechanism. For failures, we feed structured signals back to translation: compiler errors guide fixes for build breaks; for E2E failures we use the Rust procedural macro to automatically instrument the target to log salient inputs/outputs, re-run tests, and return the traces to the translator for refinement.
### 4.4 Code Combination
By translating and verifying all functions and data types, we integrate them into a unified Rust codebase. We first collect the translated Rust code from each subtask and remove duplicate definitions and other redundancies required only for standalone compilation. The cleaned code is then organized into a well-structured Rust implementation of the original C program. Finally, we run end-to-end tests on the combined program to verify the correctness of the final Rust output. If all tests pass, the translation is considered successful.
## 5 Experimental Setup
### 5.1 Datasets Used
For the selection of datasets for evaluation, we consider the following criteria:
- Sufficient Number: The dataset should contain a substantial number of C programs to ensure a robust evaluation of the approachβs performance across a diverse set of examples.
- Presence of Non-Trivial C Features: The dataset should include C programs with advanced features such as multiple functions, struct s, and other non-trivial constructs as it enables the evaluation to assess the approachβs ability to handle complex features of C.
- Availability of E2E Tests: The dataset should either include E2E tests or make it easy to generate them as this is essential for accurately evaluating the correctness of the translated code.
Based on the above criteria, we evaluate on two widely used program suites in the translation literature: TransCoder-IR (transcoderir) and Project CodeNet (codenet). Complete details for these datasets are in Appendix F. For TransCoder-IR and CodeNet, we randomly sample 100 C programs from each (for CodeNet, among programs with external inputs) to ensure computational feasibility while maintaining statistical significance.
To better reflect the language features of real-world C codebases and allow test reuse (Β§ 6.3), we also evaluate on two targets: (i) a 50-sample subset of CRust-Bench (khatry2025crust) and (ii) the libogg multimedia container library (libogg). In CRust-Bench, we exclude entries outside our pipelineβs scope (e.g., circular dependencies or compiler-specific intrinsics). libogg is a real-world C project of about 2,000 lines of code with 77 functions involving non-trivial struct s, buffer s, and pointer manipulation. Both benchmarks reuse their upstream end-to-end tests to verify the translated code.
### 5.2 Evaluation Metrics
Success Rate: This is defined as the ratio of the number of programs that can (a) successfully be translated to Rust, and (b) successfully pass the E2E tests for both unidiomatic and idiomatic translation phases to the total number of programs. To enable the LLMs to utilize feedback from previous failed attempts, we allow the LLM to make up to 6 attempts for each translation process.
Idiomaticity: To evaluate the idiomaticity of the translated code, we use three metrics:
- Lint Alert Count is measured by running Rust-Clippy (clippy), a tool that provides lints on unidiomatic Rust (including improper use of unsafe code and other common style issues). By collecting the warnings and errors generated by Rust-Clippy for the translated code, we can assess its idiomaticity: fewer alerts indicate more idiomaticity. Previous translation works (vert; flourine) have also used Rust-Clippy.
- Unsafe Code Fraction, inspired by shiraishi2024context, is defined as the ratio of tokens inside unsafe code blocks or functions to total tokens for a single program. High usage of unsafe is considered unidiomatic, as it bypasses compiler safety checks, introduces potential memory safety issues and reduces code readability.
- Unsafe Free Fraction indicates the percentage of translated programs in a dataset that do not contain any unsafe code. Since unsafe code represents potential points where the compiler cannot guarantee safety, this metric helps determine the fraction of results that can be achieved without relying on unsafe code.
### 5.3 LLMs Used
We evaluate 6 models across different experiments. On the two datasets (TransCoder-IR and CodeNet) we use four non-reasoning modelsβGPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 2.0 Flash (Google), and Llama 3.3 70B Instruct (Meta), and one reasoning model DeepSeek-R1 (DeepSeek). For real-world codebases, we run GPT-4o on CRust-Bench and run both GPT-4o and GPT-5 on libogg. Model configurations appear in Appendix G.
## 6 Evaluation
Through our evaluation, we answer: (1) How successful is SACTOR in generating idiomatic Rust code using different LLM models?; (2) How idiomatic is the Rust code produced by SACTOR compared to existing approaches?; and (3) How well does SACTOR generalize to real-world C codebases?
Our results show that: (1) DeepSeek-R1 achieves the highest success rates (93%) with SACTOR on TransCoder-IR and also reaches the highest success rates (84%) on Project CodeNet (Β§ 6.1), while failure reasons vary between datasets and models (Appendix H); (2) SACTOR βs idiomatic translation results outperforms all previous baselines, producing Rust code with fewer Clippy warnings and 100% unsafe-free translations (Β§ 6.2); and (3) For real-world codebases (Β§ 6.3), SACTOR attains strong unidiomatic success and moderate idiomatic success: on CRust-Bench, unidiomatic averages 85% across 50 samples (82% aggregated across 966 functions; 32/50 fully translated) and idiomatic averages 52% across 32 samples that fully translated into unidiomatic Rust (43% aggregated across 580 functions; 8/32 fully translated); on libogg unidiomatic reaches 100% and idiomatic spans 53% and 78% for GPT-4o and GPT-5, respectively. Failures concentrate at ABI/type boundaries and harness synthesis (pointer/slice shape, length sources, lifetime or mutability), with additional cases from unsupported features and borrow/ownership pitfalls. Overall, improving the model itself alleviates a subset of failure modes; for a fixed model, strengthening the framework and interface rules also improves outcomes but remains limited when confronted with previously unseen patterns.
We also evaluate the computational cost of SACTOR (Appendix I), the impact of the feedback mechanism (Appendix J), and temperature settings (Appendix K) . GPT-4o and Gemini 2.0 achieve the best cost-performance balance, while Llama 3.3 consumes the most tokens among non-reasoning models. DeepSeek-R1 uses 3-7 $Γ$ more tokens than others. The feedback mechanism boosts Llama 3.3βs success rate by 17%, but has little effect on GPT-4o, suggesting it benefits lower-performing models more. Temperature has minimal impact.
### 6.1 Success Rate Evaluation
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram Legend: Categorical Key with Pattern and Color Coding
### Overview
The image displays a legend or key, likely for a chart or diagram, that defines ten distinct categories. Each category is represented by a small rectangular swatch showing a unique combination of color and pattern, paired with a text label. The legend is organized into two main groups, "Unid." and "Idiom.", each containing six subcategories labeled SR1 through SR6.
### Components/Axes
This is a legend component, not a chart with axes. It contains the following elements:
* **Two Column Layout:** The legend items are arranged in two vertical columns.
* **Legend Items:** Each item consists of a colored/patterned swatch and an adjacent text label.
* **Group Prefixes:** Labels are prefixed with either "Unid." or "Idiom.", suggesting two primary categories or conditions.
* **Subcategory Labels:** Each primary category has six subcategories labeled "SR1" to "SR6".
* **Visual Encoding:** Each of the ten items has a unique visual signature defined by a base color and an overlaid hatch pattern (diagonal lines, cross-hatch, dots, vertical lines, etc.).
### Detailed Analysis
**Legend Content (Transcribed in order, top-left to bottom-right):**
**Column 1 (Left):**
1. **Swatch:** Blue with diagonal lines (top-left to bottom-right). **Label:** `Unid. SR1`
2. **Swatch:** Blue with diagonal lines (top-right to bottom-left). **Label:** `Unid. SR2`
3. **Swatch:** Blue with a cross-hatch pattern. **Label:** `Unid. SR3`
4. **Swatch:** Orange with diagonal lines (top-left to bottom-right). **Label:** `Idiom. SR1`
5. **Swatch:** Orange with diagonal lines (top-right to bottom-left). **Label:** `Idiom. SR2`
6. **Swatch:** Orange with a cross-hatch pattern. **Label:** `Idiom. SR3`
**Column 2 (Right):**
1. **Swatch:** Blue with a dense cross-hatch/grid pattern. **Label:** `Unid. SR4`
2. **Swatch:** Blue with a dotted pattern. **Label:** `Unid. SR5`
3. **Swatch:** Blue with vertical lines. **Label:** `Unid. SR6`
4. **Swatch:** Orange with a dense cross-hatch/grid pattern. **Label:** `Idiom. SR4`
5. **Swatch:** Orange with a dotted pattern. **Label:** `Idiom. SR5`
6. **Swatch:** Orange with vertical lines. **Label:** `Idiom. SR6`
**Visual Pattern Summary:**
* **Color Groups:** All "Unid." labels are associated with shades of **blue**. All "Idiom." labels are associated with shades of **orange**.
* **Pattern Sequence:** Within each color group, the patterns for SR1-SR6 follow an identical sequence:
* SR1: Diagonal lines (\\)
* SR2: Diagonal lines (/)
* SR3: Cross-hatch (X)
* SR4: Dense grid/cross-hatch (+)
* SR5: Dots (.)
* SR6: Vertical lines (||)
### Key Observations
1. **Systematic Encoding:** The legend uses a highly systematic two-factor encoding scheme: **Color** (Blue vs. Orange) denotes the primary category ("Unid." vs. "Idiom."), and **Pattern** denotes the subcategory (SR1-SR6).
2. **Pattern Consistency:** The pattern sequence is perfectly consistent across both color groups. This allows a viewer to identify the subcategory (e.g., SR3) from the pattern alone, even if the color is not distinguishable.
3. **Label Language:** All text labels are in English, using abbreviations. "Unid." and "Idiom." are likely abbreviations for longer terms (e.g., "Unified" vs. "Idiomatic", "Unidentified" vs. "Idiomatic").
4. **Spatial Layout:** The legend is compact, with items tightly packed in a two-column grid. The swatches are small but distinct due to the high-contrast patterns.
### Interpretation
This legend is a decoding tool for a more complex visual, such as a multi-series line chart, a grouped bar chart, or a heatmap. Its design suggests the parent visualization compares two main conditions or datasets ("Unid." and "Idiom.") across six measured variables or time points (SR1-SR6).
* **Purpose:** The primary function is to allow a viewer to map visual elements (colored/patterned shapes) in the main chart back to their specific categorical meaning.
* **Data Relationship:** The structure implies a **2x6 factorial design**. The "Unid." and "Idiom." groups are likely being compared across the six "SR" metrics. The consistent pattern mapping facilitates direct comparison of the same SR number across the two main groups.
* **Inference:** Without the parent chart, the exact nature of "Unid." and "Idiom." is unknown, but they represent a fundamental dichotomy in the data. "SR" could stand for "Sample Region," "Stimulus Response," "Session Run," or another sequential or categorical variable. The use of distinct patterns alongside color makes the legend accessible for color-blind viewers and ensures clarity in grayscale printing.
* **Notable Design Choice:** The decision to use both color and pattern provides redundant encoding, which is a best practice for data visualization accessibility and clarity. It ensures that the categories remain distinguishable under various viewing conditions.
</details>
<details>
<summary>x3.png Details</summary>

### Visual Description
## Grouped Bar Chart: LLM Model Success Rate Comparison
### Overview
The image displays a grouped bar chart comparing the success rates (in percentage) of five different Large Language Model (LLM) models. Each model has two adjacent bars, one blue and one orange, each segmented into multiple patterned sections. The chart is presented on a white background with a simple black border.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "LLM Models". It lists five models from left to right:
1. Claude 3.5
2. Gemini 2.0
3. Llama 3.3
4. GPT-4o
5. DeepSeek-R1
* **Y-Axis (Vertical):** Labeled "Success Rate (%)". The scale runs from 0 to 100, with major tick marks and numerical labels at intervals of 20 (0, 20, 40, 60, 80, 100).
* **Data Series:** For each model, there are two bars:
* **Left Bar (Blue):** Features a diagonal line pattern (`///`).
* **Right Bar (Orange):** Features a diagonal line pattern in the opposite direction (`\\\`).
* **Bar Segmentation:** Each bar is divided into 3-4 distinct segments, differentiated by fill patterns. The patterns observed are:
* Diagonal lines (primary fill for the main bar body).
* Cross-hatching (`XXX`).
* Grid/checkered pattern.
* Dotted pattern (appears only at the very top of some bars).
* **Legend:** **No legend is present in the image.** The meaning of the different bar colors (blue vs. orange) and the internal segmentation patterns is not defined.
### Detailed Analysis
**Trend Verification & Spatial Grounding:**
The overall trend shows an increase in success rate from left to right, with DeepSeek-R1 achieving the highest values. For each model, the blue and orange bars are of very similar total height, suggesting comparable overall performance between the two measured conditions (whatever they represent).
**Approximate Values by Model (Total Bar Height):**
* **Claude 3.5:**
* Blue Bar: ~55% (Segments: ~38% diagonal, ~10% cross-hatch, ~7% grid).
* Orange Bar: ~54% (Segments: ~52% diagonal, ~2% grid at top).
* **Gemini 2.0:**
* Blue Bar: ~78% (Segments: ~63% diagonal, ~15% cross-hatch).
* Orange Bar: ~75% (Segments: ~63% diagonal, ~12% cross-hatch).
* **Llama 3.3:**
* Blue Bar: ~76% (Segments: ~36% diagonal, ~33% cross-hatch, ~7% grid).
* Orange Bar: ~64% (Segments: ~46% diagonal, ~18% cross-hatch).
* **GPT-4o:**
* Blue Bar: ~84% (Segments: ~71% diagonal, ~13% cross-hatch).
* Orange Bar: ~80% (Segments: ~75% diagonal, ~5% grid).
* **DeepSeek-R1:**
* Blue Bar: ~94% (Segments: ~55% diagonal, ~30% cross-hatch, ~9% grid).
* Orange Bar: ~93% (Segments: ~82% diagonal, ~11% cross-hatch).
**Segment Pattern Observations:**
The cross-hatch pattern is a significant component in the blue bars for Gemini 2.0, Llama 3.3, and DeepSeek-R1. The grid pattern appears as a smaller top segment in several bars (Claude 3.5 blue, Llama 3.3 blue, GPT-4o orange, DeepSeek-R1 blue). The dotted pattern is only visible as a very thin cap on the DeepSeek-R1 blue bar.
### Key Observations
1. **Performance Hierarchy:** DeepSeek-R1 > GPT-4o > Gemini 2.0 β Llama 3.3 (blue) > Claude 3.5. Llama 3.3's orange bar is notably lower than its blue bar and the other models' performances.
2. **Internal Composition Variance:** While total heights are similar for blue/orange pairs within a model, the internal segmentation differs. For example, Llama 3.3's blue bar has a large cross-hatch segment (~33%), while its orange bar has a much smaller one (~18%).
3. **Missing Context:** The most critical missing information is the legend. It is impossible to know what the blue vs. orange bars represent (e.g., different prompting techniques, evaluation benchmarks, or task types) or what the internal patterns signify (e.g., sub-categories of success, different error types, or confidence intervals).
### Interpretation
This chart presents a comparative performance benchmark of five prominent LLMs. The data suggests that, on the measured task(s), **DeepSeek-R1 and GPT-4o are the top performers**, with success rates exceeding 80%. Claude 3.5 shows the lowest success rate among this group.
The **key investigative question** raised by this chart concerns the meaning of the dual bars and their segments. The consistent pairing suggests a controlled comparison between two conditions (A/B testing, two methods, etc.). The internal segmentation implies that the "Success Rate" is a composite metric, broken down into constituent parts. For instance, the patterns could represent:
* Different sub-tasks contributing to an overall score.
* Success rates under varying levels of difficulty.
* Breakdowns of correct answers by type (e.g., factual recall, reasoning, creativity).
**Without the legend, the chart's explanatory power is severely limited.** It effectively shows *that* performance differs between models and between the two conditions (blue/orange), but not *why* or *in what specific way*. The notable drop in Llama 3.3's orange bar performance compared to its blue bar is a potential anomaly that would require the missing context to interpret. The chart successfully communicates a ranking but fails to explain the underlying factors driving that ranking.
</details>
(a) TransCoder-IR SR
<details>
<summary>x4.png Details</summary>

### Visual Description
## Grouped Stacked Bar Chart: LLM Model Performance Comparison
### Overview
The image displays a grouped stacked bar chart comparing the performance of five Large Language Models (LLMs). Each model has two adjacent vertical bars: a blue bar on the left and an orange bar on the right. Each bar is composed of multiple stacked segments distinguished by different fill patterns (diagonal lines, cross-hatching, dots, etc.). The chart lacks a title, a legend, and specific axis titles beyond the model names and a numerical scale.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "LLM Models". It lists five categorical items:
1. Claude 3.5
2. Gemini 2.0
3. Llama 3.3
4. GPT-4o
5. DeepSeek-R1
* **Y-Axis (Vertical):** A numerical scale ranging from 0 to 100, with major grid lines and labels at intervals of 20 (0, 20, 40, 60, 80, 100). The axis lacks a descriptive title (e.g., "Score," "Accuracy," "Percentage").
* **Data Series:** For each model, there are two data series represented by colored bars:
* **Blue Bar (Left):** Always positioned to the left within its model group.
* **Orange Bar (Right):** Always positioned to the right within its model group.
* **Bar Segments (Patterns):** Each bar is divided into 3-4 stacked segments. The patterns, from bottom to top within a bar, appear to be:
1. **Diagonal Lines (Bottom Segment):** Present in all bars.
2. **Cross-Hatch / Grid Pattern (Middle Segment):** Present in all bars.
3. **Dotted / Stippled Pattern (Upper Segment):** Present in most bars.
4. **A Different Cross-Hatch or Dense Pattern (Top Segment):** Present in some bars, often as the very top layer.
* **CRITICAL NOTE:** There is **no legend** provided in the image to define what these patterns represent (e.g., different tasks, metrics, or sub-categories). Therefore, their specific meaning cannot be determined from the image alone.
### Detailed Analysis
**Estimated Total Heights (Top of each bar):**
* **Claude 3.5:** Blue β 86, Orange β 83
* **Gemini 2.0:** Blue β 78, Orange β 74
* **Llama 3.3:** Blue β 83, Orange β 76
* **GPT-4o:** Blue β 84, Orange β 79
* **DeepSeek-R1:** Blue β 86, Orange β 84
**Estimated Segment Breakdown (Approximate values, bottom to top):**
* **Claude 3.5:**
* *Blue Bar:* Diagonal Lines β 59, Cross-Hatch β 16 (to 75), Dotted β 11 (to 86).
* *Orange Bar:* Diagonal Lines β 66, Cross-Hatch β 15 (to 81), Top Pattern β 2 (to 83).
* **Gemini 2.0:**
* *Blue Bar:* Diagonal Lines β 62, Cross-Hatch β 11 (to 73), Dotted β 5 (to 78).
* *Orange Bar:* Diagonal Lines β 70, Cross-Hatch β 4 (to 74).
* **Llama 3.3:**
* *Blue Bar:* Diagonal Lines β 43, Cross-Hatch β 30 (to 73), Dotted β 10 (to 83).
* *Orange Bar:* Diagonal Lines β 50, Cross-Hatch β 21 (to 71), Top Pattern β 5 (to 76).
* **GPT-4o:**
* *Blue Bar:* Diagonal Lines β 70, Cross-Hatch β 11 (to 81), Dotted β 3 (to 84).
* *Orange Bar:* Diagonal Lines β 75, Cross-Hatch β 4 (to 79).
* **DeepSeek-R1:**
* *Blue Bar:* Diagonal Lines β 53, Cross-Hatch β 27 (to 80), Dotted β 6 (to 86).
* *Orange Bar:* Diagonal Lines β 69, Cross-Hatch β 15 (to 84).
### Key Observations
1. **Performance Range:** All models show total scores (combined segment heights) in the upper range, between approximately 74 and 86.
2. **Model Comparison:** DeepSeek-R1 and Claude 3.5 have the highest total scores for their blue bars (β86). Gemini 2.0 has the lowest totals for both its blue and orange bars.
3. **Blue vs. Orange:** For every model, the blue bar's total height is greater than or equal to the orange bar's total height. The difference is most pronounced for Llama 3.3 (β7 point difference) and smallest for DeepSeek-R1 (β2 point difference).
4. **Segment Composition:** The "Diagonal Lines" segment is consistently the largest component of every bar. The "Cross-Hatch" segment varies significantly in size, being particularly large in the blue bars of Llama 3.3 and DeepSeek-R1.
5. **Missing Information:** The most critical observation is the **absence of a legend**. Without it, the chart is interpretable only in terms of relative heights and patterns, not in terms of what is being measured.
### Interpretation
This chart is designed to compare the performance of five major LLMs across two primary, unnamed categories (blue vs. orange), each of which is further broken down into sub-components (the patterned segments).
* **What the Data Suggests:** The data suggests that while all models perform at a high level, there are nuanced differences. DeepSeek-R1 and Claude 3.5 appear to be the top performers overall. The consistent pattern where the blue bar outperforms the orange bar for each model could indicate that the blue category represents a more fundamental or easier task, while the orange category might be a more challenging or specialized evaluation.
* **Relationship Between Elements:** The grouping by model allows for direct comparison of the two main categories (blue/orange) for each LLM. The stacking within bars shows the contribution of different sub-components to the total score in each category.
* **Notable Anomalies:** The most significant anomaly is the **lack of a legend**, which renders the chart's specific meaning ambiguous. A secondary anomaly is the varying size of the "Cross-Hatch" segment, which suggests that performance on the sub-task it represents is a key differentiator between models, especially for Llama 3.3 and DeepSeek-R1 in the blue category.
* **Peircean Investigation:** From a semiotic perspective, the chart uses **icons** (the bars) to represent model performance, **indices** (the height) to show magnitude, and **symbols** (the patterns) to denote categories. However, the symbolic meaning is broken due to the missing legend, leaving the viewer to rely solely on iconic and indexical information. The chart successfully conveys *that* there are differences but fails to communicate *what* those differences are in concrete terms. To be fully informative, it requires the accompanying legend defining the metrics for "Blue," "Orange," and each fill pattern.
</details>
(b) CodeNet SR
Figure 2: Success rates (SR) across different LLM models for the TransCoder-IR and CodeNet datasets. SR 1-6 represent the number of attempts made to achieve a successful translation. Unid. and Idiom. denote unidiomatic and idiomatic translation steps, respectively.
We evaluate the success rate (as defined in Β§ 5.2) for the two datasets on different models. For idiomatic translation, we also plot how many attempts are needed.
(1) TransCoder-IR (Figure 2(a)): DeepSeek-R1 achieves the highest success rate (SR) in both unidiomatic (94%) and idiomatic (93%) steps, only 1% drops in the idiomatic translation step, demonstrating strong consistency in code translation. GPT-4o follows with 84% in the unidiomatic step and 80% in the idiomatic step. Gemini 2.0 comes next with 78% and 75%, respectively. Claude 3.5 struggles in the unidiomatic step (55%) but does not show substantial degradation when converting unidiomatic Rust to idiomatic Rust (54%, only a 1% drop), but it is still the worst model compared to the others. Llama 3.3 performs well in the unidiomatic step (76%) but drops significantly in the idiomatic step (64%), and requiring more attempts for correctness.
(2) Project CodeNet (Figure 2(b)): DeepSeek-R1 again leads with 86% in the unidiomatic step and 84% in the idiomatic step, showing only a 2% drop in the idiomatic translation step. Claude 3.5 follows closely with 86% success rate in the unidiomatic step and 83% in the idiomatic step. GPT-4o performs consistently well in the unidiomatic step (84%) but drops to 79% in the idiomatic step, indicating a 5% drop between the two steps. Gemini 2.0 follows with 78% in the unidiomatic step and 74% in the idiomatic step, showing consistent performance between two datasets. Llama 3.3 still exhibits significant drops (83% to 76%) in both steps and finishes last in the idiomatic step.
The results demonstrates that DeepSeek-R1βs SRs remain high and consistentβ94%/93% (unidiomatic/idiomatic) on TransCoder-IR versus 86%/84% on CodeNetβwhile other models exhibit notable performance drops when moving to TransCoder-IR. This suggests that models with reasoning capabilities may be better for handling complex code logic and data manipulation.
### 6.2 Measuring Idiomaticity
We compare our approach with four baselines: C2Rust (c2rust), Crown (crown), C2SaferRust (c2saferrust) and Vert (vert). Of these baselines, C2Rust is the most versatile Versatility refers to an approachβs applicability to diverse C programs., supporting most C programs, while Crown is also broad but lacks support for some language features. C2SaferRust focuses on refining the unsafe code produced by C2Rust, allowing it to handle a wide range of C programs. In contrast, Vert targets a specific subset of simpler C programs. We assess the idiomaticity of Rust code generated by C2Rust, Crown, and C2SaferRust on both datasets. Since Vert produced Rust code only for TransCoder-IR, we evaluate it solely on this dataset. All the experiments are conducted using GPT-4o as the LLM for baselines and our approach, with max 6 attempts per translation.
Results: Figure LABEL:fig:idiomaticity presents the lint alert count (sum up of Clippy warnings and errors count for a single program) across all approaches. C2Rust consistently exhibits high Clippy issues, and Crown shows little improvement over C2Rust, indicating both struggle to generate idiomatic Rust. C2SaferRust reduces Clippy issues, but it still retains a significant number of warnings and errors. Notably, even the unidiomatic output of SACTOR surpasses all of these 3. This underscores the advantage of LLMs over rule-based methods. While Vert improves idiomaticity, SACTOR βs idiomatic phase yields fewer Clippy issues, outperforming some existing LLM-based approaches.
Table LABEL:tab:unsafe_stats summarizes unsafe code statistics. Unsafe-Free indicates the percentage of programs without unsafe code, while Avg. Unsafe represents the average proportion of unsafe code across all translations. C2Rust and Crown generate unsafe code in all programs with a high average unsafe percentage. C2SaferRust has the ability to reduce unsafe code and able to generate unsafe-free programs in some cases (45.6% in TransCoder-IR), but cannot sufficiently reduce the unsafe uses in the CodeNet dataset. Vert has a higher success rate than SACTOR but occasionally introduces unsafe code. SACTOR βs unidiomatic phase retains C semantics, leading to a high unsafe percentage. However, its idiomatic phase eliminates all unsafe code, achieving a 100% Unsafe-Free rate.
### 6.3 Real-world Code-bases
To evaluate SACTOR βs performance on two real-world code-bases, we run the translation process up to three times per sample, with SACTOR attempts to translate each function, struct and global variable at most six attempts in each run. For libogg, we also experiment with both GPT-4o and GPT-5 to compare their performance.
CRust-Bench.
Measured at the function level, the mean per-sample translation success rate is 85.15%. Aggregated across the 50 samples, SACTOR translates 788 of 966 functions (81.57% combined). 32 samples achieve 100% function-level translation, i.e., the entire C codebase for the sample is translated to unidiomatic Rust. For idiomatic translation, we evaluate only on the 32 samples whose unidiomatic stage reached 100% function-level translation. On these samples, the mean per-sample function translation rate is 51.85%. Aggregated across them, SACTOR translates 249 of 580 functions (42.93% combined); 8 samples achieve 100% function-level idiomatic translation, which the entire C codebases are translated to idiomatic Rust.
| Unidi. Idiom. | 50 32 $β $ | 85.15% 51.85% | 788 / 966 (81.57%) 249 / 580 (42.93%) | 32 / 50 (64.00%) 8 / 32 (25.00%) | 2.96 0.28 |
| --- | --- | --- | --- | --- | --- |
Table 1: CRust-Bench function-level translation results. Success rate (SR) is averaged per-sample; $β $ idiomatic stage is evaluated only on samples whose unidiomatic pass fully translated all functions.
Table 1 summarizes stage-level outcomes.
Observations and failure modes. We organize failures into five main categories. (1) Interface/name drift: Symbol casing or exact-name mismatches (e.g., CamelCase vs. snake_case). (2) Semantic mapping errors: Mistakes in translating C constructs to idiomatic Rust (e.g., pointer-of-pointer vs. Vec, shape drift, lifetime or mutability issues). (3) C-specific features: Incomplete handling some features like function pointers and C variadics. (4) Borrowing and resource-model violations: Compile-time borrow-checker errors in idiomatic Rust bodies (e.g., overlapping borrows in updates). (5) Harness/runtime faults: Faulty test harnesses translation (e.g. buffer mis-sizing, out-of-bounds access). Other minor cases include unsupported intrinsics (SIMD) and global-state divergence (shadowed globals). Table LABEL:tab:crust_failures (in Appendix M.1) summarizes each sampleβs outcome and its primary cause.
Idiomaticity. Unidiomatic outputs exhibit many lint alerts and heavy reliance on unsafe: the mean Clippy alert sum is 50.14 per sample (2.96 per function); the mean unsafe fraction is 97.86% with an unsafe-free rate of 0%. Idiomatic outputs reverse this profile: the mean Clippy alert sum drops to only 2.27 per sample (0.28 per function); the mean unsafe fraction is 0% with a 100% unsafe-free rate.
Libogg.
Step (model) SR (%) Avg. lint / Function Avg. attempt Unid. (GPT-4o) 100 1.45 1.52 Idiom. (GPT-4o) 53 0.28 2.00 Unid. (GPT-5) 100 1.45 1.04 Idiom. (GPT-5) 78 0.23 1.25
Table 2: Evaluation of SACTOR βs function translation on libogg. βUnid.β/βIdiom.β denotes unidiomatic/idiomatic translation. βSRβ is the success rate of translating functions. βAvg. lintβ/βAvg. attemptβ is the average lint alert count/average number of attempts, for functions that both LLM models succeed in translating.
The unidiomatic and idiomatic translations of all structs and global variables are successful with each LLM model. For functions, the result is summarized in Table 2. SACTOR succeeds in all functionsβ unidiomatic translations. For idiomatic translations, SACTOR βs success rate is 53% and SACTOR takes 2.00 attempts on average to produce a correct translation with GPT-4o. For GPT-5, the performance is significantly better with a success rate of 78% and average number of attempts of 1.25.
Observations and failure modes. The most significant reasons for failed idiomatic translations include: (1) failure to pass tests due to mistakes in translating pointer manipulation and heap memory management; (2) compile errors in translated functions, especially arising from violation of Rust safety rules on lifetimes, borrowing and mutability; (3) failure to generate compilable test harnesses for data types with pointers and arrays. GPT-5 performs significantly better than GPT-4o. For example, GPT-5 only have one failure caused by a compile error in the translated function, in contrast to six compile error failures with GPT-4o, which shows the progress of GPT-5 in understanding Rust grammar and fixing compile errors. More details can be found in Appendix M.2.
Idiomaticity. SACTOR βs unidiomatic translations cause lint alerts largely due to the use of unsafe code while idiomatic translations lead to very few lint alerts, i.e., fewer than 0.3 alerts per function on average (Table 2). With each model, the unidiomatic translations are all in unsafe code but the idiomatic translations are all in safe code. As a result, the idiomatic translations have an avg. unsafe fraction of 0% and unsafe-free fraction of 100%. The unidiomatic translations are the opposite.
## 7 Conclusions
Translating C to Rust enhances memory safety but remains error-prone and often unidiomatic. While LLMs improve translation, they still lack correctness guarantees and struggle with semantic gaps. SACTOR addresses these through a two-stage pipeline: preserving ABI interface first, then refining to idiomatic Rust. Guided by static analysis and validated via FFI-based testing, SACTOR achieves high correctness and idiomaticity across multiple benchmarks, surpassing prior tools. Remaining challenges include stronger correctness assurance, richer C-feature coverage, and improved scalability and efficiency (see Β§ 8). Example prompts appear in Appendix N.
## 8 Limitations
While SACTOR is effective in producing correct, idiomatic Rust, several limitations remain:
- Test coverage dependence. Our soft-equivalence checks rely on existing end-to-end tests; shallow or incomplete coverage can miss subtle semantic errors. Integrating fuzzing or test generation could raise coverage and catch corner cases.
- Model variance. Translation quality depends on the underlying LLM. Although GPT-4o and DeepSeek-R1 perform well in our study, other models show lower accuracy and stability.
- Unsupported C features. Complex macros, pervasive function pointers, global state, C variadics and inline assembly are only partially handled, limiting applicability to such codebases (see Β§ 6.3).
- Static analysis precision. Current analysis may under-specify aliasing, ownership, and pointer shapes in challenging code, leading to adapter/spec errors. Stronger analyses could improve mapping and reduce retries.
- Harness generation stability. The rule-based generator with LLM fallback can still emit incomplete or brittle adapters on complex patterns (e.g., unusual pointer shapes or length expressions), causing otherwise-correct translations to fail verification. Hardening rules and reducing reliance on the fallback should improve robustness and reproducibility.
- Cost and latency. Multi-stage prompting, compilation, and test loops incur non-trivial token and time costs, which matter for large-scale migrations.
## Appendix A Differences Between C and Rust
### A.1 Code Snippets
Here is a code example to demonstrate the differences between C and Rust. The example shows a simple C program and its equivalent Rust program. The create_sequence function takes an integer n as input and returns an array with a sequence of integers. In C, the function needs to allocate memory for the array using malloc and will return the pointer to the allocated memory as an array. If the size is invalid, or the allocation fails, the function will return NULL. The caller of the function is responsible for freeing the memory using free when it is done with the array to prevent memory leaks.
C Code:
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Code Snippet: C Function for Dynamic Sequence Creation
### Overview
The image displays a C programming language code snippet. It defines a function `create_sequence` that dynamically allocates and initializes an integer array, followed by an example of its usage. The code is presented with syntax highlighting on a light background.
### Components/Axes
This is not a chart or diagram with axes. The components are lines of C code with the following syntax highlighting scheme (approximate colors based on visual inspection):
- **Function name (`create_sequence`)**: Dark red/brown
- **Keywords (`int`, `if`, `for`, `return`)**: Purple
- **Variable names (`n`, `arr`, `i`, `sequence`)**: Dark blue/black
- **Function names (`malloc`, `free`)**: Dark red/brown
- **Constants (`NULL`, `0`)**: Brown/orange
- **Operators (`<=`, `*`, `<`, `++`, `==`)**: Purple
- **Comments (`// Need to free...`)**: Gray
- **String literals/other text**: Black
### Detailed Analysis
The code consists of two main parts:
**1. Function Definition: `create_sequence`**
```c
int* create_sequence(int n) {
if (n <= 0) {
return NULL;
}
int* arr = malloc(n * sizeof(int));
if (!arr) {
return NULL;
}
for (int i = 0; i < n; i++) {
arr[i] = i;
}
return arr;
}
```
* **Purpose**: Creates a dynamically allocated array of `n` integers.
* **Logic Flow**:
1. **Input Check**: If the input integer `n` is less than or equal to 0, the function immediately returns a `NULL` pointer.
2. **Memory Allocation**: It attempts to allocate a block of memory large enough to hold `n` integers using `malloc`. The size is calculated as `n * sizeof(int)`.
3. **Allocation Check**: If `malloc` fails (returns `NULL`), the function returns `NULL`.
4. **Initialization Loop**: If allocation succeeds, a `for` loop runs from `i = 0` to `i < n`. In each iteration, it sets the value of the array element at index `i` to the integer `i` itself (i.e., `arr[0]=0, arr[1]=1, ..., arr[n-1]=n-1`).
5. **Return**: The function returns the pointer to the newly created and initialized array.
**2. Example Usage**
```c
int* sequence = create_sequence(5);
if (sequence == NULL) {
...
}
...
free(sequence); // Need to free the memory when done
```
* **Action**: Calls `create_sequence` with an argument of `5`.
* **Expected Result**: If successful, `sequence` will point to an array containing `{0, 1, 2, 3, 4}`.
* **Error Handling**: Checks if the returned pointer is `NULL` (indicating failure due to invalid input or memory allocation error). The ellipsis (`...`) indicates omitted error-handling code.
* **Memory Management**: The final line shows the critical step of freeing the allocated memory using `free(sequence)` to prevent memory leaks, accompanied by an explanatory comment.
### Key Observations
1. **Defensive Programming**: The function includes checks for both invalid input (`n <= 0`) and memory allocation failure, returning `NULL` in both cases. This is a robust practice.
2. **Initialization Pattern**: The array is initialized with a simple sequential pattern where each element's value equals its index.
3. **Memory Management Responsibility**: The example usage explicitly demonstrates the caller's responsibility to free the allocated memory, highlighted by the comment.
4. **Syntax Highlighting**: The color scheme aids readability by distinguishing language constructs (keywords in purple, functions in red/brown, etc.).
### Interpretation
This code snippet is a textbook example of dynamic memory management in C. It demonstrates several core concepts:
* **Encapsulation**: The complexity of memory allocation and initialization is hidden within the `create_sequence` function, providing a clean interface.
* **Resource Management**: It highlights the manual memory management required in C, where allocated memory (`malloc`) must be explicitly released (`free`) by the programmer. The comment serves as a crucial reminder of this duty.
* **Error Propagation**: The function uses `NULL` as an error sentinel, a common C idiom, forcing the calling code to check for success or failure.
* **Potential Use Case**: Such a function could be a utility for generating test data, creating index arrays, or as a building block for more complex data structures. The specific initialization (`arr[i] = i`) is simple but could be modified for other sequences (e.g., powers of two, random values).
The code is syntactically correct and follows good practices for a basic utility function in C. The primary technical information conveyed is the algorithm for safe dynamic array creation and the imperative of matching every `malloc` with a corresponding `free`.
</details>
Rust Code:
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Code Snippet: Rust Function for Sequence Generation
### Overview
The image displays a code snippet written in the Rust programming language. It defines a function `create_sequence` that generates a vector of integers from 0 to n-1, wrapped in an `Option` type for safe error handling. The snippet also includes an example of how to call this function and handle its result using a `match` expression. The code is presented with syntax highlighting on a light beige background.
### Components/Axes
* **Language:** Rust
* **Primary Function:** `create_sequence(n: i32) -> Option<Vec<i32>>`
* **Key Syntax Elements:**
* Function definition (`fn`)
* Conditional check (`if n <= 0`)
* Vector initialization (`Vec::with_capacity`)
* Loop (`for i in 0..n`)
* Option enum (`Some`, `None`)
* Pattern matching (`match`)
* **Syntax Highlighting Colors (Approximate):**
* Keywords (`fn`, `let`, `mut`, `for`, `in`, `match`, `return`): Blue
* Types (`i32`, `Option`, `Vec`, `usize`): Teal/Green
* Function/Method names (`create_sequence`, `with_capacity`, `push`): Dark Blue/Black
* Variables (`n`, `arr`, `i`, `sequence`): Black
* Literals (`0`, `5`): Purple
* Operators (`->`, `<=`, `..`): Black
* Braces/Punctuation (`{`, `}`, `(`, `)`, `;`): Black
* Comment (`// Does not need to free the memory`): Gray
### Detailed Analysis / Content Details
**Transcription of Code:**
```rust
fn create_sequence(n: i32) -> Option<Vec<i32>> {
if n <= 0 {
return None;
}
let mut arr = Vec::with_capacity(n as usize);
for i in 0..n {
arr.push(i);
}
Some(arr)
}
match create_sequence(5) {
Some(sequence) => {
... // Does not need to free the memory
}
None => {
...
}
}
```
**Code Logic Flow:**
1. **Function Definition (`create_sequence`):**
* **Input:** Takes a single parameter `n` of type `i32` (32-bit signed integer).
* **Return Type:** `Option<Vec<i32>>`. This means it will return either `Some(vector)` containing a list of integers on success, or `None` on failure.
* **Guard Clause:** Checks if `n <= 0`. If true, it immediately returns `None`.
* **Vector Creation:** If `n > 0`, it creates a mutable vector `arr` with an initial capacity equal to `n` (cast to `usize` for memory sizing).
* **Population Loop:** Iterates from `i = 0` up to (but not including) `n`, pushing each value of `i` into the vector `arr`.
* **Successful Return:** Wraps the populated vector `arr` in `Some()` and returns it.
2. **Example Usage (`match` block):**
* Calls `create_sequence` with the argument `5`.
* Uses a `match` expression to handle the two possible outcomes of the `Option`:
* `Some(sequence)`: The success case. The generated vector is bound to the variable `sequence`. The comment `// Does not need to free the memory` indicates that Rust's ownership system will automatically deallocate the vector when it goes out of scope.
* `None`: The failure case (if `n` had been <= 0). The code block is represented by `...`, indicating omitted logic.
### Key Observations
* **Error Handling Pattern:** The function uses Rust's `Option` type for explicit, safe error handling instead of panicking or returning null pointers.
* **Memory Efficiency:** The vector is initialized with `Vec::with_capacity(n as usize)`, which pre-allocates memory to avoid reallocations during the loop, making the function efficient.
* **Ownership & Safety:** The comment in the `Some` branch highlights a core Rust principle: memory is managed automatically via ownership, eliminating the need for manual `free` or `delete` calls and preventing memory leaks.
* **Syntax Highlighting:** The color scheme is consistent with common IDE themes (e.g., similar to "Solarized Light" or "GitHub Light"), aiding readability by distinguishing language constructs.
### Interpretation
This code snippet is a pedagogical example demonstrating several fundamental Rust concepts:
1. **Type Safety & Expressiveness:** The function signature `-> Option<Vec<i32>>` clearly communicates both the potential failure mode (`None`) and the success data type (`Vec<i32>`), making the API's contract explicit at compile time.
2. **Idiomatic Control Flow:** It showcases the common Rust pattern of using a guard clause (`if n <= 0 { return None; }`) for early exit, followed by the main logic.
3. **Resource Management:** The example implicitly teaches Rust's ownership model. The vector `arr` is owned by the function and moved into the `Some` variant upon return. The caller (`match` block) then owns the `sequence`, and the compiler ensures it is dropped (and its memory freed) automatically when `sequence` goes out of scope. This is the meaning behind the comment.
4. **Practical Utility:** While simple, the function is a useful utility for generating a sequence of numbers, a common task in programming. The use of `0..n` creates a half-open range, which is a Rust convention.
The snippet effectively serves as a mini-tutorial on writing safe, efficient, and idiomatic Rust code for sequence generation and result handling.
</details>
Figure 3: Example of a simple C program and its equivalent Rust program, both hand-written for illustration.
### A.2 Tabular Summary
Here, we present a non-exhaustive list of differences between C and Rust in Table 3, highlighting the key features that make translating code from C to Rust challenging. While the list is not comprehensive, it provides insights into the fundamental distinctions between the two languages, which can help developers understand the challenges of migrating C code to Rust.
| Memory Management Pointers Lifetime Management | Manual (through malloc/free) Raw pointers like *p Manual freeing of memory | Automatic (through ownership and borrowing) Safe references like &p/&mut p, Box and Rc Lifetime annotations and borrow checker |
| --- | --- | --- |
| Error Handling | Error codes and manual checks | Explicit handling with Result and Option types |
| Null Safety | Null pointers allowed (e.g., NULL) | No null pointers; uses Option for nullable values |
| Concurrency | No built-in protections for data races | Enforces safe concurrency with ownership rules |
| Type Conversion | Implicit conversions allowed and common | Strongly typed; no implicit conversions |
| Standard Library | C stand library with direct system calls | Rust standard library with utilities for strings, collections, and I/O |
| Language Features | Procedure-oriented with minimal abstractions | Modern features like pattern matching, generics, and traits |
Table 3: Key Differences Between C and Rust
## Appendix B Preprocessing and Task Division
### B.1 Preprocessing of C Files
To support real-world C projects, SACTOR parses the compile commands generated by the make tool, extracting relevant flags for preprocessing, parsing, compilation, linking, and third-party toolsβ use.
C source files usually contain preprocessing directives, such as #include, #define, #ifdef, #endif, etc., which we need to resolve before parsing C files. For #include, we copy and expand non-system headers recursively while keeping #include of system headers intact, because included non-system headers contain project-specific definitions such as structs and enums that the LLM has not known while system headersβ contents are known to the LLM and expanding them would unnecessarily introduce too much noise. For other directives, we pass relevant C project compile flags to the C preprocessor from GCC to resolve them.
### B.2 Algorithm for Task Division
The task division algorithm is used to determine the order in which the items should be translated. The algorithm is shown in Algorithm 1.
Algorithm 1 Translation Task Order Determination
1: $L_i$ : List of items to be translated
2: $dep(a)$ : Function to get dependencies of item $a$
3: $L_sorted$ : List of groups resolving dependencies
4: $L_sortedββ $ $\triangleright$ Empty list
5: while $|L_sorted|<|L_i|$ do
6: $L_processedββ $
7: for $aβ L_i$ do
8: if $aβ L_processed$ and $dep(a)β L_processed$ then
9: $L_sortedβ L_sorted+a$ $\triangleright$ Add to sorted list
10: $L_processedβ L_processedβͺ a$
11: end if
12: end for
13: if $L_processed=β $ then
14: $L_circularβ DFS(L_i,dep)$ $\triangleright$ Circular dependencies
15: $L_sortedβ L_sorted+L_circular$ $\triangleright$ Add a group to sorted list
16: end if
17: end while
18: return $L_sorted$
In the algorithm, $L_i$ is the list of items to be translated, and $dep(a)$ is a function that returns the dependencies of item $a$ . The algorithm returns a list $L_sorted$ that contains the items in the order in which they should be translated. $DFS(L_i,dep)$ is a depth-first search function that returns a list of items involved in a circular dependency. It begins by collecting all items (e.g., functions, structs) to be translated and their respective dependencies (in both functions and data types). Items with no unresolved dependencies are pushed into the translation order list first, and other items will remove them from their dependencies list. This process continues until all items are pushed into the list, or circular dependencies are detected. If circular dependencies are detected, we resolve them through a depth-first search strategy, ensuring that all items involved in a circular dependency are grouped together and handled as a single unit.
## Appendix C Equivalence Testing Details in Prior Literature
### C.1 Symbolic Execution-Based Equivalence
Symbolic execution explores all potential execution paths of a program by using symbolic inputs to generate constraints [king1976symbolic, baldoni2018survey, coward1988symbolic]. While theoretically powerful, this method is impractical for verifying C-to-Rust equivalence due to differences in language features. For instance, Rustβs RAII (Resource Acquisition Is Initialization) pattern automatically inserts destructors for memory management, while C relies on explicit malloc and free calls. These differences cause mismatches in compiled code, making it difficult for symbolic execution engines to prove equivalence. Additionally, Rustβs compiler adds safety checks (e.g., array boundary checks), which further complicate equivalence verification.
### C.2 Fuzz Testing-Based Equivalence
Fuzz testing generates random or mutated inputs to test whether program outputs match expected results [zhu2022fuzzing, miller1990empirical, liang2018fuzzing]. While more practical than symbolic execution, fuzz testing faces challenges in constructing meaningful inputs for real-world programs. For example, testing a URL parsing function requires generating valid URLs with specific formats, which is non-trivial. For large C programs, this difficulty scales, making it infeasible to produce high-quality test cases for every translated Rust function.
## Appendix D An Example of the Test Harness
Here, we provide an example of the test harness used to verify the correctness of the translated code in Figure 4, which is used to verify the idiomatic Rust code. In this example, the concat_str_idiomatic function is the idiomatic translation we are testing, while the concat_str_c function is the test harness function that can be linked back to the original C code. where a string and an integer are passed as input, and an owned string is returned. Input strings are converted from Cβs char* to Rustβs &str, and output strings are converted from Rustβs String back to Cβs char*.
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Code Snippet: Rust String Concatenation Functions
### Overview
The image displays two Rust functions for string concatenation. The first is an idiomatic Rust implementation, while the second is a C-compatible wrapper function designed to work with C-style strings and raw pointers. The code is presented with syntax highlighting on a light background.
### Components/Axes
The image contains two distinct function definitions with associated comments and syntax elements.
**Function 1: `concat_str_idiomatic`**
* **Signature:** `fn concat_str_idiomatic(orig: &str, num: i32) -> String`
* **Parameters:**
* `orig`: A string slice (`&str`).
* `num`: A 32-bit signed integer (`i32`).
* **Return Type:** An owned `String`.
* **Body:** A single line using the `format!` macro: `format!("{}{}", orig, num)`.
**Function 2: `concat_str`**
* **Signature:** `fn concat_str(orig: *const c_char, num: c_int) -> *const c_char`
* **Parameters:**
* `orig`: A raw pointer to a C-style string (`*const c_char`).
* `num`: A C-style integer (`c_int`).
* **Return Type:** A raw pointer to a C-style string (`*const c_char`).
* **Body:** Contains several steps with explanatory comments:
1. **Comment:** `// convert input`
2. **Code:** `let orig_str = CStr::from_ptr(orig).to_str().expect("Invalid UTF-8 string");`
3. **Comment:** `// call target function`
4. **Code:** `let out = concat_str_idiomatic(orig_str, num as i32);`
5. **Comment:** `// convert output`
6. **Code:** `let out_str = CString::new(out).unwrap();`
7. **Comment:** `// \`into_raw\` transfers ownership to the caller`
8. **Code:** `out_str.into_raw()`
### Detailed Analysis
The code demonstrates a common pattern for creating a Rust-friendly API that interfaces with C.
1. **Idiomatic Core (`concat_str_idiomatic`):** This function contains the core logic using safe, idiomatic Rust types (`&str`, `i32`, `String`). It performs the concatenation simply via string formatting.
2. **C-Compatible Wrapper (`concat_str`):** This function acts as a bridge.
* **Input Conversion:** It takes a raw C string pointer (`*const c_char`). It uses `CStr::from_ptr` to create a borrowed C string wrapper, then `.to_str()` to convert it to a Rust `&str`, with `.expect()` handling potential UTF-8 conversion errors.
* **Delegation:** It casts the `c_int` to an `i32` and calls the idiomatic `concat_str_idiomatic` function.
* **Output Conversion:** The resulting Rust `String` (`out`) is converted into a C-compatible, null-terminated `CString` using `CString::new()`. The `.unwrap()` here would panic if the string contained an interior null byte.
* **Ownership Transfer:** The critical step is `out_str.into_raw()`. This consumes the `CString`, deallocates its Rust-managed memory, and returns a raw pointer (`*const c_char`) to the caller. The comment explicitly notes this transfers ownership, meaning the caller (presumably C code) becomes responsible for freeing this memory later using the appropriate C function (e.g., `free`).
### Key Observations
* **Safety Boundary:** The wrapper function (`concat_str`) is the point where safe Rust code interacts with the unsafe, raw world of C pointers. The `unsafe` block is not explicitly shown but is implied by the use of `CStr::from_ptr` and `into_raw`, which are `unsafe` operations.
* **Error Handling:** The input conversion uses `.expect()`, which will panic on invalid UTF-8. The output conversion uses `.unwrap()`, which will panic on interior nulls. This is a simple error strategy; a production API might return a result code or use out-parameters for errors.
* **Memory Management:** The pattern of `into_raw()` is standard for returning owned data to C. It is the counterpart to `from_raw()` which would be used to take ownership back from C.
### Interpretation
This code snippet is a technical illustration of **Foreign Function Interface (FFI) bridging** in Rust. It shows the necessary "glue code" required to expose a safe, high-level Rust function to a C codebase.
The primary relationship is one of **delegation and translation**. The `concat_str` function does not contain business logic; its sole purpose is to translate types across the language boundary (`c_char*` to `&str`, `c_int` to `i32`, `String` to `c_char*`) and manage the associated memory ownership semantics.
The notable pattern is the **ownership hand-off**. The Rust function creates a `String`, converts it to a `CString`, and then deliberately leaks the Rust-owned memory by turning it into a raw pointer. This is not a bug but a deliberate protocol, placing the burden of deallocation on the C caller. This pattern is fundamental to creating interoperable libraries but requires careful documentation to prevent memory leaks on the C side. The code serves as a concise template for this common systems programming task.
</details>
Figure 4: Test harness used for verifying concat_str translation
## Appendix E An Example of SACTOR Translation Process
To demonstrate the translation process of SACTOR, we present a straightforward example of translating a C function to Rust. The C program includes an atoi function that converts a string to an integer, and a main function that parses command-line arguments and calls the atoi function. The C code is shown in Figure 5(a).
<details>
<summary>x8.png Details</summary>

### Visual Description
## Code Snippet: Custom `atoi` Implementation in C
### Overview
The image displays a complete C program containing a custom implementation of the `atoi` (ASCII to integer) function and a `main` function to demonstrate its usage. The code is presented in a monospaced font with syntax highlighting on a light gray background. The language is **C**.
### Components/Axes
The code is structured into two primary components:
1. **`atoi` Function:** A function that converts a string to an integer.
2. **`main` Function:** A driver program that accepts a command-line argument, passes it to `atoi`, and prints the result.
### Detailed Analysis / Content Details
Below is the precise transcription of the code text.
```c
#include <stdio.h>
int atoi(char *str) {
int result = 0;
int sign = 1;
while (*str == ' ' || *str == '\t' || *str == '\n' ||
*str == '\r' || *str == '\v' || *str == '\f') {
str++;
}
if (*str == '+' || *str == '-') {
if (*str == '-') {
sign = -1;
}
str++;
}
while (*str >= '0' && *str <= '9') {
result = result * 10 + (*str - '0');
str++;
}
return sign * result;
}
int main(int argc, char *argv[]) {
if (argc != 2) {
printf("Usage: %s <number>\n", argv[0]);
return 1;
}
int value = atoi(argv[1]);
printf("Parsed integer: %d\n", value);
return 0;
}
```
**Code Logic Breakdown:**
1. **Header Inclusion:** `#include <stdio.h>` - Includes the standard input/output library for `printf`.
2. **`atoi` Function:**
* Initializes `result` to 0 and `sign` to 1 (positive).
* **Whitespace Skipping Loop:** Advances the string pointer `str` past any leading whitespace characters (space, tab, newline, carriage return, vertical tab, form feed).
* **Sign Handling:** Checks for an optional leading '+' or '-' sign. If a '-' is found, `sign` is set to -1. The pointer is then advanced.
* **Digit Conversion Loop:** Iterates through consecutive digit characters ('0'-'9'). For each digit, it updates `result` using the formula: `result = result * 10 + (current_digit_char - '0')`. This effectively builds the integer value.
* **Return:** Returns the final integer value multiplied by the determined sign.
3. **`main` Function:**
* **Argument Check:** Verifies that exactly one command-line argument (`argc == 2`) is provided. If not, it prints a usage message and returns 1 (error).
* **Conversion & Output:** Calls `atoi` with the provided argument (`argv[1]`), stores the result in `value`, and prints it using `printf`.
### Key Observations
* **Error Handling:** The `main` function checks for the correct number of arguments. However, the `atoi` function itself does not perform robust error checking. It will stop conversion at the first non-digit character (after the sign) and return the value parsed up to that point. It does not distinguish between a valid "0" and an invalid string with no digits.
* **Whitespace Handling:** The implementation correctly handles a comprehensive set of C whitespace characters.
* **Return Type:** The function returns an `int`. There is no protection against integer overflow if the parsed number exceeds the range of `int`.
* **Syntax Highlighting:** The code uses color to distinguish elements: blue for keywords (`int`, `while`, `if`, `return`), green for string literals, and brown/orange for numeric literals and character constants.
### Interpretation
This code provides a foundational, from-scratch implementation of a common C library function. It demonstrates core programming concepts: string manipulation via pointers, character arithmetic (converting a digit character to its numeric value by subtracting `'0'`), conditional logic, and basic command-line interface interaction.
The implementation is pedagogically clear but lacks the robustness of a production-grade `atoi`. A production version would typically include:
1. **Overflow/Underflow Checking:** To handle numbers outside the `INT_MIN`/`INT_MAX` range.
2. **Error Reporting:** A way to indicate if no conversion was performed (e.g., returning 0 and setting an error flag, or using a different return pattern).
3. **Locale Awareness:** The standard `atoi` behavior is affected by the program's locale; this custom version is not.
The `main` function serves as a simple test harness, making the code immediately runnable and verifiable from the command line. The overall purpose is likely educationalβto illustrate how string-to-integer conversion works at a low level.
</details>
(a) C implementation of atoi
<details>
<summary>x9.png Details</summary>

### Visual Description
## Code Screenshot: Rust Implementation of `atoi` Function
### Overview
The image displays a screenshot of Rust source code implementing a custom `atoi` (ASCII to integer) function and a `main` function to demonstrate its usage. The code is presented in a text editor or IDE with syntax highlighting. The primary language is **Rust**.
### Components/Axes
This is not a chart or diagram with axes. The components are the code's syntactic and logical elements:
- **Import Statements**: Lines 1-4 (`use libc::c_char;`, `use std::env;`, `use std::ffi::CString;`, `use std::process;`).
- **Function `atoi`**: An `unsafe` public function (lines 5-40) that converts a C-style string (`*const c_char`) to a 32-bit signed integer (`i32`).
- **Function `main`**: The program's entry point (lines 42-60).
- **Variables & Logic**: Includes mutable variables (`result`, `sign`, `ptr`), loops, conditional checks, and arithmetic operations.
- **Error Handling**: Uses `match` for `CString` creation and `process::exit` for fatal errors.
### Detailed Analysis
**Full Code Transcription:**
```rust
use libc::c_char;
use std::env;
use std::ffi::CString;
use std::process;
pub unsafe fn atoi(str: *const c_char) -> i32 {
let mut result: i32 = 0;
let mut sign: i32 = 1;
let mut ptr = str;
while *ptr == ' ' as c_char
|| *ptr == '\t' as c_char
|| *ptr == '\n' as c_char
|| *ptr == '\r' as c_char
|| *ptr == '\x0B' as c_char
|| *ptr == '\x0C' as c_char
{
ptr = ptr.add(1);
}
if *ptr == '+' as c_char || *ptr == '-' as c_char {
if *ptr == '-' as c_char {
sign = -1;
}
ptr = ptr.add(1);
}
while *ptr >= '0' as c_char && *ptr <= '9' as c_char {
let digit = (*ptr - '0' as c_char) as i32;
if let Some(new_result) = result.checked_mul(10).and_then(
|r| r.checked_add(digit),
) {
result = new_result;
} else {
return if sign == 1 { i32::MAX } else { i32::MIN };
}
ptr = ptr.add(1);
}
sign * result
}
pub fn main() {
let args: Vec<String> = env::args().collect();
if args.len() != 2 {
println!("Usage: {} <number>", args[0]);
process::exit(1);
}
let c_str = match CString::new(args[1].as_str()) {
Ok(cstring) => cstring,
Err(_) => {
eprintln!("Failed to create CString from input");
process::exit(1);
}
};
let value = unsafe { atoi(c_str.as_ptr() as *const c_char) };
println!("Parsed integer: {}", value);
}
```
**Logical Flow of `atoi` Function:**
1. **Initialization**: Sets `result` to 0, `sign` to 1, and a pointer `ptr` to the input string.
2. **Whitespace Skipping**: A `while` loop advances `ptr` past any leading whitespace characters (space, tab, newline, carriage return, vertical tab, form feed).
3. **Sign Handling**: Checks for an optional leading '+' or '-'. If '-', sets `sign` to -1. Advances the pointer past the sign character.
4. **Digit Conversion**: A `while` loop processes consecutive digit characters ('0'-'9').
* Converts the character to its integer value (`digit`).
* Uses `checked_mul(10)` and `checked_add(digit)` to safely accumulate the result, preventing overflow.
* If an overflow occurs (the `checked_*` methods return `None`), it immediately returns `i32::MAX` for a positive sign or `i32::MIN` for a negative sign.
* Otherwise, updates `result` and advances the pointer.
5. **Return**: Returns the final `result` multiplied by the `sign`.
**Logical Flow of `main` Function:**
1. **Argument Collection**: Collects command-line arguments into a `Vec<String>`.
2. **Argument Check**: Verifies exactly one argument is provided (plus the program name). Prints a usage message and exits with code 1 if not.
3. **CString Conversion**: Attempts to convert the input argument to a null-terminated `CString`. Prints an error to stderr and exits on failure.
4. **Function Call**: Unsafely calls `atoi` with a raw pointer to the `CString`'s content.
5. **Output**: Prints the parsed integer value.
### Key Observations
1. **Safety**: The `atoi` function is marked `unsafe` because it dereferences a raw pointer (`*const c_char`). The `main` function encapsulates this unsafety in a single, controlled call.
2. **Robustness**: The implementation includes explicit checks for integer overflow using Rust's safe checked arithmetic methods (`checked_mul`, `checked_add`), returning the maximum or minimum `i32` value on overflow, mimicking common C library behavior.
3. **Whitespace Handling**: It correctly handles a comprehensive set of ASCII whitespace characters, not just spaces.
4. **Error Handling in `main`**: The `main` function provides basic user feedback for incorrect usage and internal errors (CString creation failure).
5. **Syntax Highlighting**: The code uses color to distinguish elements: purple for keywords (`use`, `pub`, `unsafe`, `fn`, `let`, `mut`, `while`, `if`, `return`, `match`), orange for types (`i32`, `String`, `Vec`), green for string literals, and blue for function/variable names.
### Interpretation
This code is a faithful Rust reimplementation of the classic C library function `atoi`. Its primary purpose is educational or for FFI (Foreign Function Interface) contexts where a Rust program needs to parse C-style strings.
* **What it demonstrates**: It showcases how to work with raw pointers and C-style strings in Rust (`unsafe` blocks, `*const c_char`), while still leveraging Rust's safety features where possible (checked arithmetic, `Result`/`Option` for error handling in `main`). It also illustrates command-line argument processing.
* **Relationship between components**: The `atoi` function is a pure, low-level conversion routine. The `main` function acts as a driver, handling the higher-level concerns of program execution, argument validation, and error reporting before passing the cleaned data to the unsafe core function.
* **Notable patterns/anomalies**:
* The overflow behavior (returning `i32::MAX`/`MIN`) is a specific design choice that matches some, but not all, C library implementations. Others might have undefined behavior on overflow.
* The function stops parsing at the first non-digit character after the sign, which is standard `atoi` behavior. It does not report errors for non-numeric trailing characters (e.g., `"123abc"` would parse as `123`).
* The code is a complete, compilable Rust program, not just a function snippet.
</details>
(b) Unidiomatic Rust translation from C
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Code Snippet: Rust `atoi` Implementation
### Overview
The image displays a complete Rust source code file containing two public functions: `atoi` (ASCII to integer) and `main`. The code implements a custom string-to-`i32` integer parser with robust error handling for overflow and invalid input, designed to be used as a command-line utility.
### Components/Axes
The code is structured into two primary components:
1. **`atoi` function**: The core parsing logic.
2. **`main` function**: The command-line interface that utilizes `atoi`.
**Imports (Lines 1-2):**
* `use std::env;`
* `use std::process;`
### Detailed Analysis
#### **`atoi` Function (Lines 3-44)**
* **Signature**: `pub fn atoi(input: &str) -> i32`
* **Purpose**: Converts a string slice (`&str`) into a 32-bit signed integer (`i32`).
* **Logic Flow**:
1. **Initialization (Lines 4-6)**: Initializes `result` to 0, `sign` to 1, and creates a peekable character iterator from the input string.
2. **Whitespace Skipping (Lines 7-13)**: A `while` loop consumes and discards leading whitespace characters.
3. **Sign Detection (Lines 14-22)**: An `if let` block checks the next character for a '+' or '-' sign. If a '-' is found, `sign` is set to -1. The sign character is consumed.
4. **Digit Parsing & Accumulation (Lines 23-39)**: A `for` loop iterates over the remaining characters.
* For each character, it attempts to convert it to a digit (base 10) using `c.to_digit(10)`.
* If successful, it performs a checked multiplication of the current `result` by 10, followed by a checked addition of the new digit. This prevents integer overflow.
* **Overflow Handling (Lines 30-32)**: If either the multiplication or addition operation would overflow, the function immediately returns `i32::MAX` if the sign was positive, or `i32::MIN` if the sign was negative.
* If the character is not a digit, the loop breaks (Line 38).
5. **Return (Line 41)**: The final result is computed as `sign * result`.
#### **`main` Function (Lines 43-53)**
* **Signature**: `pub fn main()`
* **Purpose**: Serves as the program entry point, handling command-line arguments.
* **Logic Flow**:
1. **Argument Collection (Line 44)**: Collects all command-line arguments into a `Vec<String>`.
2. **Argument Check (Lines 45-49)**: Checks if exactly two arguments are present (the program name and one input number). If not, it prints a usage message (`"Usage: {} <number>"`) and exits with status code 1.
3. **Parsing and Output (Lines 50-52)**: Takes the second argument (`args[1]`), passes it to the `atoi` function, and prints the result in the format `"Parsed integer: {}"`.
### Key Observations
1. **Robustness**: The code explicitly handles leading whitespace, optional signs, and non-digit characters by terminating parsing gracefully.
2. **Overflow Safety**: The use of `checked_mul` and `checked_add` with `and_then` is a idiomatic Rust pattern to safely handle arithmetic that could overflow, returning the maximum or minimum `i32` value as specified by the C standard's `atoi` behavior.
3. **Error Handling**: The `main` function provides basic user feedback for incorrect argument counts but does not propagate or handle parsing errors from `atoi` (e.g., an empty string or a string with only a sign would return 0).
4. **Code Style**: The code uses Rust's pattern matching (`if let`, `while let`) and iterator methods effectively. The indentation and brace style are consistent.
### Interpretation
This code snippet is a self-contained, educational implementation of a fundamental string parsing function. It demonstrates several important Rust concepts: ownership and borrowing (`&str`), iterators, pattern matching for control flow, and safe arithmetic to prevent undefined behavior from overflow.
The `atoi` function's behavior mirrors the classic C library function of the same name, including its specific overflow semantics (saturating at `i32::MAX`/`MIN`). This makes it predictable for developers familiar with the C standard. The `main` function wraps it into a simple CLI tool, though a production version would likely include more comprehensive error reporting (e.g., distinguishing between "no digits found" and "valid number parsed").
The primary utility of this code is as a clear example of how to implement a stateful parser in Rust with safety as a primary concern. It serves as a building block that could be extended to handle different integer types, bases (hexadecimal, octal), or more sophisticated error types.
</details>
(c) Idiomatic Rust translation from unidiomatic Rust
Figure 5: SACTOR translation process for atoi program
We assume that there are numerous end-to-end tests for the C code, allowing SACTOR to use them for verifying the correctness of the translated Rust code.
First, the divider will divide the C code into two parts: the atoi function and the main function, and determine the translation order is first atoi and then main, as atoi is the dependency of main and the atoi function is a pure function.
Next, SACTOR proceeds with the unidiomatic translation, converting both functions into unidiomatic Rust code. This generated code will keep the semantics of the original C code while using Rust syntax. Once the translation is complete, the unidiomatic verifier executes the end-to-end tests to ensure the correctness of the translated function. If the verifier passes all tests, SACTOR considers the unidiomatic translation accurate and progresses to the next function. If any test fails, SACTOR will retry the translation process using the feedback information collected from the verifier, as described in Β§ 4.3. After translating all sections of the C code, SACTOR will combine the unidiomatic Rust code segments to form the final unidiomatic Rust code. The unidiomatic Rust code is shown in Figure 5(b).
Then, the SACTOR will start the idiomatic translation process and translate the unidiomatic Rust code into idiomatic Rust code. The idiomatic translator requests the LLM to adapt the C semantics into idiomatic Rust, eliminating any unsafe and non-idiomatic constructs, as detailed in Β§ 4.2. Based on the same order, the SACTOR will translate two functions accordingly, and using the idiomatic verifier to verify and provide the feedback to the LLM if the verification fails. After all parts of the Rust code are translated into idiomatic Rust, verified, and combined, the SACTOR will produces the final idiomatic Rust code. The idiomatic Rust code is shown in Figure 5(c), representing the final output of SACTOR.
## Appendix F Dataset Details
| TransCoder-IR [transcoderir] | 100 | Removed buggy programs (compilation/memory errors) and entries with existing Rust | Present | 97.97% / 99.5% |
| --- | --- | --- | --- | --- |
| Project CodeNet [codenet] | 100 | Filtered for external-input programs (argc / argv); auto-generated tests | Generated | 94.37% / 100% |
| CRust-Bench [khatry2025crust] | 50 | Excluded unsupported patterns; combine code of each sample to a single lib.c | Present | 76.18% / 80.98% |
| libogg [libogg] | 1 | None. Each component of the library is contained within a single C file. | Present | 83.3% / 75.3% |
Table 4: Summary of datasets and real-world code-bases used for evaluation; coverage audited with gcov on the tests exercised in our pipeline.
### F.1 TransCoder-IR Dataset [transcoderir]
The TransCoder-IR dataset is used to evaluate the TransCoder-IR model and consists of solutions to coding challenges in various programming languages. For evaluation, we focus on the 698 C programs available in this dataset. First, we filter out programs that already have corresponding Rust code. Several C programs in the dataset contain bugs, which are removed by checking their ability to compile. We then use valgrind to identify and discard programs with memory errors during the end-to-end tests. Finally, we select 100 programs with the most lines of code for our experiments.
### F.2 Project CodeNet [codenet]
Project CodeNet is a large-scale dataset for code understanding and translation, containing 14 million code samples in over 50 programming languages collected from online judge websites. From this dataset, which includes more than 750,000 C programs, we target only those that accept external input. Specifically, we filter programs using argc and argv, which process input from the command line. As the end-to-end tests are not available for this dataset, we develop the SACTOR test generator to automatically generate end-to-end tests for these programs based on the source code. For evaluation, we select 200 programs and refine the dataset to include 100 programs that successfully generate end-to-end tests.
### F.3 CRust-Bench [khatry2025crust]
CRust-Bench is a repository-level benchmark for C-to-safe-Rust transpilation. It collects 100 real-world C repositories (the CBench suite) and pairs each with a manually written, safe Rust interface and a set of tests that assert functional correctness. By evaluating full repositories rather than isolated functions, CRust-Bench surfaces challenges common in practice, such as complex, pointer-rich APIs. In our evaluation, we use a 50-sample subset in CRust-Bench, which exclude entries that are out of scope for our pipeline (e.g., circular type or function dependencies and compiler-specific intrinsics that do not map cleanly). For each selected sample, we reuse the upstream end-to-end tests and relink them so that calls exercise our translated code; build environments and link flags follow the sampleβs configuration.
### F.4 libogg [libogg]
libogg is the reference implementation of the Ogg multimedia container. Ogg is a stream-oriented format that frames, timestamps, and multiplexes compressed media bitstreams (e.g., audio/video) into a robust, seekable stream. The libogg distribution contains only the Ogg container library (codecs such as Vorbis or Theora are hosted separately). In our case study, the codebase comprises roughly 2,041 lines of code (excluding tests), six struct definitions, three global variables, and 77 exported functions. We use the projectβs upstream tests and build scripts. This single-project evaluation complements the CRust-Bench subset by focusing on non-trivial structs, buffers, and pointer manipulation in a real-world C library.
## Appendix G LLM Configurations
Table 5 shows our configurations for different LLMs in evaluation. All other hyperparameters (e.g., Top-P, Top-K) use provider defaults. As GPT-5 does not support temperature setting, we use its default temperature.
| GPT-4o | gpt-4o-2024-08-06 | 0 |
| --- | --- | --- |
| Claude 3.5 Sonnet | claude-3-5-sonnet-20241022 | 0 |
| Gemini 2.0 Flash | gemini-2.0-flash-exp | 0 |
| Llama 3.3 Instruct 70B | Llama 3.3 Instruct 70B 1 | 0 |
| DeepSeek-R1 | DeepSeek-R1 671B 2 | 0 |
| GPT-5 | gpt-5-2025-08-07 | default |
- https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
- https://huggingface.co/deepseek-ai/DeepSeek-R1
Table 5: Configurations of Different LLMs in Evaluation
## Appendix H Failure Analysis in Evaluating SACTOR
(a) TransCoder-IR
| R1 | Memory safety violations in array operations due to improper bounds checking |
| --- | --- |
| R2 | Mismatched data type translations |
| R3 | Incorrect array sizing and memory layout translations |
| R4 | Incorrect string representation conversion between C and Rust |
| R5 | Failure to handle Cβs undefined behavior with Rustβs safety mechanisms |
| R6 | Use of C-specific functions in Rust without proper Rust wrappers |
(b) Project CodeNet
| S1 | Improper translation of command-line argument handling or attempt to fix wrong handling |
| --- | --- |
| S2 | Function naming mismatches between C and Rust |
| S3 | Format string directive mistranslation causing output inconsistencies |
| S4 | Original code contains random number generation |
| S5 | SACTOR unable to translate mutable global state variables |
| S6 | Mismatched data type translations |
| S7 | Incorrect control flow or loop boundary condition translations |
Table 6: Failure reason categories for translating TransCoder-IR and Project CodeNet datasets.
<details>
<summary>x11.png Details</summary>

### Visual Description
Icon/Small Image (797x38)
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
## Grouped Bar Chart: File Counts by Category and Sub-Category
### Overview
The image displays a grouped bar chart comparing the "Number of Files" across six primary categories (R1 through R6). Each primary category contains up to five sub-categories, represented by bars of different colors (blue, orange, green, red, purple). The chart has a white background with light grey horizontal grid lines.
### Components/Axes
* **Chart Type:** Grouped (clustered) bar chart.
* **X-Axis (Horizontal):**
* **Label:** "Categories"
* **Categories (Ticks):** R1, R2, R3, R4, R5, R6.
* **Y-Axis (Vertical):**
* **Label:** "Number of Files"
* **Scale:** Linear, from 0 to 25.
* **Major Ticks/Gridlines:** 0, 5, 10, 15, 20, 25.
* **Legend:** **Not present in the image.** The sub-categories are only distinguishable by bar color (blue, orange, green, red, purple). Their specific meanings are unknown.
* **Data Series (by color):** Five distinct color series are plotted within each category group.
### Detailed Analysis
Data values are approximate, read from the chart's scale.
**Category R1:**
* Blue: ~4
* Orange: ~3
* Green: ~2
* Red: ~2
* Purple: ~1
**Category R2:**
* Blue: ~4
* Orange: ~5
* Green: ~4
* Red: ~3
* Purple: Not present (value of 0).
**Category R3:**
* Blue: ~5
* Orange: ~8
* Green: ~7
* Red: ~4
* Purple: ~2
**Category R4:**
* Blue: ~1
* Orange: ~25 **(Significant outlier, maximum value in the chart)**
* Green: ~3
* Red: ~6
* Purple: ~3
**Category R5:**
* Blue: ~3
* Orange: ~4
* Green: ~3
* Red: ~4
* Purple: Not present (value of 0).
**Category R6:**
* Blue: Not present (value of 0).
* Orange: Not present (value of 0).
* Green: ~5
* Red: ~3
* Purple: Not present (value of 0).
### Key Observations
1. **Dominant Outlier:** The orange bar in category R4 is the most prominent feature, reaching the maximum y-axis value of 25. This is 3-5 times higher than any other bar in the chart.
2. **Variable Presence:** Not all sub-categories (colors) are present in every primary category. Purple is absent in R2, R5, and R6. Blue and Orange are absent in R6.
3. **General Scale:** With the exception of the R4 orange outlier, all other data points fall between 0 and 8 files.
4. **Highest Non-Outlier:** The next highest values are the orange (~8) and green (~7) bars in category R3.
5. **Lowest Values:** The lowest non-zero values are the purple bar in R1 (~1) and the blue bar in R4 (~1).
### Interpretation
This chart visualizes the distribution of files across a two-level classification system: six main categories (R1-R6) and five unnamed sub-categories (colors). The data suggests a highly uneven distribution.
* **The R4/Orange Anomaly:** The extreme value for the orange sub-category within R4 indicates a massive concentration of files in this specific segment. This could represent a primary focus area, a data collection bias, or a significant event or process that generated a large number of files for that particular sub-category in R4. Without the legend, the meaning of "orange" is critical to understanding this spike.
* **Category Composition:** Categories R1, R2, and R5 show a relatively balanced, low-volume distribution across their present sub-categories. R3 shows moderate activity, particularly in orange and green. R6 is the most sparse, containing only green and red sub-categories.
* **Missing Data Implications:** The absence of certain color bars likely indicates a count of zero files for that sub-category within the given main category, not missing data. This highlights that some sub-category types are not relevant or do not exist within certain main categories.
* **Overall Purpose:** The chart is designed to compare not only the total file count per main category (by summing the bars) but, more importantly, the composition of those totals by sub-category. The stark outlier in R4/Orange would be the primary subject of any investigation or report based on this data. The lack of a legend is a major limitation for full interpretation.
</details>
(a) TransCoder-IR
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: File Counts by Category
### Overview
This is a grouped bar chart displaying the "Number of Files" across seven distinct categories labeled S1 through S7. Each category contains a cluster of five vertical bars, each a different color, representing five separate data series. The chart uses a clear grid for reference.
### Components/Axes
* **Y-Axis (Vertical):** Labeled "Number of Files". The scale is linear, ranging from 0 to 7, with major gridlines at every integer unit (0, 1, 2, 3, 4, 5, 6, 7).
* **X-Axis (Horizontal):** Labeled "Categories". It lists seven categorical labels: S1, S2, S3, S4, S5, S6, S7.
* **Data Series (Colors):** Five distinct colors are used for the bars within each category cluster. From left to right within a cluster, the colors are: Blue, Orange, Green, Red, Purple. **Note:** There is no visible legend in the image to assign specific meaning or labels to these colors. They represent five unnamed data series.
### Detailed Analysis
The following table reconstructs the data, showing the approximate "Number of Files" for each color series within each category. Values are determined by aligning the top of each bar with the y-axis gridlines.
| Category | Blue Bar | Orange Bar | Green Bar | Red Bar | Purple Bar |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **S1** | 4 | 5 | 2 | 4 | 3 |
| **S2** | 1 | 1 | 1 | 1 | 1 |
| **S3** | 2 | 4 | 4 | 5 | 3 |
| **S4** | 1 | 1 | 1 | 1 | 1 |
| **S5** | 2 | 2 | 1 | 2 | 2 |
| **S6** | 0 | 0 | 4 | 7 | 2 |
| **S7** | 0 | 0 | 4 | 0 | 2 |
**Trend Verification by Category:**
* **S1:** Shows moderate to high counts. The Orange bar is the tallest (5), followed by Blue and Red (4), Purple (3), and Green (2).
* **S2:** All five bars are of equal, low height (1). This indicates a uniform distribution across all series for this category.
* **S3:** Shows a varied distribution. The Red bar is tallest (5), followed by Orange and Green (4), Purple (3), and Blue (2).
* **S4:** Identical to S2; all bars are at the minimum value of 1.
* **S5:** Shows low, mostly uniform counts. Blue, Orange, Red, and Purple are all at 2, while Green is slightly lower at 1.
* **S6:** Shows the most extreme variation. The Red bar is the tallest in the entire chart (7). Green is at 4, Purple at 2, and Blue and Orange are absent (value 0).
* **S7:** Shows only two active series. Green is at 4 and Purple is at 2. Blue, Orange, and Red are absent (value 0).
### Key Observations
1. **Highest Value:** The single highest data point is the Red bar in category **S6**, reaching 7 files.
2. **Lowest Value:** The lowest non-zero value is 1 file, seen in multiple categories (S2, S4, S5-Green). The value 0 (no bar) appears for Blue/Orange in S6 and Blue/Orange/Red in S7.
3. **Uniform Categories:** Categories **S2** and **S4** are perfectly uniform, with all five series having exactly 1 file.
4. **Series Absence:** The Blue and Orange series are completely absent (value 0) in the last two categories (S6, S7). The Red series is absent in S7.
5. **Dominant Series in Categories:** The "winning" (tallest) bar color changes by category: Orange (S1), Red (S3), Red (S6), Green (S7). In S2, S4, and S5, there is a tie or near-tie.
### Interpretation
This chart compares the file count distribution of five unnamed entities (represented by colors) across seven different categories or scenarios (S1-S7).
* **What the data suggests:** The categories S2 and S4 represent scenarios where all measured entities perform identically at a minimal level (1 file). In contrast, categories like S1, S3, and especially S6 show significant differentiation between the entities. S6 is an outlier scenario where one entity (Red) dramatically outperforms all others, while two entities (Blue, Orange) have no presence at all.
* **Relationships:** The chart allows for two primary comparisons: 1) **Within a category:** How the five entities compare against each other in a specific context. 2) **Across categories:** How the performance of a single entity (e.g., the Red series) varies from one context to another. For instance, the Red series is consistently present and often high-performing (4, 1, 5, 1, 2, 7, 0), while the Blue series shows a declining trend, disappearing entirely in the later categories.
* **Notable Anomalies:** The complete absence of certain series in S6 and S7 is a critical finding, suggesting those categories may represent conditions where those specific entities are not applicable or cannot produce files. The extreme spike of the Red series in S6 (7 files) warrants investigation into what makes that category uniquely favorable for that entity.
</details>
(b) Project CodeNet
Figure 6: Failure reasons across different LLM models for both datasets.
Here, we analyze the failure cases of SACTOR in translating C code to Rust that we conducted in Section 6.1. as cases where SACTOR fails offer valuable insights into areas that require refinement. For each failure case in the two datasets, we conduct an analysis to determine the primary cause of translation failure. This process involves leveraging DeepSeek-R1 to identify potential reasons (prompts available in Appendix N.5), followed by manual verification to ensure correctness. We only focus on the translation process from C to unidiomatic Rust because: (1) it is the most challenging step, and (2) it can better reflect the modelβs ability to fit the syntactic and semantic differences between the two languages. Table 6 summarize the categories of failure reasons, and Figure 6(a) and 6(b) illustrate failure reasons (FRs) across models.
(1) TransCoder-IR (Table 6(a), Figure 6(a)): Based on the analysis, we observe that different models exhibit varying failure reasons. Claude 3.5 shows a particularly high incidence of string representation conversion errors (R4), with 25 out of 45 total failures in the unidiomatic translation step. In contrast, GPT-4o has only 1 out of 17 failures in this category. Llama 3.3 demonstrates consistent challenges with both R3 (incorrect array sizing and memory layout translations) and R6 (using C-specific functions without proper Rust wrappers), with 10 files for each category. GPT-4o shows a more balanced distribution of errors, with its highest count in R3. All models except GPT-4o struggle with string handling (R4) to varying degrees, suggesting this is one of the most challenging aspects of the translation process. For R6 (use of C-specific functions in Rust), which primarily is a compilation failure, only Llama 3.3 and Gemini 2.0 consistently fail to resolve the issue in some cases, while all other models can successfully handle the compilation errors through feedback and avoid failure in this category. DeepSeek-R1 has the fewest overall errors across categories, with failures only in R1 (1 file), R3 (2 files), and R4 (3 files), while completely avoiding errors in R2, R5, and R6.
(2) Project CodeNet (Table 6(b), Figure 6(b)): Similar to the TransCoder-IR dataset, we also observe that different models in Project CodeNet demonstrate varying failure reasons. C-to-Rust code translation challenges in the CodeNet dataset. Most notably, S6 (mismatched data type translations) presents a significant barrier for Llama 3.3 and Gemini 2.0 (7 files each), while GPT-4o and Claude 3.5 completely avoid this issue. Input argument handling (S1) and format string mistranslations (S3) emerge as common challenges across all models in CodeNet, suggesting fundamental difficulties in translating these language features regardless of model architecture. Only Llama 3.3 and DeepSeek-R1 encounter control flow translation failures (S7), with 2 files each. S4 (random number generation) and S5 (mutable global state variables) are unable to be translated by SACTOR because the current SACTOR implementation does not support these features.
Compared to the results in TransCoder-IR, string representation conversion (R4 in TransCoder-IR, S3 in CodeNet) remains a consistent challenge across both datasets for all models, though the issue is significantly more severe in TransCoder-IR, particularly for Claude 3.5 (24 files). This also suggests that reasoning models like DeepSeek-R1 are better at handling complex code logic and string/array manipulation, as they exhibit fewer failures in these areas, demonstrating the potential of reasoning models to address complex translation tasks.
## Appendix I SACTOR Cost Analysis
| Claude 3.5 Gemini 2.0 | TransCoder-IR CodeNet TransCoder-IR | 4595.33 3080.28 3343.12 | 5.15 3.15 4.24 |
| --- | --- | --- | --- |
| CodeNet | 2209.38 | 2.39 | |
| Llama 3.3 | TransCoder-IR | 4622.80 | 5.39 |
| CodeNet | 4456.84 | 3.80 | |
| GPT-4o | TransCoder-IR | 2651.21 | 4.24 |
| CodeNet | 2565.36 | 2.95 | |
| DeepSeek-R1 | TransCoder-IR | 17895.52 | 4.77 |
| CodeNet | 13592.61 | 3.11 | |
Table 7: Average Cost Comparison of Different LLMs Across Two Datasets. The color intensity represents the relative cost of each metric for each dataset.
Here, we conduct a cost analysis of SACTOR for experiments in Β§ 6.1 to evaluate the efficiency of different LLMs in generating idiomatic Rust code. To evaluate the cost of our approach, we measure (1) Total LLM Queries as the number of total LLM queries made during translation and verification for a single test case in each dataset, and (2) Total Token Count as the total number of tokens processed by the LLM for a single test case in each dataset. To ensure a fair comparison across models, we use the same tokenizer (tiktoken) and encoding (o200k_base).
In order to better understand costs, we only analyze programs that successfully generate idiomatic Rust code, excluding failed attempts (as they always reach the maximum retry limit and do not contribute meaningfully to the cost analysis). We evaluate the combined cost of both translation phases to assess overall efficiency. Table 7 compares the average cost of different LLMs across two datasets, measured in token usage and query count per successful idiomatic Rust translation as mentioned in Β§ 5.2.
Results: Gemini 2.0 and GPT-4o are the most efficient models, requiring the fewest tokens and queries. GPT-4o maintains a low token cost (2651.21 on TransCoder-IR, 2565.36 on CodeNet) with 4.24 and 2.95 average queries, respectively. Gemini 2.0 is similarly efficient, especially on CodeNet, with the lowest token usage (2209.38) and requiring only 2.39 queries on average. Claude 3.5, despite its strong performance on CodeNet, incurs higher costs on TransCoder-IR (4595.33 tokens, 5.15 queries), likely due to additional translation steps. Llama 3.3 is the least efficient in non-thinking model (GPT-4o, Claude 3.5, Gemini 2.0), consuming the most tokens (4622.80 and 4456.84, respectively) and requiring the highest number of queries (5.39 and 3.80, respectively), indicating significant resource demands.
As a reasoning model, DeepSeek-R1 consumes significantly more tokens (17,895.52 vs. 13,592.61) than non-reasoning modelsβ5-7 times higher than GPT-4oβdespite having a similar average query count (4.77 vs. 3.11) for generating idiomatic Rust code. This high token usage comes from the βreasoning processβ required before code generation.
## Appendix J Ablation Study on SACTOR Designs
This appendix reports additional ablations that evaluate key design choices in SACTOR. All experiments in this section use GPT-4o with the same configuration as Table 5.
### J.1 Feedback Mechanism
To evaluate the effectiveness of the feedback mechanism proposed in Β§ 4.3, we conduct an ablation study by removing the mechanism and comparing the modelβs performance with and without it. We consider two experimental groups: (1) with the feedback mechanism enabled, and (2) without the feedback mechanism. In the latter setting, if any part of the translation fails, the system simply restarts the translation attempt using the original prompt, without providing any feedback from the failure.
We use the same dataset and evaluation metrics described in Β§ 5, and focus our evaluation on only two models: GPT-4o and Llama 3.3 70B. We choose these models because GPT-4o demonstrated one of the highest performance and Llama 3.3 70B the lowest in our earlier experiments. By comparing the success rates between the two groups, we assess whether the feedback mechanism improves translation performance across models of different capabilities.
The results are shown in Figure 7.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Legend: Unidiomatic vs. Idiomatic SR Series
### Overview
The image is a legend or key, likely extracted from a larger chart or graph. It defines the visual encoding (color and pattern) for twelve distinct data series, categorized into two primary groups: "Unidiomatic SR" and "Idiomatic SR." Each group contains six numbered series (SR 1 through SR 6) and one series labeled with a "(-FBK)" suffix.
### Components/Axes
The legend is organized into two horizontal rows. The left side (first two columns) contains the "Unidiomatic SR" series, and the right side (last two columns) contains the "Idiomatic SR" series.
**Legend Entries (Spatially from left to right, top to bottom):**
1. **Top Row, Leftmost:** A blue box with diagonal lines (top-left to bottom-right). Label: `Unidiomatic SR 1`
2. **Top Row, Second from Left:** A blue box with a grid pattern. Label: `Unidiomatic SR 4`
3. **Top Row, Third from Left:** A solid green box. Label: `Unidiomatic (-FBK)`
4. **Top Row, Fourth from Left:** An orange box with diagonal lines (top-left to bottom-right). Label: `Idiomatic SR 3`
5. **Top Row, Rightmost:** A red box with vertical lines. Label: `Idiomatic SR 6`
6. **Bottom Row, Leftmost:** A blue box with diagonal lines (top-right to bottom-left). Label: `Unidiomatic SR 2`
7. **Bottom Row, Second from Left:** A blue box with a dotted pattern. Label: `Unidiomatic SR 5`
8. **Bottom Row, Third from Left:** An orange box with diagonal lines (top-right to bottom-left). Label: `Idiomatic SR 1`
9. **Bottom Row, Fourth from Left:** An orange box with a grid pattern. Label: `Idiomatic SR 4`
10. **Bottom Row, Rightmost:** A red box with a dotted pattern. Label: `Idiomatic SR 5`
11. **Bottom Row, Far Right (implied continuation):** A blue box with vertical lines. Label: `Unidiomatic SR 6`
12. **Bottom Row, Far Right (implied continuation):** An orange box with a dotted pattern. Label: `Idiomatic SR 2`
13. **Bottom Row, Far Right (implied continuation):** A solid red box. Label: `Idiomatic (-FBK)`
**Color & Pattern Key:**
* **Unidiomatic Group:** Primarily uses shades of blue (for numbered series) and one solid green (for the -FBK series). Patterns include diagonal lines (both directions), grids, dots, and vertical lines.
* **Idiomatic Group:** Primarily uses shades of orange (for numbered series) and red (for SR 5, SR 6, and the -FBK series). Patterns include diagonal lines (both directions), grids, dots, and vertical lines.
### Detailed Analysis
This legend provides the mapping necessary to interpret a multi-series chart. The systematic naming ("SR 1" through "SR 6") suggests these are sequential or categorical test conditions, models, or data sources. The consistent use of the "(-FBK)" suffix for one series in each group indicates a special variant or condition applied to both the Unidiomatic and Idiomatic categories.
The visual design uses a combination of **color hue** (blue/green vs. orange/red) to distinguish the primary categories (Unidiomatic vs. Idiomatic) and **pattern fill** to distinguish the individual series within each category. This dual-encoding (color + pattern) is a best practice for accessibility, ensuring the chart remains interpretable when printed in grayscale or for users with color vision deficiencies.
### Key Observations
1. **Categorical Grouping:** The primary visual split is between the cool colors (blues/greens) for "Unidiomatic" and the warm colors (oranges/reds) for "Idiomatic."
2. **Pattern Consistency:** Similar patterns (e.g., diagonal lines, grids) are used across both color groups, meaning the pattern alone does not indicate categoryβcolor is the primary differentiator.
3. **Special Designation:** The "(-FBK)" series are given distinct, solid colors (green and red) within their respective groups, making them visually prominent compared to the patterned series.
4. **Potential Data Structure:** The existence of six numbered series plus one special series in each category suggests a comparative study with six standard conditions and one modified condition (FBK) for both Unidiomatic and Idiomatic inputs.
### Interpretation
This legend is the key to decoding a complex comparative analysis. The data likely comes from a study evaluating "SR" (which could stand for Speech Recognition, Speaker Recognition, or a similar technical metric) across two fundamental conditions: **Unidiomatic** (perhaps non-native, accented, or non-standard language use) and **Idiomatic** (native, standard, or fluent language use).
The six numbered series (SR 1-6) probably represent different models, algorithms, test sets, or parameter settings being evaluated under both Unidiomatic and Idiomatic conditions. The "(-FBK)" series likely represents a baseline, a filtered version, or a model with a specific feature (like "Feedback" or "Filter Bank") removed or added.
The chart this legend belongs to would allow a viewer to directly compare the performance (or another metric) of, for example, "Unidiomatic SR 3" (blue, diagonal lines) against "Idiomatic SR 3" (orange, diagonal lines) to see how the same system performs on different data types. The solid-color "(-FBK)" series would serve as critical reference points within each category. The overall goal is almost certainly to quantify the performance gap between idiomatic and unidiomatic language processing and to evaluate how different systems (SR 1-6) handle that gap.
</details>
<details>
<summary>x15.png Details</summary>

### Visual Description
## Grouped Bar Chart: LLM Task Performance Comparison
### Overview
This image displays a grouped bar chart comparing the performance of two Large Language Models (LLMs) across four distinct task categories. The chart quantifies success as a count out of 100 attempted tasks for each model-category pair.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **X-Axis:** Labeled "LLM Models". It contains two primary categories:
1. `Llama 3.3 70B`
2. `GPT-4o`
* **Y-Axis:** Labeled "Count (out of 100 tasks)". The scale runs from 0 to 100 in increments of 20, with horizontal grid lines at these intervals.
* **Legend/Series:** Four distinct data series are represented by colored and patterned bars. The legend is embedded within the bar patterns themselves, requiring visual matching.
* **Blue with diagonal hatching (\\):** Represents "Code Generation".
* **Solid Green:** Represents "Math Problem Solving".
* **Orange with cross-hatching (X):** Represents "Creative Writing".
* **Solid Red:** Represents "Factual Q&A".
* **Spatial Layout:** For each LLM model on the x-axis, the four task bars are grouped together in the order listed above (Blue, Green, Orange, Red from left to right within the group).
### Detailed Analysis
**Llama 3.3 70B Performance (Left Group):**
* **Code Generation (Blue, hatched):** The bar reaches approximately **76**. It is the highest-performing task for this model.
* **Math Problem Solving (Green, solid):** The bar reaches approximately **57**.
* **Creative Writing (Orange, cross-hatched):** The bar reaches approximately **64**.
* **Factual Q&A (Red, solid):** The bar reaches approximately **46**. It is the lowest-performing task for this model.
**GPT-4o Performance (Right Group):**
* **Code Generation (Blue, hatched):** The bar reaches approximately **84**.
* **Math Problem Solving (Green, solid):** The bar reaches approximately **87**. It is the highest-performing task for this model.
* **Creative Writing (Orange, cross-hatched):** The bar reaches approximately **80**.
* **Factual Q&A (Red, solid):** The bar reaches approximately **83**.
**Trend Verification:**
* For **Llama 3.3 70B**, the performance trend from highest to lowest is: Code Generation > Creative Writing > Math Problem Solving > Factual Q&A.
* For **GPT-4o**, the performance trend is more clustered: Math Problem Solving > Factual Q&A > Code Generation > Creative Writing. All scores are above 80.
* **Cross-Model Trend:** GPT-4o shows a clear and consistent performance advantage over Llama 3.3 70B across all four task categories. The performance gap is most pronounced in Math Problem Solving (~30 point difference) and Factual Q&A (~37 point difference).
### Key Observations
1. **Model Superiority:** GPT-4o demonstrates significantly higher and more consistent performance across all measured tasks compared to Llama 3.3 70B.
2. **Task Strength Variability:** Llama 3.3 70B shows greater variability in performance between tasks (range ~30 points), while GPT-4o's performance is more uniform (range ~7 points).
3. **Task-Specific Strengths:** For Llama, Code Generation is a relative strength. For GPT-4o, Math Problem Solving is the top-performing task, though all are strong.
4. **Visual Encoding:** The chart uses both color and pattern (hatching) to distinguish data series, which aids in accessibility and black-and-white printing.
### Interpretation
This chart provides a direct performance benchmark between two prominent LLMs on a standardized set of 100 tasks per category. The data suggests that GPT-4o is a more capable and reliable model across a diverse set of cognitive tasks, including technical (Code, Math), creative (Writing), and knowledge-based (Factual Q&A) domains.
The significant performance gap, especially in Math and Factual Q&A, may indicate differences in model architecture, training data quality/quantity, or reasoning capabilities. Llama 3.3 70B's relative strength in Code Generation could point to a training focus or architectural bias favoring structured, logical outputs.
The uniformity of GPT-4o's high scores implies robust generalization, whereas Llama's more varied results suggest its performance is more sensitive to the specific nature of the task. This analysis is crucial for practitioners selecting a model for specific applications; for instance, GPT-4o appears to be the safer choice for a general-purpose assistant, while Llama's performance in coding tasks might still be competitive for specialized development tools. The chart effectively communicates not just raw scores, but the comparative reliability and task-specialization profiles of the two models.
</details>
(a) TransCoder-IR With/Without Feedback
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Grouped Bar Chart: LLM Model Performance Comparison
### Overview
This image is a grouped bar chart comparing the performance of two Large Language Models (LLMs) on a set of 100 tasks. The chart displays four distinct performance metrics (represented by different colored and patterned bars) for each of the two models: "Llama 3.3 70B" and "GPT-4o". The y-axis represents the count of successful tasks out of 100.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **X-Axis (Horizontal):**
* **Label:** "LLM Models"
* **Categories:** Two primary categories are listed: "Llama 3.3 70B" (left group) and "GPT-4o" (right group).
* **Y-Axis (Vertical):**
* **Label:** "Count (out of 100 tasks)"
* **Scale:** Linear scale from 0 to 100, with major tick marks and grid lines at intervals of 20 (0, 20, 40, 60, 80, 100).
* **Data Series (Bars):** For each model, there are four bars. A legend is not present in the image, so the series are identified by their visual properties:
1. **Blue Bar with Diagonal Stripes (\\):** The leftmost bar in each group. It features a complex, layered pattern on its top section (approximately the top 10-15% of the bar), consisting of horizontal dots, a grid, and diagonal cross-hatching.
2. **Solid Green Bar:** The second bar from the left in each group.
3. **Orange Bar with Cross-Hatching (X):** The third bar from the left in each group.
4. **Solid Red Bar:** The rightmost bar in each group.
### Detailed Analysis
**Llama 3.3 70B (Left Group):**
* **Blue Bar (Diagonal Stripes):** Reaches a height of approximately **83**. The top patterned section is distinct.
* **Green Bar (Solid):** Reaches a height of approximately **62**.
* **Orange Bar (Cross-Hatching):** Reaches a height of approximately **76**.
* **Red Bar (Solid):** Reaches a height of approximately **59**.
**GPT-4o (Right Group):**
* **Blue Bar (Diagonal Stripes):** Reaches a height of approximately **83**, very similar to the Llama 3.3 70B blue bar. The top patterned section is also present.
* **Green Bar (Solid):** Reaches a height of approximately **82**.
* **Orange Bar (Cross-Hatching):** Reaches a height of approximately **79**.
* **Red Bar (Solid):** Reaches a height of approximately **77**.
**Trend Verification:**
* For the **Blue (striped)** series, both models show a similarly high performance (~83).
* For the **Green (solid)** series, GPT-4o (~82) shows a significantly higher performance than Llama 3.3 70B (~62).
* For the **Orange (cross-hatched)** series, GPT-4o (~79) performs slightly better than Llama 3.3 70B (~76).
* For the **Red (solid)** series, GPT-4o (~77) shows a notably higher performance than Llama 3.3 70B (~59).
### Key Observations
1. **Performance Gap:** GPT-4o outperforms Llama 3.3 70B in three out of the four measured categories (Green, Orange, Red). The performance gap is most pronounced in the Green and Red categories.
2. **Parity in One Category:** The models perform almost identically in the category represented by the Blue (striped) bar.
3. **Internal Model Consistency:** For Llama 3.3 70B, there is a wider spread in performance across categories (from ~59 to ~83). For GPT-4o, the performance is more consistent and clustered at a higher level (from ~77 to ~83).
4. **Visual Anomaly:** The Blue bars for both models have a unique, complex pattern at their top, which is not present on any other bars. This likely signifies a special condition, sub-category, or stacked data element for that specific metric, but its meaning cannot be determined without a legend.
### Interpretation
The chart provides a comparative snapshot of task success rates for two prominent LLMs. The data suggests that **GPT-4o demonstrates stronger and more consistent overall performance** across the evaluated task suite compared to Llama 3.3 70B.
The near-identical score on the "Blue" metric indicates that for that specific type of task (whatever it represents), the models are equally capable. However, GPT-4o's substantial leads in the "Green" and "Red" metrics point to superior capabilities in those particular domains or under those specific evaluation conditions.
The lack of a legend is a critical omission for full technical understanding. The distinct patterns, especially the layered design on the Blue bars, imply the data is more complex than simple countsβpossibly representing stacked successes, different difficulty tiers, or sub-task breakdowns. Without this key, the precise nature of the four compared metrics remains unknown, limiting the interpretation to relative performance trends rather than absolute categorical analysis. To fully leverage this chart, one would need the accompanying legend to decode what the Blue, Green, Orange, and Red bars specifically measure.
</details>
(b) CodeNet With/Without Feedback
Figure 7: Ablation study on the feedback mechanism. The success rates of the models with and without the feedback (marked as -FBK) mechanism are shown for both TransCoder-IR and CodeNet datasets.
(1) TransCoder-IR (Figure 7(a)): Incorporating the feedback mechanism increased the number of successful translations for Llama 3.3 70B from 57 to 76 in the unidiomatic setting and from 46 to 64 in the idiomatic setting. In contrast, GPT-4o performed slightly worse with feedback, decreasing from 87 to 84 (unidiomatic) and from 83 to 80 (idiomatic).
(2) Project CodeNet (Figure 7(b)): A similar trend is observed where Llama 3.3 70B improved from 62 to 83 (unidiomatic) and from 59 to 76 (idiomatic), corresponding to gains of 21 and 17 percentage points, respectively. GPT-4o, however, showed only marginal improvements: from 82 to 84 in the unidiomatic setting and from 77 to 79 in the idiomatic setting.
These results suggest that the feedback mechanism is particularly effective for lower-capability models like Llama 3.3, substantially improving their translation success rates. In contrast, higher-capability models such as GPT-4o already perform near optimal with simple random sampling, leaving little space for improvement. This indicates that the feedback mechanism is more beneficial for models with lower capabilities, as they can leverage the feedback to enhance their overall performance.
### J.2 Plain LLM Translation vs. SACTOR
We compare SACTOR against a trivial baseline where GPT-4o directly translates each CRust-Bench sample from C to Rust in a single step. We reuse the same end-to-end (E2E) test harness as SACTOR, and give the trivial baseline more budget: up to 10 repair attempts with compiler/test feedback (vs. 6 attempts in SACTOR). We study two prompts: (i) a minimal one (βtranslate the following C code to Rustβ); and (ii) an interface-preserving one that explicitly asks the model to preserve pointer arithmetic, memory layout, and integer type semantics (thereby encouraging unsafe). We report function success as the fraction of functions whose Rust translation passes all tests, and sample success as the fraction of samples where all translated functions pass.
| SACTOR unidiomatic SACTOR idiomatic β Trivial (1-step) | 6 6 10 | 788/966 (81.57%) 249/580 (42.93%) 77/966 (7.97%) | 32/50 (64.00%) 8/32 (25.00%) 12/50 (24.00%) | 2.96 0.28 1.60 |
| --- | --- | --- | --- | --- |
| Trivial (1-step, encourage unsafe) | 10 | 207/966 (21.43%) | 20/50 (40.00%) | 1.90 |
Table 8: Plain LLM translation vs. SACTOR on CRust-Bench (GPT-4o). The trivial baselines directly translate each sample in one step with up to 10 repair attempts. $β $ The idiomatic stage is evaluated only on samples whose unidiomatic stage fully translated all functions.
Results on CRust-Bench. Even with 10 attempts and an βencourage unsafe β prompt, the trivial baseline reaches only 21.43% function success and 40.00% sample success. Its sample-level performance exceeds SACTOR βs idiomatic stage (40.00% vs. 25.00%) because preserving C-style pointer logic in unsafe Rust is substantially easier than performing an idiomatic rewrite. However, SACTOR achieves much higher function-level correctness and produces significantly more idiomatic code (e.g., 0.28 vs. 1.90 average Clippy alerts per function).
Results on libogg. Under the same E2E tests and attempt budget as SACTOR, both trivial prompts fail to produce any test-passing translations, whereas SACTOR achieves 100% unidiomatic and 53% idiomatic success with GPT-4o (Table 2). This indicates that plain one-shot translation collapses on pointer-heavy libraries, while SACTOR remains effective.
### J.3 Effect of Crown in the Idiomatic Stage
We ablate Crownβs contribution to idiomatic translation (Β§ 4.2) on libogg, using the same setup as Β§ 6.3 and keeping all other components unchanged. Table 9 reports idiomatic function success with and without Crown.
| SACTOR SACTOR w/o Crown | 41 34 | 53% 44% | β 17% |
| --- | --- | --- | --- |
Table 9: Ablating Crown on libogg (GPT-4o).
Results and Representative failure patterns. Turning off Crown reduces idiomatic success from 41 to 34 functions. The failures are structured. Two representative patterns are:
β¬
// Without Crown (shape lost):
pub struct OggPackBuffer { pub ptr: usize }
// With Crown (shape preserved):
pub struct OggPackBuffer { pub ptr: Vec < u8 > }
// Without Crown (ownership misclassified as owned):
pub struct OggIovec { pub iov_base: Vec < u8 > }
// With Crown (ownership made explicit):
pub struct OggIovec <β a > { pub iov_base: &β a [u8] }
Once a buffer pointer is collapsed into a scalar index, the harness cannot reconstruct a valid C-facing view of the struct, so pointer arithmetic and buffer access fail together. Similarly, if a non-owning pointer (e.g., unsigned char *iov_base) is misclassified as owned storage (Vec<u8>), Rust ends up βowningβ memory that C actually controls, making safe round-tripping infeasible without inventing allocation/free rules that do not exist.
Interpretation. These failures do not indicate model weakness but an information-theoretic limitation: local C syntax does not encode pointer fatness or ownership. For a declaration such as char *iov_base, both Vec<u8> and &mut u8 are locally plausible. Even an idealized oracle model cannot uniquely infer the correct Rust type without global information about ownership and fatness. Crown supplies these semantics via whole-program static analysis; removing it makes idiomatic translation of pointer-heavy code underdetermined and explains the observed drop.
### J.4 Prompting about unsafe in Stage 1
We ablate the stage-1 (unidiomatic translation) prompt line that says βthe model may use unsafe if needed.β All experiments in this subsection are conducted on libogg, using exactly the same setup as in Β§ 6.3.
#### J.4.1 Removing βmay use unsafe if neededβ
We compare the original stage-1 prompt with a variant that deletes this line, keeping everything else unchanged.
| Baseline stage 1 (may use unsafe) | 100% | 108 | 76 | 1 | 8704/8705 (99.99%) |
| --- | --- | --- | --- | --- | --- |
| Remove βmay use unsafe β | 100% | 224 | 37 | 146 | 8100/8219 (98.55%) |
Table 10: Removing explicit permission to use unsafe in stage 1 on libogg (GPT-4o).
Two observations follow. (1) Overall unsafety hardly changes: the unsafe fraction drops only from 99.99% to 98.55%. (2) The safety profile becomes worse: clippy::not_unsafe_ptr_arg_deref jumps from 1 to 146. That is, the model keeps APIs safe-looking but dereferences raw pointer arguments inside function bodies, pushing unsafety from explicit unsafe fn signatures into hidden dereferences inside safe-looking public functions.
#### J.4.2 Replacing With βAVOID using unsafe β
We replace βmay use unsafe if neededβ with a stronger directive: βAVOID using unsafe whenever possibleβ.
| Baseline stage 1 Replace with βAVOID unsafe β | 77/77 66/77 | 100% 85% | β 15% |
| --- | --- | --- | --- |
Table 11: Discouraging unsafe in stage 1 harms unidiomatic success on libogg (GPT-4o).
Under βAVOID unsafe β, the model often attempts premature βsafe Rustβ rewrites of pointer-heavy C code (changing buffer layouts, index arithmetic, and integer types), which increases logic and type errors and breaks translations. Together, these two prompt variants show that discouraging unsafe in stage 1 harms correctness and produces a worse safety profile, supporting our design choice: allow necessary unsafe in the syntactic first stage, then systematically remove it in the idiomatic refinement stage.
## Appendix K SACTOR Performance with Different Temperatures
In Β§ 6, all the experiments are conducted with the temperature set to default values, as explained on Appendix G. To investigate how temperature affects the performance of SACTOR, we conduct additional experiments with different temperature settings (0.0, 0.5, 1.0) for GPT-4o on both TransCoder-IR and Project CodeNet datasets, as shown in Figure 8. Through some preliminary experiments and discussions on OpenAIβs community forum https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api/172683, we find that setting the temperature more than 1 will likely to generate more random and less relevant outputs, which is not suitable for our task.
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Legend: Unidomatic and Idiomatic SR Series
### Overview
The image is a legend or key, likely for a chart or diagram, that defines twelve distinct data series or categories. The series are divided into two primary groups: "Unidomatic SR" and "Idiomatic SR," each containing six numbered sub-categories (1 through 6). Each entry pairs a unique visual pattern (hatching) with a text label.
### Components/Axes
The legend is organized into two columns.
* **Left Column:** Contains the six "Unidomatic SR" series.
* **Right Column:** Contains the six "Idiomatic SR" series.
Each entry consists of a small square box filled with a specific black-and-white line pattern, followed by its corresponding text label.
### Detailed Analysis
The following table lists every entry from the legend, describing the pattern and providing the exact text label.
| Position (Column) | Pattern Description (Visual) | Exact Text Label |
| :--- | :--- | :--- |
| Left, 1st row | Diagonal lines slanting down to the right (\\) | Unidomatic SR 1 |
| Left, 2nd row | Diagonal lines slanting up to the right (/) | Unidomatic SR 2 |
| Left, 3rd row | A cross-hatch or diamond pattern (X) | Unidomatic SR 3 |
| Left, 4th row | Horizontal lines (-) | Unidomatic SR 4 |
| Left, 5th row | A grid or checkerboard pattern (+) | Unidomatic SR 5 |
| Left, 6th row | Vertical lines (\|) | Unidomatic SR 6 |
| Right, 1st row | Diagonal lines slanting down to the right (\\), but with a different line weight or spacing than Unidomatic SR 1 | Idiomatic SR 1 |
| Right, 2nd row | Diagonal lines slanting up to the right (/), but with a different line weight or spacing than Unidomatic SR 2 | Idiomatic SR 2 |
| Right, 3rd row | A cross-hatch or diamond pattern (X), but with a different line weight or spacing than Unidomatic SR 3 | Idiomatic SR 3 |
| Right, 4th row | Horizontal lines (-), but with a different line weight or spacing than Unidomatic SR 4 | Idiomatic SR 4 |
| Right, 5th row | A grid or checkerboard pattern (+), but with a different line weight or spacing than Unidomatic SR 5 | Idiomatic SR 5 |
| Right, 6th row | Vertical lines (\|), but with a different line weight or spacing than Unidomatic SR 6 | Idiomatic SR 6 |
**Note on Pattern Differentiation:** While the pattern *types* (diagonal, horizontal, etc.) are mirrored between the two groups, the specific rendering (e.g., line thickness, density, or angle) appears to differ to create visual distinction. For example, the "Unidomatic SR 1" pattern and the "Idiomatic SR 1" pattern are both diagonal (\\), but they are not identical.
### Key Observations
1. **Symmetrical Structure:** The legend has a perfectly symmetrical design, with six entries in each of two columns.
2. **Pattern Mirroring:** The sequence of pattern types (diagonal \\, diagonal /, cross-hatch, horizontal, grid, vertical) is identical for both the "Unidomatic" and "Idiomatic" series groups.
3. **Label Consistency:** The naming convention is strictly consistent: `[Group Name] SR [Number]`.
4. **Monochrome Design:** The entire legend uses only black patterns on a white background, suggesting the associated chart or diagram is designed for clarity in monochrome or grayscale reproduction.
### Interpretation
This legend serves as a critical decoding tool for a more complex visual, such as a multi-series line chart, bar chart, or heatmap. The data is categorized along two primary dimensions:
1. **Method/Type:** "Unidomatic" vs. "Idiomatic." This likely represents two different approaches, models, or conditions being compared.
2. **Series/Instance:** The numbers 1 through 6 within each group suggest multiple trials, versions, sub-categories, or time points for each method.
The use of distinct hatching patterns instead of color indicates the primary visual is intended to be interpretable without color information, which is a best practice for accessibility and black-and-white printing. The mirrored pattern sequence allows for easy visual comparison between the two main groups (e.g., comparing "Unidomatic SR 3" to "Idiomatic SR 3" by their similar but distinct cross-hatch patterns). To fully understand the data, one must cross-reference this legend with the main visual to see how these twelve series are plotted and what their relationships reveal.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Bar Chart: TransCoder-IR Dataset Success Rate by GPT-4o Temperature
### Overview
The image is a grouped bar chart displaying the performance of two different methods or models on the "TransCoder-IR dataset." The performance metric is "Success Rate (%)" measured across three different temperature settings for the GPT-4o model.
### Components/Axes
* **Title:** "TransCoder-IR dataset" (centered at the top).
* **Y-Axis:** Labeled "Success Rate (%)". The scale runs from 0 to 100 in increments of 20 (0, 20, 40, 60, 80, 100).
* **X-Axis:** Labeled "GPT-4o Model (Temperature)". It features three categorical groups:
* `t=0`
* `t=0.5`
* `t=1`
* **Data Series:** Two distinct bar series are present for each temperature group, differentiated by color and hatching pattern.
* **Series 1 (Blue):** A medium blue bar with a diagonal hatching pattern (lines sloping down from left to right).
* **Series 2 (Orange):** A bright orange bar with a different diagonal hatching pattern (lines sloping up from left to right).
* **Legend:** **CRITICAL NOTE:** There is no legend present in the image to identify what the blue and orange bars represent. This is a significant omission for full technical understanding.
### Detailed Analysis
The chart compares the success rates of the two unidentified series across increasing model temperature.
**Data Points (Approximate Values):**
* **At t=0:**
* Blue Bar: ~88%
* Orange Bar: ~82%
* **At t=0.5:**
* Blue Bar: ~85%
* Orange Bar: ~80%
* **At t=1:**
* Blue Bar: ~84%
* Orange Bar: ~79%
**Trend Verification:**
* **Blue Series Trend:** The height of the blue bars shows a slight but consistent downward trend as temperature increases from 0 to 1.
* **Orange Series Trend:** The height of the orange bars also shows a consistent downward trend as temperature increases, mirroring the blue series.
* **Relative Performance:** At every temperature point, the blue bar is taller than the orange bar, indicating a consistently higher success rate for the method represented by blue.
### Key Observations
1. **Negative Correlation with Temperature:** Both measured methods show a decrease in success rate as the GPT-4o model's temperature parameter increases.
2. **Consistent Performance Gap:** The performance gap between the two methods remains relatively stable across all temperature settings, with the blue method outperforming the orange method by approximately 4-6 percentage points.
3. **High Baseline Performance:** Both methods achieve success rates near or above 80% at the lowest temperature (t=0), suggesting strong baseline performance on the TransCoder-IR dataset.
4. **Missing Legend:** The most critical missing information is the identity of the two data series. Without a legend, it is impossible to know what techniques, models, or conditions the blue and orange bars represent.
### Interpretation
The data suggests that for the task evaluated on the TransCoder-IR dataset, increasing the randomness (temperature) of the GPT-4o model's outputs leads to a modest but reliable decrease in success rate for both approaches tested. This implies that more deterministic (lower temperature) settings are preferable for maximizing performance on this specific task.
The consistent performance gap indicates that one method (blue) has a fundamental advantage over the other (orange) that is independent of the temperature setting. This could be due to differences in prompting strategy, model architecture, preprocessing, or another controlled variable.
**Primary Uncertainty:** The core finding is clear (performance degrades with temperature), but the *significance* of the comparison is entirely obscured by the missing legend. The chart effectively shows that "Method A" is better than "Method B," but without knowing what A and B are, the practical implications for a technical document are severely limited. To be actionable, this chart must be paired with its legend or accompanying text that defines the series.
</details>
(a) Success Rate on TransCoder-IR
<details>
<summary>x19.png Details</summary>

### Visual Description
## Bar Chart: Project CodeNet Dataset Success Rates by GPT-4o Temperature
### Overview
This is a grouped bar chart titled "Project CodeNet dataset". It displays the success rate (in percentage) of a model, identified as "GPT-4o Model", across three different temperature settings (t=0, t=0.5, t=1). For each temperature setting, there are two bars, distinguished by color and pattern, representing two different metrics or conditions.
### Components/Axes
* **Title:** "Project CodeNet dataset" (centered at the top).
* **Y-Axis:**
* **Label:** "Success Rate (%)"
* **Scale:** Linear, from 0 to 100.
* **Tick Marks:** 0, 20, 40, 60, 80, 100.
* **X-Axis:**
* **Label:** "GPT-4o Model (Temperature)"
* **Categories:** Three discrete temperature settings: "t=0", "t=0.5", "t=1".
* **Data Series (Bars):**
* **Series 1 (Blue with diagonal stripes):** Positioned as the left bar in each group.
* **Series 2 (Orange with cross-hatching):** Positioned as the right bar in each group.
* **Legend:** No explicit legend is present in the image. The two series are differentiated solely by color and pattern. The specific meaning of the blue vs. orange bars is not stated in the chart.
### Detailed Analysis
**Data Points (Approximate Values):**
* **At t=0:**
* Blue Bar: ~85%
* Orange Bar: ~78%
* **At t=0.5:**
* Blue Bar: ~88% (appears to be the highest value in the chart)
* Orange Bar: ~80%
* **At t=1:**
* Blue Bar: ~83%
* Orange Bar: ~79%
**Trend Verification:**
* **Blue Series Trend:** The success rate starts high at t=0 (~85%), increases slightly to a peak at t=0.5 (~88%), and then decreases at t=1 (~83%). The overall trend is a slight arch.
* **Orange Series Trend:** The success rate starts at ~78% at t=0, increases to ~80% at t=0.5, and remains nearly level at ~79% at t=1. The trend is relatively flat with a minor peak at t=0.5.
### Key Observations
1. **Consistent Performance Gap:** The blue series consistently shows a higher success rate than the orange series across all three temperature settings. The gap is approximately 5-8 percentage points.
2. **Optimal Temperature:** Both series achieve their highest observed success rate at the intermediate temperature setting of t=0.5.
3. **Stability:** The orange series exhibits less variation across temperatures compared to the blue series.
4. **High Baseline:** All success rates are relatively high, clustered between approximately 78% and 88%.
### Interpretation
The chart demonstrates the performance of the GPT-4o model on the Project CodeNet dataset under varying levels of randomness (temperature). The data suggests two key findings:
1. **Temperature Sensitivity:** Model performance, as measured by success rate, is sensitive to the temperature parameter. For the metric represented by the blue bars, there is a clear optimal setting at t=0.5. Performance degrades when the temperature is set to its minimum (t=0) or maximum (t=1) within this test range. This implies a "sweet spot" for balancing determinism and creativity in the model's outputs for this task.
2. **Metric Disparity:** The consistent gap between the blue and orange bars indicates that the two measured outcomes (e.g., perhaps "Pass@1" vs. "Pass@10", or "exact match" vs. "functional correctness") have different levels of difficulty. The blue metric is consistently easier for the model to achieve. The fact that both metrics peak at the same temperature (t=0.5) suggests that the optimal setting for one metric is also optimal for the other, which is a useful insight for tuning.
**Note on Missing Information:** The critical absence of a legend means the specific definitions of the blue and orange data series are unknown. To fully interpret the results, one would need to know what these two bars represent (e.g., different evaluation metrics, different programming languages, different problem difficulties). The analysis above is based solely on the visual trends and numerical values presented.
</details>
(b) Success Rate on Project CodeNet
Figure 8: Success Rate of SACTOR with different temperature settings for GPT-4o on TransCoder-IR and Project CodeNet datasets.
(1) TransCoder-IR (Figure 8(a)): Setting the decoder to a deterministic temperature of $t=0$ resulted in 83 successful translations (83%), while both $t=0.5$ and $t=1.0$ yielded 80 successes (80%) each. This represents a slightly improvement with 3 additional correct predictions under the deterministic setting.
(2) Project CodeNet (Figure 8(b)): Temperature does not have a significant impact: the model produced 79, 81, and 79 successful outputs at $t=0$ , $t=0.5$ , and $t=1.0$ respectively (79β81%), which does not indicate any outstanding trend in performance across the temperature settings.
The results on both datasets suggests that lowering temperature to zero can offer a slight boost in reliability some of the cases, but it does not significantly affect the overall performance of SACTOR.
## Appendix L Spec-driven Harness Rules
<details>
<summary>x20.png Details</summary>

### Visual Description
\n
## Diagram: FFI-based E2E Verifier System Architecture
### Overview
The image is a technical flowchart or system architecture diagram illustrating a process for transforming and verifying Rust code. The system uses a combination of Large Language Models (LLMs) and rule-based methods to convert "Unidiomatic Rust" into "Verified Idiomatic Rust" through an end-to-end verification process. The diagram shows two parallel pathways converging into a final verification step.
### Components/Axes
The diagram is composed of labeled boxes, file icons, directional arrows, and a central processing unit containing AI model logos. There are no traditional chart axes.
**Key Components & Labels (Spatially Organized):**
* **Top Center:** A large rectangular box labeled **"FFI-based E2E Verifier"**. This is the central verification component.
* **Center Left:** A rounded rectangle containing three logos/icons:
* A spiral logo (resembling the OpenAI logo).
* An infinity symbol (β).
* The text **"Gemini"** next to a blue "G" logo.
* The text **"AI"** next to a brown circular icon.
* **Left Side:** A file icon labeled **"RS"** (Rust source file) with the text **"Unidiomatic Rust"** below it. An arrow points from this file to the central AI box.
* **Center:** A file icon labeled **"RS"** with the text **"Idiomatic Rust"** below it. An arrow points from the central AI box to this file. Another arrow points from this file up to the "FFI-based E2E Verifier".
* **Right Side:** A file icon labeled **"RS"** with the text **"Verified Idiomatic Rust"** below it. An arrow points from the "FFI-based E2E Verifier" down to this file.
* **Bottom Left:** A file icon labeled **"JSON"** with the text **"SPEC"** below it. An arrow points from the central AI box down to this file.
* **Bottom Center:** A file icon labeled **"RS"** with the text **"Test harness With TODO"** below it. An arrow labeled **"Rule based"** (in blue text) points from the "SPEC" file to this file.
* **Bottom Right:** A file icon labeled **"RS"** with the text **"Test harness"** below it. An arrow labeled **"LLM driven"** (in blue text) points from the "Test harness With TODO" file to this file. A large, thick arrow points from this "Test harness" file up to the "FFI-based E2E Verifier".
### Detailed Analysis
The diagram outlines a multi-stage pipeline:
1. **Input:** The process begins with **"Unidiomatic Rust"** code.
2. **AI-Powered Transformation:** This code is processed by a system represented by the central box containing **Gemini** and other AI logos. This step produces two outputs:
* **"Idiomatic Rust"** code.
* A **"SPEC"** (Specification) in JSON format.
3. **Test Harness Generation (Dual Path):**
* **Path A (Rule-based):** The JSON **SPEC** is used via a **"Rule based"** process to generate an initial **"Test harness With TODO"**.
* **Path B (LLM-driven):** The initial test harness is then refined or completed via an **"LLM driven"** process to produce a final **"Test harness"**.
4. **End-to-End Verification:** The **"FFI-based E2E Verifier"** takes two primary inputs:
* The **"Idiomatic Rust"** code (from step 2).
* The final **"Test harness"** (from step 3).
5. **Output:** The verifier's successful output is the final **"Verified Idiomatic Rust"**.
### Key Observations
* **Hybrid Approach:** The system explicitly combines **"Rule based"** and **"LLM driven"** techniques for test generation, suggesting a strategy to leverage the strengths of both methods (determinism and flexibility).
* **Central AI Role:** The AI component (Gemini et al.) is pivotal, responsible for both code transformation (to idiomatic style) and specification generation.
* **Verification Focus:** The ultimate goal is not just transformation but *verification*, as emphasized by the final output being "Verified" and the central role of the "E2E Verifier".
* **Flow Direction:** The primary data flow is from left to right (Unidiomatic -> Idiomatic -> Verified). A secondary, supporting flow runs along the bottom for test generation (SPEC -> Test harnesses), which then feeds upward into the verifier.
* **Uncertainty:** The exact nature of the "FFI-based" verifier is not detailed; it likely involves Foreign Function Interface testing to ensure the Rust code interacts correctly with other systems or languages.
### Interpretation
This diagram represents a sophisticated, AI-augmented software development pipeline focused on code quality and correctness. It addresses the common challenge of transforming legacy or poorly written ("Unidiomatic") Rust code into a standardized, maintainable form.
The process is **Peircean** in its investigative logic:
* **Abduction:** The AI infers the intended structure and rules (the **SPEC**) from the unidiomatic code.
* **Deduction:** The **Rule based** system applies those inferred rules to generate a structured test harness.
* **Induction:** The **LLM driven** refinement and the **E2E Verifier** test the hypothesis that the transformed code is correct, using the generated tests to induce confidence in the final "Verified" output.
The system's value lies in automating two labor-intensive and error-prone tasks: refactoring code to idioms and creating comprehensive test suites. By linking them through a shared specification and a final verifier, it aims to create a closed-loop system where transformation and validation are tightly integrated, potentially increasing both the efficiency and reliability of software modernization efforts. The inclusion of both rule-based and LLM-driven steps for test creation is a notable design choice, likely intended to balance the precision of rules with the adaptability of AI for handling complex or ambiguous cases.
</details>
Figure 9: Spec-driven harness generation and verification loop. The idiomatic translator co-produces idiomatic Rust and a machine-readable SPEC. A rule-based generator synthesizes a C-compatible harness from the SPEC; unsupported mappings trigger a localized LLM fallback. Harness and idiomatic code are linked via FFI for end-to-end tests.
Figure 9 illustrates the co-production timing and dataflow among artifacts (idiomatic code, SPEC, harness) and the verifier. Table 12 summarizes the SPEC patterns our rule-based generator currently supports.
| Scalars | shape: "scalar" | scalar $β$ scalar | Common libc types are cast with as when needed; default compare is by value in roundtrip selftest. |
| --- | --- | --- | --- |
| C string | ptr.kind: "cstring", ptr.null | *const/*mut c_char $β$ String / &str / Option<String> | NULL handling via ptr.null or Option< >; uses CStr / CString with lossless fallback. Return strings are converted back to *mut c_char. |
| Slices | ptr.kind: "slice", len_from | len_const | *const/*mut T + length $β$ Vec<T>, &[T], or Option<...> | Requires a length source; empty or NULL produces None or empty according to spec; writes back length on I $β$ U when a paired length field exists. |
| Single-element ref | ptr.kind: "ref" | *const/*mut T $β$ Box<T> / Option<Box<T>> | For struct T, generator calls auto struct converters C T_to_T_mut / T_to_C T_mut. |
| Derived length path | idiomatic path ending with .len | len field $β$ vec.len | Recognizes idiomatic data.len and reuses the same U-side length field on roundtrip. |
| Nullability | ptr.null: nullable|forbidden | C pointers $β$ field with/without Option | nullable maps to Option< > or tolerant empty handling. |
| &mut struct params | ownership: transient | *mut CStruct $β$ &mut Struct or Option<&mut Struct> | Copies back mutated values after the call using generated struct converters. |
| Return mapping | Field with i_field.name = "ret" | idiomatic return $β$ U output(s) | Scalars: direct or via *mut T. Strings: to *mut c_char. Slices: pointer + length writeback. Structs: via struct converters. |
| Comparison hints | compare: by_value|by_slice|skip | selftest behavior | Optional per-field checks after U $β$ I1 $β$ U $β$ I2 roundtrip, and compare with I1 and I2 |
| Unsupported paths | All SPEC key pairs other than supported paths | fallback | Generator emits localized TODOs for LLM completion; schema validation rejects malformed SPECs. |
Table 12: SPEC-driven harness coverage. U denotes the unidiomatic C-facing representation; I denotes the idiomatic Rust side.
Harness construction details.
The generator consumes a per-item SPEC (JSON) produced alongside idiomatic code and synthesizes: (i) a C-compatible shim that matches the original ABI, and (ii) idiomatic adapters that convert to/from Rust types. Pointer shapes (scalar, cstring, slice, ref) determine how memory is borrowed or owned; length sources come from sibling fields or constants; nullability and ownership hints select Option< > or strict checks. Return values are mapped back to U form, writing lengths when needed. This bridging resolves the ABI mismatch introduced by idiomatic function signatures.
Struct mappings and self-check.
For structs, the SPEC defines bidirectional converters between unidiomatic and idiomatic layouts. We validate adapter consistency with a minimal roundtrip: Unidiomatic $β$ Idiomatic(1) $β$ Unidiomatic $β$ Idiomatic(2). The self-check compares Idiomatic(1) and Idiomatic(2) field-by-field according to compare hints: by_value requires exact equality on scalar fields; by_slice compares slice contents using the SPEC-recorded length source; skip omits fields that are aliasing views or externally owned to avoid false positives. Seed unidiomatic values are synthesized by an LLM guided by the SPEC so that nullability, ownership, and length sources are populated consistently.
Fallback and verification loop.
When a SPEC uses patterns not yet implemented (e.g., pointer kinds outside cstring / slice / ref; non-trivial len_from expressions; string args whose spec.kind $β $ cstring), the generator emits a localized TODO that is completed by an LLM using the same SPEC as guidance; the resulting harness is then validated as usual. End-to-end tests run against the linked harness and idiomatic crate; passing tests provide confidence under their coverage, while failures trigger the paperβs feedback procedure for regeneration and refinement.
### SPEC rule reference
This section explains the rule families the SPEC uses to describe how unidiomatic, C-facing values become idiomatic Rust and back. The schema has two top-level forms: a struct description and a function description. Both are expressed as small collections of field mappings from the unidiomatic side to idiomatic paths; a function return is just another mapping whose idiomatic path is the special name ret. This uniform treatment keeps the generator simple and makes the SPEC readable by humans and machines alike.
Pointer handling is captured by a compact notion of shape. A field is either a scalar or one of three pointer shapes: a byte string that follows C conventions, a slice that pairs a pointer with a length, or a single-object reference. Slices record where their length comes from (either a sibling field or a constant). Each pointer also carries a null policy that distinguishes admissible NULL from forbidden NULL, which in turn selects idiomatic options versus strict checks in the generated adapters.
Two lightweight hints influence how the harness allocates and how the roundtrip self-check behaves. An ownership hint (owning vs transient) signals whether the idiomatic side should materialize owned data or borrow it for the duration of the call. A comparison hint (by value, by slice, or skip) declares how roundtrip checks should assert equality, so that aliasing views or externally owned buffers can be skipped without producing spurious failures.
Finally, the schema enforces well-formedness and defines a safe escape hatch. Invalid combinations are rejected early by validation. Patterns that are valid but not yet implemented by the generator, such as complex dotted paths or unusual pointer views, are localized and handed to the LLM fallback described earlier; the SPEC itself remains the single source of truth for the intended mapping.
## Appendix M Real-world Codebase Evaluation Details
### M.1 CRust-Bench Per-sample Outcomes
Table LABEL:tab:crust_failures lists, for each of the 50 samples, the function-level translation status and a concise failure analysis. Status is reported as per-sample function-level percentages in separate columns for the unidiomatic (Unid.) and idiomatic (Id.) stages.
### M.2 libogg Outcomes
(1) Using GPT-4o. 36 functions cannot be translated idiomatically. nine of the translation failures are caused by translated functions not passing the test cases of libogg. Six failures are due to compile errors in the translations, five of which result from the LLM violating Rustβs safety rules on lifetime, borrow, and mutability. For example, the translation of function _os_lacing_expand fails because the translation sets the value of a function parameter to a reference to the functionβs local variable vec, leading to an error β`vec` does not live long enough." Two failures are due to SACTOR being unable to generate compilable test harnesses. If a function calls another function that SACTOR cannot translate, then the caller function cannot be translated either. This is the reason why the remaining 13 translations fail.
(2) Using GPT-5. 17 functions cannot be translated idiomatically. Among them, three are because the generated functions cannot pass the test cases and three are due to failure to generate compilable test harnesses. Only one is caused by a compile error in the translated function, which shows the progress of GPT-5 in understanding Rust grammar and fixing compile errors. The remaining failures result from the callee functions of those functions being untranslatable.
Table 13: CRust-Bench per-sample outcomes (function-level). Translation Status columns report per-sample function-level success rates for unidiomatic (Unid.) and idiomatic (Id.) stages.
| 2DPartInt | 100.0% | 100.0% | β | β |
| --- | --- | --- | --- | --- |
| 42-Kocaeli-Printf | 75.0% | β | C variadics require unstable c_variadic; unresolved va_list import blocks build. | Unidiomatic compile (C varargs/unstable feature) |
| CircularBuffer | 100.0% | 54.6% | CamelCase-to-snake_case renaming breaks signature lookup; later run panics under no-unwind context. | Idiomatic compile (symbol/name mapping) |
| FastHamming | 100.0% | 60.0% | Output buffer sized to input length in harness; bounds-check panic at runtime. | Harness runtime (buffer/length) |
| Holdem-Odds | 100.0% | 6.9% | Off-by-one rank yields out-of-bounds bucket index; SIGSEGV under tests. | Runtime fault (boundary/indexing) |
| Linear-Algebra-C | 100.0% | 44.8% | Pointer vs reference semantics mismatch (nullable C pointers vs Rust references); harness compile errors. | Harness compile (pointer/ref semantics) |
| NandC | 100.0% | 100.0% | β | β |
| Phills_DHT | 75.0% | β | Shadowed global hash_table keeps dht_is_initialised() false; assertion in tests. | Runtime fault (global state divergence) |
| Simple-Sparsehash | 100.0% | 40.0% | CamelCase-to-snake_case renaming causes signature/type mismatches; harness does not compile. | Idiomatic compile (symbol/name mapping) |
| SimpleXML | 83.3% | β | Missing ParseState and CamelCase-to-snake_case renaming breaks signatures; unidiomatic stalls. | Idiomatic compile (symbol/name mapping) |
| aes128-SIMD | 85.7% | β | Array-shape mismatch (expects 4x4 refs; passes row pointer); plus intrinsics/typedef noise. | Unidiomatic compile (array shape; intrinsics/types) |
| amp | 80.0% | β | Returned C string from amp_decode_arg is not NULL-terminated; strcmp reads past allocation and trips invalid read under tests. | Runtime fault (C string NULL termination) |
| approxidate | 85.7% | β | match_alpha references anonymous enum C2RustUnnamed that is never defined, causing cascaded missing-type errors across retries. | Unidiomatic compile (types/aliases) |
| avalanche | 100.0% | 75.0% | Capturing closure passed where fn pointer required; FILE*/Rust File bridging mis-modeled; compile fails. | Harness runtime (I/O/resource model mismatch) |
| bhshell | 88.2% | β | Many parser errors (enum lacks PartialEq, missing consts, u64 to usize drift, duplicates). | Unidiomatic compile (types/aliases) |
| bitset | 100.0% | 50.0% | Treats bit count as byte count in converter; overreads and panics under tests. | Harness runtime (buffer/length) |
| bostree | 52.4% | β | Function-pointer typedefs and pointer-shape drift break callback bridging. | Unidiomatic compile (function-pointer types/deps) |
| btree-map | 100.0% | 26.3% | Trace/instrumentation proc macro requires Debug on opaque C type node; harness compilation fails for get_node_count. | Harness compile (instrumentation bound) |
| c-aces | 100.0% | 3.9% | Struct converter mismatch (Vec<CMatrix2D> vs Vec<Matrix2D>) in generated harness; compile fails after retries. | Harness compile (struct converter/shape) |
| c-string | 100.0% | 29.4% | Size vs capacity mismatch in StringT constructor; empty buffer returned, C asserts. | Runtime fault (size/capacity mismatch) |
| carrays | 100.0% | 68.5% | Trace macro imposes Debug on generic T and callback; harness fails to compile (e.g., gca_lsearch). | Harness compile (instrumentation bound) |
| cfsm | 50.0% | β | Missing typedefs for C function-pointer callbacks; harness lacks nullable extern signatures, compile fails. | Unidiomatic compile (function-pointer types/deps) |
| chtrie | 100.0% | 0.0% | Pointer-of-pointers vs Vec adapter mismatch for struct chtrie | Harness compile (struct converter/shape) |
| cissy | 100.0% | 19.1% | Anonymous C types that c2rust renamed cannot be fetched correctly as a dependency | Unidiomatic compile (types/aliases) |
| clog | 31.6% | β | Variadic logging APIs and duplicate globals; unresolved vfprintf / c_variadic; compile fails. | Unidiomatic compile (C varargs/unstable feature) |
| cset | 100.0% | 25.0% | Translator renames XXH_readLE64 to xxh_read_le64; SPEC/harness require exact C name; exhausts six attempts. | Idiomatic compile (symbol/name mapping) |
| csyncmers | 66.7% | β | Unsigned underflow in compute_closed_syncmers (i - S + 1 without guard) triggers overflow panic; prior __uint128_t typedef issues. | Runtime fault (arithmetic underflow) |
| dict | 17.7% | β | Fn-pointer fields modeled non-optional (need Option<extern "C" fn>); plus va_list requires nightly c_variadic; compile fails. | Unidiomatic compile (function-pointer types/deps) |
| emlang | 16.3% | β | Anonymous-union alias (C2RustUnnamed) misuse; duplicate program_new; assertion bridging (__assert_fail) mis-modeled. | Unidiomatic compile (types/aliases) |
| expr | 33.3% | β | Missing C2RustUnnamed alias; C varargs in trace_eval; strncmp len type mismatch. | Unidiomatic compile (types/aliases) |
| file2str | 100.0% | 100.0% | β | β |
| fs_c | 100.0% | 60.0% | Idiomatic I/O wrappers mismatch C expectations (closed fd/OwnedFd abort; Err(NotFound) leads to C-side segfault). | Harness runtime (I/O/resource model mismatch) |
| geofence | 100.0% | 100.0% | β | β |
| gfc | 100.0% | 54.6% | Converter overread + ownership misuse; later compile errors. | Harness runtime (converter/ownership) |
| gorilla-paper-encode | 100.0% | 9.1% | Missing adapters + lifetimes (Cbitwriter_s / Cbitreader_s vs BitWriter / BitReader<βa>). | Harness compile (lifetimes/struct adapters) |
| hydra | 100.0% | 50.0% | Borrow overlap in list update; name mapping for FindCommand. | Idiomatic compile (borrow/lifetime; symbol mapping) |
| inversion_list | 17.0% | β | C allows NULL comparator/function pointers; wrapper unwraps and panics. | Runtime fault (function-pointer nullability) |
| jccc | 88.7% | β | Missing C2RustUnnamed alias and duplicate Expression / Lexer types; compile fails. | Unidiomatic compile (types/aliases) |
| leftpad | 100.0% | 100.0% | β | β |
| lib2bit | 100.0% | 13.6% | Non-clonable std::fs::File in harness (C FILE* vs Rust File I/O handle mismatch) | Harness runtime (I/O/resource model mismatch) |
| libbase122 | 100.0% | 37.5% | Reader cursor/buffer not preserved across calls; writer shape mismatch; tests fail. | Harness runtime (converter/ownership) |
| libbeaufort | 100.0% | 66.7% | Returns reference to temporary tableau; matrix parameter shape drift (char** vs Vec<Option<String>>); compile fails. | Idiomatic compile (borrow/lifetime) |
| libwecan | 100.0% | 100.0% | β | β |
| morton | 100.0% | 100.0% | β | β |
| murmurhash_c | 100.0% | 100.0% | β | β |
| razz_simulation | 33.3% | β | Type-name drift; node shape; ptr/ref API mismatch. | Harness compile (type/name drift; API mismatch) |
| rhbloom | 100.0% | 33.3% | Pointer/ref misuse; bit-length as bytes; overreads/panics. | Harness runtime (pointer/ref; length units) |
| totp | 77.8% | β | Anonymous C types that c2rust renamed cannot be fetched correctly as a dependency; plus duplicate helpers (pack32 / unpack64 / hmac_sha1); compile fails. | Unidiomatic compile (types/aliases) |
| utf8 | 100.0% | 30.8% | NULL deref + unchecked indices; SIGSEGV in tests. | Runtime fault (NULL deref/out-of-bounds) |
| vec | 100.0% | 0.0% | Idiomatic rewrite uses a bounds-checked copy; out-of-range panic under tests. | Runtime fault (boundary/indexing) |
## Appendix N Examples of Prompts Used in SACTOR
The following prompts are used to guide the LLM in C-to-Rust translation and verification tasks. The prompts may slightly vary to accommodate different translation task, as SACTOR leverages static analysis to fetch the necessary information for the LLM.
### N.1 Unidiomatic Translation
Figure 10 shows the prompt for translating unidiomatic C code to Rust.
β¬
Translate the following C function to Rust. Try to keep the ** equivalence ** as much as possible.
β libc β will be included as the ** only ** dependency you can use. To keep the equivalence, you can use β unsafe β if you want.
The function is:
βββ c
{C_FUNCTION}
βββ
// Specific for main function
The function is the β main β function, which is the entry point of the program. The function signature should be: β pub fn main () -> ()β.
For β return 0;β, you can directly β return;β in Rust or ignore it if it β s the last statement.
For other return values, you can use β std:: process:: exit ()β to return the value.
For β argc β and β argv β, you can use β std:: env:: args ()β to get the arguments.
The function uses some of the following stdio file descriptors: stdin. Which will be included as
βββ rust
extern " C " {
static mut stdin: * mut libc:: FILE;
}
βββ
You should ** NOT ** include them in your translation, as the system will automatically include them.
The function uses the following functions, which are already translated as (you should ** NOT ** include them in your translation, as the system will automatically include them):
βββ rust
{DEPENDENCIES}
βββ
Output the translated function into this format (wrap with the following tags):
---- FUNCTION ----
βββ rust
// Your translated function here
βββ
---- END FUNCTION ----
Figure 10: Unidiomatic Translation Prompt
### N.2 Unidiomatic Translation with Feedback
Figure 11 shows the prompt for translating unidiomatic C code to Rust with feedback from the previous incorrect translation and error message.
β¬
Translate the following C function to Rust. Try to keep the ** equivalence ** as much as possible.
β libc β will be included as the ** only ** dependency you can use. To keep the equivalence, you can use β unsafe β if you want.
The function is:
βββ c
{C_FUNCTION}
βββ
// Specific for main function
The function is the β main β function, which is the entry point of the program. The function signature should be: β pub fn main () -> ()β.
For β return 0;β, you can directly β return;β in Rust or ignore it if it β s the last statement.
For other return values, you can use β std:: process:: exit ()β to return the value.
For β argc β and β argv β, you can use β std:: env:: args ()β to get the arguments.
The function uses some of the following stdio file descriptors: stdin. Which will be included as
βββ rust
extern " C " {
static mut stdin: * mut libc:: FILE;
}
βββ
You should ** NOT ** include them in your translation, as the system will automatically include them.
The function uses the following functions, which are already translated as (you should ** NOT ** include them in your translation, as the system will automatically include them):
βββ rust
fn atoi (str : * const c_char) -> c_int;
βββ
Output the translated function into this format (wrap with the following tags):
---- FUNCTION ----
βββ rust
// Your translated function here
βββ
---- END FUNCTION ----
Lastly, the function is translated as:
βββ rust
{COUNTER_EXAMPLE}
βββ
It failed to compile with the following error message:
βββ
{ERROR_MESSAGE}
βββ
Analyzing the error messages, think about the possible reasons, and try to avoid this error.
Figure 11: Unidiomatic Translation with Feedback Prompt
### N.3 Idiomatic Translation
Figure 12 shows the prompt for translating unidiomatic Rust code to idiomatic Rust. Crown is used to hint the LLM about the ownership, mutability, and fatness of pointers.
β¬
Translate the following unidiomatic Rust function into idiomatic Rust. Try to remove all the β unsafe β blocks and only use the safe Rust code or use the β unsafe β blocks only when necessary.
Before translating, analyze the unsafe blocks one by one and how to convert them into safe Rust code.
** libc may not be provided in the idiomatic code, so try to avoid using libc functions and types, and avoid using β std:: ffi β module.**
βββ rust
{RUST_FUNCTION}
βββ
" Crown " is a pointer analysis tool that can help to identify the ownership, mutability and fatness of pointers. Following are the possible annotations for pointers:
βββ
fatness:
- β Ptr β: Single pointer
- β Arr β: Pointer is an array
mutability:
- β Mut β: Mutable pointer
- β Imm β: Immutable pointer
ownership:
- β Owning β: Owns the pointer
- β Transient β: Not owns the pointer
ββββ
The following is the output of Crown for this function:
βββ
{CROWN_RESULT}
βββ
Analyze the Crown output firstly, then translate the pointers in function arguments and return values with the help of the Crown output.
Try to avoid using pointers in the function arguments and return values if possible.
Output the translated function into this format (wrap with the following tags):
---- FUNCTION ----
βββ rust
// Your translated function here
βββ
---- END FUNCTION ----
Also output a minimal JSON spec that maps the unidiomatic Rust layout to the idiomatic Rust for the function arguments and return value.
Full JSON Schema for the SPEC (do not output the schema; output only an instance that conforms to it):
βββ json
{_schema_text}
βββ
---- SPEC ----
βββ json
{{
" function_name ": "{function. name}",
" fields ": [
{{
" u_field ": {{
" name ": "...",
" type ": "...",
" shape ": " scalar " | {{" ptr ": {{" kind ": " slice | cstring | ref ", " len_from ": "?", " len_const ": 1}}}}
}},
" i_field ": {{
" name ": "...",
" type ": "..."
}}
}}
]
}}
βββ
---- END SPEC ----
Few - shot examples (each with unidiomatic Rust signature, idiomatic Rust signature, and the SPEC):
Example F1 (slice arg):
Unidiomatic Rust:
βββ rust
pub unsafe extern " C " fn sum (xs: * const i32, n: usize) -> i32;
βββ
Idiomatic Rust:
βββ rust
pub fn sum (xs: &[i32]) -> i32;
βββ
---- SPEC ----
βββ json
{{
" function_name ": " sum ",
" fields ": [
{{ " u_field ": {{" name ": " xs ", " type ": "* const i32 ", " shape ": {{ " ptr ": {{ " kind ": " slice ", " len_from ": " n " }} }} }},
" i_field ": {{" name ": " xs ", " type ": "&[i32]" }} }},
{{ " u_field ": {{" name ": " n ", " type ": " usize ", " shape ": " scalar " }},
" i_field ": {{" name ": " xs. len ", " type ": " usize " }} }}
]
}}
βββ
---- END SPEC ----
Example F2 (ref out):
Unidiomatic Rust:
βββ rust
pub unsafe extern " C " fn get_value (out_value: * mut i32);
βββ
Idiomatic Rust:
βββ rust
pub fn get_value () -> i32;
βββ
---- SPEC ----
βββ json
{{
" function_name ": " get_value ",
" fields ": [
{{ " u_field ": {{" name ": " out_value ", " type ": "* mut i32 ", " shape ": {{ " ptr ": {{ " kind ": " ref " }} }} }},
" i_field ": {{" name ": " ret ", " type ": " i32 " }} }}
]
}}
βββ
---- END SPEC ----
Example F3 (nullable cstring maps to Option):
Unidiomatic Rust:
βββ rust
pub unsafe extern " C " fn set_name (name: * const libc:: c_char);
βββ
Idiomatic Rust:
βββ rust
pub fn set_name (name: Option <& str >);
βββ
---- SPEC ----
βββ json
{{
" function_name ": " set_name ",
" fields ": [
{{ " u_field ": {{" name ": " name ", " type ": "* const c_char ", " shape ": {{ " ptr ": {{ " kind ": " cstring ", " null ": " nullable " }} }} }},
" i_field ": {{" name ": " name ", " type ": " Option <& str >" }} }}
]
}}
βββ
---- END SPEC ----
Figure 12: Idiomatic Translation Prompt
### N.4 Idiomatic Verification
Idiomatic verification is the process of verifying the correctness of the translated idiomatic Rust code by generating a test harness. The prompt for idiomatic verification is shown in Figure 13.
β¬
We have an initial spec - driven harness with TODOs. Finish all TODOs and ensure it compiles.
Idiomatic signature:
βββ rust
pub fn compute_idiomatic (
x: i32,
name: & str,
data: &[u8],
meta: HashMap < String, String >,
) -> i32;;
βββ
Unidiomatic signature:
βββ rust
pub unsafe extern " C " fn compute (x: i32, name: * const libc:: c_char, data: * const u8, len: usize, meta: * const libc:: c_char) -> i32;;
βββ
Current harness:
βββ rust
pub unsafe extern " C " fn compute (x: i32, name: * const libc:: c_char, data: * const u8, len: usize, meta: * const libc:: c_char) -> i32
{
// Arg β name β: borrowed C string at name
let name_str = if ! name. is_null () {
unsafe { std:: ffi:: CStr:: from_ptr (name) }. to_string_lossy (). into_owned ()
} else {
String:: new ()
};
// Arg β data β: slice from data with len len as usize
let data_len = len as usize;
let data_len_non_null = if data. is_null () { 0 } else { data_len };
let data: &[u8] = if data_len_non_null == 0 {
&[]
} else {
unsafe { std:: slice:: from_raw_parts (data as * const u8, data_len_non_null) }
};
// TODO: param meta of type HashMap < String , String >: unsupported mapping
let __ret = compute_idiomatic (x, & name_str, data, /* TODO param meta */);
return __ret;
}
βββ
Output only the final function in this format:
---- FUNCTION ----
βββ rust
// Your translated function here
βββ
---- END FUNCTION ----
Figure 13: Idiomatic Verification Prompt
### N.5 Failure Reason Analysis
Figure 14 shows the prompt for analyzing the reasons for the failure of the translation.
β¬
Given the following C code:
βββ c
{original_code}
βββ
The following code is generated by a tool that translates C code to Rust code. The tool has a bug that causes it to generate incorrect Rust code. The bug is related to the following error message:
βββ json
{json_data}
βββ
Please analyze the error message and provide a reason why the tool generated incorrect Rust code.
1. Append a new reason to the list of reasons.
2. Select a reason from the list of reasons that best describes the error message.
Please provide a reason why the tool generated incorrect Rust code ** FUNDAMENTALLY **.
List of reasons:
{all_current_reasons}
Please provide the analysis output in the following format:
βββ json
{
" action ": " append ", // or " select " to select a reason from the list of reasons
" reason ": " Format string differences between C and Rust ", // the reason for the error message, if action is " append "
" selection ": 1 // the index of the reason from the list of reasons, if action is " select "
// " reason " and " selection " are mutually exclusive, you should only provide one of them
}
βββ
Please ** make sure ** to provide a general reason that can be applied to multiple cases, not a specific reason that only applies to the current case.
Please provide a reason why the tool generated incorrect Rust code ** FUNDAMENTALLY ** (NOTE that the reason of first failure is always NOT the fundamental reason).
Figure 14: Failure Reason Analysis Prompt