# SACTOR: LLM-Driven Correct and Idiomatic C to Rust Translation with Static Analysis and FFI-Based Verification
**Authors**: Tianyang Zhou, Ziyi Zhang, Haowen Lin, Somesh Jha, Mihai Christodorescu, Kirill Levchenko, Varun Chandrasekaran
> University of Illinois Urbana-Champaign
> University of Wisconsin–Madison
> Google
## Abstract
Translating software written in C to Rust has significant benefits in improving memory safety. However, manual translation is cumbersome, error-prone, and often produces unidiomatic code. Large language models (LLMs) have demonstrated promise in producing idiomatic translations, but offer no correctness guarantees. We propose SACTOR, an LLM-driven C-to-Rust translation tool that employs a two-step process: an initial “unidiomatic” translation to preserve interface, followed by an “idiomatic” refinement to align with Rust standards. To validate correctness of our function-wise incremental translation that mixes C and Rust, we use end-to-end testing via the foreign function interface. We evaluate SACTOR on $200$ programs from two public datasets and on two more complex scenarios (a 50-sample subset of CRust-Bench and the libogg library), comparing multiple LLMs. Across datasets, SACTOR delivers high end-to-end correctness and produces safe, idiomatic Rust with up to 7 $\times$ fewer Clippy warnings; On CRust-Bench, SACTOR achieves an average (across samples) of 85% unidiomatic and 52% idiomatic success, and on libogg it attains full unidiomatic and up to 78% idiomatic coverage on GPT-5.
Keywords Software Engineering $\cdot$ Static Analysis $\cdot$ C $\cdot$ Rust $\cdot$ Large Language Models $\cdot$ Machine Learning
## 1 Introduction
C is widely used due to its ability to directly manipulate memory and hardware (love2013linux). However, manual memory management leads to vulnerabilities such as buffer overflows, dangling pointers, and memory leaks (bigvul). Rust addresses these issues by enforcing memory safety through a strict ownership model without garbage collection (matsakis2014rust), and has been adopted in projects like the Linux kernel https://github.com/Rust-for-Linux/linux and Mozilla Firefox. Translating legacy C code into idiomatic Rust improves safety and maintainability, but manual translation is error-prone, slow, and requires expertise in both languages.
Automatic tools such as C2Rust (c2rust) generate Rust by analyzing C ASTs, but rule-based or static approaches (crown; c2rust; emre2021translating; hong2024don; ling2022rust) typically yield unidiomatic code with heavy use of unsafe. Given semantic differences between C and Rust, idiomatic translations are crucial for compiler-enforced safety, readability, and maintainability.
Large language models (LLMs) show potential for capturing syntax and semantics (pan2023understanding), but they hallucinate and often generate incorrect or unsafe code (perry2023users). In C-to-Rust translation, naive prompting produces unsafe or semantically misaligned outputs. Prior work has explored prompting strategies (syzygy; c2saferrust; shiraishi2024context) and verification methods such as fuzzing and symbolic execution (vert; flourine). While these improve correctness, they struggle with complex programs and rarely yield idiomatic Rust. For example, Vert (vert) fails on programs with complex data structures, and C2SaferRust (c2saferrust) still produces Rust with numerous unsafe blocks.
In this paper, we introduce SACTOR, a structure-aware, LLM-driven C-to-Rust translator (Figure 1). SACTOR follows a two-stage pipeline:
- C $\to$ Unidiomatic Rust: Interface-preserving translation that may use unsafe for low-level operations.
- Unidiomatic $\to$ Idiomatic Rust: Behaviorally-equivalent translation that refines to Rust idioms, eliminating unsafe and migrating C API patterns to Rust equivalents.
Static analysis of C code (pointer semantics, dependencies) guides both stages. To verify correctness, we embed the translated Rust with the original C via the Foreign Function Interface (FFI), enabling end-to-end testing on both stages and accept a stage when all end-to-end tests can pass. This decomposition separates syntax from semantics, simplifies the LLM task, and ensures more idiomatic, memory-safe Rust SACTOR code is available at https://github.com/qsdrqs/sactor and datasets are available at https://github.com/qsdrqs/sactor-datasets. An example of SACTOR translation process is in Appendix E.
LLM orchestration. SACTOR places the LLM inside a neuro-symbolic feedback loop. Static analysis and a machine-readable interface specification guide prompting; compiler diagnostics and end-to-end tests provide structured feedback. In the idiomatic verification phase, a rule-based harness generator with an LLM fallback completes the feedback loop. This design first ensures semantic correctness in unidiomatic Rust, then refines it into idiomatic Rust, with both stages verifiable in a unified two-step process.
Our contributions are as follows:
- Method: An LLM-orchestrated, structure-aware two-phase pipeline that separates semantic preservation from idiomatic refinement, guided by static analysis (§ 4)
- Verification: SACTOR verifies both unidiomatic and idiomatic translations via FFI-based testing. During idiomatic verification, it uses a co-produced interface specification to synthesize C/Rust harnesses with an LLM fallback for missing patterns; compiler and test feedback are structured into targeted prompt repairs (§ 4.3).
- Evaluation: Across two datasets (200 programs) and five LLMs, SACTOR reaches 93% / 84% end-to-end correctness (DeepSeek-R1) and improves idiomaticity (§ 6.2). On CRust-Bench (50 samples), unidiomatic translation averages 85% function-level success rate across all samples (82% aggregated across functions), with 32/50 samples fully translated; idiomatic success is computed on those 32 samples and averages 52% (43% aggregated; 8/32 fully idiomatic). On libogg (77 functions), the function-level success rate is 100% for unidiomatic and 53% and 78% for idiomatic across GPT-4o and GPT-5, respectively (§ 6.3).
- Diagnostics: We analyze efficiency, feedback, temperature sensitivity, and failure cases: GPT-4o is the most token-efficient, compilation/testing feedback boosts weaker models by 17%, temperature has little effect, and reasoning models like DeepSeek-R1 excel on complex bugs such as format-string and array errors (Appendix H).
<details>
<summary>x1.png Details</summary>

### Visual Description
## Flowchart: Static Analysis Tools Workflow
### Overview
The flowchart illustrates a multi-stage process for static code analysis and verification, integrating tools like C2Rust, AST Parser, and Gemini AI. It emphasizes feedback loops between static analysis hints, code division, and verification to produce Rust code validated by an FFI-based E2E Verifier.
### Components/Axes
- **Main Title**: "Static Analysis Tools" (top-center).
- **Key Boxes**:
1. **Static Analysis Tools**: Contains icons for C2Rust (black dragon), AST Parser (gear with "R"), and a crown (orange).
2. **Static Analysis Hints**: Arrows from the main box to "Gemini AI" and "Unidiomatic Rust."
3. **Gemini AI**: Labeled with a blue infinity symbol and "AI" (bottom-left).
4. **Unidiomatic Rust**: Contains an RSI file icon (center).
5. **FFI-based E2E Verifier**: Bottom-most box, receiving feedback from both Gemini AI and Unidiomatic Rust.
- **Arrows**:
- "Divide" (left arrow from C Code to Gemini AI/Unidiomatic Rust).
- "Combine" (right arrow from Gemini AI/Unidiomatic Rust to Rust Code).
- "Verification Feedback" (upward arrows from both Gemini AI and Unidiomatic Rust to the FFI-based E2E Verifier).
### Detailed Analysis
- **C Code Input**: Divided into two paths:
- **Gemini AI**: Processes code with verification feedback.
- **Unidiomatic Rust**: Processes code with RSI file output and verification feedback.
- **Rust Code Output**: Generated by combining results from Gemini AI and Unidiomatic Rust.
- **FFI-based E2E Verifier**: Receives feedback from both paths, ensuring end-to-end validation.
### Key Observations
- **Feedback Loops**: Both Gemini AI and Unidiomatic Rust contribute verification feedback to the FFI-based E2E Verifier, suggesting iterative refinement.
- **Tool Integration**: C2Rust and AST Parser are upstream tools feeding into the analysis hints.
- **Dual Paths**: Code is split for parallel processing (Gemini AI vs. Unidiomatic Rust), then merged for final output.
### Interpretation
This workflow demonstrates a hybrid approach to static analysis:
1. **Toolchain Synergy**: C2Rust and AST Parser provide foundational analysis, while Gemini AI introduces AI-driven insights.
2. **Code Refinement**: Dividing code into Gemini AI (likely for high-level feedback) and Unidiomatic Rust (for syntax/idiom checks) allows targeted improvements.
3. **Verification Rigor**: The FFI-based E2E Verifier acts as a gatekeeper, ensuring combined outputs meet cross-platform/interoperability standards.
4. **Unidiomatic Rust’s Role**: The RSI file icon suggests a focus on Rust-specific idioms, possibly flagging non-idiomatic patterns for correction.
The process emphasizes modularity, with static analysis tools and AI working in tandem to produce verified, idiomatic Rust code. The crown icon in the Static Analysis Tools box may symbolize "gold-standard" quality assurance.
</details>
Figure 1: Overview of the SACTOR methodology.
## 2 Background
Primer on C and Rust: C is a low-level language that provides direct access to memory and hardware through pointers and abstracts machine-level instructions (tiobe). While this makes it efficient, it suffers from memory vulnerabilities (sbufferoverflow; hbufferoverflow; uaf; memoryleak). Rust, in contrast, provides memory safety without additional performance penalty, and has the same ability to access low-level hardware as C; it enforces strict compile-time memory safety through ownership, borrowing, and lifetimes to eliminate memory vulnerabilities (matsakis2014rust; jung2017rustbelt).
Challenges in Code Translation: Despite its advantages, and since Rust is relatively new, many widely used system-level programs remain in C. It is desirable to translate such programs to Rust, but the process is challenging due to fundamental language differences. Figure 3 in Appendix A shows an example of a simple C program and its Rust equivalent to illustrate the differences between two languages in terms of memory management and error handling. While Rust permits unsafe blocks for C-like pointer operations, their use is discouraged due to the absence of compiler guarantees and their non-idiomatic nature for further maintenance Other differences include string representation, pointer usage, array handling, reference lifetimes, and error propagation. A non-exhaustive summary appears in Appendix A..
## 3 Related Work
LLMs for C-to-Rust Translation: Vert (vert) combines LLM-generated candidates with fuzz testing and symbolic execution to ensure equivalence, but this strict verification struggles with scalability and complex C features. Flourine (flourine) incorporates error feedback and fuzzing, using data type serialization to mitigate mismatches, yet serialization issues still account for nearly half of errors. shiraishi2024context decompose C programs into sub-tasks (e.g., macros) and translate them with predefined Rust idioms, but evaluate only compilation success without functional correctness. syzygy employ dynamic analysis to capture runtime behavior as translation guidance, but coverage limits hinder generalization across execution paths. c2saferrust refine C2Rust outputs with LLMs to reduce unidiomatic constructs (unsafe, libc), but remain constrained by C2Rust’s preprocessing, which strips comments and directives (§ 4.2) and reduces context for idiomatic translation.
Non-LLM Approaches for C-to-Rust Translation: C2Rust (c2rust) translates by converting C ASTs into Rust ASTs and applying rule-based transformations. While syntactically correct, the results are structural translations that rely heavily on unsafe blocks and explicit type conversions, yielding low readability. Crown (crown) introduces static ownership tracking to reduce pointer usage in generated Rust code. hong2024don focus on handling return values in translation, while ling2022rust rely on rules and heuristics. Although these methods reduce some unsafe usage compared to C2Rust, the resulting code remains largely unidiomatic.
## 4 SACTOR Methodology
We propose SACTOR, an LLM-driven C-to-Rust translation tool using a two-step translation methodology. As Rust and C differ substantially in semantics (§ 2), SACTOR augments the LLM with static-analysis-derived “hints” that capture semantic information in the C code. The four main stages of SACTOR are outlined below.
### 4.1 Task Division
We begin by dividing the program into smaller parts that can be processed by the LLM independently. This enables the LLM to focus on a narrower scope for each translation task and ensures the program fits within its context window. This strategy is supported by studies showing that LLM performance degrades on long-context understanding and generation tasks (liu2024longgenbench; li2024long). By breaking the program into smaller pieces, we can mitigate these limitations and improve performance on each individual task. To facilitate task division and extract relevant language information – such as definitions, declarations, and dependencies – from C code, we developed a static analysis tool called C Parser based on libclang (a library that provides a C compiler interface, allowing access to semantic information of the code).
Our C Parser analyzes the input program and splits the program into fragments consisting of a single type, global variable, or function definition. This step also extracts semantic dependencies between each part (e.g., a function definition depending on a prior type definition). We then process each program fragment in dependency order: all dependencies of a code fragment are processed before the fragment. Concretely, C Parser constructs a directed dependency graph whose nodes are types, global variables, and functions, and whose edges point from each item to the items it directly depends on. We compute a translation order by repeatedly selecting items whose dependencies have already been processed. If the dependency graph contains a cycle, SACTOR currently treats this as an unsupported case and terminates with an explicit error. In addition, to support real-world C projects, SACTOR makes use of the C project compile commands generated by the make tool and performs preprocessing on the C source files. In Appendix B, we provide more details on how we preprocess source files and divide programs.
### 4.2 Translation
To ensure that each program fragment is translated only after its dependencies have been processed, we begin by translating data types, as they form the foundational elements for functions. This is followed by global variables and functions. We divide the translation process into two steps.
Step 1. Unidiomatic Rust Translation: We aim to produce interface equivalent Rust code from the original C code, which allows the use of unsafe blocks to do pointer manipulations and C standard library functions while keeping the same interface as original C code. For data type translation, we leverage information from C2Rust (c2rust) to help the conversion. While C2Rust provides reliable data type translation, it struggles with function translation due to its compiler-based approach, which omits source-level details like comments, macros, and other elements. These omissions significantly reduce the readability and usability of the generated Rust code. Thus, we use C2Rust only for data type translation, and use an LLM to translate global variables and functions. For functions, we rely on our C Parser to automatically extract dependencies (e.g., function signatures, data types, and global variables) and reference the corresponding Rust code. This approach guides the LLM to accurately translate functions by leveraging the previously translated components and directly reusing or invoking them as needed.
Step 2. Idiomatic Rust Translation: The goal of this step is to refine unidiomatic Rust into idiomatic Rust by removing unsafe blocks and following Rust idioms. This stage focuses on rewriting behavioral-equivalent but low-level constructs into type-safe abstractions while preserving behavior verified in the previous step. Handling pointers from C code is a key challenge, as they are considered unsafe in Rust. Unsafe pointers should be replaced with Rust types such as references, arrays, or owned types. To address this, we use Crown (crown) to facilitate the translation by analyzing pointer mutability, fatness (e.g., arrays), and ownership. This information provided by Crown helps the LLM assign appropriate Rust types to pointers. Owned pointers are translated to Box, while borrowed pointers use references or smart pointers. Crown assists in translating data types like struct and union, which are processed first as they are often dependencies for functions. For function translations, Crown analyzes parameters and return pointers, while local variable pointers are inferred by the LLM. Dependencies are extracted using our C Parser to guide accurate function translation. The idiomatic code is produced together with an interface transformation specification, forms the input to the verification step in § 4.3.
### 4.3 Verification
To verify the equivalence between source and target languages, prior work has relied on symbolic execution and fuzz testing, are impractical for real-world C-to-Rust translation (details in Appendix C). We instead validate correctness through soft equivalence: ensuring functional equivalence of the entire program via end-to-end (E2E) tests. This avoids the complexity of generating specific inputs or constraints for individual functions and is well-suited for real-world programs where such E2E tests are often available and reusable. Correctness confidence in this framework depends on the code coverage of the E2E tests: the broader the coverage, stronger the assurance of equivalence.
Verifying Unidiomatic Rust Code. This is straightforward, as it is semantically equivalent to the original C code and maintains compatible function signatures and data types, which ensures a consistent Application Binary Interface (ABI) between the two languages and enabling direct use of the FFI for cross-language linking. The verification process involves two main steps: First, the unidiomatic Rust code is compiled using the Rust compiler to check for successful compilation. Then, the original C code is recompiled with the Rust translation linked as a shared library. This setup ensures that when the C code calls the target function, it invokes the Rust translation instead. To verify correctness, E2E tests are run on the entire program, comparing the outputs of the original C code and the unidiomatic Rust translation. If all tests pass, the target function is considered verified.
Verifying Idiomatic Rust Code. Idiomatic Rust diverges from the original C program in both types and function signatures, producing an ABI mismatch that prevents direct linking into the C build. We therefore verify it via a synthesized, C-compatible test harness together with E2E tests.
During idiomatic translation, SACTOR co-produces a small, machine-readable specification (SPEC) for each function/struct. The SPEC captures, in a compact form, how C-facing values map to idiomatic Rust, including the expected pointer shape (slice / cstring / ref), where lengths come from (a sibling field or a constant), and basic nullability and return conventions; it also allows marking fields that should be compared in self-checks. A rule-based generator consumes the SPEC to synthesize a C-compatible harness that bridges from the C ABI to idiomatic code and backwards. Figure 9 shows the schematic, and Table 12 summarizes current supported patterns; Appendix L presents a detailed exposition of the SPEC-driven harness generation technique (rules and design choices), and Appendix D provides a concrete example of the generated harness. For structs, the SPEC defines bidirectional converters between the C-facing and idiomatic layouts, validated by a lightweight roundtrip test that checks the fields marked as comparable for consistency after conversion. When the SPEC includes a pattern the generator does not yet implement (e.g., aliasing/offset views or unsupported pointer kinds or types), we emit a localized TODO and use an LLM guided by the SPEC to fill only the missing conversions. Finally, we compile the idiomatic crate and the generated harness, link them into the original C build via FFI, and run the program’s existing E2E tests; passing tests validate the idiomatic translation under the coverage of those tests, while failures trigger the feedback procedure in § 4.3.
Feedback Mechanism. For failures, we feed structured signals back to translation: compiler errors guide fixes for build breaks; for E2E failures we use the Rust procedural macro to automatically instrument the target to log salient inputs/outputs, re-run tests, and return the traces to the translator for refinement.
### 4.4 Code Combination
By translating and verifying all functions and data types, we integrate them into a unified Rust codebase. We first collect the translated Rust code from each subtask and remove duplicate definitions and other redundancies required only for standalone compilation. The cleaned code is then organized into a well-structured Rust implementation of the original C program. Finally, we run end-to-end tests on the combined program to verify the correctness of the final Rust output. If all tests pass, the translation is considered successful.
## 5 Experimental Setup
### 5.1 Datasets Used
For the selection of datasets for evaluation, we consider the following criteria:
- Sufficient Number: The dataset should contain a substantial number of C programs to ensure a robust evaluation of the approach’s performance across a diverse set of examples.
- Presence of Non-Trivial C Features: The dataset should include C programs with advanced features such as multiple functions, struct s, and other non-trivial constructs as it enables the evaluation to assess the approach’s ability to handle complex features of C.
- Availability of E2E Tests: The dataset should either include E2E tests or make it easy to generate them as this is essential for accurately evaluating the correctness of the translated code.
Based on the above criteria, we evaluate on two widely used program suites in the translation literature: TransCoder-IR (transcoderir) and Project CodeNet (codenet). Complete details for these datasets are in Appendix F. For TransCoder-IR and CodeNet, we randomly sample 100 C programs from each (for CodeNet, among programs with external inputs) to ensure computational feasibility while maintaining statistical significance.
To better reflect the language features of real-world C codebases and allow test reuse (§ 6.3), we also evaluate on two targets: (i) a 50-sample subset of CRust-Bench (khatry2025crust) and (ii) the libogg multimedia container library (libogg). In CRust-Bench, we exclude entries outside our pipeline’s scope (e.g., circular dependencies or compiler-specific intrinsics). libogg is a real-world C project of about 2,000 lines of code with 77 functions involving non-trivial struct s, buffer s, and pointer manipulation. Both benchmarks reuse their upstream end-to-end tests to verify the translated code.
### 5.2 Evaluation Metrics
Success Rate: This is defined as the ratio of the number of programs that can (a) successfully be translated to Rust, and (b) successfully pass the E2E tests for both unidiomatic and idiomatic translation phases to the total number of programs. To enable the LLMs to utilize feedback from previous failed attempts, we allow the LLM to make up to 6 attempts for each translation process.
Idiomaticity: To evaluate the idiomaticity of the translated code, we use three metrics:
- Lint Alert Count is measured by running Rust-Clippy (clippy), a tool that provides lints on unidiomatic Rust (including improper use of unsafe code and other common style issues). By collecting the warnings and errors generated by Rust-Clippy for the translated code, we can assess its idiomaticity: fewer alerts indicate more idiomaticity. Previous translation works (vert; flourine) have also used Rust-Clippy.
- Unsafe Code Fraction, inspired by shiraishi2024context, is defined as the ratio of tokens inside unsafe code blocks or functions to total tokens for a single program. High usage of unsafe is considered unidiomatic, as it bypasses compiler safety checks, introduces potential memory safety issues and reduces code readability.
- Unsafe Free Fraction indicates the percentage of translated programs in a dataset that do not contain any unsafe code. Since unsafe code represents potential points where the compiler cannot guarantee safety, this metric helps determine the fraction of results that can be achieved without relying on unsafe code.
### 5.3 LLMs Used
We evaluate 6 models across different experiments. On the two datasets (TransCoder-IR and CodeNet) we use four non-reasoning models—GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 2.0 Flash (Google), and Llama 3.3 70B Instruct (Meta), and one reasoning model DeepSeek-R1 (DeepSeek). For real-world codebases, we run GPT-4o on CRust-Bench and run both GPT-4o and GPT-5 on libogg. Model configurations appear in Appendix G.
## 6 Evaluation
Through our evaluation, we answer: (1) How successful is SACTOR in generating idiomatic Rust code using different LLM models?; (2) How idiomatic is the Rust code produced by SACTOR compared to existing approaches?; and (3) How well does SACTOR generalize to real-world C codebases?
Our results show that: (1) DeepSeek-R1 achieves the highest success rates (93%) with SACTOR on TransCoder-IR and also reaches the highest success rates (84%) on Project CodeNet (§ 6.1), while failure reasons vary between datasets and models (Appendix H); (2) SACTOR ’s idiomatic translation results outperforms all previous baselines, producing Rust code with fewer Clippy warnings and 100% unsafe-free translations (§ 6.2); and (3) For real-world codebases (§ 6.3), SACTOR attains strong unidiomatic success and moderate idiomatic success: on CRust-Bench, unidiomatic averages 85% across 50 samples (82% aggregated across 966 functions; 32/50 fully translated) and idiomatic averages 52% across 32 samples that fully translated into unidiomatic Rust (43% aggregated across 580 functions; 8/32 fully translated); on libogg unidiomatic reaches 100% and idiomatic spans 53% and 78% for GPT-4o and GPT-5, respectively. Failures concentrate at ABI/type boundaries and harness synthesis (pointer/slice shape, length sources, lifetime or mutability), with additional cases from unsupported features and borrow/ownership pitfalls. Overall, improving the model itself alleviates a subset of failure modes; for a fixed model, strengthening the framework and interface rules also improves outcomes but remains limited when confronted with previously unseen patterns.
We also evaluate the computational cost of SACTOR (Appendix I), the impact of the feedback mechanism (Appendix J), and temperature settings (Appendix K) . GPT-4o and Gemini 2.0 achieve the best cost-performance balance, while Llama 3.3 consumes the most tokens among non-reasoning models. DeepSeek-R1 uses 3-7 $\times$ more tokens than others. The feedback mechanism boosts Llama 3.3’s success rate by 17%, but has little effect on GPT-4o, suggesting it benefits lower-performing models more. Temperature has minimal impact.
### 6.1 Success Rate Evaluation
<details>
<summary>x2.png Details</summary>

### Visual Description
## Comparison Chart: Unid. vs. Idiom Across SR1-SR6
### Overview
The image is a comparative chart displaying two categories ("Unid." and "Idiom") across six subcategories (SR1-SR6). Each subcategory is represented by a colored rectangle with distinct patterns, organized in two vertical columns. The legend on the left associates colors with categories: blue shades for "Unid." and orange shades for "Idiom."
### Components/Axes
- **Legend**:
- **Unid.**: Blue rectangles with patterns (solid, diagonal stripes, grid, dots, vertical stripes).
- **Idiom**: Orange rectangles with patterns (solid, diagonal stripes, grid, dots, vertical stripes).
- **Columns**:
- **Left Column ("Unid.")**: Labels SR1-SR6 with blue-patterned rectangles.
- **Right Column ("Idiom")**: Labels SR1-SR6 with orange-patterned rectangles.
- **Labels**:
- **Rows**: SR1, SR2, SR3 (left column) and SR4, SR5, SR6 (right column).
- **Columns**: "Unid." (left) and "Idiom" (right).
### Detailed Analysis
- **Unid. SR1**: Solid blue rectangle.
- **Unid. SR2**: Diagonal blue stripes.
- **Unid. SR3**: Grid-patterned blue rectangle.
- **Unid. SR4**: Dotted blue rectangle.
- **Unid. SR5**: Vertical blue stripes.
- **Unid. SR6**: Crosshatch-patterned blue rectangle.
- **Idiom SR1**: Solid orange rectangle.
- **Idiom SR2**: Diagonal orange stripes.
- **Idiom SR3**: Grid-patterned orange rectangle.
- **Idiom SR4**: Dotted orange rectangle.
- **Idiom SR5**: Vertical orange stripes.
- **Idiom SR6**: Crosshatch-patterned orange rectangle.
### Key Observations
1. **Color Coding**: Blue consistently represents "Unid.," while orange represents "Idiom."
2. **Pattern Variation**: Each SR subcategory uses a unique pattern (solid, stripes, grid, dots, vertical stripes, crosshatch) to differentiate data points.
3. **Label Placement**: Labels are centered within rectangles, with "Unid." on the left and "Idiom" on the right.
4. **Legend Positioning**: The legend is aligned to the left of the chart, with "Unid." above "Idiom."
### Interpretation
This chart visually contrasts "Unid." and "Idiom" across six scenarios (SR1-SR6), using color and patterns to encode differences. The systematic alternation of patterns suggests a categorical distinction between the two groups, possibly indicating variations in attributes like frequency, usage, or context. The absence of numerical data implies the chart prioritizes qualitative differentiation over quantitative metrics. The structured layout ensures clarity in comparing the two categories across all subcategories.
</details>
<details>
<summary>x3.png Details</summary>

### Visual Description
## Bar Chart: LLM Model Success Rate Comparison
### Overview
The chart compares the success rates of five large language models (LLMs) across four performance categories: Correct, Incorrect, Uncertain, and Ambiguous. Each model's performance is visualized as a stacked bar, with segmented patterns and colors representing the contribution of each category to the total success rate.
### Components/Axes
- **X-Axis (LLM Models)**: Labeled with model names and versions:
- Claude 3.5
- Gemini 2.0
- Llama 3.3
- GPT-40
- DeepSeek-R1
- **Y-Axis (Success Rate)**: Scaled from 0% to 100% in 20% increments.
- **Legend**: Located on the right, mapping colors/patterns to categories:
- **Blue (Diagonal Lines)**: Correct (正确)
- **Orange (Diagonal Lines)**: Incorrect (错误)
- **Gray (Dotted Lines)**: Uncertain (不确定)
- **Light Blue (Checkered)**: Ambiguous (模糊)
### Detailed Analysis
1. **Claude 3.5**:
- Total Success Rate: ~60%
- Breakdown:
- Correct: ~35% (blue)
- Incorrect: ~15% (orange)
- Uncertain: ~5% (gray)
- Ambiguous: ~5% (light blue)
2. **Gemini 2.0**:
- Total Success Rate: ~85%
- Breakdown:
- Correct: ~60% (blue)
- Incorrect: ~10% (orange)
- Uncertain: ~5% (gray)
- Ambiguous: ~10% (light blue)
3. **Llama 3.3**:
- Total Success Rate: ~60%
- Breakdown:
- Correct: ~35% (blue)
- Incorrect: ~10% (orange)
- Uncertain: ~10% (gray)
- Ambiguous: ~5% (light blue)
4. **GPT-40**:
- Total Success Rate: ~90%
- Breakdown:
- Correct: ~70% (blue)
- Incorrect: ~10% (orange)
- Uncertain: ~5% (gray)
- Ambiguous: ~5% (light blue)
5. **DeepSeek-R1**:
- Total Success Rate: ~100%
- Breakdown:
- Correct: ~55% (blue)
- Incorrect: ~30% (orange)
- Uncertain: ~10% (gray)
- Ambiguous: ~5% (light blue)
### Key Observations
- **DeepSeek-R1** achieves the highest total success rate (100%) but relies heavily on **Incorrect** answers (~30%), suggesting potential overconfidence or flawed evaluation metrics.
- **GPT-40** excels in **Correct** answers (~70%), driving its high total success rate (~90%).
- **Gemini 2.0** balances **Correct** answers (~60%) with moderate **Ambiguous** (~10%) and **Uncertain** (~5%) rates.
- **Claude 3.5** and **Llama 3.3** show similar total success rates (~60%), but Llama has higher **Uncertain** (~10%) and lower **Incorrect** (~10%) rates.
### Interpretation
The data highlights trade-offs in LLM performance:
- **DeepSeek-R1**'s 100% success rate is anomalous, as its **Incorrect** category dominates, raising questions about evaluation criteria or data labeling.
- **GPT-40** demonstrates reliability through high **Correct** answers, making it the most consistent performer.
- Models with higher **Uncertain**/**Ambiguous** rates (e.g., Llama 3.3) may prioritize caution over speed, impacting total success.
- The chart underscores that "success rate" is context-dependent, influenced by how models handle errors, uncertainty, and ambiguity.
</details>
(a) TransCoder-IR SR
<details>
<summary>x4.png Details</summary>

### Visual Description
## Bar Chart: Performance Comparison of LLM Models on a Benchmark Task
### Overview
The chart compares the performance of five large language models (LLMs) across three categories: "Correct Answers," "Incorrect Answers," and "Uncertain Responses." Each model's performance is represented as a stacked bar, with segments differentiated by color and pattern. The y-axis measures performance as a percentage (0–100), while the x-axis lists the models: Claude 3.5, Gemini 2.0, Llama 3.3, GPT-4, and DeepSeek-R1.
### Components/Axes
- **X-Axis (Categories)**: LLM Models (Claude 3.5, Gemini 2.0, Llama 3.3, GPT-4, DeepSeek-R1).
- **Y-Axis (Scale)**: Percentage (0–100), with gridlines at 20% intervals.
- **Legend**:
- **Blue (Solid)**: Correct Answers.
- **Orange (Diagonal Hatch)**: Incorrect Answers.
- **Gray (Dotted)**: Uncertain Responses.
- **Bar Structure**: Stacked vertically, with segments ordered from bottom (Correct Answers) to top (Uncertain Responses).
### Detailed Analysis
1. **Claude 3.5**:
- Correct Answers: ~60% (blue).
- Incorrect Answers: ~20% (orange).
- Uncertain Responses: ~5% (gray).
2. **Gemini 2.0**:
- Correct Answers: ~65% (blue).
- Incorrect Answers: ~10% (orange).
- Uncertain Responses: ~10% (gray).
3. **Llama 3.3**:
- Correct Answers: ~45% (blue).
- Incorrect Answers: ~30% (orange).
- Uncertain Responses: ~10% (gray).
4. **GPT-4**:
- Correct Answers: ~70% (blue).
- Incorrect Answers: ~15% (orange).
- Uncertain Responses: ~5% (gray).
5. **DeepSeek-R1**:
- Correct Answers: ~55% (blue).
- Incorrect Answers: ~25% (orange).
- Uncertain Responses: ~10% (gray).
### Key Observations
- **Highest Correct Answers**: GPT-4 (~70%) and DeepSeek-R1 (~55%) lead in correct responses.
- **Lowest Correct Answers**: Llama 3.3 (~45%) underperforms in this category.
- **Highest Uncertainty**: Gemini 2.0 and DeepSeek-R1 tie at ~10% for uncertain responses.
- **Incorrect Answers**: Llama 3.3 (~30%) has the highest rate of incorrect answers, while GPT-4 (~15%) has the lowest.
### Interpretation
The data suggests significant variability in model reliability. GPT-4 demonstrates the strongest performance, with the highest correct answers and lowest uncertainty. Llama 3.3 struggles with accuracy, showing both the lowest correct answers and highest incorrect responses. Gemini 2.0 and DeepSeek-R1 exhibit moderate performance but share the highest uncertainty, potentially indicating limitations in confidence calibration. The patterns (e.g., GPT-4’s low uncertainty) may reflect architectural differences or training data quality. These trends highlight trade-offs between accuracy, confidence, and error rates across LLMs.
</details>
(b) CodeNet SR
Figure 2: Success rates (SR) across different LLM models for the TransCoder-IR and CodeNet datasets. SR 1-6 represent the number of attempts made to achieve a successful translation. Unid. and Idiom. denote unidiomatic and idiomatic translation steps, respectively.
We evaluate the success rate (as defined in § 5.2) for the two datasets on different models. For idiomatic translation, we also plot how many attempts are needed.
(1) TransCoder-IR (Figure 2(a)): DeepSeek-R1 achieves the highest success rate (SR) in both unidiomatic (94%) and idiomatic (93%) steps, only 1% drops in the idiomatic translation step, demonstrating strong consistency in code translation. GPT-4o follows with 84% in the unidiomatic step and 80% in the idiomatic step. Gemini 2.0 comes next with 78% and 75%, respectively. Claude 3.5 struggles in the unidiomatic step (55%) but does not show substantial degradation when converting unidiomatic Rust to idiomatic Rust (54%, only a 1% drop), but it is still the worst model compared to the others. Llama 3.3 performs well in the unidiomatic step (76%) but drops significantly in the idiomatic step (64%), and requiring more attempts for correctness.
(2) Project CodeNet (Figure 2(b)): DeepSeek-R1 again leads with 86% in the unidiomatic step and 84% in the idiomatic step, showing only a 2% drop in the idiomatic translation step. Claude 3.5 follows closely with 86% success rate in the unidiomatic step and 83% in the idiomatic step. GPT-4o performs consistently well in the unidiomatic step (84%) but drops to 79% in the idiomatic step, indicating a 5% drop between the two steps. Gemini 2.0 follows with 78% in the unidiomatic step and 74% in the idiomatic step, showing consistent performance between two datasets. Llama 3.3 still exhibits significant drops (83% to 76%) in both steps and finishes last in the idiomatic step.
The results demonstrates that DeepSeek-R1’s SRs remain high and consistent–94%/93% (unidiomatic/idiomatic) on TransCoder-IR versus 86%/84% on CodeNet–while other models exhibit notable performance drops when moving to TransCoder-IR. This suggests that models with reasoning capabilities may be better for handling complex code logic and data manipulation.
### 6.2 Measuring Idiomaticity
We compare our approach with four baselines: C2Rust (c2rust), Crown (crown), C2SaferRust (c2saferrust) and Vert (vert). Of these baselines, C2Rust is the most versatile Versatility refers to an approach’s applicability to diverse C programs., supporting most C programs, while Crown is also broad but lacks support for some language features. C2SaferRust focuses on refining the unsafe code produced by C2Rust, allowing it to handle a wide range of C programs. In contrast, Vert targets a specific subset of simpler C programs. We assess the idiomaticity of Rust code generated by C2Rust, Crown, and C2SaferRust on both datasets. Since Vert produced Rust code only for TransCoder-IR, we evaluate it solely on this dataset. All the experiments are conducted using GPT-4o as the LLM for baselines and our approach, with max 6 attempts per translation.
Results: Figure LABEL:fig:idiomaticity presents the lint alert count (sum up of Clippy warnings and errors count for a single program) across all approaches. C2Rust consistently exhibits high Clippy issues, and Crown shows little improvement over C2Rust, indicating both struggle to generate idiomatic Rust. C2SaferRust reduces Clippy issues, but it still retains a significant number of warnings and errors. Notably, even the unidiomatic output of SACTOR surpasses all of these 3. This underscores the advantage of LLMs over rule-based methods. While Vert improves idiomaticity, SACTOR ’s idiomatic phase yields fewer Clippy issues, outperforming some existing LLM-based approaches.
Table LABEL:tab:unsafe_stats summarizes unsafe code statistics. Unsafe-Free indicates the percentage of programs without unsafe code, while Avg. Unsafe represents the average proportion of unsafe code across all translations. C2Rust and Crown generate unsafe code in all programs with a high average unsafe percentage. C2SaferRust has the ability to reduce unsafe code and able to generate unsafe-free programs in some cases (45.6% in TransCoder-IR), but cannot sufficiently reduce the unsafe uses in the CodeNet dataset. Vert has a higher success rate than SACTOR but occasionally introduces unsafe code. SACTOR ’s unidiomatic phase retains C semantics, leading to a high unsafe percentage. However, its idiomatic phase eliminates all unsafe code, achieving a 100% Unsafe-Free rate.
### 6.3 Real-world Code-bases
To evaluate SACTOR ’s performance on two real-world code-bases, we run the translation process up to three times per sample, with SACTOR attempts to translate each function, struct and global variable at most six attempts in each run. For libogg, we also experiment with both GPT-4o and GPT-5 to compare their performance.
CRust-Bench.
Measured at the function level, the mean per-sample translation success rate is 85.15%. Aggregated across the 50 samples, SACTOR translates 788 of 966 functions (81.57% combined). 32 samples achieve 100% function-level translation, i.e., the entire C codebase for the sample is translated to unidiomatic Rust. For idiomatic translation, we evaluate only on the 32 samples whose unidiomatic stage reached 100% function-level translation. On these samples, the mean per-sample function translation rate is 51.85%. Aggregated across them, SACTOR translates 249 of 580 functions (42.93% combined); 8 samples achieve 100% function-level idiomatic translation, which the entire C codebases are translated to idiomatic Rust.
| Unidi. Idiom. | 50 32 $\dagger$ | 85.15% 51.85% | 788 / 966 (81.57%) 249 / 580 (42.93%) | 32 / 50 (64.00%) 8 / 32 (25.00%) | 2.96 0.28 |
| --- | --- | --- | --- | --- | --- |
Table 1: CRust-Bench function-level translation results. Success rate (SR) is averaged per-sample; $\dagger$ idiomatic stage is evaluated only on samples whose unidiomatic pass fully translated all functions.
Table 1 summarizes stage-level outcomes.
Observations and failure modes. We organize failures into five main categories. (1) Interface/name drift: Symbol casing or exact-name mismatches (e.g., CamelCase vs. snake_case). (2) Semantic mapping errors: Mistakes in translating C constructs to idiomatic Rust (e.g., pointer-of-pointer vs. Vec, shape drift, lifetime or mutability issues). (3) C-specific features: Incomplete handling some features like function pointers and C variadics. (4) Borrowing and resource-model violations: Compile-time borrow-checker errors in idiomatic Rust bodies (e.g., overlapping borrows in updates). (5) Harness/runtime faults: Faulty test harnesses translation (e.g. buffer mis-sizing, out-of-bounds access). Other minor cases include unsupported intrinsics (SIMD) and global-state divergence (shadowed globals). Table LABEL:tab:crust_failures (in Appendix M.1) summarizes each sample’s outcome and its primary cause.
Idiomaticity. Unidiomatic outputs exhibit many lint alerts and heavy reliance on unsafe: the mean Clippy alert sum is 50.14 per sample (2.96 per function); the mean unsafe fraction is 97.86% with an unsafe-free rate of 0%. Idiomatic outputs reverse this profile: the mean Clippy alert sum drops to only 2.27 per sample (0.28 per function); the mean unsafe fraction is 0% with a 100% unsafe-free rate.
Libogg.
Step (model) SR (%) Avg. lint / Function Avg. attempt Unid. (GPT-4o) 100 1.45 1.52 Idiom. (GPT-4o) 53 0.28 2.00 Unid. (GPT-5) 100 1.45 1.04 Idiom. (GPT-5) 78 0.23 1.25
Table 2: Evaluation of SACTOR ’s function translation on libogg. “Unid.”/“Idiom.” denotes unidiomatic/idiomatic translation. “SR” is the success rate of translating functions. “Avg. lint”/“Avg. attempt” is the average lint alert count/average number of attempts, for functions that both LLM models succeed in translating.
The unidiomatic and idiomatic translations of all structs and global variables are successful with each LLM model. For functions, the result is summarized in Table 2. SACTOR succeeds in all functions’ unidiomatic translations. For idiomatic translations, SACTOR ’s success rate is 53% and SACTOR takes 2.00 attempts on average to produce a correct translation with GPT-4o. For GPT-5, the performance is significantly better with a success rate of 78% and average number of attempts of 1.25.
Observations and failure modes. The most significant reasons for failed idiomatic translations include: (1) failure to pass tests due to mistakes in translating pointer manipulation and heap memory management; (2) compile errors in translated functions, especially arising from violation of Rust safety rules on lifetimes, borrowing and mutability; (3) failure to generate compilable test harnesses for data types with pointers and arrays. GPT-5 performs significantly better than GPT-4o. For example, GPT-5 only have one failure caused by a compile error in the translated function, in contrast to six compile error failures with GPT-4o, which shows the progress of GPT-5 in understanding Rust grammar and fixing compile errors. More details can be found in Appendix M.2.
Idiomaticity. SACTOR ’s unidiomatic translations cause lint alerts largely due to the use of unsafe code while idiomatic translations lead to very few lint alerts, i.e., fewer than 0.3 alerts per function on average (Table 2). With each model, the unidiomatic translations are all in unsafe code but the idiomatic translations are all in safe code. As a result, the idiomatic translations have an avg. unsafe fraction of 0% and unsafe-free fraction of 100%. The unidiomatic translations are the opposite.
## 7 Conclusions
Translating C to Rust enhances memory safety but remains error-prone and often unidiomatic. While LLMs improve translation, they still lack correctness guarantees and struggle with semantic gaps. SACTOR addresses these through a two-stage pipeline: preserving ABI interface first, then refining to idiomatic Rust. Guided by static analysis and validated via FFI-based testing, SACTOR achieves high correctness and idiomaticity across multiple benchmarks, surpassing prior tools. Remaining challenges include stronger correctness assurance, richer C-feature coverage, and improved scalability and efficiency (see § 8). Example prompts appear in Appendix N.
## 8 Limitations
While SACTOR is effective in producing correct, idiomatic Rust, several limitations remain:
- Test coverage dependence. Our soft-equivalence checks rely on existing end-to-end tests; shallow or incomplete coverage can miss subtle semantic errors. Integrating fuzzing or test generation could raise coverage and catch corner cases.
- Model variance. Translation quality depends on the underlying LLM. Although GPT-4o and DeepSeek-R1 perform well in our study, other models show lower accuracy and stability.
- Unsupported C features. Complex macros, pervasive function pointers, global state, C variadics and inline assembly are only partially handled, limiting applicability to such codebases (see § 6.3).
- Static analysis precision. Current analysis may under-specify aliasing, ownership, and pointer shapes in challenging code, leading to adapter/spec errors. Stronger analyses could improve mapping and reduce retries.
- Harness generation stability. The rule-based generator with LLM fallback can still emit incomplete or brittle adapters on complex patterns (e.g., unusual pointer shapes or length expressions), causing otherwise-correct translations to fail verification. Hardening rules and reducing reliance on the fallback should improve robustness and reproducibility.
- Cost and latency. Multi-stage prompting, compilation, and test loops incur non-trivial token and time costs, which matter for large-scale migrations.
## Appendix A Differences Between C and Rust
### A.1 Code Snippets
Here is a code example to demonstrate the differences between C and Rust. The example shows a simple C program and its equivalent Rust program. The create_sequence function takes an integer n as input and returns an array with a sequence of integers. In C, the function needs to allocate memory for the array using malloc and will return the pointer to the allocated memory as an array. If the size is invalid, or the allocation fails, the function will return NULL. The caller of the function is responsible for freeing the memory using free when it is done with the array to prevent memory leaks.
C Code:
<details>
<summary>x5.png Details</summary>

### Visual Description
## Code Snippet: Dynamic Integer Sequence Creation
### Overview
The image shows a C programming snippet demonstrating dynamic memory allocation for an integer array, with a function to generate a sequence of integers from 0 to n-1. The code includes error handling for invalid input and memory allocation failures, followed by an example usage and memory cleanup.
### Components/Axes
1. **Function Definition**: `int* create_sequence(int n)`
- Purpose: Creates an integer array of size `n` filled with values 0 to n-1
2. **Memory Allocation**: `malloc(n * sizeof(int))`
- Checks for allocation success using `!arr`
3. **Loop Initialization**: `for (int i = 0; i < n; i++)`
- Fills array with sequential values
4. **Example Usage**: `int* sequence = create_sequence(5);`
5. **Memory Cleanup**: `free(sequence);`
### Detailed Analysis
1. **Function Logic**:
- Returns `NULL` for invalid input (n ≤ 0)
- Allocates memory for `n` integers
- Returns `NULL` if allocation fails
- Populates array with values 0 to n-1
- Returns the populated array
2. **Memory Management**:
- Uses `malloc` for dynamic allocation
- Includes null check after allocation
- Demonstrates proper cleanup with `free()`
3. **Example Execution**:
- Creates sequence of 5 elements (0-4)
- Shows null check pattern for returned pointer
- Includes comment about memory cleanup responsibility
### Key Observations
- **Input Validation**: Explicit handling of n ≤ 0 cases
- **Error Propagation**: Returns NULL on allocation failure
- **Zero-Based Indexing**: Array values match C array indexing
- **Resource Management**: Explicit memory freeing demonstrated
- **Code Structure**: Follows typical C function pattern with error checking
### Interpretation
This code demonstrates fundamental C programming concepts:
1. **Dynamic Memory Management**: Shows proper use of `malloc`/`free` with error checking
2. **Array Initialization**: Implements sequential value population
3. **Defensive Programming**: Includes multiple null checks
4. **Resource Ownership**: Clearly demonstrates caller responsibility for memory deallocation
The example usage (`create_sequence(5)`) illustrates the function's practical application, while the comment about memory cleanup serves as documentation for proper resource management practices. The code follows standard C conventions for error handling and memory management.
</details>
Rust Code:
<details>
<summary>x6.png Details</summary>

### Visual Description
## Code Snippet: Rust Function `create_sequence`
### Overview
The image shows a Rust code snippet defining a function `create_sequence` that generates a sequence of integers. The function returns an `Option<Vec<i32>>`, handling edge cases and memory allocation. A `match` statement demonstrates pattern matching on the function's output.
### Components/Axes
- **Function Definition**:
- `fn create_sequence(n: i32) -> Option<Vec<i32>>`
- `fn`: Rust keyword for function definition (green).
- `create_sequence`: Function name (blue).
- `n: i32`: Parameter `n` of type `i32` (integer).
- `-> Option<Vec<i32>>`: Return type: `Option` enum wrapping a vector of 32-bit integers.
- **Control Flow**:
- `if n <= 0 { return None; }`
- Conditional check for invalid input (`n <= 0`).
- Returns `None` (red) if `n` is non-positive.
- `let mut arr = Vec::with_capacity(n as usize);`
- Initializes a mutable vector `arr` with capacity `n` (converted to `usize`).
- `for i in 0..n { arr.push(i); }`
- Populates the vector with integers from `0` to `n-1`.
- `Some(arr)`: Returns the populated vector wrapped in `Some`.
- **Pattern Matching**:
- `match create_sequence(5) {`
- Calls `create_sequence(5)` and matches the result.
- `Some(sequence) => { ... }`
- Handles the `Some` case (red).
- Comment: `// Does not need to free the memory` (gray).
- `None => { ... }`
- Handles the `None` case (red).
### Detailed Analysis
1. **Function Logic**:
- For `n <= 0`, returns `None` to indicate an empty sequence.
- For `n > 0`, allocates a vector with capacity `n` and fills it with sequential integers.
- Returns `Some(arr)` to indicate a valid sequence.
2. **Memory Management**:
- The comment `// Does not need to free the memory` suggests Rust's ownership model automatically manages memory, avoiding manual deallocation.
3. **Pattern Matching**:
- The `match` statement exhaustively handles both `Some(sequence)` and `None` cases, ensuring all possibilities are addressed.
### Key Observations
- **Edge Case Handling**: The function explicitly checks for `n <= 0` to avoid invalid memory allocation.
- **Type Safety**: Uses `Option` to signal the presence or absence of a result, preventing null pointer errors.
- **Efficiency**: Pre-allocates vector capacity to optimize performance for large `n`.
### Interpretation
This function demonstrates Rust's emphasis on safety and performance:
- **Safety**: The `Option` type and exhaustive `match` ensure all code paths are handled, eliminating runtime panics.
- **Performance**: Pre-allocating vector capacity reduces reallocations during population.
- **Memory Management**: Rust's ownership model avoids manual memory management, as noted in the comment.
The code exemplifies idiomatic Rust practices for error handling, resource management, and pattern matching.
</details>
Figure 3: Example of a simple C program and its equivalent Rust program, both hand-written for illustration.
### A.2 Tabular Summary
Here, we present a non-exhaustive list of differences between C and Rust in Table 3, highlighting the key features that make translating code from C to Rust challenging. While the list is not comprehensive, it provides insights into the fundamental distinctions between the two languages, which can help developers understand the challenges of migrating C code to Rust.
| Memory Management Pointers Lifetime Management | Manual (through malloc/free) Raw pointers like *p Manual freeing of memory | Automatic (through ownership and borrowing) Safe references like &p/&mut p, Box and Rc Lifetime annotations and borrow checker |
| --- | --- | --- |
| Error Handling | Error codes and manual checks | Explicit handling with Result and Option types |
| Null Safety | Null pointers allowed (e.g., NULL) | No null pointers; uses Option for nullable values |
| Concurrency | No built-in protections for data races | Enforces safe concurrency with ownership rules |
| Type Conversion | Implicit conversions allowed and common | Strongly typed; no implicit conversions |
| Standard Library | C stand library with direct system calls | Rust standard library with utilities for strings, collections, and I/O |
| Language Features | Procedure-oriented with minimal abstractions | Modern features like pattern matching, generics, and traits |
Table 3: Key Differences Between C and Rust
## Appendix B Preprocessing and Task Division
### B.1 Preprocessing of C Files
To support real-world C projects, SACTOR parses the compile commands generated by the make tool, extracting relevant flags for preprocessing, parsing, compilation, linking, and third-party tools’ use.
C source files usually contain preprocessing directives, such as #include, #define, #ifdef, #endif, etc., which we need to resolve before parsing C files. For #include, we copy and expand non-system headers recursively while keeping #include of system headers intact, because included non-system headers contain project-specific definitions such as structs and enums that the LLM has not known while system headers’ contents are known to the LLM and expanding them would unnecessarily introduce too much noise. For other directives, we pass relevant C project compile flags to the C preprocessor from GCC to resolve them.
### B.2 Algorithm for Task Division
The task division algorithm is used to determine the order in which the items should be translated. The algorithm is shown in Algorithm 1.
Algorithm 1 Translation Task Order Determination
1: $L_{i}$ : List of items to be translated
2: $dep(a)$ : Function to get dependencies of item $a$
3: $L_{sorted}$ : List of groups resolving dependencies
4: $L_{sorted}\leftarrow\emptyset$ $\triangleright$ Empty list
5: while $|L_{sorted}|<|L_{i}|$ do
6: $L_{processed}\leftarrow\emptyset$
7: for $a\in L_{i}$ do
8: if $a\notin L_{processed}$ and $dep(a)\subseteq L_{processed}$ then
9: $L_{sorted}\leftarrow L_{sorted}+a$ $\triangleright$ Add to sorted list
10: $L_{processed}\leftarrow L_{processed}\cup a$
11: end if
12: end for
13: if $L_{processed}=\emptyset$ then
14: $L_{circular}\leftarrow DFS(L_{i},dep)$ $\triangleright$ Circular dependencies
15: $L_{sorted}\leftarrow L_{sorted}+L_{circular}$ $\triangleright$ Add a group to sorted list
16: end if
17: end while
18: return $L_{sorted}$
In the algorithm, $L_{i}$ is the list of items to be translated, and $dep(a)$ is a function that returns the dependencies of item $a$ . The algorithm returns a list $L_{sorted}$ that contains the items in the order in which they should be translated. $DFS(L_{i},dep)$ is a depth-first search function that returns a list of items involved in a circular dependency. It begins by collecting all items (e.g., functions, structs) to be translated and their respective dependencies (in both functions and data types). Items with no unresolved dependencies are pushed into the translation order list first, and other items will remove them from their dependencies list. This process continues until all items are pushed into the list, or circular dependencies are detected. If circular dependencies are detected, we resolve them through a depth-first search strategy, ensuring that all items involved in a circular dependency are grouped together and handled as a single unit.
## Appendix C Equivalence Testing Details in Prior Literature
### C.1 Symbolic Execution-Based Equivalence
Symbolic execution explores all potential execution paths of a program by using symbolic inputs to generate constraints [king1976symbolic, baldoni2018survey, coward1988symbolic]. While theoretically powerful, this method is impractical for verifying C-to-Rust equivalence due to differences in language features. For instance, Rust’s RAII (Resource Acquisition Is Initialization) pattern automatically inserts destructors for memory management, while C relies on explicit malloc and free calls. These differences cause mismatches in compiled code, making it difficult for symbolic execution engines to prove equivalence. Additionally, Rust’s compiler adds safety checks (e.g., array boundary checks), which further complicate equivalence verification.
### C.2 Fuzz Testing-Based Equivalence
Fuzz testing generates random or mutated inputs to test whether program outputs match expected results [zhu2022fuzzing, miller1990empirical, liang2018fuzzing]. While more practical than symbolic execution, fuzz testing faces challenges in constructing meaningful inputs for real-world programs. For example, testing a URL parsing function requires generating valid URLs with specific formats, which is non-trivial. For large C programs, this difficulty scales, making it infeasible to produce high-quality test cases for every translated Rust function.
## Appendix D An Example of the Test Harness
Here, we provide an example of the test harness used to verify the correctness of the translated code in Figure 4, which is used to verify the idiomatic Rust code. In this example, the concat_str_idiomatic function is the idiomatic translation we are testing, while the concat_str_c function is the test harness function that can be linked back to the original C code. where a string and an integer are passed as input, and an owned string is returned. Input strings are converted from C’s char* to Rust’s &str, and output strings are converted from Rust’s String back to C’s char*.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Screenshot: Rust Code Implementation for String Concatenation
### Overview
The image shows a Rust code snippet implementing two functions for string concatenation: `concat_str_idiomatic` and `concat_str`. The code includes syntax highlighting with color-coded elements (blue for keywords, orange for types, green for identifiers) and comments explaining the logic.
### Components/Axes
- **Function Definitions**:
- `fn concat_str_idiomatic(orig: &str, num: i32) -> String`
- `fn concat_str(orig: *const c_char, num: c_int) -> *const c_char`
- **Syntax Highlighting**:
- Keywords (e.g., `fn`, `let`, `expect`) in blue
- Types (e.g., `String`, `CString`, `CStr`) in orange
- Identifiers (e.g., `orig`, `num`, `out_str`) in green
- Comments in gray
- **Error Handling**:
- `expect("Invalid UTF-8 string")` for UTF-8 validation
- `unwrap()` for optional value conversion
### Detailed Analysis
1. **`concat_str_idiomatic` Function**:
- Parameters: `orig` (string slice), `num` (32-bit integer)
- Returns: `String` (heap-allocated string)
- Logic:
- Uses `format!` macro to concatenate `orig` and `num` into a formatted string.
- Example: `format!("{{}}", orig, num)` creates a string like `"hello123"` if `orig="hello"` and `num=123`.
2. **`concat_str` Function**:
- Parameters: `orig` (pointer to constant C string), `num` (C integer)
- Returns: Pointer to constant C string (`*const c_char`)
- Logic:
- Converts `orig` to a Rust `CString` via `CStr::from_ptr(orig).to_str().expect("Invalid UTF-8 string")`.
- Calls `concat_str_idiomatic` with the converted string and `num` (cast to `i32`).
- Wraps the result in a new `CString` and returns its raw pointer via `unwrap()` and `into_raw()`.
3. **Ownership Transfer**:
- `out_str.into_raw()` transfers ownership of the `CString` to the caller, requiring manual deallocation.
### Key Observations
- **Idiomatic vs. Low-Level**: `concat_str_idiomatic` uses Rust's high-level string formatting, while `concat_str` bridges Rust and C via raw pointers.
- **Error Propagation**: The `expect` macro enforces UTF-8 validity, crashing on invalid input.
- **Memory Management**: The `unwrap()` and `into_raw()` combination assumes valid input, risking panics or memory leaks if misused.
### Interpretation
- **Purpose**: The code demonstrates safe interoperability between Rust and C strings, with `concat_str_idiomatic` providing a safe abstraction and `concat_str` enabling low-level C compatibility.
- **Trade-offs**: The `concat_str` function sacrifices Rust's safety guarantees (e.g., raw pointers, `unwrap()`) for C compatibility, requiring careful error handling.
- **Design Insight**: The separation of concerns—`concat_str_idiomatic` for Rust-centric use and `concat_str` for FFI (Foreign Function Interface)—highlights Rust's emphasis on safety without sacrificing performance.
## Additional Notes
- **Language**: Rust (no other languages detected).
- **Color Significance**: Syntax highlighting aids readability but does not affect code semantics.
- **Missing Context**: The code lacks error recovery mechanisms (e.g., `?` operator) for the `expect` call, which could improve robustness.
</details>
Figure 4: Test harness used for verifying concat_str translation
## Appendix E An Example of SACTOR Translation Process
To demonstrate the translation process of SACTOR, we present a straightforward example of translating a C function to Rust. The C program includes an atoi function that converts a string to an integer, and a main function that parses command-line arguments and calls the atoi function. The C code is shown in Figure 5(a).
<details>
<summary>x8.png Details</summary>

### Visual Description
## C Code Screenshot: Custom atoi Implementation
### Overview
The image shows a C programming language implementation of a custom `atoi` function (ASCII to integer converter) along with a `main` function demonstrating its usage. The code includes syntax highlighting with color-coded elements.
### Components/Axes
- **Function Definitions**:
- `int atoi(char *str)`: Custom string-to-integer conversion function
- `int main(int argc, char *argv[])`: Program entry point
- **Variables**:
- `int result = 0`: Accumulates numeric value
- `int sign = 1`: Tracks positive/negative sign
- `char *str`: Input string pointer
- **Control Structures**:
- `while` loops for whitespace skipping and digit processing
- `if` statements for sign handling
- **Syntax Highlighting Colors**:
- Blue: `#include`, `while`, `if`, `return`
- Purple: `int`, `atoi`, `main`, `printf`, `print`
- Red: Function return types (`int`)
- Orange: Numeric literals (`0`, `1`, `10`)
- Green: String literals (`"Usage: %s <number>\\n"`, `"Parsed integer: %d\\n"`)
- Gray: Comments (`//`)
### Detailed Analysis
1. **atoi Function Logic**:
- **Whitespace Handling**: Skips leading spaces, tabs, newlines, carriage returns, vertical tabs, and form feeds
- **Sign Detection**: Checks for `+` or `-` to set sign multiplier
- **Digit Conversion**: Processes digits 0-9 using ASCII arithmetic (`*str - '0'`)
- **Termination**: Stops at first non-digit character
- **Return**: `sign * result`
2. **Main Function**:
- **Argument Check**: Requires exactly 2 command-line arguments (program name + number string)
- **Usage Message**: Prints `"Usage: %s <number>\\n"` if arguments are invalid
- **Conversion & Output**: Calls `atoi(argv[1])` and prints result with `"Parsed integer: %d\\n"`
### Key Observations
- **Color Consistency**: Syntax elements maintain consistent coloring throughout (e.g., all `while` keywords in blue)
- **Edge Case Handling**: Explicitly skips multiple whitespace characters before processing digits
- **Error Prevention**: Returns 0 for empty strings after whitespace skipping
- **ASCII Arithmetic**: Uses `*str - '0'` for digit conversion without numeric constants
### Interpretation
This implementation demonstrates fundamental string processing techniques in C:
1. **Robust Parsing**: Handles various whitespace characters and optional signs
2. **Efficiency**: Processes input in a single pass with O(n) complexity
3. **Safety**: Returns 0 for invalid inputs rather than crashing
4. **Educational Value**: Shows manual string-to-integer conversion without library functions
The color coding enhances code readability by visually separating:
- Keywords (blue)
- Types (purple)
- Literals (orange/green)
- Comments (gray)
This multi-colored approach aids in quickly identifying code structure and logic flow.
</details>
(a) C implementation of atoi
<details>
<summary>x9.png Details</summary>

### Visual Description
## Rust Code Screenshot: Custom `atoi` Implementation
### Overview
The image shows a Rust code snippet implementing a custom `atoi` function to convert C-style strings to integers, along with a `main` function for command-line argument parsing. The code uses unsafe FFI patterns and manual memory management.
### Components/Axes
- **Function Definitions**:
- `pub unsafe fn atoi(str: *const c_char) -> i32`: Converts C strings to integers.
- `pub fn main()`: Handles command-line arguments and input validation.
- **Variables**:
- `result`, `sign`, `ptr`: Integer accumulator and pointer variables.
- `c_str`: C string created from command-line input.
- `value`: Parsed integer output.
- **Error Handling**:
- `Ok(cstring) => cstring`
- `Err(_) => "Failed to create CString from input"`
- **Key Constructs**:
- `while *ptr != '\0'`: Null-terminated string traversal.
- `ptr.add(1)`: Pointer arithmetic for character iteration.
- `i32::checked_add(digit)`: Overflow-safe digit accumulation.
### Detailed Analysis
1. **`atoi` Function Logic**:
- Initializes `result = 0` and `sign = 1`.
- Skips leading `+`/`-` characters to determine sign.
- Iterates through digits using pointer arithmetic (`ptr.add(1)`).
- Converts ASCII digits to integers via `digit - '0'`.
- Uses `i32::checked_add` to prevent overflow, returning `i32::MAX`/`MIN` on error.
- Returns `sign * result` after processing all digits.
2. **`main` Function Flow**:
- Collects command-line arguments into `Vec<String>`.
- Validates argument count (requires exactly 2 arguments).
- Prints usage message for invalid input: `"Usage: <number>"`.
- Converts second argument to `CString` with error handling.
- Parses `CString` using `atoi` and prints result.
### Key Observations
- **Unsafe Usage**: The `atoi` function is marked `unsafe` due to direct pointer manipulation and FFI interactions.
- **Manual Parsing**: Implements digit conversion without standard library helpers (e.g., `parse()`).
- **Error Propagation**: Uses Rust's `Result` type for error handling in `CString` creation.
- **Pointer Safety**: Employs `ptr.add(1)` for safe character iteration within bounds-checked loops.
### Interpretation
This code demonstrates Rust's FFI capabilities while emphasizing safety through:
1. **Explicit Error Handling**: All potential failure points (e.g., invalid input, overflow) are explicitly checked.
2. **Memory Safety**: Uses `CString` for owned C-style strings and bounds-checked pointer arithmetic.
3. **Overflow Prevention**: Leverages `i32::checked_add` to avoid integer overflow vulnerabilities.
4. **CLI Integration**: Provides clear usage instructions and input validation for command-line tools.
The implementation highlights Rust's balance between low-level control and memory safety, particularly in systems programming contexts requiring C interoperability.
</details>
(b) Unidiomatic Rust translation from C
<details>
<summary>x10.png Details</summary>

### Visual Description
## Rust Code: Integer Parsing and Command-Line Argument Handling
### Overview
This code implements a command-line utility that parses an integer from user input, handling edge cases like overflow, invalid characters, and sign detection. It uses Rust's standard libraries for environment interaction and error handling.
### Components/Axes
- **Standard Libraries**:
- `std::env` for command-line argument access
- `std::process` for process exit functionality
- **Key Functions**:
- `atoi(input: &str) -> i32`: Custom string-to-integer conversion
- `main()`: Entry point handling argument validation and parsing
### Detailed Analysis
1. **Argument Validation**:
- Checks if exactly 2 arguments exist (`args.len() != 2`)
- Prints usage message `"Usage: <number>"` on invalid input count
- Exits with code 1 on argument count mismatch
2. **Integer Parsing Logic**:
- **Sign Detection**:
- Scans for leading `+` or `-` to set `sign` (1 or -1)
- Skips whitespace before sign detection
- **Digit Processing**:
- Converts characters to digits using `to_digit(10)`
- Uses `checked_add` for safe arithmetic to prevent overflow
- Returns `i32::MAX`/`i32::MIN` on overflow detection
- **Error Handling**:
- Returns `None` for non-digit characters
- Breaks loop on invalid input during parsing
3. **Output**:
- Prints parsed integer with `"Parsed integer: {value}"`
- Uses `println!` macro for formatted output
### Key Observations
- **Overflow Protection**: Explicit checks using `checked_add` and bounds (`i32::MAX`/`i32::MIN`)
- **Whitespace Handling**: Skips leading whitespace before sign detection
- **Command-Line Interface**: Strict argument count validation (requires exactly one input)
- **Rust Safety Features**: Leverages `checked_*` methods for memory-safe operations
### Interpretation
This code demonstrates Rust's emphasis on safety and explicit error handling. The custom `atoi` implementation:
1. **Prevents Buffer Overflow**: By validating input length and character validity
2. **Handles Edge Cases**:
- Leading/trailing whitespace
- Positive/negative signs
- 32-bit integer overflow
3. **Integrates with CLI**: Follows Unix-style argument parsing conventions
The code's structure reflects Rust's ownership model through:
- Immutable references (`&str`)
- Explicit error propagation via `Option` types
- Safe arithmetic operations
Notable design choices include:
- Returning `i32::MAX`/`i32::MIN` for overflow instead of panicking
- Using `checked_add` to avoid undefined behavior
- Clear separation of sign handling and digit accumulation
</details>
(c) Idiomatic Rust translation from unidiomatic Rust
Figure 5: SACTOR translation process for atoi program
We assume that there are numerous end-to-end tests for the C code, allowing SACTOR to use them for verifying the correctness of the translated Rust code.
First, the divider will divide the C code into two parts: the atoi function and the main function, and determine the translation order is first atoi and then main, as atoi is the dependency of main and the atoi function is a pure function.
Next, SACTOR proceeds with the unidiomatic translation, converting both functions into unidiomatic Rust code. This generated code will keep the semantics of the original C code while using Rust syntax. Once the translation is complete, the unidiomatic verifier executes the end-to-end tests to ensure the correctness of the translated function. If the verifier passes all tests, SACTOR considers the unidiomatic translation accurate and progresses to the next function. If any test fails, SACTOR will retry the translation process using the feedback information collected from the verifier, as described in § 4.3. After translating all sections of the C code, SACTOR will combine the unidiomatic Rust code segments to form the final unidiomatic Rust code. The unidiomatic Rust code is shown in Figure 5(b).
Then, the SACTOR will start the idiomatic translation process and translate the unidiomatic Rust code into idiomatic Rust code. The idiomatic translator requests the LLM to adapt the C semantics into idiomatic Rust, eliminating any unsafe and non-idiomatic constructs, as detailed in § 4.2. Based on the same order, the SACTOR will translate two functions accordingly, and using the idiomatic verifier to verify and provide the feedback to the LLM if the verification fails. After all parts of the Rust code are translated into idiomatic Rust, verified, and combined, the SACTOR will produces the final idiomatic Rust code. The idiomatic Rust code is shown in Figure 5(c), representing the final output of SACTOR.
## Appendix F Dataset Details
| TransCoder-IR [transcoderir] | 100 | Removed buggy programs (compilation/memory errors) and entries with existing Rust | Present | 97.97% / 99.5% |
| --- | --- | --- | --- | --- |
| Project CodeNet [codenet] | 100 | Filtered for external-input programs (argc / argv); auto-generated tests | Generated | 94.37% / 100% |
| CRust-Bench [khatry2025crust] | 50 | Excluded unsupported patterns; combine code of each sample to a single lib.c | Present | 76.18% / 80.98% |
| libogg [libogg] | 1 | None. Each component of the library is contained within a single C file. | Present | 83.3% / 75.3% |
Table 4: Summary of datasets and real-world code-bases used for evaluation; coverage audited with gcov on the tests exercised in our pipeline.
### F.1 TransCoder-IR Dataset [transcoderir]
The TransCoder-IR dataset is used to evaluate the TransCoder-IR model and consists of solutions to coding challenges in various programming languages. For evaluation, we focus on the 698 C programs available in this dataset. First, we filter out programs that already have corresponding Rust code. Several C programs in the dataset contain bugs, which are removed by checking their ability to compile. We then use valgrind to identify and discard programs with memory errors during the end-to-end tests. Finally, we select 100 programs with the most lines of code for our experiments.
### F.2 Project CodeNet [codenet]
Project CodeNet is a large-scale dataset for code understanding and translation, containing 14 million code samples in over 50 programming languages collected from online judge websites. From this dataset, which includes more than 750,000 C programs, we target only those that accept external input. Specifically, we filter programs using argc and argv, which process input from the command line. As the end-to-end tests are not available for this dataset, we develop the SACTOR test generator to automatically generate end-to-end tests for these programs based on the source code. For evaluation, we select 200 programs and refine the dataset to include 100 programs that successfully generate end-to-end tests.
### F.3 CRust-Bench [khatry2025crust]
CRust-Bench is a repository-level benchmark for C-to-safe-Rust transpilation. It collects 100 real-world C repositories (the CBench suite) and pairs each with a manually written, safe Rust interface and a set of tests that assert functional correctness. By evaluating full repositories rather than isolated functions, CRust-Bench surfaces challenges common in practice, such as complex, pointer-rich APIs. In our evaluation, we use a 50-sample subset in CRust-Bench, which exclude entries that are out of scope for our pipeline (e.g., circular type or function dependencies and compiler-specific intrinsics that do not map cleanly). For each selected sample, we reuse the upstream end-to-end tests and relink them so that calls exercise our translated code; build environments and link flags follow the sample’s configuration.
### F.4 libogg [libogg]
libogg is the reference implementation of the Ogg multimedia container. Ogg is a stream-oriented format that frames, timestamps, and multiplexes compressed media bitstreams (e.g., audio/video) into a robust, seekable stream. The libogg distribution contains only the Ogg container library (codecs such as Vorbis or Theora are hosted separately). In our case study, the codebase comprises roughly 2,041 lines of code (excluding tests), six struct definitions, three global variables, and 77 exported functions. We use the project’s upstream tests and build scripts. This single-project evaluation complements the CRust-Bench subset by focusing on non-trivial structs, buffers, and pointer manipulation in a real-world C library.
## Appendix G LLM Configurations
Table 5 shows our configurations for different LLMs in evaluation. All other hyperparameters (e.g., Top-P, Top-K) use provider defaults. As GPT-5 does not support temperature setting, we use its default temperature.
| GPT-4o | gpt-4o-2024-08-06 | 0 |
| --- | --- | --- |
| Claude 3.5 Sonnet | claude-3-5-sonnet-20241022 | 0 |
| Gemini 2.0 Flash | gemini-2.0-flash-exp | 0 |
| Llama 3.3 Instruct 70B | Llama 3.3 Instruct 70B 1 | 0 |
| DeepSeek-R1 | DeepSeek-R1 671B 2 | 0 |
| GPT-5 | gpt-5-2025-08-07 | default |
- https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
- https://huggingface.co/deepseek-ai/DeepSeek-R1
Table 5: Configurations of Different LLMs in Evaluation
## Appendix H Failure Analysis in Evaluating SACTOR
(a) TransCoder-IR
| R1 | Memory safety violations in array operations due to improper bounds checking |
| --- | --- |
| R2 | Mismatched data type translations |
| R3 | Incorrect array sizing and memory layout translations |
| R4 | Incorrect string representation conversion between C and Rust |
| R5 | Failure to handle C’s undefined behavior with Rust’s safety mechanisms |
| R6 | Use of C-specific functions in Rust without proper Rust wrappers |
(b) Project CodeNet
| S1 | Improper translation of command-line argument handling or attempt to fix wrong handling |
| --- | --- |
| S2 | Function naming mismatches between C and Rust |
| S3 | Format string directive mistranslation causing output inconsistencies |
| S4 | Original code contains random number generation |
| S5 | SACTOR unable to translate mutable global state variables |
| S6 | Mismatched data type translations |
| S7 | Incorrect control flow or loop boundary condition translations |
Table 6: Failure reason categories for translating TransCoder-IR and Project CodeNet datasets.
<details>
<summary>x11.png Details</summary>

### Visual Description
Icon/Small Image (797x38)
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
## Bar Chart: Number of Files Across Categories
### Overview
The chart displays a grouped bar visualization comparing the number of files across six categories (R1–R6). Each category contains five bars in distinct colors (blue, orange, green, red, purple), with a vertical legend on the right mapping colors to categories. The y-axis ranges from 0 to 25, labeled "Number of Files."
### Components/Axes
- **X-axis**: Labeled "Categories," with discrete labels R1–R6.
- **Y-axis**: Labeled "Number of Files," scaled from 0 to 25 in increments of 5.
- **Legend**: Vertical legend on the right, associating colors with categories:
- Blue
- Orange
- Green
- Red
- Purple
- **Bars**: Each category (R1–R6) has five bars, one for each color in the legend.
### Detailed Analysis
- **R1**:
- Blue: ~4
- Orange: ~3
- Green: ~2
- Red: ~2
- Purple: ~1
- **R2**:
- Blue: ~4
- Orange: ~5
- Green: ~4
- Red: ~3
- Purple: ~2
- **R3**:
- Blue: ~5
- Orange: ~8
- Green: ~7
- Red: ~4
- Purple: ~2
- **R4**:
- Blue: ~1
- Orange: ~25 (outlier)
- Green: ~3
- Red: ~6
- Purple: ~3
- **R5**:
- Blue: ~3
- Orange: ~4
- Green: ~3
- Red: ~4
- Purple: ~0
- **R6**:
- Blue: ~0
- Orange: ~0
- Green: ~5
- Red: ~3
- Purple: ~0
### Key Observations
1. **R4 Outlier**: The orange bar in R4 (~25) is significantly higher than all other bars, suggesting an anomaly or exceptional case.
2. **Purple Consistency**: Purple bars are consistently the lowest across all categories, often near zero.
3. **Green Peaks**: Green bars peak in R3 (~7) and R6 (~5), indicating higher values in these categories.
4. **Blue Trends**: Blue bars are highest in R3 (~5) and R1/R2 (~4), with a sharp drop in R4 (~1).
### Interpretation
The chart highlights variability in file counts across categories and data series. The R4 outlier (orange bar) warrants investigation, as it deviates drastically from other values. The purple series appears negligible, possibly indicating a minor or inactive metric. Green and blue series show moderate consistency, with R3 and R6 having notable green values. The data suggests potential imbalances or errors in R4, while other categories exhibit more uniform distributions.
</details>
(a) TransCoder-IR
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: Number of Files Across Categories
### Overview
The chart displays a grouped bar visualization comparing the number of files across seven categories (S1–S7). Each category is represented by five colored bars (blue, orange, green, red, purple), with values ranging from 0 to 7 on the y-axis. The legend on the right maps colors to categories, though no explicit labels are provided for the colors themselves.
### Components/Axes
- **X-axis**: Labeled "Categories," with discrete labels S1–S7.
- **Y-axis**: Labeled "Number of Files," scaled from 0 to 7 in increments of 1.
- **Legend**: Positioned on the right, associating colors with categories (e.g., blue = S1, orange = S2, etc.). No explicit textual labels for colors are visible.
- **Bars**: Grouped vertically for each category, with approximate heights corresponding to the y-axis values.
### Detailed Analysis
- **S1**:
- Blue: ~4
- Orange: ~5
- Green: ~2
- Red: ~4
- Purple: ~3
- **S2**:
- All bars: ~1
- **S3**:
- Blue: ~2
- Orange: ~4
- Green: ~4
- Red: ~5
- Purple: ~3
- **S4**:
- All bars: ~1
- **S5**:
- Blue: ~2
- Orange: ~2
- Green: ~1
- Red: ~2
- Purple: ~2
- **S6**:
- Green: ~4
- Red: ~7
- Purple: ~2
- **S7**:
- Green: ~4
- Purple: ~2
### Key Observations
1. **S6** has the tallest bar (red, ~7), indicating the highest number of files in this category.
2. **S7** shows equal heights for green (~4) and purple (~2) bars, suggesting a tie or deliberate grouping.
3. **S2** and **S4** have uniformly low values (~1) across all colors, indicating minimal file counts.
4. **S3** has the second-highest red bar (~5), followed by S1 (~4).
5. **Green** and **purple** bars dominate S7, while **orange** and **red** dominate S1 and S3.
### Interpretation
The data suggests significant variability in file counts across categories. S6 stands out as an outlier with the highest value (~7), potentially indicating a critical or high-priority category. The uniformity in S2 and S4 (~1) may reflect baseline or control groups. The equal heights of green and purple bars in S7 could imply balanced contributions or a design choice. The legend’s placement on the right ensures clarity, though explicit color-to-category labels would enhance interpretability. The chart emphasizes categorical differences rather than trends, as no temporal or ordinal relationships are implied.
</details>
(b) Project CodeNet
Figure 6: Failure reasons across different LLM models for both datasets.
Here, we analyze the failure cases of SACTOR in translating C code to Rust that we conducted in Section 6.1. as cases where SACTOR fails offer valuable insights into areas that require refinement. For each failure case in the two datasets, we conduct an analysis to determine the primary cause of translation failure. This process involves leveraging DeepSeek-R1 to identify potential reasons (prompts available in Appendix N.5), followed by manual verification to ensure correctness. We only focus on the translation process from C to unidiomatic Rust because: (1) it is the most challenging step, and (2) it can better reflect the model’s ability to fit the syntactic and semantic differences between the two languages. Table 6 summarize the categories of failure reasons, and Figure 6(a) and 6(b) illustrate failure reasons (FRs) across models.
(1) TransCoder-IR (Table 6(a), Figure 6(a)): Based on the analysis, we observe that different models exhibit varying failure reasons. Claude 3.5 shows a particularly high incidence of string representation conversion errors (R4), with 25 out of 45 total failures in the unidiomatic translation step. In contrast, GPT-4o has only 1 out of 17 failures in this category. Llama 3.3 demonstrates consistent challenges with both R3 (incorrect array sizing and memory layout translations) and R6 (using C-specific functions without proper Rust wrappers), with 10 files for each category. GPT-4o shows a more balanced distribution of errors, with its highest count in R3. All models except GPT-4o struggle with string handling (R4) to varying degrees, suggesting this is one of the most challenging aspects of the translation process. For R6 (use of C-specific functions in Rust), which primarily is a compilation failure, only Llama 3.3 and Gemini 2.0 consistently fail to resolve the issue in some cases, while all other models can successfully handle the compilation errors through feedback and avoid failure in this category. DeepSeek-R1 has the fewest overall errors across categories, with failures only in R1 (1 file), R3 (2 files), and R4 (3 files), while completely avoiding errors in R2, R5, and R6.
(2) Project CodeNet (Table 6(b), Figure 6(b)): Similar to the TransCoder-IR dataset, we also observe that different models in Project CodeNet demonstrate varying failure reasons. C-to-Rust code translation challenges in the CodeNet dataset. Most notably, S6 (mismatched data type translations) presents a significant barrier for Llama 3.3 and Gemini 2.0 (7 files each), while GPT-4o and Claude 3.5 completely avoid this issue. Input argument handling (S1) and format string mistranslations (S3) emerge as common challenges across all models in CodeNet, suggesting fundamental difficulties in translating these language features regardless of model architecture. Only Llama 3.3 and DeepSeek-R1 encounter control flow translation failures (S7), with 2 files each. S4 (random number generation) and S5 (mutable global state variables) are unable to be translated by SACTOR because the current SACTOR implementation does not support these features.
Compared to the results in TransCoder-IR, string representation conversion (R4 in TransCoder-IR, S3 in CodeNet) remains a consistent challenge across both datasets for all models, though the issue is significantly more severe in TransCoder-IR, particularly for Claude 3.5 (24 files). This also suggests that reasoning models like DeepSeek-R1 are better at handling complex code logic and string/array manipulation, as they exhibit fewer failures in these areas, demonstrating the potential of reasoning models to address complex translation tasks.
## Appendix I SACTOR Cost Analysis
| Claude 3.5 Gemini 2.0 | TransCoder-IR CodeNet TransCoder-IR | 4595.33 3080.28 3343.12 | 5.15 3.15 4.24 |
| --- | --- | --- | --- |
| CodeNet | 2209.38 | 2.39 | |
| Llama 3.3 | TransCoder-IR | 4622.80 | 5.39 |
| CodeNet | 4456.84 | 3.80 | |
| GPT-4o | TransCoder-IR | 2651.21 | 4.24 |
| CodeNet | 2565.36 | 2.95 | |
| DeepSeek-R1 | TransCoder-IR | 17895.52 | 4.77 |
| CodeNet | 13592.61 | 3.11 | |
Table 7: Average Cost Comparison of Different LLMs Across Two Datasets. The color intensity represents the relative cost of each metric for each dataset.
Here, we conduct a cost analysis of SACTOR for experiments in § 6.1 to evaluate the efficiency of different LLMs in generating idiomatic Rust code. To evaluate the cost of our approach, we measure (1) Total LLM Queries as the number of total LLM queries made during translation and verification for a single test case in each dataset, and (2) Total Token Count as the total number of tokens processed by the LLM for a single test case in each dataset. To ensure a fair comparison across models, we use the same tokenizer (tiktoken) and encoding (o200k_base).
In order to better understand costs, we only analyze programs that successfully generate idiomatic Rust code, excluding failed attempts (as they always reach the maximum retry limit and do not contribute meaningfully to the cost analysis). We evaluate the combined cost of both translation phases to assess overall efficiency. Table 7 compares the average cost of different LLMs across two datasets, measured in token usage and query count per successful idiomatic Rust translation as mentioned in § 5.2.
Results: Gemini 2.0 and GPT-4o are the most efficient models, requiring the fewest tokens and queries. GPT-4o maintains a low token cost (2651.21 on TransCoder-IR, 2565.36 on CodeNet) with 4.24 and 2.95 average queries, respectively. Gemini 2.0 is similarly efficient, especially on CodeNet, with the lowest token usage (2209.38) and requiring only 2.39 queries on average. Claude 3.5, despite its strong performance on CodeNet, incurs higher costs on TransCoder-IR (4595.33 tokens, 5.15 queries), likely due to additional translation steps. Llama 3.3 is the least efficient in non-thinking model (GPT-4o, Claude 3.5, Gemini 2.0), consuming the most tokens (4622.80 and 4456.84, respectively) and requiring the highest number of queries (5.39 and 3.80, respectively), indicating significant resource demands.
As a reasoning model, DeepSeek-R1 consumes significantly more tokens (17,895.52 vs. 13,592.61) than non-reasoning models–5-7 times higher than GPT-4o–despite having a similar average query count (4.77 vs. 3.11) for generating idiomatic Rust code. This high token usage comes from the “reasoning process” required before code generation.
## Appendix J Ablation Study on SACTOR Designs
This appendix reports additional ablations that evaluate key design choices in SACTOR. All experiments in this section use GPT-4o with the same configuration as Table 5.
### J.1 Feedback Mechanism
To evaluate the effectiveness of the feedback mechanism proposed in § 4.3, we conduct an ablation study by removing the mechanism and comparing the model’s performance with and without it. We consider two experimental groups: (1) with the feedback mechanism enabled, and (2) without the feedback mechanism. In the latter setting, if any part of the translation fails, the system simply restarts the translation attempt using the original prompt, without providing any feedback from the failure.
We use the same dataset and evaluation metrics described in § 5, and focus our evaluation on only two models: GPT-4o and Llama 3.3 70B. We choose these models because GPT-4o demonstrated one of the highest performance and Llama 3.3 70B the lowest in our earlier experiments. By comparing the success rates between the two groups, we assess whether the feedback mechanism improves translation performance across models of different capabilities.
The results are shown in Figure 7.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Legend: Color-Coded Categorization System
### Overview
The image displays a structured legend with color-coded rectangles and labels, organized into two vertical columns. Each rectangle represents a distinct category with a unique color pattern and label. The system appears to classify items into "Unidiomatic" and "Idiomatic" groups, with subcategories and special cases marked by suffixes like "SR" (Serial?) and "-FBK" (possibly "Fixed Boundary Key" or similar).
### Components/Axes
- **Left Column (Unidiomatic Group)**:
- **Unidiomatic SR 1**: Dark blue with diagonal stripes.
- **Unidiomatic SR 2**: Medium blue with diagonal stripes.
- **Unidiomatic SR 3**: Light blue with diagonal stripes.
- **Unidiomatic SR 4**: Green (solid).
- **Unidiomatic SR 5**: Orange with diagonal stripes.
- **Unidiomatic SR 6**: Light orange with diagonal stripes.
- **Unidiomatic (-FBK)**: Green (solid, positioned at bottom-right).
- **Right Column (Idiomatic Group)**:
- **Idiomatic SR 1**: Orange with diagonal stripes.
- **Idiomatic SR 2**: Light orange with diagonal stripes.
- **Idiomatic SR 3**: Light orange with diagonal stripes.
- **Idiomatic SR 4**: Light orange with diagonal stripes.
- **Idiomatic SR 5**: Light orange with diagonal stripes.
- **Idiomatic (-FBK)**: Red (solid, positioned at bottom-right).
### Detailed Analysis
- **Color Patterns**:
- Diagonal stripes are used for most categories, except for "Unidiomatic SR 4" (green) and "Idiomatic (-FBK)" (red), which are solid.
- Blue shades dominate the left column, while orange shades dominate the right column.
- The "-FBK" suffix is associated with solid colors (green for unidiomatic, red for idiomatic).
- **Label Placement**:
- Labels are left-aligned with their corresponding rectangles.
- The "-FBK" entries are isolated at the bottom-right, suggesting a special category.
### Key Observations
1. **Hierarchical Grouping**: The legend separates "Unidiomatic" and "Idiomatic" groups, with "SR" subcategories (1–6) and a distinct "-FBK" case.
2. **Color Coding**:
- Blue → Unidiomatic (SR 1–3, 4–6).
- Orange → Idiomatic (SR 1–5).
- Green/Red → Special cases ("-FBK").
3. **Missing Data**: No numerical values or quantitative data are present; the legend is purely categorical.
### Interpretation
This legend likely serves as a key for interpreting data in a larger dataset or visualization. The use of color and patterns suggests:
- **Unidiomatic vs. Idiomatic**: A binary classification, with "Unidiomatic" items further divided into six serial subcategories (SR 1–6) and a special "-FBK" case.
- **Idiomatic**: A unified group with five serial subcategories (SR 1–5) and a distinct "-FBK" case marked in red.
- **FBK Significance**: The "-FBK" suffix may denote a fixed or standardized boundary condition, with unidiomatic and idiomatic cases treated differently (green vs. red).
The absence of numerical data implies this is a qualitative classification system, possibly used in fields like linguistics, data categorization, or user interface design. The precise meaning of "SR" and "-FBK" would require additional context from the source document.
</details>
<details>
<summary>x15.png Details</summary>

### Visual Description
## Bar Chart: Performance Comparison of LLM Models Across Task Categories
### Overview
The chart compares the performance of two large language models (Llama 3.3 70B and GPT-4o) across four task categories: Accuracy, Efficiency, Creativity, and Robustness. Performance is measured as a percentage of tasks successfully completed out of 100.
### Components/Axes
- **X-axis**: LLM Models (Llama 3.3 70B, GPT-4o)
- **Y-axis**: Count (out of 100 tasks), scaled from 0 to 100
- **Legend**: Located on the right side, associating colors/patterns with task categories:
- **Accuracy**: Blue (diagonal stripes)
- **Efficiency**: Green (solid)
- **Creativity**: Orange (diagonal stripes)
- **Robustness**: Red (solid)
- **Bar Patterns**: Diagonal stripes (Accuracy/Creativity) vs. solid fills (Efficiency/Robustness)
### Detailed Analysis
#### Llama 3.3 70B
- **Accuracy**: ~35 tasks (blue diagonal stripes)
- **Efficiency**: ~58 tasks (green solid)
- **Creativity**: ~45 tasks (orange diagonal stripes)
- **Robustness**: ~47 tasks (red solid)
#### GPT-4o
- **Accuracy**: ~70 tasks (blue diagonal stripes)
- **Efficiency**: ~85 tasks (green solid)
- **Creativity**: ~75 tasks (orange diagonal stripes)
- **Robustness**: ~82 tasks (red solid)
### Key Observations
1. **Performance Gaps**: GPT-4o consistently outperforms Llama 3.3 70B in all categories.
2. **Largest Disparity**: Efficiency (GPT-4o: 85 vs. Llama: 58) and Robustness (GPT-4o: 82 vs. Llama: 47) show the most significant differences.
3. **Pattern Consistency**: Diagonal stripes (Accuracy/Creativity) and solid fills (Efficiency/Robustness) align precisely with the legend.
### Interpretation
The data demonstrates that GPT-4o exhibits superior capabilities across all evaluated tasks compared to Llama 3.3 70B. The most pronounced advantages are in Efficiency and Robustness, suggesting architectural or training optimizations in GPT-4o that enable better resource utilization and reliability. These findings could inform deployment decisions for applications prioritizing task completion rates and system stability. The consistent pattern alignment confirms the chart's visual encoding accurately represents the underlying data.
</details>
(a) TransCoder-IR With/Without Feedback
<details>
<summary>x16.png Details</summary>

### Visual Description
## Bar Chart: Task Performance Comparison of LLM Models
### Overview
The chart compares the task performance of two large language models (LLMs), **Llama 3.3 70B** and **GPT-4o**, across four categories: **Correct**, **Partially Correct**, **Incorrect**, and **Unknown**. The y-axis represents the count of tasks (out of 100), while the x-axis lists the models. Each category is represented by a distinct color and pattern.
### Components/Axes
- **X-axis (Models)**:
- Llama 3.3 70B (left)
- GPT-4o (right)
- **Y-axis (Count)**:
- Labeled "Count (out of 100 tasks)" with increments from 0 to 100.
- **Legend (Right)**:
- **Correct**: Blue (solid) with diagonal stripes.
- **Partially Correct**: Green (solid).
- **Incorrect**: Orange (solid) with diagonal stripes.
- **Unknown**: Red (solid).
### Detailed Analysis
#### Llama 3.3 70B
- **Correct**: ~60 tasks (blue diagonal stripes).
- **Partially Correct**: ~20 tasks (green).
- **Incorrect**: ~15 tasks (orange diagonal stripes).
- **Unknown**: ~5 tasks (red).
#### GPT-4o
- **Correct**: ~80 tasks (blue diagonal stripes).
- **Partially Correct**: ~15 tasks (green).
- **Incorrect**: ~3 tasks (orange diagonal stripes).
- **Unknown**: ~2 tasks (red).
### Key Observations
1. **GPT-4o outperforms Llama 3.3 70B** in task accuracy, with a significantly higher count of **Correct** tasks (~80 vs. ~60).
2. **Unknown tasks** are minimal for both models, but GPT-4o has fewer (~2 vs. ~5).
3. **Partially Correct** tasks are higher for Llama 3.3 70B (~20 vs. ~15), suggesting it may handle ambiguous tasks better but with lower precision.
4. **Incorrect tasks** are notably lower for GPT-4o (~3 vs. ~15), indicating superior reliability.
### Interpretation
The data demonstrates that **GPT-4o** achieves higher task completion accuracy and reliability compared to **Llama 3.3 70B**. While Llama 3.3 70B shows a higher proportion of **Partially Correct** tasks, this does not compensate for its lower **Correct** task count. The minimal **Unknown** tasks for both models suggest robust handling of edge cases, but GPT-4o’s dominance in **Correct** tasks highlights its superior performance. This aligns with expectations for advanced LLMs, where GPT-4o’s architecture likely enables more precise task execution.
</details>
(b) CodeNet With/Without Feedback
Figure 7: Ablation study on the feedback mechanism. The success rates of the models with and without the feedback (marked as -FBK) mechanism are shown for both TransCoder-IR and CodeNet datasets.
(1) TransCoder-IR (Figure 7(a)): Incorporating the feedback mechanism increased the number of successful translations for Llama 3.3 70B from 57 to 76 in the unidiomatic setting and from 46 to 64 in the idiomatic setting. In contrast, GPT-4o performed slightly worse with feedback, decreasing from 87 to 84 (unidiomatic) and from 83 to 80 (idiomatic).
(2) Project CodeNet (Figure 7(b)): A similar trend is observed where Llama 3.3 70B improved from 62 to 83 (unidiomatic) and from 59 to 76 (idiomatic), corresponding to gains of 21 and 17 percentage points, respectively. GPT-4o, however, showed only marginal improvements: from 82 to 84 in the unidiomatic setting and from 77 to 79 in the idiomatic setting.
These results suggest that the feedback mechanism is particularly effective for lower-capability models like Llama 3.3, substantially improving their translation success rates. In contrast, higher-capability models such as GPT-4o already perform near optimal with simple random sampling, leaving little space for improvement. This indicates that the feedback mechanism is more beneficial for models with lower capabilities, as they can leverage the feedback to enhance their overall performance.
### J.2 Plain LLM Translation vs. SACTOR
We compare SACTOR against a trivial baseline where GPT-4o directly translates each CRust-Bench sample from C to Rust in a single step. We reuse the same end-to-end (E2E) test harness as SACTOR, and give the trivial baseline more budget: up to 10 repair attempts with compiler/test feedback (vs. 6 attempts in SACTOR). We study two prompts: (i) a minimal one (“translate the following C code to Rust”); and (ii) an interface-preserving one that explicitly asks the model to preserve pointer arithmetic, memory layout, and integer type semantics (thereby encouraging unsafe). We report function success as the fraction of functions whose Rust translation passes all tests, and sample success as the fraction of samples where all translated functions pass.
| SACTOR unidiomatic SACTOR idiomatic † Trivial (1-step) | 6 6 10 | 788/966 (81.57%) 249/580 (42.93%) 77/966 (7.97%) | 32/50 (64.00%) 8/32 (25.00%) 12/50 (24.00%) | 2.96 0.28 1.60 |
| --- | --- | --- | --- | --- |
| Trivial (1-step, encourage unsafe) | 10 | 207/966 (21.43%) | 20/50 (40.00%) | 1.90 |
Table 8: Plain LLM translation vs. SACTOR on CRust-Bench (GPT-4o). The trivial baselines directly translate each sample in one step with up to 10 repair attempts. $\dagger$ The idiomatic stage is evaluated only on samples whose unidiomatic stage fully translated all functions.
Results on CRust-Bench. Even with 10 attempts and an “encourage unsafe ” prompt, the trivial baseline reaches only 21.43% function success and 40.00% sample success. Its sample-level performance exceeds SACTOR ’s idiomatic stage (40.00% vs. 25.00%) because preserving C-style pointer logic in unsafe Rust is substantially easier than performing an idiomatic rewrite. However, SACTOR achieves much higher function-level correctness and produces significantly more idiomatic code (e.g., 0.28 vs. 1.90 average Clippy alerts per function).
Results on libogg. Under the same E2E tests and attempt budget as SACTOR, both trivial prompts fail to produce any test-passing translations, whereas SACTOR achieves 100% unidiomatic and 53% idiomatic success with GPT-4o (Table 2). This indicates that plain one-shot translation collapses on pointer-heavy libraries, while SACTOR remains effective.
### J.3 Effect of Crown in the Idiomatic Stage
We ablate Crown’s contribution to idiomatic translation (§ 4.2) on libogg, using the same setup as § 6.3 and keeping all other components unchanged. Table 9 reports idiomatic function success with and without Crown.
| SACTOR SACTOR w/o Crown | 41 34 | 53% 44% | – 17% |
| --- | --- | --- | --- |
Table 9: Ablating Crown on libogg (GPT-4o).
Results and Representative failure patterns. Turning off Crown reduces idiomatic success from 41 to 34 functions. The failures are structured. Two representative patterns are:
⬇
// Without Crown (shape lost):
pub struct OggPackBuffer { pub ptr: usize }
// With Crown (shape preserved):
pub struct OggPackBuffer { pub ptr: Vec < u8 > }
// Without Crown (ownership misclassified as owned):
pub struct OggIovec { pub iov_base: Vec < u8 > }
// With Crown (ownership made explicit):
pub struct OggIovec <’ a > { pub iov_base: &’ a [u8] }
Once a buffer pointer is collapsed into a scalar index, the harness cannot reconstruct a valid C-facing view of the struct, so pointer arithmetic and buffer access fail together. Similarly, if a non-owning pointer (e.g., unsigned char *iov_base) is misclassified as owned storage (Vec<u8>), Rust ends up “owning” memory that C actually controls, making safe round-tripping infeasible without inventing allocation/free rules that do not exist.
Interpretation. These failures do not indicate model weakness but an information-theoretic limitation: local C syntax does not encode pointer fatness or ownership. For a declaration such as char *iov_base, both Vec<u8> and &mut u8 are locally plausible. Even an idealized oracle model cannot uniquely infer the correct Rust type without global information about ownership and fatness. Crown supplies these semantics via whole-program static analysis; removing it makes idiomatic translation of pointer-heavy code underdetermined and explains the observed drop.
### J.4 Prompting about unsafe in Stage 1
We ablate the stage-1 (unidiomatic translation) prompt line that says “the model may use unsafe if needed.” All experiments in this subsection are conducted on libogg, using exactly the same setup as in § 6.3.
#### J.4.1 Removing “may use unsafe if needed”
We compare the original stage-1 prompt with a variant that deletes this line, keeping everything else unchanged.
| Baseline stage 1 (may use unsafe) | 100% | 108 | 76 | 1 | 8704/8705 (99.99%) |
| --- | --- | --- | --- | --- | --- |
| Remove “may use unsafe ” | 100% | 224 | 37 | 146 | 8100/8219 (98.55%) |
Table 10: Removing explicit permission to use unsafe in stage 1 on libogg (GPT-4o).
Two observations follow. (1) Overall unsafety hardly changes: the unsafe fraction drops only from 99.99% to 98.55%. (2) The safety profile becomes worse: clippy::not_unsafe_ptr_arg_deref jumps from 1 to 146. That is, the model keeps APIs safe-looking but dereferences raw pointer arguments inside function bodies, pushing unsafety from explicit unsafe fn signatures into hidden dereferences inside safe-looking public functions.
#### J.4.2 Replacing With “AVOID using unsafe ”
We replace “may use unsafe if needed” with a stronger directive: “AVOID using unsafe whenever possible”.
| Baseline stage 1 Replace with “AVOID unsafe ” | 77/77 66/77 | 100% 85% | – 15% |
| --- | --- | --- | --- |
Table 11: Discouraging unsafe in stage 1 harms unidiomatic success on libogg (GPT-4o).
Under “AVOID unsafe ”, the model often attempts premature “safe Rust” rewrites of pointer-heavy C code (changing buffer layouts, index arithmetic, and integer types), which increases logic and type errors and breaks translations. Together, these two prompt variants show that discouraging unsafe in stage 1 harms correctness and produces a worse safety profile, supporting our design choice: allow necessary unsafe in the syntactic first stage, then systematically remove it in the idiomatic refinement stage.
## Appendix K SACTOR Performance with Different Temperatures
In § 6, all the experiments are conducted with the temperature set to default values, as explained on Appendix G. To investigate how temperature affects the performance of SACTOR, we conduct additional experiments with different temperature settings (0.0, 0.5, 1.0) for GPT-4o on both TransCoder-IR and Project CodeNet datasets, as shown in Figure 8. Through some preliminary experiments and discussions on OpenAI’s community forum https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api/172683, we find that setting the temperature more than 1 will likely to generate more random and less relevant outputs, which is not suitable for our task.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Legend: Unidiomatic and Idiomatic SR Categories
### Overview
The image displays a legend categorizing six distinct "SR" (Speech Recognition) types into two groups: **Unidiomatic SR** (left column) and **Idiomatic SR** (right column). Each group contains three subcategories (SR 1–3), differentiated by unique visual patterns and color coding.
### Components/Axes
- **Legend Structure**:
- **Left Column (Unidiomatic SR)**:
- **Unidiomatic SR 1**: Diagonal blue stripes.
- **Unidiomatic SR 2**: Blue dotted pattern.
- **Unidiomatic SR 3**: Blue crosshatch pattern.
- **Right Column (Idiomatic SR)**:
- **Idiomatic SR 1**: Orange diagonal stripes.
- **Idiomatic SR 2**: Orange dotted pattern.
- **Idiomatic SR 3**: Orange crosshatch pattern.
- **Visual Elements**:
- **Colors**: Blue for Unidiomatic SR, orange for Idiomatic SR.
- **Patterns**: Diagonal lines, dots, and crosshatch for differentiation.
- **Text**: Labels in black font, centered within each colored square.
- **Positioning**:
- Legend spans the full width of the image, centered.
- Left column (Unidiomatic) on the left, right column (Idiomatic) on the right.
### Detailed Analysis
- **Unidiomatic SR**:
- **SR 1**: Diagonal blue stripes (top-left to bottom-right).
- **SR 2**: Blue dots arranged in a grid.
- **SR 3**: Blue crosshatch (intersecting lines forming squares).
- **Idiomatic SR**:
- **SR 1**: Orange diagonal stripes (mirroring Unidiomatic SR 1).
- **SR 2**: Orange dots (same density as Unidiomatic SR 2).
- **SR 3**: Orange crosshatch (same density as Unidiomatic SR 3).
### Key Observations
1. **Pattern Consistency**: Each SR type (1–3) uses the same pattern across both Unidiomatic and Idiomatic groups, with only color differentiation.
2. **Color Coding**: Blue and orange are used exclusively for Unidiomatic and Idiomatic SR, respectively.
3. **Symmetry**: The legend’s layout is symmetrical, with identical patterns mirrored between the two columns.
### Interpretation
This legend serves as a visual key to distinguish between **Unidiomatic** and **Idiomatic** Speech Recognition categories, likely used in a technical or linguistic context. The consistent use of patterns (diagonal, dots, crosshatch) ensures clarity in distinguishing subcategories (SR 1–3) within each group. The color coding (blue vs. orange) reinforces the primary categorization, while the mirrored patterns suggest a deliberate design choice to emphasize equivalence in structure between the two groups.
No numerical data, trends, or anomalies are present, as the image is purely a categorical legend. The absence of additional context (e.g., charts, graphs) limits interpretation to the structural and symbolic relationships described above.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Stacked Bar Chart: TransCoder-IR Dataset Performance by GPT-4o Model Temperature
### Overview
The chart compares the success rates of two models (Model A and Model B) across three temperature settings (t=0, t=0.5, t=1) using a stacked bar visualization. Model A (blue) consistently outperforms Model B (orange) at all temperatures, though both show performance degradation as temperature increases.
### Components/Axes
- **X-axis**: GPT-4o Model (Temperature) with categories: t=0, t=0.5, t=1
- **Y-axis**: Success Rate (%) ranging from 0 to 100
- **Legend**:
- Blue (diagonal stripes): Model A
- Orange (diagonal stripes): Model B
- **Bar Structure**: Stacked vertically, with Model A segments always positioned above Model B segments
### Detailed Analysis
1. **t=0**:
- Model A: ~85% success rate
- Model B: ~80% success rate
- Total bar height: ~165% (non-standard for percentage charts, suggesting potential data misrepresentation)
2. **t=0.5**:
- Model A: ~75% success rate
- Model B: ~80% success rate
- Total bar height: ~155%
3. **t=1**:
- Model A: ~70% success rate
- Model B: ~75% success rate
- Total bar height: ~145%
### Key Observations
- **Model A Degradation**: Success rate decreases by ~15 percentage points as temperature increases from t=0 to t=1
- **Model B Stability**: Maintains ~80% success at t=0 and t=0.5, with only a minor 5% improvement at t=1
- **Non-standard Stacking**: Total bar heights exceed 100% at all temperatures, contradicting typical percentage-based visualizations
### Interpretation
The data suggests that:
1. **Temperature Sensitivity**: Model A's performance is significantly impacted by temperature increases, while Model B shows relative stability
2. **Potential Data Issue**: The stacked bar heights exceeding 100% indicate either:
- A misinterpretation of the visualization type (possibly grouped bars instead of stacked)
- An error in data normalization
3. **Practical Implications**: If using temperature-sensitive models, Model B might be preferable for higher-temperature scenarios despite lower absolute performance
The visualization highlights tradeoffs between model performance and temperature robustness, though the unconventional stacking methodology warrants verification against raw data sources.
</details>
(a) Success Rate on TransCoder-IR
<details>
<summary>x19.png Details</summary>

### Visual Description
## Bar Chart: Project CodeNet dataset
### Overview
The chart visualizes the success rate distribution across three GPT-4o model temperature settings (t=0, t=0.5, t=1) using stacked bars. Each bar is divided into three color-coded sections representing different success rate components.
### Components/Axes
- **X-axis**: Labeled "GPT-4o Model (Temperature)" with categories t=0, t=0.5, t=1
- **Y-axis**: Labeled "Success Rate (%)" with a scale from 0 to 100
- **Legend**: Located at the top-right corner with three color-coded categories:
- Blue (striped pattern)
- Orange (striped pattern)
- Gray (solid pattern)
- **Bar Structure**: Each temperature group contains three vertically stacked sections corresponding to the legend colors
### Detailed Analysis
1. **t=0**:
- Blue section: ~65% (bottom)
- Orange section: ~15% (middle)
- Gray section: ~5% (top)
- Total success rate: ~85%
2. **t=0.5**:
- Blue section: ~68% (bottom)
- Orange section: ~12% (middle)
- Gray section: ~7% (top)
- Total success rate: ~87%
3. **t=1**:
- Blue section: ~70% (bottom)
- Orange section: ~10% (middle)
- Gray section: ~8% (top)
- Total success rate: ~88%
### Key Observations
- Blue section (likely correct predictions) increases with temperature
- Orange section (likely incorrect predictions) decreases with temperature
- Gray section (possibly errors or neutral outcomes) shows slight increase with temperature
- Total success rate improves marginally from t=0 (85%) to t=1 (88%)
### Interpretation
The data suggests that higher temperature settings in the GPT-4o model correlate with improved overall success rates, primarily driven by increased correct predictions (blue) and reduced incorrect ones (orange). The slight rise in the gray section with temperature might indicate a trade-off where increased model creativity (temperature) improves some outcomes but introduces minor drawbacks. The consistent improvement across all temperature settings implies that the model's performance benefits from higher temperature configurations within this dataset.
</details>
(b) Success Rate on Project CodeNet
Figure 8: Success Rate of SACTOR with different temperature settings for GPT-4o on TransCoder-IR and Project CodeNet datasets.
(1) TransCoder-IR (Figure 8(a)): Setting the decoder to a deterministic temperature of $t=0$ resulted in 83 successful translations (83%), while both $t=0.5$ and $t=1.0$ yielded 80 successes (80%) each. This represents a slightly improvement with 3 additional correct predictions under the deterministic setting.
(2) Project CodeNet (Figure 8(b)): Temperature does not have a significant impact: the model produced 79, 81, and 79 successful outputs at $t=0$ , $t=0.5$ , and $t=1.0$ respectively (79–81%), which does not indicate any outstanding trend in performance across the temperature settings.
The results on both datasets suggests that lowering temperature to zero can offer a slight boost in reliability some of the cases, but it does not significantly affect the overall performance of SACTOR.
## Appendix L Spec-driven Harness Rules
<details>
<summary>x20.png Details</summary>

### Visual Description
## Flowchart: FFI-based E2E Verifier Workflow
### Overview
The diagram illustrates a technical workflow for transforming and verifying Rust code using a combination of AI-driven tools, rule-based systems, and formal verification. It emphasizes iterative refinement from "Unidiomatic Rust" to "Verified Idiomatic Rust" through multiple stages.
### Components/Axes
1. **Key Elements**:
- **Unidiomatic Rust**: Starting point (leftmost box).
- **Gemini AI**: Central processing node (labeled with Gemini logo and "AI").
- **Idiomatic Rust**: Intermediate output (center-right box).
- **FFI-based E2E Verifier**: Topmost box, acting as the overarching process.
- **Verified Idiomatic Rust**: Final output (rightmost box).
- **Rule-based**: Subprocess under Gemini AI (labeled "Rule based" with JSON spec).
- **LMM driven**: Subprocess under Gemini AI (labeled "LMM driven" with test harness using TODD).
2. **Flow Arrows**:
- Unidiomatic Rust → Gemini AI → Idiomatic Rust.
- Idiomatic Rust → FFI-based E2E Verifier → Verified Idiomatic Rust.
- Feedback loop: Verified Idiomatic Rust → Idiomatic Rust (iterative refinement).
3. **Subcomponents**:
- **JSON Spec**: Output of the "Rule based" subprocess.
- **Test harness with TODD**: Output of the "LMM driven" subprocess.
### Detailed Analysis
- **Unidiomatic Rust** is processed by **Gemini AI** to produce **Idiomatic Rust**, suggesting AI-assisted code refactoring.
- **Idiomatic Rust** undergoes verification via two parallel paths:
1. **Rule-based**: Generates a JSON specification (likely formal rules for validation).
2. **LMM driven**: Uses a test harness with TODD (possibly a testing framework or LLM-based tool).
- Both paths converge to produce **Verified Idiomatic Rust**, which feeds back into the loop for further refinement.
### Key Observations
1. **Iterative Verification**: The feedback loop indicates continuous improvement, where verified code is re-evaluated for further idiomatic adjustments.
2. **Hybrid Verification**: Combines rule-based (deterministic) and LLM-driven (probabilistic) methods for robustness.
3. **TODD Integration**: The test harness with TODD implies a specialized toolchain for LLM-driven validation, possibly handling edge cases or complex scenarios.
### Interpretation
This workflow demonstrates a **formal verification pipeline** for Rust code, leveraging AI (Gemini) for initial refactoring and hybrid methods (rules + LLM) for validation. The FFI-based E2E Verifier acts as the central authority, ensuring cross-platform compatibility (via FFI) and end-to-end correctness. The use of TODD in the LMM-driven path suggests a focus on test-driven development, while the JSON spec output from the rule-based path provides auditability. The cyclical nature emphasizes that "idiomatic" code is context-dependent and requires ongoing validation, likely to address evolving language standards or project requirements.
</details>
Figure 9: Spec-driven harness generation and verification loop. The idiomatic translator co-produces idiomatic Rust and a machine-readable SPEC. A rule-based generator synthesizes a C-compatible harness from the SPEC; unsupported mappings trigger a localized LLM fallback. Harness and idiomatic code are linked via FFI for end-to-end tests.
Figure 9 illustrates the co-production timing and dataflow among artifacts (idiomatic code, SPEC, harness) and the verifier. Table 12 summarizes the SPEC patterns our rule-based generator currently supports.
| Scalars | shape: "scalar" | scalar $\rightarrow$ scalar | Common libc types are cast with as when needed; default compare is by value in roundtrip selftest. |
| --- | --- | --- | --- |
| C string | ptr.kind: "cstring", ptr.null | *const/*mut c_char $\rightarrow$ String / &str / Option<String> | NULL handling via ptr.null or Option< >; uses CStr / CString with lossless fallback. Return strings are converted back to *mut c_char. |
| Slices | ptr.kind: "slice", len_from | len_const | *const/*mut T + length $\rightarrow$ Vec<T>, &[T], or Option<...> | Requires a length source; empty or NULL produces None or empty according to spec; writes back length on I $\rightarrow$ U when a paired length field exists. |
| Single-element ref | ptr.kind: "ref" | *const/*mut T $\rightarrow$ Box<T> / Option<Box<T>> | For struct T, generator calls auto struct converters C T_to_T_mut / T_to_C T_mut. |
| Derived length path | idiomatic path ending with .len | len field $\leftrightarrow$ vec.len | Recognizes idiomatic data.len and reuses the same U-side length field on roundtrip. |
| Nullability | ptr.null: nullable|forbidden | C pointers $\rightarrow$ field with/without Option | nullable maps to Option< > or tolerant empty handling. |
| &mut struct params | ownership: transient | *mut CStruct $\rightarrow$ &mut Struct or Option<&mut Struct> | Copies back mutated values after the call using generated struct converters. |
| Return mapping | Field with i_field.name = "ret" | idiomatic return $\rightarrow$ U output(s) | Scalars: direct or via *mut T. Strings: to *mut c_char. Slices: pointer + length writeback. Structs: via struct converters. |
| Comparison hints | compare: by_value|by_slice|skip | selftest behavior | Optional per-field checks after U $\rightarrow$ I1 $\rightarrow$ U $\rightarrow$ I2 roundtrip, and compare with I1 and I2 |
| Unsupported paths | All SPEC key pairs other than supported paths | fallback | Generator emits localized TODOs for LLM completion; schema validation rejects malformed SPECs. |
Table 12: SPEC-driven harness coverage. U denotes the unidiomatic C-facing representation; I denotes the idiomatic Rust side.
Harness construction details.
The generator consumes a per-item SPEC (JSON) produced alongside idiomatic code and synthesizes: (i) a C-compatible shim that matches the original ABI, and (ii) idiomatic adapters that convert to/from Rust types. Pointer shapes (scalar, cstring, slice, ref) determine how memory is borrowed or owned; length sources come from sibling fields or constants; nullability and ownership hints select Option< > or strict checks. Return values are mapped back to U form, writing lengths when needed. This bridging resolves the ABI mismatch introduced by idiomatic function signatures.
Struct mappings and self-check.
For structs, the SPEC defines bidirectional converters between unidiomatic and idiomatic layouts. We validate adapter consistency with a minimal roundtrip: Unidiomatic $\rightarrow$ Idiomatic(1) $\rightarrow$ Unidiomatic $\rightarrow$ Idiomatic(2). The self-check compares Idiomatic(1) and Idiomatic(2) field-by-field according to compare hints: by_value requires exact equality on scalar fields; by_slice compares slice contents using the SPEC-recorded length source; skip omits fields that are aliasing views or externally owned to avoid false positives. Seed unidiomatic values are synthesized by an LLM guided by the SPEC so that nullability, ownership, and length sources are populated consistently.
Fallback and verification loop.
When a SPEC uses patterns not yet implemented (e.g., pointer kinds outside cstring / slice / ref; non-trivial len_from expressions; string args whose spec.kind $\neq$ cstring), the generator emits a localized TODO that is completed by an LLM using the same SPEC as guidance; the resulting harness is then validated as usual. End-to-end tests run against the linked harness and idiomatic crate; passing tests provide confidence under their coverage, while failures trigger the paper’s feedback procedure for regeneration and refinement.
### SPEC rule reference
This section explains the rule families the SPEC uses to describe how unidiomatic, C-facing values become idiomatic Rust and back. The schema has two top-level forms: a struct description and a function description. Both are expressed as small collections of field mappings from the unidiomatic side to idiomatic paths; a function return is just another mapping whose idiomatic path is the special name ret. This uniform treatment keeps the generator simple and makes the SPEC readable by humans and machines alike.
Pointer handling is captured by a compact notion of shape. A field is either a scalar or one of three pointer shapes: a byte string that follows C conventions, a slice that pairs a pointer with a length, or a single-object reference. Slices record where their length comes from (either a sibling field or a constant). Each pointer also carries a null policy that distinguishes admissible NULL from forbidden NULL, which in turn selects idiomatic options versus strict checks in the generated adapters.
Two lightweight hints influence how the harness allocates and how the roundtrip self-check behaves. An ownership hint (owning vs transient) signals whether the idiomatic side should materialize owned data or borrow it for the duration of the call. A comparison hint (by value, by slice, or skip) declares how roundtrip checks should assert equality, so that aliasing views or externally owned buffers can be skipped without producing spurious failures.
Finally, the schema enforces well-formedness and defines a safe escape hatch. Invalid combinations are rejected early by validation. Patterns that are valid but not yet implemented by the generator, such as complex dotted paths or unusual pointer views, are localized and handed to the LLM fallback described earlier; the SPEC itself remains the single source of truth for the intended mapping.
## Appendix M Real-world Codebase Evaluation Details
### M.1 CRust-Bench Per-sample Outcomes
Table LABEL:tab:crust_failures lists, for each of the 50 samples, the function-level translation status and a concise failure analysis. Status is reported as per-sample function-level percentages in separate columns for the unidiomatic (Unid.) and idiomatic (Id.) stages.
### M.2 libogg Outcomes
(1) Using GPT-4o. 36 functions cannot be translated idiomatically. nine of the translation failures are caused by translated functions not passing the test cases of libogg. Six failures are due to compile errors in the translations, five of which result from the LLM violating Rust’s safety rules on lifetime, borrow, and mutability. For example, the translation of function _os_lacing_expand fails because the translation sets the value of a function parameter to a reference to the function’s local variable vec, leading to an error “`vec` does not live long enough." Two failures are due to SACTOR being unable to generate compilable test harnesses. If a function calls another function that SACTOR cannot translate, then the caller function cannot be translated either. This is the reason why the remaining 13 translations fail.
(2) Using GPT-5. 17 functions cannot be translated idiomatically. Among them, three are because the generated functions cannot pass the test cases and three are due to failure to generate compilable test harnesses. Only one is caused by a compile error in the translated function, which shows the progress of GPT-5 in understanding Rust grammar and fixing compile errors. The remaining failures result from the callee functions of those functions being untranslatable.
Table 13: CRust-Bench per-sample outcomes (function-level). Translation Status columns report per-sample function-level success rates for unidiomatic (Unid.) and idiomatic (Id.) stages.
| 2DPartInt | 100.0% | 100.0% | – | – |
| --- | --- | --- | --- | --- |
| 42-Kocaeli-Printf | 75.0% | – | C variadics require unstable c_variadic; unresolved va_list import blocks build. | Unidiomatic compile (C varargs/unstable feature) |
| CircularBuffer | 100.0% | 54.6% | CamelCase-to-snake_case renaming breaks signature lookup; later run panics under no-unwind context. | Idiomatic compile (symbol/name mapping) |
| FastHamming | 100.0% | 60.0% | Output buffer sized to input length in harness; bounds-check panic at runtime. | Harness runtime (buffer/length) |
| Holdem-Odds | 100.0% | 6.9% | Off-by-one rank yields out-of-bounds bucket index; SIGSEGV under tests. | Runtime fault (boundary/indexing) |
| Linear-Algebra-C | 100.0% | 44.8% | Pointer vs reference semantics mismatch (nullable C pointers vs Rust references); harness compile errors. | Harness compile (pointer/ref semantics) |
| NandC | 100.0% | 100.0% | – | – |
| Phills_DHT | 75.0% | – | Shadowed global hash_table keeps dht_is_initialised() false; assertion in tests. | Runtime fault (global state divergence) |
| Simple-Sparsehash | 100.0% | 40.0% | CamelCase-to-snake_case renaming causes signature/type mismatches; harness does not compile. | Idiomatic compile (symbol/name mapping) |
| SimpleXML | 83.3% | – | Missing ParseState and CamelCase-to-snake_case renaming breaks signatures; unidiomatic stalls. | Idiomatic compile (symbol/name mapping) |
| aes128-SIMD | 85.7% | – | Array-shape mismatch (expects 4x4 refs; passes row pointer); plus intrinsics/typedef noise. | Unidiomatic compile (array shape; intrinsics/types) |
| amp | 80.0% | – | Returned C string from amp_decode_arg is not NULL-terminated; strcmp reads past allocation and trips invalid read under tests. | Runtime fault (C string NULL termination) |
| approxidate | 85.7% | – | match_alpha references anonymous enum C2RustUnnamed that is never defined, causing cascaded missing-type errors across retries. | Unidiomatic compile (types/aliases) |
| avalanche | 100.0% | 75.0% | Capturing closure passed where fn pointer required; FILE*/Rust File bridging mis-modeled; compile fails. | Harness runtime (I/O/resource model mismatch) |
| bhshell | 88.2% | – | Many parser errors (enum lacks PartialEq, missing consts, u64 to usize drift, duplicates). | Unidiomatic compile (types/aliases) |
| bitset | 100.0% | 50.0% | Treats bit count as byte count in converter; overreads and panics under tests. | Harness runtime (buffer/length) |
| bostree | 52.4% | – | Function-pointer typedefs and pointer-shape drift break callback bridging. | Unidiomatic compile (function-pointer types/deps) |
| btree-map | 100.0% | 26.3% | Trace/instrumentation proc macro requires Debug on opaque C type node; harness compilation fails for get_node_count. | Harness compile (instrumentation bound) |
| c-aces | 100.0% | 3.9% | Struct converter mismatch (Vec<CMatrix2D> vs Vec<Matrix2D>) in generated harness; compile fails after retries. | Harness compile (struct converter/shape) |
| c-string | 100.0% | 29.4% | Size vs capacity mismatch in StringT constructor; empty buffer returned, C asserts. | Runtime fault (size/capacity mismatch) |
| carrays | 100.0% | 68.5% | Trace macro imposes Debug on generic T and callback; harness fails to compile (e.g., gca_lsearch). | Harness compile (instrumentation bound) |
| cfsm | 50.0% | – | Missing typedefs for C function-pointer callbacks; harness lacks nullable extern signatures, compile fails. | Unidiomatic compile (function-pointer types/deps) |
| chtrie | 100.0% | 0.0% | Pointer-of-pointers vs Vec adapter mismatch for struct chtrie | Harness compile (struct converter/shape) |
| cissy | 100.0% | 19.1% | Anonymous C types that c2rust renamed cannot be fetched correctly as a dependency | Unidiomatic compile (types/aliases) |
| clog | 31.6% | – | Variadic logging APIs and duplicate globals; unresolved vfprintf / c_variadic; compile fails. | Unidiomatic compile (C varargs/unstable feature) |
| cset | 100.0% | 25.0% | Translator renames XXH_readLE64 to xxh_read_le64; SPEC/harness require exact C name; exhausts six attempts. | Idiomatic compile (symbol/name mapping) |
| csyncmers | 66.7% | – | Unsigned underflow in compute_closed_syncmers (i - S + 1 without guard) triggers overflow panic; prior __uint128_t typedef issues. | Runtime fault (arithmetic underflow) |
| dict | 17.7% | – | Fn-pointer fields modeled non-optional (need Option<extern "C" fn>); plus va_list requires nightly c_variadic; compile fails. | Unidiomatic compile (function-pointer types/deps) |
| emlang | 16.3% | – | Anonymous-union alias (C2RustUnnamed) misuse; duplicate program_new; assertion bridging (__assert_fail) mis-modeled. | Unidiomatic compile (types/aliases) |
| expr | 33.3% | – | Missing C2RustUnnamed alias; C varargs in trace_eval; strncmp len type mismatch. | Unidiomatic compile (types/aliases) |
| file2str | 100.0% | 100.0% | – | – |
| fs_c | 100.0% | 60.0% | Idiomatic I/O wrappers mismatch C expectations (closed fd/OwnedFd abort; Err(NotFound) leads to C-side segfault). | Harness runtime (I/O/resource model mismatch) |
| geofence | 100.0% | 100.0% | – | – |
| gfc | 100.0% | 54.6% | Converter overread + ownership misuse; later compile errors. | Harness runtime (converter/ownership) |
| gorilla-paper-encode | 100.0% | 9.1% | Missing adapters + lifetimes (Cbitwriter_s / Cbitreader_s vs BitWriter / BitReader<’a>). | Harness compile (lifetimes/struct adapters) |
| hydra | 100.0% | 50.0% | Borrow overlap in list update; name mapping for FindCommand. | Idiomatic compile (borrow/lifetime; symbol mapping) |
| inversion_list | 17.0% | – | C allows NULL comparator/function pointers; wrapper unwraps and panics. | Runtime fault (function-pointer nullability) |
| jccc | 88.7% | – | Missing C2RustUnnamed alias and duplicate Expression / Lexer types; compile fails. | Unidiomatic compile (types/aliases) |
| leftpad | 100.0% | 100.0% | – | – |
| lib2bit | 100.0% | 13.6% | Non-clonable std::fs::File in harness (C FILE* vs Rust File I/O handle mismatch) | Harness runtime (I/O/resource model mismatch) |
| libbase122 | 100.0% | 37.5% | Reader cursor/buffer not preserved across calls; writer shape mismatch; tests fail. | Harness runtime (converter/ownership) |
| libbeaufort | 100.0% | 66.7% | Returns reference to temporary tableau; matrix parameter shape drift (char** vs Vec<Option<String>>); compile fails. | Idiomatic compile (borrow/lifetime) |
| libwecan | 100.0% | 100.0% | – | – |
| morton | 100.0% | 100.0% | – | – |
| murmurhash_c | 100.0% | 100.0% | – | – |
| razz_simulation | 33.3% | – | Type-name drift; node shape; ptr/ref API mismatch. | Harness compile (type/name drift; API mismatch) |
| rhbloom | 100.0% | 33.3% | Pointer/ref misuse; bit-length as bytes; overreads/panics. | Harness runtime (pointer/ref; length units) |
| totp | 77.8% | – | Anonymous C types that c2rust renamed cannot be fetched correctly as a dependency; plus duplicate helpers (pack32 / unpack64 / hmac_sha1); compile fails. | Unidiomatic compile (types/aliases) |
| utf8 | 100.0% | 30.8% | NULL deref + unchecked indices; SIGSEGV in tests. | Runtime fault (NULL deref/out-of-bounds) |
| vec | 100.0% | 0.0% | Idiomatic rewrite uses a bounds-checked copy; out-of-range panic under tests. | Runtime fault (boundary/indexing) |
## Appendix N Examples of Prompts Used in SACTOR
The following prompts are used to guide the LLM in C-to-Rust translation and verification tasks. The prompts may slightly vary to accommodate different translation task, as SACTOR leverages static analysis to fetch the necessary information for the LLM.
### N.1 Unidiomatic Translation
Figure 10 shows the prompt for translating unidiomatic C code to Rust.
⬇
Translate the following C function to Rust. Try to keep the ** equivalence ** as much as possible.
‘ libc ‘ will be included as the ** only ** dependency you can use. To keep the equivalence, you can use ‘ unsafe ‘ if you want.
The function is:
‘‘‘ c
{C_FUNCTION}
‘‘‘
// Specific for main function
The function is the ‘ main ‘ function, which is the entry point of the program. The function signature should be: ‘ pub fn main () -> ()‘.
For ‘ return 0;‘, you can directly ‘ return;‘ in Rust or ignore it if it ’ s the last statement.
For other return values, you can use ‘ std:: process:: exit ()‘ to return the value.
For ‘ argc ‘ and ‘ argv ‘, you can use ‘ std:: env:: args ()‘ to get the arguments.
The function uses some of the following stdio file descriptors: stdin. Which will be included as
‘‘‘ rust
extern " C " {
static mut stdin: * mut libc:: FILE;
}
‘‘‘
You should ** NOT ** include them in your translation, as the system will automatically include them.
The function uses the following functions, which are already translated as (you should ** NOT ** include them in your translation, as the system will automatically include them):
‘‘‘ rust
{DEPENDENCIES}
‘‘‘
Output the translated function into this format (wrap with the following tags):
---- FUNCTION ----
‘‘‘ rust
// Your translated function here
‘‘‘
---- END FUNCTION ----
Figure 10: Unidiomatic Translation Prompt
### N.2 Unidiomatic Translation with Feedback
Figure 11 shows the prompt for translating unidiomatic C code to Rust with feedback from the previous incorrect translation and error message.
⬇
Translate the following C function to Rust. Try to keep the ** equivalence ** as much as possible.
‘ libc ‘ will be included as the ** only ** dependency you can use. To keep the equivalence, you can use ‘ unsafe ‘ if you want.
The function is:
‘‘‘ c
{C_FUNCTION}
‘‘‘
// Specific for main function
The function is the ‘ main ‘ function, which is the entry point of the program. The function signature should be: ‘ pub fn main () -> ()‘.
For ‘ return 0;‘, you can directly ‘ return;‘ in Rust or ignore it if it ’ s the last statement.
For other return values, you can use ‘ std:: process:: exit ()‘ to return the value.
For ‘ argc ‘ and ‘ argv ‘, you can use ‘ std:: env:: args ()‘ to get the arguments.
The function uses some of the following stdio file descriptors: stdin. Which will be included as
‘‘‘ rust
extern " C " {
static mut stdin: * mut libc:: FILE;
}
‘‘‘
You should ** NOT ** include them in your translation, as the system will automatically include them.
The function uses the following functions, which are already translated as (you should ** NOT ** include them in your translation, as the system will automatically include them):
‘‘‘ rust
fn atoi (str : * const c_char) -> c_int;
‘‘‘
Output the translated function into this format (wrap with the following tags):
---- FUNCTION ----
‘‘‘ rust
// Your translated function here
‘‘‘
---- END FUNCTION ----
Lastly, the function is translated as:
‘‘‘ rust
{COUNTER_EXAMPLE}
‘‘‘
It failed to compile with the following error message:
‘‘‘
{ERROR_MESSAGE}
‘‘‘
Analyzing the error messages, think about the possible reasons, and try to avoid this error.
Figure 11: Unidiomatic Translation with Feedback Prompt
### N.3 Idiomatic Translation
Figure 12 shows the prompt for translating unidiomatic Rust code to idiomatic Rust. Crown is used to hint the LLM about the ownership, mutability, and fatness of pointers.
⬇
Translate the following unidiomatic Rust function into idiomatic Rust. Try to remove all the ‘ unsafe ‘ blocks and only use the safe Rust code or use the ‘ unsafe ‘ blocks only when necessary.
Before translating, analyze the unsafe blocks one by one and how to convert them into safe Rust code.
** libc may not be provided in the idiomatic code, so try to avoid using libc functions and types, and avoid using ‘ std:: ffi ‘ module.**
‘‘‘ rust
{RUST_FUNCTION}
‘‘‘
" Crown " is a pointer analysis tool that can help to identify the ownership, mutability and fatness of pointers. Following are the possible annotations for pointers:
‘‘‘
fatness:
- ‘ Ptr ‘: Single pointer
- ‘ Arr ‘: Pointer is an array
mutability:
- ‘ Mut ‘: Mutable pointer
- ‘ Imm ‘: Immutable pointer
ownership:
- ‘ Owning ‘: Owns the pointer
- ‘ Transient ‘: Not owns the pointer
‘‘‘‘
The following is the output of Crown for this function:
‘‘‘
{CROWN_RESULT}
‘‘‘
Analyze the Crown output firstly, then translate the pointers in function arguments and return values with the help of the Crown output.
Try to avoid using pointers in the function arguments and return values if possible.
Output the translated function into this format (wrap with the following tags):
---- FUNCTION ----
‘‘‘ rust
// Your translated function here
‘‘‘
---- END FUNCTION ----
Also output a minimal JSON spec that maps the unidiomatic Rust layout to the idiomatic Rust for the function arguments and return value.
Full JSON Schema for the SPEC (do not output the schema; output only an instance that conforms to it):
‘‘‘ json
{_schema_text}
‘‘‘
---- SPEC ----
‘‘‘ json
{{
" function_name ": "{function. name}",
" fields ": [
{{
" u_field ": {{
" name ": "...",
" type ": "...",
" shape ": " scalar " | {{" ptr ": {{" kind ": " slice | cstring | ref ", " len_from ": "?", " len_const ": 1}}}}
}},
" i_field ": {{
" name ": "...",
" type ": "..."
}}
}}
]
}}
‘‘‘
---- END SPEC ----
Few - shot examples (each with unidiomatic Rust signature, idiomatic Rust signature, and the SPEC):
Example F1 (slice arg):
Unidiomatic Rust:
‘‘‘ rust
pub unsafe extern " C " fn sum (xs: * const i32, n: usize) -> i32;
‘‘‘
Idiomatic Rust:
‘‘‘ rust
pub fn sum (xs: &[i32]) -> i32;
‘‘‘
---- SPEC ----
‘‘‘ json
{{
" function_name ": " sum ",
" fields ": [
{{ " u_field ": {{" name ": " xs ", " type ": "* const i32 ", " shape ": {{ " ptr ": {{ " kind ": " slice ", " len_from ": " n " }} }} }},
" i_field ": {{" name ": " xs ", " type ": "&[i32]" }} }},
{{ " u_field ": {{" name ": " n ", " type ": " usize ", " shape ": " scalar " }},
" i_field ": {{" name ": " xs. len ", " type ": " usize " }} }}
]
}}
‘‘‘
---- END SPEC ----
Example F2 (ref out):
Unidiomatic Rust:
‘‘‘ rust
pub unsafe extern " C " fn get_value (out_value: * mut i32);
‘‘‘
Idiomatic Rust:
‘‘‘ rust
pub fn get_value () -> i32;
‘‘‘
---- SPEC ----
‘‘‘ json
{{
" function_name ": " get_value ",
" fields ": [
{{ " u_field ": {{" name ": " out_value ", " type ": "* mut i32 ", " shape ": {{ " ptr ": {{ " kind ": " ref " }} }} }},
" i_field ": {{" name ": " ret ", " type ": " i32 " }} }}
]
}}
‘‘‘
---- END SPEC ----
Example F3 (nullable cstring maps to Option):
Unidiomatic Rust:
‘‘‘ rust
pub unsafe extern " C " fn set_name (name: * const libc:: c_char);
‘‘‘
Idiomatic Rust:
‘‘‘ rust
pub fn set_name (name: Option <& str >);
‘‘‘
---- SPEC ----
‘‘‘ json
{{
" function_name ": " set_name ",
" fields ": [
{{ " u_field ": {{" name ": " name ", " type ": "* const c_char ", " shape ": {{ " ptr ": {{ " kind ": " cstring ", " null ": " nullable " }} }} }},
" i_field ": {{" name ": " name ", " type ": " Option <& str >" }} }}
]
}}
‘‘‘
---- END SPEC ----
Figure 12: Idiomatic Translation Prompt
### N.4 Idiomatic Verification
Idiomatic verification is the process of verifying the correctness of the translated idiomatic Rust code by generating a test harness. The prompt for idiomatic verification is shown in Figure 13.
⬇
We have an initial spec - driven harness with TODOs. Finish all TODOs and ensure it compiles.
Idiomatic signature:
‘‘‘ rust
pub fn compute_idiomatic (
x: i32,
name: & str,
data: &[u8],
meta: HashMap < String, String >,
) -> i32;;
‘‘‘
Unidiomatic signature:
‘‘‘ rust
pub unsafe extern " C " fn compute (x: i32, name: * const libc:: c_char, data: * const u8, len: usize, meta: * const libc:: c_char) -> i32;;
‘‘‘
Current harness:
‘‘‘ rust
pub unsafe extern " C " fn compute (x: i32, name: * const libc:: c_char, data: * const u8, len: usize, meta: * const libc:: c_char) -> i32
{
// Arg ’ name ’: borrowed C string at name
let name_str = if ! name. is_null () {
unsafe { std:: ffi:: CStr:: from_ptr (name) }. to_string_lossy (). into_owned ()
} else {
String:: new ()
};
// Arg ’ data ’: slice from data with len len as usize
let data_len = len as usize;
let data_len_non_null = if data. is_null () { 0 } else { data_len };
let data: &[u8] = if data_len_non_null == 0 {
&[]
} else {
unsafe { std:: slice:: from_raw_parts (data as * const u8, data_len_non_null) }
};
// TODO: param meta of type HashMap < String , String >: unsupported mapping
let __ret = compute_idiomatic (x, & name_str, data, /* TODO param meta */);
return __ret;
}
‘‘‘
Output only the final function in this format:
---- FUNCTION ----
‘‘‘ rust
// Your translated function here
‘‘‘
---- END FUNCTION ----
Figure 13: Idiomatic Verification Prompt
### N.5 Failure Reason Analysis
Figure 14 shows the prompt for analyzing the reasons for the failure of the translation.
⬇
Given the following C code:
‘‘‘ c
{original_code}
‘‘‘
The following code is generated by a tool that translates C code to Rust code. The tool has a bug that causes it to generate incorrect Rust code. The bug is related to the following error message:
‘‘‘ json
{json_data}
‘‘‘
Please analyze the error message and provide a reason why the tool generated incorrect Rust code.
1. Append a new reason to the list of reasons.
2. Select a reason from the list of reasons that best describes the error message.
Please provide a reason why the tool generated incorrect Rust code ** FUNDAMENTALLY **.
List of reasons:
{all_current_reasons}
Please provide the analysis output in the following format:
‘‘‘ json
{
" action ": " append ", // or " select " to select a reason from the list of reasons
" reason ": " Format string differences between C and Rust ", // the reason for the error message, if action is " append "
" selection ": 1 // the index of the reason from the list of reasons, if action is " select "
// " reason " and " selection " are mutually exclusive, you should only provide one of them
}
‘‘‘
Please ** make sure ** to provide a general reason that can be applied to multiple cases, not a specific reason that only applies to the current case.
Please provide a reason why the tool generated incorrect Rust code ** FUNDAMENTALLY ** (NOTE that the reason of first failure is always NOT the fundamental reason).
Figure 14: Failure Reason Analysis Prompt