2601.11840

Model: gemini-2.0-flash

## Imandra CodeLogician: Neuro-Symbolic Reasoning for Precise Analysis of Software Logic Hongyu Lin 1 , Samer Abdallah 1 , Makar Valentinov 1 , Paul Brennan 1 , Elijah Kagan 1 , Christoph M. Wintersteiger 1 , Denis Ignatovich 1 , and Grant Passmore ∗ 1,2 1 Imandra Inc. 2 Clare Hall, University of Cambridge { hongyu,samer,makar,paul,elijah,christoph,denis,grant } @imandra.ai February 10, 2026 ## Abstract Large Language Models (LLMs) have shown strong performance on code understanding and software engineering tasks, yet they fundamentally lack the ability to perform precise, exhaustive mathematical reasoning about program behavior. Existing benchmarks either focus on mathematical proof automation, largely disconnected from real-world software, or on engineering tasks that do not require semantic rigor. We present CodeLogician 1 , a neurosymbolic agent and framework for precise analysis of software logic, and its integration with ImandraX 2 , an industrial automated reasoning engine successfully deployed in financial markets, safety-critical systems and government applications. Unlike prior approaches that use formal methods primarily to validate or filter LLM outputs, CodeLogician uses LLMs to help construct explicit formal models ('mental models') of software systems, enabling automated reasoning to answer rich semantic questions beyond binary verification outcomes. While the current implementation builds on ImandraX and benefits from its unique features designed for large-scale industrial formal verification and software understanding, the framework is explicitly designed to be reasoner-agnostic and can support additional formal reasoning backends. To rigorously evaluate mathematical reasoning about software logic, we introduce a new benchmark dataset code-logic-bench 3 targeting the previously unaddressed middle ground between theorem proving and software engineering benchmarks. The benchmark measures correctness and efficacy of reasoning about program state spaces, control flow, coverage constraints, decision boundaries, and edge cases, with ground truth defined via formal modeling and automated decomposition. Using this benchmark set, we compare LLM-only reasoning against LLMs augmented with CodeLogician . Across all evaluated models, formal augmentation with CodeLogician yields substantial and consistent improvements, closing a 41-47 percentage point gap in reasoning accuracy and achieving complete coverage while LLM-only approaches remain approximate or incorrect. These results demonstrate that the neurosymbolic integration of LLMs with formal reasoning engines is essential for scaling program analysis beyond heuristic reasoning toward rigorous, autonomous software understanding and formal verification. ∗ The author would like to thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, for support and hospitality during the programme Big Proof: Formalizing Mathematics at Scale , where some of the work on this paper was undertaken. This work was supported by EPSRC grant EP/Z000580/1. 1 https://www.codelogician.dev 2 https://www.imandra.ai/core 3 https://github.com/imandra-ai/code-logic-bench ## Contents | Introduction | Introduction | 4 | |---------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------| | 1.1 Beyond Verification-as-Filtering | . . . . . . | . . . . . . . . . . 4 | | | . . . . | . . . . . . . . . . 4 | | 1.2 | Neuro-Symbolic Complementarity IML Agent . . . . . . | . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . 7 | | 1.3 ImandraX and Reasoner-Agnostic Design | 1.3 ImandraX and Reasoner-Agnostic Design | . . . . . . . . . . 4 | | 1.4 The Missing Benchmark: Mathematical Reasoning About . | 1.4 The Missing Benchmark: Mathematical Reasoning About . | . . . . . . . . . . 5 | | 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . | 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . | 5 | | CodeLogician overview | CodeLogician overview | CodeLogician overview | | 2.1 Positioning 2.2 Extensible | and Scope . . . . . . . . . . . Reasoner-Oriented Design . . . | . . . . . . . . . . 5 . . . . . . . . . . 6 | | 2.3 Modes of Operation . . . 2.4 Governance and Feedback | . . . . . . . . . . . . . . . . . . | . . . . . . . . . . 6 . . . . . . . . . . 7 | | 2.5 Relationship to the CodeLogician 2.6 Summary . . . . . . . . . . . . | 2.5 Relationship to the CodeLogician 2.6 Summary . . . . . . . . . . . . | 7 | | Overview of ImandraX | Overview of ImandraX | Overview of ImandraX | | | | 7 | | 3 | 3 | 3 | | | Language | . . . . . . . . . . 8 | | 3.1 Formal Models in the Imandra Modeling 3.2 Verification Goals . . . . . . . . . . . . . | . | . . . . . . . . . . 8 | | 3.3 Region | Decomposition . . . . . . . . . . . | . . . . . . . . . . 9 | | 3.4 Focusing Decomposition: Side | Conditions | . . . . . . . . . . 10 | | 3.5 Test Generation from Regions . . . . 3.6 Summary . . . . . . . . . . . . . . . | . . . . . . | . . . . . . . . . . 11 . . . . . . . . . . 11 | | | | 11 | | Autoformalization of source code | Autoformalization of source code | Autoformalization of source code | | | | 12 | | 4.3 Abstraction Levels . . . . . . . 4.4 Dealing with Unknowns . . . | . . . . . . | . . . . . . . . . . | | | . . . . | . . . . . . . . . . 13 | | . . . | . . . . | . . . . . . . . . . 13 | | 4.4.1 Opaque Functions . . . . . . | . . . . | . . . . . . . . . . 13 | | 4.4.2 Axioms and Assumptions | 4.4.2 Axioms and Assumptions | 4.4.2 Axioms and Assumptions | | 4.4.3 Approximation Functions 4.5 Summary . . . . . . . . . . . . | . . . . . . . . . | . . . . . . . . . . 13 . . . . . . . . . . 13 | | CodeLogician IML Agent | CodeLogician IML Agent | CodeLogician IML Agent | | . | . | . | | . 5.1 From 5.2 Verification | Source Code to Formal Models . . . Goals and Logical Properties | . . . . . . . . . . 14 . . . . . . . . . . 14 | | | . . . . . . . | . . . . . . . . . . 14 | | 5.3 Beyond Yes/No Verification . 5.4 Agent Architecture and State | . . . . . . | . . . . . . . . . . 14 | | . 5.5 Model Refinement and | . . | . . . . . . . . . . 15 | | Executability . . . . . | . . | . . . . . . . . . . 15 | | 5.6 Assessing Model Correctness . | 5.6 Assessing Model Correctness . | . . . . . . . . . . 15 | | 5.7 Region decomposition, test case generation, and refinement 5.8 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . | 5.7 Region decomposition, test case generation, and refinement 5.8 Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . | . . . . . . . . . . 16 | | 5.8.1 Programmatic Access via python RemoteGraph . . | 5.8.1 Programmatic Access via python RemoteGraph . . | . . . . . . . . . . 17 | | 5.8.2 VS Code Extension . . . . . . . . . . . . . 5.9 Summary . . . . . . . . . . . . . . . . . . . . . . | 5.8.2 VS Code Extension . . . . . . . . . . . . . 5.9 Summary . . . . . . . . . . . . . . . . . . . . . . | . . . . . . . . . . 17 | | Server | Server | Server | | CodeLogician | CodeLogician | CodeLogician | | 6 | 6 | 17 | | 6.1 Server | Responsibilities . . . . . . . . . . . Formalization via | . . . . . . . . . . 19 . . . . . . . . . . 19 | | | . . . . . | . . . . . . . . . . | | 6.2 Strategies: Project-Wide | 6.2 Strategies: Project-Wide | 19 | | 6.3 Metamodel-Based Formalization 6.4 Core Definitions . | . . . . . . . . . . . . | . . . . . . . . . . | | . | . | 19 20 | | 6.5 Formalization Levels . . . . . . . . . 6.6 Conditions Triggering | 6.5 Formalization Levels . . . . . . . . . 6.6 Conditions Triggering | 6.5 Formalization Levels . . . . . . . . . 6.6 Conditions Triggering | | Re-Formalization . | Re-Formalization . | 20 | | 6.7 . . . | 6.7 . . . | 6.7 . . . | | | Autoformalization Workflow . . . . | | | . | . | . | | . | . | . | | . | . | . | | . | . | . | | . | . | . | | . | . | | | | . . . 20 | . . . 20 | | | . . . . . | . . . . . | | | . . . | . . . | | | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | | | | . . . . . . . . . . . . . . . . . | | | 6.8 Minimal Re-Formalization | Strategy . . . . . | 21 | |-----|-------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|------| | | 6.9 Summary and Error Resilience . | . . . . . . | 21 | | 7 | Performance | Performance | 23 | | | Dataset . . . . . . . . . | . . . . . . . . . . . | 23 | | | Metrics . . . . . . . . . . . . . | . . . . . . . | 24 | | | 7.2.1 | State Space Estimation Accuracy . . | 24 | | | 7.2.2 | Outcome Precision . . . . . . . . . . | 24 | | | 7.2.3 | Direction Accuracy . . . . . . . . . . | 25 | | | 7.2.4 | Coverage Completeness . . . . . . . | 25 | | | 7.2.5 | Control Flow Understanding . . . . | 25 | | | 7.2.6 | Edge Case Detection . . . . . . . . . | 26 | | | 7.2.7 | Decision Boundary Clarity . . . . . | 26 | | 7.3 | Evaluation . . . . . . . . . . . | . . . . . . . . | 27 | | | 7.3.1 | Answer Generation . . . . . . . . . . | 27 | | | 7.3.2 | Answer Evaluation (LLM-as-a-Judge) | 27 | | | 7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 27 | | | 7.4.1 | Overall Performance . . . . . . . . . | 27 | | | 7.4.2 | Model Comparison by Metric . . . . | 29 | | 8 | | | 32 | | | Case Studies | Case Studies | | | | 8.1 LSE GTT Order Expiry | . . . . . . . . . . . | 32 | | | 8.1.1 Verification . . . | . . . . . . . . . . . | 34 | | | 8.2 London Stock Exchange Fee Schedule | . . . | 34 | | | 8.2.1 | Verified Invariants . . . . . . . . . . | 34 | | | 8.2.2 | Region Decomposition . . . . . . . . | 35 | | | Multilateral Netting Engine . | . . . . . . . . | 35 | | | 8.3.1 Input Validation . | . . . . . . . . . . | 36 | | | 8.3.2 | Floating Point Precision . . . . . . . | 36 | | 9 | Conclusion and Future Work | Conclusion and Future Work | 36 | | A | LSE GTT Order Expiry | LSE GTT Order Expiry | 38 | | | A.1 Python model . . . . . . . | . . . . . . . . . . | 38 | | | A.2 Sequence diagram . . | . . . . . . . . . . . . . | 45 | | B | Multilateral Netting Engine | Multilateral Netting Engine | 46 | | | B.1 Python model . . . . . | . . . . . . . . . . . . | 46 | | | B.2 Evaluation | Example: Transportation Risk Manager | 48 | | | B.2.1 | The Three Questions . . . . . . . . . | 48 | | | B.2.2 | Question 1: State Space Estimation | 48 | | | B.2.3 | Question 2: Conditional Analysis . . | 49 | | | B.2.4 | Question 3: Property Verification . . | 50 | | | B.2.5 | Summary . . . . . . . . . . . . . . . | 51 | ## 1 Introduction The rapid adoption of Large Language Models (LLMs) in software development has brought the notion of 'reasoning about code' into the mainstream. LLMs are now widely used for tasks such as code completion, refactoring, documentation, and test generation, and have demonstrated strong capabilities in understanding control flow and translating between programming languages. However, despite these advances, LLMs remain fundamentally limited in their ability to perform precise, exhaustive mathematical reasoning about program behavior. In parallel, the field of automated reasoning-often referred to as symbolic AI or formal methodshas long focused on mathematically precise analysis of programs expressed as formal models. Decades of research and industrial deployment have produced powerful techniques for verifying correctness, exhaustively exploring program behaviors, and identifying subtle edge cases. Yet these techniques have historically required specialized expertise, limiting their accessibility and integration into everyday software development workflows. This paper introduces CodeLogician , a neuro-symbolic framework that bridges these two paradigms. CodeLogician combines LLM-driven agents with ImandraX , an automated reasoning engine that has been successfully applied in financial markets, safety-critical systems, and government agencies. With CodeLogician , LLMs are used to translate source code into executable formal models and to interpret reasoning results, while automated reasoning engines perform exhaustive mathematical analysis of those models. ## 1.1 Beyond Verification-as-Filtering Much of the existing work that combines LLMs with formal methods treats symbolic systems as external 'validators'. In these approaches, LLMs generate code, specifications, or proofs, and automated reasoning tools are applied post hoc to check consistency or correctness, typically yielding a binary yes/no outcome. While such verification-as-filtering is valuable for enforcing correctness constraints, it does not fundamentally expand the reasoning capabilities of LLMs. In particular, verification-as-filtering does not help LLMs construct accurate internal models of program behavior, reason about large or infinite state spaces, or answer rich semantic questions involving constraints, decision boundaries, and edge cases. As a result, LLMs remain confined to approximate and heuristic reasoning, even when formal tools are present in the pipeline. CodeLogician adopts a different perspective. Rather than using formal methods to police LLM outputs, we use LLMs to construct explicit formal mental models of software systems expressed in mathematical logic. Automated reasoning engines are then applied directly to these models to answer questions that go far beyond satisfiability or validity, including exhaustive behavioral coverage, precise boundary identification, and systematic test generation. In this setting, LLMs act as translators and orchestrators, while mathematical reasoning is delegated to tools designed explicitly for that purpose. ## 1.2 Neuro-Symbolic Complementarity The strengths and weaknesses of LLMs closely mirror those of symbolic reasoning systems. LLMs excel at translation, abstraction, and working with incomplete information, but struggle with precise logical reasoning. Symbolic systems, by contrast, require strict formalization and domain discipline, but excel at exhaustive and exact reasoning once a model is available. CodeLogician embraces this neuro-symbolic complementarity. LLMs bridge the gap between informal source code and formal logic, while automated reasoning engines provide mathematical guarantees about program behavior. This separation of concerns enables a scalable and principled approach to reasoning about real-world software systems. ## 1.3 ImandraX and Reasoner-Agnostic Design At the core of CodeLogician lies ImandraX , an automated reasoning engine operating over the Imandra Modeling Language (IML), a pure functional language based on a subset of OCaml with extensions for specification and verification [14]. Beyond traditional theorem proving and formal verification, ImandraX supports advanced analysis techniques such as region decomposition, enabling systematic exploration of program state spaces and automated generation of high-coverage test suites. While CodeLogician currently integrates ImandraX as its primary reasoning backend, the framework is explicitly designed to be reasoner-agnostic . Additional formal reasoning tools can be incorporated without architectural changes, allowing CodeLogician to evolve alongside advances in automated reasoning. ## 1.4 The Missing Benchmark: Mathematical Reasoning About Software Despite rapid progress in evaluating LLMs, existing benchmarks fall into two largely disconnected categories. Mathematical proof benchmarks evaluate abstract symbolic reasoning over mathematical domains, but do not address reasoning about real-world software behavior. Engineering benchmarks, such as those focused on debugging or patch generation, evaluate practical development skills but do not require precise or exhaustive semantic reasoning. What is missing is a benchmark that evaluates an LLM's ability to apply mathematical reasoning to executable software logic . To address this gap, we introduce a new benchmark dataset code-logicbench [9] designed to measure reasoning about program state spaces, constraints, decision boundaries, and edge cases. Ground truth is defined via formal modeling and automated reasoning, enabling objective evaluation of correctness, coverage, and boundary precision. The benchmark supports principled comparison between LLM-only reasoning and LLMs augmented with formal reasoning via CodeLogician . ## 1.5 Contributions This paper makes the following contributions: - We present CodeLogician , a neuro-symbolic framework that teaches LLM-driven agents to construct formal models of software systems and reason about them using automated reasoning engines. - We introduce a new benchmark dataset targeting mathematical reasoning about software logic, filling a critical gap between existing mathematical and engineering-oriented benchmarks. - We provide a rigorous empirical evaluation demonstrating substantial improvements when LLMs are augmented with formal reasoning, compared to LLM-only approaches. ## 2 CodeLogician overview CodeLogician is a neuro-symbolic, agentic governance framework for AI-assisted software development. It provides a uniform orchestration layer that exposes formal reasoning engines (called 'reasoners'), autoformalization agents, and analysis workflows through consistent interfaces. It supports multiple modes of use, including direct invocation of reasoners, agent-driven auto-formalization, and LLM-assisted utilities for documentation and analysis. It is available as both CLI and server-based workflows-with an interactive TUI that can invoke agents over entire directories-making formal reasoning, verification, and test-case generation accessible across development pipelines. Rather than competing with LLM-based coding tools, CodeLogician is designed to augment and guide them by supplying mathematically grounded feedback, proofs, counterexamples, and exhaustive behavioral analyses. At its core, CodeLogician serves as a source of mathematical ground truth . Where LLM-based tools operate probabilistically, CodeLogician anchors reasoning about software behavior in formal logic, ensuring that conclusions are derived from explicit models and automated reasoning rather than statistical inference. ## 2.1 Positioning and Scope CodeLogician targets reasoning tasks that arise above the level of individual lines of code, where system behavior emerges from the interaction of components, requirements, and environments. Typical application areas include architectural reasoning, multi-system integration, protocol and workflow validation, and functional correctness of critical business logic. Figure 1: Imandra CodeLogician , neurosymbolic agent and framework for precise reasoning about software logic, with CLI, MCP and VS Code interfaces. Available from www.codelogician.dev . <details> <summary>Image 1 Details</summary> ![254e0a2d](/v1/image/254e0a2d6a9c7eec5e280a9385a835dfab06aaf86382d0970d39da216ed108bb) ### Visual Description ## IDE Screenshot: Imandra CodeLogician ### Overview The image is a screenshot of the Imandra CodeLogician IDE, showing multiple panels with code, project structure, server status, and available strategies. It displays a complex software development environment with various tools and information related to formal verification and code analysis. ### Components/Axes **1. Main IDE Window (Top-Left):** * **Header:** "DIO IMANDRA CodeLogician, version 1.0" in pixelated font. * **Source Tree:** Shows the project structure with files and directories. * Root directory: `sample_math_lib` * Subdirectories: `analysis`, `core`, `stats`, `utils.py` * Files: `__init__.py`, `numerical`, `derivatives.py`, `integration.py`, `optimization`, `root_finding.py`, `advanced.py`, `arithmetic.py`, `comparison.py`, `basic.py`, `measures.py` * **Tabs:** "Source tree", "Module deps", "Rev deps" **2. Model View Window (Top-Right):** * **Header:** "Model View" * **Tabs:** "Model state", "Command entry", "Opaque Functions", "Decomposition requests", "Verification goals" * **File Path:** `analysis/numerical/derivatives.py` * **Model State Information:** * Source code status: `Source_code_current` * Instructions added: `('No', 'bold')` * Dependencies changed: `('No', 'bold')` * Formalization task ID: `None` * Formalization status: `TRANSPARENT` * **Src code:** Python code snippet for numerical derivative calculations. * Includes imports: `from core.arithmetic import add, divide, multiply, subtract` * Function definition: `def numerical_derivative(func, x: float, h: float = 1e-8) -> float:` * **DML code:** DML (Declarative Machine Language) code snippet for numerical derivative calculations. * Includes: `open Core_arithmetic` * Function definition: `let numerical_derivative func: real -> real -> real -> real -> real -> real opti` **3. Server Status Window (Bottom-Left):** * **Header:** "DIO IMANDRA" * **Tabs:** "Disk", "Server" * **Server Details:** * Enter CodeLogician server details: `http://127.0.0.1:8000` * Apply button * **Overview:** * Source: `Server` * Server address: `http://127.0.0.1:8000` * Server last update: `2023-11-28 14:00:58.811545` * Server error: `None` * **Available Strategy:** * Path: `/Users/samer/src/imandra/code-logician/src/codelogician/data/sample_math_lib` * **Summary of Available Strategies:** * Num. Models: 15 * Num. Verification goals: 1 * Num. Opaque functions: 1 * Num. Decompositions: 0 * TRANSPARENT: 4 * ADMISSIBLE: 0 * EXECUTABLE_WITH_APPROXIMATION: 1 * ADMITTED_WITH_OPAQUENESS: 1 **4. Code Editor Window (Bottom-Center):** * Displays Python code, likely from one of the files in the project. * The code appears to define a class `Order` with methods like `next_best_wait`, `next_wait`, `next_sell_wait`. * Includes comments explaining the purpose of the code. **5. Formatting Window (Bottom-Right):** * **Header:** "Formatting IML Model with Codelogician" * **Content:** Text describing how to use CodeLogician to formalize IML models. * "Let's use Codelogician to formalize this into an IML model." * "Use agents workflow to generate IML model." * "Thought for 4 seconds" * "I'll help you use Codelogician to formalize the Python code into an IML model. Let me start by getting an overview of CodeLogician and then proceed with the formalization process." * "Called MCP tool: get_codelogician_overview" * "Now I'll create a thread and initialize the state with the Python code, then use the agents workflow to generate the IML model." * "Called MCP tool: new_thread" * "Now I'll initialize the state with the Python source code." * "Called MCP tool: init_state" * "Now I'll use the agents workflow to automatically formalize the Python code into an IML model." * "Called MCP tool: agents_workflow" * "Great! The agents workflow successfully formalized the Python code into an IML model with 15 models, which means the model is fully formalized with no opaque functions. Let's get a summary of the formalization state to see what was generated." * "Called MCP tool: get_formalization_state" * "Excellent! The formalization was successful. Let me also show you the actual IML code that was generated." * "Called MCP tool: get_iml_code" * "ds_swaps.py" * "import_export_utils" ### Detailed Analysis or Content Details * **Project Structure:** The project `sample_math_lib` contains modules for analysis, core functionality, and statistics. The `analysis` module includes numerical methods and derivatives. * **Code Snippets:** The Python code shows numerical derivative calculations using central difference. The DML code provides a declarative representation of the same calculations. * **Server Status:** The CodeLogician server is running locally at `http://127.0.0.1:8000`. * **Available Strategies:** The IDE provides strategies for formal verification, including transparent, admissible, executable with approximation, and admitted with opaqueness. * **Formalization Process:** The formatting window describes the process of using CodeLogician to formalize Python code into an IML model. ### Key Observations * The IDE is designed for formal verification and code analysis. * It supports both Python and DML code. * The IDE provides tools for managing projects, analyzing code, and formalizing models. * The formalization process involves multiple steps, including getting an overview, creating a thread, initializing the state, and running the agents workflow. ### Interpretation The screenshot demonstrates the Imandra CodeLogician IDE in action, showcasing its capabilities for formal verification and code analysis. The IDE provides a comprehensive environment for developers to analyze, formalize, and verify their code. The presence of both Python and DML code suggests a hybrid approach to software development, where Python is used for implementation and DML is used for formal specification and verification. The detailed server status and available strategies indicate a focus on ensuring the correctness and reliability of the code. The formatting window provides a step-by-step guide to using CodeLogician, making it accessible to developers with varying levels of expertise. </details> While CodeLogician can, in principle, reason about low-level implementation details, its primary value lies in governing system-level behavior. Domains such as memory safety, SQL injection detection, or hardware-adjacent concerns are already well served by specialized static analyzers and runtime tools. CodeLogician complements these point solutions by providing a framework for validating architectural intent, invariants, and cross-component interactions with mathematical completeness. ## 2.2 Extensible Reasoner-Oriented Design CodeLogician is explicitly designed as a multi-reasoner framework. Its first release integrates the ImandraX automated reasoning engine, which supports executable formal models, proof-oriented verification, and exhaustive state-space analysis. However, the framework is intentionally reasoner-agnostic and is architected to host additional formal tools (e.g., temporal logic model checkers such as TLA+) under the same orchestration and governance layer. This design allows CodeLogician to evolve alongside advances in formal methods without coupling its user-facing workflows to a single solver, logic, or verification paradigm. ## 2.3 Modes of Operation CodeLogician exposes its capabilities through several complementary modes of operation, all backed by the same underlying orchestration and reasoning infrastructure. Direct invocation. In direct mode, users or external tools invoke reasoning engines and agents explicitly. This mode supports fine-grained control over formalization, verification, and analysis steps and is particularly useful for expert users and debugging complex models. Agent-mediated workflows. In agentic modes, CodeLogician coordinates LLM-driven agents that automatically formalize source code, refine models, generate verification goals, and invoke reasoning backends. These workflows enable end-to-end analysis with minimal human intervention while retaining the ability to request feedback when ambiguities or inconsistencies arise. Bring-Your-Own-Agent (BYOA). CodeLogician can be used as a reasoning backend for externally hosted LLM agents. In this mode, an LLM interacts with CodeLogician through structured commands, receives formal feedback, and iteratively improves its own formalization behavior. This enables tight integration with existing LLM-based coding assistants while preserving a clear separation between statistical generation and logical reasoning. Server-based operation. For large-scale or continuous analysis, CodeLogician can be deployed in a server mode that monitors codebases or projects and incrementally updates formal models and analyses in response to changes. This supports use cases such as continuous verification, regression analysis, and long-running agentic workflows. ## 2.4 Governance and Feedback A central contribution of CodeLogician is its role as a governance layer for AI-assisted development. Rather than merely accepting or rejecting LLM outputs, CodeLogician provides structured, semantically rich feedback grounded in formal reasoning. This includes: - proofs of correctness when properties hold, - counterexamples when properties fail, - explicit characterization of decision boundaries and edge cases, - quantitative coverage information derived from exhaustive analysis. This feedback can be consumed by humans, external agents, or LLMs themselves, enabling iterative improvement and learning. In this sense, CodeLogician does not merely analyze code-it shapes how AI systems reason about software over time. ## 2.5 Relationship to the CodeLogician IML Agent The CodeLogician IML Agent described in Section 5 is one concrete realization of the framework's capabilities. It implements an end-to-end autoformalization pipeline that translates source code into executable formal models and invokes ImandraX for analysis. More generally, the framework supports multiple agents, reasoning engines, and workflows operating over shared representations and governed by the same orchestration logic. This separation between framework and agents allows CodeLogician to scale across domains, reasoning techniques, and modes of interaction. ## 2.6 Summary CodeLogician is a unifying framework for integrating LLMs with formal reasoning engines in a principled, extensible manner. By providing orchestration, governance, and structured feedback around automated reasoning, it enables AI-assisted software development to move beyond heuristic reasoning toward mathematically grounded analysis. The framework establishes a firm foundation upon which multiple agents and reasoners can coexist, evolve, and collaborate in the service of rigorous software engineering. ## 3 Overview of ImandraX ImandraX (cf. Fig 2) is an automated reasoning engine for analyzing executable mathematical models of software systems. Its input language is the Imandra Modeling Language (IML), a pure functional subset of OCaml augmented with directives for specification and reasoning (e.g., verify , lemma and instance ) [14]. IML models are ordinary programs, but they are also formal models : ImandraX contains a mechanized formal semantics for IML, and thus IML code is both executable code and rigorous mathematical logic to which ImandraX applies automated reasoning. Figure 2: The ImandraX automated reasoning engine and theorem prover. Interfaces are available for both humans (VS Code) and AI assistants (MCP and CLI), and may be installed from www.imandra.ai/core . <details> <summary>Image 2 Details</summary> ![8b4cb11c](/v1/image/8b4cb11c42fb3b5c0b271089210ed2e308815ec303071117be9b2c2d77ea6b33) ### Visual Description ## Screenshot: ImandraX Automated Reasoning Engine ### Overview The image is a promotional screenshot for ImandraX, an "Automated Reasoning Engine". It showcases the software's capabilities through code snippets and interface elements. The screenshot highlights features like formal verification, counterexample generation, automated proofs, and integration with VS Code. ### Components/Axes The image is divided into several regions: 1. **Header (Left):** Contains the ImandraX logo and the title "Automated Reasoning Engine". It also includes a brief description of the software's capabilities and a call to action to install ImandraX in VS Code. 2. **Code Snippets (Center & Right):** Displays various code snippets, likely demonstrating the ImandraX's functionality. These snippets are annotated with callouts and highlights. 3. **Interface Elements (Overlaid):** Shows pop-up windows and interface elements, providing a glimpse into the user experience. ### Detailed Analysis or ### Content Details **1. Header (Left):** * **Logo:** The ImandraX logo is present at the top-left. * **Title:** "Automated Reasoning Engine" is prominently displayed. * **Description:** "Formally verified functional programming with first-class counterexamples, highly automated proofs, powerful reasoning tactics and seamless integration of bounded and unbounded verification." * **Call to Action:** "NEW Install ImandraX in VS Code!" **2. Code Snippets (Center):** * **Verification Goals:** The code snippet starts with a comment indicating "Verification Goals". * **Loop Correctness:** A section of code focuses on verifying the correctness of a loop. * Line 233: `(* Now that we know the loop works for positive n, let's prove that it works for any n. *)` * Line 236: `@Re-check | Browse | Interact | Copy` * Line 237: `lemma loop_correct stack ra rn s steps =` * Line 238: `steps == 11*rn + 3` * Line 239: `&& s == {halted=false; prog_ex: pc=2; stack; ra; rn}` * Line 240: `&& rn >= 0` * Line 241: `run s steps` * Line 242: `(if rn > 0 then` * Line 243: `{halted=true; prog_ex: pc` * Line 244: `else` * Line 245: `{halted=true; prog_ex: pc` * Line 246: `(by unroll 100)` * **Counter Model:** A section defines a counter model. * Line 248: `Counter model:` * Line 249: `model {` * Line 250: `let rn : int = (-7719)` * Line 251: `let steps : int = (-84898)` * Line 252: `let ra : int = 1142` * Line 253: `let s : state =` * Line 254: `(halted = false;` * Line 255: `prog =` * Line 256: `[Const 0; Store RA; Load RN;` * Line 257: `Load RN; Const 1; Sub; St` * Line 258: `pc = 2; stack = [1796]; ra` * Line 259: `let stack : int list = [1796]` **3. Code Snippets (Right):** * **More Code:** * Line 236: `@Re-check` * Line 237: `lemma loop_correct stack ra rn s steps` * Line 238: `steps == 11*rn + 3` * Line 239: `&& s == {halted=false; prog_ex: pc=2; stack; ra; rn}` * Line 241: `run s steps` * Line 242: `(if rn > 0 then` * Line 243: `{halted=true; prog_ex: pc=13; stack; rn@; ra@ra = sum rn}` * Line 244: `else` * Line 245: `{halted=true; prog_ex: pc=13; stack; rn; ra}` * Line 246: `(by Nuse loop_correct_pos_n stack ra rn s steps)` * Line 247: `Nuse program_base_case stack (ra=sum rn)` * Line 248: `{s with pc=2; rn=0; ra = ra + sum rn}` * Line 249: `(steps = -11+rn)}` * Line 251: `@auto` * Line 252: `@disable run` **4. Interface Elements (Overlaid):** * **Instance Search:** A pop-up window shows search results for instances. * `Re-check | Browse | Interact | Copy` * `match run dim_8x8 empty_board acts with` * `Some covering ->` * `List. length covering = 18` * `None = false` * **Mutilated Board:** A section related to a "mutilated board". * `let mutilated_board =` * `delete (x-y: x-y-0) (delete` * **Theorem Main Acts:** * `Re-check | Browse | Interact | Copy` * `match run dim_8x8 empty_board acts with` * `Some covering ->` * `covering -> mutilated_board` * `None -> true` * `(by Nuse run_color_count_from_empty_empty_board acts)` * **Found an Instance:** Another pop-up window shows a found instance. * `Found an instance:` * `model {` * `let acts : action list =` * `Place ((x = 4; y = 0), (x = 5; y = 6));` * `Place ((x = 3; y = 3), (x = 3; y = 2));` * `Place ((x = 1; y = 2), (x = 2; y = 2));` * `Place ((x = 1; y = 0), (x = 2; y = 1));` * `Place ((x = 3; y = 1), (x = 3; y = 0));` * `}` ### Key Observations * The code snippets use a functional programming style. * The interface elements suggest a user-friendly environment for interacting with the reasoning engine. * The examples focus on loop verification, counter model generation, and board-related problems. ### Interpretation The screenshot aims to convey the power and ease of use of the ImandraX Automated Reasoning Engine. It highlights the software's ability to formally verify code, generate counterexamples, and automate proofs. The code snippets and interface elements demonstrate the practical application of these features. The focus on specific examples, such as loop verification and board-related problems, suggests the software's versatility. The call to action encourages users to try ImandraX in VS Code, indicating a seamless integration with a popular development environment. Overall, the screenshot presents ImandraX as a valuable tool for developers seeking to improve the reliability and correctness of their code. </details> CodeLogician relies on ImandraX as its primary reasoning backend 4 . In the autoformalization workflow (cf. Section 4), LLM-driven agents translate source code into executable IML models, make abstraction choices explicit, and introduce assumptions where required. ImandraX then performs the mathematical analysis of these models, enabling both proof-oriented verification and richer forms of behavioral understanding. This section summarizes the two ImandraX capabilities most heavily used by CodeLogician : verification goals and state-space region decomposition. As in Section 4, we emphasize that LLMs are used to construct explicit formal mental models, while ImandraX is used to reason about them. ## 3.1 Formal Models in the Imandra Modeling Language (IML) IML inherits the modeling discipline discussed in Section 4: models are pure (no side effects), statically typed functional programs in ImandraX 's fragment of OCaml, structured to support inductive reasoning modulo decision procedures. Imperative constructs such as loops are represented via recursion, and systems with mutable state are modeled as state machines in which the evolving state is carried explicitly through function inputs and outputs. The question is therefore not whether a system can be modeled, but at what level of abstraction it should be modeled (Section 4). ImandraX provides analysis capabilities that remain meaningful across these abstraction levels, provided the model is executable and its assumptions are made explicit. ## 3.2 Verification Goals Formal verification in ImandraX is performed via the statement and analysis of verification goals (VGs): boolean-valued IML functions that encode properties of the system being analyzed. VGs are (typically) implicitly universally quantified: when one proves a theorem in ImandraX , one establishes that a boolean-valued VG is true for all possible inputs. Unlike testing, which evaluates finitely many concrete executions, VGs quantify over (typically) infinite input spaces and yield either: (i) a proof that the property holds universally, or (ii) a counterexample witness demonstrating failure. As programs and datatypes may involve recursion over structures of arbitrary depth, these proofs may involve induction. A unique feature of ImandraX is its seamless integration of bounded and unbounded verification. Every goal may be subjected to both bounded model checking and full inductive theorem proving. Bounded proofs are automatic based on ImandraX 's recursive function unrolling modulo decision procedures, and typically take place in a setting for which they are complete 4 https://www.imandra.ai/core ``` order1 = { order2 = { order3 = { peg = NEAR; peg = NEAR; order_type = PEGGED_CI; qty = 8400; order_type = PEGGED; order_type = PEGGED_CI; qty = 8400; qty = 1800; qty = 8400; price = 10.0; price = 10.5; price = 12.0; time = 236; time = 237; time = 237; ... ... ... }; }; }; ``` Figure 3: Counterexample for Order Ranking Transitivity for counterexamples. ImandraX contains deep automation for constructing inductive proofs, including a lifting of the Boyer-Moore inductive waterfall to ImandraX 's typed, higher-order setting [14]. Moreover, ImandraX contains a rich tactic language for the interactive construction of proofs and counterexamples, and for more general system and state-space exploration. For example, consider the transitivity of an order ranking function for a trading venue [16]. This may naturally be conjectured as a lemma, which may then be subjected to verification: ``` lemma rank_transitivity side order1 order2 order3 mkt = order_higher_ranked(side,order1,order2,mkt) && order_higher_ranked(side,order2,order3,mkt) ==> order_higher_ranked(side,order1,order3,mkt) ``` In practice, verification is iterative: counterexamples often reveal missing preconditions, modeling gaps, or specification errors. For example, in Fig. 3 we see a (truncated version of) an ImandraXsynthesized counterexample for rank transitivity of a real-world trading venue [16]. In ImandraX , all counterexamples are 'first-class' objects which may be reflected in the runtime and computed with directly via ImandraX 's eval directive. This unified computational environment for programs, conjectures, counterexamples and proofs is important for the analysis of real-world software. Verification goals should immediately be amenable to automated analysis, counterexamples for false goals should be synthesized efficiently, and these counterexamples should be available for direct execution through the system under analysis, enabling rapid triage and understanding for goal violations and false conjectures. This use of counterexamples aligns with the autoformalization perspective: constructing a good formal model is a refinement process in which assumptions and intended semantics are progressively made explicit, and counterexamples to conjectures can rapidly help one to fix faulty logic and recognize missing assumptions. ## 3.3 Region Decomposition Many questions central to program understanding are not naturally expressed as binary VGs. To support richer semantic analysis, ImandraX provides region decomposition , which partitions a function's input space into finitely many symbolic regions corresponding to distinct behaviors. Each region comprises: - constraints over inputs that characterize when execution enters the region, - an invariant result describing the output (or a symbolic characterization of it) for all inputs satisfying the constraints, and - sample points witnessing satisfiable instances of the constraints. Region decomposition provides a mathematically precise notion of 'edge cases' and directly supports the systematic test generation goals described in Section 4. ImandraX decomposes function into regions corresponding to the distinct path conditions induced by their control flow, each with explicit constraints and invariant outputs. These decompositions may be performed relative to logical side-conditions and configurable abstraction boundaries called basis functions to tailor state-space exploration for a particular goal. In real software systems, small Figure 4: A principal region decomposition of a trading venue pricing function <details> <summary>Image 3 Details</summary> ![0286da81](/v1/image/0286da81ddafa5f4c763897c7c43105bc2eff85ee60e361eb4536242fa1360d5) ### Visual Description ## Region Explorer: Decompose Match Price ### Overview The image shows a visualization of regions with associated numerical values, likely representing some kind of data analysis or clustering. A panel on the right displays constraints and invariants for a selected region (4.33). The regions are spatially arranged and labeled with numerical identifiers. ### Components/Axes * **Header:** Contains "IMANDRA REGION EXPLORER" on the left and ":decompose match\_price" on the right. * **Main Chart:** A map-like structure composed of irregular polygons, each representing a region. Each region is labeled with a numerical identifier (e.g., "Region 4.33") and a numerical value (e.g., "4.33"). * **Right Panel:** Displays information about the selected region (Region 4.33). It includes "Constraints" and "Invariant" sections. ### Detailed Analysis or Content Details **Region Map:** The region map consists of several regions, each labeled with a "Region X.XX" identifier and a corresponding numerical value. The regions are arranged spatially, and their sizes vary. Here's a breakdown of the visible regions and their values: * Region 4.6: Located in the top-left corner. * Region 4.20: Located near the top-center. Sub-regions 4.20.4, 4.20.3, 4.20.2, and 4.20.1 are nearby. * Region 4.15: Located to the left of Region 4.30. * Region 4.30: Located in the upper-center area. Sub-regions 4.30.1 and 4.30.2 are nearby. * Region 4.25: Located below Region 4.15. Value is 4.25. Sub-regions 4.25.1, 4.25.2, and 4.25.3 are nearby. * Region 4.35: Located to the right of Region 4.30. * Region 4.33: Located in the center, highlighted with a thicker border. * Region 4.28: Located below Region 4.33. Value is 4.28. Sub-regions 4.28.1, 4.28.2, and 4.28.3 are nearby. * Region 4.31: Located to the right of Region 4.28. Value is 4.31. Sub-regions 4.31.1, 4.31.2, and 4.31.3 are nearby. * Region 4.26: Located to the right of Region 4.31. Value is 4.26. Sub-regions 4.26.1 and 4.26.2 are nearby. * Region 4.9: Located in the bottom-left corner. * Region 4.18: Located above Region 4.9. * Region 4.3: Located to the left of Region 4.13. * Region 4.13: Located in the bottom-left area. * Region 4.21: Located to the right of Region 4.13. Value is 4.21. Sub-region 4.21.2 is nearby. * Region 4.7: Located to the right of Region 4.21. * Region 4.16: Located below Region 4.7. * Region 4.19: Located to the right of Region 4.26. * Region 4.11: Located to the right of Region 4.19. * Region 4.1: Located in the bottom-right corner. Sub-regions 4.1.1, 4.1.2, and 4.1.4 are nearby. **Right Panel - Region 4.33:** * **Constraints:** A list of logical constraints, expressed in a formal language, likely related to order book data. 1. (List.hd(x1.order\_book.buys)).order\_qty = (List.hd(x1.order\_book.sells)).order\_qty 2. (List.hd(List.tl(x1.order\_book.buys))).order\_type <> Market 3. List.tl(x1.order\_book.sells) = \[\] 4. (List.hd(List.tl(x1.order\_book.buys))).order\_price <= x1.ref\_price 5. List.tl(x1.order\_book.buys) <> \[\] 6. is\_Market((List.hd(x1.order\_book.buys)).order\_type) 7. is\_Market((List.hd(x1.order\_book.sells)).order\_type) 8. x1.order\_book.sells <> \[\] * **Invariant:** A statement of invariance, also expressed in a formal language. * Known(x1.ref\_price) = F(x1) ### Key Observations * The regions are spatially organized, suggesting a relationship between their values and their positions. * Region 4.33 is highlighted, indicating it is the currently selected or focused region. * The constraints and invariants provide a formal specification of the properties of Region 4.33. * The constraints involve comparisons of order quantities, order types, and order prices, suggesting the data relates to a financial market or trading system. ### Interpretation The image represents a system for analyzing and decomposing match prices in a financial market or similar domain. The region map provides a visual representation of different price levels or market states, while the constraints and invariants define the logical rules that govern the behavior of the selected region. The "decompose match\_price" label suggests that the system is used to break down the components of a match price and understand the factors that influence it. The constraints likely represent conditions that must be satisfied for a match to occur in that region, while the invariant represents a property that remains constant within that region. The system allows users to explore different regions and understand their specific characteristics. </details> boundary differences (e.g., a flipped inequality) often lead to qualitatively different outcomes; region decomposition isolates these differences and makes the boundaries of logically invariant behavior explicit and auditable. ## 3.4 Focusing Decomposition: Side Conditions and Basis Autoformalization often produces models whose full state spaces are too large-or simply too broadfor a given analysis objective. ImandraX provides two complementary mechanisms to focus decomposition, mirroring the abstraction discipline in Section 4. Side conditions. A side condition is a boolean predicate with the same signature as the decomposed function. When supplied, ImandraX explores only regions for which the side condition holds. This is useful for restricting analysis to scenarios of operational or regulatory interest. Figure 5 illustrates a region decomposition query augmented with a side-condition. Basis functions. A basis specifies functions to be treated as atomic during decomposition. ImandraX will not explore their internal branching structure, reducing the number of regions and improving interpretability. Basis selection provides a principled way to 'abstract away' auxiliary computations while retaining their influence on the surrounding model. Figure 5: A refined decomposition of the venue pricing function with a side-condition <details> <summary>Image 4 Details</summary> ![f82e0b5a](/v1/image/f82e0b5aa83c82059c985c7b4428d02aaff39f98bf28e842097a34e1473265fc) ### Visual Description ## Voronoi Diagram: Region Decomposition - Exchange Pricing ### Overview The image presents a screenshot of a software interface, likely related to financial modeling or algorithmic trading, featuring a Voronoi diagram, code snippets, and region details. The diagram visually represents a region decomposition, while the code defines a side condition and a modular decomposition. Constraints and invariants are also listed. ### Components/Axes * **Header:** "IMANDRA Region Decomposition - Exchange Pricing" with "Last Checkpoint: 5 minutes ago (unsaved changes)" and "Trusted" and user "Imandra O" * **Menu Bar:** "File Edit View Insert Cell Kernel Help" * **Code Section:** Contains code snippets in a functional programming language (likely OCaml or similar). Includes definitions for `side_condition` and `Modular_decomp`. * **Voronoi Diagram:** A diagram composed of several polygonal regions, each labeled with a numerical identifier (e.g., 1.1.1, 1.1.2, 1.1.3). The diagram is labeled "Voronoi (16 of 16)". * **Regions Details:** A section providing information about the selected region. Includes "Direct sub-regions: 0" and "Contained regions: 1". * **Constraints:** A list of constraints related to order types, quantities, and prices. * **Invariant:** A statement of an invariant condition. ### Detailed Analysis or Content Details **Code Section:** * `In [16]: let side_condition (ob : order_book) (ref_price : real) =` `match best_buy(ob), best_sell(ob) with` `| Some bb, Some bs ->` `bb.order_type = Market && bs.order_type = Market` `| _ -> false;;` * `Modular_decomp.(top ~assuming:"side_condition" "match_price")` * `Out [16]: val side_condition : order_book -> real -> bool = <fun>` `- : Modular_decomp_intf.decomp_ref = <abstr>` **Voronoi Diagram:** The diagram is composed of several regions, each with a label. The labels and approximate relative sizes are: * 1.1.1: Darkest region, bottom-right. * 1.1.2: Medium-dark region, left of 1.1.1. * 1.1.3: Light region, top-right of 1.1.1. * 1.1.1.1: Very light region, top-left of 1.1.1. * 1.1.1.1.1: Very light region, top-left of 1.1.1.1. * 1.1.1.1.2: Very light region, above 1.1.1.1.1. * 1.3.2: Light region, right of 1.1.3. * 2.2: Light region, above 1.1.3. * 3.1: Light region, right of 1.1.1. **Regions Details:** * Direct sub-regions: 0 * Contained regions: 1 **Constraints:** * `(List.hd ob.buys).order_type = Market` * `(List.hd ob.sells).order_type = Market` * `(List.hd ob.buys).order_qty = (List.hd ob.sells).order_qty` * `(List.hd (List.tl ob.buys)).order_type = Market` * `(List.hd (List.tl ob.buys)).order_price > ref_price` * `(List.hd (List.tl ob.sells)).order_type = Market` **Invariant:** * `F = Known (List.hd (List.tl ob.buys)).order_price` ### Key Observations * The code defines a `side_condition` function that checks if the order types of the best buy and best sell orders are both "Market". * The Voronoi diagram visually partitions the space based on some criteria, likely related to the conditions defined in the code. * The constraints specify relationships between order types, quantities, and prices, suggesting a system for managing and validating orders. * The invariant states that the order price is known, implying that the system has a way to determine or track the price of orders. ### Interpretation The image represents a system for analyzing and managing exchange pricing using region decomposition. The Voronoi diagram likely visualizes different regions or states in the order book, with the code defining the conditions and constraints that govern these regions. The constraints ensure that orders meet certain criteria, while the invariant provides a guarantee about the order price. This system could be used for algorithmic trading, risk management, or market surveillance. The "Region Decomposition" likely refers to partitioning the order book state space into regions with similar properties, allowing for more efficient analysis and decision-making. The "Exchange Pricing" aspect suggests that the system is focused on understanding and predicting price movements in the exchange. </details> Together, side-conditions and basis functions provide a rich query language for targeting the analysis of system state spaces towards particular classes of behaviors [11, 12]. ## 3.5 Test Generation from Regions Each region produced by decomposition includes sample points satisfying its constraints. These witnesses can be extracted to generate tests that correspond directly to semantic distinctions in program behavior, rather than to ad hoc input distributions. In CodeLogician , this capability underpins automated test generation: agents use regions to produce tests that cover all behavioral cases discovered by decomposition, consistent with the 'exhaustive coverage' objective motivating autoformalization. ## 3.6 Summary ImandraX provides the formal reasoning substrate for CodeLogician . Verification goals support prooforiented analysis, while region decomposition exposes the structure of program state spaces, makes decision boundaries explicit, and enables systematic test-case generation. Together with the autoformalization process (Section 4), these capabilities allow LLM-driven agents to construct formal mental models of software and to answer rich semantic questions beyond binary verification outcomes. ## 4 Autoformalization of source code Autoformalization is the process of translating executable source code into precise, mathematical models suitable for automated reasoning. While this translation is partially automated by CodeLogician, it relies on a particular way of thinking about software-one that emphasizes explicit state, clear abstraction boundaries, and well-defined assumptions. Just as developers strive to write clear code , effective automated reasoning depends on writing or synthesizing clear formal models . ## 4.1 Pure Functional Modeling All formal models produced by CodeLogician are expressed in the Imandra Modeling Language (IML), a pure functional subset of OCaml. Purity here means the absence of side effects: functions cannot mutate global state or influence values outside their scope. Instead, all state changes must be made explicit via function arguments and return values. This restriction is not a fundamental limitation. Programs with side effects can be systematically transformed into state-machine models in which the evolving program state is explicitly carried through computations. Such techniques are well established in the literature and trace back to foundational work by Boyer and Moore [4, 3]. For example, this (infinite) state-machine model approach underlies successful formal verification of highly complex industrial systems, including large-scale financial market infrastructure [16, 15], complex hardware models [17], robotic controllers integrating deep neural networks [6], and high-frequency communication protocols [10]. ## 4.2 Key Modeling Constructs Several structural features of IML are particularly important for automated reasoning: Static Types IML is a statically typed functional language. All variable types are determined before execution and cannot change dynamically. This eliminates entire classes of ambiguity present in dynamically typed languages. During autoformalization, CodeLogician extracts type information from source programs-using static annotations where available and inference otherwise-and enforces type consistency in the generated models. Type errors detected by ImandraX often correspond directly to logical inconsistencies in the original program. Recursion and Induction Loops in imperative source code are rewritten as recursive functions. Both for and while constructs are mapped to structurally recursive definitions. This transformation is essential because reasoning about unbounded iteration requires inductive reasoning, which is naturally expressed over recursive functions. State Machines Many real-world programs are best modeled as state machines. These machines are typically infinite-state, as the state may include complex data structures such as lists, maps, or trees in addition to numeric values. State-machine modeling is particularly effective for representing code with side effects and for capturing the behavior of complex reactive systems, such as trading platforms or protocol handlers. ## 4.3 Abstraction Levels A central question in autoformalization is not merely whether a program can be modeled, but at what level of abstraction it should be modeled . The appropriate abstraction depends on the properties one wishes to analyze. We distinguish three broad abstraction levels: High-Level Models High-level models typically arise from domain-specific languages (DSLs) designed to describe complex systems concisely. In such cases, direct use of CodeLogician is often unnecessary. Instead, the DSL itself is translated into IML. For example, the Imandra Protocol Language (IPL) provides a compact notation for symbolic state-transition systems and compiles into IML with significant structural expansion. Reasoning at this level focuses on system-wide invariants and high-level behavioral properties. Mid-Level Models Application-level software represents the primary target for CodeLogician. At this level, source programs can be translated directly into IML without introducing an intermediate DSL. This category includes most business logic, data-processing pipelines, and control-heavy application code. Low-Level Models Low-level code, such as instruction sequences or hardware-near implementations, can also be modeled in ImandraX, but doing so requires an explicit formalization of the execution environment (e.g., memory model, virtual machine, or instruction semantics). Such models are outside the intended scope of CodeLogician, which assumes access to semantic abstractions of the underlying execution platform. ## 4.4 Dealing with Unknowns Real-world programs inevitably rely on external libraries, services, or system calls. Autoformalization therefore requires explicit handling of unknown or partially specified behavior. CodeLogician adopts techniques analogous to mocking in software testing, but within a formal reasoning framework. ## 4.4.1 Opaque Functions IML supports opaque functions , which declare a function's type without specifying its implementation. Opaque functions allow models to type-check and support reasoning without making any assumptions about the function's internal behavior. For example, calls to external libraries such as pseudo-random number generators are introduced as opaque functions during autoformalization. Reasoning about code that depends on such functions remains sound, but necessarily conservative. ## 4.4.2 Axioms and Assumptions To refine reasoning about opaque functions, users (or agents) may introduce axioms that constrain their behavior. These axioms encode assumptions about external components, such as value ranges or monotonicity properties. Introducing axioms narrows the set of possible behaviors considered during reasoning and can significantly strengthen the conclusions that can be drawn. Importantly, axioms are explicit and local: reasoning results are valid only under the stated assumptions, making the modeling process auditable and transparent. ## 4.4.3 Approximation Functions In many cases, opaque functions can be replaced with sound approximations. For common mathematical functions, CodeLogician provides libraries of approximation functions, such as bounded Taylorseries expansions for trigonometric operations. These approximations allow models to remain executable and enable test generation, while still supporting rigorous reasoning within known bounds. ## 4.5 Summary Autoformalization is not merely a syntactic translation process. It is a disciplined approach to modeling software behavior in which state, control flow, assumptions, and abstraction boundaries are made explicit. By combining LLM-driven translation with principled formal modeling techniques, CodeLogician enables automated reasoning engines to answer deep semantic questions about software behavior that are inaccessible to LLMs alone. ## 5 CodeLogician IML Agent CodeLogician is a neuro-symbolic agent framework designed to enable Large Language Models (LLMs) to reason rigorously about software systems. Rather than asking LLMs to reason directly about source code, CodeLogician teaches LLM-driven agents to construct explicit formal models in mathematical logic and to delegate semantic reasoning to automated reasoning engines. The current implementation integrates the ImandraX automated reasoning engine together with static analysis tools and auxiliary knowledge sources within an agentic architecture built on LangGraph. CodeLogician exposes both low-level programmatic commands and high-level agentic workflows, allowing users and external agents to flexibly combine automated reasoning with guided interaction. When required information is missing or inconsistencies are detected, the agent can request human feedback, enabling mixed-initiative refinement. The framework can also be embedded into external agent systems via a remote graph API. ## 5.1 From Source Code to Formal Models At the core of CodeLogician lies the process of autoformalization (Section 4). Given a source program written in a conventional programming language (e.g., Python), the agent incrementally constructs an executable formal model in the Imandra Modeling Language (IML). Autoformalization is not a single translation step, but a structured refinement process that involves: - extracting structural, control-flow, and type information from the source program, - refactoring imperative constructs into pure functional representations, - making implicit assumptions explicit (e.g., about external libraries or partial functions), - synthesizing an IML model that is admissible to the reasoning engine. The resulting model is functionally equivalent to the source program at the chosen level of abstraction and suitable for mathematical analysis by ImandraX. ## 5.2 Verification Goals and Logical Properties Once a formal model has been constructed, CodeLogician assists in formulating verification goals (VGs). A verification goal is a boolean-valued predicate over the model that expresses a property expected to hold for all possible inputs. Unlike traditional testing, verification goals quantify symbolically over infinite input spaces and admit mathematical proof or counterexample. For example, consider a ranking function used in a financial trading system. A fundamental correctness requirement is transitivity: if one order ranks above a second, and the second ranks above a third, then the first must rank above the third. Such properties are expressed directly as verification goals in IML and discharged by ImandraX. When a verification goal does not hold, ImandraX produces a counterexample witness. These counterexamples are concrete executions that violate the property and often expose subtle logical flaws that are difficult to detect through testing or informal inspection. In industrial case studies, this approach has uncovered violations of core invariants in production systems that would otherwise remain hidden. ## 5.3 Beyond Yes/No Verification While the proof or refutation of a verification goal determines whether or not a property is true , many forms of program understanding require richer, non-binary kinds of semantic information. CodeLogician therefore relies not only on formal verification, but also on exhaustive behavioral analysis techniques provided by ImandraX. In particular, CodeLogician uses state-space exploration to identify decision boundaries, semantic edge cases, and invariant behaviors. These capabilities enable explanations, documentation, and test generation that go far beyond yes/no verification and are central to CodeLogician's value proposition. ## 5.4 Agent Architecture and State CodeLogician is implemented as a stateful agent whose execution is organized around an explicit agent state . This state captures all artifacts relevant to the reasoning process and evolves as the agent performs analysis and refinement. The agent state includes: - the original source program, - extracted formalization metadata (types, control flow, assumptions), - the generated IML model, - the model's admission and executability status, - verification goals, decomposition requests, and related artifacts. Agent operations fall into three broad categories: - State transformations , which update the agent state explicitly (e.g., modifying the source program or model); - Agentic workflows , which orchestrate multi-step processes such as end-to-end autoformalization and analysis; - Low-level commands , which perform individual actions such as model admission, verification, or decomposition. This separation allows users and external agents to mix automated and guided reasoning as appropriate. ## 5.5 Model Refinement and Executability Autoformalization proceeds as an iterative refinement process. The immediate objective is to produce a model that is admissible to ImandraX, meaning that it is well-typed and free of missing definitions. A further objective is to make the model executable , eliminating opaque functions either by implementation or by sound approximation. These two stages correspond to: - Admitted models , which can be verified but may contain unresolved abstractions; - Executable models , which support full behavioral analysis and test generation. This refinement process mirrors established practices in model-based software engineering and digital twinning. By working with an executable mathematical model, CodeLogician enables rigorous reasoning about systems whose direct analysis in the source language would be impractical. ## 5.6 Assessing Model Correctness A natural question arises when constructing formal models: how do we know the model is correct? CodeLogician adopts a notion of correctness based on functional equivalence at the chosen abstraction level. Two programs are considered equivalent if they produce the same observable outputs for the same inputs within the modeled domain. Exhaustive behavioral analysis plays a central role in this assessment. By enumerating all distinct behavioral regions of the model, CodeLogician provides a structured and auditable view of program behavior. This makes mismatches between intended and modeled semantics explicit and actionable. ## 5.7 Region decomposition, test case generation, and refinement The final stage is where Imandra shows its power. We perform region decomposition on the IML code to obtain a finite number of symbolically described regions such that (a) the union of all regions covers the entire state-space, and (b) in each region, the behavior of the system is invariant. Consider the following IML code that models order discount logic: ``` type priority = Standard | Premium type order = { amount: int; customer: priority } let discount = fun o -> match o.customer with | Premium -> if o.amount > 100 then 20 else 10 | Standard -> if o.amount > 100 then 10 else 0 ``` Region decomposition partitions the input space into four distinct regions based on customer type and order amount. For each region, ImandraX provides concrete witness values that satisfy the region's constraints. To bridge the gap between these symbolic results and executable test cases, we developed imandrax-codegen , an open-source code generation tool. Given the decomposition artifacts, the tool automatically generates code including type definitions and test cases. We use Python as the target language in this example; support for additional languages is planned: ``` @dataclass class order: amount: int customer: priority @dataclass class Standard: pass @dataclass class Premium: pass priority = Standard | Premium def test_1(): """test_1 - invariant: 0 - constraints: - not (o.customer = Premium) - o.amount <= 100 """ result: int = discount(o=order(0, Standard())) expected: int = 0 assert result == expected def test_2(): """test_2 - invariant: 10 - constraints: - not (o.customer = Premium) - o.amount >= 101 """ result: int = discount(o=order(101, Standard())) expected: int = 10 assert result == expected # ... test_3, test_4 are omitted for brevity ``` This workflow ensures that we have test cases that cover all possible behavioral regions of the system-in this example, all four combinations of customer priority and order amount threshold. The entire pipeline is configurable, allowing for a balance between automation and human intervention. For example, the user can choose to double-check refactored code in intermediate steps, provide additional human feedback on error messages or let the agent handle it by itself, etc. And as mentioned in the previous section, FDB provides critical RAG support in all stages of the pipeline. ## 5.8 Interfaces CodeLogician exposes both input and output interfaces designed to be simultaneously user-friendly and programmatically accessible. The interfaces follow a simple input schema paired with a rich, structured output format, enabling use in interactive development environments as well as integration into larger agent-based systems. The input interface accepts source programs (currently Python) as textual input, together with optional configuration parameters that control the autoformalization and reasoning process. This minimal input design allows CodeLogician to be invoked uniformly from command-line tools, IDEs, or external agents. The output interface is structured using Pydantic models, providing strong validation, typing, and serialization guarantees. The output schema captures detailed artifacts produced during each reasoning attempt, including: - generated IML models, - compilation and admission results, - verification outcomes and counterexamples, - region decomposition results and associated metadata. This structured representation makes CodeLogician suitable both for standalone use and as a component within larger neuro-symbolic pipelines. CodeLogician also exposes configuration parameters that control the degree of automation and human intervention. Parameters such as confirm\_refactor , decomp , max\_attempts , and max\_attempts\_wo\_human\_feedback allow users to trade off full automation against mixed-initiative interaction, depending on the complexity and criticality of the target code. ## 5.8.1 Programmatic Access via python RemoteGraph CodeLogician can be directly integrated with other LangGraph agents using the RemoteGraph interface. By specifying the agent's URI within Imandra Universe and supplying an API key, external agents can invoke CodeLogician as a remote reasoning component. This enables composition with other agentic workflows, including multi-agent systems and tool-augmented LLM pipelines. In addition to programmatic access, CodeLogician is also available through a web-based chat interface in Imandra Universe. ## 5.8.2 VS Code Extension CodeLogician is distributed with a free Visual Studio Code (VS Code) extension [8], providing a lightweight and accessible interface for interactive use. The extension communicates with the CodeLogician agent via the RemoteGraph interface and exposes core commands through the VS Code command palette. The extension is built around a command-queue abstraction that allows users to sequence multiple operations-such as autoformalization, verification, and decomposition-before execution. It supports simultaneous formalization of multiple source files and provides visibility into the agent's internal state, including generated models and reasoning artifacts. Figure 6 illustrates the VS Code extension's command management system, state viewer, and an example IML model produced by CodeLogician. ## 5.9 Summary The CodeLogician IML agent operationalizes the neuro-symbolic approach advocated in this paper. LLMs are used to construct formal mental models of software systems, while automated reasoning engines provide mathematically precise analysis. By combining autoformalization, verification goals, and exhaustive behavioral analysis within an agentic framework, CodeLogician enables rigorous reasoning about real-world software systems beyond the capabilities of LLM-only approaches. ## 6 CodeLogician Server The CodeLogician server provides a persistent backend for project-wide formalization and analysis. It manages formalization tasks, maintains analysis state across sessions, and serves multiple client interfaces (TUI and CLI). Unlike one-shot invocation modes, the server is designed for continuous and incremental operation: it caches results, reacts to code changes, and recomputes only the minimal set of affected artifacts required to keep formal models consistent with the underlying code base. Figure 6: CodeLogician VS Code extension, showing command management, agent state visualization, and a generated IML model. <details> <summary>Image 5 Details</summary> ![27dd0aa7](/v1/image/27dd0aa78e425fd2f25849cf3fc65054b440d5f13c1dff03ba1ca97265e36f55) ### Visual Description ## Software Interface: Imandra Code Logician ### Overview The image shows a software interface, likely for a tool named "Imandra Code Logician." It displays various panels related to code analysis, command execution, and interaction summaries. The interface includes code editors, command management options, and a summary of interactions. ### Components/Axes * **Top-Left Panel:** * Header: "IMANDRA: CODE LOGICIAN" * Section: "COMMAND MANAGEMENT" * Buttons: "Add command", "Clear graph state" * **Top-Center Panel:** * File Name: "rd_rb.rb" * Code Editor: Contains Ruby code defining functions `g(x)` and `f(x)`. * **Top-Right Panel:** * Title: "Select a command to execute (11 certain of 16 total)" * List of Commands: "Update Source Code", "Inject Formalization Context", "Set IML Model", "Admit IML Model" (Not available in current state), "Sync IML Model" (Not available in current state), "Generate IML Model" * **Bottom-Left Panel:** * Title: "Summary of Interactions - rd_rb.rb (Imandra CodeLogician)" * Table: "Summary of Interactions" with columns "#", "Command", and "Response". * **Bottom-Right Panel:** * Explorer: Shows a file tree with "TEST" as the root, containing "code_logician", "rd_rb.iml", and "rd_rb.rb". * Code Editor: Displays the content of "rd_rb.iml", which contains code defining functions `g(x)` and `f(x)` using a syntax similar to OCaml. ### Detailed Analysis **Top-Left Panel:** * The "COMMAND MANAGEMENT" section provides buttons to "Add command" and "Clear graph state". **Top-Center Panel:** * The file "rd_rb.rb" contains the following Ruby code: ```ruby def g(x) if x > 22 9 else 100 + x end end def f(x) if x > 99 100 elsif x > 23 && x < 70 89 + x end end ``` **Top-Right Panel:** * The command selection panel indicates that 11 out of 16 commands are available. * The "Admit IML Model" and "Sync IML Model" commands are "Not available in current state". **Bottom-Left Panel:** * The "Summary of Interactions" table has the following entries: * **Row 1:** * Command: `init_state` * Parameters: * `src_code`: `def g(x) if x > 22 9 else end` * `src_lang`: `ruby` * Response: `Task (DONE)` * **Row 2:** * Command: `agent_formalizer` * Parameters: * `no_check_formalization_h`: `False` * `no_refactor`: `False` * `no_gen_model_hitl`: `False` * `max_tries_wo_hitl`: `2` * `max_tries`: `3` * Response: `Task (DONE)` * **Row 3:** * Command: `gen_region_decomps` * Parameters: * `function_name`: `f` * Response: `Pending` **Bottom-Right Panel:** * The file "rd_rb.iml" contains the following code: ``` let g (x : int) : int = if x > 22 then 9 else 100 + x let f (x : int) : int = if x > 99 then 100 ``` ### Key Observations * The interface allows users to manage commands, view and edit code in different languages (Ruby and a language similar to OCaml), and see a summary of interactions. * Some commands are not available in the current state, suggesting dependencies or preconditions. * The "Summary of Interactions" table provides information about the commands executed, their parameters, and their responses. ### Interpretation The Imandra Code Logician appears to be a tool for analyzing and formalizing code. It allows users to define functions, execute commands to analyze the code, and view the results of these commands. The tool supports multiple programming languages and provides a structured way to interact with the code analysis process. The "Summary of Interactions" table is a key component, providing a record of the analysis steps and their outcomes. The fact that some commands are not always available suggests a stateful analysis process where the order of operations matters. </details> ## 6.1 Server Responsibilities At a high level, the server is responsible for: - Persistent analysis state : storing formalization artifacts and their status across sessions; - Project management : tracking multiple repositories, configurations, and formalization strategies; - Incremental updates : monitoring file and dependency changes and triggering minimal reformalization; - Caching : avoiding redundant computation by reusing results for unchanged modules; - Interface multiplexing : serving requests from interactive and programmatic clients via a shared backend. These responsibilities enable a workflow in which developers and agents can repeatedly query and refine formal models while the server maintains a coherent, up-to-date view of the formalized state of the project. ## 6.2 Strategies: Project-Wide Formalization via PyIML The server executes strategies , configureable pipelines that formalize a code base and manage dependencies between its components. For Python projects, CodeLogician provides the PyIML strategy , which translates Python source code into IML models suitable for reasoning in ImandraX . Given a project directory, the PyIML strategy: 1. Analyzes Python modules and infers their dependency structure from imports; 2. Translates source constructs (functions, classes, types) into corresponding IML artifacts; 3. Manages formalization across the codebase in dependency-consistent order; 4. Maintains a project metamodel capturing modules, dependencies, and formalization status. The strategy also handles opaque functions (Section 4) by preserving type information while abstracting implementation details when direct translation is not possible. This allows the system to admit models and begin formal reasoning even in the presence of external libraries or currently-unknown semantics, while making the loss of executability explicit. ## 6.3 Metamodel-Based Formalization Formalizing large software systems requires reasoning not only about individual source files, but also about their interdependencies. To support scalable project-wide analysis, the CodeLogician server maintains an explicit metamodel that captures the structure, dependencies, and formalization status of a collection of models corresponding to a repository. ## 6.4 Core Definitions Model A model represents the formalization state of a single source file or module. Each model contains: - src code : the source code of the corresponding file or module; - formalization status : the current level of formalization achieved; - dependencies : references to other models on which this model depends; - dependency context : an aggregated formal representation (in IML) of available dependencies, used during formalization. Metamodel A metamodel is a container for all models associated with a given project directory (repository). It provides operations for model insertion, deletion, update, and dependency management, and serves as the authoritative representation of the project's formalization state. Interdependence By interdependence, we mean that models import symbols (e.g., functions, classes, constants) from other project files or third-party libraries. By default, a model can be formalized even when referenced symbols lack formal implementations; however, such symbols must be treated as opaque , which limits the strength of reasoning that can be performed. Autoformalization Autoformalization is the process by which the CodeLogician agent translates source code, together with its dependency context, into an IML model and submits it to the reasoning engine. This process may produce artifacts such as verification goals, counterexamples, and test cases. ## 6.5 Formalization Levels Each model is assigned a formalization level reflecting how much semantic information is available: - Unknown : no formalization has been performed; - Error during validation : admission failed due to issues such as inconsistent types, missing symbols, or non-termination; - Admitted with opaqueness : the model has been admitted but contains opaque symbols; - Admitted transparent : the model has been admitted and contains no opaque elements. The dependency context is computed by aggregating all available IML models of dependencies and injecting this context alongside the model's own source code during formalization. Objective. The objective of the server is to maximize the number of models that are formalized, and to achieve the highest possible level of formalization for each model, while ensuring that the most up-to-date source code, dependency context, and human input are reflected. ## 6.6 Conditions Triggering Re-Formalization A model requires re-formalization when: - its underlying source code has changed or has never been formalized; or - the source code or formalization status of one or more dependencies (direct or indirect) has changed. ## 6.7 Autoformalization Workflow Initial state. Figure 7 shows an example repository with a hierarchical structure in which modules import symbols from one another. Figure 7: Example project structure. <details> <summary>Image 6 Details</summary> ![8ee50097](/v1/image/8ee50097c3a3c5c450c3f6cbca5aebb0449d6ad46f4a8fe7eab2e6df6bb96df1) ### Visual Description ## File System Diagram: sample_simple_math_lib ### Overview The image depicts a file system directory structure for a Python library named "sample_simple_math_lib". It shows the hierarchical arrangement of files and subdirectories within the library. ### Components/Axes The diagram uses a tree-like structure to represent the file system. Each line represents a file or directory. Indentation indicates the level of nesting. ### Detailed Analysis The root directory is "sample_simple_math_lib". - Inside the root directory, there are several files and subdirectories: - `__init__.py` - `advanced` (subdirectory) - Inside `advanced`: - `__init__.py` - `geometry` (subdirectory) - Inside `geometry`: - `__init__.py` - `shapes.py` - `basic.py` - `math_ops.py` - `README.md` - `utils` (subdirectory) - Inside `utils`: - `__init__.py` - `helpers.py` ### Key Observations - The library is organized into modules such as "advanced", "basic", "math_ops", and "utils". - The "advanced" module further contains a "geometry" submodule. - Each module and submodule contains an `__init__.py` file, indicating that they are Python packages. - A `README.md` file is present at the root level, likely containing documentation for the library. ### Interpretation The diagram illustrates the structure of a simple Python math library. The use of `__init__.py` files indicates that the subdirectories are intended to be treated as Python packages, allowing for modular organization and import of functions and classes within the library. The presence of modules like "advanced", "basic", "math_ops", and "utils" suggests a categorization of mathematical functions based on their complexity or purpose. The `README.md` file is a standard practice for providing information about the library's usage and functionality. </details> The dependency graph inferred from imports is shown in Figure 8. An edge from a source node to a target node indicates that the source imports symbols from the target. Figure 8: Dependency graph inferred from imports. <details> <summary>Image 7 Details</summary> ![56e4f1dc](/v1/image/56e4f1dc43bb5b4e661226053d3b13a8d0d4043d0ee7ce1bebf0e264ec57ad38) ### Visual Description ## Diagram: Python Module Dependency Graph ### Overview The image is a dependency graph illustrating the relationships between several Python modules. Each module is represented by a rounded rectangle containing the module's name and file path. Arrows indicate dependencies, showing which modules import or rely on others. ### Components/Axes * **Nodes:** Each node represents a Python module. The text inside each node specifies the module's name and its file path. The format is `PythonModule(name='module_name', path='file_path')`. * **Edges:** Arrows indicate dependencies. An arrow from Module A to Module B means Module A depends on Module B. ### Detailed Analysis or ### Content Details Here's a breakdown of each module and its dependencies: 1. **Top Row, Left:** * `PythonModule(name='__init__', path='__init__.py')` * This module has no incoming or outgoing dependencies shown in the diagram. 2. **Top Row, Second from Left:** * `PythonModule(name='advanced.__init__', path='advanced/__init__.py')` * This module has no incoming or outgoing dependencies shown in the diagram. 3. **Top Row, Center:** * `PythonModule(name='utils.__init__', path='utils/__init__.py')` * This module has a dependency on `utils.helpers`. 4. **Top Row, Second from Right:** * `PythonModule(name='advanced.geometry.shapes', path='advanced/geometry/shapes.py')` * This module has dependencies on both `utils.helpers` and `basic`. 5. **Top Row, Right:** * `PythonModule(name='advanced.geometry.__init__', path='advanced/geometry/__init__.py')` * This module has no incoming or outgoing dependencies shown in the diagram. 6. **Middle Row, Center:** * `PythonModule(name='utils.helpers', path='utils/helpers.py')` * This module has dependencies on both `math_ops` and `basic`. 7. **Bottom Row, Left:** * `PythonModule(name='math_ops', path='math_ops.py')` * This module has no outgoing dependencies shown in the diagram. 8. **Bottom Row, Right:** * `PythonModule(name='basic', path='basic.py')` * This module has no outgoing dependencies shown in the diagram. ### Key Observations * The `utils.helpers` module is a central point, with dependencies from `utils.__init__` and `advanced.geometry.shapes`, and providing dependencies to `math_ops` and `basic`. * The `basic` module is a leaf node, meaning no other modules depend on it in this diagram. * The `math_ops` module is also a leaf node. * The `advanced.geometry.shapes` module depends on both `utils.helpers` and `basic`. ### Interpretation The diagram illustrates the module structure and dependencies within a Python project, likely related to advanced geometry calculations. The `utils.helpers` module seems to provide utility functions used by other modules. The `basic` and `math_ops` modules likely contain fundamental operations, with `basic` being a very low-level module given its dependencies. The `advanced.geometry.shapes` module utilizes both utility functions and basic operations, suggesting it implements more complex geometric shapes or algorithms. The `__init__.py` modules likely serve to define packages. </details> ## CodeLogician(SourceCode, DependencyContext) : Model. Figure 9: Topological formalization order. Autoformalization proceeds bottom-up along a topological ordering of the dependency graph (Figure 9). Suppose we wish to reason about functions in utils/helpers.py . To do so, CodeLogician formalizes that module and all of its dependencies in dependency order (Figure 10). ``` 1. CodeLogician(basic.py, []) = Model(basic.py) 2. CodeLogician(math_ops.py, [Model(basic.py)]) = Model(math_ops.py) 3. CodeLogician(utils/helpers.py, [Model(basic.py), Model(math_ops.py)]) = Model(utils/helpers.py) ``` Figure 10: Formalization sequence for a target module. After completion, formalized models appear highlighted, while unformalized modules remain unchanged (Figures 11. ## 6.8 Minimal Re-Formalization Strategy When source code changes, the server computes precise diffs to identify the minimal set of models that require re-formalization, ensuring consistency while avoiding unnecessary recomputation. Case 1: Source modified, dependencies unchanged. If math ops.py is modified without altering dependency relations (Figure 12), the modified module is marked as source modified , and only downstream formalized dependencies are updated. - Case 2: New dependency added. If a new module is introduced (Figure 13), formalization proceeds in dependency order to restore consistency. - Case 3: Dependencies merged. When multiple modules are merged, the server recomputes formalization only for affected nodes (Figure 14). - Case 4: Upstream dependency removed. If a dependency is removed from an unformalized module, no re-formalization is required for already formalized nodes (Figure 15). ## 6.9 Summary and Error Resilience In response to file updates, CodeLogician : 1. updates the dependency graph; 2. computes the minimal set of required re-formalization tasks; 3. avoids unnecessary re-formalization by targeting only affected models. <details> <summary>Image 8 Details</summary> ![23c83a22](/v1/image/23c83a221516b787004b08baa052d4195d0ff88642b8dbd55f5d8c9b0fe649e7) ### Visual Description ## Dependency Diagram and Formalization Status ### Overview The image presents a dependency diagram of Python modules and a table detailing their formalization status. The diagram shows relationships between modules, while the table provides information on their formalization status, synchronization status, and whether they can or need formalization. ### Components/Axes **Diagram Components:** * **Nodes:** Represented as rounded rectangles, each labeled as "PythonModule(name='module_name', path='module_path')". The nodes are colored white except for three nodes colored green. * **Edges:** Represented as arrows, indicating dependencies between modules. * **Node Labels:** Each node contains the module name and its file path. **Table Columns:** * **Path:** File path of the Python module. * **Model Name:** Name of the model associated with the module. * **Formalization Status:** Status of formalization (transparent or unknown). * **Sync Status:** Status of synchronization (synced, source\_modified, dependency\_changed, or never\_formalized). * **Can Formalize:** Indicates whether the module can be formalized (represented by a checkmark). * **Needs Formalization:** Indicates whether the module needs formalization (represented by a checkmark or an "X"). ### Detailed Analysis **Diagram Nodes and Connections:** * **Top Row (White Nodes):** * `PythonModule(name='__init__', path='__init__.py')` * `PythonModule(name='advanced.__init__', path='advanced/__init__.py')` * `PythonModule(name='utils.__init__', path='utils/__init__.py')` * `PythonModule(name='advanced.geometry.shapes', path='advanced/geometry/shapes.py')` * `PythonModule(name='advanced.geometry.__init__', path='advanced/geometry/__init__.py')` * **Middle Row (Green Nodes):** * `PythonModule(name='utils.helpers', path='utils/helpers.py')` * **Bottom Row (Green Nodes):** * `PythonModule(name='math.ops', path='math.ops.py')` * `PythonModule(name='basic', path='basic.py')` **Connections:** * `utils.__init__.py` connects to `utils.helpers.py`. * `advanced.geometry.shapes.py` connects to `utils.helpers.py` and `basic.py`. * `utils.helpers.py` connects to `math.ops.py` and `basic.py`. **Table Data:** | Path | Model Name | Formalization Status | Sync Status | Can Formalize | Needs Formalization | | :------------------------------- | :----------- | :------------------- | :----------------- | :------------ | :------------------ | | basic.py | basic.iml | transparent | synced | ✓ | X | | math.ops.py | math.ops.iml | transparent | source\_modified | ✓ | X | | utils/helpers.py | helpers.iml | transparent | dependency\_changed | ✓ | X | | advanced/geometry/\_\_init\_\_.py | init.iml | unknown | never\_formalized | ✓ | ✓ | | advanced/geometry/shapes.py | shapes.iml | unknown | never\_formalized | ✓ | ✓ | | utils/\_\_init\_\_.py | init.iml | unknown | never\_formalized | ✓ | ✓ | | advanced/\_\_init\_\_.py | init.iml | unknown | never\_formalized | ✓ | ✓ | | \_\_init\_\_.py | init.iml | unknown | never\_formalized | ✓ | ✓ | ### Key Observations * The diagram shows a dependency structure where `utils.helpers.py` and `basic.py` are central modules, with multiple other modules depending on them. * The table indicates that `basic.py`, `math.ops.py`, and `utils/helpers.py` have a "transparent" formalization status and do not need formalization. * The modules in the `advanced/geometry` and `utils` directories, along with the root `__init__.py`, have an "unknown" formalization status and "never\_formalized" sync status, indicating they need formalization. ### Interpretation The diagram and table provide insights into the formalization status of a Python project. The green nodes in the diagram likely represent modules that are already formalized or considered stable, while the white nodes represent modules that may require further attention. The table confirms this, showing that the modules with "transparent" formalization status do not need formalization, while those with "unknown" status do. The "never\_formalized" sync status suggests that these modules have not been included in the formalization process yet. The arrows indicate the direction of dependency, showing which modules rely on others. This information can be used to prioritize formalization efforts and understand the impact of changes in different modules. </details> Figure 11: Summary of formalization status, formalized modules highlighted. Figure 12: Source modification without dependency changes. <details> <summary>Image 9 Details</summary> ![d0a70755](/v1/image/d0a70755b1fc46624aa5ea5e731bdf081cb9048e136b908a69357742d3eed752) ### Visual Description ## Diagram: Python Module Dependency and Formalization Status ### Overview The image presents a diagram illustrating the dependencies between Python modules and a table summarizing their formalization status and synchronization status. The diagram shows the relationships between different Python modules, while the table provides details on their model names, formalization status, sync status, and whether they can be formalized or need formalization. ### Components/Axes **Diagram Components:** * **Nodes:** Represented by rounded rectangles, each labeled as "PythonModule(name='module_name', path='module_path')". The nodes are colored differently: green, red, and yellow. * **Edges:** Represented by arrows, indicating dependencies between modules. * **Node Colors:** * Green: basic.py * Red: math_ops.py * Yellow: utils.helpers.py * White: All other modules **Table Columns:** * **Path:** The file path of the Python module. * **Model Name:** The name of the model associated with the module. * **Formalization Status:** Indicates whether the module has been formalized (transparent or unknown). * **Sync Status:** Indicates the synchronization status of the module (synced, source\_modified, dependency\_changed, or never\_formalized). * **Can Formalize:** Indicates whether the module can be formalized (represented by a checkmark or an "X"). * **Needs Formalization:** Indicates whether the module needs formalization (represented by a checkmark or an "X"). ### Detailed Analysis **Diagram Node Details:** * **Top Row (White Nodes):** * PythonModule(name='\_\_init\_\_.py', path='\_\_init\_\_.py') * PythonModule(name='advanced.\_\_init\_\_.py', path='advanced/\_\_init\_\_.py') * PythonModule(name='utils.\_\_init\_\_.py', path='utils/\_\_init\_\_.py') * PythonModule(name='advanced.geometry.shapes.py', path='advanced/geometry/shapes.py') * PythonModule(name='advanced.geometry.\_\_init\_\_.py', path='advanced/geometry/\_\_init\_\_.py') * **Middle Row (Yellow Node):** * PythonModule(name='utils.helpers.py', path='utils/helpers.py') * **Bottom Row (Red and Green Nodes):** * PythonModule(name='math.ops.py', path='math\_ops.py') (Red) * PythonModule(name='basic.py', path='basic.py') (Green) **Table Data:** | Path | Model Name | Formalization Status | Sync Status | Can Formalize | Needs Formalization | | :------------------------------- | :----------- | :------------------- | :----------------- | :------------ | :------------------ | | basic.py | basic.iml | transparent | synced | ✓ | X | | math.ops.py | math\_ops.iml | transparent | source\_modified | ✓ | ✓ | | utils/helpers.py | helpers.iml | transparent | dependency\_changed | X | ✓ | | advanced/geometry/\_\_init\_\_.py | init.iml | unknown | never\_formalized | ✓ | ✓ | | advanced/geometry/shapes.py | shapes.iml | unknown | never\_formalized | X | ✓ | | utils/\_\_init\_\_.py | init.iml | unknown | never\_formalized | X | ✓ | | advanced/\_\_init\_\_.py | init.iml | unknown | never\_formalized | ✓ | ✓ | | \_\_init\_\_.py | init.iml | unknown | never\_formalized | ✓ | ✓ | **Diagram Edge Details:** * 'utils.helpers.py' depends on 'math.ops.py' * 'utils.helpers.py' depends on 'basic.py' * 'advanced.geometry.shapes.py' depends on 'utils.helpers.py' ### Key Observations * The diagram visualizes the dependencies between Python modules. * The table provides a summary of the formalization and synchronization status of each module. * Modules with "transparent" formalization status have different sync statuses. * Modules under "advanced/geometry" and init files have "unknown" formalization status and "never\_formalized" sync status. * The "Can Formalize" and "Needs Formalization" columns provide a quick overview of the formalization requirements for each module. ### Interpretation The diagram and table together provide a comprehensive view of the state of the Python modules in terms of their dependencies, formalization, and synchronization. The "transparent" formalization status suggests that these modules have been formalized to some extent, while "unknown" indicates that they haven't been formalized yet. The sync status provides information on whether the modules are up-to-date with the latest changes. The "Can Formalize" and "Needs Formalization" columns highlight the modules that require attention in terms of formalization efforts. The dependency graph shows how changes in one module might affect others, which is crucial for maintaining the system's integrity. The data suggests a need to prioritize the formalization of modules with "unknown" formalization status, especially those under "advanced/geometry" and init files, and to address the synchronization issues in modules with "source\_modified" or "dependency\_changed" sync status. </details> Figure 13: Addition of a new dependency <details> <summary>Image 10 Details</summary> ![d6222e2c](/v1/image/d6222e2c631bb16b9c8b149f8f1c6daf3673bda9f86533343be2f2f6f8724dcc) ### Visual Description ## Diagram: Python Module Dependency Graph ### Overview The image is a directed graph representing dependencies between Python modules. Each node represents a Python module, and the edges represent import relationships. The nodes are labeled with the module's name and file path. Node colors indicate module type or importance. ### Components/Axes * **Nodes:** Represent Python modules. Each node is a rounded rectangle containing the module's name and path. * Format: `PythonModule(name='module_name', path='module/path.py')` * **Edges:** Represent dependencies (import relationships) between modules. Arrows indicate the direction of the dependency. * **Node Colors:** * White: Base modules or entry points. * Yellow: Utility or helper modules. * Red: Core or fundamental modules. ### Detailed Analysis Here's a breakdown of each module and its dependencies: 1. **Top Row (Left to Right):** * `PythonModule(name='__init__', path='__init__.py')` - White node, no incoming or outgoing edges. * `PythonModule(name='advanced.__init__', path='advanced/__init__.py')` - White node, has one outgoing edge. * `PythonModule(name='utils.__init__', path='utils/__init__.py')` - White node, has one outgoing edge. * `PythonModule(name='advanced.geometry.shapes', path='advanced/geometry/shapes.py')` - White node, has one outgoing edge. * `PythonModule(name='advanced.geometry.__init__', path='advanced/geometry/__init__.py')` - White node, no incoming or outgoing edges. 2. **Middle Row:** * `PythonModule(name='utils.helpers', path='utils/helpers.py')` - Yellow node, has two outgoing edges and one incoming edge from `utils.__init__.py`. * `PythonModule(name='math.ops', path='math.ops.py')` - Yellow node, has one outgoing edge and one incoming edge from `utils.helpers`. * `PythonModule(name='basic', path='basic.py')` - Red node, has one outgoing edge and three incoming edges from `utils.helpers`, `math.ops`, and `advanced/geometry/shapes.py`. 3. **Bottom Row:** * `PythonModule(name='basic', path='basic.py')` - White node, has one incoming edge from `basic.py`. **Dependencies (Edges):** * `advanced.__init__.py` -> `basic.py` * `utils.__init__.py` -> `utils.helpers` * `utils.helpers` -> `basic.py` * `utils.helpers` -> `math.ops.py` * `math.ops.py` -> `basic.py` * `advanced.geometry.shapes` -> `basic.py` * `basic.py` -> `basic.py` ### Key Observations * The `basic.py` module (red node) is a central module, depending on several other modules. * Utility modules (`utils.helpers`, `math.ops`) are colored yellow. * The `__init__.py` modules and `advanced.geometry.shapes` appear to be entry points or initializers, as they have no incoming dependencies. * The `basic.py` module at the bottom depends on the `basic.py` module in the middle row, suggesting a potential recursive or core dependency. ### Interpretation The diagram illustrates the architecture and dependencies within a Python project. The red node (`basic.py`) likely represents a core module that provides fundamental functionalities, as it is depended upon by several other modules. The yellow nodes (`utils.helpers`, `math.ops`) are utility modules that provide helper functions or mathematical operations. The white nodes represent entry points, initializers, or modules with no dependencies. The graph highlights the flow of dependencies and the importance of the `basic.py` module in the overall project structure. The dependency of `basic.py` on itself suggests a fundamental or recursive relationship within the core logic. </details> <details> <summary>Image 11 Details</summary> ![fc879e34](/v1/image/fc879e343849ea1f549e6266957c41c5416dedc4f6c34c7b7d46935e511848db) ### Visual Description ## Data Table: File Formalization Status ### Overview The image presents a data table detailing the formalization status of various files, including their paths, model names, formalization status, sync status, and whether they can or need to be formalized. ### Components/Axes The table has the following columns: * **Path:** File path (e.g., `basic_.py`, `basic.py`). * **Model Name:** Name of the model associated with the file (e.g., `basic_.iml`, `basic.iml`). * **Formalization Status:** Status of the file's formalization (e.g., `unknown`, `transparent`). * **Sync Status:** Status of synchronization (e.g., `never_formalized`, `source_modified`, `dependency_changed`). * **Can Formalize:** Indicates whether the file can be formalized (✓ for yes, X for no). * **Needs Formalization:** Indicates whether the file needs to be formalized (✓ for yes). ### Detailed Analysis or ### Content Details Here's a row-by-row breakdown of the table's content: 1. **Path:** `basic_.py`, **Model Name:** `basic_.iml`, **Formalization Status:** `unknown`, **Sync Status:** `never_formalized`, **Can Formalize:** ✓, **Needs Formalize:** ✓ 2. **Path:** `basic.py`, **Model Name:** `basic.iml`, **Formalization Status:** `transparent`, **Sync Status:** `source_modified`, **Can Formalize:** X, **Needs Formalize:** ✓ 3. **Path:** `math_ops.py`, **Model Name:** `math_ops.iml`, **Formalization Status:** `transparent`, **Sync Status:** `dependency_changed`, **Can Formalize:** X, **Needs Formalize:** ✓ 4. **Path:** `utils/helpers.py`, **Model Name:** `helpers.iml`, **Formalization Status:** `transparent`, **Sync Status:** `dependency_changed`, **Can Formalize:** X, **Needs Formalize:** ✓ 5. **Path:** `advanced/geometry/init.py`, **Model Name:** `init.iml`, **Formalization Status:** `unknown`, **Sync Status:** `never_formalized`, **Can Formalize:** ✓, **Needs Formalize:** ✓ 6. **Path:** `advanced/geometry/shapes.py`, **Model Name:** `shapes.iml`, **Formalization Status:** `unknown`, **Sync Status:** `never_formalized`, **Can Formalize:** X, **Needs Formalize:** ✓ 7. **Path:** `utils/__init__.py`, **Model Name:** `init.iml`, **Formalization Status:** `unknown`, **Sync Status:** `never_formalized`, **Can Formalize:** X, **Needs Formalize:** ✓ 8. **Path:** `advanced/__init__.py`, **Model Name:** `init.iml`, **Formalization Status:** `unknown`, **Sync Status:** `never_formalized`, **Can Formalize:** ✓, **Needs Formalize:** ✓ 9. **Path:** `__init__.py`, **Model Name:** `init.iml`, **Formalization Status:** `unknown`, **Sync Status:** `never_formalized`, **Can Formalize:** ✓, **Needs Formalize:** ✓ ### Key Observations * Files with `transparent` formalization status cannot be formalized. * All files need formalization. * Files with `unknown` formalization status and `never_formalized` sync status can be formalized. ### Interpretation The data suggests a system where certain files are either already formalized (`transparent`) or require formalization. The sync status provides additional context, indicating whether the file has been modified or has dependency changes. The "Can Formalize" and "Needs Formalize" columns highlight the actions required for each file. The consistent "Needs Formalization" status across all files suggests that the formalization process is incomplete or ongoing for the entire set of files listed. The files with "transparent" formalization status and "X" in the "Can Formalize" column likely represent files that have already been formalized and do not require further action. </details> Figure 15: Removal of an upstream dependency. <details> <summary>Image 12 Details</summary> ![24dec71a](/v1/image/24dec71a7d5a00f4fb66acbf2aef96c4b4849d77e2afab9466cc3746823e3a59) ### Visual Description ## Diagram: Merged Dependencies of Python Modules ### Overview The image is a diagram illustrating the merged dependencies between various Python modules. The diagram consists of rectangular nodes representing Python modules, with arrows indicating dependencies between them. The nodes are colored differently to highlight specific relationships or categories. The diagram is split into two similar structures, one above the other. ### Components/Axes * **Nodes:** Represent Python modules. Each node contains the module's name and path. * Format: `PythonModule(name='module_name', path='module/path.py')` * **Arrows:** Indicate dependencies. An arrow from Module A to Module B means Module A depends on Module B. * **Colors:** * White: Initial modules. * Yellow: `utils.helpers` module in the top structure. * Red: `math.ops` module in the top structure. * Green: `utils.helpers`, `math.ops`, and `basic` modules in the bottom structure. * **Title:** "Figure 14: Merged dependencies." ### Detailed Analysis **Top Structure:** * **Row 1 (Top-Left to Top-Right):** * `PythonModule(name='__init__', path='__init__.py')` (White) * `PythonModule(name='advanced.__init__', path='advanced/__init__.py')` (White) * `PythonModule(name='utils.__init__', path='utils/__init__.py')` (White) * `PythonModule(name='advanced.geometry.shapes', path='advanced/geometry/shapes.py')` (White) * `PythonModule(name='advanced.geometry.__init__', path='advanced/geometry/__init__.py')` (White) * **Row 2 (Center):** * `PythonModule(name='utils.helpers', path='utils/helpers.py')` (Yellow) * **Row 3 (Center):** * `PythonModule(name='math.ops', path='math.ops.py')` (Red) * **Dependencies (Top Structure):** * `utils.__init__.py` -> `utils.helpers.py` * `advanced.geometry.shapes.py` -> `utils.helpers.py` * `utils.helpers.py` -> `math.ops.py` **Bottom Structure:** * **Row 1 (Top-Left to Top-Right):** * `PythonModule(name='__init__', path='__init__.py')` (White) * `PythonModule(name='advanced.__init__', path='advanced/__init__.py')` (White) * `PythonModule(name='utils.__init__', path='utils/__init__.py')` (White) * `PythonModule(name='advanced.geometry.shapes', path='advanced/geometry/shapes.py')` (White) * `PythonModule(name='advanced.geometry.__init__', path='advanced/geometry/__init__.py')` (White) * **Row 2 (Center):** * `PythonModule(name='utils.helpers', path='utils/helpers.py')` (Green) * **Row 3 (Bottom-Left to Bottom-Right):** * `PythonModule(name='math.ops', path='math.ops.py')` (Green) * `PythonModule(name='basic', path='basic.py')` (Green) * **Dependencies (Bottom Structure):** * `utils.__init__.py` -> `utils.helpers.py` * `advanced.geometry.shapes.py` -> `utils.helpers.py` * `utils.helpers.py` -> `math.ops.py` * `utils.helpers.py` -> `basic.py` ### Key Observations * The diagram illustrates the dependencies between Python modules. * The top structure shows `utils.helpers.py` (Yellow) depending on `math.ops.py` (Red). * The bottom structure shows `utils.helpers.py` (Green) depending on both `math.ops.py` (Green) and `basic.py` (Green). * The `__init__.py` modules and `advanced.geometry` modules do not have any dependencies shown in the diagram. * The color change from Yellow/Red to Green in the bottom structure suggests a change or evolution in the module dependencies. ### Interpretation The diagram depicts how the dependencies of the `utils.helpers` module have changed. In the top structure, `utils.helpers` depends on `math.ops`. However, in the bottom structure, `utils.helpers` depends on both `math.ops` and `basic`. This suggests that the functionality of `utils.helpers` has been extended to incorporate elements from the `basic` module, or that the `basic` module has been introduced as a new dependency. The color change might indicate a refactoring or enhancement of the module architecture. The merging of dependencies likely refers to the integration of the `basic` module into the dependency chain of `utils.helpers`. </details> To ensure robustness during active development, the server uses graceful degradation. If parsing or dependency errors occur, it preserves the last known-good dependency graph and formal models, tracks errors for debugging, and automatically recovers once inconsistencies are resolved. ## 7 Performance ## 7.1 Dataset We have created 50 models that have the following characteristics: ## · State Machine Architecture Each model defines explicit states and manages transitions between those states via step functions. ## · Complex Multi-Entity Systems All models represent systems with multiple interacting components, for example: - -BGP path vector routing: speakers, neighbors, routes - -NPC pathfinding: characters, obstacles, terrain - -Smart warehouse auto-controller: robots, inventory, orders, zones - -Options market maker: options, positions, hedges - -Nuclear reactor safety system: control rods, safety systems, parameters - -TLS handshake protocol: client, server, certificates, sessions ## · Temporal Evolution Models explicitly track state changes over time using counters and/or timestamps, and evolve via simulation step functions. ## · Decision Logic Each model implements complex decision-making based on current state and parameters (e.g. routing decisions, pathfinding, task assignment, hedging, safety checks, protocol steps). This logic often exhibits characteristic interaction patterns, including: - Override patterns , where special conditions trump normal logic except in specific subcases - Dead-zone patterns , where integer boundary combinations create gaps leading to undefined behavior or unexpected fallback states ## · Safety and Correctness Properties Many models include explicit invariants or constraints that must always hold, such as safety limits, risk limits, or convergence conditions. All models are rooted in real-world systems that exist in production environments. Each model includes three questions that probe different aspects of state machine understanding, such as enumerating distinct behavioral scenarios, identifying conditions under which specific behaviors occur, and determining whether safety or correctness properties can be violated. We evaluate two approaches to answering these questions: - LLM-only: The LLM analyzes the Python source code directly and answers the questions based on its own reasoning over the code. - CodeLogician (LLM + ImandraX ): The model is translated into IML (Imandra Modeling Language), and ImandraX performs formal region decomposition and verification. The LLM then synthesizes answers from these formal analysis results. ## 7.2 Metrics We design 7 metrics to evaluate the performance of LLM-only reasoning compared to CodeLogician . Each metric is designed to assess a specific aspect of reasoning about complex state machines and decision logic. ## 7.2.1 State Space Estimation Accuracy This metric measures how accurately the LLM estimates the number of distinct behavioral scenarios or regions in the state space. The ground truth comes from the exact region count produced by CodeLogician's decomposition analysis. $$F o r m u l a \colon \, s c o r e = \frac { 1 } { 1 + \log _ { 2 } ( d i f f + 1 ) } , \quad w h e r e \, d i f f = | n _ { l l m } - n _ { d e c o m p } |$$ Design rationale: The logarithmic penalty reflects that being off by orders of magnitude (e.g., estimating 10 scenarios when there are 1000) is fundamentally more problematic than small absolute differences. The formula yields: perfect match (diff=0) → score=1.0, off by 1 → score=0.5, off by 3 → score=0.33, and converges to 0 as the difference grows. When the LLM provides only qualitative descriptions without any quantification, we assign score=0.0. For extraction, we use the following rules: (a) explicit numbers are used directly, (b) ranges use the midpoint, (c) approximations extract the implicit number, (d) qualitative descriptions with implicit combinatorial reasoning (e.g., '4 × 5 × 3 combinations') are computed, and (e) pure qualitative statements with no extractable quantity are marked as unknown. ## 7.2.2 Outcome Precision This metric evaluates the precision of constraints and conditions in the LLM's answer compared to qualitative descriptions. It measures whether the LLM identifies exact constraints that determine system behavior. ## Scoring Rubric: - 1.0 : Exact match (both value and operator are correct). - 0.8 : Value is correct, but the operator is approximately equivalent (e.g., ≥ 20 vs. > 19). - 0.5 : Value is approximately correct within a reasonable range (e.g., ≥ 20 vs. ≥ 18). - 0.2 : Qualitative but directionally correct (e.g., 'high' when the correct constraint is > 620). - 0.0 : Incorrect or missing. Design rationale: This metric distinguishes between responses that capture exact conditions versus vague qualitative descriptions. The graduated scoring recognizes that approximate answers are more useful than purely qualitative ones, even if not perfectly precise. ## 7.2.3 Direction Accuracy This metric assesses whether the LLM reaches conclusions in the correct direction, particularly for yes/no questions about system properties or behaviors. For example, does the LLM correctly identify that certain scenarios can violate a safety property, or does it incorrectly claim the property always holds? ## Scoring Rubric: - 1.0 : Completely correct conclusion with sound reasoning. - 0.75 : Correct conclusion but with flawed or incomplete reasoning. - 0.5 : Partially correct. - 0.25 : Incorrect conclusion but identifies some relevant factors. - 0.0 : Completely incorrect. Design rationale: Getting the fundamental direction of an answer wrong (e.g., claiming a system is safe when it has critical vulnerabilities) is a severe error in code understanding. This metric prioritizes correctness of the conclusion while also rewarding sound reasoning processes. ## 7.2.4 Coverage Completeness This metric measures the proportion of actual decision scenarios or behavioral regions that the LLM identifies in its analysis. It evaluates how thoroughly the LLM explores the state space. ## Scoring Approach (Highest Applicable Method). - (a) If the LLM provides explicit scenario counts, compare them directly to the decomposition count. - (b) If the LLM enumerates specific scenarios, count the number of scenarios enumerated. - (c) If the LLM categorizes outcomes, compute the score as # categories identified # categories in decomposition . - (d) If there is only a qualitative acknowledgment of multiplicity, assign a score of 0 . 25. - (e) If there is no meaningful coverage, assign a score of 0 . 0. Design rationale: Complete coverage is essential for sound verification. Missing scenarios can hide critical bugs or edge cases. We use multiple scoring methods to accommodate different response styles, always choosing the method most favorable to the LLM to avoid penalizing valid reasoning approaches. ## 7.2.5 Control Flow Understanding This metric assesses whether the LLM correctly captures the structure of control flow, including branches, guards, overrides, and short-circuit behavior. We evaluate the following four aspects: - Precedence of conditions : Correct handling of logical precedence (e.g., AND before OR , explicit priority ordering). - Branching structure : Correct identification of control structures such as if--else--if chains versus independent conditional checks. - Short-circuit behavior : Recognition of early returns, guard clauses, or other forms of shortcircuit evaluation. - Override logic : Identification of safety or exception-handling logic that supersedes normal operation. Scoring Formula. The score is computed as: $$s c o r e = { \frac { \# c o r r e c t + 0 . 5 \times \# p a r t i a l } { 4 } } .$$ Design Rationale. Control flow structure is fundamental to program behavior. Misunderstanding whether conditions are evaluated sequentially (with short-circuiting) versus independently, or misinterpreting logical precedence, leads to incorrect analysis. Each aspect is therefore assessed independently and aggregated into a single score, with partial credit awarded to reflect nuanced but incomplete understanding. ## 7.2.6 Edge Case Detection This metric evaluates how many rare or unexpected edge cases the LLM identifies in its analysis. An edge case is defined as any behavioral region exhibiting at least one of the following properties: - Complex conjunctive conditions : Three or more conjunctive constraints combined in a single condition. - Exact boundary conditions : Equality constraints at precise thresholds (e.g., x = 10 rather than x > 10). - Negation of the common case : Scenarios such as sensor failure, timeout, or other exceptional modes. - Counterintuitive outcomes : Behaviors that contradict naive or intuitive expectations. - Safety-critical violations : Invariant violations, error states, or other safety-relevant failures. Scoring Formula. The score is computed as: $$s c o r e = \frac { \# e d g e \, c a s e s \, i n d e f i t i e d \, b y \, t h e \, L L M } { \# t o t a l \, e d g e \, c a s e s \, i n \, t h e \, d e c o m p o s i t i o n } .$$ Design Rationale. Edge cases are where defects most commonly occur. Real-world systems tend to fail at boundary conditions, under rare combinations of constraints, or in situations that violate designer assumptions. This metric therefore rewards systematic identification of such cases. The definition is intentionally broad in order to capture multiple forms of non-obvious or unexpected behavior. ## 7.2.7 Decision Boundary Clarity Decision Boundary Identification Metric. This metric measures whether the LLM identifies the exact constraints that determine decision boundaries in the system. Each region produced by decomposition is defined by Boolean constraints-conditions on inputs such as: - deviation ≥ 80 cm, - state of charge < 20%, - connection count ≥ 3, - temperature > 100 ◦ C. These constraints may take the form of inequalities, equalities, or logical combinations thereof. Scoring Formula. The score is computed as: | score = | #constraints identified by the LLM | |-----------|--------------------------------------| | score = | #constraints in the decomposition | Design Rationale. Decision boundaries represent critical points in the state space at which system behavior changes qualitatively. Accurately identifying these constraints is essential for precise reasoning about when different behaviors occur. Qualitative descriptions such as 'low battery' or 'high temperature' are insufficient for formal analysis and are therefore penalized by this metric. Applicability Note. If a metric is not applicable or irrelevant to a specific question (e.g., decision boundaries are requested for a system that has none), that metric is excluded from the evaluation for that question. Overall Evaluation Context. Collectively, these seven metrics capture complementary dimensions of how LLM deviates from CodeLogician's formally-grounded analysis. Together, they provide a multifaceted assessment of the value added by formal methods tools. A perfect score of 1 . 0 across all metrics would indicate that an LLM can achieve the same level of rigor and completeness as formal analysis without assistance. The gaps revealed by these metrics quantify the specific ways in which formal decomposition and verification extend beyond unaided code reading and natural-language reasoning. ## 7.3 Evaluation Our evaluation proceeds in two stages: answer generation and answer evaluation. ## 7.3.1 Answer Generation We obtain answers from five frontier models: GPT-5.2, Grok Code Fast 1, Claude Sonnet 4.5, Claude Opus 4.5, and Gemini 3 Pro. For the LLM-only condition, each model analyzes the Python source code directly and answers the three questions per model based on its own reasoning. For the CodeLogician condition, answers are derived from ImandraX's formal region decomposition and verification results, which provide exact scenario counts, precise constraint boundaries, and proof outcomes (proven or refuted with counterexamples). The CodeLogician answers serve as ground truth for evaluation. ## 7.3.2 Answer Evaluation (LLM-as-a-Judge) We employ the LLM-as-a-judge methodology to evaluate how well LLM-only answers align with CodeLogician answers. Four evaluator models-Gemini 2.5 Flash, Claude Haiku 4.5, GPT-4o Mini, and Grok 4 Fast-independently score each LLM response according to the seven metric definitions, providing detailed reasoning for each score. We aggregate these evaluations by averaging across the four evaluators to reduce bias from any single judge. While LLM-based evaluation introduces some variability, we plan to complement these results with deterministic methods such as test case generation from region decomposition result in the future. All data, code, and scripts to reproduce these results are publicly available at https://github. com/imandra-ai/code-logic-bench . ## 7.4 Results ## 7.4.1 Overall Performance State Space Estimation Accuracy (0.186). This metric reveals the most dramatic performance gap. The low scores reflect both severe underestimation-when LLMs attempt quantification but miss combinatorial explosion by orders of magnitude-and complete inability to quantify, where models resort to vague statements such as 'many scenarios' without attempting enumeration. This failure to grasp state space size has direct consequences: it leads to underestimation of verification effort, inadequate test budgets, and false confidence that manual analysis has covered sufficient ground. In safety-critical systems, such miscalibration can mean the difference between rigorous validation and deploying systems with untested corner cases. In one environmental monitoring system, when asked to enumerate distinct response scenarios, the model focused on high-level alert categories. In contrast, formal decomposition revealed a combinatorial explosion arising from Boolean constraint boundaries, where different combinations of sensor drift conditions, timing windows, measurement ranges, and pollutant-specific limits produce orders of magnitude more behavioral regions than surface-level structure suggests. Outcome Precision (0.613). This metric exposes substantial imprecision in how LLMs specify constraints. Mean scores indicate frequent reliance on qualitative descriptions or approximate values rather than exact specifications. Such imprecision introduces ambiguity into requirements documentation, test case generation, and safety audits. For example, stating 'temperature must not be too high' instead of 'temperature < 620 ◦ C' prevents consistent implementation, hinders compliance verification, and leaves boundary conditions untested. Direction Accuracy (0.635). Direction Accuracy measures whether LLMs reach correct conclusions, particularly for safety-critical properties where errors have real-world consequences. This metric uses graduated scoring from 0 . 0 (completely incorrect) to 1 . 0 (correct with sound reasoning), with partial credit awarded for identifying relevant factors despite an incorrect conclusion. A mean score of 0 . 635 indicates that while LLMs often identify relevant aspects of system behavior, they frequently reach flawed conclusions or provide incomplete reasoning. The most severe failures occur when LLMs categorically misidentify whether safety properties hold. In one data protection system, formal verification proved that a critical privacy property always holds, yet the LLM incorrectly concluded that the property could be violated, misinterpreting how boundary conditions propagate through compliance checks. Similarly, in an autonomous driving system, the LLM claimed that a sensor reliability property could be violated under specific conditions, when verification showed it always holds. These failures represent fundamental errors about system guarantees-the difference between confidently deploying a system and discovering critical vulnerabilities only after incidents occur. Coverage Completeness (0.490). Coverage Completeness highlights a weak performance area, with LLMs identifying fewer than half of the actual decision scenarios. This systematic undercounting creates blind spots where critical bugs tend to hide. The gap arises because LLMs reason at categorical levels while missing how combinations of constraints multiply scenarios. When asked to enumerate scenarios, models typically identify only the obvious, top-level categories visible in code structure. However, each category subdivides along Boolean constraint boundaries, where combinations of comparisons, timing windows, and state checks produce distinct behavioral regions. The resulting combinatorial interaction creates orders of magnitude more scenarios than surface-level reasoning suggests. This mirrors a common developer pitfall: believing that 'all cases' have been tested when only the obvious paths have been exercised. Control Flow Understanding (0.746). Control Flow Understanding represents the strongest area of LLM performance. Models demonstrate relatively strong capability in interpreting branching logic, condition precedence, and short-circuit behavior. This is the only metric where mean performance exceeds 0 . 7, suggesting that LLMs can effectively infer common control-flow idioms through code analysis. Nevertheless, complex override patterns and deeply nested conditionals remain challenging. Formal methods retain a key advantage by providing deterministic analysis that is independent of surface-level code structure or stylistic variation. Edge Case Detection (0.597). Edge Case Detection shows that LLMs identify fewer than 60% of edge cases. Models tend to focus on obvious scenarios while missing counterintuitive behaviors. In monitoring systems, LLMs typically recognize high-level operational modes but overlook scenarios in which subsystems in non-standard states can still trigger critical actions through indirect pathways, such as quorum voting. These counterintuitive interactions-where naive reasoning predicts one outcome but complex logical dependencies produce another-are precisely where production bugs most often emerge. Failing to identify such cases leaves test suites weakest exactly where failures are most likely to occur in the field. Decision Boundary Clarity (0.695). Decision Boundary Clarity shows moderate performance. LLMs partially succeed in identifying constraints that influence system behavior but struggle to extract precise boundary conditions. Models often recognize that factors such as 'weather adjustments' or 'sensor drift' matter, yet fail to identify the exact constraint expressions where behavior changes-such as specific comparisons ( ≥ vs. > ), threshold values (79 vs. 80), or logical combinations. LLMs also miss narrow 'dead zones' where small ranges of inputs trigger different logic, and how constraints vary across operational modes. This limitation is critical because implementations require exact Boolean conditions. Developers need precise constraint expressions to implement, test boundary behavior, and verify correctness. Region decomposition exhaustively enumerates these constraints, enabling precise implementation and comprehensive test case generation. These data are presented in Figures 17 and Figures 18. ## Mean Score by Model Figure 16: Mean score by model across seven metrics, relative to the scores when models were augmented with CodeLogician. Claude Opus 4.5 achieves the highest score (0.601), followed by GPT-5.2 (0.589). Despite these differences, all models perform substantially below the level achieved when augmented with formal methods. <details> <summary>Image 13 Details</summary> ![65c13700](/v1/image/65c13700918e7d890c0f5feabdc0079e0137f5326920c91a35b8de90bb4eb386) ### Visual Description ## Horizontal Bar Chart: Model Performance Comparison ### Overview The image is a horizontal bar chart comparing the overall mean scores of several language models (LLMs) on a specific task, presumably related to code generation or understanding, as indicated by the "LLM + CodeLogician" label. The chart displays the performance of models from Anthropic, OpenAI, X-AI, and Google. ### Components/Axes * **Y-axis (left)**: "Model" - Lists the names of the language models being compared. * Models: * anthropic/claude-opus-4.5 * openai/gpt-5.2 * anthropic/claude-sonnet-4.5 * x-ai/grok-code-fast-1 * google/gemini-3-pro-preview * **X-axis (bottom)**: "Overall Mean Score" - Numerical scale ranging from 0 to 1, with increments of 0.1. * **Bars**: Horizontal bars representing the overall mean score for each model. The bars are a light blue color. * **Values**: Numerical values displayed at the end of each bar, indicating the exact overall mean score. * **Vertical Dotted Line (right)**: A vertical dotted line at x=1, labeled "LLM + CodeLogician" in teal. ### Detailed Analysis The chart presents the following data: * **anthropic/claude-opus-4.5**: Score of 0.601. The bar extends to approximately 60% of the x-axis. * **openai/gpt-5.2**: Score of 0.589. The bar extends to approximately 59% of the x-axis. * **anthropic/claude-sonnet-4.5**: Score of 0.576. The bar extends to approximately 58% of the x-axis. * **x-ai/grok-code-fast-1**: Score of 0.534. The bar extends to approximately 53% of the x-axis. * **google/gemini-3-pro-preview**: Score of 0.532. The bar extends to approximately 53% of the x-axis. ### Key Observations * The anthropic/claude-opus-4.5 model has the highest overall mean score (0.601). * The google/gemini-3-pro-preview model has the lowest overall mean score (0.532). * The scores are relatively close, with a range of approximately 0.07 between the highest and lowest scores. ### Interpretation The chart provides a comparison of the overall performance of several language models. The anthropic/claude-opus-4.5 model appears to outperform the others in this specific evaluation. The "LLM + CodeLogician" label suggests that the models were evaluated on tasks that involve both language understanding and code-related abilities. The proximity of the scores suggests that all models are reasonably competent, but there are still performance differences. The vertical line at 1.0 indicates the maximum possible score, and all models fall significantly short of this maximum, suggesting room for improvement in all models. </details> ## 7.4.2 Model Comparison by Metric Figures 20 and 21 show how each model performed on individual metrics: Claude Opus 4.5 achieves the highest overall performance (0 . 601), leading in State Space Estimation Accuracy (0 . 222), Decision Boundary Clarity (0 . 763), Edge Case Detection (0 . 640), and Coverage Completeness (0 . 524). GPT5.2 achieves the strongest performance in Control Flow Understanding (0 . 810) and Outcome Precision (0 . 659), indicating effective pattern recognition for common control-flow structures and relatively precise constraint specification. Shared Limitations Across Models. Despite these relative differences, all evaluated models perform similarly on State Space Estimation Accuracy , with scores clustered in the 0 . 164-0 . 222 range. Even Claude Opus 4.5, which achieves the highest score on this metric, still falls far short of the precision achievable through formal decomposition. This variance on the most challenging metric indicates that fundamental architectural limitations affect current LLMs in a broadly similar manner when operating without formal reasoning tools. ## Mean Score by Metric Figure 17: Mean score by metric, relative to performance once augmented with CodeLogician. <details> <summary>Image 14 Details</summary> ![67c50efb](/v1/image/67c50efbf3ac3e003ef5ef0e3955f3b9d06c18bd8300ccbc7cbecc70759f0e25) ### Visual Description ## Bar Chart: Metric Mean Scores ### Overview The image is a horizontal bar chart comparing the mean scores of different metrics. The metrics are listed on the vertical axis, and the mean scores are displayed on the horizontal axis. A vertical dotted line indicates the performance of "LLM + CodeLogician". ### Components/Axes * **Vertical Axis (Metric):** Lists the different metrics being evaluated. * Control Flow Understanding * Decision Boundary Clarity * Direction Accuracy * Outcome Precision * Edge Case Detection * Coverage Completeness * State Space Estimation Accuracy * **Horizontal Axis (Mean Score):** Represents the mean score, ranging from 0 to 1, with increments of 0.1. * **Bars:** Horizontal bars represent the mean score for each metric. The bars are light blue. * **Vertical Dotted Line:** A vertical dotted line at x=1, labeled "LLM + CodeLogician". The line is green. ### Detailed Analysis The following are the mean scores for each metric, extracted from the bar chart: * **Control Flow Understanding:** 0.746 * **Decision Boundary Clarity:** 0.695 * **Direction Accuracy:** 0.635 * **Outcome Precision:** 0.613 * **Edge Case Detection:** 0.597 * **Coverage Completeness:** 0.49 * **State Space Estimation Accuracy:** 0.186 The bars are arranged in descending order of mean score, except for "Coverage Completeness" and "State Space Estimation Accuracy". ### Key Observations * "Control Flow Understanding" has the highest mean score (0.746), while "State Space Estimation Accuracy" has the lowest (0.186). * The mean scores generally decrease as you move down the chart, with a significant drop for "State Space Estimation Accuracy". * The "LLM + CodeLogician" line is at the maximum score of 1. ### Interpretation The bar chart provides a comparison of the performance of different metrics. The high score for "Control Flow Understanding" suggests that the system performs well in this area. The low score for "State Space Estimation Accuracy" indicates a potential area for improvement. The "LLM + CodeLogician" line at 1.0 suggests that this combined approach achieves perfect performance, at least according to the scale of this chart. The chart highlights the relative strengths and weaknesses of the system across different evaluation metrics. </details> Figure 18: Mean score by metric, relative to performance once augmented with CodeLogician. <details> <summary>Image 15 Details</summary> ![453625d9](/v1/image/453625d9344ab8ad22a53cccb4d41f781ddbf7a1a861d91f57799d12c92f93f9) ### Visual Description ## Horizontal Bar Chart: Median Score by Metric ### Overview The image is a horizontal bar chart displaying the median scores for different metrics. The metrics are listed on the vertical axis, and the median scores are represented by the length of the horizontal bars. A vertical dotted line indicates the performance of "LLM + CodeLogician". ### Components/Axes * **Title:** Median Score by Metric * **Vertical Axis (Metric):** * Control Flow Understanding * Decision Boundary Clarity * Direction Accuracy * Outcome Precision * Edge Case Detection * Coverage Completeness * State Space Estimation Accuracy * **Horizontal Axis (Median Score):** * Scale: 0 to 1 * Increments: 0.1 * **Vertical Dotted Line:** Labeled "LLM + CodeLogician" at x = 1 ### Detailed Analysis The chart presents median scores for seven different metrics. Each metric has a corresponding horizontal bar, with the length of the bar indicating the median score. The scores are as follows: * **Control Flow Understanding:** 0.833 * **Decision Boundary Clarity:** 0.759 * **Direction Accuracy:** 0.783 * **Outcome Precision:** 0.665 * **Edge Case Detection:** 0.588 * **Coverage Completeness:** 0.457 * **State Space Estimation Accuracy:** 0.093 The "LLM + CodeLogician" performance is marked by a vertical dotted line at a median score of 1. ### Key Observations * "Control Flow Understanding" has the highest median score (0.833). * "State Space Estimation Accuracy" has the lowest median score (0.093). * The scores vary significantly across the different metrics. * All metrics score below the "LLM + CodeLogician" benchmark of 1. ### Interpretation The chart provides a comparative analysis of the median scores for different metrics. The data suggests that "Control Flow Understanding" is the strongest area, while "State Space Estimation Accuracy" is the weakest. The vertical line representing "LLM + CodeLogician" serves as a benchmark, indicating the target performance level. The fact that all metrics score below this benchmark suggests areas for improvement. The large variance in scores across metrics indicates that performance is not uniform across all areas. </details> Figure 19: Score distributions across the seven metrics. <details> <summary>Image 16 Details</summary> ![1d5c1f89](/v1/image/1d5c1f89bff4cf39fd181ed7b3b6fe8b7c306ee827269695192e7e7f23ad2b00) ### Visual Description ## Box Plot: Score Distribution by Metric ### Overview The image is a box plot illustrating the distribution of scores across various metrics. The plot displays the median, quartiles, and outliers for each metric, providing a visual comparison of their performance. A horizontal dashed line is present at the top of the chart at a score of 1. ### Components/Axes * **Title:** Score Distribution by Metric * **X-axis:** Metric, with the following categories: * State Space Estimation * Control Flow Understanding * Edge Case Detection * Decision Boundary * Outcome Precision * Direction Accuracy * Coverage Completeness * **Y-axis:** Score, ranging from 0 to 1, with tick marks at intervals of 0.2. * **Box Plot Elements:** Each box plot represents the interquartile range (IQR), with a line indicating the median. Whiskers extend from the box to show the range of the data, and outliers are plotted as individual points. * **Horizontal Line:** A dashed horizontal line is present at the score of 1. ### Detailed Analysis Here's a breakdown of the score distribution for each metric: * **State Space Estimation:** * The box extends from approximately 0.0 to 0.25. * The median is around 0.1. * There are several outliers above the upper whisker, ranging from approximately 0.65 to 1.0. * **Control Flow Understanding:** * The box extends from approximately 0.6 to 1.0. * The median is around 0.83. * **Edge Case Detection:** * The box extends from approximately 0.4 to 1.0. * The median is around 0.57. * **Decision Boundary:** * The box extends from approximately 0.5 to 0.93. * The median is around 0.76. * **Outcome Precision:** * The box extends from approximately 0.5 to 0.95. * The median is around 0.68. * **Direction Accuracy:** * The box extends from approximately 0.75 to 1.0. * The median is around 0.82. * **Coverage Completeness:** * The box extends from approximately 0.15 to 0.75. * The median is around 0.44. ### Key Observations * **State Space Estimation** has the lowest scores and the most outliers. * **Control Flow Understanding, Direction Accuracy** have the highest median scores and less variance. * **Coverage Completeness** has a wide interquartile range, indicating variability in scores. ### Interpretation The box plot provides a comparative view of the performance of different metrics. State Space Estimation appears to be the weakest performing metric, with low scores and several outliers, suggesting inconsistent results. Control Flow Understanding and Direction Accuracy show consistently high scores. The wide interquartile range for Coverage Completeness suggests that the performance of this metric varies significantly. The horizontal line at a score of 1 may represent a target or ideal performance level. </details> Impact of Formal Methods Augmentation. The overall performance gap of 40-47 percentage points-comparing LLM-only performance (0 . 53-0 . 60) against perfect scores (1 . 0) achieved with region decomposition and verification-quantifies the value added by formal methods tools such as ImandraX when augmenting LLM-based code analysis. Reproducibility. All raw data and scripts to reproduce these results are available at https:// github.com/imandra-ai/code-logic-bench . Figure 20: Radar chart showing model performance in the seven metrics tested. Claude Opus 4.5 leads in State Space Estimation and Decision Boundary, while GPT 5.2 outperforms the others on Control Flow Understanding. Despite these variations, all models perform significantly worse than when augmented with CodeLogician. <details> <summary>Image 17 Details</summary> ![da217725](/v1/image/da2177254e014ef054d7c9663f0cbb7ab7c8d8c9eb29546e234b5285590df8d5) ### Visual Description ## Radar Chart: Model Comparison Across Metrics ### Overview The image is a radar chart comparing the performance of several language models (LLMs) across different metrics related to code understanding and generation. The chart visualizes the strengths and weaknesses of each model in areas such as edge case detection, state space estimation, control flow understanding, coverage completeness, direction accuracy, outcome precision, and decision boundary handling. A larger area covered by a model's line indicates better performance across the metrics. ### Components/Axes * **Title:** Model Comparison Across Metrics * **Subtitle:** Edge Case Detection * **Axes (Metrics):** * State Space Estimation * Control Flow Understanding * Coverage Completeness * Direction Accuracy * Outcome Precision * Decision Boundary * Edge Case Detection * **Radial Scale:** 0 to 1, with increments of 0.2 (0, 0.2, 0.4, 0.6, 0.8, 1) * **Legend (Top-Right):** * Purple: anthropic/claude-opus-4.5 * Orange: openai/gpt-5.2 * Red: anthropic/claude-sonnet-4.5 * Teal: x-ai/grok-code-fast-1 * Blue: google/gemini-3-pro-preview * Dashed Green: LLM + CodeLogician ### Detailed Analysis * **anthropic/claude-opus-4.5 (Purple):** * Trend: Generally high performance across all metrics. * Approximate Values: State Space Estimation (0.9), Control Flow Understanding (0.8), Coverage Completeness (0.7), Direction Accuracy (0.7), Outcome Precision (0.8), Decision Boundary (0.8), Edge Case Detection (0.9) * **openai/gpt-5.2 (Orange):** * Trend: Similar to anthropic/claude-opus-4.5, but slightly lower in some metrics. * Approximate Values: State Space Estimation (0.8), Control Flow Understanding (0.7), Coverage Completeness (0.6), Direction Accuracy (0.6), Outcome Precision (0.7), Decision Boundary (0.7), Edge Case Detection (0.8) * **anthropic/claude-sonnet-4.5 (Red):** * Trend: Performance is very close to openai/gpt-5.2. * Approximate Values: State Space Estimation (0.9), Control Flow Understanding (0.7), Coverage Completeness (0.6), Direction Accuracy (0.6), Outcome Precision (0.7), Decision Boundary (0.8), Edge Case Detection (0.9) * **x-ai/grok-code-fast-1 (Teal):** * Trend: Slightly lower performance compared to the top three models, but still competitive. * Approximate Values: State Space Estimation (0.7), Control Flow Understanding (0.6), Coverage Completeness (0.5), Direction Accuracy (0.5), Outcome Precision (0.6), Decision Boundary (0.7), Edge Case Detection (0.7) * **google/gemini-3-pro-preview (Blue):** * Trend: Performance is similar to x-ai/grok-code-fast-1. * Approximate Values: State Space Estimation (0.7), Control Flow Understanding (0.6), Coverage Completeness (0.5), Direction Accuracy (0.5), Outcome Precision (0.6), Decision Boundary (0.7), Edge Case Detection (0.7) * **LLM + CodeLogician (Dashed Green):** * Trend: Represents the outer boundary or ideal performance. * Value: All metrics are at the maximum value of 1. ### Key Observations * The models anthropic/claude-opus-4.5, openai/gpt-5.2, and anthropic/claude-sonnet-4.5 show the highest performance across all metrics. * x-ai/grok-code-fast-1 and google/gemini-3-pro-preview have similar performance profiles, slightly below the top three. * The "LLM + CodeLogician" line represents a theoretical upper bound or target performance level. ### Interpretation The radar chart provides a visual comparison of different LLMs in the context of code-related tasks. The proximity of the models' lines to the outer "LLM + CodeLogician" boundary indicates their overall effectiveness. The chart highlights the relative strengths and weaknesses of each model, allowing for informed decisions about which model to use for specific applications. The tight clustering of the top three models suggests a high degree of competitiveness in the LLM landscape. The chart suggests that there is still room for improvement in all models, as none of them reach the ideal performance represented by the "LLM + CodeLogician" line. </details> ## 8 Case Studies ## 8.1 LSE GTT Order Expiry This example concerns itself with an extended (by Imandra) version of the London Stock Exchange trading system specification. The Guide to the Trading System (MIT201) [13] specifies: 'Any GTT [Good Till Time] orders with an expiry time during any auction call phase will not be expired until after uncrossing has completed and are therefore eligible to participate in that uncrossing.' The model implements a state machine with two trading modes (Continuous and Auction Call), GTT order lifecycle management, and an auction protection mechanism. The auction protection logic is implemented in the following Python function, with the full program available in Appendix A. ``` def maybe_expire_gtt_order_v2(state: State) -> State: """ Handle GTT order expiry with auction protection. """ ``` ## Model Comparison Across All Metrics Figure 21: Chart showing model performance in the seven metrics tested. Scores are relative to those of the LLMs when augmented with CodeLogician. <details> <summary>Image 18 Details</summary> ![1743814a](/v1/image/1743814a8cf397001a0a4fcb9d8ecdf02f3335e3bb5e7cb30fa26fdf5058a6e7) ### Visual Description ## Bar Chart: LLM Performance on Code-Related Metrics ### Overview The image is a bar chart comparing the performance of five different Large Language Models (LLMs) across six code-related metrics. The chart displays the scores (ranging from 0 to 1) achieved by each model on each metric. A horizontal line indicates a baseline performance level labeled "LLM + CodeLogician". ### Components/Axes * **X-axis:** Metrics: State Space Estimation, Control Flow Understanding, Edge Case Detection, Decision Boundary, Outcome Precision, Direction Accuracy, Coverage Completeness. * **Y-axis:** Score (0-1), with tick marks at intervals of 0.1 from 0 to 1.1. * **Legend (top-left):** * Purple: anthropic/claude-opus-4.5 * Red: anthropic/claude-sonnet-4.5 * Blue: google/gemini-3-pro-preview * Orange: openai/gpt-5.2 * Teal: x-ai/grok-code-fast-1 * **Horizontal Line:** A dashed teal line at approximately y=1.0, labeled "LLM + CodeLogician". ### Detailed Analysis **1. State Space Estimation:** * anthropic/claude-opus-4.5 (Purple): 0.22 * anthropic/claude-sonnet-4.5 (Red): 0.19 * google/gemini-3-pro-preview (Blue): 0.18 * openai/gpt-5.2 (Orange): 0.17 * x-ai/grok-code-fast-1 (Teal): 0.16 **2. Control Flow Understanding:** * anthropic/claude-opus-4.5 (Purple): 0.79 * anthropic/claude-sonnet-4.5 (Red): 0.73 * google/gemini-3-pro-preview (Blue): 0.69 * openai/gpt-5.2 (Orange): 0.81 * x-ai/grok-code-fast-1 (Teal): 0.71 **3. Edge Case Detection:** * anthropic/claude-opus-4.5 (Purple): 0.64 * anthropic/claude-sonnet-4.5 (Red): 0.63 * google/gemini-3-pro-preview (Blue): 0.61 * openai/gpt-5.2 (Orange): 0.58 * x-ai/grok-code-fast-1 (Teal): 0.55 **4. Decision Boundary:** * anthropic/claude-opus-4.5 (Purple): 0.76 * anthropic/claude-sonnet-4.5 (Red): 0.69 * google/gemini-3-pro-preview (Blue): 0.66 * openai/gpt-5.2 (Orange): 0.74 * x-ai/grok-code-fast-1 (Teal): 0.63 **5. Outcome Precision:** * anthropic/claude-opus-4.5 (Purple): 0.66 * anthropic/claude-sonnet-4.5 (Red): 0.62 * google/gemini-3-pro-preview (Blue): 0.58 * openai/gpt-5.2 (Orange): 0.66 * x-ai/grok-code-fast-1 (Teal): 0.59 **6. Direction Accuracy:** * anthropic/claude-opus-4.5 (Purple): 0.66 * anthropic/claude-sonnet-4.5 (Red): 0.66 * google/gemini-3-pro-preview (Blue): 0.62 * openai/gpt-5.2 (Orange): 0.61 * x-ai/grok-code-fast-1 (Teal): 0.63 **7. Coverage Completeness:** * anthropic/claude-opus-4.5 (Purple): 0.52 * anthropic/claude-sonnet-4.5 (Red): 0.45 * google/gemini-3-pro-preview (Blue): 0.44 * openai/gpt-5.2 (Orange): 0.51 * x-ai/grok-code-fast-1 (Teal): 0.46 ### Key Observations * The "State Space Estimation" metric shows the lowest scores across all models. * "Control Flow Understanding" and "Decision Boundary" generally have higher scores compared to other metrics. * openai/gpt-5.2 (Orange) achieves the highest score in "Control Flow Understanding" with a value of 0.81. * anthropic/claude-opus-4.5 (Purple) generally performs well across all metrics. * None of the models reach the "LLM + CodeLogician" baseline (1.0) on any metric. ### Interpretation The bar chart provides a comparative analysis of the performance of different LLMs on specific code-related tasks. The low scores in "State Space Estimation" suggest that this is a challenging area for these models. The higher scores in "Control Flow Understanding" and "Decision Boundary" indicate relative strengths in these areas. The fact that no model reaches the "LLM + CodeLogician" baseline suggests that there is still room for improvement in these LLMs' coding abilities. The performance differences between the models highlight the varying strengths and weaknesses of each architecture. </details> ``` if state.gtt_order is None: return state expires_at = state.gtt_order.expires_at if isinstance(state.mode, AuctionCallMode): expires_at = max(expires_at, state.mode.uncross_at + 1) if state.time >= expires_at: return expire_order(state) else: return state ``` ## 8.1.1 Verification The critical property to verify is that auction uncross and order expiry events can never occur simultaneously. CodeLogician expressed this as a conflict detection predicate in IML: ``` let conflict_reachable msgs state k = let final_state = run state (List.take k msgs) in final_state.auction_event = Uncrossed && final_state.order_event = Expired ``` The verification goal asserts that for all valid initial states and message sequences, conflict\_reachable should return false . CodeLogician refuted the verification goal, producing a concrete counterexample: ``` let msgs = [Tick; Tick] let state = { time = 2699; mode = AuctionCallMode {uncross_at = 2700}; gtt_order = Some {expires_at = 2700}; market_order_present_after_uncross = true; market_order_extension_period = 1 } ``` This sequence of events is illustrated in Figure 22 (Appendix A). The counterexample reveals a subtle interaction between two features: the auction protection logic extends order expiry to uncross\_at + 1 , while the market order extension feature delays the actual uncross to uncross\_at + extension\_period . When extension\_period = 1 , both events occur at the same tick. This case demonstrates formal verification's ability to catch feature interaction bugs: the auction protection and market order extension features each work correctly in isolation, but their interaction was not properly considered. Traditional testing would likely miss this edge case, as it requires specific timing conditions that are difficult to anticipate manually. ## 8.2 London Stock Exchange Fee Schedule This case study applies CodeLogician to the London Stock Exchange Trading Services Price List, verifying fee calculation logic with mathematical certainty. For algorithmic trading firms, accurate fee prediction is essential: a fraction of a basis point can determine whether a strategy is profitable, and errors across millions of daily orders compound into significant revenue leakage or unexpected costs. Exchange fee schedules present challenges: tiered pricing, minimum floors, rebates, and surcharges can interact in complex ways, and specifications are published as prose documents rather than machinereadable formats. Traditional testing struggles to cover all combinations exhaustively. ## 8.2.1 Verified Invariants CodeLogician formally verified two invariants against all possible inputs. The first invariant, 'Non-Persistent Order Charge Isolation,' establishes that an Immediate or Cancel (IOC) or Fill or Kill (FOK) order costs exactly £ 0.01 more than an identical Day (DAY) order, due to the order entry charge. CodeLogician expressed this property in IML as: ``` fun (price : real) (quantity : int) (is_aggressive : bool) (is_hidden : bool) (package : package_type) (lps_tier : lps_tier_type) (security_type : security_type) (cumulative_value : real) -> calculate_total_cost price quantity true is_aggressive is_hidden package lps_tier security_type cumulative_value = calculate_total_cost price quantity false is_aggressive is_hidden package lps_tier security_type cumulative_value +. 0.01 ``` CodeLogician proved this property holds for all combinations of inputs, demonstrating that the order entry charge logic is completely isolated from all other fee components. The second invariant, 'LPS Zero-Fee Execution,' guarantees that when the rate equals 0.00bp and the minimum charge equals £ 0.00, the execution fee is exactly zero: ``` fun (notional_value : real) -> calculate_execution_fee notional_value 0.0 0.0 false = 0.0 ``` This ensures that liquidity providers-who receive a 0.00bp rate with no minimum-are never accidentally charged due to floating-point errors or logic edge cases. ## 8.2.2 Region Decomposition Region decomposition revealed 6 distinct behavioral regions in the execution fee function, arising from the combinatorial interaction of factors including hidden order premiums, minimum charge floors, and negative rebate rates. The regions are partitioned by whether the order is hidden (triggering the +0.25bp premium), whether the rate is non-negative, and whether the minimum charge floor dominates the calculated fee. The analysis identified 'crossover notionals'-trade sizes below which the minimum charge dominates the calculated fee. For example, a £ 1,000 trade at the standard 0.45bp rate yields a calculated fee of £ 0.045, but the minimum floor of £ 0.11 applies, resulting in an effective rate of 11bp-24 times the stated rate. Such boundaries vary by pricing scheme: the crossover for Standard Tier 1 (0.45bp, 11p minimum) is £ 2,444.44, while Package 1 (0.15bp, 5p minimum) crosses over at £ 3,333.33. Negative rates (LPS rebates) bypass the minimum charge logic entirely-a design decision that prevents floors from being applied to rebates. CodeLogician verified an edge case where a large hidden order with LPS FTSE100 Top Tier (-0.15bp rebate + 0.25bp hidden premium) pays a net 0.10bp, not zero. While this particular example may seem straightforward, the same analysis, when applied to complex fee schedules, can reveal hidden complexity and compliance risks, and provide a mathematical foundation for trust in the system. ## 8.3 Multilateral Netting Engine In this example, CodeLogician analyzes a simple multilateral netting engine to find two critical bugs. The system has two critical invariants: 1. Zero-sum Conservation - the sum of all net positions must equal zero 2. Netting Efficiency - no party's exposure must increase as a result of netting. Formal methods found issues pertaining to both of these requirements in the code. Bugs were found before any testing, concrete counterexamples presented, and once the code was corrected, CodeLogician proved the code mathematically correct. An excerpt of the code is exhibited below, with the full program available in Appendix B. ``` def calculate_net_obligations(trades: List[Trade]) -> Dict[str, float]: """ Calculate net obligations for all participants in a multilateral netting cycle. """ net_positions: Dict[str, float] = {} for trade in trades: ``` ``` net_positions[trade.payer_id] = \ net_positions.get(trade.payer_id, 0.0) - trade.amount net_positions[trade.receiver_id] = \ net_positions.get(trade.receiver_id, 0.0) + trade.amount return net_positions ``` ## 8.3.1 Input Validation The first bug to be found was an input validation problem where negative values were accepted, causing a violation of the netting efficiency requirement. CodeLogician found the counterexample ``` Trade(payer_id="", receiver_id="", amount=-10834.0) ``` In this case, Net Exposure, | 10834 | + |-10834 | = 21668, is greater than Gross Exposure, |-10834 | = 10834. ## 8.3.2 Floating Point Precision The second issue found was with the use of floating-point arithmetic. CodeLogician identified that floating-point errors would cause the zero-sum conservation requirement to be violated. IEEE 754 [5] binary representation cannot exactly represent most decimal values; accumulated errors compound with each operation. Consider the classic demonstration: summing 0.1 ten times in floating-point yields 0.9999999999999999, not 1.0. In high-frequency clearing with thousands of trades per second, such errors accumulate to material discrepancies. CodeLogician's counterexample demonstrated a scenario where {sum(net\_positions.values()) != 0.0 , violating the zero-sum invariant despite mathematically balanced trades. The fix required migrating from Python's float to Decimal [2] for arbitrary-precision arithmetic, combined with defense-in-depth validation within the core algorithm. After applying the fix, CodeLogician re-verified the implementation, proving all three verification goals: exact zero-sum conservation, tolerance-based zero-sum, and netting efficiency. The corrected implementation provides mathematical certainty of correctness-a requirement for systems subject to EMIR [7] and Dodd-Frank Title VII [1]. ## 9 Conclusion and Future Work We have introduced CodeLogician , a neuro-symbolic agent and framework for rigorous reasoning about software systems that integrates Large Language Models (LLMs) with automated reasoning engines. We argue that while LLMs excel at translation, abstraction, and interaction, they fundamentally lack the capacity for precise, exhaustive reasoning about program behavior. CodeLogician addresses this limitation by teaching LLM-driven agents to construct explicit formal models of software and by delegating semantic reasoning to formal tools such as ImandraX . A central contribution of this work is the demonstration that formal augmentation fundamentally changes the reasoning capabilities of LLM-based systems. Using a newly introduced benchmark targeting mathematical reasoning about software logic, we showed that LLM-only approaches systematically fail on tasks requiring exact boundary identification, combinatorial reasoning, and exhaustive coverage. By contrast, LLMs augmented with CodeLogician achieve complete and precise analysis, closing a 41-47 percentage point gap across multiple rigorously defined metrics. These results provide empirical evidence that rigorous program analysis cannot be achieved by statistical reasoning alone and that formal reasoning engines are essential for scaling AI-assisted software development beyond heuristic methods. Beyond empirical results, we presented CodeLogician as a general framework rather than a single agent or solver. Its reasoner-agnostic architecture, agentic orchestration layer, and support for continuous, project-scale autoformalization enable principled integration of multiple reasoning tools and workflows. The CodeLogician server and PyIML strategies demonstrate how formalization can be performed incrementally and persistently over evolving codebases, making formal reasoning practical in real-world development environments. Future Work. Applying LLMs to formal reasoning about software presents both substantial challenges and exceptional opportunities. LLMs remain imperfect at maintaining global consistency, respecting strict semantic constraints, and scaling reasoning across large, interdependent codebases. At the same time, their strengths in translation, abstraction, and interaction make them uniquely well suited to act as interfaces between informal software artifacts and formal mathematical models. Our ongoing work in this direction is embodied in SpecLogician , a complementary project focused on scalable, data-driven formal specification and refinement for large software systems. SpecLogician extends the ideas introduced in this paper by shifting the emphasis from individual code fragments to collections of specifications, traces, tests, logs, and other real-world artifacts. By combining LLMs with automated reasoning, SpecLogician aims to incrementally synthesize, refine, and validate formal specifications at scale, even in settings where complete or authoritative specifications do not exist upfront. Looking forward, we plan to deepen the integration between CodeLogician and SpecLogician , enabling closed-loop workflows in which specifications, implementations, and empirical artifacts co-evolve under formal constraints. Additional future directions include extending support to new source languages and reasoning backends, improving dependency-aware and incremental formalization strategies, and using formal counterexamples, region decompositions, and proofs to actively guide LLM behavior during development. Ultimately, we view CodeLogician and SpecLogician as steps toward a new class of AI-assisted software engineering systems in which LLMs and formal reasoning engines operate symbiotically. By grounding AI-assisted development in explicit formal models and mathematically sound analysis, such systems can move beyond probabilistic correctness toward rigorous, auditable, and scalable reasoning about complex software systems. ## A LSE GTT Order Expiry ## A.1 Python model ``` A LSE GTT Order Expiry A.1 Python model ``` ``` class State: """The complete state of the trading venue.""" time: Time auction_call_duration: Time auction_interval: Time mode: Mode gtt_order: Optional[GTTOrder] auction_event: Optional[AuctionEvent] order_event: Optional[OrderEvent] # Extended state for market order feature market_order_present_after_uncross: bool market_order_extension_period: Time class Message(Enum): """Messages that can be processed by the venue.""" TICK = "tick" @dataclass class CreateGTTOrder: """Message to create a GTT order.""" expires_at: Time # Helper functions def is_valid_state(state: State) -> bool: """ Check if a state is valid according to business invariants. VG: State validity invariants should always hold: - Time is non-negative - Durations are positive - Scheduled times are in the future - GTT order expiry is in the future """ if state.time < 0: return False if state.auction_call_duration <= 0: return False if state.auction_interval <= 0: return False if state.market_order_extension_period <= 0: return False # Check mode-specific invariants if isinstance(state.mode, ContinuousMode): if state.mode.start_auction_call_at <= state.time: return False elif isinstance(state.mode, AuctionCallMode): if state.mode.uncross_at <= state.time: return False # GTT order expiry must be in the future if state.gtt_order is not None: if state.gtt_order.expires_at <= state.time: return False return True ``` ``` def events_none(state: State) -> bool: """Check if no events are set (used for initial states).""" return state.auction_event is None and state.order_event is None # Core business logic functions def maybe_create_gtt_order(expires_at: Time, state: State) -> State: """ Handle an incoming GTT order. VG: Orders with expiry <= current time should be rejected """ if expires_at <= state.time: # Reject order - no state change return state else: # Accept order return State( time=state.time, auction_call_duration=state.auction_call_duration, auction_interval=state.auction_interval, mode=state.mode, gtt_order=GTTOrder(expires_at=expires_at), auction_event=state.auction_event, order_event=OrderEvent.CREATED, market_order_present_after_uncross=state.market_order_present_after_uncross, market_order_extension_period=state.market_order_extension_period ) def start_auction(state: State) -> State: """ Start an auction call phase. Sets mode to AuctionCall, schedules uncross, emits event. """ return State( time=state.time, auction_call_duration=state.auction_call_duration, auction_interval=state.auction_interval, mode=AuctionCallMode(uncross_at=state.time + state.auction_call_duration), gtt_order=state.gtt_order, auction_event=AuctionEvent.AUCTION_CALL_STARTED, order_event=state.order_event, market_order_present_after_uncross=state.market_order_present_after_uncross, market_order_extension_period=state.market_order_extension_period ) def uncross(state: State) -> State: """ Complete the auction uncross. Sets mode to Continuous, schedules next auction, emits event. """ return State( time=state.time, auction_call_duration=state.auction_call_duration, auction_interval=state.auction_interval, mode=ContinuousMode(start_auction_call_at=state.time + state.auction_interval), gtt_order=state.gtt_order, ``` ``` auction_event=AuctionEvent.UNCROSSED, order_event=state.order_event, market_order_present_after_uncross=state.market_order_present_after_uncross, market_order_extension_period=state.market_order_extension_period ) def maybe_uncross(uncross_at: Time, state: State) -> State: """ Handle uncrossing with market order extension logic. VG: If market orders remain, auction is extended by extension_period """ if state.market_order_present_after_uncross: # Market orders remain - check if extension period has elapsed if state.time >= uncross_at + state.market_order_extension_period: # Extension period complete - uncross and clear market order flag new_state = State( time=state.time, auction_call_duration=state.auction_call_duration, auction_interval=state.auction_interval, mode=state.mode, gtt_order=state.gtt_order, auction_event=state.auction_event, order_event=state.order_event, market_order_present_after_uncross=False, market_order_extension_period=state.market_order_extension_period ) return uncross(new_state) else: # Still in extension period return state else: # No market orders - uncross immediately return uncross(state) def maybe_change_mode(state: State) -> State: """ Handle timed mode changes (auction start or uncross). VG: Mode changes occur at scheduled times """ if isinstance(state.mode, ContinuousMode): if state.time >= state.mode.start_auction_call_at: return start_auction(state) elif isinstance(state.mode, AuctionCallMode): if state.time >= state.mode.uncross_at: return maybe_uncross(state.mode.uncross_at, state) return state def maybe_expire_gtt_order_v2(state: State) -> State: """ Handle GTT order expiry with auction protection. CRITICAL FIX: Per MIT201, GTT orders expiring during auction call should not expire until AFTER uncross completes. VG: GTT orders with expiry during auction call must not expire until after uncross (expires_at is extended to uncross_at + 1) ``` ``` """ if state.gtt_order is None: return state expires_at = state.gtt_order.expires_at # Key fix: Delay expiry if we're in auction call mode if isinstance(state.mode, AuctionCallMode): # Extend expiry to after uncross expires_at = max(expires_at, state.mode.uncross_at + 1) if state.time >= expires_at: # Expire the order return State( time=state.time, auction_call_duration=state.auction_call_duration, auction_interval=state.auction_interval, mode=state.mode, gtt_order=None, auction_event=state.auction_event, order_event=OrderEvent.EXPIRED, market_order_present_after_uncross=state.market_order_present_after_uncross, market_order_extension_period=state.market_order_extension_period ) else: return state def tick(state: State) -> State: """ Process a clock tick - advance time and handle timed events. Order of operations matters: 1. Advance time 2. Check for order expiry 3. Check for mode changes VG: Time advances monotonically """ # Advance time state = State( time=state.time + 1, auction_call_duration=state.auction_call_duration, auction_interval=state.auction_interval, mode=state.mode, gtt_order=state.gtt_order, auction_event=state.auction_event, order_event=state.order_event, market_order_present_after_uncross=state.market_order_present_after_uncross, market_order_extension_period=state.market_order_extension_period ) # Check for order expiry state = maybe_expire_gtt_order_v2(state) # Check for mode changes state = maybe_change_mode(state) return state def step(msg: Message | CreateGTTOrder, state: State) -> State: ``` ``` """ Process a single message and return the new state. VG: Each message type should transition state correctly """ if msg == Message.TICK: return tick(state) elif isinstance(msg, CreateGTTOrder): return maybe_create_gtt_order(msg.expires_at, state) else: return state def run(state: State, msgs: List[Message | CreateGTTOrder]) -> State: """ Process a sequence of messages. VG: Event fields are cleared before processing each message VG: Message sequence processing is deterministic """ for msg in msgs: # Clear events before processing each message state = State( time=state.time, auction_call_duration=state.auction_call_duration, auction_interval=state.auction_interval, mode=state.mode, gtt_order=state.gtt_order, auction_event=None, order_event=None, market_order_present_after_uncross=state.market_order_present_after_uncross, market_order_extension_period=state.market_order_extension_period ) state = step(msg, state) return state def conflict_reachable(msgs: List[Message | CreateGTTOrder], state: State, k: int = 5) -> bool: """ CRITICAL VERIFICATION GOAL: Check if it's possible to reach a state where BOTH: - An uncross event occurs AND - A GTT order expires at the same time. Per MIT201, this should NEVER happen - GTT orders expiring during auction call must wait until after uncross. VG: verify that NOT conflict_reachable for all valid initial states and message sequences """ # Limit message sequence length limited_msgs = msgs[:k] # Run the model final_state = run(state, limited_msgs) # Check for the conflict return (final_state.auction_event == AuctionEvent.UNCROSSED and ``` ``` final_state.order_event == OrderEvent.EXPIRED) # Verification Goals (VGs) explained in comments above each function # These are the properties the author was verifying: # VG1: State validity invariants (is_valid_state) # VG2: Order rejection logic (maybe_create_gtt_order) # VG3: Auction timing correctness (start_auction, uncross) # VG4: Market order extension logic (maybe_uncross) # VG5: Mode transition correctness (maybe_change_mode) # VG6: GTT order expiry protection during auction (maybe_expire_gtt_order_v2) # VG7: Time monotonicity (tick) # VG8: Message processing determinism (step, run) # VG9: CRITICAL - No simultaneous uncross and expiry (conflict_reachable) ``` ## A.2 Sequence diagram Figure 22: Sequence diagram illustrating how the initial conditions found by CodeLogician cause a violation of the MIT201 specification. <details> <summary>Image 19 Details</summary> ![e8ac8a77](/v1/image/e8ac8a775fda5f47b5f0ed9c475ff909d97a99ad60ba13225e1a07b588260f49) ### Visual Description ## Sequence Diagram: GTT Expiry and Mode/Uncross Check ### Overview The image is a sequence diagram illustrating the interaction between different components (Time, GTT Expiry Check, Mode/Uncross Check, and State) during the Good-Till-Time (GTT) order expiry and mode/uncross check process. It shows the flow of events and decisions made at different ticks, particularly focusing on auction mode and market order presence. ### Components/Axes * **Horizontal Axes (Top)**: * Time * GTT Expiry Check * Mode/Uncross Check * State * **Vertical Axis**: Represents the progression of time and events. * **Nodes**: * Initial State (yellow box): t = 2699, Auction Mode, uncross_at = 2700, order expires_at = 2700, market_extension = 1 * After Tick 1 (yellow box): t = 2700, Still in auction * Final State (pink box): Both events occur in the same tick, MIT201 violated ### Detailed Analysis or ### Content Details * **Initial State**: * Time (t) is 2699. * The system is in Auction Mode. * `uncross_at` is set to 2700. * `order expires_at` is set to 2700. * `market_extension` is set to 1. * **Tick 1: t = 2699 -> 2700** * **GTT Expiry Check**: * Checks expiry. * Asks: "In auction mode?" * Extends to max(2700, 2700+1) = 2701. * Checks: "2700 >= 2701? NO" * Result: "Order NOT expired" * **Mode/Uncross Check**: * Checks mode change. * Checks: "2700 >= 2700? YES" * Asks: "Market orders present? 2700 >= 2700 + 1? NO" * Result: "Stay in auction" * **After Tick 1**: * Time (t) is 2700. * The system is still in auction. * **Tick 2: t = 2700 -> 2701** * **GTT Expiry Check**: * Checks expiry. * Asks: "In auction mode?" * Extends to max(2700, 2700+1) = 2701. * Checks: "2701 >= 2701? YES" * Result: "ORDER EXPIRES" * **Mode/Uncross Check**: * Checks mode change. * Checks: "2701 >= 2700? YES" * Asks: "Market orders present? 2701 >= 2700 + 1? YES" * Result: "UNCROSS" * **Final State**: * "Both events occur in the same tick" * "MIT201 violated" ### Key Observations * The diagram illustrates a scenario where an order is initially set to expire at time 2700. * In Tick 1, the order does not expire because 2700 is not greater than or equal to 2701. The system stays in auction because market orders are not present. * In Tick 2, the order expires because 2701 is greater than or equal to 2701. The system uncrosses because market orders are present. * The final state indicates a violation of MIT201, implying that both the order expiring and the uncross event occurring in the same tick is problematic. ### Interpretation The sequence diagram depicts a critical race condition or conflict where the order expiry and uncross events occur simultaneously, leading to a violation of the MIT201 rule. This suggests a potential flaw in the system's logic or timing mechanism, where the expiry check and uncross check are not properly synchronized or prioritized. The diagram highlights the importance of ensuring that these events are handled in a specific order or with proper synchronization to avoid the MIT201 violation. The `market_extension` parameter seems to play a role in extending the auction, but the simultaneous expiry and uncross indicate a failure to prevent the conflict. </details> ## B Multilateral Netting Engine ## B.1 Python model ``` B Multilateral Netting Engine B.1 Python model ``` ``` Invariants Maintained: 1. Zero-Sum: sum(net_positions.values()) == 0.0 2. Netting Efficiency: sum(|net|) <= sum(|gross|) Example: >>> trades = [ ... Trade("A", "B", 100.0), ... Trade("B", "C", 50.0), ... Trade("C", "A", 30.0) ... ] >>> net = calculate_net_obligations(trades) >>> net {'A': -70.0, 'B': 50.0, 'C': 20.0} >>> sum(net.values()) # Zero-sum check 0.0 """ net_positions: Dict[str, float] = {} for trade in trades: # Payer: outgoing payment (negative) net_positions[trade.payer_id] = net_positions.get(trade.payer_id, 0.0) - trade.amount # Receiver: incoming payment (positive) net_positions[trade.receiver_id] = net_positions.get(trade.receiver_id, 0.0) + trade.amount return net_positions def verify_zero_sum(net_positions: Dict[str, float], tolerance: float = 1e-10) -> bool: """ Verify the Zero-Sum invariant (Conservation of Cash). Args: net_positions: Dictionary of participant net positions tolerance: Floating-point tolerance for zero comparison Returns: True if sum of all positions is within tolerance of zero """ total = sum(net_positions.values()) return abs(total) < tolerance def verify_netting_efficiency( trades: List[Trade], net_positions: Dict[str, float] ) -> bool: """ Verify the Netting Efficiency invariant. The sum of absolute net positions should never exceed the sum of absolute gross trade amounts. Netting should reduce exposure. Args: trades: Original bilateral trades net_positions: Computed net positions Returns: True if netting reduces or maintains total exposure """ gross_exposure = sum(trade.amount for trade in trades) net_exposure = sum(abs(position) for position in net_positions.values()) ``` ``` return net_exposure <= gross_exposure def calculate_netting_benefit( trades: List[Trade], net_positions: Dict[str, float] ) -> float: """ Calculate the reduction in cash flows achieved by netting. Returns: Percentage reduction in gross obligations (0.0 to 100.0) """ if not trades: return 0.0 gross_exposure = sum(trade.amount for trade in trades) net_exposure = sum(abs(position) for position in net_positions.values()) if gross_exposure == 0.0: return 0.0 reduction = ((gross_exposure - net_exposure) / gross_exposure) * 100.0 return reduction ``` ## B.2 Evaluation Example: Transportation Risk Manager To illustrate the patterns observed across the 50-model evaluation, we present a detailed analysis of the transportation risk manager model. This system evaluates route decisions based on transport mode (Truck, Rail, Air, Sea), cargo type (Standard, Fragile, Hazardous, Perishable), weather conditions, and geopolitical risks, determining whether routes are approved and whether emergency routing should be activated. ## B.2.1 The Three Questions The model was evaluated using three questions designed to test different aspects of code reasoning: 1. State Space Estimation (Q1): 'How many distinct routing scenarios exist for Sea transport when operating under geopolitical tensions (Tensions, Sanctions, or WarZone)?' 2. Conditional Analysis (Q2): 'Can Air transport ever be rejected for Standard (non-Hazardous) cargo? If yes, under what conditions?' 3. Property Verification (Q3): 'For Air transport specifically, does having both cargo value > 100000 AND time critical < 24 always activate emergency routing when the route is approved?' ## B.2.2 Question 1: State Space Estimation CodeLogician Result. Region decomposition identified 117 distinct scenarios for Sea transport under geopolitical tensions. These scenarios arise from the interaction of: - 3 geopolitical risk levels (Tensions, Sanctions, WarZone) - 5 weather conditions (Clear, Rain, Snow, Storm, Hurricane) - 4 cargo types (Standard, Fragile, Hazardous, Perishable) - Numeric threshold boundaries for cargo value and time critical Critically, the decomposition reveals that numeric parameters create additional scenario boundaries. For example, one region is defined by: ``` mode = Sea geopolitical = Sanctions cargo = Hazardous weather = Clear time_critical <= 23 cargo_value >= 100001 ``` This scenario results in emergency routing = true because three of the four emergency conditions are satisfied. A different region with cargo value <= 100000 produces emergency routing = false -a behaviorally distinct scenario that categorical analysis misses. LLM Estimates. Table 1 summarizes the LLM estimates. All models significantly underestimated the scenario count, with errors ranging from 2 × to 39 × . Table 1: Q1 state space estimates by model. All LLMs underestimated the scenario count, consistent with the 0.186 mean State Space Estimation Accuracy reported in the main evaluation. | Model | Estimate | Actual | Error Factor | |----------------------|------------|----------|----------------| | grok-code-fast-1 | 15 | 117 | 7.8 × | | claude-sonnet-4.5 | 14 | 117 | 8.4 × | | claude-opus-4.5 | 15 | 117 | 7.8 × | | gemini-3-pro-preview | 60 | 117 | 2.0 × | | gpt-5.2 | 3 | 117 | 39 × | Analysis. The LLMs applied simple categorical reasoning: 3 (geopolitical) × 5 (weather) = 15 scenarios, or variations thereof. This approach correctly counts the categorical combinations but fails to account for how numeric thresholds ( cargo value > 100000 , time critical < 24 , route cost < 1000 , route time < 72 ) partition the state space further. The emergency routing logic alone creates multiple behavioral regions within each categorical combination. GPT-5.2's estimate of 3 scenarios (counting only the geopolitical risk levels) represents an extreme case of categorical oversimplification. Gemini-3-pro's estimate of 60 came closest by attempting to include cargo types (3 × 5 × 4 = 60), but still missed the numeric threshold boundaries. ## B.2.3 Question 2: Conditional Analysis CodeLogician Result. Decomposition confirmed that Air transport can be rejected for Standard cargo under exactly 2 scenarios : 1. weather = Hurricane AND geopolitical = WarZone 2. weather = Hurricane AND geopolitical = Sanctions (air grounded) ``` (extreme risk) ``` LLM Results. All five models correctly identified both rejection scenarios. For example, Claude Opus 4.5 reasoned: 'Air transport can still be rejected via the is route viable check. The air grounded condition triggers when: mode=AIR AND weather=HURRICANE AND geopolitical=SANCTIONS . The extreme risk condition triggers when: weather=HURRICANE AND geopolitical=WAR ZONE (applies to all modes including Air).' Analysis. This question tests explicit boolean logic that is directly visible in the code: ``` def _is_route_viable(mode, weather, geopolitical): extreme_risk = (weather == HURRICANE and geopolitical == WAR_ZONE) air_grounded = (mode == AIR and weather == HURRICANE and geopolitical == SANCTIONS) return not (extreme_risk or air_grounded) ``` The conditions are explicit, categorical, and do not involve numeric thresholds or complex interactions. This explains the 100% success rate: LLMs excel at tracing explicit control flow when conditions are directly stated in the code. This result is consistent with the relatively higher Control Flow Understanding score (0.746) reported in the main evaluation. ## B.2.4 Question 3: Property Verification The Property. The question asks whether, for approved Air routes, the conjunction of cargo value > 100000 AND time critical < 24 is sufficient to guarantee emergency routing activation. CodeLogician Result. ImandraX proved the property always holds. The intuition behind why this property holds can be understood as follows: ## Why the Property Holds Emergency routing activates when at least 3 of 4 conditions are true: - c1 : cargo value > 100000 - given as true by the property premise - c2 : time critical < 24 - given as true by the property premise - c3 : route cost < 1000 - varies depending on weather and geopolitical factors - c4 : route time < 72 - for Air: base time = 8, max multiplier = 3, so max time = 24 < 72 Since c4 is always true for Air transport (route time cannot exceed 24 hours), and c1 and c2 are given by the premise, we have at least 3 conditions satisfied regardless of c3 . LLM Results. The models produced mixed results: - Correct: GPT-5.2, Gemini-3-pro-preview correctly concluded the property holds. - Incorrect: Grok-code-fast-1 concluded 'No,' claiming that route cost = 1000 (exactly) would make c3 false, leaving only 2 conditions met. - Self-corrected: Claude Opus 4.5 initially answered 'No' but corrected itself mid-reasoning. Anatomy of Self-Correction. Claude Opus 4.5's reasoning demonstrates the challenge of discovering numeric invariants without formal analysis: 'Air route cost can exceed 1000: e.g., AIR + STORM + WAR ZONE = 500 × 3 × 4 = 6000 ¿ 1000. In scenarios where cost ≥ 1000 (c3 false) and route time ≥ 72 (never happens for Air, max is 24), emergency would fail... Wait, re-checking: Air + HURRICANE scenarios are limited. Let me verify c4 is always true: max Air time = 8 × 3 = 24 ¡ 72, so yes c4 is always true for Air. Correction: With cargo value ¿ 100000 (c1=true), time critical ¡ 24 (c2=true), and route time always ¡ 72 for Air (c4=true), we have 3 conditions, so emergency IS always activated.' The model eventually discovered the invariant through explicit arithmetic verification, but only after initially reaching the wrong conclusion. Grok-code-fast-1 made a similar initial error but did not self-correct, focusing on the boundary case route cost = 1000 without recognizing that c4 compensates. Analysis. This question tests the ability to reason about numeric constraints that determine behavior. The key insight (Air's route time is bounded at 24 hours) requires: 1. Recognizing that route time depends on base time and weather multipliers 2. Computing the maximum possible value for Air specifically 3. Connecting this bound to the emergency threshold (72 hours) ImandraX establishes this property through logical deduction. LLMs may arrive at the same conclusion through explicit arithmetic (as Claude Opus eventually did), but this process is unreliablesome models self-correct while others do not, as demonstrated by Grok's failure to recognize that c4 compensates for any value of c3 . ## B.2.5 Summary This case study illustrates the patterns observed across the full evaluation: 1. State Space Estimation : LLMs systematically underestimate scenario counts by treating numeric parameters as irrelevant. The 2-39 × error range in Q1 is consistent with the 0.186 mean accuracy reported in the main section. 2. Explicit Control Flow : When conditions are directly visible as boolean expressions (Q2), LLMs perform well, achieving 100% accuracy on the conditional analysis task. 3. Numeric Invariants : Properties requiring discovery of domain-specific bounds (Q3) challenge LLMs. While self-correction is possible, it is unreliable-some models correct themselves, others do not. While different LLMs exhibit varying performance levels-with some models like Claude Opus 4.5 demonstrating self-correction capabilities that others lack-all models share inherent limitations when reasoning about code without formal methods support. Region decomposition provides exhaustive state space analysis that no LLM achieved, and theorem proving establishes properties with logical certainty that LLMs can only approximate through heuristic reasoning. These limitations are fundamental to current LLM architectures rather than artifacts of any particular model. ## References - [1] 111th United States Congress. Dodd-Frank Wall Street Reform and Consumer Protection Act. https://www.congress.gov/bill/111th-congress/house-bill/4173 , 2010. - [2] F. Batista. PEP 327 - Decimal Data Type. https://peps.python.org/pep-0327/ , 2003. - [3] W. R. Bevier, W. A. Hunt, J. S. Moore, and W. D. Young. Special Issue on System Verification. Journal of Automated Reasoning , 5(4):409-530, 1989. - [4] R. S. Boyer and J. S. Moore. A Computational Logic . Academic Press Professional, Inc., USA, 1979. - [5] I. S. Committee. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 , pages 1-70, 2008. doi: 10.1109/IEEESTD.2008.4610935. - [6] R. Desmartin, O. Isac, G. Passmore, E. Komendantskaya, K. Stark, and G. Katz. A Certified Proof Checker for Deep Neural Network Verification in Imandra. In Y. Forster and C. Keller, editors, 16th International Conference on Interactive Theorem Proving (ITP 2025) , volume 352 of Leibniz International Proceedings in Informatics (LIPIcs) , pages 1:1-1:21, Dagstuhl, Germany, 2025. Schloss Dagstuhl - Leibniz-Zentrum f¨ ur Informatik. ISBN 978-3-95977-396-6. doi: 10.4230/ LIPIcs.ITP.2025.1. URL https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs. ITP.2025.1 . - [7] European Parliament and Council of the European Union. Regulation (EU) No 648/2012 on OTC derivatives, central counterparties and trade repositories (EMIR). https://eur-lex.europa.eu/ legal-content/EN/TXT/?uri=CELEX:32012R0648 , 2012. - [8] Imandra, Inc. CodeLogician IML Agent. https://marketplace.visualstudio.com/items? itemName=imandra.imandra-code-logician , 2025. - [9] Imandra Inc. Code Logic Bench: A Benchmark for Evaluating Program Reasoning. https: //github.com/imandra-ai/code-logic-bench , 2026. GitHub repository. - [10] Imandra Inc. Imandra protocol language (IPL) documentation. https://docs.imandra.ai/ipl/ , 2026. Accessed: 2026-01-16. - [11] Imandra Inc. Region decomposition: Stock exchange trade pricing. https://docs.imandra.ai/ imandra-docs/notebooks/six-swiss-exchange-pricing/ , 2026. Accessed: 2026-01-16. - [12] Imandra Inc. Supervised learning with imandra. https://docs.imandra.ai/imandra-docs/ notebooks/supervised-learning/ , 2026. Accessed: 2026-01-16. - [13] London Stock Exchange. MIT201 -Guide to the trading system. https://docs.londonstockexchange.com/sites/default/files/documents/ mit201-guide-to-the-trading-system-15\_6\_20240429.pdf , 2024. - [14] G. Passmore, S. Cruanes, D. Ignatovich, D. Aitken, M. Bray, E. Kagan, K. Kanishev, E. Maclean, and N. Mometto. The Imandra Automated Reasoning System (System Description). In Proc. 10th Int. Joint Conf. Automated Reasoning (IJCAR) , pages 464-471, 2020. - [15] G. O. Passmore. Some Lessons Learned in the Industrialization of Formal Methods for Financial Algorithms. In Proc. 24th Int. Symposium on Formal Methods (FM) , pages 717-721, 2021. - [16] G. O. Passmore and D. Ignatovich. Formal verification of financial algorithms. In L. de Moura, editor, Automated Deduction CADE 26 - 26th International Conference on Automated Deduction, Gothenburg, Sweden, August 6-11, 2017, Proceedings , volume 10395 of Lecture Notes in Computer Science , pages 26-41. Springer, 2017. doi: 10.1007/978-3-319-63046-5 \ 3. URL https://doi.org/10.1007/978-3-319-63046-5\_3 . - [17] C. M. Wintersteiger. Formal verification of the IEEE P3109 standard for binary floating-point formats for machine learning. In 2025 IEEE 32nd Symposium on Computer Arithmetic (ARITH) , pages 157-160, 2025. doi: 10.1109/ARITH64983.2025.00032.

Rendering Paper...