2412.16953v2

Model: nemotron-free

# Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework > Corresponding author: Hao Fei Abstract In the context of large language models (LLMs), current advanced reasoning methods have made impressive strides in various reasoning tasks. However, when it comes to logical reasoning tasks, major challenges remain in both efficacy and efficiency. This is rooted in the fact that these systems fail to fully leverage the inherent structure of logical tasks throughout the reasoning processes such as decomposition, search, and resolution. To address this, we propose a logic-complete reasoning framework, Aristotle, with three key components: Logical Decomposer, Logical Search Router, and Logical Resolver. In our framework, symbolic expressions and logical rules are comprehensively integrated into the entire reasoning process, significantly alleviating the bottlenecks of logical reasoning, i.e., reducing sub-task complexity, minimizing search errors, and resolving logical contradictions. The experimental results on several datasets demonstrate that Aristotle consistently outperforms state-of-the-art reasoning frameworks in both accuracy and efficiency, particularly excelling in complex logical reasoning scenarios. We will open-source all our code at http://llm-symbol.github.io/Aristotle. 1 Introduction LLMs Patel et al. (2023); Chowdhery et al. (2023) have unlocked unprecedented potential in semantic understanding Zhao et al. (2023), sparking immense hope for realizing AGI. A fundamental requirement for true intelligence is the ability to perform human-level reasoning, such as commonsense reasoning Wang et al. (2024c), mathematical problem-solving Wang et al. (2024a), and geometric reasoning Eisner et al. (2024). To achieve this, researchers have drawn inspiration from human reasoning processes, proposing various methods and strategies for LLM-based reasoning. One of the most groundbreaking works is the Chain-of-Thought (CoT) Wei et al. (2022), which breaks down complex problems into smaller sub-problems, solving them step by step. The birth of CoT has elevated the reasoning capabilities of LLMs to new heights. Further research has built on this foundation by closely emulating human cognitive patterns, introducing more advanced approaches, such as Least-to-Most Zhou et al. (2023), Tree-of-Thought (ToT) Yao et al. (2023), Graph-of-Thought (GoT) Besta et al. (2024), and Plan-and-Solve Wang et al. (2023a), which have achieved progressively better results on reasoning benchmarks. In summary, successful LLM-based reasoning methods generally involve three key modules Huang and Chang (2023); Li et al. (2024): problem decomposition, path searching, and problem resolution. <details> <summary>x1.png Details</summary> ![3fa0628a](/v1/image/3fa0628aa4985e94e81a210161252d32b7ac28c6651d44410b490ba10b9a9cf3) ### Visual Description ## Bar Chart: Comparative Performance of ToT and Ours Methods ### Overview The image presents two side-by-side bar charts comparing the performance of two methods ("ToT" and "Ours") across three metrics: **Error Rate (%)** and **Reasoning Steps**. The charts use color-coded bars (blue for ToT, red for Ours) and include percentage-based annotations with directional arrows indicating performance differences. --- ### Components/Axes 1. **Legend**: - Positioned at the top center. - Labels: - Blue = ToT - Red = Ours 2. **Left Chart (Error Rate %)**: - **X-axis**: Categories labeled "SE" (left) and "RE" (right). - **Y-axis**: Error Rate (%) with a scale from 0 to 30. - **Bars**: - SE: ToT (15.0%), Ours (4.9%). - RE: ToT (28.4%), Ours (0.4%). 3. **Right Chart (Reasoning Steps)**: - **X-axis**: Same categories ("SE" and "RE"). - **Y-axis**: Reasoning Steps with a scale from 0 to 30. - **Bars**: - SE: ToT (24.6 steps), Ours (11.7 steps). - RE: No data shown (only SE is labeled). --- ### Detailed Analysis 1. **Error Rate (%)**: - **SE**: - ToT: 15.0% (blue bar). - Ours: 4.9% (red bar), a **11.1% reduction** (arrow labeled "-11.1"). - **RE**: - ToT: 28.4% (blue bar). - Ours: 0.4% (red bar), a **28.0% reduction** (arrow labeled "-28.0"). 2. **Reasoning Steps**: - **SE**: - ToT: 24.6 steps (blue bar). - Ours: 11.7 steps (red bar), a **12.9% reduction** (arrow labeled "-12.9"). - **RE**: No data provided for Reasoning Steps. --- ### Key Observations 1. **Error Rate Reduction**: - Ours outperforms ToT in both SE and RE, with the largest improvement in RE (-28.0%). - RE Error Rate for Ours is nearly negligible (0.4%), suggesting near-perfect performance in this category. 2. **Reasoning Steps**: - Ours reduces reasoning steps by 12.9% in SE, indicating efficiency gains. - No data for RE Reasoning Steps limits conclusions about its impact there. 3. **Visual Trends**: - Red bars (Ours) are consistently shorter than blue bars (ToT) across all metrics. - Arrows emphasize the magnitude of improvement, with RE showing the most dramatic reduction. --- ### Interpretation The data demonstrates that the "Ours" method significantly outperforms "ToT" in reducing both error rates and reasoning steps. The most notable improvement occurs in the **RE** category, where error rates drop from 28.4% to 0.4%—a 28.0% reduction. This suggests the method is particularly effective in complex or high-stakes scenarios (RE). The reduction in reasoning steps (12.9%) further implies computational efficiency gains. However, the absence of RE Reasoning Steps data leaves a gap in understanding its full impact. The consistent performance across metrics indicates a robust, generalizable improvement in the "Ours" approach. </details> Figure 1: Our reasoning framework vs. the SoTA ToT: comparison in terms of Search Error (SE) and single-step Reasoning Error (RE), as well as in terms of average number of reasoning steps. <details> <summary>x2.png Details</summary> ![fd3f18ef](/v1/image/fd3f18ef6c412d28c7b0296beaf0fe2aff4188fd507dc6d2d52fe0e29e216f95) ### Visual Description ## Flowchart: Logical Reasoning Process for Answer Determination ### Overview This flowchart illustrates a multi-step logical reasoning process to determine the truth value of a statement ("Dave is nice") based on given premises. The process involves translation, decomposition, clause resolution, contradiction detection, and iterative reasoning trajectories. The final output is a binary answer (True/False/Unknown/Self-Contradiction). --- ### Components/Axes 1. **Key Components**: - **Translator**: Converts raw input into translated/decomposed premises and questions. - **Decomposer**: Breaks down premises into logical clauses (e.g., `Patient(x, False) ∨ Nice(x, True)`). - **Resolver**: Checks for contradictions between current clauses and complementary clauses. - **Search Router**: Routes clauses to relevant premises for resolution. - **Reasoning Trajectories**: Visualizes iterative clause resolution across multiple rounds. 2. **Flow Direction**: - Left-to-right progression through steps: **Search Initialization → Search and Resolve → Conclude Answer**. - Vertical arrows indicate iterative refinement of clauses (e.g., "start 2nd round"). 3. **Color Coding**: - **Blue**: Resolved clauses (e.g., `Patient(Dave, False)`). - **Pink**: Contradictions or unresolved states. - **Yellow**: Final answer box. --- ### Detailed Analysis #### Step 1: Search Initialization - **Input**: Raw premises and question (e.g., "If people are patient, then they are nice..."). - **Output**: Translated/decomposed premises and question: - Premises: 1. `Patient(x, False) ∨ Nice(x, True)` 2. `Smart(x, False) ∨ Patient(x, True)` 3. `Smart(Dave, True)` - Question: `Nice(Dave, True)`. #### Step 2: Search and Resolve - **Current Clause**: Initialized as `¬Sn:Nice(Dave, False)`. - **Process**: 1. **Complementary Clause Search**: Finds `Patient(x, False) ∨ Nice(x, True)`. 2. **Resolver**: Determines no contradiction initially. 3. **Premises Routing**: Matches clauses to premises (e.g., `Smart(Dave, True)` resolves to `Smart(Dave, False)` contradiction). #### Step 3: Conclude Answer - **Final Answer**: Derived from resolved clauses: - `Patient(Dave, False)` → Contradiction with `Smart(Dave, True)`. - `Smart(Dave, False)` → Final answer: **True** (no contradiction found). --- ### Key Observations 1. **Iterative Reasoning**: The process involves multiple rounds of clause refinement (e.g., "start 2nd round," "start 3rd round"). 2. **Contradiction Handling**: Explicitly flags contradictions (e.g., `Smart(Dave, False)` vs. `Smart(Dave, True)`). 3. **Binary Logic**: All clauses are resolved to True/False values, with no intermediate states. --- ### Interpretation This flowchart represents a formalized logical deduction system, likely inspired by resolution-based theorem proving. The process systematically: 1. **Decomposes** natural language premises into formal clauses. 2. **Resolves** clauses iteratively to eliminate contradictions. 3. **Concludes** the truth value of the target statement based on resolved premises. The use of color coding and directional arrows emphasizes the algorithmic nature of the process, where each step depends on the output of the previous. The final answer (`True`) emerges only after all contradictions are resolved or deemed irrelevant. The system prioritizes logical consistency over heuristic reasoning, adhering strictly to formal rules of inference. </details> Figure 2: Our Aristotle logical reasoning framework (best viewed via zooming in). In step 1, the <details> <summary>figure/translation.png Details</summary> ![ddaf7b48](/v1/image/ddaf7b485ffc8094330390ff0e3a455f43c559ccfa7658cee2239db8977a268d) ### Visual Description ## Diagram: Bilingual Communication Flow ### Overview The image depicts a bidirectional relationship between two speech bubbles containing text in different languages. A white speech bubble with a Chinese character ("文") is connected to a blue speech bubble with the English letter "A" via two curved arrows, suggesting a cyclical or reciprocal interaction. ### Components/Axes - **Speech Bubbles**: - **Top-Left**: White bubble with a black Chinese character "文" (Wén), meaning "text" or "writing." - **Bottom-Right**: Blue bubble with a black English letter "A." - **Arrows**: - **Top-Right Arrow**: Black, curved arrow pointing from the Chinese bubble to the English bubble. - **Bottom-Left Arrow**: Black, curved arrow pointing from the English bubble back to the Chinese bubble. - **Color Coding**: - White and blue bubbles distinguish the two languages. - Arrows are uniformly black, emphasizing directional flow. ### Detailed Analysis - **Textual Elements**: - Chinese character "文" (Wén) is centered in the white bubble. - English letter "A" is centered in the blue bubble. - **Spatial Relationships**: - The Chinese bubble is positioned in the top-left quadrant, while the English bubble occupies the bottom-right. - Arrows form a closed loop, connecting the two bubbles in both directions. ### Key Observations - The diagram emphasizes **bidirectional communication** between Chinese and English. - The use of contrasting colors (white vs. blue) highlights the distinction between the two languages. - The cyclical arrows suggest a **feedback loop** or **translation process** between the languages. ### Interpretation This diagram likely represents a **translation or cross-linguistic interaction** mechanism. The bidirectional arrows imply that communication is not one-way but involves mutual exchange, such as translating "文" (text) into "A" (a letter) and vice versa. The simplicity of the design focuses attention on the core relationship between the two languages, possibly symbolizing **interoperability** or **cultural exchange**. The absence of additional elements (e.g., labels, data points) suggests the diagram is intentionally minimalist, prioritizing clarity over complexity. **Note**: No numerical data, charts, or tables are present. The image relies on symbolic representation rather than quantitative analysis. </details> Translator and <details> <summary>figure/decomposer.png Details</summary> ![644ef874](/v1/image/644ef87442359d854972de092da507d1302b525f4a79d5eaf89f554bdf996954) ### Visual Description ## Flowchart Diagram: Hierarchical Process Structure ### Overview The image depicts a multi-tiered flowchart with a hierarchical structure. It consists of colored square nodes connected by blue lines, arranged in three distinct levels. The diagram uses color-coded nodes to represent different elements or stages, though no explicit labels or textual annotations are present. ### Components/Axes - **Nodes**: - Top tier: 1 yellow square - Middle tier: 2 squares (left: red, right: green) - Bottom tier: 3 squares (left: green, center: yellow, right: blue) - **Connectors**: - Blue lines link nodes vertically and horizontally - No arrows or directional indicators present - **Color Legend**: - No explicit legend exists, but colors appear to categorize nodes: - Yellow: Top/central authority - Red: Left-side middle tier - Green: Right-side middle tier and left-side bottom tier - Blue: Right-side bottom tier ### Detailed Analysis 1. **Top Tier**: - Single yellow square at apex - Positioned centrally above middle tier - Connected via single blue line to middle tier 2. **Middle Tier**: - Left node: Red square with vertical shadow on left edge - Right node: Green square with vertical shadow on right edge - Both nodes connected to top tier via single blue line - Connected to bottom tier via two separate blue lines 3. **Bottom Tier**: - Left node: Green square with vertical shadow on left edge - Center node: Yellow square with vertical shadow on right edge - Right node: Blue square with vertical shadow on right edge - Each node connected to middle tier via individual blue lines ### Key Observations - No textual labels, axis titles, or legends present - Color distribution suggests potential categorization: - Yellow appears in top and center-bottom positions - Green appears in both middle-right and bottom-left positions - Red and blue are unique to their respective positions - Shadow effects on all squares suggest 3D rendering style - No numerical data, time indicators, or quantitative elements visible ### Interpretation This flowchart appears to represent an organizational structure or decision-making process with three hierarchical levels. The color coding may indicate: 1. **Authority levels**: Yellow (top) > Red/Green (middle) > Green/Yellow/Blue (bottom) 2. **Functional divisions**: - Red/Green middle tier could represent departments - Bottom tier might show sub-teams or operational units 3. **Process flow**: - Top-down decision making (yellow → red/green) - Middle tier bifurcation (red → green bottom nodes) - Bottom tier trinary split (green → yellow → blue) The absence of labels makes definitive interpretation challenging. The color choices and shadow effects suggest this is a stylized diagram meant for visual clarity rather than data representation. The symmetrical layout implies balanced relationships between nodes, though the lack of directional arrows leaves process flow ambiguous. ## Recommendations for Technical Documentation 1. Add explicit labels to all nodes 2. Include a color legend explaining categorization 3. Add directional arrows to clarify process flow 4. Consider adding explanatory text blocks for each tier 5. Implement interactive elements if used in digital formats </details> Decomposer together transform $P$ and $S$ into $P_{n}$ and $S_{n}$ . Then, we initialize the $C_{\text{current}}$ using the decomposed $S_{n}$ and $\neg S_{n}$ . In step 2, the <details> <summary>x5.png Details</summary> ![8d8a683a](/v1/image/8d8a683ae7014bee57dd2dc7d1bd3e3d390f6e84fcbcda52d95718d3d8578b01) ### Visual Description Icon/Small Image (29x29) </details> Search Router uses the $C_{\text{current}}$ and $P_{n}$ to search for $C_{\text{complement}}$ . The <details> <summary>x6.png Details</summary> ![46952b44](/v1/image/46952b445bda58971a86eb660442158f8d4e51e195642e0f1d703428f5592e38) ### Visual Description Icon/Small Image (29x29) </details> Resolver then resolves $C_{\text{current}}$ with $C_{\text{complement}}$ to produce $C_{\text{resolved}}$ . The reasoning complete if: (1) the $C_{\text{resolved}}$ determines whether a contradiction exists; (2) reach the maximum number of iterations $I_{\text{max}}$ . In step 3, Aristotle then concludes the Proof $D_{S_{n}}$ and $D_{\neg S_{n}}$ based on the Proof Determination. Using these proofs, Aristotle determines the final answer based on Eq. (1). Note that two distinct reasoning paths will be implemented: a solid box representing the path starting from $\neg S_{n}$ , and a dotted box representing the path starting from $S_{n}$ . The complete reasoning process for both two paths, including all iterations are shown in the right part “ Reasoning Trajectories ”. Compared to other forms of general reasoning, logical reasoning Huang and Chang (2023) stands out as one of the most challenging tasks, as it demands the strictest evidence, arguments, and logical rigor to arrive at sound conclusions or judgments. Logical reasoning more closely mirrors human-level cognitive processes, making it crucial in high-stakes domains such as mathematical proof generation, legal analysis, and scientific discovery Cummins et al. (1991); Markovits and Vachon (1989). In recent years, numerous studies have investigated how to integrate LLMs into logical reasoning. For example, some methods Pan et al. (2023); Gao et al. (2023) use LLMs to translate textual problems into symbolic expressions, which are then addressed by external logic solvers. Subsequent work, such as SymbCoT Xu et al. (2024), suggests that LLMs themselves can handle both symbolic translation and logic resolution, thus avoiding potential information loss caused when using external solvers. While SymbCoT has achieved state-of-the-art (SoTA) performance, the inherent simplicity of CoT’s linear reasoning process leaves considerable room for further improvement in LLM-based logical reasoning. In response, certain research Yao et al. (2023); Besta et al. (2024); Zhang et al. (2023) has applied sophisticated general-purpose reasoning methods (e.g., ToT, GoT) directly to logical reasoning tasks. Unfortunately, these approaches largely overlook the inherent structure of logical tasks and fail to effectively integrate logical rules into the decompose-search-resolve framework, leaving key issues unresolved in both reasoning efficacy and efficiency: $\blacktriangleright$ From an efficacy perspective, first, when LLMs decompose logical problems, they often rely on the linguistic token relations rather than the underlying logical structure, leading to disconnected sub-problems and faulty reasoning. Specifically, when reasoning hinges on specific logical relationships, neglecting them can result in disjointed sub-problems, breaking the logical chain and ultimately leading to incorrect conclusions. Furthermore, during the search stage, current path search methods rely heavily on evaluators that may be unreliable, selecting nodes based on possibly flawed logic, causing error propagation through subsequent reasoning steps Chen et al. (2024); Wang et al. (2024b). For the resolving step, these methods guide LLMs to solve sub-questions with simple text prompts, which frequently contain logical errors, resulting in numerous faulty nodes in the search space Xu et al. (2024). These errors propagate through subsequent reasoning steps, causing entire paths to fail and leading to reasoning failure. Our preliminary experiments reveal that directly applying SoTA general-purpose reasoning methods with a search mechanism to logical tasks results in significant errors, with 28.4% for reasoning and 15.0% for search, as shown in Fig. 1. $\blacktriangleright$ From an efficiency perspective, these approaches also lead to significant shortcomings. For example, generating large numbers of incorrect nodes wastes computational resources Ning et al. (2024). Moreover, relying on unreliable evaluators introduces bias into the search process, leading to unnecessary node and path explorations, ultimately reducing efficiency Chen et al. (2024). We note that inefficient logical reasoning systems can significantly undermine their value in practical application scenarios. To address these challenges, we propose a novel reasoning framework, Aristotle, which effectively tackles the performance and the efficiency bottlenecks in existing logical reasoning tasks by completely integrating symbolic expressions and rules into each stage of the decomposition, search, and resolution. Fig. 2 illustrates the overall framework. Specifically, we first introduce a Logical Decomposer that breaks down the original problem into smaller and simpler components based on its logical structure, reducing the complexity of logical tasks. We then devise a Logical Search Router, which leverages proof by contradiction to directly search for logical inconsistencies, thereby reducing search errors from unreliable evaluators and minimizing the number of steps required by existing methods. Finally, we develop a Logical Resolver, which rigorously resolves logical contradictions at each reasoning step, guided by the Logical Search Router. Overall, Aristotle thoroughly considers the inherent logical structure of tasks, fully incorporating logical symbols into the entire decompose-search-resolve framework. This ensures a more logically coherent reasoning process, leading to more reliable final results. We conducted experiments across multiple logical reasoning benchmarks, where our method surpasses the current SoTA baselines by 4.5% with GPT-4 and 5.4% with GPT-4o. Further analysis revealed that the decomposer, search router, and resolver modules each contributed to: (i) reducing task complexity during problem decomposition, leading to improved accuracy in subsequent search and reasoning phases; (ii) focusing search efforts on the most direct and relevant paths, which reduced errors and enhanced efficiency; (iii) achieving near-perfect logical reasoning accuracy. Moreover, we observe that Aristotle delivers even greater performance improvements in complex scenarios, such as those with more intricate logical structures or longer reasoning chains. Overall, this work marks the first successful complete integration of symbolic logic expressions into every stage of an LLM-based reasoning framework (decomposition, search, and resolution), demonstrating that LLMs can perform complete logical reasoning over symbolic structures. 2 <details> <summary>figure/aristotle_icon.png Details</summary> ![92409ed4](/v1/image/92409ed43a618eb0d8b6a7485e00efaad515e6309dfddf3ecb50fc7ba5a4e017) ### Visual Description ## Icon/Symbol: Classical Figure Illustration ### Overview The image depicts a simplified, cartoon-style illustration of a male figure with a neutral expression. The design emphasizes minimalistic features and lacks contextual or textual elements. ### Components/Axes - **Visual Elements**: - **Head**: Rounded shape with a gray beard and mustache. - **Facial Features**: Two black dots for eyes, a horizontal line for the mouth, and two curved lines for eyebrows. - **Body**: Shoulders and upper torso visible, wearing a draped garment resembling a toga (light blue color). - **Skin Tone**: Light beige shading on the face and exposed shoulder. - **Color Palette**: - Gray (beard/mustache), light blue (garment), beige (skin), and black (outlines). ### Detailed Analysis - **No Textual Content**: The image contains no labels, axis titles, legends, or embedded text. - **Design Style**: Flat, vector-like illustration with bold black outlines and solid fill colors. - **Spatial Composition**: Centralized figure with symmetrical facial features and balanced garment draping. ### Key Observations - The figure’s attire and beard suggest a historical or classical theme (e.g., Roman, Greek, or biblical). - The absence of contextual details (e.g., background, props) limits interpretive specificity. ### Interpretation This icon likely serves as a generic representation of a historical or authoritative figure, possibly for educational, decorative, or symbolic purposes. The lack of textual or numerical data confirms it is not a chart, diagram, or data visualization. The simplicity of the design prioritizes recognizability over narrative depth, making it suitable for icons, avatars, or minimalist branding. **Note**: No factual or data-driven information is present in the image. The analysis focuses solely on visual composition and symbolic implications. </details> Aristotle Architecture We first formally define the logical reasoning task. Given a set of premises $P=\{p_{1},p_{2},...,p_{n}\}$ , where each $p_{i}$ represents a logical statement, a reasoner should derive an answer $A$ regarding a given statement $S$ . The possible answer is true ( $T$ ), false ( $F$ ), unknown ( $U$ ), or self-contradictory ( $SD$ ). Self-contradictory means a statement can be proved true and false simultaneously. The formal definition of each answer can be found in Eq. (1). As illustrated in Fig. 2, Aristotle has an architecture with four modules: Translator, Decomposer, Search Router, and Resolver. <details> <summary>figure/translation.png Details</summary> ![ddaf7b48](/v1/image/ddaf7b485ffc8094330390ff0e3a455f43c559ccfa7658cee2239db8977a268d) ### Visual Description ## Diagram: Bilingual Communication Flow ### Overview The image depicts a bidirectional relationship between two speech bubbles containing text in different languages. A white speech bubble with a Chinese character ("文") is connected to a blue speech bubble with the English letter "A" via two curved arrows, suggesting a cyclical or reciprocal interaction. ### Components/Axes - **Speech Bubbles**: - **Top-Left**: White bubble with a black Chinese character "文" (Wén), meaning "text" or "writing." - **Bottom-Right**: Blue bubble with a black English letter "A." - **Arrows**: - **Top-Right Arrow**: Black, curved arrow pointing from the Chinese bubble to the English bubble. - **Bottom-Left Arrow**: Black, curved arrow pointing from the English bubble back to the Chinese bubble. - **Color Coding**: - White and blue bubbles distinguish the two languages. - Arrows are uniformly black, emphasizing directional flow. ### Detailed Analysis - **Textual Elements**: - Chinese character "文" (Wén) is centered in the white bubble. - English letter "A" is centered in the blue bubble. - **Spatial Relationships**: - The Chinese bubble is positioned in the top-left quadrant, while the English bubble occupies the bottom-right. - Arrows form a closed loop, connecting the two bubbles in both directions. ### Key Observations - The diagram emphasizes **bidirectional communication** between Chinese and English. - The use of contrasting colors (white vs. blue) highlights the distinction between the two languages. - The cyclical arrows suggest a **feedback loop** or **translation process** between the languages. ### Interpretation This diagram likely represents a **translation or cross-linguistic interaction** mechanism. The bidirectional arrows imply that communication is not one-way but involves mutual exchange, such as translating "文" (text) into "A" (a letter) and vice versa. The simplicity of the design focuses attention on the core relationship between the two languages, possibly symbolizing **interoperability** or **cultural exchange**. The absence of additional elements (e.g., labels, data points) suggests the diagram is intentionally minimalist, prioritizing clarity over complexity. **Note**: No numerical data, charts, or tables are present. The image relies on symbolic representation rather than quantitative analysis. </details> Translator. We use the LLM itself to parse the given premises $P$ and question statement $S$ into a symbolic format, which aims to eliminate ambiguity and ensure precision in the logical statement. We specifically use Logic Programming (LP) language, adopting Prolog’s grammar Clocksin and Mellish (2003) to represent the problem as facts, rules, and queries. Facts and rules Baader et al. (2003) are derived from $P$ , while queries are formulated based on the $S$ . We denote the translated premises (facts and rules) as $P_{t}$ , and queries as $S_{t}$ . The details of the grammar can be found at A <details> <summary>figure/decomposer.png Details</summary> ![644ef874](/v1/image/644ef87442359d854972de092da507d1302b525f4a79d5eaf89f554bdf996954) ### Visual Description ## Flowchart Diagram: Hierarchical Process Structure ### Overview The image depicts a multi-tiered flowchart with a hierarchical structure. It consists of colored square nodes connected by blue lines, arranged in three distinct levels. The diagram uses color-coded nodes to represent different elements or stages, though no explicit labels or textual annotations are present. ### Components/Axes - **Nodes**: - Top tier: 1 yellow square - Middle tier: 2 squares (left: red, right: green) - Bottom tier: 3 squares (left: green, center: yellow, right: blue) - **Connectors**: - Blue lines link nodes vertically and horizontally - No arrows or directional indicators present - **Color Legend**: - No explicit legend exists, but colors appear to categorize nodes: - Yellow: Top/central authority - Red: Left-side middle tier - Green: Right-side middle tier and left-side bottom tier - Blue: Right-side bottom tier ### Detailed Analysis 1. **Top Tier**: - Single yellow square at apex - Positioned centrally above middle tier - Connected via single blue line to middle tier 2. **Middle Tier**: - Left node: Red square with vertical shadow on left edge - Right node: Green square with vertical shadow on right edge - Both nodes connected to top tier via single blue line - Connected to bottom tier via two separate blue lines 3. **Bottom Tier**: - Left node: Green square with vertical shadow on left edge - Center node: Yellow square with vertical shadow on right edge - Right node: Blue square with vertical shadow on right edge - Each node connected to middle tier via individual blue lines ### Key Observations - No textual labels, axis titles, or legends present - Color distribution suggests potential categorization: - Yellow appears in top and center-bottom positions - Green appears in both middle-right and bottom-left positions - Red and blue are unique to their respective positions - Shadow effects on all squares suggest 3D rendering style - No numerical data, time indicators, or quantitative elements visible ### Interpretation This flowchart appears to represent an organizational structure or decision-making process with three hierarchical levels. The color coding may indicate: 1. **Authority levels**: Yellow (top) > Red/Green (middle) > Green/Yellow/Blue (bottom) 2. **Functional divisions**: - Red/Green middle tier could represent departments - Bottom tier might show sub-teams or operational units 3. **Process flow**: - Top-down decision making (yellow → red/green) - Middle tier bifurcation (red → green bottom nodes) - Bottom tier trinary split (green → yellow → blue) The absence of labels makes definitive interpretation challenging. The color choices and shadow effects suggest this is a stylized diagram meant for visual clarity rather than data representation. The symmetrical layout implies balanced relationships between nodes, though the lack of directional arrows leaves process flow ambiguous. ## Recommendations for Technical Documentation 1. Add explicit labels to all nodes 2. Include a color legend explaining categorization 3. Add directional arrows to clarify process flow 4. Consider adding explanatory text blocks for each tier 5. Implement interactive elements if used in digital formats </details> Decomposer. By breaking down the logical statement into a standardized logical form, we can simplify the reasoning process, making it easier to apply formal rules and perform efficient logical calculations. To achieve this, we use an LLM to transform the parsed premises $P_{t}$ , and queries $S_{t}$ into a standardized logical form through Normalization Davis and Putnam (1960) and Skolemization Nonnengart (1996), converting them into Conjunctive Normal Form (CNF) and eliminates quantifiers, denoted as $P_{n}$ and $S_{n}$ . For example, the logical rule $∀ x\,(P(x)→ Q(x))$ will be decomposed into $\neg P(x)\lor Q(x)$ . <details> <summary>x7.png Details</summary> ![f5c50723](/v1/image/f5c507233482abb3df468ee7de0551ee4d5a39d86638d1bd0ec14cf1a54384e8) ### Visual Description Icon/Small Image (29x29) </details> Search Router. We adopt the proof-by-contradiction Bishop (1967) approach because it allows us to straightforwardly search for complementary clauses. This method reduces search errors and directly targets logical conflicts, making the reasoning process faster and more efficient. We design a rule-based module to search for the clauses $C_{\text{complement}}∈ P_{n}$ such that $C_{\text{current}}$ and $C_{\text{complement}}$ contain complementary terms. Terms are complementary when they share the same predicate and argument, but have opposite polarity. For example, if the $C_{\text{current}}$ is $\text{P}(\text{x},\text{True})$ , clauses in the premises that contains $\text{P}(\text{x},\text{False})$ will be found by the Search Router as $C_{\text{complement}}$ , since they are complementary (same predicate $P$ and argument $x$ but opposite polarity (True vs. False)). We will explain how we define the $C_{\text{current}}$ in Section 3. We include more details about the search strategy in Appendix B and H. <details> <summary>x8.png Details</summary> ![35236c4b](/v1/image/35236c4b7e7694cd37197302f62b55a6ba953d6ec31ff8193984fbad6f5df893) ### Visual Description Icon/Small Image (29x29) </details> Resolver. To conduct effective step-wise reasoning during proof by contradiction, we adhere to the resolution principle Robinson (1965) as it provides clear and concise instructions to resolve logical conflicts, minimizing the likelihood of reasoning errors. Specifically, it works by canceling out the complementary terms identified by the Search Router and connecting the remaining terms, a process that will be implemented using an LLM. Specifically, given two clauses $C_{\text{current}}$ and $C_{\text{complement}}$ , where: $$ C_{\text{current}}=P(x,\text{True})\lor A $$ $$ C_{\text{complement}}=P(x,\text{False})\lor B $$ Here, $P(x,\text{True})$ and $P(x,\text{False})$ are complementary terms. The Resolver cancels out them and connects the remaining terms. The resolved clause becomes: $$ C_{\text{resolved}}=\text{Resolve}(C_{\text{current}},C_{\text{complement}})=A\lor B $$ If the remaining clause is empty or contradiction ( $\bot$ ) E.g. Resolve $C_{\text{current}}=(A)$ and $C_{\text{complement}}=(\neg A)$ will get an empty clause $C_{\text{resolved}}=\bot$ . An empty clause is equivalent to a contradiction. , we can conclude the proof and determine the answer, which will be explained in detail in Section 3 at Step 2. 3 Logic-Complete Reasoning Processing With Aristotle, we now demonstrate how each module comes into play to form the integrated dual-path reasoning process. Step 1: Search Initialization. As shown in the step 1 of Fig. 2, given the original premises $P$ and the question statement $S$ , we first translate them into symbolic format $P_{t}$ and $S_{t}$ , and then decompose them into $P_{n}$ and $S_{n}$ , respectively. Translate and Decompose $\blacktriangleright$ Input: $P$ , $S$ $\blacktriangleright$ Output: $P_{n}$ , $S_{n}$ , $C_{\text{current}}$ = { $S_{n}$ , $\neg S_{n}$ } To implement proof by contradiction, we initialize the current clause $C_{\text{current}}$ with both $S_{n}$ and its negation $\neg S_{n}$ , denoted as $C_{\text{current}}=S_{n}$ and $C_{\text{current}}=\neg S_{n}$ . Considering both $S_{n}$ and $\neg S_{n}$ is necessary because we need both proofs to scrupulously conclude an answer, which is marked in Eq. (1) and will be explained in detail later in Step 3. Step 2: Search and Resolve. At this stage, two reasoning paths are initiated: one from $C_{\text{current}}=S_{n}$ and the other from $C_{\text{current}}=\neg S_{n}$ , initialized in Step 1. We aim to reach a final answer using proof by contradiction for both paths, iteratively search for complementary clauses and resolve conflicts. This helps us systematically reach an accurate final answer more quickly. Specifically for each reasoning path, presented in the Step 2 of Fig. 2, the Search Router selects clauses $C_{\text{complement}}∈ P_{n}$ that are complementary to $C_{\text{current}}$ . Search $\blacktriangleright$ Input: $P_{n}$ , $C_{\text{current}}$ $\blacktriangleright$ Output: $C_{\text{complement}}$ The Resolver module then applies the resolution rule Resolve( $C_{\text{current}}$ , $C_{\text{complement}}$ ) to produce a new clause $C_{\text{resolved}}$ . Resolve $\blacktriangleright$ Input: $C_{\text{current}}$ , $C_{\text{complement}}$ $\blacktriangleright$ Output: $C_{\text{resolved}}$ If the $C_{\text{resolved}}$ indicates a contradiction or confirms the absence of a contradiction, we then terminate the reasoning process. If not, we then update $C_{\text{current}}$ = $C_{\text{resolved}}$ and repeat the Search and Resolve process. If the process reaches the maximum number of iterations $I_{\text{max}}$ and still does not find a contradiction, we conclude that there is no contradiction and terminate the reasoning process. Given the determination of whether contradiction exists, we then use the formula presented below to formally establish the proof $D_{S_{n}}$ (started from $C_{\text{current}}$ = $S_{n}$ ) and $D_{\neg S_{n}}$ (started from $C_{\text{current}}$ = $\neg S_{n}$ ) to determine whether $P_{n}$ entails either $S_{n}$ or $\neg S_{n}$ . Proof Determination $D_{S_{n}}=\begin{cases}P_{n}\vdash\neg S_{n}&\text{($C_{\text{resolved}}$ = Contradiction)}\\ P_{n}\not\vdash\neg S_{n}&\text{($C_{\text{resolved}}$ = No Contradiction)}\end{cases}$ $D_{\neg S_{n}}=\begin{cases}P_{n}\vdash S_{n}&\text{($C_{\text{resolved}}$ = Contradiction)}\\ P_{n}\not\vdash S_{n}&\text{($C_{\text{resolved}}$ = No Contradiction)}\end{cases}$ Step 3: Conclude Answer. This proof $D_{S_{n}}$ and $D_{\neg S_{n}}$ can then be used to conclude the truth value $A$ of $S$ based on Eq. (1). For example, consider a statement $S$ . If we get $D_{S_{n}}$ = $P\vdash\neg S$ and $D_{\neg S_{n}}$ = $P\not\vdash S$ , the combination of $P\vdash\neg S$ and $P\not\vdash S$ leads to the conclusion $A$ that $S$ is false according to Eq. (1). Final Answer $\blacktriangleright$ Input: $D_{S_{n}}∈\left\{P_{n}\vdash\neg S_{n},P_{n}\not\vdash\neg S_{n}\right\}$ $D_{\neg S_{n}}∈\left\{P_{n}\vdash S_{n},P_{n}\not\vdash S_{n}\right\}$ $\blacktriangleright$ Output: $A∈\{\text{True, False, Unknown, Self-Contradictory}\}$ $$ A=\left\{\begin{array}[]{ll}\text{True,}&P_{n}\vdash S_{n}\land P_{n}\not\vdash\neg S_{n}\\ \text{False,}&P_{n}\not\vdash S_{n}\land P_{n}\vdash\neg S_{n}\\ \text{Unknown,}&P_{n}\not\vdash S_{n}\land P_{n}\not\vdash\neg S_{n}\\ \text{Self-Contradictory,}&P_{n}\vdash S_{n}\land P_{n}\vdash\neg S_{n}\\ \end{array}\right. \tag{1} $$ The full algorithm and an example case can be found in Appendix H and I, respectively. | | Method | GPT-4 | GPT-4o | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | ProntoQA | ProofWriter | LogicNLI | Avg | ProntoQA | ProofWriter | LogicNLI | Avg | | | | LR | Naive | 77.4 | 53.1 | 49.0 | 59.8 | 89.6 | 48.7 | 53.0 | 63.8 | | CoT | 98.9 | 68.1 | 51.0 | 72.6 | 98.0 | 77.2 | 61.0 | 78.7 | | | AR | CoT-SC | 93.4 | 69.3 | 57.3 | 73.3 | 99.6 | 78.3 | 64.3 | 80.7 | | CR | 98.2 | 71.7 | 62.0 | 77.3 | 99.6 | 82.2 | 61.0 | 80.9 | | | DetermLR | 98.6 | 79.2 | 57.0 | 78.3 | 93.4 | 69.8 | 58.0 | 75.7 | | | ToT | 97.6 | 70.3 | 52.7 | 73.5 | 98.6 | 69.0 | 56.7 | 74.8 | | | SR | SymbCoT | 99.6 | 82.5 | 59.0 | 80.4 | 99.4 | 82.3 | 58.7 | 80.1 | | Logic-LM | 83.2 | 79.7 | - | - | 83.2 | 72.0 | - | - | | | Ours | 99.6 | 86.8 | 68.3 | 84.9 | 99.6 | 88.5 | 70.7 | 86.3 | | | (+0.0) | (+4.3) | (+6.3) | (+4.5) | (+0.0) | (+6.2) | (+6.4) | (+5.4) | | | Table 1: Performance on GPT-4 and GPT-4o. The second best score is underlined and bold one is the best. In the brackets are the corresponding improvements in between. 4 Experiments We present the experiment settings, baselines and results in this Section. | $\bullet$ Claude-3.5-Sonnet CoT-SC | ProntoQA 98.0 | ProofWriter 78.5 | LogicNLI 54.3 | Avg 77.0 | | --- | --- | --- | --- | --- | | CR | 88.8 | 57.8 | 57.7 | 68.1 | | ToT | 92.0 | 69.5 | 46.7 | 69.4 | | Ours | 99.0 | 86.5 | 61.3 | 82.3 | | (+1.0) | (+8.0) | (+3.6) | (+5.3) | | | $\bullet$ Llama-3.1-405b | | | | | | CoT-SC | 84.0 | 69.5 | 60.3 | 71.3 | | CR | 96.0 | 56.3 | 50.7 | 67.7 | | ToT | 98.4 | 65.5 | 56.7 | 73.5 | | Ours | 98.4 | 89.5 | 69.0 | 85.6 | | (+0.0) | (+20.0) | (+8.7) | (+12.1) | | Table 2: Performance by using Claude-3.5-Sonnet and Llama-3.1-405B LLMs. 4.1 Settings LLMs. We assess the baselines and our method using GPT-4 and GPT-4o. We also include Claude and LLaMA to verify whether our method can generalize to different LLMs other than GPT series. Dataset. We evaluated both the baselines and our method on three carefully selected logical reasoning datasets: ProntoQA Saparov and He (2023), ProofWriter Tafjord et al. (2021) and LogicNLI Tian et al. (2021). These datasets were chosen to reflect increasing levels of difficulty, with ProntoQA being the easiest, ProofWriter moderately complex, and LogicNLI the most challenging due to their intricate logical structures. ProntoQA focuses on basic deductive logical relationships, ProofWriter introduces more complex structures such as “and/or,” and LogicNLI presents the most intricate reasoning with constructs such as “either/or” and “if and only if”. This progression enables us to comprehensively evaluate the effectiveness of our method across varying levels of complexity in logical structure. The details of each dataset can be found in appendix D. Baselines. We compare with a wide range of established baselines. Those baselines can be classified into three main categories. (1) Linear Reasoning (LR) refers to approaches where the model arrives at an answer through a single-step process, using a straightforward response based on the initial prompt including: Naive Prompting and CoT Wei et al. (2022); (2) Aggregative Reasoning (AR) refers to methods where the model performs reasoning multiple times or aggregates the results to reach a final answer. This includes: CoT-SC Wang et al. (2023b); Cumulative Reasoning (CR; Zhang et al., 2023); DetermLR Sun et al. (2024); ToT Yao et al. (2023); (3) Symbolic Reasoning (SR), which engages symbolic expressions and rules in the reasoning framework including: SymbCoT Xu et al. (2024) and Logic-LM Pan et al. (2023). More details can be found in Appendix G. 4.2 Main Result The main results are presented in Table 1, from which we can learn the following observations: Our method consistently outperforms all baselines across the three datasets. Specifically, we achieve average improvements over CoT-SC, ToT, CR, and SymbCoT of 11.6%, 11.4%, 7.6%, and 4.5% on GPT-4, and 5.6%, 11.5%, 5.4%, and 6.2% on GPT-4o, respectively. These results demonstrate the general advantage of our method over the existing baselines across different datasets. Our method performs even more effectively in complex logical scenarios. We notice in Table 1 that our approach does not yield an improvement on the ProntoQA dataset. This can be attributed to the relative simplicity of the dataset, where most baselines already achieve high accuracy, leaving limited room for further enhancement. However, our improvements are more pronounced on the challenging datasets. Specifically, we achieve a 4.3% and 6.2% improvement over the second-best baseline on ProofWriter with GPT-4 and GPT-4o, respectively. On the most challenging dataset, LogicNLI, we observe even greater improvements of 6.3% for GPT-4 and 6.4% for GPT-4o. These results highlight the advantages of our method in scenarios involving complex logical structures and increased difficulty. Our method is generalizable across different models. In Table 2, we present the results for two models (Claude and Llama) outside the GPT series. We compare our method with strong baselines that aggregate multiple reasoning paths. Our method demonstrates similar improvements over the selected strong baseline, highlighting its generalizability across different models. <details> <summary>x9.png Details</summary> ![033ecc56](/v1/image/033ecc56d374cf0c557d880ed9df145226d5f47fa6e1ee91ba474e23096a5eb6) ### Visual Description ## Bar Chart: Model Accuracy Comparison ### Overview The chart compares the accuracy of two models, **ProofWriter** and **LogicNLI**, under four configurations: **"Ours"**, **"w/o Decomposer"**, **"w/o Search Router"**, and **"w/o Resolver"**. Accuracy is measured in percentage, with values ranging from 30% to 90%. ### Components/Axes - **X-axis**: Model names (**ProofWriter**, **LogicNLI**). - **Y-axis**: Accuracy (%) from 30% to 90%. - **Legend**: - **Dark Red**: "Ours" - **Medium Red**: "w/o Decomposer" - **Light Red**: "w/o Search Router" - **Light Pink**: "w/o Resolver" ### Detailed Analysis #### ProofWriter - **"Ours"**: 88.5% (highest accuracy, dark red bar). - **"w/o Decomposer"**: 62.0% (medium red bar, ~26.5% drop from "Ours"). - **"w/o Search Router"**: 37.7% (light red bar, ~50.8% drop from "Ours"). - **"w/o Resolver"**: 47.1% (light pink bar, ~47.4% drop from "Ours"). #### LogicNLI - **"Ours"**: 70.7% (dark red bar). - **"w/o Decomposer"**: 42.8% (medium red bar, ~39.8% drop from "Ours"). - **"w/o Search Router"**: 39.1% (light red bar, ~44.9% drop from "Ours"). - **"w/o Resolver"**: 57.5% (light pink bar, ~18.8% drop from "Ours"). ### Key Observations 1. **"Ours"** consistently achieves the highest accuracy for both models. 2. Removing the **Decomposer** causes the largest accuracy drop in **ProofWriter** (26.5%) compared to **LogicNLI** (39.8%). 3. The **Search Router** has a more severe impact on **ProofWriter** (50.8% drop) than on **LogicNLI** (44.9% drop). 4. The **Resolver** has the smallest impact on **LogicNLI** (18.8% drop) but a moderate impact on **ProofWriter** (47.4% drop). ### Interpretation - The **"Ours"** configuration demonstrates the highest performance, indicating that all components (Decomposer, Search Router, Resolver) are critical for optimal accuracy. - The **Decomposer** is particularly vital for **ProofWriter**, as its absence leads to a significant performance decline. - The **Search Router** appears more critical for **ProofWriter**, while the **Resolver** is more impactful for **LogicNLI**. - The data suggests that component importance varies by model, highlighting the need for tailored optimization strategies. ### Spatial Grounding & Trend Verification - **Legend Position**: Bottom of the chart, with colors matching bar colors exactly. - **Bar Order**: For each model, bars are ordered left-to-right as "Ours", "w/o Decomposer", "w/o Search Router", "w/o Resolver". - **Trend Check**: - **ProofWriter**: "Ours" > "w/o Decomposer" > "w/o Resolver" > "w/o Search Router". - **LogicNLI**: "Ours" > "w/o Resolver" > "w/o Decomposer" > "w/o Search Router". - **Outliers**: The drastic drop in **ProofWriter** without the **Search Router** (37.7%) is an anomaly compared to other configurations. ### Content Details - All values are explicitly labeled on the bars (e.g., 88.5%, 62.0%). - No additional text or hidden data is present. - Colors are consistent with the legend (e.g., dark red = "Ours"). </details> Figure 3: Ablation results (w/ GPT-4o). 4.3 Model Ablation To evaluate the contribution of each module in our framework, we conducted an ablation study by replacing each module individually with simpler alternatives. Specifically, we substitute (1) the Decomposer by prompting the LLM for simple decomposition, (2) the Resolver by prompting the LLM to infer using the given premises, and (3) the Search Router by prompting the LLM to search for relevant premises. The results, shown in Fig. 3, demonstrate that removing any module leads to a significant performance drop, highlighting the importance of each component. Notably, replacing the Search Router results in the largest performance decline (50.8% and 31.6% for ProofWriter and LogicNLI, respectively), emphasizing the benefits of searching complementary premises under the proof-by-contradiction strategy. Besides, the Decomposer has a greater impact than the Resolver on LogicNLI, whereas in ProofWriter, the Resolver plays a more significant role than the Decomposer. This is because LogicNLI includes more complex logical structures, such as “either…or…”, “vice versa”, and “if and only if”, while ProofWriter primarily involves simpler conjunctions such as “and”, “or”. As a result, LogicNLI relies more heavily on the Decomposer to break down complex logical statements into simpler forms for optimal performance. 5 Analysis and Discussion We now take one step further, delving into the underlying working mechanisms of our system. <details> <summary>x10.png Details</summary> ![42676296](/v1/image/4267629667f8eaf01d7cae4cb98676fc7141811915a2afd3807a3f5f481c398d) ### Visual Description ## Scatter Plot: Accuracy vs. Number of Steps ### Overview The image is a scatter plot comparing the accuracy (in percentage) of different methods against the number of steps required. The plot includes five data points labeled with method names, each represented by distinct colors and markers. The x-axis represents the "Number of Steps," and the y-axis represents "Accuracy (%)." A legend identifies the color coding for the methods. --- ### Components/Axes - **X-axis (Number of Steps)**: Labeled with approximate values at 12, 16, 20, and 24. The axis spans from 12 to 24. - **Y-axis (Accuracy %)**: Labeled with approximate values at 65, 75, 85, and 95. The axis spans from 65 to 95. - **Legend**: Located in the top-right corner (inferred from standard chart conventions). It associates: - **Red star**: "Ours" - **Blue dots**: "DetermLR," "CR," "CoT-SC," and "ToT" --- ### Detailed Analysis - **Data Points**: - **"Ours"**: Red star at (12, 85). This is the highest accuracy point, positioned at 12 steps. - **"DetermLR"**: Blue dot at (16, 77). Located at 16 steps with 77% accuracy. - **"CR"**: Blue dot at (16, 72). Same x-axis value as "DetermLR" but lower accuracy. - **"CoT-SC"**: Blue dot at (16, 68). Also at 16 steps but with the lowest accuracy among the blue points. - **"ToT"**: Blue dot at (24, 70). Positioned at 24 steps with 70% accuracy. - **Trends**: - The red star ("Ours") is the only data point above 80% accuracy, indicating superior performance. - The blue points cluster around 16 steps, with "DetermLR" being the highest among them. - "ToT" is the only blue point at 24 steps, but its accuracy (70%) is lower than "Ours" (85%). --- ### Key Observations 1. **"Ours" outperforms all other methods** in accuracy, achieving 85% at 12 steps. 2. **Blue methods** (DetermLR, CR, CoT-SC, ToT) show lower accuracy, with "DetermLR" being the most efficient among them (77% at 16 steps). 3. **"ToT" requires the most steps (24)** but still lags behind "Ours" in accuracy. --- ### Interpretation The data suggests that the method labeled "Ours" is significantly more efficient, achieving higher accuracy with fewer steps compared to other methods. The blue methods (DetermLR, CR, CoT-SC, ToT) exhibit a trade-off between steps and accuracy, with "DetermLR" being the most balanced. The clustering of blue points around 16 steps implies that these methods may have similar computational demands but varying effectiveness. "ToT" stands out as the least efficient, requiring the most steps for a relatively low accuracy gain. This plot highlights the importance of optimizing step count while maintaining high accuracy, with "Ours" demonstrating the optimal balance. </details> Figure 4: Accuracy vs. Efficiency on ProofWriter using GPT-4. Efficiency is measured as the average number of visited nodes/steps required to solve the problem. The upper-left corner is the optimal point, representing the best performance with the fewest visited nodes. 5.1 Accuracy vs. Efficiency Our method achieves better reasoning accuracy with higher efficiency. Here, we measure the average number of steps or nodes for solving problems in the ProofWriter dataset. As shown in Fig. 4, our method not only achieves the highest accuracy across all baselines but does so with the least number of visited nodes indicating both superior efficacy and efficiency. Specifically, our method achieves the highest accuracy among all baselines, while visiting only 11.65 nodes on average, reducing the number of nodes visited by 52.6%, 30.5%, and 20.4% compared to ToT, CR, and DetermLR, respectively. This demonstrates that our approach effectively balances accuracy and computational efficiency. By directly targeting contradictions, it significantly streamlines the reasoning process, making it both precise and efficient. 5.2 Step-wise Reasoning Accuracy The <details> <summary>x11.png Details</summary> ![cfb28fc7](/v1/image/cfb28fc73f20514ab807f396e5f5c7279d680169150d14c31035ce95d9d01ec5) ### Visual Description Icon/Small Image (29x29) </details> Resolver can achieve near-perfect accuracy in one-step logical inference. To understand why our framework is effective, we must also examine its one-step logical reasoning accuracy. Since the final answer is derived from these individual inferences (i.e., nodes in ToT and steps in Ours), their accuracy directly impacts the overall performance. We compare the one-step reasoning accuracy of our method with that of ToT shown in Fig. 5. We randomly sample 100 cases with manual evaluation. ToT demonstrated around 70% accuracy, which is consistent with prior research showing that LLMs can sometimes introduce logical errors. In contrast, our Aristotle achieved near-perfect accuracy in one-step inference, underscoring the effectiveness of the Resolver module’s use of the resolution principle. This is because the resolution principle provides a systematic and logically rigorous way to resolve contradictions, simplifying the reasoning process compared to methods that rely on LLMs to reason from previous steps and multiple premises. <details> <summary>x12.png Details</summary> ![734cc782](/v1/image/734cc7827af79b617ff9eaa0bc899420441a80f2f807b06b027bd615bb7ab861) ### Visual Description ## Bar Chart: One-step Inference Accuracy ### Overview The chart compares the accuracy of two methods ("ToT" and "Ours") across two tasks ("ProofWriter" and "LogicNLI") using one-step inference. Accuracy is measured as a percentage, with "Ours" consistently outperforming "ToT" in both tasks. ### Components/Axes - **X-axis**: Task categories ("ProofWriter" and "LogicNLI"). - **Y-axis**: Accuracy (%) ranging from 60% to 100%. - **Legend**: - Blue = "ToT" (bottom-left). - Red = "Ours" (bottom-right). - **Annotations**: - Arrows indicate percentage increases from "ToT" to "Ours" (25.7% for ProofWriter, 30.3% for LogicNLI). ### Detailed Analysis - **ProofWriter**: - "ToT": 74.0% (blue bar). - "Ours": 99.7% (red bar), a 25.7% increase. - **LogicNLI**: - "ToT": 69.2% (blue bar). - "Ours": 99.5% (red bar), a 30.3% increase. ### Key Observations 1. **Dominance of "Ours"**: Both tasks show near-perfect accuracy for "Ours" (99.5–99.7%), far exceeding "ToT" (69.2–74.0%). 2. **Larger Improvement in LogicNLI**: The percentage increase from "ToT" to "Ours" is greater for LogicNLI (30.3%) than ProofWriter (25.7%). 3. **Task-Specific Performance**: "Ours" achieves slightly higher accuracy in ProofWriter (99.7%) than LogicNLI (99.5%). ### Interpretation The data demonstrates that the "Ours" method significantly enhances one-step inference accuracy compared to "ToT," particularly in the more complex LogicNLI task. The near-100% accuracy of "Ours" suggests it may leverage advanced optimization or architectural improvements. The smaller gap between the two methods in ProofWriter (25.7% vs. 30.3%) could reflect task-specific challenges or dataset characteristics. This highlights the importance of method selection based on task complexity. </details> Figure 5: One-step reasoning accuracy using GPT-4o. 5.3 Search Error Our method effectively reduces errors from the search strategy. Apart from one-step logical inference, the search router also plays a crucial role. Previous research has shown that methods involving an evaluator to guide the search tend to underperform as the evaluator can be unreliable and may mislead the reasoning process, resulting in incorrect answers. We assess the search error In ToT, search errors occur when the evaluator selects a logically flawed node for expansion. In contrast, in our method, search errors arise when either the complementary clause is not found or a non-complementary clause is selected. The evaluation process is conducted manually., as shown in Fig. 6. Our search strategy significantly reduces errors, lowering them by 11.2% in ProofWriter and 9.0% in LogicNLI. This demonstrates that our logic-based search approach outperforms LLM self-evaluation, effectively addressing the limitations posed by unreliable evaluators in logical reasoning. An explanation is that our method simplifies the search process by focusing on identifying complementary clauses, a task with clear definitions and rules that an LLM can easily follow with a few examples. In contrast, having an LLM evaluate logical inferences, such as in ToT, requires complex judgments, making it more prone to errors Chen et al. (2024); Wang et al. (2024b). <details> <summary>x13.png Details</summary> ![4c750a7e](/v1/image/4c750a7e379456bc8cbd89d9b5a82ea80be5ee3e40e6dc8d5f592095b6b62e93) ### Visual Description ## Bar Chart: Search Error ### Overview The chart compares search error rates across two tasks (ProofWriter and LogicNLI) for four methods: ToT, ToT-Search, Ours, and Ours-Search. Error rates are presented as percentages, with distinct color coding for each method. ### Components/Axes - **Title**: "Search Error" - **X-axis**: Task categories ("ProofWriter", "LogicNLI") - **Y-axis**: "Error Rate (%)" (0–40% scale) - **Legend**: Located at the bottom, with four color-coded methods: - Light blue: ToT - Dark blue: ToT-Search - Light pink: Ours - Dark pink: Ours-Search - **Bar Labels**: Numerical values (e.g., "31.0%", "14.0%") placed atop each bar. ### Detailed Analysis - **ProofWriter**: - ToT: 31.0% (light blue) - ToT-Search: 14.0% (dark blue) - Ours: 11.5% (light pink) - Ours-Search: 2.8% (dark pink) - **LogicNLI**: - ToT: 43.2% (light blue) - ToT-Search: 16.0% (dark blue) - Ours: 29.5% (light pink) - Ours-Search: 5.0% (dark pink) ### Key Observations 1. **Ours-Search** consistently achieves the lowest error rates in both tasks (2.8% for ProofWriter, 5.0% for LogicNLI). 2. **ToT** has the highest error rates (31.0% for ProofWriter, 43.2% for LogicNLI), indicating significant performance gaps. 3. **ToT-Search** reduces errors by ~50% compared to ToT in both tasks (14.0% vs. 31.0% for ProofWriter; 16.0% vs. 43.2% for LogicNLI). 4. **Ours** outperforms ToT-Search in LogicNLI (29.5% vs. 16.0%) but underperforms in ProofWriter (11.5% vs. 14.0%). ### Interpretation The data demonstrates that the "Ours-Search" method is the most effective at minimizing search errors, outperforming all other approaches by a substantial margin. The ToT method exhibits the highest error rates, suggesting potential limitations in its design or implementation. The LogicNLI task shows higher overall error rates than ProofWriter, which may reflect greater complexity or dataset-specific challenges. The "Ours" method bridges the gap between ToT-Search and Ours-Search, indicating incremental improvements in error reduction. This hierarchy highlights the importance of method selection for task-specific optimization. </details> Figure 6: The outer bar shows the overall error rate. The inner bar represents the proportion of the error caused by the search strategy. 5.4 Complex Reasoning Our method demonstrates a clear advantage in handling problems of increasing difficulty. We evaluate accuracy across different reasoning difficulties in ProofWriter, as shown in Fig. 7. Our method consistently outperforms others at all depths, maintaining superior accuracy. It excels particularly at moderate and challenging depths, surpassing baselines like SymbCoT and Logic-LM. Even as other methods struggle at higher depths, our approach remains robust, demonstrating better scalability and resilience to problem difficulty. This effectiveness is due to two main factors: (1) using the resolution principle minimizes errors at each step and prevents them from compounding, and (2) streamlining the reasoning process reduces steps, lowering the likelihood of error accumulation. <details> <summary>x14.png Details</summary> ![37d0db38](/v1/image/37d0db38bdc6da3118792202ca65035189c65d13e6376c533b1867d79650c0fe) ### Visual Description ## Line Graph: Accuracy vs. Reasoning Depth ### Overview The image depicts a line graph comparing the accuracy performance of four reasoning methods (CoT, SymbCoT, Logic-LM, and "Ours") across varying reasoning depths (0 to 5). Accuracy is measured on the y-axis (50-100%), while reasoning depth is on the x-axis. Four distinct lines with unique markers and colors represent each method. ### Components/Axes - **X-axis (Reasoning Depth)**: Labeled "Reasoning Depth" with integer markers at 0, 1, 2, 3, and 5. - **Y-axis (Accuracy %)**: Labeled "Accuracy %" with increments of 10 from 50 to 100. - **Legend**: Located at the bottom-left corner, mapping: - **CoT**: Blue square (□) - **SymbCoT**: Blue triangle (▲) - **Logic-LM**: Blue diamond (◆) - **Ours**: Red diamond (◆) ### Detailed Analysis 1. **Ours (Red Diamond Line)**: - Starts at ~98% accuracy at depth 0. - Declines steadily to ~78% at depth 5. - Maintains the highest accuracy across all depths. 2. **CoT (Blue Square Line)**: - Begins at ~82% at depth 0. - Drops to ~72% at depth 5. - Shows a consistent downward trend. 3. **SymbCoT (Blue Triangle Line)**: - Starts at ~90% at depth 0. - Declines to ~70% at depth 5. - Exhibits a moderate slope compared to other methods. 4. **Logic-LM (Blue Diamond Line)**: - Begins at ~76% at depth 0. - Plummets to ~50% at depth 5. - Demonstrates the steepest decline among all methods. ### Key Observations - **Performance Degradation**: All methods show reduced accuracy as reasoning depth increases, with Logic-LM experiencing the most severe drop (~26% decrease). - **Robustness**: "Ours" method retains the highest accuracy (98% → 78%) and exhibits the least degradation (~20% decrease) compared to others. - **SymbCoT vs. CoT**: SymbCoT starts higher than CoT but ends lower, suggesting initial advantages that diminish with depth. ### Interpretation The data suggests that the "Ours" method is the most robust for handling deeper reasoning tasks, maintaining superior performance across all depths. The steep decline in Logic-LM highlights its limitations in complex reasoning scenarios. SymbCoT and CoT show intermediate performance, with SymbCoT initially outperforming CoT but underperforming at deeper levels. This trend underscores the importance of method design in balancing initial accuracy with scalability to deeper reasoning demands. </details> Figure 7: Studying the effect of reasoning depth with GPT-4 on ProofWriter. The cost scaling is stable with increased reasoning depth. To complement the performance analysis on different reasoning depth, we also address concerns about the scalability of our framework when applied to tasks requiring deeper reasoning involving greater reasoning depth. Specifically, we provide a detailed examination of the marginal computational costs and token usage associated with extended reasoning chains. Aristotle is designed to handle increased reasoning depth in an efficient and controlled manner. Each reasoning step in our framework consists of two operations. First, a search operation is performed using a deterministic rule-based method, which introduces minimal computational overhead and does not contribute additional token usage as reasoning depth increases. Second, a resolve operation is executed by the resolver module, which processes one reasoning step at a time and is the only component contributing to token-based computational cost. This modular design ensures that total token usage grows linearly with the number of reasoning steps, keeping the marginal cost per additional step low and predictable. To empirically validate the efficiency and stability of our approach, we analyzed token usage for individual reasoning steps across two benchmark datasets, ProofWriter and LogicNLI, as shown in Table 3. The statistics indicate that token usage per reasoning step is highly consistent, exhibiting low standard deviation and coefficient of variation. Moreover, the absolute token cost per step remains modest, ensuring that deeper reasoning does not impose a significant computational burden. Given the stable, linear scaling of token consumption and the low per-step cost, our framework maintains its efficiency even for tasks requiring extended reasoning chains. | ProofWriter LogicNLI | 3076.8 2071.1 | 19.1 26.4 | 0.71% 0.85% | | --- | --- | --- | --- | Table 3: Token usage statistics per reasoning step. TU denotes Token Usage, Std. Dev. denotes Standard Deviation, and CV denotes Coefficient of Variation. <details> <summary>x15.png Details</summary> ![09fd4e20](/v1/image/09fd4e20dd4372c1d44fcf268da64c5f9d9c58b7c967ea9e4e6b6aedb4b71f67) ### Visual Description ## Pie Charts: ProofWriter and LogicNLI ### Overview The image contains two side-by-side pie charts labeled "ProofWriter" (left) and "LogicNLI" (right). Each chart visually represents the distribution of six categories (Translation, Resolve, Imax, Decompose, Search, Contra Error) as percentages of a whole. A shared legend at the bottom maps colors to categories. ### Components/Axes - **Legend**: - **Translation**: Orange - **Resolve**: Pink - **Imax**: Light Blue - **Decompose**: Green - **Search**: Yellow - **Contra Error**: Gray - **Labels**: - Left chart: "ProofWriter" - Right chart: "LogicNLI" - **Axes**: - No explicit axes; percentages are embedded in segments. ### Detailed Analysis #### ProofWriter (Left Chart) - **Translation**: 10.8% (orange) - **Resolve**: 8.1% (pink) - **Imax**: 18.9% (light blue) - **Decompose**: 10.8% (green) - **Search**: 24.3% (yellow) - **Contra Error**: 27.0% (gray) #### LogicNLI (Right Chart) - **Translation**: 23.8% (orange) - **Resolve**: 9.5% (pink) - **Imax**: 7.1% (light blue) - **Decompose**: 9.5% (green) - **Search**: 16.7% (yellow) - **Contra Error**: 33.3% (gray) ### Key Observations 1. **Dominance of Contra Error**: - In both charts, "Contra Error" occupies the largest segment (27.0% in ProofWriter, 33.3% in LogicNLI). 2. **Translation Focus in LogicNLI**: - "Translation" is the second-largest category in LogicNLI (23.8%), significantly higher than in ProofWriter (10.8%). 3. **Search vs. Imax**: - "Search" is the largest non-Contra Error category in ProofWriter (24.3%) but smaller in LogicNLI (16.7%). - "Imax" is the smallest category in LogicNLI (7.1%) but larger in ProofWriter (18.9%). 4. **Symmetry in ProofWriter**: - "Resolve," "Decompose," and "Translation" have identical percentages (8.1%, 10.8%, 10.8%), suggesting balanced emphasis. ### Interpretation The data highlights systemic differences in prioritization between ProofWriter and LogicNLI: - **ProofWriter** emphasizes **Search** and **Contra Error** most heavily, with moderate focus on **Imax**. Other categories are relatively minor. - **LogicNLI** prioritizes **Contra Error** even more strongly, with **Translation** becoming a major focus (23.8%), surpassing all other categories except Contra Error. - The smaller percentages for **Resolve**, **Decompose**, and **Imax** in LogicNLI suggest these tasks are deprioritized compared to ProofWriter. - The stark contrast in "Translation" allocation (10.8% vs. 23.8%) implies LogicNLI may specialize in translation-related error detection or correction. This analysis assumes the percentages reflect intentional design choices or observed usage patterns, though the exact rationale for category weighting is not provided in the image. </details> Figure 8: Error analysis on ProofWriter and LogicNLI with GPT-4o. 5.5 Error Analysis To thoroughly understand the limitations of our framework, we conduct a manual error analysis on the ProofWriter and LogicNLI datasets using GPT-4o. The detailed error statistics are presented in Fig. F. The majority of errors stem from the Contradiction Error, primarily due to flaws in dataset construction. The next most common source is the Search module, where complementary clauses exist but are not retrieved. This is often due to unexpected symbols (e.g., LaTeX code) in the LLM’s output that disrupt regular expression matching. Translation and Decomposition errors are the third largest category. These occur when the LLM struggles to parse complex logical relationships or convert them into the correct symbolic form. Such errors are more prevalent in LogicNLI, which features more intricate constructs (e.g., "either…or…" and "if and only if") compared to the simpler logic in ProofWriter (e.g., "and," "or"). Another notable source of error is Insufficient Iterations, where the reasoning process terminates prematurely, often concluding "No contradiction" when further iterations might reveal one. While increasing the iteration threshold could mitigate this, it must be balanced against computational efficiency. Finally, Resolution errors, often due to incorrect variable instantiations, are relatively infrequent. This is because logical statements have been reduced to a simple conjunctive normal form (CNF) by this point, making them easier to interpret, and the resolution principle offers clear guidance for resolving inconsistencies. Potential improvements include incorporating more targeted in-context learning (ICL) examples to enhance translation and resolution. Besides, enhancing regular expression patterns to accommodate a wider range of syntactic variations could further reduce search errors. Additionally, tuning iteration limits would help achieve a better balance between accuracy and computational efficiency. More details of error analysis can be found at Appendix F. 6 Related Work Enhancing the logical reasoning of LLMs to achieve human-like levels has been a key focus in recent research Huang and Chang (2023); Dunivin (2024). Existing approaches can be broadly categorized into the following: Linear Reasoning. These approaches involve prompting LLMs once to emulate human reasoning in a sequential manner. The representative is the CoT method Wei et al. (2022), which guides the model to generate reasoning steps linearly. Building upon CoT, SymbCoT Xu et al. (2024) incorporates symbolic expressions and rules into the linear reasoning process to achieve more rigorous logical reasoning. However, these methods lack any searching or backtracking mechanism to avoid flawed reasoning paths, leading to suboptimal performance. In this paper, we thus consider the searching-based reasoning framework. Sampling Methods. These methods obtain the final answer through samplings to enhance reasoning diversity and accuracy Wang et al. (2023b); Fu et al. (2023); Manakul et al. (2023). This involves running the reasoning process multiple times, with the increased diversity of reasoning paths contributing to better results. However, they do not resolve the underlying issue of flawed logical reasoning inherent to the LLM, which is resolved by the resolution principle Robinson (1965) in our framework. Iterative Reasoning with Search. These methods rely on an evaluator to search for different reasoning paths to avoid flawed nodes. Techniques such as ToT Yao et al. (2023), and its variances Xie et al. (2023); Zhang et al. (2023); Sun et al. (2024), generate multiple thoughts during the reasoning. An evaluator repetitively selects the most probable paths to expand to the next level until the final answer. However, the evaluator may not be reliable Chen et al. (2024); Wang et al. (2024b), potentially selecting incorrect nodes for expansion and propagating errors. This paper proposes a search mechanism that relies on matching symbolic logic, avoiding the use of an unreliable evaluator. Reasoning Relying on External Tools. Here LLMs often involve integrating well-developed rule-based reasoning engines, where LLMs act as mere translators, converting the natural language of reasoning questions into specific symbolic forms to be processed by external rule-based engines. Examples include Logic-LM Pan et al. (2023), LINC Olausson et al. (2023) and PAL Gao et al. (2023). The limitation of this approach lies in the strict formatting requirements of external logic resolver; LLMs inevitably introduce syntax errors during translation, leading to failures in the reasoning process. Fortunately, the success of SymbCoT Xu et al. (2024) preliminarily demonstrates that enabling LLMs to perform logical reasoning based on symbols is feasible and promising. In our paper, we further prove that symbolic logic expressions can be fully integrated into all processes of the reasoning framework, including decomposition, search, and inference, thereby demonstrating that LLMs themselves can completely achieve high-level symbolic logical reasoning. 7 Conclusion We presented Aristotle, a logic-complete reasoning framework designed to tackle the challenges of logical reasoning, which comprehensively integrates symbolic expressions and logical rules into three core components: Logical Decomposer, Logical Search Router, and Logical Resolver. These modules streamline the reasoning process by reducing the task complexity, minimizing the search errors, and rigorously resolving logical contradictions. Our experiments show that our method consistently outperforms state-of-the-art frameworks in both accuracy and efficiency significantly. Acknowledgements This work is supported by the Ministry of Education, Singapore, under its MOE AcRF TIER 3 Grant (MOE-MOET32022-0001). Potential Limitations Our method faces two potential limitations. First, the reasoning process relies on the quality of translation and decomposition. However, even with a few-shot approach, LLMs cannot always guarantee that these processes are fully correct. Future work could consider more advanced methods to guarantee the quality of these processes, such as fine-tuning. Secondly, our reasoning approach requires that all necessary information is explicitly stated in the premises. If any implicit information or assumptions exist, our method may fail to capture them, leading to incorrect reasoning. A more detailed analysis of this limitation is provided in Appendix E. Nevertheless, there are existing methods that explore how to make implicit information explicit. Future work could integrate those methods into this framework to address this limitation. Ethics Statement The datasets employed in our research were carefully selected from publicly available or ethically sourced materials, and we take deliberate steps to identify and reduce biases within these datasets to uphold fairness in our outcomes. References - Baader et al. (2003) Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. Patel-Schneider, editors. 2003. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press. - Besta et al. (2024) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 17682–17690, Vancouver, Canada. - Bishop (1967) Errett Bishop. 1967. Foundations of Constructive Analysis. Mcgraw-Hill. - Chen et al. (2024) Ziru Chen, Michael White, Ray Mooney, Ali Payani, Yu Su, and Huan Sun. 2024. When is tree search useful for LLM planning? It depends on the discriminator. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 13659–13678, Bangkok, Thailand. - Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24:240:1–240:113. - Clocksin and Mellish (2003) William F Clocksin and Christopher S Mellish. 2003. Programming in Prolog. Springer Science & Business Media. - Cummins et al. (1991) D.D. Cummins, T. Lubart, O. Alksnis, et al. 1991. Conditional reasoning and causation. Memory & Cognition, 19:274–282. - Davis and Putnam (1960) Martin Davis and Hilary Putnam. 1960. A computing procedure for quantification theory. J. ACM, 7(3):201–215. - Dunivin (2024) Zackary Okun Dunivin. 2024. Scalable qualitative coding with LLMs: Chain-of-Thought reasoning matches human performance in some hermeneutic tasks. ArXiv preprint, abs/2401.15170. - Eisner et al. (2024) Ben Eisner, Yi Yang, Todor Davchev, Mel Vecerík, Jonathan Scholz, and David Held. 2024. Deep se(3)-equivariant geometric reasoning for precise placement tasks. In The Twelfth International Conference on Learning Representations, Vienna, Austria. - Fu et al. (2023) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2023. Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations, Kigali, Rwanda. - Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. PAL: program-aided language models. In International Conference on Machine Learning, volume 202, pages 10764–10799, Honolulu, Hawaii, USA. PMLR. - Han et al. (2022) Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Luke Benson, Lucy Sun, Ekaterina Zubova, Yujie Qiao, Matthew Burtell, David Peng, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Shafiq R. Joty, Alexander R. Fabbri, Wojciech Kryscinski, Xi Victoria Lin, Caiming Xiong, and Dragomir Radev. 2022. FOLIO: natural language reasoning with first-order logic. ArXiv preprint, abs/2209.00840. - Huang and Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. - Li et al. (2024) Haoming Li, Zhaoliang Chen, Jonathan Zhang, and Fei Liu. 2024. LASP: surveying the state-of-the-art in large language model-assisted AI planning. ArXiv preprint, abs/2409.01806. - Liu et al. (2023) Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. 2023. Evaluating the logical reasoning ability of ChatGPT and GPT-4. ArXiv preprint, abs/2304.03439. - Liu et al. (2020) Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. LogiQA: A challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pages 3622–3628. - Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore. - Markovits and Vachon (1989) Henry Markovits and Robert Vachon. 1989. Reasoning with contrary-to-fact propositions. Journal of Experimental Child Psychology, 47:398–412. - Ning et al. (2024) Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. 2024. Skeleton-of-Thought: Prompting llms for efficient parallel generation. In The Twelfth International Conference on Learning Representations, Vienna, Austria. - Nonnengart (1996) Andreas Nonnengart. 1996. Strong skolemization. - Olausson et al. (2023) Theo Olausson, Alex Gu, Ben Lipkin, Cedegao Zhang, Armando Solar-Lezama, Joshua Tenenbaum, and Roger Levy. 2023. LINC: A neurosymbolic approach for logical reasoning by combining language models with First-Order Logic provers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5153–5176, Singapore. - Pan et al. (2023) Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. 2023. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3806–3824, Singapore. - Parmar et al. (2024) Mihir Parmar, Nisarg Patel, Neeraj Varshney, Mutsumi Nakamura, Man Luo, Santosh Mashetty, Arindam Mitra, and Chitta Baral. 2024. LogicBench: Towards systematic evaluation of logical reasoning ability of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 13679–13707, Bangkok, Thailand. - Patel et al. (2023) Ajay Patel, Bryan Li, Mohammad Sadegh Rasooli, Noah Constant, Colin Raffel, and Chris Callison-Burch. 2023. Bidirectional language models are also few-shot learners. In The Eleventh International Conference on Learning Representations, Kigali, Rwanda. - Patel et al. (2024) Nisarg Patel, Mohith Kulkarni, Mihir Parmar, Aashna Budhiraja, Mutsumi Nakamura, Neeraj Varshney, and Chitta Baral. 2024. Multi-LogiEval: Towards evaluating multi-step logical reasoning ability of large language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 20856–20879, Bangkok, Thailand. - Robinson (1965) John Alan Robinson. 1965. A machine-oriented logic based on the resolution principle. J. ACM, 12(1):23–41. - Saparov and He (2023) Abulhair Saparov and He He. 2023. Language models are greedy reasoners: A systematic formal analysis of Chain-of-Thought. In The Eleventh International Conference on Learning Representations, Kigali, Rwanda. - Sun et al. (2024) Hongda Sun, Weikai Xu, Wei Liu, Jian Luan, Bin Wang, Shuo Shang, Ji-Rong Wen, and Rui Yan. 2024. DetermLR: Augmenting LLM-based logical reasoning from indeterminacy to determinacy. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9828–9862, Bangkok, Thailand. - Tafjord et al. (2021) Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. 2021. ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621–3634. - Tian et al. (2021) Jidong Tian, Yitian Li, Wenqing Chen, Liqiang Xiao, Hao He, and Yaohui Jin. 2021. Diagnosing the first-order logical reasoning ability through LogicNLI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3738–3747, Online and Punta Cana, Dominican Republic. - Tyagi et al. (2024) Nemika Tyagi, Mihir Parmar, Mohith Kulkarni, Aswin RRV, Nisarg Patel, Mutsumi Nakamura, Arindam Mitra, and Chitta Baral. 2024. Step-by-step reasoning to solve grid puzzles: Where do LLMs falter? ArXiv preprint, abs/2407.14790. - Wang et al. (2024a) Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. 2024a. Mathcoder: Seamless code integration in LLMs for enhanced mathematical reasoning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. - Wang et al. (2023a) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023a. Plan-and-Solve Prompting: Improving zero-shot Chain-of-Thought reasoning by large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2609–2634, Toronto, Canada. - Wang et al. (2024b) Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. 2024b. Large language models are not fair evaluators. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 9440–9450, Bangkok, Thailand. - Wang et al. (2024c) Weiqi Wang, Tianqing Fang, Chunyang Li, Haochen Shi, Wenxuan Ding, Baixuan Xu, Zhaowei Wang, Jiaxin Bai, Xin Liu, Cheng Jiayang, Chunkit Chan, and Yangqiu Song. 2024c. CANDLE: Iterative conceptualization and instantiation distillation from large language models for commonsense reasoning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 2351–2374, Bangkok, Thailand. - Wang et al. (2023b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023b. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, Kigali, Rwanda. - Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems, New Orleans, LA, USA. - Xie et al. (2023) Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, and Michael Qizhe Xie. 2023. Self-evaluation guided beam search for reasoning. In Proceedings of the Annual Conference on Neural Information Processing Systems, New Orleans, LA, USA. - Xu et al. (2024) Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu. 2024. Faithful logical reasoning via Symbolic Chain-of-Thought. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 13326–13365, Bangkok, Thailand. - Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems, New Orleans, LA, USA. - Yu et al. (2020) Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. 2020. Reclor: A reading comprehension dataset requiring logical reasoning. In 8th International Conference on Learning Representations. - Zhang et al. (2023) Yifan Zhang, Jingqin Yang, Yang Yuan, and Andrew Chi-Chih Yao. 2023. Cumulative reasoning with large language models. ArXiv preprint, abs/2308.04371. - Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models. ArXiv preprint, abs/2303.18223. - Zhou et al. (2023) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023. Least-to-Most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, Kigali, Rwanda. Appendix A Logical Grammar Facts: A fact is an assertion about specific individuals in the domain. It involves a predicate applied to constants (not variables) and states that a particular relationship holds between these individuals. For example, $\text{Sees}(\text{Jack},\text{Tom},\text{True})$ asserts Jack sees Tom. Rules: A rule delineates a logical relationship between predicates and forms an integral component of the domain’s terminological knowledge. These rules typically incorporate variables and are universally quantified, ensuring their applicability across all relevant instances within the domain. Rules can involve logical connectors such as "and" ( $\land$ ), "or" ( $\lor$ ), "either…or…" (exclusive or, $\oplus$ ), and "not" ( $\neg$ ), appearing on both sides of the implication ( $→$ ) or biconditional ( $\leftrightarrow$ ) operators. For instance, the rule $$ \forall x,y\left(\text{Sees}(x,y)\rightarrow\text{Knows}(x,y)\right) $$ asserts that for all individuals $x$ and $y$ , if $x$ sees $y$ , then $x$ knows $y$ . Query: A query is a fact that needs to be proven based on the given facts and rules. Appendix B Three potential situations for searching complement When searching for a complementary clause during the resolution process, three potential situations may arise. - If exactly one complementary clause is found the resolution will be directly implemented. - If multiple complementary clauses are identified, we prioritize the shorter clauses, while saving the remaining clauses as backup options. In cases where the current reasoning path cannot find any further complementary clauses before reaching the predefined maximum number of iterations, we will backtrack and attempt to use these backup clauses. - If no complementary clause is found initially, we will backtrack to the backup list. Should the backup list also be exhausted, we will conclude the reasoning process by determining the result as "No contradiction found." This approach ensures a structured and efficient search and resolution process, while also accounting for cases where multiple or no complementary clauses are found, improving overall search robustness. Appendix C Conjunctive Normal Form Conjunctive Normal Form (CNF) is a standardized way of expressing logical statements in Boolean logic. In CNF, a formula is composed of a conjunction (AND, denoted as $\land$ ) of clauses, where each clause is a disjunction (OR, denoted as $\lor$ ) of literals. A literal is either a variable or its negation. For example, the logical statement $(A\lor\neg B)\land(C\lor D)$ is in CNF. Each group within the parentheses is a clause, and the entire expression is the conjunction of these clauses. The reason we need logical statements to be in CNF to conduct resolution is that resolution is a fundamental inference rule in automated theorem proving. It works by finding pairs of clauses where one contains a literal and the other contains its negation. These pairs can then be combined to eliminate the opposing literals, simplifying the overall logical expression. Since CNF breaks down complex statements into smaller, manageable components of ANDs and ORs, it allows the resolution rule to systematically and efficiently simplify or refute logical expressions, thus enabling automated reasoning systems to solve problems. Appendix D Dataset Specifications ProntoQA is a synthetic dataset designed to assess models’ deductive reasoning abilities. For our evaluation, we use the most difficult 5-hop subset. Each question in this dataset requires verifying whether a given claim is "True" or "False" based on the provided premises. The dataset focuses on basic logical relationships. For instance, given "X is Y", "Y is Z", determining whether "X is Z". ProofWriter is a popular dataset for logical deductive reasoning. The dataset has five subsets with different reasoning depths (from 1 to 5). We use the hardest depth-5 subset for evaluation. And the context in this dataset contains more challenging logical relationships such as the combination of "and" and "or". LogicNLI is a challenging NLI-style dataset that effectively disentangles the target logical reasoning from commonsense inference. In addition to common logical relationships such as "and" and "or", it presents the most difficult relationships among the three datasets, such as "either…or…", "vice versa", and "if and only if". The test sizes are 500 for ProntoQA, 600 for ProofWriter, and 300 for LogicNLI, respectively. Appendix E Evaluation on Real-world Dataset | CoT Aristotle | 75.0 76.5 | 65.6 31.2 | | --- | --- | --- | Table 4: Performance on Real-world’s Benchmark The ability to solve real-world problems is important, as it reflects a model’s applicability beyond synthetic or constrained settings. Numerous benchmarks Han et al. (2022); Patel et al. (2024); Liu et al. (2023); Parmar et al. (2024); Liu et al. (2020); Yu et al. (2020); Tyagi et al. (2024) have been introduced to evaluate this capability by incorporating tasks that require implicit reasoning, background knowledge, and commonsense understanding. We include an evaluation on real-world datasets FOLIO Han et al. (2022) and LogiQA Liu et al. (2020), to provide a deeper understanding of Aristotle. As shown in Table 4, Aristotle’s performance is suboptimal. This is primarily due to its design: it operates on explicitly stated premises and deliberately avoids relying on unstated assumptions or external background knowledge. However, real-world scenarios frequently depend on such implicit information and commonsense reasoning, making these capabilities essential for robust performance. While Aristotle promotes clarity and precision in logical inference, it may not fully capture the complexities inherent in real-world tasks—a limitation we have acknowledged in the main paper. Notably, Aristotle’s performance on LogiQA is significantly worse than on FOLIO. Our qualitative analysis reveals that this discrepancy stems from LogiQA’s greater reliance on commonsense knowledge and implicit assumptions, which makes it less compatible with Aristotle’s strictly premise-driven reasoning approach. To address this, our plan is to incorporate external knowledge to better handle such scenarios in future work. For instance, information retrieval from the internet or commonsense knowledge graphs could supplement the explicit premises with the necessary implicit knowledge for reasoning. This integration would enable our method to leverage background information that is not explicitly provided, improving its applicability and performance in solving real-world tasks. Appendix F Error Analysis Here, we present a more detailed error analysis. F.1 False Contradiction False contradiction refers to when the method identifies a contradiction when none should exist, leading to an incorrect final answer. This issue often arises in cases where the ground truth of a problem is false. For example, when the ground truth is false, we should find a contradiction when reasoning from the negation of the statement and no contradiction when reasoning from the original statement, as outlined in the equation 1. However, our method sometimes finds a contradiction even when reasoning from the original statement, altering the final answer and producing errors. This should be due to the way datasets are constructed. When the ground truth is false, for instance, the dataset may be built such that the false statement is provable, but the construction process might fail to ensure that the true statement is not provable. This oversight results in both the true and false values being provable, making the problem self-contradictory. This situation occurs more frequently when the ground truth is either true or false, suggesting that the dataset did not fully account for the exclusive relationship between proving one value and excluding the other. In an ideal dataset construction where the ground truth is true or false, the premises should only allow for the ground truth to be provable, while any other possible answers should be logically excluded. That is, if a statement is provably true, it should be impossible to prove its negation, and vice versa. Ensuring this exclusivity is crucial for logical consistency. That said, it’s important to note that these False Contradictions represent only a small portion of the overall dataset. While this issue can affect some instances, it doesn’t significantly undermine the dataset’s overall effectiveness for testing reasoning models. F.2 Resolver Instantiation Error An instantiation error occurs when a resolver incorrectly substitutes a variable, leading to an inaccurate conclusion. For example, given two clauses: $Smart(Gary,False)$ and $Smart(Gary,True)\lor Nice(x,False)$ , the correct resolution would recognize that $Smart(Gary,False)$ directly complements $Smart(Gary,True)$ , resulting in the simplified clause $Nice(x,False)$ without needing to instantiate ‘x‘. However, if the resolver mistakenly instantiates "x" as "Gary," the clause changes to $Nice(Gary,False)$ , which is more specific than necessary. This error restricts the generality of the conclusion, as the correct clause $Nice(x,False)$ is intended to apply to any individual, not just "Gary." Such improper instantiation can lead to faulty reasoning in subsequent steps, where conclusions might be incorrectly drawn because the reasoning process has been prematurely narrowed to a specific case. Ensuring that instantiation only occurs when necessary can help prevent these errors and maintain the validity of logical deductions. Appendix G Baselines Here we illustrate the details of each baseline. Naive Prompting Naive Prompting refers to a basic prompting technique where a model is given a question or task without any complex instructions or intermediate steps. The model is expected to output a direct answer based on its existing knowledge. In this approach, the reasoning process is implicit, and the model simply leverages its pre-trained knowledge to respond without additional structured reasoning or step-by-step guidance. Chain-of-Thought (CoT) Chain-of-Thought (CoT) prompting is a more advanced prompting strategy that encourages the model to generate intermediate reasoning steps before arriving at a final answer. Instead of asking the model for an immediate response, the prompt guides the model to break down the reasoning process into smaller, logical steps. This allows the model to engage in more thoughtful problem-solving and often leads to better performance on tasks requiring multi-step reasoning Wei et al. (2022). Chain-of-Thought with Self-Consistency (CoT-SC) Chain-of-Thought with Self-Consistency (CoT-SC) improves upon the standard CoT method by running the chain-of-thought reasoning process multiple times independently. Instead of producing just one reasoning chain per query, the model generates multiple chains for the same task. After running these different reasoning processes, the final answer is determined by applying majority voting on the outputs. This ensures that the model selects the answer that is most consistent across multiple reasoning attempts, which helps reduce variability and errors caused by randomness or incorrect intermediate steps in any single chain Wang et al. (2023b). Cumulative Reasoning (CR) Cumulative Reasoning builds on the idea that reasoning can be improved over successive iterations. The model does not simply reach a conclusion in one step, but rather, the reasoning evolves across multiple stages or passes. In this process, intermediate results are used as building blocks for the final solution, allowing the model to accumulate information and refine its reasoning step by step Zhang et al. (2023). DetermLR DetermLR is a reasoning approach that rethinks the process as an evolution from indeterminacy to determinacy. It categorizes premises into determinate (clear) and indeterminate (uncertain) types, guiding models to convert indeterminate data into determinate insights. The approach uses quantitative methods to prioritize relevant premises and employs reasoning memory to store and retrieve historical reasoning paths. This helps streamline the reasoning process, progressively refining the model’s understanding to produce more determinate and accurate conclusions Sun et al. (2024). Tree-of-Thought (ToT) Tree-of-Thought (ToT) is a framework that uses a tree-like structure for reasoning. Instead of generating a single chain of thought, the model explores multiple reasoning pathways in parallel, branching out into different possible solutions. The tree structure allows the model to evaluate and prune different paths, keeping only the most promising routes to reach the correct solution. This approach is particularly useful for problems where multiple reasoning paths can lead to the answer, allowing for exploration and selection of the best path Yao et al. (2023). SymbCoT SymbCoT integrates symbolic expressions and rules into the chain-of-thought (CoT) process. It translates natural language input into symbolic representations, allowing the model to reason based on these symbolic expressions. The LLM then applies symbolic rules to process and analyze the information, enhancing its ability to handle tasks that require formal reasoning and structured problem-solving Xu et al. (2024). Logic-LM Logic-LM translates natural language input into a symbolic format and then applies a rule-based logical engine to perform reasoning. This approach leverages formal logic rules to process and analyze the symbolic representation, enabling more structured and precise reasoning, particularly for tasks that require strict logical inferences Pan et al. (2023). Appendix H Full Algorithm Input: Premises $P$ , Question Statement $S$ , LLM $p_{\theta}$ , Translator $T()$ , Decomposer $D()$ , Search Router $SR()$ , Resolver $R()$ , Search Round Limit $S$ $P_{t},S_{t}← T(P,S)$ ; // Translate the given premises and statement $P_{n},S_{n}← D(P_{t},S_{t})$ ; // Decompose the translated premises and statement $C_{\text{current\_list}}←[S_{n},\neg S_{n}]$ ; // Initiate search with $S_{n}$ and its negation 1 $Search\_round← 0$ ; 2 3 foreach $C_{\text{current}}$ in $C_{\text{current\_list}}$ do 4 while $Search\_round<S$ do $C_{\text{searched\_list}}← SR(C_{\text{current}},P_{n})$ ; // Search for complementary clause 5 $num\_searched\_C←\text{len}(C_{\text{searched\_list}})$ ; 6 7 if $num\_searched\_C>=1$ then 8 $C_{\text{searched}}← C_{\text{searched\_list}}[0]$ ; 9 10 else 11 $C_{\text{searched}}← C_{\text{searched\_list}}.\text{pop}(0)$ ; 12 13 end if 14 $cache←\{P_{n}:C_{\text{searched\_list}}\}$ ; // If more than one $C_{\text{current}}$ , save in $cache$ 15 16 if $num\_searched\_C==0$ then 17 if $\text{cache}[C_{\text{current}}]$ is not empty then $C_{\text{current}}←\text{next}(\text{iter}(\text{cache}))$ ; // If no $C_{c}urrent$ found, search from $cache$ 18 $C_{\text{searched}}←\text{cache}[C_{\text{current}}].\text{pop}(0)$ ; 19 20 end if 21 else 22 if Start from $S_{n}$ then $D_{S_{n}}← P\not\vdash\neg S$ ; // If $cache$ is empty, make conclusion 23 24 end if 25 if Start from $\neg S_{n}$ then 26 $D_{\neg S_{n}}← P\not\vdash S$ ; 27 28 end if 29 break; 30 31 end if 32 33 end if 34 35 $C_{\text{resolved}}← R(C_{\text{current}},C_{\text{searched}})$ ; 36 if $C_{\text{resolved}}==\text{`Contradiction'}$ then 37 if Start from $S_{n}$ then $D_{S_{n}}← P\vdash\neg S$ ; // If contradiction is found, make conclusion 38 ; 39 end if 40 if Start from $\neg S_{n}$ then 41 $D_{\neg S_{n}}← P\vdash S$ ; 42 43 end if 44 break; 45 46 end if 47 else $P_{n}← P_{n}\cup\{C_{\text{resolved}}\}$ ; // Append $C_{\text{resolved}}$ on $P_{n}$ 48 49 end if 50 $C_{\text{current}}← C_{\text{resolved}}$ ; 51 52 end while 53 if Start from $S_{n}$ then $D_{S_{n}}← P\not\vdash\neg S$ ; // If no contradiction found and reach max iterations, make conclusion 54 55 end if 56 if Start from $\neg S_{n}$ then 57 $D_{\neg S_{n}}← P\not\vdash S$ ; 58 59 end if 60 61 end foreach Algorithm 1 Methodology Appendix I Case Study Given the premises, we need to determine whether the question statement $S$ "Dave is not nice" is true/false/unknown/self-contradictory. We first start the first reasoning path from $C_{\text{current}}$ = $S_{n}$ . $C_{\text{complement}}$ is the complementary clause found by the Search Router from $P_{n}$ . Logical Resolution Steps Translated and Decomposed Premises $P_{n}$ : 1. If someone is green then they are nice ::: $∀ x\left(\text{Green}(x,\text{False})\lor\text{Nice}(x,\text{True})\right)$ 2. If someone is smart then they are green ::: $∀ x\left(\text{Smart}(x,\text{False})\lor\text{Green}(x,\text{True})\right)$ 3. Dave is smart ::: $\text{Smart}(\text{Dave},\text{True})$ Question Statement $S_{n}$ : Dave is not nice ::: Nice(Dave, False) $C_{\text{current}}:\quad\text{Nice}(\text{Dave},\text{False})$ Resolution Steps: 1. First Resolution: $\text{Resolve}\left(C_{\text{current}}=\text{Nice}(\text{Dave},\text{False}),\ C_{\text{complement}}=∀ x\left(\text{Green}(x,\text{False})\lor\text{Nice}(x,\text{True})\right)\right)$ • Instantiate $C_{\text{complement}}$ for $x=\text{Dave}$ : $\text{Green}(\text{Dave},\text{False})\lor\text{Nice}(\text{Dave},\text{True})$ • Resolve with $C_{\text{current}}=\text{Nice}(\text{Dave},\text{False})$ : $\displaystyle\text{Nice}(\text{Dave},\text{False})$ and $\displaystyle\text{Green}(\text{Dave},\text{False})\lor\text{Nice}(\text{Dave},\text{True})$ • Since $\text{Nice}(\text{Dave},\text{False})$ contradicts $\text{Nice}(\text{Dave},\text{True})$ , the new resolved clause is: $C_{\text{resolved}}=\text{Green}(\text{Dave},\text{False})$ • Update $C_{\text{current}}$ to $C_{\text{resolved}}$ : $C_{\text{current}}=C_{\text{resolved}}=\text{Green}(\text{Dave},\text{False})$ 2. Second Resolution: $\text{Resolve}\left(C_{\text{current}}=\text{Green}(\text{Dave},\text{False}),\ C_{\text{complement}}=∀ x\left(\text{Smart}(x,\text{False})\lor\text{Green}(x,\text{True})\right)\right)$ • Instantiate $C_{\text{complement}}$ for $x=\text{Dave}$ : $\text{Smart}(\text{Dave},\text{False})\lor\text{Green}(\text{Dave},\text{True})$ • Resolve with $C_{\text{current}}=\text{Green}(\text{Dave},\text{False})$ : $\displaystyle\text{Green}(\text{Dave},\text{False})$ and $\displaystyle\text{Smart}(\text{Dave},\text{False})\lor\text{Green}(\text{Dave},\text{True})$ • Since $\text{Green}(\text{Dave},\text{False})$ contradicts $\text{Green}(\text{Dave},\text{True})$ , the new resolved clause is: $C_{\text{resolved}}=\text{Smart}(\text{Dave},\text{False})$ • Update $C_{\text{current}}$ to $C_{\text{resolved}}$ : $C_{\text{current}}=C_{\text{resolved}}=\text{Smart}(\text{Dave},\text{False})$ 3. Third Resolution: $\text{Resolve}\left(C_{\text{current}}=\text{Smart}(\text{Dave},\text{False}),\ C_{\text{complement}}=\text{Smart}(\text{Dave},\text{True})\right)$ • Resolve $\text{Smart}(\text{Dave},\text{False})$ with $\text{Smart}(\text{Dave},\text{True})$ : $\displaystyle\text{Smart}(\text{Dave},\text{False})$ and $\displaystyle\text{Smart}(\text{Dave},\text{True})$ • Since $\text{Smart}(\text{Dave},\text{False})$ contradicts $\text{Smart}(\text{Dave},\text{True})$ , the final resolved clause is: $C_{\text{resolved}}=\text{Contradiction}$ Conclusion: $D_{S_{n}}$ = $P\vdash\neg S$ We then start the second reasoning path from $C_{\text{current}}$ = $\neg S_{n}$ . Logical Resolution Steps Translated and Decomposed Premises $P_{n}$ : 1. If someone is green then they are nice ::: $∀ x\left(\text{Green}(x,\text{False})\lor\text{Nice}(x,\text{True})\right)$ 2. If someone is smart then they are green ::: $∀ x\left(\text{Smart}(x,\text{False})\lor\text{Green}(x,\text{True})\right)$ 3. Dave is smart ::: $\text{Smart}(\text{Dave},\text{True})$ Question Statement $S_{n}$ : Dave is not nice ::: Nice(Dave, False) $C_{\text{current}}=\neg S_{n}=\text{Nice}(\text{Dave},\text{True})$ Resolution Steps: 1. First Resolution: No complementary clause was found from the given premises, thus we directly conclude "No contradiction found" Conclusion: $D_{\neg S_{n}}$ = $P\not\vdash S$ Since we get: $D_{S_{n}}$ = $P\vdash\neg S$ and $D_{\neg S_{n}}$ = $P\not\vdash S$ from two reasoning paths correspondingly, according to Eq. (1), the final answer is False. Appendix J Full Prompting of Each Module J.1 ProntoQA Translation Task Description: You are given a problem description and a question. The task is to: 1. Define all the predicates in the problem. 2. Parse the problem into logic rules based on the defined predicates. 3. Write all the facts mentioned in the problem. 4. Parse the question into the logic form. Premises $P$ : • Each jompus is fruity. Every jompus is a wumpus. Every wumpus is not transparent. Wumpuses are tumpuses. Tumpuses are mean. Tumpuses are vumpuses. Every vumpus is cold. Each vumpus is a yumpus. Yumpuses are orange. Yumpuses are numpuses. Numpuses are dull. Each numpus is a dumpus. Every dumpus is not shy. Impuses are shy. Dumpuses are rompuses. Each rompus is liquid. Rompuses are zumpuses. Alex is a tumpus. Statement $S$ : • True or false: Alex is not shy. Facts (included in $P_{t}$ ): • $Tumpuses(Alex)$ : Alex is a tumpus. • (… more facts …) Rules (included in $P_{t}$ ): • $Jompus(x)\Rightarrow Fruity(x)$ : Each jompus is fruity. • (… more rules …) Translated Query $S_{t}$ : • $Shy(Alex,False)$ ::: Alex is not shy Decomposition Task Description: You are given a problem description and a question. The task is to: 1. Given the premises and conjecture in logical form, decompose the logical statements using normalization and skolemization. 2. Normalization: Convert each premise and conjecture into Prenex Normal Form (PNF), then into Conjunctive Normal Form (CNF). 3. Skolemization: Eliminate existential quantifiers by introducing Skolem constants or functions. Premises $P_{t}$ : • $Jompus(x,True)→ Shy(x,False)$ • $Jompus(x,True)→ Yumpus(x,True)$ • (…more premises… ) Query $S_{t}$ : • $Sour(Max,True)$ Decomposed Premises $P_{n}$ : • 1. $\neg Jompus(x,True)\lor Shy(x,False)$ • 2. $(\neg Jompus(x,True)\lor Yumpus(x,True)$ • (… additional decomposed premises …) Query $S_{n}$ : • $Sour(Max,True)$ Resolve Task Description: You are given a problem description and a question. The task is to: 1. Check for Complementary/Contradictory Terms. Two terms are contradictory if they share the same predicate and arguments but differ in boolean value (True vs. False). 2. If contradictory terms are found, apply the resolution rule: From $(P(x,True)\lor Q(x,True))$ and $(P(x,False)\lor M(x,True))$ , derive $Q(x,True)\lor M(x,True)$ . 3. If the resolution leads to an empty clause or direct contradiction, then output "Contradiction". Otherwise output the new clause after resolution. Example: Given Clauses ( $C_{\text{current}}$ and $C_{\text{complement}}$ ) • $Difficult(Bradley,True)\lor Known(x,False)$ • $Difficult(x,False)\lor Embarrassed(x,True)\lor Colorful(x,False)$ Resolved $C_{\text{resolved}}$ : • $Known(x,False)\lor Embarrassed(Bradley,True)\lor Colorful(Bradley,False)$ (…more examples…) J.2 ProofWriter Translation Task Description: You are given a problem description and a question. The task is to: 1. Define all the predicates in the problem. 2. Parse the problem into logic rules based on the defined predicates. 3. Write all the facts mentioned in the problem. 4. Parse the question into the logic form. Premises $P$ : • Anne is quiet. Erin is furry. Erin is green. Fiona is furry. Fiona is quiet. Fiona is red. Fiona is rough. Fiona is white. Harry is furry. Harry is quiet. Harry is white. Young people are furry. If Anne is quiet then Anne is red. Young, green people are rough. If someone is green then they are white. If someone is furry and quiet then they are white. If someone is young and white then they are rough. All red people are young. Statement $S$ : • Is the following statement true, false, or unknown? Anne is white. Facts (included in $P_{t}$ ): • $Quite(Anne,True)$ : Anne is quiet. • (… More facts …) Rules (included in $P_{t}$ ): • $Young(x,True)\Rightarrow Furry(x,True)$ : Young people are furry. • (… More rules …) Query $S_{t}$ : • White(Anne, True) ::: Anne is white. Decomposition Task Description: You are given a problem description and a question. The task is to: 1. Given the premises and conjecture in logical form, decompose the logical statements using normalization and skolemization. 2. Normalization: Convert each premise and conjecture into Prenex Normal Form (PNF), then into Conjunctive Normal Form (CNF). 3. Skolemization: Eliminate existential quantifiers by introducing Skolem constants or functions. Premises $P_{t}$ : • $Quite(Anne,True)$ : Anne is quiet. • $Young(x,True)\Rightarrow Furry(x,True)$ : Young people are furry. • (…more premises… ) Query $S_{t}$ : • White(Anne, True) ::: Anne is white. Decomposed Premises $P_{n}$ : • 1. $Quite(Anne,True)$ • 2. $Young(x,False)\lor Furry(x,True)$ • (… additional decomposed premises …) Query $S_{n}$ : • $White(Anne,True)$ Resolve Task Description: You are given a problem description and a question. The task is to: 1. Check for Complementary/Contradictory Terms. Two terms are contradictory if they share the same predicate and arguments but differ in boolean value (True vs. False). 2. If contradictory terms are found, apply the resolution rule: From $(P(x,True)\lor Q(x,True))$ and $(P(x,False)\lor M(x,True))$ , derive $Q(x,True)\lor M(x,True)$ . 3. If the resolution leads to an empty clause or direct contradiction, then output "Contradiction". Otherwise output the new clause after resolution. Example: Given Clauses ( $C_{\text{current}}$ and $C_{\text{complement}}$ ) • $Difficult(Bradley,True)\lor Known(x,False)$ • $Difficult(x,False)\lor Embarrassed(x,True)\lor Colorful(x,False)$ Resolved $C_{\text{resolved}}$ : • $Known(x,False)\lor Embarrassed(Bradley,True)\lor Colorful(Bradley,False)$ (…more examples…) J.3 LogicNLI Translation Task Description: You are given a problem description and a question. The task is to: 1. Define all the predicates in the problem. 2. Parse the problem into logic rules based on the defined predicates. 3. Write all the facts mentioned in the problem. 4. Parse the question into the logic form. 5. Please make sure to differentiate ’or’ and ’either…or…’. For ’or’, you should translate it with the inclusive ’or’ ( $\lor$ ) operator. For ’either…or…’, you should translate it with the ’exclusive or’ ( $\oplus$ ) operator. 6. Please be careful when translating clauses with words "equivalent", "vice versa" and "if and only if". Make sure you use the biconditional " $\leftrightarrow$ " in those translations. Premises $P$ : • Medwin is doubtful. Roberto is not bitter. Roberto is not grieving. If someone is not bitter, then he is not grieving. Medwin being not sociable implies that Medwin is not pure. If there is someone who is either not pure or not doubtful, then Lynda is not grieving. Statement $S$ : • Bernard is not bitter. Facts (included in $P_{t}$ ): • $Doubtful(Medwin,True)$ ::: Medwin is doubtful. • (… More facts …) Rules (included in $P_{t}$ ): • $Bitter(x,False)\Rightarrow Grieving(x,False)$ : If someone is not bitter, then he is not grieving. • (… More rules …) Query $S_{t}$ : • Bitter(Bernard, False) ::: Bernard is not bitter. Decomposition Task Description: You are given a problem description and a question. The task is to: 1. Given the premises and conjecture in logical form, decompose the logical statements using normalization and skolemization. 2. Normalization: Convert each premise and conjecture into Prenex Normal Form (PNF), then into Conjunctive Normal Form (CNF). 3. Skolemization: Eliminate existential quantifiers by introducing Skolem constants or functions. Premises $P_{t}$ : • $Doubtful(Medwin,True)$ ::: Medwin is doubtful. • $Bitter(x,False)\Rightarrow Grieving(x,False)$ : If someone is not bitter, then he is not grieving. • (…more premises… ) Query $S_{t}$ : • $Bitter(Bernard,False)$ ::: Bernard is not bitter. Decomposed Premises $P_{n}$ : • 1. $Doubtful(Medwin,True)$ • 2. $Bitter(x,True)\lor Grieving(x,False)$ • (… additional decomposed premises …) Query $S_{n}$ : • $Bitter(Bernard,False)$ Resolve Task Description: You are given a problem description and a question. The task is to: 1. Check for Complementary/Contradictory Terms. Two terms are contradictory if they share the same predicate and arguments but differ in boolean value (True vs. False). 2. If contradictory terms are found, apply the resolution rule: From $(P(x,True)\lor Q(x,True))$ and $(P(x,False)\lor M(x,True))$ , derive $Q(x,True)\lor M(x,True)$ . 3. If the resolution leads to an empty clause or direct contradiction, then output "Contradiction". Otherwise output the new clause after resolution. Example: Given Clauses ( $C_{\text{current}}$ and $C_{\text{complement}}$ ) • $Difficult(Bradley,True)\lor Known(x,False)$ • $Difficult(x,False)\lor Embarrassed(x,True)\lor Colorful(x,False)$ Resolved $C_{\text{resolved}}$ : • $Known(x,False)\lor Embarrassed(Bradley,True)\lor Colorful(Bradley,False)$ (…more examples…)

Rendering Paper...