2601.18595

Model: healer-alpha-free

# A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic ## Abstract Although Large Language Models (LLMs) have demonstrated impressive formal reasoning abilities, they often break down when problems require complex proof planning. One promising approach for improving LLM reasoning abilities involves translating problems into formal logic and using a logic solver. Although off-the-shelf logic solvers are in principle substantially more efficient than LLMs at logical reasoning, they assume that all relevant facts are provided in a question and are unable to deal with missing commonsense relations. In this work, we propose a novel method that uses feedback from the logic solver to augment a logic problem with commonsense relations provided by the LLM, in an iterative manner. This involves a search procedure through potential commonsense assumptions to maximize the chance of finding useful facts while keeping cost tractable. On a collection of pure-logical reasoning datasets, from which some commonsense information has been removed, our method consistently achieves considerable improvements over existing techniques, demonstrating the value in balancing neural and symbolic elements when working in human contexts. ## 1 Introduction Large Language Models (LLMs) have demonstrated impressive abilities to reason formally, often via chain-of-thought reasoning (Wei et al., 2022). While the state of the art modern LLM-based systems show impressive reasoning capabilities, it is unclear whether this comes from the LLM itself, or sophisticated post-learning refinement algorithms. At the same time, open-sourced LLMs still demonstrate an inability to naturally scale to problems that require complex proof planning (Saparov and He, 2023; Dziri et al., 2023). Such problems are exactly the type on which symbolic logical solvers excel: such solvers have a long history and were for a long time considered a key component of any path to artificial intelligence (Nilsson, 1991). Nevertheless, they are greatly restricted by their need for problems to be stated in symbolic language and for every relevant fact to be provided as input. These constraints have ultimately limited them to highly specialized applications, and they have never had the broad impact that was hoped for (Crevier, 1993). These complimentary strengths of neural and symbolic methods have motivated a revival of interest in neuro-symbolic methods, where an LLM incorporates a logic solver to improve its reasoning abilities (Ye et al., 2023; Lee and Hwang, 2024; Lyu et al., 2023; Olausson et al., 2023). In these approaches, the LLM translates problems formulated in natural language into symbolic language, addressing one of the key deficiencies of a purely symbolic approach. Nonetheless, these hybrid systems remain impractical because they are ultimately purely deductive: that is, every relevant fact must be provided as input. This means that the symbolic solvers are often unable to reach a conclusion simply because obvious, commonsense assumptions are left unstated, and it is often difficult to predict which should be included until one is presented with a failed reasoning chain. For example, consider the problem in Figure 2. A logic solver would return “unknown” for the target query as, formally speaking, neither its truth nor its falsehood is implied by the premises. A human, however, would easily solve this problem by supplying the additional commonsense fact that white surfaces reflect light ( $turns\_white(fox,winter)\rightarrow reflects(fox,sun)$ ). This ability to supply missing information is usually themed abductive reasoning, and is a key mark of human intelligence. <details> <summary>x1.png Details</summary> ![8ffce337](/v1/image/8ffce3375888a6f580af2b2d646f82d358a379d329ba00faa790b46a2cd5308c) ### Visual Description ## [Text-Based Diagram]: Natural Language to Logical Form Translation ### Overview The image presents a side-by-side comparison demonstrating the translation of natural language statements about animal winter survival strategies into formal logical expressions. It is divided into two distinct panels: a left panel containing English text and a right panel containing corresponding symbolic logic. The purpose is to illustrate how descriptive sentences can be represented in a structured, logical format. ### Components/Axes The image is composed of two primary rectangular panels: 1. **Left Panel (Natural Language):** Contains two text blocks. * **Top Block (Context):** A paragraph describing animal behaviors in winter. * **Bottom Block (Question):** A specific statement about the arctic fox, enclosed in a purple border. 2. **Right Panel (Logical Form):** Contains a list of logical expressions. * **Top Section:** A series of existential quantifier (∃) statements corresponding to the context paragraph. * **Bottom Section:** A set of three logical statements enclosed in a purple border, corresponding to the question block. ### Detailed Analysis / Content Details #### Left Panel - Natural Language Text * **Context Paragraph:** > Context: Some animals tough winter out. They do not leave. They do not hide. They must survive. Sometimes nature helps them out. Some animals grow thicker coats in the winter. Other animals change color. The arctic fox is brown in the summer. His coat turns white in the winter. * **Question Block (Purple Border):** > Question: The arctic fox's coat turns white in the winter because white absorbs the sun and is warmer. #### Right Panel - Logical Expressions * **Top Section (Corresponding to Context):** * `∃x : tough_out(x, winter)` * `∃x : ¬hide(x) ∧ ¬leave(x) ∧ survive(x)` * `∃x : helps(nature, x)` * `∃x : grow_coat(x, winter)` * `∃x : change_color(x, winter)` * `brown(fox, summer)` * `turns_white(fox, winter)` * **Bottom Section (Purple Border, Corresponding to Question):** * `absorbs(white, sun)` * `warmer(coat)` * `turns_white(fox, winter)` ### Key Observations 1. **Parallel Structure:** There is a direct, line-by-line correspondence between the sentences in the left panel's context and the logical expressions in the right panel's top section. Each sentence is translated into a formal predicate. 2. **Highlighted Correspondence:** The purple border explicitly links the natural language "Question" block to a specific set of three logical statements, indicating this is the target translation or inference being examined. 3. **Logical Syntax:** The right panel uses standard first-order logic notation, including existential quantifiers (∃), logical conjunction (∧), and negation (¬). Predicates like `tough_out`, `helps`, `absorbs`, and `warmer` are defined to capture the relationships and properties described in the text. 4. **Specificity Shift:** The context uses general terms ("Some animals"), which are captured by the existential quantifier `∃x`. The statements about the arctic fox (`brown(fox, summer)`, `turns_white(fox, winter)`) are specific facts and are represented as ground predicates without a quantifier. ### Interpretation This image serves as an educational or technical example in the fields of computational linguistics, knowledge representation, or artificial intelligence. It demonstrates the process of **semantic parsing**—converting unstructured natural language into a structured, machine-readable logical form. * **What it demonstrates:** The translation captures the *meaning* of the text. For instance, "They do not leave. They do not hide." becomes `¬hide(x) ∧ ¬leave(x)`, explicitly linking the two actions with a logical "and". The causal statement in the question ("because white absorbs the sun and is warmer") is broken down into its constituent propositions (`absorbs(white, sun)`, `warmer(coat)`) alongside the observed fact (`turns_white(fox, winter)`), setting the stage for a logical inference or explanation. * **Relationship between elements:** The left panel provides the human-readable narrative and the specific claim to be analyzed. The right panel provides the formal, unambiguous representation that a computer system could use for reasoning, theorem proving, or knowledge base population. The purple boxes highlight a specific task: formalizing a causal explanation. * **Notable aspect:** The logical form for the question does not explicitly encode the word "because." Instead, it lists the relevant facts. A reasoning system would need to use these facts, potentially along with other axioms (e.g., `∀x (absorbs(x, sun) → warmer(x))`), to infer or justify the causal link implied in the natural language question. This highlights a common challenge in AI: representing implicit knowledge and causal relationships. </details> Figure 1: An example from a children’s comprehension exercise booklet Taken from https://www.ereadingworksheets.com/worksheets/reading/nonfiction-passages/wintertime. We selected a choice from multiple choice question 3 and re-phrased it as a True/False question, according to the logic-problem framing.. Left: the problem phrased in human language. Right: the same problem translated to first-order-logic. The limitation of current neuro-symbolic LLM systems to deductive reasoning means that they have mostly been so far of theoretical interest, since they tend to break down when confronted with more complex problems where enumerating every possible background fact is not realistic. However, besides their translation skills, LLMs possess also another striking ability: their training on prodigious amounts of internet data has made them very adept at recognizing commonsense statements, to the point where they have been regarded as potential universal databases (Petroni et al., 2019). In a way, LLMs seem to have internalized most commonsense knowledge. <details> <summary>x2.png Details</summary> ![ad46aea3](/v1/image/ad46aea324bb792d82725bd32c63fd2cc9289780ffb0fbce0d0426453438a33f) ### Visual Description ## [Conceptual Diagram]: Spectrum of AI Reasoning Approaches ### Overview The image is a conceptual diagram illustrating a spectrum of artificial intelligence reasoning methodologies. It positions various techniques along a horizontal axis that ranges from purely symbolic/logical approaches on the left to purely linguistic/neural approaches on the right. The diagram uses text labels, icons, and a color gradient to represent the continuum and the relative placement of different AI paradigms. ### Components/Axes * **Main Axis:** A horizontal black line with arrowheads at both ends, indicating a continuous spectrum. * **Axis End Labels:** * **Left End:** "Symbolic (logical)" accompanied by a blue calculator icon. * **Right End:** "Linguistic (neural)" accompanied by a pink/purple llama icon. * **Plotted Points (from left to right):** 1. **Semantic Parsers** (Red dot, positioned closer to the symbolic end). 2. **LLM Tres** (Red dot, positioned left of center). 3. **ARGOS** (Red dot, positioned at the center). This label is in bold, uppercase text. 4. **Logic of Thought** (Red dot, positioned right of center). 5. **COT** (Red dot, positioned closer to the linguistic end). * **Additional Visual Element:** A yellow smiley face icon is placed directly above the central "ARGOS" point. * **Color Gradient Bar:** A horizontal bar runs beneath the main axis. It transitions from solid red on the far left, through a greenish-yellow in the center, back to solid red on the far right. This visually reinforces the spectrum concept. ### Detailed Analysis The diagram maps specific AI techniques onto a conceptual scale: * **Symbolic/Logical Pole (Left):** Represented by a calculator, suggesting rule-based, formal logic systems. * **Linguistic/Neural Pole (Right):** Represented by a llama (likely referencing the LLaMA family of Large Language Models), suggesting systems based on statistical pattern recognition in natural language. * **Technique Placement:** * **Semantic Parsers** are placed nearest to the symbolic end, indicating they are heavily reliant on formal logic and structured representations. * **LLM Tres** (likely a specific model or technique) is placed left of center, suggesting a hybrid approach leaning towards symbolic reasoning. * **ARGOS** is positioned at the exact center of the spectrum, directly under the smiley face, implying it represents a balanced or optimal integration of both symbolic and neural methods. * **Logic of Thought** is placed right of center, suggesting a method that applies logical structures within a neural framework. * **COT (Chain-of-Thought)** is placed nearest to the linguistic/neural end, indicating it is a prompting technique primarily used with and within large neural language models. ### Key Observations 1. **Central Emphasis:** The diagram's focal point is "ARGOS," highlighted by its central position, bold text, and the smiley face icon above it. This suggests the creator's intent to present ARGOS as a harmonious or superior middle-ground approach. 2. **Symmetry of Extremes:** The color gradient bar is symmetric, being red at both ends and green in the middle. This visually implies that the pure extremes (pure symbolic or pure neural) are similar in some regard (perhaps in their limitations), while the blended center is distinct (perhaps optimal). 3. **Non-Linear Labeling:** The labels "LLM Tres" and "Logic of Thought" are placed above the axis, while "Semantic Parsers," "ARGOS," and "COT" are placed below. This appears to be a layout choice for clarity rather than a semantic distinction. ### Interpretation This diagram argues that the future or ideal state of AI reasoning lies not at the extremes of pure symbolic logic or pure neural linguistics, but in a synthesis of the two. It presents a historical or methodological progression from left to right, but crucially places "ARGOS" as the balanced apex. * **What it suggests:** The field is moving from separate paradigms (symbolic AI vs. connectionist AI) toward hybrid systems. Techniques like Chain-of-Thought (COT) are seen as neural approaches incorporating a form of logic, while Semantic Parsers are logical systems trying to handle language. The central placement of ARGOS, with a positive icon, proposes that a specific integration of these paradigms is the most promising direction. * **Relationships:** The elements relate as points on a continuum. The icons (calculator, llama) serve as archetypes for the two foundational paradigms. The color bar provides a visual metaphor for the spectrum, with green (often associated with "go" or "good") highlighting the center. * **Anomalies/Notes:** The term "LLM Tres" is ambiguous without external context. The diagram is a high-level conceptual model; it does not provide empirical data to support the precise placement of each technique, which is based on the author's perspective. The smiley face is a subjective, non-technical element indicating approval. </details> Figure 2: Symbolic-Linguistic Spectrum depicting the positioning of LLM-Tres (Toroghi et al., 2024), Logic-of-Thought (Liu et al., 2024), and Chain-of-Thought (COT) (Wei et al., 2022) relative to our approach. This realization has led some works to use an LLM itself to supply missing but relevant clauses when reasoning. Notably, Toroghi et al. (2024) proposed a method that operates an exhaustive search over a heavily restrained set of rules in the symbolic space, whereas Liu et al. (2024) proposed a method that uses LLM prompting to produce new rules which might be deducible from the given logical context. While these methods lie on opposite ends of the symbolic-linguistic reasoning spectrum (Figure 2), they both limit themselves to searching over such a restricted space of possible commonsense that they cannot solve practical problems. In this work, we seek to improve AI reasoning abilities by using an LLM to provide relevant unstated commonsense clauses to a logic solver, but unlike previous works, without imposing significant constraints on the shape or content of such clauses. Furthermore, and most importantly, our method ARGOS (A bductive R easoning with G eneralization O ver S ymbolics) can abduce propositions not previously instantiated in the input problem. To compensate for the far more general search space, we guide the search using feedback from the logic solver in the form of the SAT problem backbone, another novel contribution. The resulting system strikes a balance between linguistic and symbolic approaches, allowing us to use both their strengths while minimizing their weaknesses to achieve true abductive reasoning. The contributions of this paper are as follows. - We propose a novel framing of the commonsense logical reasoning problem founded upon classical logical principles and an aim towards more realistic use-cases. - We introduce a novel algorithm that (i) searches over larger spaces of commonsense facts; and (2) uses logic solver feedback in the form of the backbone graph to increase practicality and efficiency. - We demonstrate empirically on multiple benchmarks and large language models that our method improves substantially over existing symbolic and neural methods on abductive reasoning problems where background information is missing. ## 2 Related Work Previous LLM-related logical reasoning methods combine symbolic and neural approaches, but usually rely much more on one or the other. Appendix G provides an extended review. #### Neural Methods Wei et al. (2022) were the first to present a framework for LLM-based reasoning, showing that providing examples of rationales for answers to questions can induce the LLM to do the same, leading to improved accuracy. Kojima et al. (2022) showed that this can be induced without any few-shot examples by prepending the sentence “Let’s think step by step” before generating an answer. This is known as “Chain of Thought” (COT). Following this, Wang et al. (2023) proposed self-consistency (SC), using COT multiple times and taking the mode as the prediction. However, Saparov and He (2023) observed that COT and SC suffer from challenges in proof planning — rationale steps tend to be factual but of low value. This motivated guidance of the LLM at a step-level. Yao et al. (2023) proposed Tree of Thoughts (TOT), which explores hand-crafted trees using an LLM to solve reasoning tasks. TOT is poorly suited to logical reasoning settings as logic problems have highly variable tree-structures. Kazemi et al. (2023) and Lee and Hwang (2024) proposed more logic-focused methods, with reverse reasoning, starting at the answer and ending at the problem. These back-chaining methods, however, underperform symbolic approaches. #### Symbolic Methods Acknowledging that LLMs are poor proof-planners, a series of methods, including F-COT (Lyu et al., 2023) and SAT-LM (Ye et al., 2023), proposed to offload the reasoning burden from the LLM to more specialized tools. In these works, the LLM converts the text to symbolic logic, and a solver is then employed. Logic-LM (Pan et al., 2023) extended this to include a self-refinement step. While these methods perform well on simple datasets, they fail to account for ambiguity and the exclusion of common knowledge. Addressing this, Liu et al. (2024) and Wang et al. (2022) proposed algorithms that produce new clauses via logical deduction and then add the logic back to the text for an LLM to solve. While this might help the LLM, it does not add information to the problem, because any added relations are already deducible. Instead of producing clauses via deduction, Toroghi et al. (2024) proposed a method that exhaustively searches for new single- proposition modus-ponens clauses. However, the search is conducted only over the propositions from the question, and repeated until the problem is solvable by classical logic, diminishing robustness. This search space is highly restricted and leaves out nearly all necessary information for some logic problems. ## 3 Background Propositional logic is a logical system built around propositions, which are statements of fact such as “It is sunny" or “I need an umbrella” which can be true or false. Propositions are often denoted by single letter variables such as $A$ or $B$ , called a propositional variable, which can be tied together by logical connectives (such as $\wedge$ , $\vee$ or $\rightarrow$ ) to form further compound propositions. A deductive propositional logic problem is composed of a set of propositional variables $\mathcal{V}$ , a set of propositions (each represented by a propositional variable in $\mathcal{V}$ ) and compound propositions (built by using logical connectives to connect propositions by their representative propositional variables in $\mathcal{V}$ ) called the premises $\mathcal{P}=\{P_{1},\dots,P_{K}\}$ , and a proposition or compound proposition $Q$ also built from those variables, called the query. The premises are given to be true ( $\vdash\mathcal{P}$ ), and the goal of the problem is to determine whether they imply the query, $\mathcal{P}\vdash Q$ , or its negation, $\mathcal{P}\vdash\neg Q$ . Such problems are usually solved by translating them into two Boolean Satisfiability (SAT) problems, one for $Q$ and one for $\neg Q$ . Let $\mathcal{L}(\mathcal{V})=\{A\,|\,A\in\mathcal{V}\}\cup\{\neg A\,|\,A\in\mathcal{V}\}$ denote the set of all so-called literals of the problem. The backbone of the problem is the collection of all those literals which are implied by the premises, $$ . In effect, they are values for the propositions represented by the variables in the problem which can be inferred from the premises. In an abductive commonsense propositional logic problem the premises $\mathcal{P}$ entail neither the query $Q$ nor its negation $\neg Q$ : the problem is underdetermined. Instead, one must augment the premises with additional commonsense propositions $\mathcal{C}$ , which represent background facts or knowledge left unstated in the problem, until either $(\mathcal{P}\land\mathcal{C})\vdash Q$ or $(\mathcal{P}\land\mathcal{C})\vdash\neg Q$ . Thus, the goal of an abductive problem is to not only find the truth-value of $Q$ , but also a corresponding set of commonsense propositions $\mathcal{C}$ to complete the problem. We assume that $\mathcal{P}\land\mathcal{C}\not\vdash\bot$ , that is that the premises $\mathcal{P}$ are not contradictory with commonsense (i.e. that $P$ and $C$ are consistent). One can show that, under this assumption, the answer to the problem will not depend on the choice of commonsense set $\mathcal{C}$ : details are provided in Appendix A. In practice, the problems we encounter in real life are often stated in terms of first-order logic. First-order logic is a logical system that extends propositional logic to entities and their predicates. An $n$ -ary predicate is a symbol of a relation, such as $\mathit{MotherOf}$ , that takes as arguments $n$ terms such as $x$ and $y$ to become a formula $MotherOf(x,y)$ , and becomes true or false when constants, such as $\mathit{Alice}$ and $\mathit{Bob}$ , are used as a grounding for its arguments. Predicates can be connected by logical connectors, and can also be quantified over a discrete or abstract set of entities with $\forall$ and $\exists$ , to form compound propositions such as $\forall x\forall y[\mathit{MotherOf}(x,y)\rightarrow\neg\mathit{Male}(x)]$ . First-order logic formulas over a finite set of entities can always be converted into equivalent propositional logic formulas, a process known as grounding, by instantiating a propositional variables for every predicate $F(x)$ and entity $A$ , and expanding $\forall x\,F(x)$ into the compound proposition ( $F(A)\wedge F(B)\wedge\dots$ ) and $\exists x\,F(x)$ into $(F(A)\vee F(B)\vee\dots)$ . Given two propositional literals, we will declare them related in first-order logic if they have an entity in common. For example, $\mathit{MotherOf}(\mathit{Alice},\mathit{Bob})$ and $\neg\mathit{Male}(\mathit{Alice})$ are related because both involve the entity $\mathit{Alice}$ . <details> <summary>x3.png Details</summary> ![83c29d0d](/v1/image/83c29d0d8aaee2cb7cd1b35fd5f409d4bb41a8cb886e63bbc5ab55849fdbcec7) ### Visual Description ## Diagram: Hybrid Symbolic-Neural Reasoning Framework ### Overview The image is a three-panel flowchart illustrating a hybrid system that combines symbolic (calculator icon) and neural (llama icon) processing steps for logical problem-solving. The diagram details the overall framework, a specific logic module, and an augmentation module, showing the flow of data and decision points. ### Components/Axes The diagram is divided into three vertical panels, each with a distinct background color and title: - **Panel A (Left, light green background):** Titled "A) The Framework". Contains the high-level system flow. - **Panel B (Center, light blue background):** Titled "B) The Logic Module". Details the core solving process. - **Panel C (Right, light pink background):** Titled "C) The Augmentation Module". Details the clause generation process. **Legend (Bottom Center):** - Green oval: `start` - Red oval: `stop` - Llama icon: `neural step` - Calculator icon: `symbolic step` ### Detailed Analysis #### Panel A: The Framework 1. **Top Section:** A green-bordered rectangle contains two white boxes labeled `Text` and `Logic`. 2. **Flow:** An arrow points from this rectangle down to a light blue box labeled `Logic` (containing a calculator and llama icon). 3. **Decision Point:** From the `Logic` box, two arrows branch out: - Labeled `no`: Points left to a light pink box labeled `Problem Augmentation` (containing a calculator and llama icon). - Labeled `yes`: Points right to a red oval labeled `Solution`. 4. **Loop:** An arrow from the `Problem Augmentation` box loops back up to the input of the `Logic` box. #### Panel B: The Logic Module 1. **Input:** An arrow labeled `in` enters from the top. 2. **Step 1:** Points to a white box labeled `SAT Solver` (containing a calculator icon). 3. **Decision 1:** An arrow leads to the text `solvable?`. - `yes` branch: Points down to a red oval labeled `Solution`. - `no` branch: Points right to a white box labeled `SC` (containing a llama icon). 4. **Decision 2:** From the `SC` box, an arrow leads to the text `confidence >? γ` (where γ is the Greek letter gamma). - `yes` branch: Points down to a red oval labeled `Solution`. - `no` branch: Points down to the text `Not Solvable`. #### Panel C: The Augmentation Module 1. **Input:** An arrow labeled `in` enters from the top. 2. **Step 1:** Points to a white box labeled `Antecedent Selection` (containing a calculator icon). 3. **Parallel Processing:** Two arrows branch from `Antecedent Selection`: - Left arrow points to a white box labeled `Scores` (containing a llama icon). - Right arrow points to a white box labeled `Generate` (containing a llama icon). 4. **Feedback Loop:** An arrow loops from the `Generate` box back to the `Scores` box. 5. **Decision & Output:** An arrow from the `Scores` box leads to the text `scores >? τ` (where τ is the Greek letter tau). - The `yes` branch (implied) points to a blue oval labeled `Generated Clause`. ### Key Observations 1. **Hybrid Architecture:** Every major processing step (`Logic`, `Problem Augmentation`, `SC`, `Scores`, `Generate`) explicitly combines a symbolic step (calculator) and a neural step (llama), indicating a tightly integrated neuro-symbolic approach. 2. **Iterative Refinement:** Both the overall framework (Panel A) and the augmentation module (Panel C) contain feedback loops (`Problem Augmentation` -> `Logic`; `Generate` -> `Scores`), suggesting an iterative problem-solving or refinement process. 3. **Multi-Stage Verification:** The Logic Module (Panel B) employs a two-stage verification: first a symbolic SAT solver, then a neural confidence check (`SC`), before declaring a problem unsolvable. 4. **Threshold-Based Decisions:** Key decisions are governed by thresholds: a confidence threshold `γ` in the Logic Module and a score threshold `τ` in the Augmentation Module. ### Interpretation This diagram outlines a sophisticated system designed to solve logical problems (likely from natural language text) by leveraging the strengths of both symbolic AI and neural networks. - **What it demonstrates:** The system first attempts a direct logical solution using a SAT solver (symbolic). If that fails, it doesn't give up immediately. Instead, it uses a neural component (`SC`) to assess the confidence of a potential solution path. If confidence is high enough, it accepts a solution; otherwise, it deems the problem unsolvable. Concurrently, an augmentation module works to generate new logical clauses (`Generated Clause`) by selecting antecedents and using neural scoring/generation, which can be fed back to reformulate (`Problem Augmentation`) and re-attempt the problem. - **Relationships:** The Framework (A) is the orchestrator, calling upon the Logic Module (B) for solving and the Augmentation Module (C) for problem reformulation when initial solving fails. The modules are interdependent, with the output of C potentially becoming the input for A and B in a subsequent iteration. - **Notable Anomalies/Patterns:** The architecture is notable for its defensive, multi-check approach before declaring failure (`Not Solvable`). The clear separation of "solvable" (a syntactic check) from "confidence" (a semantic/quality check) is a key design pattern. The use of distinct colors (green start, red stop, blue generated output) provides clear visual cues for process initiation, termination, and successful output generation. </details> Figure 3: ARGOS at a glance. See Section 4.1 and Appendix F for details. (A) Given a propositional logic problem, we iteratively augment the problem with new propositions until it is solvable. (B) We attempt to solve the problem both with a logic solver, and with self-consistency (Wang et al., 2023). (C) If we fail, we attempt to add additional commonsense propositions by combining literals from the backbone as antecedents, and generating a right-hand-side using an LLM. We test the proposition for commonsense and relevance using this same LLM, and add it to the pool if it passes the tests. ### 3.1 Problem Statement We are given an abductive propositional logic problem in both textual and logical form, as defined in Section 3, and we are also provided with a large language model and a SAT solver. As described, the task is to determine whether the target query is true or false given the premises and some additional commonsense propositions which must be found. Four annotated examples are provided, intended for few-shot prompting. In particular, the task is inference-only and no training phase is involved. We evaluate performance based on the number of correctly answered questions on a test dataset. ## 4 Methodology We now describe our novel algorithm to tackle the problem described in Section 3.1. This algorithm is described by the diagram in Figure 3, and formally as Algorithm 1 in Appendix D. ### 4.1 Algorithm We start the algorithm by initializing our set of commonsense propositions as empty, $\mathcal{C}=\{\}$ . As shown in module B of Figure 3, we first try to solve the problem using the SAT Solver (sat_solve) to test whether either $(\mathcal{P}\land\mathcal{C})\vdash Q$ or $(\mathcal{P}\land\mathcal{C})\vdash\neg Q$ . If it reaches one of these conclusions, our job is finished; if not, we at least obtain from our call the backbone $\mathcal{B}=\{L\in\mathcal{L}\,|\,\mathcal{P}\vdash L\}$ . Next, still in module B of Figure 3, we attempt to solve the problem using the LLM (llm_solve) by $k$ -shot self-consistency (we use $k=5$ in our experiments). We ask the LLM whether the query is true or false, providing it the premises and the commonsense found so far. Details can be found in Appendix F.1. If the fraction of votes pass a certain threshold $\gamma$ , we also conclude either $(\mathcal{P}\land\mathcal{C})\vdash Q$ or $(\mathcal{P}\land\mathcal{C})\vdash\neg Q$ respectively, and the algorithm is finished. This parameter $\gamma$ (initialized at $\gamma=1$ in our experiments) is reduced by a fixed amount $\gamma\leftarrow\gamma-\alpha$ at every iteration ( $\alpha=0.1$ in our experiments). Thus, the maximum cost of our algorithm, in terms of number of COTs required, is bounded at $cost<k\frac{\gamma-0.5}{\alpha}$ , since when $\gamma=0.5$ , the fraction of votes is guaranteed to pass the threshold since the vote is over binary classes. For details on empirical cost, see Appendix B. If neither solving method succeeds in establishing $Q$ or $\neg Q$ , we try to add a new commonsense proposition to our pool $\mathcal{C}$ , as illustrated in module C of Figure 3. In practice, we define a proposition to be commonsense if it seems true to a large language model without any context. To guarantee that the added proposition will grow the problem’s backbone, we search for commonsense propositions of the form $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ , where $L_{1}$ and $L_{2}$ are literals in the backbone $\mathcal{B}$ , and $L_{\text{right}}$ is a new literal suggested by the LLM. Note that $L_{1}$ and $L_{2}$ may be the same literal, in which case we in effect have a formula of the form $L_{1}\rightarrow L_{right}$ , thereby allowing both single and two-literal antecedents. In addition, by adding $\emptyset$ to the set of backbone literals, we can also have $\emptyset\rightarrow L_{right}$ , allowing 0-literal antecedents. This search routine (find_new_commonsense) is described in Algorithm 2 in the Appendix. In detail, we start by iterating over pairs of literals in the backbone. We iterate by prioritizing the literals that share the most entities with others in the backbone, $\text{score}_{\mathcal{B}}(L)=\#\{L^{\prime}\in\mathcal{B}\,|\,L^{\prime}\text{ has an entity in common with }L\}$ , so that we take highly-scored literals first. This gives a measure of relevance of the literal to the problem. To understand the rationale behind this choice, consider an example in which six relations are known about John and only one is known about Jane. If asked to guess about whom the problem is about, the natural guess would be John, since while problems often include extraneous information, it is rare that the majority of the problem is extraneously included. Next, for a given pair of literals $L_{1},L_{2}$ , we prompt the LLM (llm_generate) to generate a right-hand-side literal $L_{\text{right}}$ for $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ . In doing so, the LLM might introduce new variables not previously involved in the problem. Details can be found in Appendix F.2. We choose this forward-chaining approach rather than a goal-oriented backwards-chaining for simplicity, since LLMs are much easier to prompt for forward-chaining (COT) than backwards chaining (recursive algorithms such as LAMBADA). Finally, for each generated $L_{\text{right}}$ , we use the LLM (llm_score) twice to evaluate it. First, we use the LLM (llm_commonsense_score) to score whether $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ is likely to be commonsense. Second, we use the LLM again (llm_relevance_score) to score whether $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ is likely to be relevant to our current context. Each procedure returns a probability between 0 and 1. Details can be found in Appendices F.3 and F.4, respectively, and human evaluation in F.5. We stop the search at the first new proposition $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ whose commonsense and relevance scores are both above a given threshold $\tau$ (we use $\tau=0.3$ in our experiments). When this happens, we update the commonsense set $\mathcal{C}$ with this new proposition, and restart the process. If not, running new iterations will not change anything and we fall back on our best guess, namely the self-consistency estimate. In addition, if after multiple iterations the self-consistency threshold reaches zero, we also exit with the self-consistency estimate. ### 4.2 Example Consider again the winter fox problem from the introduction section. Let us describe in Figure 4 a hypothetical run of our ARGOS algorithm to illustrate how it could solve the problem. To simplify the illustration, let us use only the SAT solver, and not self-consistency. We start with the premises (in black) and the query (in purple) on the top left-hand-side. We first run the logic solver, which fails to reach any conclusion, but returns an initial backbone. The algorithm chooses the antecedents $L_{1}=L_{2}=turns\_white(fox,winter)$ from this backbone, generating a new proposition $turns\_white(fox,winter)\rightarrow reflects(fox,sun)$ . It is commonsensical and relevant to the question, so we add it to the question. We call the SAT solver again, which adds $reflects(fox,sun)$ to the backbone. Next, the algorithm selects the antecedent $L_{1}=L_{2}=reflects(fox,sun)$ from the new backbone and generates $reflects(fox,sun)\rightarrow\neg absorbs(fox,sun)$ , which is similarly commonsensical and relevant. The SAT solver is called again and adds $\neg absorbs(fox,sun)$ to the backbone. Finally, in the third iteration ARGOS picks $L_{1}=\neg absorbs(fox,sun)$ and $L_{2}=turns\_white(fox,winter)$ from the backbone and generates $\neg absorbs(fox,sun)\wedge turns\_white(fox,winter)\rightarrow\neg absorbs(white,sun)$ , which is a logical conclusion it deems consistent with commonsense and relevant to the question. At this point, we call the SAT solver again, which concludes that $\neg absorbs(white,sun)$ is true and therefore that the query must be false, returning $\mathcal{P}\land\mathcal{C}\vdash\neg Q$ as conclusion. <details> <summary>x4.png Details</summary> ![f07961ed](/v1/image/f07961ed26d188e61c33bffa9a892de380f9b5e5a202a6112460ed0e35ae57a8) ### Visual Description ## Logical Reasoning Diagram: Arctic Fox Coat Color Query Evaluation ### Overview The image is a **logical reasoning diagram** illustrating the evaluation of a query about the arctic fox’s winter coat color. It combines natural language text, symbolic logic (predicates, quantifiers, implications), and a “backbone” of predicates to test the validity of the query. The diagram concludes the query is false. ### Components/Axes (Elements & Layout) - **Legend**: Top-right corner, with two labels: - *Query* (purple text, for the initial query statement). - *New* (orange text, for newly introduced logical elements). - **Natural Language Text (Top)**: Explains the arctic fox’s coat color change: *“Some animals tough winter out... The arctic fox is brown in the summer. His coat turns white in the winter. Query: The arctic fox’s coat is white in the winter because white absorbs the sun and is warmer.”* - **Logical Steps (Left Column)**: Four sequential blocks of symbolic logic (using ∃x, →, ∧, ¬) with predicates: - `tough_out(x, winter)`, `brown(fox, summer)`, `turns_white(fox, winter)`, `absorbs(white, sun)`, `warmer(coat)`, `reflects(fox, sun)`, `¬absorbs(fox, sun)`, `¬absorbs(white, sun)`. - **Backbone (Right Column)**: Green boxes listing predicates (e.g., `brown(fox, summer)`, `turns_white(fox, winter)`) that evolve with each logical step. - **Conclusion (Bottom-Right)**: Pink box: *“Query is FALSE”*. ### Detailed Analysis (Logical Flow & Predicates) 1. **Initial Query (Purple)**: `∃x : tough_out(x, winter) brown(fox, summer) turns_white(fox, winter) query: absorbs(white, sun) ∧ warmer(coat) ∧ turns_white(fox, winter)` (The query claims the fox’s white coat is due to white absorbing sun and being warmer.) 2. **First Logical Step**: Adds `turns_white(fox, winter) → reflects(fox, sun)` (orange text, “New”). Backbone updates to include `reflects(fox, sun)`. 3. **Second Logical Step**: Adds `reflects(fox, sun) → ¬absorbs(fox, sun)` (orange text). Backbone updates to include `¬absorbs(fox, sun)`. 4. **Third Logical Step**: Adds `¬absorbs(fox, sun) ∧ turns_white(fox, winter) → ¬absorbs(white, sun)` (orange text). Backbone updates to include `¬absorbs(white, sun)`. 5. **Conclusion**: The final step negates the query’s premise (`absorbs(white, sun)`), so the query is marked *“FALSE”* (pink box). ### Key Observations - **Logical Progression**: Each step introduces a new predicate (orange text) to the backbone, building a chain of implications. - **Negation of Premise**: The query’s core claim (`absorbs(white, sun)`) is negated via `¬absorbs(white, sun)`, invalidating the query. - **Spatial Layout**: The legend (top-right) distinguishes “Query” (purple) and “New” (orange) elements. The backbone (right) and logical steps (left) are connected by arrows, showing the flow of reasoning. ### Interpretation The diagram is a **formal logical proof** demonstrating the query is false. The reasoning uses predicates about the fox’s coat color, reflection, and absorption to show: - The arctic fox’s white coat *reflects* sun (not absorbs it), so `absorbs(white, sun)` is false. - This negates the query’s premise, proving the initial claim (white coat = warmer due to sun absorption) is incorrect. In practical terms, this suggests the arctic fox’s white coat may serve other purposes (e.g., camouflage) rather than sun absorption for warmth, as the logical chain invalidates the “warmer due to absorption” argument. The diagram uses symbolic logic to rigorously test the query, highlighting the importance of precise reasoning in evaluating causal claims. (Note: All text is in English; no other languages are present.) </details> Figure 4: Overview of ARGOS with the winter fox example. We iteratively add to the logic problem and query a logic solver to look for conflicts within the backbone compared to the query. Eventually, we find that $absorbs(white,sun)$ is $False$ , contradicting the query. ## 5 Experiments #### Models We employ Llama3-8B (L8B), Llama3-70B (L70B), and Mistral 7B (M7B) as LLMs. Our method is dependent on access to logit-level outputs, so closed-source models are excluded. Experiments are each conducted on 1 or 2 NVIDIA Tesla V100 GPUs, depending on the LLM’s GPU memory requirement. As a logic solver, we use Cadical (Biere et al., 2024). #### Benchmarks Unfortunately, there are few natural language reasoning datasets that are strongly logically-structured and commonsense-abductive. However, given a dataset of classical commonsense-based logic problems, data transformations to introduce the need for abductions are typically achievable. For a list of common datasets which have proven unsuitable for our setting, and corresponding explanation, see Appendix I. For our experiments, we use abductive versions of ProntoQA (Saparov and He, 2023), CLUTRR (Sinha et al., 2019), and FOLIO (Han et al., 2024). CLUTRR is not originally True/False, but it is multiple-choice. We modify it to be True/False output by making the question randomly either ask if the correct or an incorrect choice is True. While these datasets are better described with first-order logic, we render them propositional by unrolling their quantified formulas over all instantiated terms. While strongly logical and therefore obvious choices, these datasets are not representative of real-world application or generalizability of our method. To test our method’s generalizability as well as its broader applicability to real use-cases, we also include some datasets that are not strictly logical. CosmosQA (Huang et al., 2019) and QUAIL (Rogers et al., 2020) are reading comprehension MCQA datasets. Reading comprehension is key for general summarization and interactive QA tasks, which are certainly a common LLM use-case in practice. ESNLI (Camburu et al., 2018) is a short-form natural-language-inference dataset. Each of these datasets requires some form of reasoning, but the structure of both the text and the necessary reasoning is generally fuzzy, requiring subjective interpretation. For the MCQA datasets, we process them into True/False questions similarly to how it was done for CLUTRR. We note that ProntoQA, CosmosQA and ESNLI performances are already saturated by self-consistency. Despite this, the results are valuable as they demonstrate that on these apparently simple tasks ARGOS is able to compare with purely neural methods, avoiding the performance collapse that more symbolic methods encounter. For few-shot examples, we randomly remove four problems from each dataset and annotate them with COTs, using the same four examples and annotations for each method. For more dataset details, please see Appendix H. #### Evaluation We compare against COT (Wei et al., 2022), Self-Consistency Wang et al. (2023), SAT-LM (Ye et al., 2023), Logic-of-Thoughts (Liu et al., 2024) and LLM-Tres (Toroghi et al., 2024). For a fair comparison with 20-shot self-consistency and LOT, we set ARGOS’s hyperparameters such that it makes no more than 20 COT calls per problem on average. Details are provided in Appendix B. We report accuracy on the abduction-modified evaluation sets and report results in Table 1. \begin{overpic}[width=433.62pt]{figures/flip_hist.pdf} \put(5.0,70.0){{(a)}} \end{overpic} \begin{overpic}[width=433.62pt]{figures/ci_bars.pdf} \put(5.0,70.0){{(b)}} \end{overpic} Figure 5: (a) The number of CLUTRR problems for which ARGOS flips SC predictions correctly and incorrectly. (b) SC and ARGOS accuracies on CLUTRR subsets, partitioned by the number of ARGOS iterations each datapoint receives. ### 5.1 Results and Discussion As can be seen in Table 1, ARGOS provides significant performance improvements over existing methods (up to +13%). Of the datasets, FOLIO is the most representative of human-generated logical reasoning problems. ARGOS outperforms the baselines for FOLIO, improving performance by 3-10%. For more structured problems (CLUTRR), the symbolic components of ARGOS become more reliable, and we see more consistent gains of 6-8%. On QUAIL, a highly ambiguous dataset that is also formatted in strange ways due to it being constructed by scraping forums and wikis, ARGOS improves compared to self-consistency by up to 13%, demonstrating its ability to adapt to even non-logical contexts. On ProntoQA, ESNLI and CosmosQA, despite the very competitive neural baseline performances, ARGOS performs comparably. Symbolic baselines (SAT-LM, LoT-20, LLM-Tres) see large performance gaps, at times being reduced to guessing. SAT-LM, despite the fact that some datasets are strongly logically structured and that we filter out mis-translated problems, still can not answer problems. Even in the best case, it is impossible for purely symbolic methods to handle realistic reasoning scenarios. LLM-Tres, despite having abductive capabilities, is so restricted in its abduction space that it it almost never capable of identifying the necessary rules to solve CLUTRR or ESNLI problems. Table 1: Binary classification accuracy (True/False) of all methods on the datasets, using the chosen language models. Bolded text indicates that the method has the best performance, and that its performance is better than the next-best-performing method in a statistically significant way ( $p$ -value < 0.005 according to a Wilcoxon pair-wise rank test). Small-font numbers to the right indicate the bounds of the 95% confidence interval, derived via a bootstrap approach. | | FOLIO | CLUTRR | PQA | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | M7B | L8B | L70B | M7B | L8B | L70B | M7B | L8B | L70B | | | SC20 | 66% 66.4 65.5 | 71% 71.7 70.1 | 77% 77.7 75.9 | 59% 59.3 58.8 | 69% 69.5 68.8 | 69% 69.4 68.8 | 97% 97.2 95.6 | 95% 95.6 94.4 | 93% 94.1 92.4 | | COT | 66% 66.4 65.5 | 68% 69.1 67.2 | 72% 72.5 71.8 | 59% 59.3 58.8 | 68% 68.4 67.8 | 66% 66.3 65.6 | 82% 82.9 81.7 | 90% 91.2 89.6 | 93% 94.1 92.4 | | SAT-LM | 43% 43.2 42.8 | 43% 43.2 42.8 | 43% 43.2 42.8 | 50% 50.4 49.9 | 50% 50.4 49.9 | 50% 50.4 49.9 | 50% 50.3 49.8 | 50% 50.3 49.8 | 50% 50.3 49.8 | | LoT-20 | 57% 57.3 56.6 | 69% 69.5 68.7 | 70% 70.4 69.5 | 71% 71.6 70.7 | 70% 70.2 69.7 | 69% 69.3 68.7 | 88% 88.4 87.5 | 97% 98.2 96.3 | 95% 95.7 94.3 | | LLM-Tres | 66% 66.6 65.9 | 63% 63.2 62.4 | 63% 63.2 62.4 | 51% 51.5 50.8 | 51% 51.6 50.8 | 53% 53.2 52.8 | 80% 81.4 79.2 | 83% 83.8 82.3 | 76% 76.6 75.2 | | ARGOS | 70% 70.6 69.8 | 81% 81.8 80.0 | 80% 80.5 78.8 | 78% 78.4 77.7 | 76% 76.3 75.8 | 78% 78.2 77.7 | 98% 98.7 97.9 | 97% 98.2 96.3 | 97% 98.1 96.2 | | CosmosQA | ESNLI | QUAIL | | | | | | | | | M7B | L8B | L70B | M7B | L8B | L70B | M7B | L8B | L70B | | | SC20 | 84% 84.3 82.9 | 81% 81.3 79.7 | 90% 91.1 89.9 | 97% 97.7 97.1 | 96% 97.0 96.3 | 99% 99.5 99.2 | 70% 71.0 67.9 | 68% 69.1 65.6 | 75% 75.6 72.4 | | COT | 81% 81.4 80.1 | 76% 77.5 74.2 | 88% 88.7 86.9 | 96% 95.8 96.4 | 88% 86.9 88.4 | 99% 98.9 99.4 | 71% 71.9 68.8 | 65% 66.2 63.6 | 75% 75.6 72.4 | | SAT-LM | 35% 37.5 34.8 | 35% 37.5 34.8 | 35% 37.5 34.8 | 49% 50.0 47.7 | 49% 50.0 47.7 | 49% 50.0 47.7 | 53% 55.0 51.6 | 53% 55.0 51.6 | 53% 55.0 51.6 | | LoT-20 | 77% 77.2 76.8 | 75% 75.9 74.2 | 85% 85.7 84.3 | 71% 72.1 70.4 | 76% 77.5 75.7 | 75% 76.1 74.4 | 62% 63.6 59.9 | 56% 57.1 53.8 | 72% 73.1 69.9 | | LLM-Tres | 73% 72.1 74.0 | 72% 72.7 70.9 | 71% 71.5 69.8 | 51% 52.5 50.8 | 51% 52.5 50.8 | 51% 52.5 50.8 | 63% 65.9 62.8 | 60% 62.1 58.8 | 58% 60.1 57.1 | | ARGOS | 84% 84.3 82.7 | 83% 84.0 82.6 | 90% 90.7 89.4 | 95% 95.6 94.9 | 96% 96.2 95.5 | 98% 98.0 97.4 | 82% 83.4 80.6 | 82% 83.4 80.7 | 80% 81.8 78.5 | #### RQ1: How useful are the scoring and backbone-tracking elements? In Table 2, we test the importance of two elements of ARGOS: (i) score thresholding and (ii) backbone computation. The ablation of each element in isolation results in a decrease in performance. In addition, the ablation of both results in a larger performance drop than even the sum of the two single ablations’ decreases. The fully-ablated method, however, still shows strong performance relative to the next strongest baseline (SC-20), highlighting the strength of the general concept behind the method. For further ablations, see Appendix E. #### RQ2: How often are ARGOS’ added clauses useful or harmful? An important criterion when adding clauses is that they do not corrupt the logic of the problem, undesirably changing the outcome of the logic. It can be shown (see Appendix A) that so long as clauses are commonsensical, their addition will not corrupt the problem. However, it is possible that our method adds non-commonsensical clauses, since the commonsense scoring is not perfectly reliable. Given CLUTRR’s strict structure, since we know the full knowledge base from which it was constructed, we can re-construct the full problems and test if ARGOS’ added clauses corrupt the logical arithmetic such that a different answer is found for the logic problem. We find that on CLUTRR, ARGOS never corrupts a problem. It is then not surprising that ARGOS sees significant performance gains: added information should in principle never negatively affect a wholly rational reasoner’s solution and so performance should only improve. It is also, of course, important that the added clauses contribute to the (correct) solution of the problem. In order to identify what information is important to the solution of the problem, we add the full relational reasoning rules to the SAT problem representing each CLUTRR example. We then extract the proof, taking all variables mentioned in the proof as important to solving the problem. We can then measure the number of problems for which at least one new variable is added, which is important to the proof. We find that ARGOS, on Llama 8B, identifies important new variables for 65% of the CLUTRR questions we test on. #### RQ3: Does ARGOS attribute more compute to harder problems? How does this affect the solution of harder problems? In Figure 5 (b), we examine the proportion of CLUTRR problems that are solved correctly by SC and ARGOS, over subsets of the dataset grouped by the number of ARGO iterations before termination. The error bars are 5/95% confidence intervals. As the number of ARGOS iterations increases, the problems become harder for SC to solve (indicated by a lower proportion of correct solutions by SC). This tells us that ARGOS’ method of evaluating solvability is working as intended; harder problems are being assigned more computation. Another interpretation of this result is that problems which have more missing information, or for which the missing information is more difficult to infer, are attributed more ARGOS iterations (in order for ARGOS to find the necessary information). This is supported by the fact that the decrease in proportion seen in SC is not present for ARGOS: if SC’s performance is dropping due to missing information, then ARGOS is successfully recovering the necessary missing information. This ability to address the obstacles which cause SC performance to drop contribute to a large number of answers being flipped from incorrect (when solved by self-consistency) to correct (when solved by ARGOS). Changes to the answers caused by new information are more often than not in the right direction. On CLUTRR L70B, we find 112 correct and 35 incorrect flips. Figure 5 (a) shows the number of correct and incorrect flips ARGOS achieves. As the number of ARGOS iterations increases, both the correct and incorrect flip counts increase, but the correct flip counts increase much faster. For a closer look at confidence-score vs. iteration behavior, see Appendix K. Table 2: Ablations. We ablate elements of ARGOS: (i) the score thresholding, taking the first clause sampled at each iteration (ARGOS - No T), (ii) the backbone-tracking, generating prompts by randomly selecting two variables (ARGOS - No BB). | | FOLIO L8B | | --- | --- | | SC-20 | 71% 71.7 70.1 | | ARGOS - No T | 79% 79.8 78.5 | | ARGOS - No BB | 79% 79.4 78.4 | | ARGOS - No Both | 76% 75.2% 77.2% | | ARGOS | 81% 81.8 80.0 | <details> <summary>x5.png Details</summary> ![d5c6aa03](/v1/image/d5c6aa03b34ac23e99a660b40a74957b0de744cfec2cf7323ffb45780ca70f22) ### Visual Description ## [Logical Reasoning Diagram]: Family Relationship Puzzle with Two Reasoning Paths ### Overview The image presents a logical reasoning problem about family relationships. It includes a context block with given facts, a question, and two distinct reasoning approaches labeled "COT" (Chain-of-Thought) and "ARGOS". Each approach leads to a different conclusion, highlighted by visual annotations (arrows, labels, and color highlights) that indicate the correctness of the reasoning steps. ### Components/Axes The image is structured as a vertical text block with clear section headers. There are no traditional chart axes. The key components are: 1. **Context Section**: A list of factual statements about family relationships. 2. **Question**: A specific query based on the context. 3. **COT Section**: A step-by-step reasoning process labeled "COT". 4. **ARGOS Section**: An alternative reasoning process labeled "ARGOS", which introduces additional facts. 5. **Visual Annotations**: * A dotted, curved arrow in the COT section points from step 3 to step 4, labeled with the text "wrong". * The final step (step 4) in the COT section is highlighted with a light red/pink background. * A dotted, curved arrow in the ARGOS section points from the "additional facts" block to step 3, labeled with the text "ARGOS". * The final step (step 3) in the ARGOS section is highlighted with a light green background. ### Detailed Analysis #### **Context & Question** * **Context Text**: * Antonio is the father of Laura. * Pedro is the son of Harold. * Antonio is the son of Harold. * Shantel is the mother of Pedro. * **Question**: Is Shantel Laura's grandmother? * **Label**: True (This appears to be a pre-existing label or answer key, positioned below the question). #### **COT (Chain-of-Thought) Reasoning Path** * **Steps**: 1. Shantel is Pedro's mother. 2. Antonio is Harold's son. 3. Antonio is Laura's father. 4. Shantel is Laura's aunt. * **Conclusion**: Therefore, the answer to the question is **No**. * **Visual Analysis**: The reasoning flows linearly. A dotted arrow labeled "wrong" connects step 3 to step 4, indicating the logical leap or error occurs at this point. Step 4 is highlighted in red, marking it as the flawed conclusion within this chain. #### **ARGOS Reasoning Path** * **Additional Facts Introduced**: * Antonio is the brother of Pedro. * Shantel is the mother of Antonio. * Harold is the husband of Shantel. * **Steps**: 1. Antonio is Shantel's son. 2. Laura is Antonio's daughter. 3. Shantel is Laura's grandmother. * **Conclusion**: Therefore, the answer to the question is **Yes**. * **Visual Analysis**: The "additional facts" block is highlighted in yellow. A dotted arrow labeled "ARGOS" points from this block to step 3, showing that these new facts directly enable the final, correct deduction. Step 3 is highlighted in green, marking it as the correct conclusion. ### Key Observations 1. **Divergent Conclusions**: The same initial context leads to opposite answers ("No" vs. "Yes") based on the reasoning path and the information considered. 2. **Critical Missing Information**: The COT path fails because it does not establish the sibling relationship between Antonio and Pedro, nor does it correctly place Shantel as the mother of both. It incorrectly infers "aunt". 3. **Visual Coding**: The image uses color (red for wrong, green for right, yellow for new info) and labeled arrows to visually critique the reasoning process and highlight the source of the correction. ### Interpretation This image is a pedagogical or demonstrative tool illustrating the importance of **completeness in logical premises**. It shows how a seemingly straightforward deduction (COT) can be incorrect if it operates on an incomplete set of facts. The ARGOS method succeeds by explicitly introducing the missing relational links (Shantel is the mother of *both* Antonio and Pedro, making them brothers). The "Peircean investigative" reading suggests this is an abductive reasoning exercise. The initial context presents a puzzle. The COT represents a plausible but flawed hypothesis. The ARGOS section introduces new evidence (the additional facts) that allows for a revised and more robust hypothesis that better explains all the relationships, leading to the correct conclusion. The image argues that careful, exhaustive fact-gathering is essential before drawing final inferences in logical or diagnostic problems. The "Label: True" likely refers to the final, correct answer validated by the ARGOS process. </details> Figure 6: COT vs ARGOS on a CLUTRR problem. Figure 6 provides an example of a question from CLUTRR that is misclassified by self-consistency but flipped to correct by ARGOS. The COT seems confused, displaying its characteristic inability to plan out a proof: in steps 1-3 it provides disjoint pieces of information that neither follow from each other nor move towards the target conclusion. This confusion eventually leads to an incorrect step: “Shantel is Laura’s aunt”, resulting to an incorrect conclusion. ARGOS, after 3 iterations, provides several pieces of key information which would require at least one additional reasoning step to find, halving the necessary chain-length. For some examples in which ARGOS fails, see Appendix J. For results, evaluated by a human, on ARGOS and COT faithfulness on FOLIO, see Appendix C. ### 5.2 Impact of Imperfect Logical Translation Here, we test if the assumption of perfect logical translation we make in our experimental procedure is justified via empirical result. In our experiments, we assumed that we started from a propositional logic formulation. Some datasets came with an official formulation, while for the others we translated from text using Claude Opus 4, filtering to remove failed translations. This was done to fairly evaluate the methods on abductive reasoning, regardless of the quality of translation. In general, logical translation is kept as a separate module to the proof planning/execution module in logical reasoning systems. Also, the translation task is intrinsically simpler for LLMs, since it is linguistic rather than cognitive. LLMs have already demonstrated strong abilities at logic translation (Yang et al., 2024), and are expected to continue improving faster than at reasoning. To validate this claim in our context, we re-tested ARGOS with Llama 8B on FOLIO using a translator, but including failed translations. Performance only decreased marginally, from 80% to 78%, still outperforming the next best method (SC at 71%). On QUAIL, ARGOS performance dropped from 82% to 73%, which while large relative to the drop on FOLIO still keeps ARGOS as the best performing method on QUAIL. ## 6 Conclusion We have presented a method for addressing realistic natural-language logic problems, where “realistic” entails a need for abduction and commonsense. Whether neural or symbolic, we demonstrate empirically that existing methods struggle in this setting. The method we present addresses this weakness by (a) balancing neural and symbolic elements and allowing them to speak to each-other; and (b) avoiding the commonplace design choice of heavily restricting the abduction-clause search space. On both general and highly structured logic problems, our method demonstrates the power of a balanced neuro-symbolic approach, outperforming all existing work meaningfully. Limitations A limitation of our work is that it is currently restricted to problems which are strictly True or False, eliminating cases where logic might be used to select an option from a list of choices, or cases where the correct answer is “Maybe”. In our experimental work, we addressed the consequences this had on dataset selection by converting datasets to be True/False. The method could however be extended to multiple-choice questions by asking each question as an individual True/False question, combined with a decision heuristic for when no/multiple choices are determined True. Another limitation is that we restrict ARGOS to generating rules with up to two literals in the antecedent. While many-literal propositional formulas can often be decomposed into smaller ones, this may not always be the case and an ideal method would allow for large-antecedent generation. Thirdly, while our goal was to develop methods for open-source LLMs, the method would be more easily applicable if it did not require logit-level access. A potential future direction might be to convert the scoring system from a logit-based one to a verbalized score. Additionally, most benchmarks employed were modified to our setting, making the evaluated tasks at times artificial. Also, ARGOS sometimes depends upon self-consistency for problem solution. So, there are times when unfaithful or hallucinatory COTs will impact ARGOSs’ final prediction. Finally, this work focuses on forward chaining. A future direction may be backwards-chaining approaches to abductive reasoning. ## Reproducibility Statement In the supplementary material, we provide our full code which was used to implement and benchmark our method as well as the baselines. The code also includes data processing steps. We took great care to include in the Appendix, as well, detailed descriptions of our algorithm and our prompts. While our human modification of FOLIO text is not provided, the process for generating it is described carefully in the Appendix. ## References - A. Biere, T. Faller, K. Fazekas, M. Fleury, N. Froleyks, and F. Pollitt (2024) CaDiCaL 2.0. In Proc. Int. Conf. on Computer Aided Verification, pp. 133–152. Cited by: footnote 3. - O. Camburu, T. Rocktäschel, T. Lukasiewicz, and P. Blunsom (2018) E-snli: natural language inference with natural language explanations. In Proc. Conf. Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . Cited by: Appendix H, §5. - P. Clark, O. Tafjord, and K. Richardson (2021) Transformers as soft reasoners over language. In Proc. Int. Joint Conf. on Artificial Intelligence, pp. 3882–3890. Cited by: Appendix I. - D. Crevier (1993) AI: the tumultuous search for artificial intelligence. Cited by: §1. - N. Dziri, X. Lu, M. Sclar, X. (. Li, L. Jiang, B. Y. Lin, S. Welleck, P. West, C. Bhagavatula, R. Le Bras, J. Hwang, S. Sanyal, X. Ren, A. Ettinger, Z. Harchaoui, and Y. Choi (2023) Faith and fate: limits of transformers on compositionality. In Proc. Conf. Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36, pp. 70293–70332. Cited by: §1. - S. Han, H. Schoelkopf, Y. Zhao, Z. Qi, M. Riddell, W. Zhou, J. Coady, D. Peng, Y. Qiao, L. Benson, L. Sun, A. Wardle-Solano, H. Szabó, E. Zubova, M. Burtell, J. Fan, Y. Liu, B. Wong, M. Sailor, A. Ni, L. Nan, J. Kasai, T. Yu, R. Zhang, A. Fabbri, W. M. Kryscinski, S. Yavuz, Y. Liu, X. V. Lin, S. Joty, Y. Zhou, C. Xiong, R. Ying, A. Cohan, and D. Radev (2024) FOLIO: natural language reasoning with first-order logic. In Proc Conf. on Empirical Methods in Natural Language Processing, pp. 22017–22031. Cited by: Appendix H, §5. - L. Huang, R. Le Bras, C. Bhagavatula, and Y. Choi (2019) Cosmos QA: machine reading comprehension with contextual commonsense reasoning. In Proc. Conf. Empirical Methods in Natural Language Processing and Int. Joint Conf. on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2391–2401. Cited by: Appendix H, §5. - M. Kazemi, N. Kim, D. Bhatia, X. Xu, and D. Ramachandran (2023) LAMBADA: backward chaining for automated reasoning in natural language. In Proc. Conf. Association for Computational Linguistics, pp. 6547–6568. Cited by: Appendix G, §2. - T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa (2022) Large language models are zero-shot reasoners. In Proc. Conf. Neural Informations Processing Systems, pp. 22199–22213. Cited by: §2. - J. Lee and W. Hwang (2024) SymBa: symbolic backward chaining for structured natural language reasoningsymba: symbolic backward chaining for structured natural language reasoning. External Links: 2402.12806 Cited by: Appendix G, §1, §2. - J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, and Y. Zhang (2021) LogiQA: a challenge dataset for machine reading comprehension with logical reasoning. In Proc. Int. Joint Conf. on Artificial Intelligence, pp. 3622–3628. Cited by: Appendix I. - T. Liu, W. Xu, W. Huang, Y. Zeng, J. Wang, X. Wang, H. Yang, and J. Li (2024) Logic-of-thought: injecting logic into contexts for full reasoning in large language models. In arXiv preprint arXiv:2409.17539, Cited by: Appendix G, Appendix G, Figure 2, §1, §2, §5. - Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao, E. Wong, M. Apidianaki, and C. Callison-Burch (2023) Faithful chain-of-thought reasoning. In Int. Joint Conf. Natural Language Processing and Conf. Asia-Pacific Chapter of the Association for Computational Linguistics, Cited by: Appendix G, §1, §2. - N. J. Nilsson (1991) Logic and artificial intelligence. Artificial Intelligence 47 (1), pp. 31–56. External Links: ISSN 0004-3702 Cited by: §1. - T. Olausson, A. Gu, B. Lipkin, C. Zhang, A. Solar-Lezama, J. Tenenbaum, and R. Levy (2023) LINC: a neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. In Proc. Conf. on Empirical Methods in Natural Language Processing, pp. 5153–5176. Cited by: §1. - L. Pan, A. Albalak, X. Wang, and W. Wang (2023) Logic-LM: empowering large language models with symbolic solvers for faithful logical reasoning. In Findings of the Association for Computational Linguistics, H. Bouamor, J. Pino, and K. Bali (Eds.), pp. 3806–3824. Cited by: Appendix G, §2. - F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller (2019) Language models as knowledge bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Hong Kong, China, pp. 2463–2473. Cited by: §1. - A. Plaat, A. Wong, S. Verberne, J. Broekens, N. van Stein, and T. Back (2024) Reasoning with large language models, a survey. In arXiv preprint arXiv:2407.11511, Cited by: Appendix G. - A. Rogers, O. Kovaleva, M. Downey, and A. Rumshisky (2020) Getting closer to ai complete question answering: a set of prerequisite real tasks. In Proc. AAAI Conf. Artificial Intelligence, Vol. 34, pp. 8722–8731. Cited by: Appendix H, §5. - A. Saparov and H. He (2023) Language models are greedy reasoners: a systematic formal analysis of chain-of-thought. In Proc. Int. Conf. Learning Representations, Cited by: Appendix G, Appendix H, §1, §2, §5. - K. Sinha, S. Sodhani, J. Dong, J. Pineau, and W. L. Hamilton (2019) CLUTRR: a diagnostic benchmark for inductive reasoning from text. In Proc. Conf. Empirical Methods of Natural Language Processing, Cited by: Appendix G, Appendix H, §5. - O. Tafjord, B. Dalvi, and P. Clark (2021) ProofWriter: generating implications, proofs, and abductive statements over natural language. In Proc. Conf. Association for Computational Linguistics: ACL-IJCNLP, pp. 3621–3634. Cited by: Appendix I. - J. Tian, Y. Li, W. Chen, L. Xiao, H. He, and Y. Jin (2021) Diagnosing the first-order logical reasoning ability through logicnli. In Proc. Conf. Empirical Methods in Natural Language Processing, pp. 3738–3747. Cited by: Appendix I. - A. Toroghi, W. Guo, A. Pesaranghader, and S. Sanner (2024) Verifiable, debuggable, and repairable commonsense logical reasoning via llm-based theory resolution. In Proc. Conf. on Empirical Methods in Natural Language Processing, pp. 6634–6652. Cited by: Appendix G, Appendix G, Figure 2, §1, §2, §5. - S. Wang, W. Zhong, D. Tang, Z. Wei, Z. Fan, D. Jiang, M. Zhou, and N. Duan (2022) Logic-driven context extension and data augmentation for logical reasoning of text. In Proc. Conf. Association for Computational Linguistics, S. Muresan, P. Nakov, and A. Villavicencio (Eds.), pp. 1619–1629. Cited by: Appendix G, §2. - X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou (2023) Self-consistency improves chain of thought reasoning in language models. In Proc. Int. Conf. on Learning Representations, Cited by: Appendix G, §2, Figure 3, §5. - J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. In Proc. Conf. Neural Information Processing Systems, pp. 24824–24837. Cited by: Appendix G, Appendix G, Figure 2, §1, §2, §5. - J. Xu, H. Fei, L. Pan, Q. Liu, M. Lee, and W. Hsu (2024) Faithful logical reasoning via symbolic chain-of-thought. In Proc. Conf. Association for Computational Linguistics, pp. 13326–13365. Cited by: Appendix G. - S. Xue, Z. Huang, J. Liu, X. Lin, Y. Ning, B. Jin, X. Li, and Q. Liu (2024) Decompose, analyze and rethink: solving intricate problems with human-like reasoning cycle. In Proc. Conf. Neural Information Processing Systems, pp. 357–385. Cited by: Appendix G. - Y. Yang, S. Xiong, A. Payani, E. Shareghi, and F. Fekri (2024) Harnessing the power of large language models for natural language to first-order logic translation. In Annual Meeting of the Association for Computational Linguistics 2024, pp. 6942–6959. Cited by: §5.2. - S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan (2023) Tree of thoughts: deliberate problem solving with large language models. In Proc. Conf. Neural Information Processing Systems, pp. 11809–11822. Cited by: Appendix G, §2. - X. Ye, Q. Chen, I. Dillig, and G. Durrett (2023) SatLM: satisfiability-aided language models using declarative prompting. In Proc. Conf. Neural Information Processing Systems, Cited by: Appendix G, §1, §2, §5. - W. Yu, Z. Jiang, Y. Dong, and J. Feng (2020) ReClor: a reading comprehension dataset requiring logical reasoning. In Proc. Int. Conf. on Learning Representations, Cited by: Appendix I. ## Appendix A Abductive logic problems are well-defined In this section we prove that the solution of an abductive propositional logic problem, given in Section 3, does not depend on the choice of commonsense set $\mathcal{C}$ . **Proposition 1** *Let $\mathcal{P}$ be a set of premises, $Q$ a query proposition, and $\mathcal{C}_{1},\mathcal{C}_{2}$ subsets from commonsense set $\mathcal{C}$ of additional propositions such that $(\mathcal{P}\land\mathcal{C}_{1})\vdash L_{1}$ and $(\mathcal{P}\land\mathcal{C}_{2})\vdash L_{2}$ for literals $L_{1},L_{2}\in\{Q,\neg Q\}$ . If $\mathcal{P}$ is consistent with $\mathcal{C}$ ( $\mathcal{P}\land\mathcal{C}\not\vdash\bot$ ) then $L_{1}\leftrightarrow L_{2}$ .* * Proof* Let’s say $L_{1}\not\leftrightarrow L_{2}$ : without loss of generality we can take $(\mathcal{P}\land\mathcal{C}_{1})\vdash Q$ and $(\mathcal{P}\land\mathcal{C}_{2})\vdash\neg Q$ . So, $(\mathcal{P}\land C_{1}\land C_{2})\vdash(Q\land\neg Q)\vdash\bot$ . But $C_{1},C_{2}\subset C$ , so therefore $(\mathcal{P}\land C)\leftrightarrow(\mathcal{P}\land C_{1}\land C_{2}\land[C\setminus C_{1}\cup C_{2}])$ , so $\mathcal{P}\land C\leftrightarrow\bot\land[C\setminus(C_{1}\cup C_{2})]$ , so $\mathcal{P}\land C\vdash\bot$ . This contradicts our assumption that $\mathcal{P}\land\mathcal{C}\not\vdash\bot$ . ∎ ## Appendix B Cost Discussion While in theory COT generation is meant to be done until an answer is found, in practice it is necessary that an upper-limit on number of tokens generated is enforced. This is in case (a) the LLM continues generating past its answer, or (b) the LLM goes off-track and never answers the question. In any case, this means that each COT generation will be, at worst-case-assumption, equal in cost. In addition, the various method-specific LLM generations that are employed require small token-limits relative to COT, and so the number of COT calls made dominates the total number of tokens generated by any method. Additionally, as problems get harder and more logically complex, necessary COT generation length increases, making this even more true. So, we can say that cost for each method scales in proportion to the number of COTs generated. For example, Self-Consistency takes two hours longer to run than AROGS on FOLIO with Llama 8B, despite requiring more total LLM calls (where we include scoring and literal-generaiton calls in our count). This is because the average number of COT-specific calls is lower for ARGOS than SC, and the scoring and literal generation calls are much shorter than COT calls. In our implementation, we generate at most 25 tokens for literal-generation, 1 token maximum for scoring, and 300 tokens maximum for COT generation. Given this, budgeting method costs in terms of COT calls is well-justified. For SC and COT, the cost evaluation is trivial: COT always makes 1 COT call and SC makes a fixed number of COT calls, specified as a hyper-parameter. Similarly, LOT makes some small generative calls followed by SC, so its number of COT calls is fixable. LLM-Tres makes no COT calls, and neither does SAT-LM. ARGOS’ cost varies according to the entry, but its hyper-parameters (number of COT calls per-iteration and threshold/annealing constants) can be set such that its average number (or worst-case) number of calls is less than a budget. A summary of method cost in terms of COT calls is provided in Table 3. In Figure 7, we show a histogram of the individual problem costs incurred by ARGOS with Llama 8B, expressed in terms of the number of COTs required per problem. For most problems, only 10 COTs are required. This allows ARGOS to exceed the on-average cap for problems requiring more ablation or deeper search. If we compose a new dataset of only the hardest problems, for example the CLUTRR problems for which ARGOS takes 8-10 ARGOS steps (the final bar in Figure 5(b)). Testing ARGOS, limited at 20 COTs per-problem (strictly, not on average), we see that its performance drops from 65% to 54% on these problems, whereas SC performs at 40$ with these problems. This indicates a clear and steep cost-performance tradeoff for ARGOS, but even with strict cost limitations ARGOS outperforms SC. Table 3: Average number of COT calls required by each method. | | Cost (Avg # COT) | | --- | --- | | COT | 1 | | SC | 20 | | LOT | 20 | | SAT-LM | 0 | | LLM-Tres | 0 | | ARGOS | 18.4 | <details> <summary>x6.png Details</summary> ![149b6061](/v1/image/149b6061e9381ff4cf67f2142fcc439e91c99e4294bbd551eccc1d5137439124) ### Visual Description ## Histogram: CLUTRR Cost Distribution ### Overview The image displays a histogram titled "CLUTRR cost histogram." It visualizes the frequency distribution of a metric labeled "# COTs" (likely "Chain-of-Thought" steps or a similar cost measure). The chart shows a strongly right-skewed distribution, where the vast majority of instances have a low cost, with frequency decreasing sharply as cost increases. ### Components/Axes * **Chart Title:** "CLUTRR cost histogram" (Top-center). * **Y-Axis:** * **Label:** "count" (Left side, vertical). * **Scale:** Linear scale from 0 to 600, with major tick marks at 0, 100, 200, 300, 400, 500, and 600. * **X-Axis:** * **Label:** "# COTs" (Bottom-center). * **Scale:** Linear scale from 10 to 35, with major tick marks at 10, 15, 20, 25, 30, and 35. The bins appear to have a width of 5 units. * **Data Series:** A single series represented by blue bars. There is no legend, as there is only one data category. ### Detailed Analysis The histogram consists of 6 contiguous bars, each representing a bin of 5 COT units. The approximate count for each bin, read from the y-axis, is as follows: 1. **Bin 10-15 COTs:** This is the tallest bar, with a count of approximately **650**. This represents the mode of the distribution. 2. **Bin 15-20 COTs:** The count drops dramatically to approximately **100**. 3. **Bin 20-25 COTs:** The count decreases further to approximately **75**. 4. **Bin 25-30 COTs:** The count is approximately **60**. 5. **Bin 30-35 COTs:** The count is approximately **30**. 6. **Bin 35-40 COTs:** The final bar (extending slightly beyond the last labeled tick) has a count of approximately **40**. **Trend Verification:** The visual trend is a steep, monotonic decrease from left to right. The first bar is an order of magnitude taller than all subsequent bars, which then decline more gradually. ### Key Observations * **Extreme Right Skew:** The distribution is heavily concentrated at the lower end of the cost scale. Over 70% of the total count (estimated sum ~955) falls within the first bin (10-15 COTs). * **Sharp Initial Drop:** The most significant feature is the precipitous drop in frequency between the first bin (10-15) and the second bin (15-20), where the count falls by approximately 85%. * **Long Tail:** A "long tail" extends to the right, indicating that while high-cost instances (25+ COTs) are rare, they do occur with a non-zero frequency. * **Minor Fluctuation:** There is a slight increase in count from the 30-35 bin to the 35-40 bin, though both values are very low relative to the first bin. ### Interpretation This histogram demonstrates that the "CLUTRR cost," measured in # COTs, is not evenly distributed. The data suggests a system or process where the **overwhelming majority of cases are resolved with minimal cost** (10-15 steps). The sharp decline indicates a strong barrier or efficiency gain after this initial threshold; once a process requires more than 15 COTs, it becomes significantly less common. The presence of the long tail implies that a small subset of cases is substantially more complex or resource-intensive. The minor uptick in the final bin could be an artifact of binning (e.g., all remaining high-cost cases grouped together) or may indicate a small, distinct cluster of very high-cost problems. For a technical document, this distribution is critical for resource allocation, performance benchmarking, and identifying outlier cases that may require special handling or investigation. The skewness suggests that median cost would be a more representative measure of typical performance than the mean. </details> Figure 7: Histogram of individual problem costs to ARGOS-Llama 8B on CLUTRR ## Appendix C Manual Human Faithfulness Evaluation For COT, we manually checked the chains which result in the correct final answer, generated with Llama 8B. For ARGOS, if ARGOS made the correct final prediction using the SC output, we will check one of the chains from the final SC call. If ARGOS made the final prediction using the symbolic solver, the reasoning is considered correct if none of the added clauses are incorrect (the reasoning itself must be beyond reproach as it is executed by a solver). Additionally, for ARGOS, we evaluate the usefulness of the added information in solving the problem, and whether the added information is new to the problem. We find that Llama 8B generates faithful COT reasoning processes when it gets the answer right 72% of the time. We find that ARGOS-L8B generates a faithful reasoning process when it gets the answer right 85% of the time, showing that in general ARGOS-L8B is more faithful than pure-COT based methods (i.e. COT, SC). Three potential explanations for this result, in order from least to most in terms of their strength in justifying our method, are that (1) The logical content of ARGOS’s augmented prompt, regardless of content, incites a more logical structure in the LLM generation. (2) ARGOS successfully extracts the key elements of the problem, stabilizing the LLM and making it less likely to become unfaithful or to hallucinate. (3) ARGOS adds new information which allows us to solve otherwise difficult problems. Empirically, we find that ARGOS-L8B adds at least one piece of information necessary for the generating the faithful proof 85% of the time. This seems to lend credibility to explanation (2). Additionally, ARGOS adds new information to the text-problem 72% of the time, which lends some credibility to explanation (3), which would strongly justify the extend of our method, whose goal is explicitly stated as searching for new and crucial commonsense. Most interestingly, we find that 100% of the time in which we solve symbolically, the information added by ARGOS is faithful, useful and novel. This is perhaps not surprising, since to achieve the contrary, we would have to either add exactly contradictory information to the true proof. Nonetheless, it is an extremely satisfying finding, as it shows that when we are able to avoid typical symbolic robustness issues such as symbol mismatch, ARGOS maintains the rigour characteristic of symbolic methods. ## Appendix D Algorithm Description In this section we provide a detailed description of our ARGOS algorithm. The main procedure is summarized as Algorithm 1, which uses the find_new_commonsense subroutine in Algorithm 2. Algorithm 1 ARGOS 1: premises $\mathcal{P}$ , query $Q$ , SC sample-count $k$ , scoring threshold $\tau\in(0,1]$ , 2: self-consistency threshold $\gamma\in(0,1]$ and decay $\alpha\in(0,1]$ 3: commonsense set $\mathcal{C}\leftarrow\{\}$ 4: while $\gamma>0$ do 5: // Attempt solving with the SAT solver 6: sat_conclusion, backbone ${\mathcal{B}}$ $\leftarrow\texttt{sat\_solve}(\mathcal{P}\land\mathcal{C}\vdash Q,\neg Q)$ 7: if sat_conclusion is $(\mathcal{P}\land\mathcal{C})\vdash Q$ or $(\mathcal{P}\land\mathcal{C})\vdash\neg Q$ then 8: return $\mathcal{C}$ , sat_conclusion 9: end if 10: // Else attempt solving with the LLM ( $k$ -shot self-consistency) 11: llm_conclusion, llm_confidence $\leftarrow\texttt{llm\_solve}(\mathcal{P}\land\mathcal{C}\vdash Q,\neg Q)$ 12: if llm_confidence $>\gamma$ then 13: return $\mathcal{C}$ , llm_conclusion 14: end if 15: // Else find a new commonsense proposition to add to the pool 16: $C\leftarrow\texttt{find\_new\_commonsense}(\mathcal{P},\mathcal{C},\mathcal{B},\tau)$ 17: if $C$ is not None then 18: // New commonsense has been found, we try again with an enlarged $\mathcal{C}$ and smaller $\gamma$ 19: $\mathcal{C}\leftarrow\mathcal{C}\land\{C\}$ 20: $\gamma\leftarrow\gamma-\alpha$ 21: else 22: // We failed, return best guess 23: return $\mathcal{C}$ , llm_conclusion 24: end if 25: end while 26: // We ran out of time, return best guess 27: return $\mathcal{C}$ , llm_conclusion Algorithm 2 find_new_commonsense 1: premises $\mathcal{P}$ , commonsense $\mathcal{C}$ , backbone $\mathcal{B}=\text{backbone}(\mathcal{P}\land\mathcal{C})$ , scoring threshold $\tau\in(0,1]$ 2: for $L_{1}\in\mathcal{B}$ from highest to lowest $$ do 3: for $L_{2}\in\mathcal{B}$ from highest to lowest $$ do 4: for $L_{\text{right}}$ in $\texttt{llm\_generate}_{\mathcal{P}\land\mathcal{C}}(L_{1}\wedge L_{2}\rightarrow?)$ do 5: commonsense_score $\leftarrow\texttt{llm\_commonsense\_score}(L_{1}\wedge L_{2}\rightarrow L_{\text{right}})$ 6: relevance_score $\leftarrow\texttt{llm\_relevance\_score}_{\mathcal{P}\land\mathcal{C}}(L_{1}\wedge L_{2}\rightarrow L_{\text{right}})$ 7: if commonsense_score $\;>\tau$ and relevance_score $\;>\tau$ then 8: // We found a new relevant commonsense clause 9: return $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ 10: end if 11: end for 12: end for 13: end for 14: // We failed to find a new relevant commonsense clause 15: return None ## Appendix E Ablations | | FOLIO 8B | CLUTRR 8B | | --- | --- | --- | | ARGOS-Symbolic | 59% | 72% | | ARGOS | 81% | 76% | | SC | 71% | 69% | | LLM-Tres | 63% | 51% | Table 4: Ablating the SC-solver on ARGOS. ARGOS-Symbolic denotes the ablated version of ARGOS. In Table 4, we ablate the SC-solver, solving problems with only a symbolic solver. We impose a threshold of 100 proposed rules, since the previous bounding system was in terms of SC confidence.We find that removing the SC option only somewhat hurts ARGOS performance on CLUTRR, but has a catastrophic effect on FOLIO. This is not at all surprising, since CLUTRR is far simpler and more logically structured than FOLIO. Interestingly, we find that LLM-Tres performs slightly better than ARGOS-Symbolic on FOLIO. On investigation, we found that propositions with a single-literal antecedent were generally enough to solve FOLIO problems, and so LLM-Tres’ methodological restriction to such rules benefits it, whereas in CLUTRR where nearly all necessary propositions have two literals in the antecedent, LLM-Tres is reduced to guessing. This massive performance gap which we see on ARGOS in the more linguistic and more logical datasets illustrates the need for balanced, neuro-symbolic approaches. | | FOLIO 8B | | --- | --- | | No Commonsense | 79% | | No Contextual | 80% | | Full AROGS | 81% | Table 5: Ablating individual score thresholds In Table 5, we ablate separately the commonsense and contextualness scoring components. We find that both are necessary. This makes sense, since while they may have some overlap in terms of the rules which they serve to reject (for example, gibberish output would be neither commonsensical nor contextually relevant), it is easily conceivable that the sets of proposed propositions which they reject correctly is not totally overlapping. ## Appendix F Algorithm LLM Details In this section we provide details about the parts of the algorithm that involve interactions with the large language model. ### F.1 Solving with 5-round self-consistency ( llm_solve ) This subroutine aims to establish whether the premises and commonsense entails the query, i.e. $(\mathcal{P}\land\mathcal{C})\vdash Q$ or $(\mathcal{P}\land\mathcal{C})\vdash\neg Q$ . This routine always returns an answer, but can make mistakes. We few-shot prompt the LLM 5 times with the prompt in Table 6. Table 6: COT prompt | Here are some facts and rules: [premises $\mathcal{P}$ ] | | --- | | Here is some additional info we found: [commonsense $\mathcal{C}$ ] | | True or false: [query $Q$ ]? | | Answer: | Each call returns an answer $a_{1},\dots,a_{5}\in\{\text{True},\text{False}\}$ , with a certain confidence $c_{1},\dots,c_{5}\in(0,1]$ . The algorithm returns the most common answer (True or False), and a total confidence score given by the sum of confidences of the most common answer $a^{*}$ , divided by 5: | | $\displaystyle c^{*}=\frac{1}{5}\sum_{i=1}^{5}c_{i}\mathbbm{1}\big[a_{i}=a^{*}\big].$ | | | --- | --- | --- | ### F.2 Generating new commonsense literals $L_{\text{right}}$ ( llm_generate ) This subroutine aims to find a plausible right-hand-side literal $L_{\text{right}}$ for a proposition $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ . The new literal might potentially involve new variables not previously seen in the problem. We use a slightly different prompt for CLUTRR and for the others, because of CLUTRR’s more distinct structure. Anecdotally, we found that CLUTRR’s more consistent linguistic structure allowed for prompt formats which were very straightforward. For more linguistically complex datasets, however, more robust prompt formatting was required. For example, asking if a rule seems contradictory was more robust in situations where a rule was ambiguous without context (i.e. student(Rina) implies like_coffee(Rina), vs. Mom(Hannah, Sam) and Sibling(Sam,Mary) implies Mom(Hanna, Mary)). Table 7: Clause generation prompt for all datasets but CLUTRR. $e$ is an entity appearing in $L_{1}$ or $L_{2}$ . | Fill in the blank with a known predicate: [ $L_{1}\wedge L_{2}$ ] implies ___([ $e$ ]). | | --- | | Known predicates are: [all predicates appearing in the premises $\mathcal{P}$ and the commonsense $\mathcal{C}$ ] | | Answer: | Table 8: Clause generation prompt for CLUTRR. $e_{1}$ and $e_{2}$ are entities appearing in $L_{1}$ or $L_{2}$ . | If [ $L_{1}\wedge L_{2}$ ] then ___([ $e_{1}$ ],[ $e_{2}$ ]). Fill in the blank. | | --- | | Answer: | Thus, for example, in FOLIO if $L_{1}=\textit{drinksCoffee}(Rina)$ and $L_{2}=Loves(Mary,Sam)$ , then we would make three calls, one for $\textit{drinksCoffee}(Rina)\wedge Loves(Mary,Sam)\rightarrow F(Rina)$ , $\textit{drinksCoffee}(Rina)\wedge Loves(Mary,Sam)\rightarrow F(Mary)$ and $\textit{drinksCoffee}(Rina)\wedge Loves(Mary,Sam)\rightarrow F(Sam)$ respectively. With such calls, the method might return the set $\{productive(Rina),hasFeelings(Mary),isLoved(Sam)\}$ , for example. ### F.3 Scoring propositions $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ for commonsense ( llm_commonsense_score ) This procedure uses the LLM to score how much our new proposition $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ is likely to be commonsense. In detail, we ask the LLM whether, without any context, the rule seems contradictory (FOLIO/ProntoQA) or true (CLUTRR). We record the logits of the “Yes” and “No” tokens following the prompt, and we return as commonsense score $P[\text{No}]=\exp(\text{logit}_{\text{No}})/(\exp(\text{logit}_{\text{Yes}})+\exp(\text{logit}_{\text{No}}))$ , except for CLUTRR where we return $P[\text{Yes}]=\exp(\text{logit}_{\text{Yes}})/(\exp(\text{logit}_{\text{Yes}})+\exp(\text{logit}_{\text{No}}))$ since the question is inverted. Table 9: Commonsense scoring prompt for FOLIO/ProntoQA. | Does the following rule seem contradictory? | | --- | | Rule: [ $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ ] | | Answer: | Table 10: Commonsense scoring prompt for CLUTRR. | Does the following rule seem true? | | --- | | Rule: [ $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ ] | | Answer: | ### F.4 Scoring propositions $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ for context-relevance ( llm_relevance_score ) This procedure uses the LLM to score how much our new proposition $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ is likely to be relevant to the context. This helps eliminate propositions, like “The sky is blue”, that are true and commonsense but unlikely to help prove our query. In this case, we use the same prompt for all datasets. We record the logits of the tokens “Yes” and “No” following the text, and return $P[\text{Yes}]=\exp(\text{logit}_{\text{Yes}})/(\exp(\text{logit}_{\text{Yes}})+\exp(\text{logit}_{\text{No}}))$ as relevance score. Table 11: Context scoring prompt for all datasets. | Here are some facts and rules: [premises $\mathcal{P}$ and commonsense $\mathcal{C}$ ] | | --- | | Does the following new rule seem contextually relevant to the facts and rules? [ $L_{1}\wedge L_{2}\rightarrow L_{\text{right}}$ ] | | Answer: | ### F.5 Manual Evaluation of Scoring Module We have conducted a human review of 50 ARGOS-proposed rules and their corresponding scores, for Llama 8B on FOLIO. We found that, by classifying with the decision rule of thresholding at 0.3 for commonsense-ness and contextual-ness separately, the commonsense binary classification is correct 76% of the time, and the contextualness classification accuracy is correct 91% of the time. We find that the scores are not in general well-calibrated, but this is not needed for the thresholding that we do. ## Appendix G In-Depth literature review Since 2022, when Wei et al. (2022) found that models had the capacity to solve (at the time) difficult reasoning problems by simply prompting the model to output a detailed rationale (a “chain of thought” (COT)) before making a decision, there has been significant interest in leveraging/improving LLMs’ capacity for reasoning. #### Semantic Logic Parsers More recently, a series of methods were proposed which effectively by-passed the reasoning process by prompting LLMs to translate the given input to symbolic language (parsing the logic) and then using external, programmatic solvers to solve the problem in the symbolic space. F-COT, proposed by Lyu et al. (2023) and SAT-LM, proposed by Ye et al. (2023) are two contemporaneous works which prompt the LLM to translate the given problem into its corresponding symbolic language, to be solved by the appropriate solver. Logic-LM, proposed by Pan et al. (2023), also published around the same time, includes a self-refinement step in order to catch mistranslations and to re-translate them, but this module provided only minor improvements. The logic-parsing strategy proved extremely effective, converting reasoning tasks into the more linguistic translation task. Since the algorithmic tools never make mistakes, if the translation is accurate, then the solution will be too. These methods, however, rely upon the strong assumption which goes widely unacknowledged that all necessary information is provided at input. By motivation, this assumption implies a well-informed, precise, and careful end-user. This critical assumption greatly injures the applicability and generalizability of these symbolic methods. In order to address this, ARGOS aims specifically to provide missing information by exploring the logical space and leveraging logical tools. Xu et al. (2024), argue that using external solvers is not relevant to the LLM’s actual reasoning capacity and present a method which still parses the text into symbolic language, but uses the LLM as the symbolic solver by inputting the symbolic expressions directly to the LLM. In fact, this was an alternative presented in SAT-LM, and was shown to be weaker than directly using an external solver in cases where the problems are fully defined (which is the case in the setting chosen by both works). #### Deductive Logical Algorithms Given the findings discussed above that LLMs were efficient logic-parsers, the door was opened to more agentic algorithms which operate in the logic-space (as opposed to the textual space). Algorithms were proposed by Kazemi et al. (2023) and Lee and Hwang (2024) which attempted to reason backwards through the problem in logic space, using the LLM after each deduction step to choose the next deduction. These methods proved to perform worse than the previous parsing methods as they relied on the same assumptions of completeness in the input, but also required the LLM to greedily trace a reasoning path, opening another avenue for errors. This step is unnecessary since logic tools are able to search exhaustively over the solution space at very low cost. #### Clause-generative Algorithms (Abduction) In response to the literature’s inability to address cases where the completeness assumption fails, some work was proposed which explicitly aims to generate logical clauses in the textual space. The very similar algorithms proposed by Liu et al. (2024) and Wang et al. (2022) first translate the text to logic and then produce some new clauses which are deducible via classical logic given the logical translation of the input. By then translating the newly deduced logic back to text, the textual representation of the logic is now richer from a linguistic point of view (although no new information was added to the underlying logic), and then COT is used to solve the augmented problem in text-space. In the logical sense, this method is not truly abductive as no new logical information is produced. Instead of producing clauses via deduction on the input, Toroghi et al. (2024) proposed a method which leverages LLMs’ knowledge of commonsense to supply missing clauses during the reasoning process. Given the logical translation of the input text, the space of all 2-variable clauses which are possible to construct using the variables given in the question is explored exhaustively, with each rule being given a probability of being true by the LLM. Reasoning paths are formed with the various rules. This method suffers from a rigidity regarding the search space: due to the exhaustive search the space must be restricted to only 2-variable rules constructible from the input. Often the missing information from logic problems may include variables which are not named in the problem and so are unseen in the abduction input. In other cases, the necessary rules might be of different sizes. These cases are easily seen even in the commonly used datasets for logic-reasoning evaluation. For example, CLUTRR Sinha et al. (2019), which is a dataset of reasoning problems related to family relationships, requires 3-variable rules of the form $mom(A,B)\land sister(B,C)\rightarrow mom(A,C)$ . This algorithm is by construction incapable of producing such information. The evaluation procedures presented by Toroghi et al. (2024) use simple and rigidly structured data such that only simple 2-variable rules constructed from existing variables are required. Since the algorithm output is dependent on the logical proof, if the necessary clauses cannot be found then a full proof will never be found, and the algorithm is forced to provide either no output or a guess. In contrast to these works, ARGOS aims to introduce logical clauses which provide new information about potentially unseen variables and which can be of varying sizes. In order to accommodate the very large search space, we introduce several innovative methods to dynamically select sub-spaces and to efficiently search them. Our proposed method aims to build its solution from both the language-reasoning space and the logic space in order to leverage the exactness of logical tools while remaining robust to failures to find the necessary logical clauses. #### Generalist Reasoners Despite much research, Chain-of-Thought (COT), proposed by Wei et al. (2022), remains one of the most generally applicable and robust reasoning approaches Plaat et al. (2024). Thus, much research has been done on how to augment the COT process (the previously discussed work by Liu et al. (2024) could be viewed as part of this category). While the focus of our paper is logical algorithms, there are some generalist methods which cannot be ignored. A simple but highly effective approach known as self-consistency (SC), proposed by Wang et al. (2023), prompts via COT several times and takes the mode of the outputs as the prediction. This method benefits from adding further robustness by smoothing potential outlier errors which might be present in a few chains, and is easily scalable according to computational budget by choosing the appropriate number of samples to take. However, the marginal return decreases as the number of sampled chains increases, as is shown in the original paper Wang et al. (2023). COT and SC’s poor performance in proof planning (Saparov and He, 2023) motivated Tree of Thoughts (TOT), proposed by Yao et al. (2023), in which a reasoning tree is hand-crafted for a problem and then the LLM is prompted to explore this tree for likely paths. This method demands a very specific tree topology and exploration designs for each task and so is more of a general framework than an explicit method applicable to logical reasoning. The tree structure of logical reasoning problems even in the same dataset are highly varied. Given this, TOT is poorly suited to logical reasoning settings. There are many works which go beyond TOT in terms of exploring potential chains. Most recently, Xue et al. (2024) proposed a recursive method which also conducts a tree-like search, but allows for dynamically-structured trees. ## Appendix H Used Datasets The following is a description of the datasets which we have used in our experiments. #### ProntoQA (Saparov and He, 2023) ProntoQA is a dataset which comes in three types of groundings over the same symbolic structure, which is a string of single-variable modus ponens operations. One of these types is a hand-crafted grounding designed to be true according to commonsense. This dataset is, at this point, fairly easy for language models to handle. However, it remains a common dataset in logical reasoning, and comes with the convenience that a simple random removal of inference rules included in the problem will build abduction cases, as all rules are commonsense. There are 59 problems in the dataset. #### CLUTRR (Sinha et al., 2019) CLUTRR is a dataset of family-relational reasoning problems in which some family relations (i.e. “Sam is the mother of John”) are given in the form of simple stories (i.e. “John went with his mom Sam to the mall”). Traditionally, the task in CLUTRR is to deduce the relationship between two people, given the context. In order to structure the problem as a true/false classical logic problem, however, we restructure the task to determine if a given relationship between two people is true or not. Practically, we construct the labels by taking the ground-truth relationship as the query 50% of the time (so the new label is “True”), and taking a random other relationship between the given two people (“False”) the other 50% of the time, in order to balance the dataset. While the task is naturally abductive in that the input contexts do not include abstract or even grounded relational inference rules (i.e. “if A is the mother of B then B is the child of A”), most symbolic methods rely upon a practitioner to hand-craft a knowledge base of relational rules which are appended to each problem in the dataset. Simply by forbidding this provision, the task becomes truly abductive for the reasoning model. There are 1000 problems in the dataset. #### FOLIO (Han et al., 2024) FOLIO is a dataset of logic problems which were hand-crafted by “expert annotators”. The dataset is the most diverse of the three we use, both linguistically and structurally. While not perfectly commonsensical, it is generally based in commonsense simply because the annotators were humans who exist in and tend to operate within real-world contexts. Given this pseudo-commonsensicality, random removal of rules to introduce commonsense abduction is not an option, as non-commonsense may be removed from the context. Thus, we engaged human annotators to replace randomly selected phrases from problems with semantically equivalent expressions, so that each replacement would require a minimum of one new rule, indicating that the replaced phrase is implied by its replacement. For example, “NBA Player” $\rightarrow$ “Pro Basketball Player”. Annotators were instructed to reject problems where no replacement could easily be found, and some annotations failed to impact the problem due to either weak replacements or non-interaction with the solve-path of the problem. These problems were discarded, leaving us an abduction-variant of FOLIO with 108 True-or-False problems. #### ESNLI (Camburu et al., 2018) ESNLI is a dataset of premise-conclusion pairs, stemming from the human-explanation NLI field. A machine is asked to explain how the premise yields or contradicts the conclusion. We adopted this dataset by inverting the task, so that given the premise and conclusion the machine must determine whether the conclusion is entailed (True) or contradicted (False). The task naturally ensures that abduction is necessary, as we leave out the human explanation with which the pairs are annotated. The dataset, on inspection, is mostly common-sensical. #### CosmosQA (Huang et al., 2019) CosmosQA is an MCQA dataset designed to test machine reading comprehension. We adapt it to our setting by constructing problems for which the answer is “False” by taking questions for which "None of the Above" is the correct answer and randomly selecting another of the answer choices to be the query (for problems for which the answer is “True”, the process is obvious). The dataset is mostly commonsensical. #### QUAIL (Rogers et al., 2020) QUAIL is an MCQA dataset also designed to test machine reading comprehension. It is derived primarily by scraping wiki and forum pages on online, and so it often contains artifacts such as timestamps or CSS formatting quirks. We adopt it from MCQA to True/False in an identical fashion to CosmosQA. Becasue of its provenance, questions are often extremely vague; the larger context of the webpage from which the problem comes is not included but is often key to answering the problem. On manual solution of QUAIL problems, the intuitive approach is often to start by inferring the original context of the scraping, making the dataset of high value for abductive settings and for robust evaluation of commonsense flexibility. ## Appendix I Unused Datasets While there are many logical-reasoning-related datasets available, many are unsuitable because they are either not truly logically-structured or are not commonsensical. Here, we will list some of the most commonly used datasets for logic-adjacent applications and explain their weaknesses/unsuitability for our setting. #### LogiQA While LogiQA (Liu et al., 2021) is generally commonsensical and logically themed, its questions do not in fact impose an immediate logical problem. In fact, many of the questions are in fact meta-logical, in that they ask questions about the underlying logic of the text. For example: “Which of the following makes the same logical mistake as above”. These questions could indeed have a formal-logical re-framing, but this would require far more logical aptitude than is currently held by language models, and hand-translating LogiQA questions to logic problems is too time-consumptive. #### RECLOR RECLOR (Yu et al., 2020) suffers the same weakness as LogiQA. Questions are meta-logical or ask for subjective qualifications regarding some described commonsense logic. Again, this type of question both fails to evaluate true logical reasoning in real-world contexts, and proves problematic for careful evaluation given the inconsistency of the task in that different questions ask different things of the reasoner. #### Soft Reasoner The Soft Reasoner dataset (Clark et al., 2021) is strictly logical, but is plainly non-commonsensical by construction. The logical problems are constructed without consideration of commonsense or real-world contexts. The dataset is constructed by building clauses from variables/predicates which are randomly selected from a hand-selected bag of words. #### LogicNLI LogicNLI (Tian et al., 2021) suffer from the Soft Reasoner dataset’s weakness to an even greater degree - while the problems are also randomly generated and do not comply with commonsense, they also often do not comply with grammar. For example, phrases such as “Quinlan does not entire” appear frequently. While this may suit the authors’ aims of producing arbitrary text as a stand-in for symbolic logic, it is not amenable to the evaluation of real-world logical reasoning in human contexts. #### ProofWriter ProofWriter (Tafjord et al., 2021) again suffers from the same weakness which all semi-random auto-generated datasets suffer: non-commonsenseness. By picking clauses randomly by sampling bags of predicates, no guarantees can be made on the realism of the data. Examination of the dataset will show that the “facts” described in problem contexts range from unlikely to non-sensical - the very first problem includes a rule “All red things are rough”. Within the real world this is obviously not true, as we can find examples of red things which are not rough. Of course, it was not the dataset creators’ aim to build a commonsensical dataset and so this is of little surprise. ## Appendix J Confusing Problems for ARGOS Here’ we give some examples of problems which confused ARGOS. Note that all of the problems provided are examples in which the given premises are at least somewhat contradictory with commonsense, breaking the setting. Table 12: Confusing since Fido is typically specifically a dog’s name (and not a cat’s). If we ask the LLM if Fido is a dog, with no context, it will say yes. | All tigers are cats. No cats are dogs. All Bengal tigers are tigers. All huskies are dogs. Fido is either a Bengal tiger or a cat. | | --- | | True or False: Fido is a husky animal. | Table 13: Confusing since Detroit City is probably not a horse according to commonsense. | Detroit City is a horse. Some horses are racehorses. If a horse falls in a race, it poses risks to its rider. Detroit City fell in a race. A horse is a racehorse if it is in a race. | | --- | | True or False: Detroit City has been in multiple races. | Table 14: Confusing because (1) edible can refer to beverages when the contextual distinction is between safe and unsafe for consumption, but not when the distinction is between eating and drinking; (2) Coke is not apple juice. | All drinks on the counter are edible. All juices on the counter are drinks. Orange juice is a type of juice. Everything on the counter is either orange juice or apple juice. All apple juices on the counter are sweet. The coke is on the counter and if the coke is apple juice, then the coke is a drink. If the coke is not apple juice, then the coke is not edible. | | --- | | True or False: The coke is edible and sweet. | ## Appendix K Solvability Progression Illustration In this section we provide figures for CLUTRR, CosmosQA and QUAIL, illustrating how ARGOS’ SC-solvability measure progresses per-question, over a 100-question clipping from each dataset. Positive % values indicate a positive classification, and negative the like. <details> <summary>x7.png Details</summary> ![c7a1c607](/v1/image/c7a1c60749fcabf326a372a0ee2c862f6b7155433713204516a5d40dbf2e1e79) ### Visual Description ## Confidence Progression Heatmap ### Overview The image is a heatmap titled *“Confidence progression”* that visualizes how confidence levels for different questions (Question IDs) evolve across a series of iterations (Number of Iterations). The Y-axis lists Question IDs (including “SC” and numerical IDs 1–10), and the X-axis represents iteration stages (from “SC” to 10). A color legend on the right maps colors to confidence percentages, with a “terminated” label indicating a state where questions are no longer active. ### Components/Axes - **Title**: *“Confidence progression”* (top-center). - **Y-axis**: Labeled *“Question ID”* (vertical axis), with rows representing individual questions (e.g., “SC”, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10). - **X-axis**: Labeled *“Number of Iterations”* (horizontal axis), with ticks at *“SC”* (initial stage) and 1–10 (subsequent iterations). - **Color Legend**: Vertical bar on the right with: - Pink: 100% confidence - Yellow: 50% confidence - Orange: 0% confidence - Purple: -50% confidence - Green: -100% confidence (or “terminated” state) - *“terminated”* label (below green, indicating inactive questions). ### Detailed Analysis Each cell (row = Question ID, column = Iteration) is colored to show confidence. Key patterns: - **Green (terminated/-100%)**: Dominates later iterations (columns 4–10) for most Question IDs, suggesting most questions are terminated or lose confidence after ~4 iterations. - **Pink (100%)**: Appears in early iterations (columns 1–3) for some rows, indicating high initial confidence. - **Yellow (50%)**: Scattered in early iterations (columns 1–3) for select rows, showing moderate confidence. - **Orange (0%)**: Present in a few rows (e.g., lower rows) in early iterations, indicating neutral confidence. - **Purple (-50%)**: Appears in some rows (e.g., middle rows) in early iterations, showing low confidence. ### Key Observations - **Trend**: Confidence levels shift from mixed (pink, yellow, orange, purple) in early iterations (SC–3) to predominantly green (terminated/-100%) in later iterations (4–10). This suggests a process where most questions stabilize at low confidence or are terminated after a few iterations. - **Outliers**: A few rows (e.g., middle rows) retain non-green colors (pink, yellow, purple) in later iterations, indicating questions that remain active with varying confidence. - **Spatial Grounding**: The legend is positioned on the right, with colors aligned vertically. Each color’s placement in the legend matches its use in the heatmap (e.g., pink at the top of the legend corresponds to pink cells in the heatmap). ### Interpretation The heatmap illustrates how confidence in questions evolves iteratively. Early iterations show a mix of confidence levels (high, moderate, neutral, low), but as iterations increase, most questions either lose confidence (green, -100%) or are terminated. This could reflect a system where questions are evaluated iteratively, and those with low confidence or unresolved issues are terminated, while a few remain active. The “SC” iteration (likely a special case) shows initial states, and the progression to later iterations highlights a trend toward termination or low confidence for most questions. This data could inform the efficiency of a question-answering or evaluation process, where early iterations have mixed confidence, but later iterations stabilize to termination or low confidence for most cases. (Note: The image contains no explicit text in non-English languages, so no translation is required.) </details> Figure 8: CosmosQA: This dataset, being mostly solved by vanilla SC, sees little fluctuation and exits from ARGOS early. With that said, we still see a flip from low-confidence negative to high-confidence positive. <details> <summary>x8.png Details</summary> ![1b570bc6](/v1/image/1b570bc68d9923d9965f0911d718370a7882f3384248c9c39535d37bd910788f) ### Visual Description ## Confidence Progression Heatmap: Iteration vs. Question ID ### Overview The image is a heatmap titled *“Confidence progression”* that visualizes how confidence levels (in percentage terms) change across iterations (labeled “SC, 1–10”) for different Question IDs (y-axis). Each cell’s color corresponds to a confidence percentage (legend: 100% = pink, 50% = yellow, 0% = orange, -50% = purple, -100% = green). The x-axis tracks “Number of Iterations,” and the y-axis lists “Question ID” (multiple rows, each representing a distinct question). ### Components/Axes - **Title**: *“Confidence progression”* (top-center). - **Y-axis**: *“Question ID”* (vertical, left) – multiple rows (each row = a unique question, though specific IDs are not labeled). - **X-axis**: *“Number of Iterations”* (horizontal, bottom) – labels: `SC`, `1`, `2`, `3`, `4`, `5`, `6`, `7`, `8`, `9`, `10`. - **Legend (right side)**: - 100%: Pink - 50%: Yellow - 0%: Orange - -50%: Purple - -100%: Green - *“terminated”* (text below the color bar, likely a state label, but colors represent confidence percentages). ### Detailed Analysis Each row (Question ID) shows a sequence of colored cells across iterations (SC → 10). Key trends: 1. **Early Iteration Variability (SC–4)**: - Colors fluctuate between pink (100%), yellow (50%), orange (0%), and purple (-50%), indicating dynamic confidence changes (e.g., a row with `SC=green` → `1=purple` → `2=orange` → `3=yellow` → `4=pink`). - Some rows start with high confidence (pink/yellow) and shift to lower (orange/purple) in early iterations. 2. **Late Iteration Dominance (5–10)**: - Most rows transition to **green (-100%)** by iteration 5–10, suggesting either process termination or maximally negative confidence. - Green dominates the rightmost columns (iterations 5–10) for nearly all Question IDs. 3. **Question ID Variation**: - Different rows (Question IDs) have distinct color sequences, meaning confidence progression varies by question (e.g., one row may stay pink longer, while another shifts to green earlier). ### Key Observations - **Green (-100%) Dominance in Late Iterations**: Most questions reach a state of -100% confidence (green) by iteration 5–10, indicating a consistent endpoint (termination or negative confidence). - **Early Iteration Volatility**: Iterations SC–4 show more color variation (pink, yellow, orange, purple), reflecting uncertainty or adjustment before stabilizing. - **Question-Specific Progression**: Confidence evolution differs across Question IDs, with some questions having more volatile early-stage changes. ### Interpretation This heatmap illustrates how confidence in answering questions evolves over iterations: - **Termination/Negative Confidence**: The dominance of green (-100%) in later iterations suggests most questions either terminate or reach a state of maximally negative confidence. - **Early Uncertainty**: Fluctuating colors in early iterations (SC–4) indicate dynamic confidence adjustments (e.g., learning, re-evaluation) before stabilizing. - **Question-Specific Behavior**: Variation between Question IDs implies some questions have more volatile confidence progression, possibly due to complexity or ambiguity. (Note: No numerical data points are explicitly labeled, but color-coded confidence levels and iteration trends are extracted. The “terminated” label likely denotes a process state, while colors represent confidence percentages.) </details> Figure 9: CLUTRR: We see a significant amount of confidence fluctuation and flipping, indicating that meaningful elements within the logic of the problem are being modified by ARGOS in order to affect the answer. This is not surprising, since our construction of generated propositions as taking purely backbone variables as antecedents ensures that added ARGOS propositions will be effectual. <details> <summary>x9.png Details</summary> ![1f7858d8](/v1/image/1f7858d8ce0fcf36c6e6b30f874f7f9dbadd075a1e43d1d66990f623988d758c) ### Visual Description ## Confidence Progression Chart: Detailed Analysis ### Overview The image is a **heatmap-style bar chart** titled *"Confidence progression"*, visualizing how confidence (or a similar metric) evolves over the number of iterations for different "Question IDs." The x-axis represents the number of iterations (0–10), and the y-axis lists individual Question IDs (vertical axis). A color legend on the right maps colors to confidence percentages (or states). ### Components/Axes - **Title**: *"Confidence progression"* (top-center). - **X-axis**: *"Number of Iterations"* (bottom), with ticks at 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. - **Y-axis**: *"Question ID"* (left), listing distinct question identifiers (vertical rows). - **Color Legend (Right)**: - Pink: 100% confidence - Yellow: 50% confidence - Orange: 0% confidence - Purple: -50% confidence - Green: -100% confidence - *"terminated"* (label, associated with green, likely indicating a process state). ### Detailed Analysis (Trends & Data Points) The chart tracks confidence across iterations for multiple Question IDs. Key patterns: #### 1. Early Iterations (0–4): - **Pink (100%)**: Dominates iteration 0 for most Question IDs (high initial confidence). - **Orange (0%)**: Appears in iteration 1–2 (confidence drops to 0%). - **Yellow (50%)**: Appears in iteration 2–4 (confidence rebounds to 50% for some questions). - **Purple (-50%)**: Appears in iteration 1–3 (confidence dips to -50% for some questions). #### 2. Later Iterations (5–10): - **Green (-100%)**: Dominates from iteration 5 onward for *all* Question IDs. This suggests confidence collapses to -100% (or the process terminates) after ~5 iterations. ### Key Observations - **Dominance of Green (-100%)**: After iteration 5, every Question ID shows green (-100%) confidence, indicating a universal drop in confidence (or process termination). - **Early Variability**: Confidence fluctuates (100% → 0% → 50% → -50%) in iterations 0–4, with some questions maintaining 50% confidence longer (e.g., yellow in iteration 3–4) before collapsing. - **Spatial Pattern**: The left side (iterations 0–4) is colorful (pink, orange, yellow, purple), while the right side (iterations 5–10) is uniformly green (-100%). ### Interpretation The chart illustrates a **confidence collapse** over iterations: - Initial confidence is high (100%, pink) but quickly fluctuates (0% → 50% → -50%) in early iterations. - By iteration 5, confidence plummets to -100% (green) for all questions, suggesting the process (e.g., answering questions) fails or terminates after ~5 iterations. - The early variability (pink → orange → yellow → purple) implies uncertainty or iterative refinement, but the universal green in later iterations signals a critical failure point. This visualization highlights that confidence in the process (or answers) is unstable initially and collapses to -100% after a few iterations, likely indicating a systemic issue or termination of the process. </details> Figure 10: QUAIL: despite ARGOS’s strong performance on QUAIL relative to SC, we find that in fact few ARGOS iterations are necessary. While QUAIL is made complicated for language models by its irregular form and often disjoint nature, its logical structure (while very ambiguous) is simple, meaning that we can solve QUAIL problems with only a small number of well-chosen propositions.

Rendering Paper...