2402.10767

Model: gemini-2.0-flash

# Inference to the Best Explanation in Large Language Models **Authors**: Email: d.dalal1@universityofgalway.ie Abstract While Large Language Models (LLMs) have found success in real-world applications, their underlying explanatory process is still poorly understood. This paper proposes IBE-Eval, a framework inspired by philosophical accounts on Inference to the Best Explanation (IBE) to advance the interpretation and evaluation of LLM explanations. IBE-Eval estimates the plausibility of natural language explanations through a combination of explicit logical and linguistic features including: consistency, parsimony, coherence, and uncertainty. Extensive experiments are conducted on Causal Question Answering (CQA), where IBE-Eval is tasked to select the most plausible causal explanation amongst competing ones generated by the LLM (e.g. GPT 3.5 or LLaMA 2). The experiments reveal that IBE-Eval can successfully identify the best explanation with up to 77% accuracy ( $≈ 27\%$ above random), improving upon a GPT 3.5-as-a-judge baseline ( $≈+17\%$ ) while being intrinsically more efficient and interpretable. Additional analysis suggests that, despite LLM-specific variances, generated explanations tend to conform to IBE criteria and that IBE-Eval is significantly correlated with human judgment, opening up opportunities for future development of automated explanation verification tools. <details> <summary>x1.png Details</summary> ![fda69372](/v1/image/fda69372e351cfba2bdaa95e0259462c7c8940837e1a45b0636772562d11d1b2) ### Visual Description ## Causal Reasoning Diagram: Balloon Expansion ### Overview The image presents a diagram illustrating a causal reasoning process, specifically focusing on determining the cause of a balloon's expansion. It outlines competing hypotheses, explanations generated by a Large Language Model (LLM), and an Inference to the Best Explanation (IBE) framework to evaluate these explanations. ### Components/Axes The diagram is divided into three main sections: 1. **Causal Question:** Poses the problem: "The balloon expanded. What was the cause? A) I blew into it. B) I pricked it." 2. **Competing Hypotheses:** Presents two premises: * Premise 1: "I blew into the balloon. Conclusion: The balloon expanded." (highlighted in green) * Premise 2: "I pricked the balloon. Conclusion: The balloon expanded." (highlighted in red) 3. **Explanation Prompt:** Provides instructions for generating explanations for each option, treating each option as a premise and generating a step-by-step logical proof. 4. **LLM:** A central node representing the Large Language Model that processes the hypotheses and generates explanations. 5. **Competing Explanations:** Presents two explanations generated by the LLM: * Explanation 1 (E1): Based on blowing into the balloon. * Explanation 2 (E2): Based on pricking the balloon. 6. **Inference to the Best Explanation (IBE):** Evaluates the explanations based on selection criteria. ### Detailed Analysis or Content Details **1. Causal Question and Competing Hypotheses:** * The causal question sets up the problem: "The balloon expanded. What was the cause?" with two possible answers: "A) I blew into it" and "B) I pricked it." * The competing hypotheses frame these answers as premises leading to the conclusion that "The balloon expanded." **2. Explanation Prompt:** * The prompt instructs the system to identify the most plausible cause and generate step-by-step explanations for each option. * It emphasizes the use of IF-THEN rules and commonsense assumptions. **3. LLM:** * The LLM acts as the central processing unit, taking the hypotheses and prompt as input and generating explanations. **4. Competing Explanations:** * **Explanation 1 (E1):** * Step 1: "IF someone blows into a balloon, THEN it can cause the balloon to inflate." * Assumption: "Blowing air into a balloon increases the amount of air inside it, leading to inflation." * Step 2: "IF the balloon inflates, THEN it can cause the balloon to expand." * Assumption: "When a balloon inflates, it stretches and expands in size." * Conclusion: "Therefore, since I blew into the balloon, it caused the balloon to inflate, which resulted in its expansion." * **Explanation 2 (E2):** * Step 1: "IF a balloon is pricked, THEN the balloon may deflate." * Assumption: "Pricking a balloon can cause it to lose air and deflate." * Step 2: "IF a balloon deflates, THEN there is a decrease in air pressure inside the balloon." * Assumption: "When a balloon deflates, there is a decrease in air pressure inside the balloon." * Step 3: "IF there is a decrease in air pressure inside the balloon, THEN the external air pressure will cause the balloon to expand." * Assumption: "When the air pressure inside a balloon is lower than the external air pressure, the balloon will expand." * Conclusion: "Therefore, since the balloon was pricked, it may have deflated, resulting in a decrease in air pressure inside the balloon, causing the external air pressure to make the balloon expand." **5. Inference to the Best Explanation (IBE):** * **Explanation 1 (E1) Evaluation:** * Consistency: 1.0 * Parsimony: -2.0 * Coherence: 0.51 * Uncertainty: 2.0 * Result: E1 is accepted (green checkmark) * **Explanation 2 (E2) Evaluation:** * Consistency: 1.0 * Parsimony: -3.0 * Coherence: 0.28 * Uncertainty: 3.0 * Result: E2 is rejected (red X) ### Key Observations * The diagram illustrates a structured approach to causal reasoning, using an LLM to generate explanations and an IBE framework to evaluate them. * The selection criteria (Consistency, Parsimony, Coherence, Uncertainty) are used to quantify the quality of each explanation. * Explanation 1 (blowing into the balloon) is favored over Explanation 2 (pricking the balloon) based on the IBE evaluation. ### Interpretation The diagram demonstrates a process for automated causal reasoning. The LLM generates explanations based on provided premises, and the IBE framework provides a quantitative method for selecting the best explanation. The specific values assigned to the selection criteria (Consistency, Parsimony, Coherence, Uncertainty) reflect the relative strengths and weaknesses of each explanation. In this case, the "blowing into the balloon" explanation is deemed more plausible than the "pricking the balloon" explanation, suggesting that the LLM and IBE framework favor the former as the cause of the balloon's expansion. The negative parsimony values indicate a penalty for complexity in the explanations. The lower coherence and higher uncertainty for Explanation 2 contribute to its rejection. </details> Figure 1: IBE-Eval qualifies LLM-generated explanations with a set of logical and linguistic selection criteria to identify the most plausible hypothesis. The corresponding explanation for each hypothesis is evaluated across the IBE criteria of logical consistency, parsimony, internal coherence, and linguistic uncertainty. A final plausibility score is computed across those features and the hypothesis with highest score is identified as the best explanation. 1 Introduction Large Language Models (LLMs) such as OpenAI’s GPT Brown et al. (2020) and LLaMA Touvron et al. (2023) have been highly effective across a diverse range of language understanding and reasoning tasks Liang et al. (2023). While LLM performances have been thoroughly investigated across various benchmarks Wang et al. (2019); Srivastava et al. (2023); Gao et al. (2023); Touvron et al. (2023), the principles and properties behind their step-wise reasoning process are still poorly understood Valentino et al. (2021). LLMs are notoriously black-box and can be difficult to interpret Chakraborty et al. (2017); Danilevsky et al. (2020). Moreover, the commercialization of LLMs has led to strategic secrecy around model architectures and training details Xiang (2023); Knight (2023). Finally, LLMs are susceptible to hallucinations and adversarial perturbations Geirhos et al. (2020); Camburu et al. (2020), often producing plausible but factually incorrect answers Ji et al. (2023); Huang et al. (2023). As the size and complexity of LLM architectures increase, the systematic study of generated explanations becomes crucial to better interpret and validate the LLM’s internal inference and reasoning processes Wei et al. (2022b); Lampinen et al. (2022); Huang and Chang (2022). The automatic evaluation of natural language explanations presents several challenges Atanasova et al. (2023); Camburu et al. (2020). Without resource-intensive annotation Wiegreffe and Marasovic (2021); Thayaparan et al. (2020); Dalvi et al. (2021); Camburu et al. (2018), explanation quality methods tend to rely on either weak supervision, where the identification of the correct answer is taken as evidence of explanation quality, or require the injection of domain-specific knowledge Quan et al. (2024). In this paper, we seek to better understand the LLM explanatory process through the investigation of explicit linguistic and logical properties. While explanations are hard to formalize due to their open-ended nature, we hypothesize that they can be analyzed as linguistic objects, with measurable features that can serve to define criteria for assessing their quality. Specifically, this paper investigates the following overarching research question: “Can the linguistic and logical properties associated with LLM-generated explanations be used to qualify the models’ reasoning process?”. To this end, we propose an interpretable framework inspired by philosophical accounts of abductive inference, also known as Inference to the Best Explanation (IBE) - i.e. the process of selecting among competing explanatory theories Lipton (2017). In particular, we aim to measure the extent to which LLM-generated explanations conform to IBE expectations when attempting to identify the most plausible explanation. To this end, we present IBE-Eval, a framework designed to estimate the plausibility of natural language explanations through a set of explicit logical and linguistic features, namely: logical consistency, parsimony, coherence, and linguistic uncertainty. To evaluate the efficacy of IBE-Eval, we conduct extensive experiments in the multiple-choice Causal Question Answering (CQA) setting. The overall results and contributions of the paper can be summarized as follows: 1. To the best of our knowledge, we are the first to propose an interpretable framework inspired by philosophical accounts on Inference to the Best Explanation (IBE) to automatically assess the quality of natural language explanations. 1. We propose IBE-Eval, a framework that can be instantiated with external tools for the automatic evaluation of LLM-generated explanations and the identification of the best explanation in a multiple-choice CQA setting. 1. We provide empirical evidence that LLM-generated explanations tend to conform to IBE expectations with varying levels of statistical significance correlated to the LLM’s size. 1. We additionally find that uncertainty, parsimony, and coherence are the best predictors of plausibility and explanation quality across all LLMs. However, we also find that the LLMs tend to be strong rationalizers and can produce logically consistent explanations even for less plausible candidates, making the consistency metric less effective in practice. 1. IBE-Eval can successfully identify the best explanation supporting the correct answers with up to 77% accuracy (+ $≈ 27\%$ above random and + $≈ 17\%$ over GPT 3.5-as-a-Judge baselines) 1. IBE-Eval is significantly correlated with human judgment, outperforming a GPT3.5-as-a-Judge baseline in terms of alignment with human preferences. For reproducibility, our code is made available on Github https://github.com/dhairyadalal/IBE-eval to encourage future research in the field. 2 Inference to the Best Explanation (IBE) Explanatory reasoning is a distinctive feature of human rationality underpinning problem-solving and knowledge creation in both science and everyday scenarios Lombrozo (2012); Deutsch (2011). Accepted epistemological accounts characterize the creation of an explanation as composed of two distinct phases: conjecturing and criticism Popper (2014). The explanatory process always involves a conflict between plausible explanations, which is typically resolved through the criticism phase via a selection process, where competing explanations are assessed according to a set of criteria such as parsimony, coherence, unification power, and hardness to variation Lipton (2017); Harman (1965); Mackonis (2013); Thagard (1978, 1989); Kitcher (1989); Valentino and Freitas (2022). As LLMs become interfaces for natural language explanations, epistemological frameworks offer an opportunity for developing criticism mechanisms to understand the explanatory process underlying state-of-the-art models. To this end, this paper considers an LLM as a conjecture device producing linguistic objects that can be subject to criticism. In particular, we focus on a subset of criteria that can be computed on explicit linguistic and logical features, namely: consistency, parsimony, coherence, and uncertainty. To assess the LLM’s alignment to such criteria, we focus on the task of selecting among competing explanations in a multiple-choice CQA setting (Figure 1). Specifically, given a set of competing hypotheses (i.e. the multiple-choice options), $H=\{h_{1},h_{2},...,h_{n}\}$ , we prompt the LLM to generate plausible explanations supporting each hypothesis (Section 3). Subsequently, we adopt the proposed IBE selection criteria to assess the quality of the generated explanations (Section 4). IBE-Eval computes an explanation plausibility score derived from the linear combination of the computed selection criteria. The explanation with the highest score is selected as the predicted answer and additionally assessed as to the extent to which the observable IBE features are correlated with QA accuracy. We hypothesize that IBE-Eval will produce higher scores for the explanation associated with the correct answer and that the IBE criteria should meaningfully differentiate between competing explanations. 3 Explanation Generation For the first stage, the LLM is prompted to generate competing explanations for the hypotheses using a modified Chain-of-Thought (CoT) prompt Wei et al. (2022a). Specifically, the COT prompt is modified to instruct the LLM to produce an explanation for each competing hypothesis (see Figure 1). We adopt a methodology similar to Valentino et al. (2021), where the generated explanation is constrained into an entailment form for the downstream IBE evaluation. In particular, we posit that a valid explanation should demonstrate an entailment relationship between the premise and conclusion which are derived from the question-answer pair. To elicit logical connections between explanation steps and facilitate subsequent analysis, the LLM is constrained to use weak syllogisms expressed as If-Then statements. Additionally, the LLM is instructed to produce the associated causal or commonsense assumption underlying each explanation step. This output is then post-processed to extract the explanation steps and supporting knowledge for evaluation via the IBE selection criteria. Additional details and examples of prompts are reported in Appendix A.2. 4 Linguistic & Inference Criteria To perform IBE, we investigate a set of criteria that can be automatically computed on explicit logical and linguistic features, namely: consistency, parsimony, coherence, and uncertainty. Consistency. Consistency aims to verify whether the explanation is logically valid. Given a hypothesis, comprised of a premise $p_{i}$ , a conclusion $c_{i}$ , and an explanation consisting of a set of If-Then statements $E={s_{1},...,s_{i}}$ , we define $E$ to be logically consistent if $p_{i}\cup E\vDash c_{i}$ . Specifically, an explanation is logically consistent if it is possible to build a deductive proof linking premise and conclusion. To evaluate logical consistency, we leverage external symbolic solvers along with autoformalization - i.e., the translation of natural language into a formal language Wu et al. (2022). Specifically, the hypotheses and explanations are formalized into a Prolog program which will attempt to generate a deductive proof via backward chaining Weber et al. (2019). To perform autoformalization, we leverage the translation capabilities of GPT 3.5. Specifically, we instruct GPT 3.5 to convert each IF-Then explanation step from the generated explanation into an implication rule and the premise statement into grounding atoms. On the other end, the entailment condition and the conclusion are used to create a Prolog query. The query instructs the Prolog solver to attempt to find a path through the implications rules such that the conclusion be directly connected to the premise. Further details about the autoformalization process can be found in Appendix A.3. After autoformalization, following recent work on neuro-symbolic integration for LLM explanations Quan et al. (2024), we adopt an external Prolog solver for entailment verification https://github.com/neuro-symbolic-ai/explanation_based_ethical_reasoning. The explanation is considered consistent if the Prolog solver can satisfy the query and successfully build a deductive proof. Technical details can be found in Appendix A.5. Parsimony. The parsimony principle, also known as Ockham’s razor, favors the selection of the simplest explanation consisting of the fewest elements and assumptions Sober (1981). Epistemological accounts posit that an explanation with fewer assumptions tends to leave fewer statements unexplained, improving specificity and alleviating the infinite regress Thagard (1978). Further, parsimony is an essential feature of causal interpretability, as only parsimonious solutions are guaranteed to reflect causation in comparative analysis Baumgartner (2015). In this paper, we adopt two metrics as a proxy of parsimony, namely proof depth, and concept drift. Proof depth, denoted as $Depth$ , is defined as the cardinality of the set of rules, $R$ , required by the Prolog solver to connect the conclusion to the premise via backward chaining. Let $h$ be a hypothesis candidate composed of a premise $p$ and a conclusion $c$ , and let $E$ be a formalized explanation represented as a set of rules $R^{\prime}$ . The proof depth is the number of rules $|R|$ , with $R⊂eq R^{\prime}$ , traversed during backward chaining to connect the conclusion $c$ to the premise $p$ : $$ Depth(h)=|R| $$ Concept drift, denoted as $Drift$ , is defined as the number of additional concepts and entities, outside the ones appearing in the hypothesis (i.e., premise and conclusion), that are introduced by the LLM to support the entailment. For simplicity, we consider nouns as concepts. Let $N=\{Noun_{p},Noun_{c},Noun_{E}\}$ be the unique nouns found in the premise, conclusion, and explanation steps. Concept drift is the cardinality of the set difference between the nouns found in the explanation and the nouns in the hypothesis: $$ Drift(h)=|Noun_{E}-(Noun_{p}\cup Noun_{c})| $$ Intuitively, the parsimony principle would predict the most plausible hypothesis as the one supported by an explanation with the smallest observed proof depth and concept drift. Implementation details can be found in Appendix A.6. Coherence. Coherence attempts to measure the logical validity at the level of the specific explanation steps. An explanation can be formally consistent on the surface while still including implausible or ungrounded intermediate assumptions. Coherence evaluates the quality of each intermediate If-Then implication by measuring the entailment strength between the If and Then clauses. To this end, we employ a fine-tuned natural language inference (NLI) model. Formally, let $S$ be a set of explanation steps, where each step $s$ consists of an If-Then statement, $s=(If_{s},Then_{s})$ . For a given step $s_{i}$ , let $ES(s_{i})$ denote the entailment score obtained via the NLI model between $If_{s}$ and $Then_{s}$ clauses. The step-wise entailment score $SWE(S)$ is then calculated as the averaged sum of the entailment scores across all explanation steps $|S|$ : $$ \text{SWE}(S)=\frac{1}{|S|}\sum_{i=1}^{|S|}\text{ES}(s_{i}) $$ We hypothesize that the LLM should generate a higher coherence score for more plausible hypotheses, as such explanations should exhibit stronger step-wise entailment. Additional details can be found in Appendix A.7. Uncertainty. Finally, we consider the linguistic certainty expressed in the generated explanation as a proxy for plausibility. Hedging words such as probably, might be, could be, etc typically signal ambiguity and are often used when the truth condition of a statement is unknown or improbable. Pei and Jurgens (2021) found that the strength of scientific claims in research papers is strongly correlated with the use of direct language. In contrast, they found that the use of hedging language suggested that the veracity of the claim was weaker or highly contextualized. To measure the linguistic uncertainty ( $UC$ ) of an explanation, we consider the explanation’s underlying assumptions ( $A_{i}$ ) and the overall explanation summary ( $S$ ). The linguistic uncertainty score is extracted using the fine-tuned sentence-level RoBERTa model from Pei and Jurgens (2021). The overall linguistic uncertainty score ( $UC_{\text{overall}}$ ) is the sum of the assumption and explanation summary scores: $$ UC_{\text{overall}}=UC(A)+UC(S) $$ Where $UC(A)$ is the sum of the linguistic uncertainty scores ( $UC(A)$ ) across all the assumptions $|A|$ associated with each explanation step $i$ : $$ UC(A)=\sum_{i=1}^{|A|}UC(a_{i}) $$ and linguistic uncertainty of the explanation summary $UC(S)$ . We hypothesize that the LLM will use more hedging language when explaining the weaker hypothesis resulting in a higher uncertainty score. Further details can be found in Appendix A.8. 4.1 Inference to Best Explanation After the IBE criteria are computed for each competing hypothesis, they are used to generate the final explanation plausibility score. We define a simple linear regression model $\theta(·)$ , which was fitted on a small set of training examples consisting of extracted IBE features to predict the probability that an explanation $E_{i}$ corresponds to the correct answer. Specifically, we employ IBE-Eval to score each generated explanation independently and then select the final answer $a$ via argmax: $$ a=\operatorname*{argmax}_{i}[\theta(E_{1}),\ldots,\theta(E_{n})] $$ Additional details can be found in Appendix A.9. <details> <summary>extracted/6246183/correlation.png Details</summary> ![76a7fae7](/v1/image/76a7fae7cd69720041d6cfcd06cde5d7bcfef271823d524f51627f38760dc5ad) ### Visual Description ## Heatmap: LLM Performance on COPA and E-CARE ### Overview The image presents two heatmaps comparing the performance of three Large Language Models (LLMs) - LLaMA 2 7B, LLaMA 2 13B, and GPT 3.5 - on two tasks: COPA and E-CARE. The heatmaps display correlation values between the LLMs and different aspects of the tasks: Consistency, Depth, Coherence, Uncertainty, and Drift. The color intensity represents the strength and direction (positive or negative) of the correlation, with green indicating positive correlation and red indicating negative correlation. Significance levels are indicated by asterisks (*, **, ***). ### Components/Axes * **Titles:** "COPA" (left heatmap), "E-CARE" (right heatmap) * **Y-axis Label:** "LLM" * **Y-axis Categories:** LLaMA 2 7B, LLaMA 2 13B, GPT 3.5 * **X-axis Categories:** Consistency, Depth, Coherence, Uncertainty, Drift * **Color Scale (Corr.):** * Green: Positive correlation, ranging up to 2.5 * White: 0 correlation * Red: Negative correlation, ranging down to -2.5 * **Significance Levels:** * \* : p < 0.05 * \*\* : p < 0.01 * \*\*\* : p < 0.001 ### Detailed Analysis #### COPA Heatmap | LLM | Consistency | Depth | Coherence | Uncertainty | Drift | |-------------|-------------|-----------|-----------|-------------|-----------| | LLaMA 2 7B | 1.37 | -2.95\*\* | 1.22 | -3.10\*\* | -0.27 | | LLaMA 2 13B | 1.36 | -1.28 | 3.87\*\*\* | -2.17\* | -3.33\*\*\* | | GPT 3.5 | 4.67\*\*\* | -4.893\*\*\* | 3.60\*\*\* | -4.34\*\*\* | -3.22\*\* | * **LLaMA 2 7B:** * Consistency: 1.37 (light green) * Depth: -2.95\*\* (red) * Coherence: 1.22 (light green) * Uncertainty: -3.10\*\* (red) * Drift: -0.27 (light red) * **LLaMA 2 13B:** * Consistency: 1.36 (light green) * Depth: -1.28 (light red) * Coherence: 3.87\*\*\* (green) * Uncertainty: -2.17\* (light red) * Drift: -3.33\*\*\* (red) * **GPT 3.5:** * Consistency: 4.67\*\*\* (dark green) * Depth: -4.893\*\*\* (dark red) * Coherence: 3.60\*\*\* (green) * Uncertainty: -4.34\*\*\* (dark red) * Drift: -3.22\*\* (red) #### E-CARE Heatmap | LLM | Consistency | Depth | Coherence | Uncertainty | Drift | |-------------|-------------|----------|-----------|-------------|-----------| | LLaMA 2 7B | 0.20 | -0.53 | 2.18\* | -2.11\*\* | -0.78\* | | LLaMA 2 13B | 1.167 | -1.18 | 1.67\* | -1.52\* | -1.91\* | | GPT 3.5 | 3.10\*\* | -2.91\*\* | 0.98 | -2.61\*\* | -5.14\*\*\* | * **LLaMA 2 7B:** * Consistency: 0.20 (light green) * Depth: -0.53 (light red) * Coherence: 2.18\* (light green) * Uncertainty: -2.11\*\* (light red) * Drift: -0.78\* (light red) * **LLaMA 2 13B:** * Consistency: 1.167 (light green) * Depth: -1.18 (light red) * Coherence: 1.67\* (light green) * Uncertainty: -1.52\* (light red) * Drift: -1.91\* (light red) * **GPT 3.5:** * Consistency: 3.10\*\* (green) * Depth: -2.91\*\* (red) * Coherence: 0.98 (light green) * Uncertainty: -2.61\*\* (red) * Drift: -5.14\*\*\* (dark red) ### Key Observations * **COPA:** GPT 3.5 shows the strongest positive correlation with Consistency and Coherence, but also the strongest negative correlation with Depth, Uncertainty, and Drift. LLaMA 2 13B shows a strong positive correlation with Coherence. * **E-CARE:** GPT 3.5 shows the strongest positive correlation with Consistency, but also the strongest negative correlation with Drift. All models show negative correlations with Depth, Uncertainty, and Drift. * **Significance:** GPT 3.5 generally has more statistically significant correlations (higher number of asterisks) compared to the LLaMA models. * **Consistency:** All models show a positive correlation with Consistency in both tasks. * **Depth, Uncertainty, Drift:** All models show a negative correlation with Depth, Uncertainty, and Drift in both tasks. * **Coherence:** All models show a positive correlation with Coherence in both tasks, except for GPT 3.5 in E-CARE, which has a correlation close to 1. ### Interpretation The heatmaps suggest that GPT 3.5 generally performs better in terms of Consistency and Coherence compared to the LLaMA models, but it also exhibits stronger negative correlations with Depth, Uncertainty, and Drift. This could indicate that while GPT 3.5 is more consistent and coherent, it might be more prone to errors or biases related to depth, uncertainty, and drift. The LLaMA models show more moderate correlations, suggesting a more balanced performance across different aspects of the tasks. The statistical significance of the correlations indicates the reliability of these observations, with GPT 3.5 generally showing more significant correlations. The negative correlations with Depth, Uncertainty, and Drift across all models suggest that these aspects of the tasks are challenging for all LLMs. </details> Figure 2: A regression analysis measuring the correlation between IBE criteria and question accuracy. All the LLMs tend to conform to IBE expectations with GPT 3.5 exhibiting the most consistent and significant alignment. Linguistic uncertainty is the strongest IBE predictor for explanation quality, where higher uncertainty is negatively correlated with question accuracy. Statistical significance is noted as: ‘***’ p < 0.001, ‘**’ p < 0.01 ‘*’ p < 0.05. | Baselines | COPA GPT 3.5 | E-CARE LlaMA 2 13B | LlaMA 2 7B | GPT 3.5 | LlaMA 2 13B | LlaMA 2 7B | | --- | --- | --- | --- | --- | --- | --- | | GPT3.5 Judge | .59 | .47 | .63 | .43 | .61 | .52 | | Human | .95 | 1.0 | .91 | .90 | .91 | .92 | | IBE Features | | | | | | | | Consistency | .51 | .52 | .55 | .54 | .54 | .54 | | Depth (Parsimony) | .67 | .53 | .63 | .66 | .56 | .54 | | Drift (Parsimony) | .67 | .63 | .58 | .66 | .57 | .57 | | Coherence | .66 | .66 | .56 | .56 | .57 | .59 | | Linguistic Uncertainty | .70 | .65 | .61 | .59 | .56 | .60 | | Composed Model | | | | | | | | Random | .50 | .50 | .50 | .50 | .50 | .50 | | + Consistency | .51 | .52 | .55 | .54 | .54 | .54 | | + Depth | .67 | .53 | .63 | .66 | .56 | .56 | | + Drift | .70 | .65 | .65 | .72 | .66 | .65 | | + Coherence | .73 | .71 | .69 | .73 | .68 | .69 | | + Linguistic Uncertainty | .77 | .74 | .70 | .74 | .70 | .73 | Table 1: An ablation study and evaluation of the IBE criteria and the composed IBE-Eval model. IBE-Eval outperforms the GPT 3.5 Judge baseline by an average of +17.5% across all all models and tasks. 5 Experimental Setting Causal Question-Answering (CQA) requires reasoning about the causes and effects given an event description. We specifically consider the task of cause and effect prediction in a multiple-choice setting, where given a question and two candidate answers, the LLM must decide which is the most plausible cause or effect. Causal reasoning is a challenging task as the model must both possess commonsense knowledge about causal relationships and consider the event context which would make one option more plausible than the other. For our experiments, we use the Choice of Plausible Alternatives (COPA) Gordon et al. (2012) and E-CARE Du et al. (2022) datasets. COPA. COPA is a multiple-choice commonsense causal QA dataset consisting of 500 train and test examples that were manually generated. Each multiple-choice example consists of a question premise and a set of answer candidates which are potential causes or effects of the premise. COPA is a well-established causal reasoning benchmark that is both a part of SuperGlue Wang et al. (2019) and the CALM-Bench Dalal et al. (2023). E-CARE. E-CARE is a large-scale multiple-choice causal crowd-sourced QA dataset consisting of 15K train and 2k test examples. Similar to COPA, the task requires the selection of the most likely cause or effect provided an event description. We randomly sample 500 examples from the E-CARE test set for our experiments. LLMs. We consider GPT-Turbo-3.5, LLaMA 2 13B, and LLaMA 2 7B for all experiments. GPT 3.5 is a proprietary model Brown et al. (2020) and is highly effective across a wide range of natural language reasoning tasks Laskar et al. (2023). We additionally evaluate the open-source LLaMA 2 model Touvron et al. (2023). We consider both the 13B and 7B variants of Llama 2 as both are seen as viable commodity GPT alternatives and have been widely adopted by the research community for LLM benchmarking and evaluation. Baselines. We employ LLM-as-a-Judge Zheng et al. (2023) and human evaluators as baseline methods for the selection of the best explanation in the CQA setting. (Zheng et al., 2023) found LLMs can align with human judgment and be utilized for automated evaluation and judgment. We specifically uses GPT 3.5 as the LLM judge. For each CQA example, we present the judges with two competing explanations generated by the target LLM. The judge is asked to identify the best and most plausible explanation. Additional details about the baselines can be found in Appendix A.4. 6 Preliminary Analysis We conduct a preliminary analysis as a sanity check to measure the extent to which LLMs generate self-evident or tautological explanations - i.e., explanations that simply restate the premises and conclusions. Tautological explanations present a risk for IBE-Eval as the metrics would be theoretically uninformative if the LLM adopts the tested causal relation as the explanation itself (e.g. A → B) without providing additional supporting statements. We consider the parsimony metric to compute the percentage of explanations with proof depth equal to 1 (i.e, explanations containing only one inference step) and concept drift equal to 0 (i.e. no additional concepts other than the ones stated in premises and conclusions appear in the explanation). In such cases, the LLM is effectively generating a self-evident or tautological explanation. We found that about 2% of the cases consist of self-evident explanations. For GPT 3.5, LLaMA 2 13B, and LLaMA 2 7B, 2% of the generated explanations exhibit a concept drift of 0, and on average 1.5% of the explanations have a proof depth of 1. We then conducted an error analysis to evaluate the cases where IBE-Eval selected a self-evident explanation as the best one. Across all LLMs, less than 0.1% of the errors were caused by the selection of such explanations. Our analysis suggests that the impact of self-evident explanations is not significant and that the IBE framework can be robustly applied to identify such cases. 7 Results To assess the LLM’s alignment with the proposed IBE framework and evaluate the efficacy of IBE-Eval, we run a regression analysis and conduct a set of ablation studies to evaluate the relationship between IBE and question accuracy. The main results are presented in Figure 2 and Table 1. Our regression analysis finds that the IBE criteria are generally consistent across the LLMs as demonstrated by similar correlation patterns found on both the COPA and E-CARE tasks (Figure 2). GPT 3.5 exhibits the strongest alignment with IBE expectations as we observe nearly all the IBE criteria have statistically significant and directionally aligned correlations across both tasks. Thus our proposed IBE criteria can serve as promising build blocks for future work on automated explanation evaluation. In Table 1 we evaluate the accuracy of the IBE criteria and IBE-Eval in selecting the most plausible explanation in the CQA setting. We find that though independently the IBE criteria are generally limited in their ability to identify the more plausible explanation - they still outperform the GPT-3.5-as-a-judge baseline. IBE-Eval, which considers all IBE criteria, improves the ability to select the best explanation by 17% over both the GPT 3.5-as-a-judge and random baselines. We can achieve up to 77% accuracy utilizing just the extracted IBE criteria demonstrating IBE’s potential value for automatic explanation evaluation. Next, we explore each explanation feature in further detail to better understand the variances across the IBE criteria and LLMs. Consistency. We find that the LLMs are surprisingly strong conjecture models. The LLMs can generate logically consistent explanations for any hypothesis as observed by similar consistency scores for correct and incorrect (Figure 3) explanations. Moreover, we observe that consistency tends to be a statistically insignificant predictor for the LLaMA models. Therefore, we conclude that evidence of logical consistency provides a limited signal for plausibility and is better understood in the context of other IBE criteria. For the incorrect candidate explanations, we find that LLMs over-rationalize and introduce additional premises to demonstrate entailment in their explanations. <details> <summary>extracted/6246183/consistency.png Details</summary> ![e0017935](/v1/image/e00179350a40266d3015455fb763cc7e7e55a360ea3796e625c49f18dbe937b5) ### Visual Description ## Bar Chart: Model Logical Consistency ### Overview The image is a horizontal bar chart comparing the logical consistency of three language models: Llama 2 7B, Llama 2 13B, and ChatGPT. The chart displays the percentage of logically consistent responses for both correct and incorrect options. ### Components/Axes * **Y-axis:** Categorical axis listing the language models: Llama 2 7B, Llama 2 13B, and ChatGPT. * **X-axis:** Numerical axis labeled "% Logically Consistent". The scale ranges implicitly from 0% to 100%. * **Legend (Top-Right):** * Green: "correct" * Red: "incorrect" ### Detailed Analysis The chart presents two bars for each language model, representing the percentage of logically consistent responses for correct and incorrect options. * **Llama 2 7B:** * Correct (Green): 70% * Incorrect (Red): 76% * **Llama 2 13B:** * Correct (Green): 78% * Incorrect (Red): 77% * **ChatGPT:** * Correct (Green): 81% * Incorrect (Red): 82% ### Key Observations * ChatGPT shows the highest logical consistency for both correct and incorrect options. * Llama 2 7B has the lowest logical consistency for correct options. * For Llama 2 13B, the percentage of logically consistent responses is slightly higher for correct options (78%) than for incorrect options (77%). * For all models, the percentage of logically consistent responses is very similar for correct and incorrect options, with differences of 1-6%. ### Interpretation The chart suggests that ChatGPT exhibits the best logical consistency among the three models tested. The proximity of the "correct" and "incorrect" bars for each model indicates that logical consistency is not strongly dependent on the correctness of the option. This could imply that the models are consistently applying logic, regardless of whether the conclusion is correct or not. The small differences between correct and incorrect options suggest that the models' logical reasoning is somewhat independent of the factual accuracy of the input or desired output. </details> Figure 3: An evaluation of explanation consistency. LLMs are strong rationalizers and can generate logically consistent explanations at equal rates for explanations associated with both correct and incorrect answers options. Parsimony. <details> <summary>extracted/6246183/parsimony.png Details</summary> ![930696b9](/v1/image/930696b916b9e4628333929e97ec0da3aaa3d209eb25da215f839e62217ed83d) ### Visual Description ## Horizontal Bar Chart: Model Performance Comparison ### Overview The image presents two horizontal bar charts comparing the performance of three language models (Llama 2 7B, Llama 2 13B, and ChatGPT) on two metrics: Average Proof Depth and Explanation Concept Drift. Each model has two bars representing "correct" and "incorrect" option types, indicated by green and red colors, respectively. The charts display the numerical values for each bar. ### Components/Axes * **Chart Titles:** * Top Chart: "Avg. Proof Depth" * Bottom Chart: "Expl. Concept Drift" * **Y-Axis Labels (Model Names):** * Llama 2 7B * Llama 2 13B * ChatGPT * **X-Axis Labels:** * Top Chart: "Depth" * Bottom Chart: "Drift" * **Legend (Bottom):** * "Option Type" * Green: "correct" * Red: "incorrect" ### Detailed Analysis **Top Chart: Avg. Proof Depth** * **Llama 2 7B:** * Correct (Green): 2.68 * Incorrect (Red): 2.94 * **Llama 2 13B:** * Correct (Green): 3.07 * Incorrect (Red): 3.27 * **ChatGPT:** * Correct (Green): 2.63 * Incorrect (Red): 3.17 **Bottom Chart: Expl. Concept Drift** * **Llama 2 7B:** * Correct (Green): 3.96 * Incorrect (Red): 4.33 * **Llama 2 13B:** * Correct (Green): 3.67 * Incorrect (Red): 4.55 * **ChatGPT:** * Correct (Green): 2.95 * Incorrect (Red): 4.18 ### Key Observations * For both metrics, the "incorrect" option type consistently has a higher value than the "correct" option type for all models. * Llama 2 13B has the highest "incorrect" value for "Avg. Proof Depth" (3.27). * Llama 2 13B has the highest "incorrect" value for "Expl. Concept Drift" (4.55). * ChatGPT has the lowest "correct" value for "Expl. Concept Drift" (2.95). ### Interpretation The data suggests that all three models tend to have higher "Proof Depth" and "Concept Drift" when the option is incorrect. This could indicate that the models struggle more when generating incorrect answers, leading to more complex or convoluted reasoning processes. Llama 2 13B appears to have the highest "Concept Drift" when incorrect, potentially indicating a greater tendency to deviate from the correct explanation path. ChatGPT has the lowest "Concept Drift" when correct, suggesting it might be more consistent in its reasoning when providing correct answers. The difference between "correct" and "incorrect" values could be a useful metric for evaluating model reliability and identifying areas for improvement. </details> Figure 4: Explanation parsimony is evaluated using proof depth and concept drift. Both metrics are consistently lower for explanations supporting the correct answers suggesting that LLMs are able to generate efficient explanations for the more plausible hypothesis. The results suggest that parsimony has a more consistent effect and is a better predictor of explanation quality. We observe negative correlations between proof depth, concept drift, and question-answering accuracy, suggesting that LLMs tend to introduce more concepts and explanation steps when explaining less plausible hypotheses. On average, we found the depth and drift to be 6% and 10% greater for the incorrect option across all LLMs (Figure 4). Moreover, the results suggest that as the LLM parameter size increases, the tendency to over-rationalize increases as well. This is attested by the fact that the average difference in depth and drift is the greatest in GPT 3.5, suggesting that the model tends to find the most efficient explanations for stronger hypotheses and articulates explanations for weaker candidates. Finally, we found that the LLaMA models tend to generate more complex explanations overall, with LLaMA 2 13B exhibiting the largest concept drift for less plausible hypotheses. The parsimony criterion supports the IBE predictive power with an average of 14% improvement over consistency. Coherence. <details> <summary>extracted/6246183/coherence.png Details</summary> ![33c79ad1](/v1/image/33c79ad1505828ff6a39afec16ee6f35aa209fe9a6f3868fe1e5db84f6909f46) ### Visual Description ## Bar Chart: Avg. Coherence Scores ### Overview The image is a bar chart comparing the average coherence scores of three Large Language Models (LLMs): ChatGPT, Llama 2 13B, and Llama 2 7B. The chart displays coherence scores for both "Correct" and "Incorrect" types, represented by green and red bars, respectively. A secondary y-axis shows the relative difference percentage, plotted as a dashed red line with black circular markers. ### Components/Axes * **Title:** Avg. Coherence Scores * **X-axis:** LLM (ChatGPT, Llama 2 13B, Llama 2 7B) * **Left Y-axis:** Coherence Score (scale from 0.0 to 0.3, with increments of 0.1) * **Right Y-axis:** Rel. Difference % (scale from 0% to 30%, with increments of 10%) * **Legend:** Located on the right side of the chart. * Correct (Green) * Incorrect (Red) ### Detailed Analysis * **ChatGPT:** * Correct: Coherence score is approximately 0.25. * Incorrect: Coherence score is approximately 0.20. * Relative Difference: Approximately 23%. * **Llama 2 13B:** * Correct: Coherence score is approximately 0.28. * Incorrect: Coherence score is approximately 0.19. * Relative Difference: Approximately 31%. * **Llama 2 7B:** * Correct: Coherence score is approximately 0.23. * Incorrect: Coherence score is approximately 0.19. * Relative Difference: Approximately 18%. * **Relative Difference Trend:** The red dashed line shows the relative difference percentage. It starts at approximately 23% for ChatGPT, increases to approximately 31% for Llama 2 13B, and then decreases to approximately 18% for Llama 2 7B. ### Key Observations * For all three LLMs, the coherence score for "Correct" type is higher than the coherence score for "Incorrect" type. * Llama 2 13B has the highest coherence score for "Correct" type and the highest relative difference percentage. * Llama 2 7B has the lowest coherence score for "Correct" type and the lowest relative difference percentage. * The "Incorrect" coherence scores are relatively similar across all three LLMs, ranging from approximately 0.19 to 0.20. ### Interpretation The chart suggests that Llama 2 13B performs the best in terms of coherence score when the model's output is correct. The relative difference percentage indicates the degree to which the coherence score differs between correct and incorrect outputs. A higher percentage suggests a greater disparity in coherence between correct and incorrect outputs. The data indicates that while all models perform better on "Correct" outputs, Llama 2 13B exhibits the most significant difference in coherence between correct and incorrect outputs. ChatGPT and Llama 2 7B show relatively similar performance, with Llama 2 7B having a slightly lower relative difference percentage. </details> Figure 5: An evaluation of the explanation coherence and question accuracy.The average coherence score is consistently higher for explanations corresponding to the correct hypotheses across the LLMs. Similarly to parsimony, we found coherence to be a better indicator of explanation quality being statistically significant for both GPT 3.5 and Llama 2 13B on COPA and both Llama 2 models on E-Care. We found that the average coherence score is consistently greater for the stronger hypothesis across all LLMs and datasets (see Figure 5). Both GPT and Llama 2 13B exhibit a higher relative difference between the correct and incorrect hypotheses in contrast to Llama 2 7B. Uncertainty. <details> <summary>extracted/6246183/uncertainty.png Details</summary> ![2b1fd07d](/v1/image/2b1fd07d6abcdff757480ce28bb8c3be15c3c005d814be75be2e65487356eb2c) ### Visual Description ## Bar Charts and Stacked Bar Chart: Model Uncertainty and Hedge Cues ### Overview The image presents three charts. The top left chart shows the average uncertainty score for three language models (ChatGPT, Llama 2 13B, and Llama 2 7B) when generating explanations, separated by whether the explanation was correct or incorrect. The top right chart shows the average ratio of explanation hedge cues for the same models, again separated by correctness. Both of these charts also include a secondary y-axis showing the relative difference between the correct and incorrect values. The bottom chart is a stacked bar chart showing the distribution of different categories of hedge cues (Conditional, Doxatic, and Epistemic) in incorrect explanations for each model. ### Components/Axes **Top Left Chart: Avg. Uncertainty** * **Title:** Avg. Uncertainty * **Y-axis (left):** Score, ranging from 0 to 4. * **X-axis:** Language models: ChatGPT, Llama 2 13B, Llama 2 7B. * **Legend:** * Correct (Green) * Incorrect (Red) * **Y-axis (right):** Rel. Difference, ranging from 0% to 10%. **Top Right Chart: Avg. Ratio of Expl. Hedge Cues** * **Title:** Avg. Ratio of Expl. Hedge Cues * **Y-axis (left):** Ratio, ranging from 0.00 to 0.04. * **X-axis:** Language models: ChatGPT, Llama 2 13B, Llama 2 7B. * **Legend:** * Correct (Green) * Incorrect (Red) * **Y-axis (right):** Rel. Difference, ranging from 0% to 20%. **Bottom Chart: Hedge Cues in Incorrect Expl.** * **Title:** Hedge Cues in Incorrect Expl. * **Y-axis:** Language models: Llama 2 7B, Llama 2 13B, ChatGPT. * **X-axis:** Percentage, ranging from 0% to 100%. * **Legend:** * Conditional (Red) * Doxatic (Green) * Epistemic (Blue) ### Detailed Analysis **Top Left Chart: Avg. Uncertainty** * **ChatGPT:** * Correct: Score ~3.6 * Incorrect: Score ~3.8 * Rel. Difference: ~8% * **Llama 2 13B:** * Correct: Score ~3.8 * Incorrect: Score ~4.1 * Rel. Difference: ~3% * **Llama 2 7B:** * Correct: Score ~3.4 * Incorrect: Score ~3.9 * Rel. Difference: ~12% The red dashed line, representing the relative difference, shows an upward trend from ChatGPT to Llama 2 7B. **Top Right Chart: Avg. Ratio of Expl. Hedge Cues** * **ChatGPT:** * Correct: Ratio ~0.03 * Incorrect: Ratio ~0.04 * Rel. Difference: ~22% * **Llama 2 13B:** * Correct: Ratio ~0.035 * Incorrect: Ratio ~0.045 * Rel. Difference: ~12% * **Llama 2 7B:** * Correct: Ratio ~0.032 * Incorrect: Ratio ~0.04 * Rel. Difference: ~20% The red dashed line, representing the relative difference, shows a downward trend from ChatGPT to Llama 2 13B, then an upward trend to Llama 2 7B. **Bottom Chart: Hedge Cues in Incorrect Expl.** * **ChatGPT:** * Epistemic: ~85% * Doxatic: ~5% * Conditional: ~10% * **Llama 2 13B:** * Epistemic: ~80% * Doxatic: ~5% * Conditional: ~15% * **Llama 2 7B:** * Epistemic: ~75% * Doxatic: ~5% * Conditional: ~20% ### Key Observations * For all models, the average uncertainty score is higher for incorrect explanations than for correct explanations. * For all models, the average ratio of explanation hedge cues is higher for incorrect explanations than for correct explanations. * Epistemic hedge cues are the most prevalent type of hedge cue in incorrect explanations across all models. * The proportion of Epistemic cues decreases from ChatGPT to Llama 2 7B, while the proportion of Conditional cues increases. ### Interpretation The data suggests that higher uncertainty and a greater ratio of hedge cues in explanations are associated with incorrect explanations generated by these language models. This could indicate that the models are less confident and more hesitant when providing incorrect explanations. The distribution of hedge cue categories suggests that the models rely heavily on epistemic hedging (indicating a lack of knowledge or certainty) when generating incorrect explanations. The shift in hedge cue distribution from ChatGPT to Llama 2 7B might reflect differences in the models' training data or architectures, leading to variations in how they express uncertainty. </details> Figure 6: Evaluation of linguistic uncertainty in LLM-generated explanations. LLMs tend to use more hedging language in explanations supporting less plausible hypotheses. Across the LLMs, the hedging language is found to be predominantly epistemic A.8. The results reveal that linguistic uncertainty is the strongest predictor of explanation quality and is a statistically significant feature for all LLMs. This suggests that LLMs use more qualifying language when explaining weaker hypotheses (see Figure 6). We found that uncertainty can improve accuracy by 13pp on COPA and 4pp on E-CARE. We also examine the uncertainty cues expressed by LLMs by analyzing both the frequency of hedge words and the types of hedge cues employed in incorrect explanations. We find the distribution of hedge cues across LLMs tends to be similar, with only minor differences between LLMs (Figure 6). Epistemic cues were most frequently used by all three models, with LLaMA 2 7B being more likely to use conditional cues. See Appendix A.8 for further details.. 7.1 Correlation with Human Judgement. We first sample 100 generated explanation pairs across both the COPA and E-CARE tasks and evaluated LLMs. Two human evaluators are instructed to evaluate the pair of explanations and to select which explanation is most plausible. No additional information about the original question nor the correct answer is provided to prevent biasing the judge. The human evaluators on average were able to identify the explanation associated with the correct answer 96% (COPA) and 91% (E-Care) of time. We compute the inter-evaluator agreement score between two human evaluators and find that there is Cohen’s Kappa score of .68 suggesting there is a strong agreement between the two evaluators. To evaluate if IBE-Eval is correlated with human judgment, we compute the Spearman’s rank correlation between GPT-3.5-as-a-judge, IBE-Eval and human judgment. We find that GPT-3.5-as-a-judge exhibits a weak and statistically insignificant correlation with human judgment (0.31). In contrast, we find that the IBE-Eval is significantly aligned with human preferences (with a Spearman’s correlation of 0.64 and p < 0.01) further suggesting the IBE’s potential for automatic explanation evaluation. 8 Related Work Explorations of LLM reasoning capabilities across various domains (e.g. arithmetic, commonsense, planning, symbolic, etc) are an emerging area of interest Xu et al. (2023); Huang and Chang (2023). Prompt-based methods Wei et al. (2022b); Zhou et al. (2023); Wang et al. (2023), such as CoT, investigate strategies to elicit specific types of reasoning behavior through direct LLM interaction. Olausson et al. (2023) investigate automatic proof generation and propose a neurosymbolic framework with an LLM semantic parser and external solver. Creswell et al. (2022) propose an inference framework where the LLM acts as both a selection and inference module to produce explanations consisting of causal reasoning steps in entailment tasks. Research on LLM faithfulness Atanasova et al. (2023) investigates if LLM explanations are robust to spurious input alterations. Parcalabescu and Frank (2024) propose a self-consistency measure CC-SHAP which measures how specific alterations to a model’s input contribute to the generated explanation. This paper primarily draws inspiration from recent work on the evaluation of natural language explanations Quan et al. (2024); Valentino et al. (2021); Wiegreffe and Marasovic (2021); Thayaparan et al. (2020); Dalvi et al. (2021); Camburu et al. (2018). However, differently from previous methods that require extensive human annotations or specific domain knowledge, we are the first to propose a set of criteria that can be automatically computed on explicit linguistic and logical features. 9 Conclusion This paper proposed IBE-Eval, an interpretable framework for LLM explanation evaluation inspired by philosophical accounts of Inference to the Best Explanation (IBE). IBE-Eval can identify the best explanation supporting the correct answer with up to 77% accuracy in CQA scenarios, improving upon a GPT 3.5 Judge baselines by +17%. Our regression study suggests that LLM explanations tend to conform to IBE expectations and that IBE-Eval is strongly correlated with human judgment. Linguistic uncertainty is the stronger IBE predictor for explanation quality closely followed by parsimony and coherence. However, we also found that LLMs tend to be strong conjecture models able to generate logically consistent explanations for less plausible hypotheses, suggesting limited applicability for the logical consistency criterion in isolation. We believe our findings can open new lines of research on external evaluation methods for LLMs as well as interpretability tools for understanding the LLM’s underlying explanatory process. 10 Limitations IBE-Eval offers an interpretable explanation evaluation framework utilizing logical and linguistic features. Our current instantiation of the framework is primarily limited in that it does not consider grounded knowledge for factuality. We observe that LLMs can generate factually incorrect but logically consistent explanations. In some cases, the coherence metric can identify those factual errors when the step-wise entailment score is comparatively lower. However, our reliance on aggregated metrics can hide weaker internal entailment especially when the explanation is longer or the entailment strength of the surrounding explanation steps is stronger. Future work can introduce metrics to evaluate grounded knowledge or perform more granular evaluations of explanations to better weight factual inaccuracies. Additionally, IBE-Eval currently does not support single natural language explanations and was evaluated in the limited domain of causal commonsense reasoning. Future work will explore globally calibrating IBE-Eval plausibility scores to extend evaluation to more diverse explanation generation and QA settings. Calibration efforts would allow for IBE-Eval to generate comparable scores across unrelated explanations and could be used to produce global thresholds explanation classification. Finally, the list of criteria considered in this work is not exhaustive and can be extended in future work. However, additional criteria for IBE might not be straightforward to implement (e.g., unification power, hardness to variation) and would probably require further progress in both epistemological accounts and existing NLP technology. 11 Ethics Statement The human annotators for computing the human judgment baseline are all authors of the papers and as such were not further compensated for the annotation task. Acknowledgements This work was partially funded by the Swiss National Science Foundation (SNSF) project NeuMath (200021_204617), by the EPSRC grant EP/T026995/1 entitled “EnnCore: End-to-End Conceptual Guarding of Neural Architectures” under Security for all in an AI-enabled society, by the CRUK National Biomarker Centre, and supported by the Manchester Experimental Cancer Medicine Centre, the Science Foundation Ireland under grants SFI/18/CRT/6223 (Centre for Research Training in Artificial Intelligence), SFI/12/RC/2289_P2 (Insight), co-funded by the European Regional Development Fund, and the NIHR Manchester Biomedical Research Centre. References - Atanasova et al. (2023) Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. 2023. Faithfulness tests for natural language explanations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 283–294, Toronto, Canada. Association for Computational Linguistics. - Baumgartner (2015) Michael Baumgartner. 2015. Parsimony and causality. Quality & Quantity, 49:839–856. - Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics. - Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901. - Buitinck et al. (2013) Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108–122. - Camburu et al. (2018) Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31. - Camburu et al. (2020) OM Camburu, B Shillingford, P Minervini, T Lukasiewicz, and P Blunsom. 2020. Make up your mind! adversarial generation of inconsistent natural language explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. ACL Anthology. - Chakraborty et al. (2017) Supriyo Chakraborty, Richard Tomsett, Ramya Raghavendra, Daniel Harborne, Moustafa Alzantot, Federico Cerutti, Mani Srivastava, Alun Preece, Simon Julier, Raghuveer M. Rao, Troy D. Kelley, Dave Braines, Murat Sensoy, Christopher J. Willis, and Prudhvi Gurram. 2017. Interpretability of deep learning models: A survey of results. In 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pages 1–6. - Creswell et al. (2022) Antonia Creswell, Murray Shanahan, and Irina Higgins. 2022. Selection-inference: Exploiting large language models for interpretable logical reasoning. - Dalal et al. (2023) Dhairya Dalal, Paul Buitelaar, and Mihael Arcan. 2023. CALM-bench: A multi-task benchmark for evaluating causality-aware language models. In Findings of the Association for Computational Linguistics: EACL 2023, pages 296–311, Dubrovnik, Croatia. Association for Computational Linguistics. - Dalvi et al. (2021) Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura, and Peter Clark. 2021. Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370. - Danilevsky et al. (2020) Marina Danilevsky, Kun Qian, Ranit Aharonov, Yannis Katsis, Ban Kawas, and Prithviraj Sen. 2020. A survey of the state of explainable AI for natural language processing. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 447–459, Suzhou, China. Association for Computational Linguistics. - Deutsch (2011) David Deutsch. 2011. The beginning of infinity: Explanations that transform the world. penguin uK. - Du et al. (2022) Li Du, Xiao Ding, Kai Xiong, Ting Liu, and Bing Qin. 2022. e-CARE: a new dataset for exploring explainable causal reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432–446, Dublin, Ireland. Association for Computational Linguistics. - Farkas et al. (2010) Richárd Farkas, Veronika Vincze, György Móra, János Csirik, and György Szarvas. 2010. The CoNLL-2010 shared task: Learning to detect hedges and their scope in natural language text. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning – Shared Task, pages 1–12, Uppsala, Sweden. Association for Computational Linguistics. - Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation. - Geirhos et al. (2020) Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673. - Gordon et al. (2012) Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. 2012. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 394–398, Montréal, Canada. Association for Computational Linguistics. - Harman (1965) Gilbert H Harman. 1965. The inference to the best explanation. The philosophical review, 74(1):88–95. - Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear. - Huang and Chang (2022) Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403. - Huang and Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. - Huang et al. (2023) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2023. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. - Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12). - Kitcher (1989) Philip Kitcher. 1989. Explanatory unification and the causal structure of the world. - Knight (2023) Will Knight. 2023. Ai is becoming more powerful-but also more secretive. - Lampinen et al. (2022) Andrew Lampinen, Ishita Dasgupta, Stephanie Chan, Kory Mathewson, Mh Tessler, Antonia Creswell, James McClelland, Jane Wang, and Felix Hill. 2022. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563. - Laskar et al. (2023) Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Amran Hossen Bhuiyan, Shafiq Joty, and Jimmy Xiangji Huang. 2023. A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. - Liang et al. (2023) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. 2023. Holistic evaluation of language models. - Lipton (2017) Peter Lipton. 2017. Inference to the best explanation. A Companion to the Philosophy of Science, pages 184–193. - Lombrozo (2012) Tania Lombrozo. 2012. Explanation and abductive inference. Oxford handbook of thinking and reasoning, pages 260–276. - Mackonis (2013) Adolfas Mackonis. 2013. Inference to the best explanation, coherence and other explanatory virtues. Synthese, 190(6):975–995. - Nie et al. (2019) Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. Combining fact extraction and verification with neural semantic matching networks. In Association for the Advancement of Artificial Intelligence (AAAI). - Nie et al. (2020) Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. - Olausson et al. (2023) Theo X. Olausson, Alex Gu, Benjamin Lipkin, Cedegao E. Zhang, Armando Solar-Lezama, Joshua B. Tenenbaum, and Roger Levy. 2023. Linc: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. - Parcalabescu and Frank (2024) Letitia Parcalabescu and Anette Frank. 2024. On measuring faithfulness or self-consistency of natural language explanations. - Pei and Jurgens (2021) Jiaxin Pei and David Jurgens. 2021. Measuring sentence-level and aspect-level (un)certainty in science communications. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9959–10011, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. - Popper (2014) Karl Popper. 2014. Conjectures and refutations: The growth of scientific knowledge. routledge. - Quan et al. (2024) Xin Quan, Marco Valentino, Louise A Dennis, and André Freitas. 2024. Enhancing ethical explanations of large language models through iterative symbolic refinement. arXiv preprint arXiv:2402.00745. - R Core Team (2013) R Core Team. 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. - Sober (1981) Elliott Sober. 1981. The principle of parsimony. The British Journal for the Philosophy of Science, 32(2):145–156. - Srivastava et al. (2023) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. - Thagard (1989) Paul Thagard. 1989. Explanatory coherence. Behavioral and brain sciences, 12(3):435–467. - Thagard (1978) Paul R Thagard. 1978. The best explanation: Criteria for theory choice. The journal of philosophy, 75(2):76–92. - Thayaparan et al. (2020) Mokanarangan Thayaparan, Marco Valentino, and André Freitas. 2020. A survey on explainability in machine reading comprehension. arXiv preprint arXiv:2010.00389. - Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. - Valentino and Freitas (2022) Marco Valentino and André Freitas. 2022. Scientific explanation and natural language: A unified epistemological-linguistic perspective for explainable ai. arXiv preprint arXiv:2205.01809. - Valentino et al. (2021) Marco Valentino, Ian Pratt-Hartmann, and André Freitas. 2021. Do natural language explanations represent valid logical arguments? verifying entailment in explainable NLI gold standards. In Proceedings of the 14th International Conference on Computational Semantics (IWCS), pages 76–86, Groningen, The Netherlands (online). Association for Computational Linguistics. - Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. CoRR, abs/1905.00537. - Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. - Weber et al. (2019) Leon Weber, Pasquale Minervini, Jannes Münchmeyer, Ulf Leser, and Tim Rocktäschel. 2019. Nlprolog: Reasoning with weak unification for question answering in natural language. - Wei et al. (2022a) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022a. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903. - Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837. - Wiegreffe and Marasovic (2021) Sarah Wiegreffe and Ana Marasovic. 2021. Teach me to explain: A review of datasets for explainable natural language processing. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1). - Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics. - Wu et al. (2022) Yuhuai Wu, Albert Q. Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. 2022. Autoformalization with large language models. - Xiang (2023) Chloe Xiang. 2023. Openai’s gpt-4 is closed source and shrouded in secrecy. - Xu et al. (2023) Fangzhi Xu, Qika Lin, Jiawei Han, Tianzhe Zhao, Jun Liu, and Erik Cambria. 2023. Are large language models really good logical reasoners? a comprehensive evaluation and beyond. - Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv, abs/2306.05685. - Zhou et al. (2023) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. 2023. Least-to-most prompting enables complex reasoning in large language models. Appendix A Appendix A.1 Reproducibility All experimental code is available online https://github.com/dhairyadalal/IBE-eval to encourage future research in the field. We additionally summarize all the model implementations and technical resources used for the computation of the proposed IBE criteria below: - We adopt the Prolog solver for neuro-symbolic integration from Quan et al. (2024). - We use Spacy Honnibal and Montani (2017) to tokenize and extract part-of-speech (POS) tags. - To compute coherence, we employ the RoBERTa-based NLI model Nie et al. (2020) that has been finetuned on a range of NLI and fact verification datasets consisting of SNLI Bowman et al. (2015), aNLI Nie et al. (2020), multilingual NLI Williams et al. (2018)), and FEVER-NLI Nie et al. (2019). - To measure sentence-level uncertainty, we employ a finetuned RoBERTa model provided by Pei and Jurgens (2021). - We use a fine-tuned BERT-based token classification model to classify all the words in the generated explanation with uncertainty categories introduced in the 2010 CoNLL shared task on Hedge Detection Farkas et al. (2010). A.2 Explanation Prompting <details> <summary>extracted/6246183/eev.png Details</summary> ![4cf23996](/v1/image/4cf23996bf39ae38f762eb1d86b8a40c3fc0e46221cbe2adfce1c5f8e908c6b7) ### Visual Description ## Table: Entailment Forms Examples ### Overview The image presents a table illustrating examples of entailment forms for cause and effect prediction. The table is divided into three columns: "Type," "Example," and "Entailment Forms." Each row provides a specific type of prediction (cause or effect), an example context and question, and the corresponding entailment forms (premise and conclusion). ### Components/Axes * **Columns:** * Type: Specifies the type of prediction (Cause Prediction, Effect Prediction). * Example: Provides a context and a question related to the prediction type, along with possible answers. * Entailment Forms: Shows the premise and conclusion related to the example. * **Rows:** * Cause Prediction: Illustrates cause prediction with an example involving a balloon expanding. * Effect Prediction: Illustrates effect prediction with an example involving a child punching a stack of blocks. ### Detailed Analysis or ### Content Details **Row 1: Cause Prediction** * **Type:** Cause Prediction * **Example:** * Context: The balloon expanded. (in blue) * Question: What was the cause? * A) I blew into it. (in purple) * B) I pricked it. (in purple) * **Entailment Forms:** * Premise: I blew into it. (in purple) * Conclusion: The balloon expanded. (in blue) * Premise: I pricked it. (in purple) * Conclusion: The balloon expanded. (in blue) **Row 2: Effect Prediction** * **Type:** Effect Prediction * **Example:** * Context: The child punched the stack of blocks. (in blue) * Question: What was the effect? * A) The stack towered over the boys head. (in purple) * B) The blocks scattered all over the rug. (in purple) * **Entailment Forms:** * Premise: The child punched the stack of blocks. (in blue) * Conclusion: The stack towered over the boys head. (in purple) * Premise: The child punched the stack of blocks. (in blue) * Conclusion: The blocks scattered all over the rug. (in purple) ### Key Observations * The examples are designed to illustrate how a given context and question can lead to different entailment forms. * The color-coding (blue and purple) highlights the relationship between the context/question and the premise/conclusion. ### Interpretation The table demonstrates how entailment forms can be derived from different types of predictions (cause and effect). It shows that a single context can lead to multiple possible conclusions, depending on the premise. The examples provided are simple scenarios that can be easily understood, making the concept of entailment forms more accessible. The use of color-coding helps to visually connect the different parts of the examples and their corresponding entailment forms. </details> Figure 7: To perform IBE we convert the CQA context and answer candidates into an entailment form (i.e., EEV) Valentino et al. (2021). <details> <summary>extracted/6246183/cot-prompt.png Details</summary> ![7ec6513a](/v1/image/7ec6513a1191fc929229e44e1364cff6e51d31e17d9fe532f3d681ae1b07710f) ### Visual Description ## Text Analysis: Cause Explanation Example ### Overview The image presents an example of a problem-solving task where the goal is to identify the most plausible cause for a given context. It provides a context, a question, two possible options, explanations for each option, and the selected answer. The task involves generating step-by-step logical proofs for each option, using IF-THEN rules and common-sense assumptions. ### Components/Axes The image is structured as follows: * **Instructions:** A paragraph explaining the task and the required format for the answer. * **Example:** * **Context:** The woman banished the children from her property. * **Question:** What was the cause? * **Options:** * (a) the children trampled through her garden * (b) the children hit a ball into her yard * **Option 1 Explanation:** * **Premise:** the children trampled through her garden. * **Conclusion:** The woman banished the children from her property. * **Step 1:** IF children trample through someone's garden, THEN it can cause damage to the garden. * **Assumption:** Trampling through a garden can result in damage to the garden. * **Step 5:** Therefore, since the children trampled through her garden, causing damage, the woman may have felt upset or angry and decided to banish the children from her property as a way to prevent further damage. * **Option 2 Explanation:** * **Premise:** the children hit a ball into her yard. * **Conclusion:** The woman banished the children from her property. * **Step 1:** IF children hit a ball into her yard, THEN the woman may feel her property is being invaded. * **Assumption:** Having objects thrown into one's yard can be seen as an invasion of privacy. * **Step 5:** Therefore, since the children hit a ball into her yard, the woman may have felt her property was being invaded, which could have led to her becoming angry and ultimately banishing the children from her property. * **Answer:** (a) the children trampled through her garden * **Context:** * **Question:** * **Options:** | ### Detailed Analysis or ### Content Details The example demonstrates how to analyze two possible causes for a given context. For each option, a premise and conclusion are stated, followed by a step-by-step explanation using an IF-THEN rule and a common-sense assumption. The explanation aims to provide a logical connection between the premise and the conclusion. ### Key Observations The key observation is the structured approach to problem-solving. Each option is treated as a hypothesis, and a logical argument is constructed to support or refute it. The use of IF-THEN rules and assumptions makes the reasoning process explicit. ### Interpretation The example illustrates a method for identifying the most plausible cause by systematically evaluating each option. The step-by-step explanation helps to clarify the reasoning process and identify potential weaknesses in the argument. The task emphasizes the importance of considering both causal relationships and common-sense assumptions when analyzing complex situations. The final section with empty context, question, and options suggests that the user is expected to fill in these fields and apply the same problem-solving approach. </details> Figure 8: An example of the modified CoT prompt template for explanation generation. A modified CoT prompt is used to instruct the LLM to generate explanations. The prompt includes a set of instructions for explanation generation and an in-context example. Appended to the end of the prompt are the CQA context, causal question, and answer candidates. The LLM is instructed to first convert the options into the EEV format consisting of a premise and conclusion. The EEV format will differ depending on the directionality of the causal question (see Figure 7). Cause prediction questions will treat the answer candidate as the premise and the context as the conclusion. In contrast, effect prediction reverses the relationship treating the context as the premise and the answer options as the conclusion. After the EEV conversion, the model is instructed to generate a step-by-step explanation consisting of IF-THEN statements and the associated causal or commonsense assumptions. For ease of post-processing, the LLM is instructed to use headers and enumerate steps using the Step # format. A full example of the prompt template is exhibited in Figure 8. <details> <summary>extracted/6246183/autoformalization.png Details</summary> ![ceeeea67](/v1/image/ceeeea67f5f9612af518a19e762ca7cc5d5f61237482ca39b7abf4ca5bfed6c9) ### Visual Description ## Prolog Conversion Example ### Overview The image presents instructions and an example of converting a premise, conclusion, and explanation into Prolog syntax. It provides guidelines for generating goals, facts, and rules, along with specific constraints on variable usage and constant usage. It then gives a concrete example related to Tom's health condition and its representation in Prolog. Finally, it sets up a template for a second example. ### Components/Axes * **Instructions:** A block of text explaining the conversion process and constraints. * **Example 1:** * **Premise:** "Tom's pancreas was injured." * **Conclusion:** "He has a high blood sugar level." * **Explanation:** A series of "IF...THEN..." statements and a concluding statement linking the premise to the conclusion. * **Goal:** `has_high_blood_sugar(tom).` * **Formal Goal:** `has_high_blood_sugar(X) :- tom(X).` * **Facts:** * `injured_pancreas(tom)` * `tom(tom)` * **Rules:** * `dysfunctional_pancreas(X) :- injured_pancreas(X).` * `reduced_insulin_production(X) :- dysfunctional_pancreas(X)` * `has_high_blood_sugar(X) :- reduced_insulin_production(X)` * **Example 2:** * **Premise:** (Blank) * **Conclusion:** (Blank) * **Explanation:** (Blank) ### Detailed Analysis or ### Content Details **Instructions:** The instructions specify the following: * Convert premise, conclusion, and explanation to Prolog syntax. * Generate the goal from the conclusion. * Generate facts from the premise. * Generate rules from the explanation. * Use only one variable per predicate. * Do not generate rules or facts with more than one variable. * Examples of disallowed constructs: `'intoxicated(X, main)'`, `'intoxicated(X, Y)'`, `'leaking(water_pipe, frozen)'`. * Goals and facts must refer to the same constant. **Example 1 Breakdown:** * **Premise:** Tom's pancreas was injured. * **Conclusion:** He has a high blood sugar level. * **Explanation:** * IF pancreas are injured, THEN pancreas may be dysfunctional. * IF pancreas are dysfunctional, THEN pancreas have a reduced capacity for insulin production. * IF there is a reduced capacity for insulin production, THEN there is high levels of blood sugar. * Therefore, since Tom's pancreas was injured, he may have a reduced capacity for insulin production, leading to insufficient insulin and high blood sugar levels. * **Prolog Representation:** * **Goal:** `has_high_blood_sugar(tom).` This states the goal is to prove Tom has high blood sugar. * **Formal Goal:** `has_high_blood_sugar(X) :- tom(X).` This is likely incorrect, as it states that for all X, if X is Tom, then X has high blood sugar. * **Facts:** * `injured_pancreas(tom).` This states that Tom's pancreas is injured. * `tom(tom).` This is likely incorrect, as it states that Tom is Tom. * **Rules:** * `dysfunctional_pancreas(X) :- injured_pancreas(X).` If X's pancreas is injured, then X's pancreas is dysfunctional. * `reduced_insulin_production(X) :- dysfunctional_pancreas(X).` If X's pancreas is dysfunctional, then X has reduced insulin production. * `has_high_blood_sugar(X) :- reduced_insulin_production(X).` If X has reduced insulin production, then X has high blood sugar. **Example 2:** This section provides a template for a second example, with placeholders for the premise, conclusion, and explanation. ### Key Observations * The instructions emphasize the conversion of natural language statements into Prolog syntax. * The example demonstrates how a simple medical scenario can be represented using Prolog facts and rules. * The example highlights the importance of using appropriate constants and variables in Prolog. * The "Formal Goal" and `tom(tom)` fact in Example 1 seem logically flawed. ### Interpretation The image provides a basic introduction to representing knowledge in Prolog. The example demonstrates how to translate a simple causal chain (injured pancreas -> dysfunctional pancreas -> reduced insulin production -> high blood sugar) into Prolog rules. The instructions aim to guide users in constructing valid Prolog code by emphasizing the use of single variables per predicate and consistent use of constants. The example, while illustrative, contains potential logical errors in the "Formal Goal" and the `tom(tom)` fact, which could lead to incorrect inferences if used as is. The overall purpose is to teach the fundamentals of knowledge representation in Prolog using a relatable medical scenario. </details> Figure 9: An example of the autoformalization prompt. A.3 Autoformalization Autoformalization is the process of translating natural language descriptions into formal specifications Wu et al. (2022). We adopt the translational capabilities of GPT-3.5-Turbo to convert the explanation into a formal entailment hypothesis. The IF-THEN explanation steps are converted into a set of Prolog rules, the entailment description is used to generate Prolog atoms, and the conclusion statement is translated into a Prolog query. We provide an example of the autoformalization prompt in Figure 9 and an example of the formalized output in Figure 11. After autoformalization, we deploy a post-processing script to extract the formalized rules, atoms, and query and generate a Prolog program for entailment verification. A.4 LLM-as-a-Judge Baseline <details> <summary>extracted/6246183/llm-judge-prompt.png Details</summary> ![9207c7c4](/v1/image/9207c7c45a9ce2d50b2fba9d82f13b69e5ef577eecfe1b38a55fc5b6d8d75c5c) ### Visual Description ## Text Block: Explanation Selection Task ### Overview The image presents a task where the user must choose the more plausible of two explanations. The instructions emphasize logical consistency and arriving at the correct conclusion as criteria for a good explanation. The user is instructed to respond with either "explanation 1" or "explanation 2". ### Components/Axes The image contains the following text elements: * **Instructions:** "Given the two explanations below (explanation 1 and explanation 2) which explanation is more plausible. A good explanation should be logically consistent and arrive at the correct conclusion. Respond with either explanation 1 or explanation 2 as your final answer." * **Explanation 1:** "Explanation 1:" followed by "..." indicating a placeholder for the first explanation. * **Explanation 2:** "Explanation 2:" followed by "..." indicating a placeholder for the second explanation. * **Answer:** "Answer:" indicating a space for the user's response. ### Detailed Analysis or ### Content Details The image is structured as a prompt for a reasoning task. It provides a context, instructions, and placeholders for two explanations and an answer. The "..." placeholders suggest that the actual explanations are to be provided elsewhere (e.g., in a separate document or part of the task). ### Key Observations The task focuses on evaluating the plausibility of explanations based on logical consistency and correctness. The user is expected to analyze the provided explanations and select the one that best meets these criteria. ### Interpretation The image represents a question or task designed to assess critical thinking and reasoning skills. The user must evaluate the provided explanations and select the most plausible one based on logical consistency and the ability to arrive at the correct conclusion. The task emphasizes the importance of sound reasoning and the ability to distinguish between plausible and implausible explanations. </details> Figure 10: An example of prompt used by the LLM-as-a-Judge model for evaluating competing explanations. GPT 3.5 is used as the LLM for the LLM-as-a-Judge baseline. Similar to the human evaluators, GPT is presented with both generated explanations and asked to identify which explanation is more plausible. A.5 Logical Consistency <details> <summary>extracted/6246183/logical-pipeline.png Details</summary> ![bbd39529](/v1/image/bbd395292cf92c2ed6a5ed4a9b92f82f5f43c9738a108410ada685c759b36bd2) ### Visual Description ## Table: Explanation, Formalized Output, and Generated Proof ### Overview The image presents a table with three columns: "Explanation," "Formalized Output," and "Generated Proof." It describes the process of a balloon expanding after someone blows into it, formalizes this process using Prolog, and generates a proof of the process. ### Components/Axes * **Column 1:** Explanation * Premise: I blew into it. * Conclusion: The balloon expanded. * Step 1: IF someone blows into a balloon, THEN it can cause the balloon to inflate. * Assumption: Blowing air into a balloon increases the amount of air inside it, leading to inflation. * Step 2: IF the balloon inflates, THEN it can cause the balloon to expand. * Assumption: When a balloon inflates, it stretches and expands in size. * Therefore, since I blew into the balloon, it caused the balloon to inflate, which resulted in its expansion. * **Column 2:** Formalized Output * Prolog Query: expanded\_balloon(me). * Program: * % Atoms * blew\_into\_balloon(me). * me(me). * % Rules * inflated\_balloon(X) :- blew\_into\_balloon(X). * expanded\_balloon(X) :- inflated\_balloon(X). * **Column 3:** Generated Proof * expanded\_balloon(me) -> * expanded\_balloon(X) :- inflated\_balloon(X) -> * inflated\_balloon(X) :- blew\_into\_balloon(X) -> * blew\_into\_balloon(me) ### Detailed Analysis or ### Content Details **Column 1: Explanation** * The explanation starts with a premise and a conclusion. The premise is "I blew into it," and the conclusion is "The balloon expanded." * It then breaks down the process into two steps, each with an "IF...THEN" statement and an assumption. * Step 1 states that blowing into a balloon can cause it to inflate, with the assumption that blowing air into a balloon increases the amount of air inside, leading to inflation. * Step 2 states that if the balloon inflates, it can cause the balloon to expand, with the assumption that when a balloon inflates, it stretches and expands in size. * The explanation concludes by stating that since "I blew into the balloon," it caused the balloon to inflate, which resulted in its expansion. **Column 2: Formalized Output** * The formalized output uses Prolog to represent the process. * The Prolog query is "expanded\_balloon(me)." * The program defines atoms and rules. * The atoms are "blew\_into\_balloon(me)." and "me(me)." * The rules are "inflated\_balloon(X) :- blew\_into\_balloon(X)." and "expanded\_balloon(X) :- inflated\_balloon(X)." **Column 3: Generated Proof** * The generated proof shows the logical steps to prove that the balloon expanded. * It starts with "expanded\_balloon(me) ->" and then shows the steps to reach "blew\_into\_balloon(me)." ### Key Observations * The table provides a structured way to understand the process of a balloon expanding. * The explanation is in natural language, while the formalized output uses Prolog. * The generated proof shows the logical steps to connect the premise to the conclusion. ### Interpretation The table demonstrates how a simple real-world scenario can be formalized using logic programming. The "Explanation" provides an intuitive understanding, while the "Formalized Output" translates this understanding into a Prolog program. The "Generated Proof" then uses this program to logically derive the conclusion from the premise. This highlights the power of logic programming in representing and reasoning about real-world events. </details> Figure 11: An example of the autoformalization prompt. An explanation hypothesis is considered logically consistent if the external solver can build a deductive proof connecting the conclusion to the premise. We use NLProlog Weber et al. (2019), a neuro-symbolic Prolog solver integrating backward chaining with word embedding models via a weak unification mechanism. NLProlog allows for a level of flexibility and robustness that is necessary for NLP use cases (e.g. unification applied to synonyms). We provide the autoformalized query, atoms, and rules to NLProlog. If NLProlog can satisfy the entailment query, it will return the proof consisting of the set of rules traversed, the weak unification score, and the proof depth. For simplicity, we assign a score of one if the entailment query is satisfied and zero if it is not. The proof depth score is evaluated as part of the parsimony analysis. An end-to-end example of consistency evaluation can be found in Figure 11. 1 Input : Symbolic KB $kb$ , Goal $goal$ , Glove embedding model $e(·)$ Output : proof chain $chain$ , proof depth $depth$ 2 3 threshold $←$ 0.13; 4 5 $depth$ $←$ 1 ; 6 $chain$ $←$ $emptylist$ ; 7 8 foreach step $t$ in backward_chaining( $kb$ , $goal$ ) do 9 foreach $max\_unification(q,q_{t})$ do 10 $unification\_score$ $←$ $CosineSimilarity$ ( $e(q,m_{s}),e(q_{t},m_{s})$ ); 11 $depth$ $←$ $depth$ $×$ $unification\_score$ ; 12 13 end foreach 14 $chain$ $←$ backward_chaining( $kb$ , $goal$ ); 15 16 end foreach 17 18 if $chain$ is not empty and $depth$ $>$ threshold then 19 $chain←$ current_proof_chain $[0]$ ; 20 21 end if 22 else 23 $depth$ $←$ 0 ; 24 end if 25 26 return $chain$ , $depth$ ; Algorithm 1 Neuro-symbolic Solver A.6 Parsimony Parsimony measures the complexity of an explanation and is represented by the proof depth and concept drift metrics. Proof depth is automatically calculated by NLProlog and reflects the number of rules traversed by the solver to satisfy the entailment query. If the hypothesis is not logically consistent, depth is set to zero. The concept drift metric measures the entropy of novel concepts introduced to bridge the premise and conclusion. To compute the drift of an explanation, we consider the nouns found in the premise, conclusion, and explanation steps. We use Spacy Honnibal and Montani (2017) to tokenize and extract part-of-speech (POS) tags. All tokens with the ’NOUN’ POS tag extracted. For normalization purposes, we consider the lemma of the tokens. Concept drift then is calculated as the set difference between the unique nouns found across all explanation steps and those found in the premise and conclusion. 1 2 Input : Premise, Conclusion, Explanation, Spacy model $spacy$ ( $·$ ) Output : Drift Score $drift$ 3 4 $Noun_{p}←$ $spacy$ ( $Premise$ ); 5 $Noun_{c}←$ $spacy$ ( $Conclusion$ ); 6 $Noun_{E}←$ $spacy$ ( $Explanation$ ); 7 $N←\{Noun_{p},Noun_{c},Noun_{E}\}$ ; 8 $drift← length(set(Noun_{E})-set(Noun_{p}\cup Noun_{c})$ ); 9 return $drift$ ; Algorithm 2 Concept Drift A.7 Coherence Coherence evaluates the plausibility of the intermediate explanation. We propose stepwise entailment as a metric to measure the entailment strength of the If-then implications. We employ a RoBERTa-based NLI model Nie et al. (2020) that has been finetuned on a range of NLI and fact verification datasets consisting of SNLI Bowman et al. (2015), aNLI Nie et al. (2020), multilingual NLI Williams et al. (2018)), and FEVER-NLI Nie et al. (2019). To compute the stepwise entailment score, we first measure the entailment strength between the If and Then propositions. For example, to calculate the score of the statement “IF a balloon is pricked, THEN the balloon may deflate” we consider “a balloon is pricked” and “the balloon may deflate” as input sentences for the NLI model. The NLI will produce independent scores for the entailment and contradiction labels. We compute the entailment strength by subtracting the contraction label score from the entailment label score. An entailment strength of one indicates the If-then implication is strongly plausible whereas a score of zero suggests that it is likely implausible. The overall stepwise entailment score is the average of entailment strength measures across all explanation steps. 1 2 Input : Explanation $E$ , NLI Model $nli$ ( $·$ ) Output : Average Entailment Strength $strength$ 3 4 $EntailmentStrengthScores←$ empty list; 5 6 foreach Step $(If_{s},Then_{s})$ in $E$ do 7 $EntailmentScore←$ $nli$ ( $If_{s}$ , $Then_{s}$ ); 8 $ContradictionScore←$ $nli$ ( $If_{s}$ , $Then_{s}$ ); 9 $EntailmentStrength← EntailmentScore-ContradictionScore$ ; 10 Append $EntailmentStrength$ to $EntailmentStrengthScores$ ; 11 12 end foreach 13 14 $strength←$ $Avg$ ( $EntailmentStrengthScores$ ); 15 return $strength$ ; Algorithm 3 Stepwise Entailment A.8 Linguistic Uncertainty Linguistic uncertainty measures the confidence of a statement where hedging cues and indirect language suggest ambiguity around the proposition. To measure sentence-level uncertainty, we employ a finetuned RoBERTa model provided by Pei and Jurgens (2021). The model was trained on a sentence-level dataset consisting of findings and statements extracted from new articles and scientific publications and human annotated evaluation of sentence certainty. A scale from one to six was used to annotate sentences where one corresponds to the lowest degree of certainty and six is the highest expressed by the sentence. We invert the scale to retrieve the uncertainty scores. To compute the overall linguistic uncertainty of an explanation, we first compute the uncertainty for each assumption and the explanation summary and then average all the scores. We use a fine-tuned BERT-based token classification model to classify all the words in the generated explanation with uncertainty categories introduced in the 2010 CoNLL shared task on Hedge Detection Farkas et al. (2010). Farkas et al. (2010) classify hedge cues into three categories: epistemic, doxatic, and conditional. Epistemic cues refer to hedging scenarios where the truth value of a proposition can be determined but is unknown in the present (e.g. the blocks may fall). Doxatic cues refer to beliefs and hypotheses that can be held to be true or false by others (e.g. the child believed the blocks would fall). Finally, conditional cues refer to propositions whose truth value is dependent on another proposition’s truth value (e.g. if the balloon is pricked it may deflate). 1 2 Input : Assumptions, Explanation Summary, Uncertainty Estimator Model $uc(·)$ Output : Overall Uncertainty 3 4 $AssumptionUncertaintyList←$ empty list; 5 6 foreach Assumption in Assumptions do 7 $UncertaintyScore←$ $uc$ ( $UncertaintyModel$ , $Assumption$ ); 8 Append $UncertaintyScore$ to $AssumptionUncertaintyList$ ; 9 10 end foreach 11 12 $AverageAssumptionUncertainty←$ $Avg$ ( $AssumptionUncertaintyList$ ); 13 14 $ExplanationUncertainty←$ $uc$ ( $UncertaintyModel$ , $ExplanationSummary$ ); 15 16 $OverallExplanationUncertainty← AverageAssumptionUncertainty+ExplanationUncertainty$ ; 17 18 return $OverallExplanationUncertainty$ ; Algorithm 4 Linguistic Uncertainty A.9 Inference to Best Explanation To perform IBE, we first fit a linear regression model over the extracted explanation features from the COPA train set and 500 random sample train examples from the E-CARE train set. We consider all explanations independently and annotate each explanation with a 1 if it corresponds to a correct answer or 0 if corresponds to an incorrect answer. After the linear model is fitted, we evaluate the COPA and E-CARE test sets. For each example, we use the trained linear model to score each answer candidate explanation and then select a candidate with the highest score. We use the linear regression implementation from scikit-learn Buitinck et al. (2013) for the IBE model. We additionally use the R stats package R Core Team (2013) for conducting our regression analysis. A.10 E-CARE Results A.10.1 E-CARE Consistency See Figure 12. <details> <summary>extracted/6246183/consistency-ecare.png Details</summary> ![72674cac](/v1/image/72674cac5d8e2c3c975ed99e27681da06c47ab2d2209e67f49a4cf7a319902a1) ### Visual Description ## Chart Type: Horizontal Bar Chart ### Overview The image is a horizontal bar chart comparing the average consistency of three language models (Llama 2 7B, Llama 2 13B, and ChatGPT) on the E-CARE benchmark. The chart displays the percentage of logically consistent responses, broken down into "correct" and "incorrect" option types. ### Components/Axes * **Title:** E-CARE: Avg. Consistency * **Y-axis:** Language Models (Llama 2 7B, Llama 2 13B, ChatGPT) * **X-axis:** % Logically Consistent * **Legend (top-right):** * Green: correct * Red: incorrect ### Detailed Analysis The chart presents the percentage of logically consistent responses for each language model, separated by whether the response was "correct" or "incorrect". * **Llama 2 7B:** * Correct: 80% * Incorrect: 74% * **Llama 2 13B:** * Correct: 74% * Incorrect: 70% * **ChatGPT:** * Correct: 78% * Incorrect: 71% ### Key Observations * For all three models, the percentage of "correct" logically consistent responses is higher than the percentage of "incorrect" logically consistent responses. * Llama 2 7B has the highest percentage of "correct" logically consistent responses (80%). * Llama 2 13B has the lowest percentage of "incorrect" logically consistent responses (70%). ### Interpretation The chart suggests that all three language models exhibit a reasonable level of logical consistency on the E-CARE benchmark. However, there are differences in performance between the models. Llama 2 7B appears to be the most consistent in providing correct and logically consistent responses, while Llama 2 13B shows the lowest percentage of incorrect but logically consistent responses. ChatGPT falls in between the two Llama models in terms of both correct and incorrect logically consistent responses. The data indicates that while the models are generally consistent, there is still room for improvement, particularly in reducing the number of incorrect responses that are still logically consistent. </details> Figure 12: Average consistency comparison between correct and incorrect options for the E-CARE dataset. A.10.2 E-CARE Proof Depth See Figure 13. <details> <summary>extracted/6246183/depth-ecare.png Details</summary> ![7fb87fff](/v1/image/7fb87fffdf73c48afccf4aa69cb6eb608df68f870d23811bc5595441f42aa587) ### Visual Description ## Horizontal Bar Chart: E-CARE: Avg. Proof Depth ### Overview The image is a horizontal bar chart comparing the average proof depth of three different language models (Llama 2 7B, Llama 2 13B, and ChatGPT) on the E-CARE benchmark. The chart displays the average proof depth for both correct and incorrect answers, indicated by green and red bars, respectively. ### Components/Axes * **Title:** E-CARE: Avg. Proof Depth * **Y-axis Labels (Models):** Llama 2 7B, Llama 2 13B, ChatGPT * **X-axis Label:** Depth * **Legend (Option Type):** * Green: correct * Red: incorrect ### Detailed Analysis The chart presents the average proof depth for each model, separated by whether the answer was correct or incorrect. * **Llama 2 7B:** * Correct (green): 1.86 * Incorrect (red): 2.03 * **Llama 2 13B:** * Correct (green): 2.15 * Incorrect (red): 2.21 * **ChatGPT:** * Correct (green): 1.98 * Incorrect (red): 2.18 ### Key Observations * For all three models, the average proof depth is higher for incorrect answers than for correct answers. * Llama 2 13B has the highest average proof depth for both correct and incorrect answers compared to the other two models. * Llama 2 7B has the lowest average proof depth for correct answers. ### Interpretation The data suggests that a higher proof depth does not necessarily correlate with a correct answer. In fact, for all models tested, incorrect answers tend to have a higher average proof depth. This could indicate that the models are spending more computational effort on incorrect answers, or that the complexity of the problem is higher when the model fails to produce a correct answer. Llama 2 13B consistently exhibits the highest proof depth, suggesting a potentially different approach or architecture compared to the other models. The difference in proof depth between correct and incorrect answers is relatively small, but the consistent trend across all models is notable. </details> Figure 13: Comparison of average proof depth between correct and incorrect options. <details> <summary>extracted/6246183/drift-ecare.png Details</summary> ![67093f72](/v1/image/67093f72a9eb3446dd078700a96eed192cb900bc7376cdbc339af330091681fe) ### Visual Description ## Bar Chart: E-CARE: Avg. Concept Drift ### Overview The image is a horizontal bar chart comparing the average concept drift of three language models (ChatGPT, Llama 2 13B, and Llama 2 7B) based on whether the model's response was correct or incorrect. The chart displays two bars for each model, one representing the average drift when the response was correct (green) and the other when the response was incorrect (red). ### Components/Axes * **Title:** E-CARE: Avg. Concept Drift * **Y-axis:** Model Names (ChatGPT, Llama 2 13B, Llama 2 7B) * **X-axis:** Drift * **Legend (Top-Right):** * Correct (Green) * Incorrect (Red) ### Detailed Analysis The chart presents the average concept drift for each model under two conditions: when the model's response is correct and when it is incorrect. The drift values are displayed at the end of each bar. * **ChatGPT:** * Correct (Green): 3.61 * Incorrect (Red): 5.02 * **Llama 2 13B:** * Correct (Green): 5.19 * Incorrect (Red): 5.80 * **Llama 2 7B:** * Correct (Green): 5.37 * Incorrect (Red): 5.87 ### Key Observations * For all three models, the average concept drift is higher when the response is incorrect compared to when it is correct. * ChatGPT has the lowest drift when the response is correct (3.61), but also the largest difference between correct and incorrect responses (5.02 - 3.61 = 1.41). * Llama 2 7B has the highest drift when the response is incorrect (5.87). * The Llama 2 models have a smaller difference in drift between correct and incorrect responses compared to ChatGPT. ### Interpretation The data suggests that all three language models exhibit a higher concept drift when they provide incorrect responses. This could indicate that errors are associated with a greater degree of deviation from the intended or expected concept. ChatGPT, while having the lowest drift for correct answers, shows the most significant increase in drift when it makes mistakes, suggesting its errors might be more conceptually divergent. The Llama 2 models show a more consistent level of drift regardless of the correctness of the response. This information is valuable for understanding the behavior of these models and identifying areas for improvement in their training or application. </details> Figure 14: Comparison of average concept drift between correct and incorrect options. <details> <summary>extracted/6246183/coherence-ecare.png Details</summary> ![147dd33d](/v1/image/147dd33da52f5c410e4bfedc0e6070ecb6179d6c3ee6284cc4266209e35a275f) ### Visual Description ## Bar Chart: E-CARE: Avg. Coherence ### Overview The image is a bar chart comparing the average coherence scores of three language models (ChatGPT, Llama 2 13B, and Llama 2 7B) on the E-CARE dataset. The chart displays coherence scores for both "Correct" and "Incorrect" response types, represented by green and red bars, respectively. A dashed red line plots the relative difference between the "Correct" and "Incorrect" scores. ### Components/Axes * **Title:** E-CARE: Avg. Coherence * **X-axis:** Categorical axis with three categories: ChatGPT, Llama 2 13B, Llama 2 7B * **Left Y-axis:** "Coherence Score" ranging from 0.0 to 0.25, with tick marks at 0.0, 0.1, and 0.2. * **Right Y-axis:** "Rel. Difference %" ranging from 0% to 40%, with tick marks at 0%, 10%, 20%, 30%, and 40%. * **Legend:** Located on the right side of the chart, indicating "Correct" responses with a green bar and "Incorrect" responses with a red bar. * **Data Series:** * Correct: Green bars representing the coherence score for correct responses. * Incorrect: Red bars representing the coherence score for incorrect responses. * Relative Difference: Dashed red line representing the relative difference between correct and incorrect scores. ### Detailed Analysis **ChatGPT:** * Correct (Green): Approximately 0.25 * Incorrect (Red): Approximately 0.23 * Relative Difference (Red Dashed Line): Approximately 11% **Llama 2 13B:** * Correct (Green): Approximately 0.23 * Incorrect (Red): Approximately 0.20 * Relative Difference (Red Dashed Line): Approximately 21% **Llama 2 7B:** * Correct (Green): Approximately 0.22 * Incorrect (Red): Approximately 0.18 * Relative Difference (Red Dashed Line): Approximately 28% ### Key Observations * The "Correct" coherence scores are consistently higher than the "Incorrect" coherence scores for all three models. * The relative difference between "Correct" and "Incorrect" scores increases from ChatGPT to Llama 2 13B to Llama 2 7B. * ChatGPT has the highest "Correct" coherence score, while Llama 2 7B has the lowest. ### Interpretation The chart suggests that all three language models exhibit higher coherence in their "Correct" responses compared to their "Incorrect" responses. The increasing relative difference from ChatGPT to Llama 2 7B indicates that the gap in coherence between correct and incorrect responses widens for the Llama 2 models, especially the 7B version. This could imply that the Llama 2 models are more sensitive to the quality of input or context, leading to a greater disparity in coherence when generating incorrect responses. ChatGPT, on the other hand, maintains a relatively smaller difference, suggesting a more consistent level of coherence regardless of response correctness. </details> Figure 15: Comparison of average coherence scores between correct and incorrect options. <details> <summary>extracted/6246183/uncertainty-ecare.png Details</summary> ![c69a6c31](/v1/image/c69a6c31afa2375dce666cf3d99afcd73c502a464957b6b1a730082911129472) ### Visual Description ## Bar Chart: E-CARE: Avg. Uncertainty ### Overview The image is a bar chart comparing the average uncertainty scores of three language models (ChatGPT, Llama 2 13B, and Llama 2 7B) on the E-CARE dataset. The chart displays scores for both "correct" and "incorrect" response types, represented by green and red bars, respectively. A secondary y-axis shows the relative difference between the "correct" and "incorrect" scores, plotted as a red dashed line with black circular markers. ### Components/Axes * **Title:** E-CARE: Avg. Uncertainty * **X-axis:** Language Models (ChatGPT, Llama 2 13B, Llama 2 7B) * **Primary Y-axis:** Score, ranging from 0 to 3, with tick marks at 0, 1, 2, and 3. * **Secondary Y-axis:** Rel. Difference, ranging from 0% to 5%, with tick marks at 0% and 5%. * **Legend:** Located on the right side of the chart, indicating "correct" responses with a green bar and "incorrect" responses with a red bar. ### Detailed Analysis * **ChatGPT:** * Correct: The green bar extends to approximately 3.4. * Incorrect: The red bar extends to approximately 3.6. * **Llama 2 13B:** * Correct: The green bar extends to approximately 3.4. * Incorrect: The red bar extends to approximately 3.8. * **Llama 2 7B:** * Correct: The green bar extends to approximately 3.2. * Incorrect: The red bar extends to approximately 3.5. * **Relative Difference (Red Dashed Line):** * ChatGPT: Starts at approximately 4.5%. * Llama 2 13B: Drops to approximately 2.5%. * Llama 2 7B: Rises to approximately 4.7%. ### Key Observations * For all three language models, the "incorrect" scores are higher than the "correct" scores, indicating higher uncertainty for incorrect responses. * Llama 2 13B has the highest "incorrect" score and the lowest relative difference. * ChatGPT and Llama 2 7B have similar relative differences, but ChatGPT has slightly lower "incorrect" score. ### Interpretation The chart suggests that all three language models exhibit higher uncertainty when generating incorrect responses compared to correct ones. The relative difference in uncertainty between correct and incorrect responses varies across the models, with Llama 2 13B showing the smallest difference. This could indicate that Llama 2 13B's uncertainty is less indicative of its correctness compared to the other two models. The data implies that uncertainty scores could potentially be used as a metric to assess the reliability of these language models, with lower uncertainty generally correlating with more accurate responses. </details> Figure 16: Comparison of average uncertainty scores between correct and incorrect options. <details> <summary>extracted/6246183/hedge-ratio-ecare.png Details</summary> ![bcd1d36e](/v1/image/bcd1d36ef069189edeedd57e115f5eef5bb6d4290197b3063403c8a60f7f4a41) ### Visual Description ## Bar and Line Chart: E-CARE: Avg. Ratio of Hedge Cues ### Overview The image is a combination bar and line chart comparing the average ratio of hedge cues for three language models: ChatGPT, Llama 2 13B, and Llama 2 7B. The chart displays two sets of data: the ratio of hedge cues (represented by green bars) and the relative difference (represented by red bars and a dashed red line). ### Components/Axes * **Title:** E-CARE: Avg. Ratio of Hedge Cues * **X-axis:** Categorical axis representing the language models: ChatGPT, Llama 2 13B, Llama 2 7B. * **Left Y-axis:** "Ratio", ranging from 0.00 to 0.04, with increments of 0.01. * **Right Y-axis:** "Rel. Difference", ranging from 0% to 20%, with increments of 5%. * **Data Series 1:** Ratio of Hedge Cues (Green Bars) * **Data Series 2:** Relative Difference (Red Bars and Dashed Red Line) ### Detailed Analysis * **ChatGPT:** * Ratio of Hedge Cues (Green Bar): Approximately 0.032 * Relative Difference (Red Bar): Approximately 0.036 * Relative Difference (Red Line): Approximately 12% * **Llama 2 13B:** * Ratio of Hedge Cues (Green Bar): Approximately 0.040 * Relative Difference (Red Bar): Approximately 0.042 * Relative Difference (Red Line): Approximately 4% * **Llama 2 7B:** * Ratio of Hedge Cues (Green Bar): Approximately 0.035 * Relative Difference (Red Bar): Approximately 0.045 * Relative Difference (Red Line): Approximately 8% **Trend Verification:** * **Ratio of Hedge Cues (Green Bars):** Increases from ChatGPT to Llama 2 13B, then decreases slightly to Llama 2 7B. * **Relative Difference (Red Bars):** Increases from ChatGPT to Llama 2 7B. * **Relative Difference (Red Line):** Decreases from ChatGPT to Llama 2 13B, then increases to Llama 2 7B. ### Key Observations * Llama 2 13B has the highest ratio of hedge cues. * Llama 2 7B has the highest relative difference. * The relative difference (red line) shows a decreasing trend from ChatGPT to Llama 2 13B, followed by an increasing trend to Llama 2 7B. ### Interpretation The chart compares the average ratio of hedge cues for three language models. The green bars represent the ratio of hedge cues, while the red bars and red dashed line represent the relative difference. The data suggests that Llama 2 13B has the highest ratio of hedge cues, while Llama 2 7B has the highest relative difference. The relative difference line shows that the difference between the models is not consistent, with Llama 2 13B having a lower relative difference compared to ChatGPT and Llama 2 7B. This could indicate that Llama 2 13B is more consistent in its use of hedge cues compared to the other two models. </details> Figure 17: Comparison of the average ratio of hedge cues between correct and incorrect options. <details> <summary>extracted/6246183/hedge-distrib-ecare.png Details</summary> ![090255e0](/v1/image/090255e0859fba6bbeb4fbc8c28aaefc03fbc5d78af3ae18e7864141195d3907) ### Visual Description ## Stacked Bar Chart: E-CARE: Hedge Cue Distrib. ### Overview The image is a stacked bar chart comparing the distribution of hedge cues across three language models: Llama 2 7B, Llama 2 13B, and ChatGPT. The chart displays the proportion of three categories of hedge cues: Conditional (red), Doxatic (green), and Epistemic (blue). The x-axis represents the percentage, ranging from 0% to 100%. The y-axis lists the language models. ### Components/Axes * **Title:** E-CARE: Hedge Cue Distrib. * **Y-axis Labels:** Llama 2 7B, Llama 2 13B, ChatGPT * **X-axis Labels:** 0%, 25%, 50%, 75%, 100% * **Legend:** Located at the bottom of the chart. * Conditional (red) * Doxatic (green) * Epistemic (blue) ### Detailed Analysis Here's a breakdown of the percentage distribution for each language model: * **Llama 2 7B:** * Epistemic (blue): Approximately 25% * Doxatic (green): Approximately 25% * Conditional (red): Approximately 50% * **Llama 2 13B:** * Epistemic (blue): Approximately 30% * Doxatic (green): Approximately 40% * Conditional (red): Approximately 30% * **ChatGPT:** * Epistemic (blue): Approximately 40% * Doxatic (green): Approximately 30% * Conditional (red): Approximately 30% ### Key Observations * Llama 2 7B has a significantly higher proportion of Conditional hedge cues compared to the other two models. * ChatGPT has the highest proportion of Epistemic hedge cues among the three models. * Llama 2 13B has a more balanced distribution of the three hedge cue categories compared to Llama 2 7B. ### Interpretation The chart illustrates the different hedging strategies employed by the three language models. Llama 2 7B relies heavily on Conditional hedges, while ChatGPT favors Epistemic hedges. Llama 2 13B exhibits a more even distribution, suggesting a potentially more nuanced approach to hedging. These differences could reflect variations in the training data, model architecture, or specific design choices made during the development of each model. The data suggests that model size does not directly correlate with a specific hedging strategy, as Llama 2 7B and Llama 2 13B exhibit different distributions. </details> Figure 18: Distribution of hedge cues across incorrect explanations. A.10.3 E-CARE Concept Drift See Figure 13. A.10.4 E-CARE Coherence See Figure 15. A.10.5 E-CARE Uncertainty See Figure 16. A.10.6 E-CARE Hedge Ratio See Figure 17. A.10.7 E-CARE Hedge Distribution See Figure 18. A.11 Causal Directionality <details> <summary>extracted/6246183/cause-effect-acc.png Details</summary> ![3cd45a99](/v1/image/3cd45a9950c1e2ad85d79d97b8dc4892584955c1856d693fa65e27b621cb472c) ### Visual Description ## Horizontal Bar Chart: LLM Accuracy by Type (Cause vs. Effect) ### Overview The image is a horizontal bar chart comparing the accuracy of three Large Language Models (LLMs) - Llama 2 7B, Llama 2 13B, and ChatGPT - on two different tasks: identifying causes and identifying effects. Accuracy is measured as a percentage. The chart uses different colored bars to represent the accuracy for each task type (Cause and Effect) for each LLM. ### Components/Axes * **Y-axis (Vertical):** Labeled "LLM", with the following categories: * Llama 2 7B * Llama 2 13B * ChatGPT * **X-axis (Horizontal):** Labeled "Accuracy (%)", representing the accuracy percentage. * **Legend (Right Side):** * Cause: Represented by a light green bar. * Effect: Represented by a light blue bar. * **Data Labels:** Accuracy percentages are displayed at the end of each bar. ### Detailed Analysis * **Llama 2 7B:** * Cause (light green): 64% * Effect (light blue): 72% * **Llama 2 13B:** * Cause (light green): 72% * Effect (light blue): 72% * **ChatGPT:** * Cause (light green): 71% * Effect (light blue): 80% ### Key Observations * Llama 2 13B has the same accuracy for both Cause and Effect tasks (72%). * ChatGPT has the highest accuracy for Effect tasks (80%) compared to the other models. * Llama 2 7B has the lowest accuracy for Cause tasks (64%) compared to the other models. * For all models, the accuracy for identifying Effects is equal to or higher than the accuracy for identifying Causes. ### Interpretation The chart suggests that different LLMs have varying levels of accuracy in identifying causes and effects. ChatGPT appears to be the most accurate at identifying effects, while Llama 2 7B struggles more with identifying causes. The consistent accuracy of Llama 2 13B across both tasks is also noteworthy. The data implies that the type of task (Cause vs. Effect) can influence the performance of an LLM. Further investigation could explore the specific characteristics of these tasks that lead to the observed differences in accuracy. </details> Figure 19: Accuracy in predicting the most plausible causes vs effects on COPA. When considering the causal directionality (i.e. cause vs effect), we observed that accuracy tended to differ between LLMs on COPA. In particular, we found both GPT and LLaMA 2 7B to be more accurate in predicting the effects in causal scenarios (see Figure 19). We hypothesize that LLMs may suffer the challenge of causal sufficiency as the space of potential causal explanations can be far greater than the range of effects once an event has been observed. This hypothesis is partly supported by the fact that GPT and LLaMA 2 7B express greater linguistic uncertainty and produce more complex explanations when predicting causes rather than effects. A.12 Dataset Details COPA is released under a BSD-2 license and made available for broad research usage with copyright notification restrictions people.ict.usc.edu/~gordon/copa.html. We do not modify or use COPA outside of its intended use which is primarily open-domain commonsense causal reasoning. E-CARE is released under the MIT license and can be used for broad purposes with copyright notification restrictions github.com/Waste-Wood/e-CARE?tab=MIT-1-ov-file#readme. We do not modify or use E-CARE outside of its intended use which is causal reasoning evaluation of language models.

Rendering Paper...