# Inference to the Best Explanation in Large Language Models
**Authors**: Email: d.dalal1@universityofgalway.ie
Abstract
While Large Language Models (LLMs) have found success in real-world applications, their underlying explanatory process is still poorly understood. This paper proposes IBE-Eval, a framework inspired by philosophical accounts on Inference to the Best Explanation (IBE) to advance the interpretation and evaluation of LLM explanations. IBE-Eval estimates the plausibility of natural language explanations through a combination of explicit logical and linguistic features including: consistency, parsimony, coherence, and uncertainty. Extensive experiments are conducted on Causal Question Answering (CQA), where IBE-Eval is tasked to select the most plausible causal explanation amongst competing ones generated by the LLM (e.g. GPT 3.5 or LLaMA 2). The experiments reveal that IBE-Eval can successfully identify the best explanation with up to 77% accuracy ( $â 27\%$ above random), improving upon a GPT 3.5-as-a-judge baseline ( $â+17\%$ ) while being intrinsically more efficient and interpretable. Additional analysis suggests that, despite LLM-specific variances, generated explanations tend to conform to IBE criteria and that IBE-Eval is significantly correlated with human judgment, opening up opportunities for future development of automated explanation verification tools.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Causal Reasoning with Large Language Model (LLM)
### Overview
This diagram illustrates a process for causal reasoning, utilizing a Large Language Model (LLM) to evaluate competing hypotheses and explanations for a given causal question. The diagram outlines a causal question, competing hypotheses, an explanation prompt, competing explanations generated by the LLM, and an Inference to the Best Explanation (IBE) evaluation based on selection criteria.
### Components/Axes
The diagram is segmented into four main sections:
1. **Causal Question:** Presents the initial problem: "The balloon expanded. What was the cause? A) I blew into it. B) I pricked it."
2. **Competing Hypotheses:** States two premises and their conclusions:
* Premise 1: "I blew into the balloon. Conclusion: The balloon expanded."
* Premise 2: "I pricked the balloon. Conclusion: The balloon expanded."
3. **Explanation Prompt:** Describes the instructions given to the LLM: "For the scenario provided, identify which option is the most plausible cause of the context. Let's think step-by-step and generate an explanation for each option. Treat each option as the premise and the provided context as the conclusion. Generate a short step-by-step logical proof that explains how the premise can result in the conclusion. For each step provide an IF-THEN rule and the underlying causal or commonsense assumption."
4. **Competing Explanations & Inference to the Best Explanation (IBE):** This section is further divided into two explanation blocks (E1 and E2) and an IBE evaluation. Each explanation block consists of a step-by-step explanation with assumptions. The IBE section evaluates each explanation based on four criteria: Consistency, Parsimony, Coherence, and Uncertainty.
### Detailed Analysis or Content Details
**Causal Question:**
* Question: "The balloon expanded. What was the cause?"
* Options: A) I blew into it. B) I pricked it.
**Competing Hypotheses:**
* Hypothesis 1: Premise - "I blew into the balloon." Conclusion - "The balloon expanded."
* Hypothesis 2: Premise - "I pricked the balloon." Conclusion - "The balloon expanded."
**Explanation Prompt:** (Transcribed as is)
"For the scenario provided, identify which option is the most plausible cause of the context. Let's think step-by-step and generate an explanation for each option. Treat each option as the premise and the provided context as the conclusion. Generate a short step-by-step logical proof that explains how the premise can result in the conclusion. For each step provide an IF-THEN rule and the underlying causal or commonsense assumption."
**Explanation 1 (E1):**
* Step 1: "IF someone blows into a balloon, THEN it can cause the balloon to inflate." Assumption: "Blowing air into a balloon increases the amount of air inside it, leading to inflation."
* Step 2: "IF the balloon inflates, THEN it can cause the balloon to expand." Assumption: "When a balloon inflates, it stretches and expands in size."
* Conclusion: "Therefore, since I blew into the balloon, it caused the balloon to inflate, which resulted in its expansion."
**Explanation 2 (E2):**
* Step 1: "IF a balloon is pricked, THEN the balloon may deflate." Assumption: "Pricking a balloon can cause it to lose air and deflate."
* Step 2: "IF a balloon deflates, THEN there is a decrease in air pressure inside the balloon." Assumption: "When a balloon deflates, the air pressure inside the balloon decreases."
* Step 3: "IF there is a decrease in air pressure inside the balloon, THEN the external air pressure will cause the balloon to expand." Assumption: "When the air pressure inside a balloon is lower than the external air pressure, the balloon will expand."
* Conclusion: "Therefore, since the balloon was pricked, it may have deflated, resulting in a decrease in air pressure inside the balloon, causing the external air pressure to make the balloon expand."
**Inference to the Best Explanation (IBE):**
The IBE section presents evaluation scores for each explanation (E1 and E2) across four criteria. The values are as follows:
| Criteria | E1 | E2 |
|---|---|---|
| **Consistency** | 1.0 | 1.0 |
| **Parsimony** | -2.0 | -3.0 |
| **Coherence** | 0.51 | 0.28 |
| **Uncertainty** | 2.0 | 3.0 |
The IBE section visually indicates that E1 is favored over E2, with E1 positioned higher in the diagram.
### Key Observations
* The LLM generates plausible explanations for both hypotheses.
* Explanation 1 (blowing into the balloon) receives higher scores for Parsimony and Coherence, suggesting it is a simpler and more logically consistent explanation.
* Explanation 2 (pricking the balloon) has higher Uncertainty, indicating a less certain causal link.
* The IBE process uses numerical scores to quantify the quality of each explanation based on defined criteria.
* The diagram uses dashed arrows to indicate the flow of information and reasoning.
### Interpretation
This diagram demonstrates a structured approach to causal reasoning, leveraging an LLM to generate and evaluate competing explanations. The IBE framework provides a quantitative method for comparing explanations based on key criteria. The results suggest that, in this scenario, the LLM favors the explanation that blowing into the balloon caused the expansion, as it is more parsimonious and coherent than the explanation involving pricking the balloon. The higher uncertainty associated with the pricking explanation likely stems from the counterintuitive nature of a pricked balloon *expanding* â it typically deflates. The diagram highlights the LLM's ability to not only generate explanations but also to assess their plausibility based on logical principles. The use of IF-THEN rules and underlying assumptions explicitly reveals the reasoning process, making it transparent and auditable. This approach could be valuable in various domains where causal inference is critical, such as scientific discovery, troubleshooting, and decision-making. The diagram is a visual representation of a cognitive process, showing how an LLM can be used to simulate and support human reasoning.
</details>
Figure 1: IBE-Eval qualifies LLM-generated explanations with a set of logical and linguistic selection criteria to identify the most plausible hypothesis. The corresponding explanation for each hypothesis is evaluated across the IBE criteria of logical consistency, parsimony, internal coherence, and linguistic uncertainty. A final plausibility score is computed across those features and the hypothesis with highest score is identified as the best explanation.
1 Introduction
Large Language Models (LLMs) such as OpenAIâs GPT Brown et al. (2020) and LLaMA Touvron et al. (2023) have been highly effective across a diverse range of language understanding and reasoning tasks Liang et al. (2023). While LLM performances have been thoroughly investigated across various benchmarks Wang et al. (2019); Srivastava et al. (2023); Gao et al. (2023); Touvron et al. (2023), the principles and properties behind their step-wise reasoning process are still poorly understood Valentino et al. (2021). LLMs are notoriously black-box and can be difficult to interpret Chakraborty et al. (2017); Danilevsky et al. (2020). Moreover, the commercialization of LLMs has led to strategic secrecy around model architectures and training details Xiang (2023); Knight (2023). Finally, LLMs are susceptible to hallucinations and adversarial perturbations Geirhos et al. (2020); Camburu et al. (2020), often producing plausible but factually incorrect answers Ji et al. (2023); Huang et al. (2023). As the size and complexity of LLM architectures increase, the systematic study of generated explanations becomes crucial to better interpret and validate the LLMâs internal inference and reasoning processes Wei et al. (2022b); Lampinen et al. (2022); Huang and Chang (2022).
The automatic evaluation of natural language explanations presents several challenges Atanasova et al. (2023); Camburu et al. (2020). Without resource-intensive annotation Wiegreffe and Marasovic (2021); Thayaparan et al. (2020); Dalvi et al. (2021); Camburu et al. (2018), explanation quality methods tend to rely on either weak supervision, where the identification of the correct answer is taken as evidence of explanation quality, or require the injection of domain-specific knowledge Quan et al. (2024). In this paper, we seek to better understand the LLM explanatory process through the investigation of explicit linguistic and logical properties. While explanations are hard to formalize due to their open-ended nature, we hypothesize that they can be analyzed as linguistic objects, with measurable features that can serve to define criteria for assessing their quality.
Specifically, this paper investigates the following overarching research question: âCan the linguistic and logical properties associated with LLM-generated explanations be used to qualify the modelsâ reasoning process?â. To this end, we propose an interpretable framework inspired by philosophical accounts of abductive inference, also known as Inference to the Best Explanation (IBE) - i.e. the process of selecting among competing explanatory theories Lipton (2017). In particular, we aim to measure the extent to which LLM-generated explanations conform to IBE expectations when attempting to identify the most plausible explanation. To this end, we present IBE-Eval, a framework designed to estimate the plausibility of natural language explanations through a set of explicit logical and linguistic features, namely: logical consistency, parsimony, coherence, and linguistic uncertainty.
To evaluate the efficacy of IBE-Eval, we conduct extensive experiments in the multiple-choice Causal Question Answering (CQA) setting. The overall results and contributions of the paper can be summarized as follows:
1. To the best of our knowledge, we are the first to propose an interpretable framework inspired by philosophical accounts on Inference to the Best Explanation (IBE) to automatically assess the quality of natural language explanations.
1. We propose IBE-Eval, a framework that can be instantiated with external tools for the automatic evaluation of LLM-generated explanations and the identification of the best explanation in a multiple-choice CQA setting.
1. We provide empirical evidence that LLM-generated explanations tend to conform to IBE expectations with varying levels of statistical significance correlated to the LLMâs size.
1. We additionally find that uncertainty, parsimony, and coherence are the best predictors of plausibility and explanation quality across all LLMs. However, we also find that the LLMs tend to be strong rationalizers and can produce logically consistent explanations even for less plausible candidates, making the consistency metric less effective in practice.
1. IBE-Eval can successfully identify the best explanation supporting the correct answers with up to 77% accuracy (+ $â 27\%$ above random and + $â 17\%$ over GPT 3.5-as-a-Judge baselines)
1. IBE-Eval is significantly correlated with human judgment, outperforming a GPT3.5-as-a-Judge baseline in terms of alignment with human preferences.
For reproducibility, our code is made available on Github https://github.com/dhairyadalal/IBE-eval to encourage future research in the field.
2 Inference to the Best Explanation (IBE)
Explanatory reasoning is a distinctive feature of human rationality underpinning problem-solving and knowledge creation in both science and everyday scenarios Lombrozo (2012); Deutsch (2011). Accepted epistemological accounts characterize the creation of an explanation as composed of two distinct phases: conjecturing and criticism Popper (2014). The explanatory process always involves a conflict between plausible explanations, which is typically resolved through the criticism phase via a selection process, where competing explanations are assessed according to a set of criteria such as parsimony, coherence, unification power, and hardness to variation Lipton (2017); Harman (1965); Mackonis (2013); Thagard (1978, 1989); Kitcher (1989); Valentino and Freitas (2022).
As LLMs become interfaces for natural language explanations, epistemological frameworks offer an opportunity for developing criticism mechanisms to understand the explanatory process underlying state-of-the-art models. To this end, this paper considers an LLM as a conjecture device producing linguistic objects that can be subject to criticism. In particular, we focus on a subset of criteria that can be computed on explicit linguistic and logical features, namely: consistency, parsimony, coherence, and uncertainty.
To assess the LLMâs alignment to such criteria, we focus on the task of selecting among competing explanations in a multiple-choice CQA setting (Figure 1). Specifically, given a set of competing hypotheses (i.e. the multiple-choice options), $H=\{h_{1},h_{2},...,h_{n}\}$ , we prompt the LLM to generate plausible explanations supporting each hypothesis (Section 3). Subsequently, we adopt the proposed IBE selection criteria to assess the quality of the generated explanations (Section 4). IBE-Eval computes an explanation plausibility score derived from the linear combination of the computed selection criteria. The explanation with the highest score is selected as the predicted answer and additionally assessed as to the extent to which the observable IBE features are correlated with QA accuracy. We hypothesize that IBE-Eval will produce higher scores for the explanation associated with the correct answer and that the IBE criteria should meaningfully differentiate between competing explanations.
3 Explanation Generation
For the first stage, the LLM is prompted to generate competing explanations for the hypotheses using a modified Chain-of-Thought (CoT) prompt Wei et al. (2022a). Specifically, the COT prompt is modified to instruct the LLM to produce an explanation for each competing hypothesis (see Figure 1). We adopt a methodology similar to Valentino et al. (2021), where the generated explanation is constrained into an entailment form for the downstream IBE evaluation. In particular, we posit that a valid explanation should demonstrate an entailment relationship between the premise and conclusion which are derived from the question-answer pair.
To elicit logical connections between explanation steps and facilitate subsequent analysis, the LLM is constrained to use weak syllogisms expressed as If-Then statements. Additionally, the LLM is instructed to produce the associated causal or commonsense assumption underlying each explanation step. This output is then post-processed to extract the explanation steps and supporting knowledge for evaluation via the IBE selection criteria. Additional details and examples of prompts are reported in Appendix A.2.
4 Linguistic & Inference Criteria
To perform IBE, we investigate a set of criteria that can be automatically computed on explicit logical and linguistic features, namely: consistency, parsimony, coherence, and uncertainty.
Consistency.
Consistency aims to verify whether the explanation is logically valid. Given a hypothesis, comprised of a premise $p_{i}$ , a conclusion $c_{i}$ , and an explanation consisting of a set of If-Then statements $E={s_{1},...,s_{i}}$ , we define $E$ to be logically consistent if $p_{i}\cup E\vDash c_{i}$ . Specifically, an explanation is logically consistent if it is possible to build a deductive proof linking premise and conclusion.
To evaluate logical consistency, we leverage external symbolic solvers along with autoformalization - i.e., the translation of natural language into a formal language Wu et al. (2022). Specifically, the hypotheses and explanations are formalized into a Prolog program which will attempt to generate a deductive proof via backward chaining Weber et al. (2019).
To perform autoformalization, we leverage the translation capabilities of GPT 3.5. Specifically, we instruct GPT 3.5 to convert each IF-Then explanation step from the generated explanation into an implication rule and the premise statement into grounding atoms. On the other end, the entailment condition and the conclusion are used to create a Prolog query. The query instructs the Prolog solver to attempt to find a path through the implications rules such that the conclusion be directly connected to the premise. Further details about the autoformalization process can be found in Appendix A.3.
After autoformalization, following recent work on neuro-symbolic integration for LLM explanations Quan et al. (2024), we adopt an external Prolog solver for entailment verification https://github.com/neuro-symbolic-ai/explanation_based_ethical_reasoning. The explanation is considered consistent if the Prolog solver can satisfy the query and successfully build a deductive proof. Technical details can be found in Appendix A.5.
Parsimony.
The parsimony principle, also known as Ockhamâs razor, favors the selection of the simplest explanation consisting of the fewest elements and assumptions Sober (1981). Epistemological accounts posit that an explanation with fewer assumptions tends to leave fewer statements unexplained, improving specificity and alleviating the infinite regress Thagard (1978). Further, parsimony is an essential feature of causal interpretability, as only parsimonious solutions are guaranteed to reflect causation in comparative analysis Baumgartner (2015). In this paper, we adopt two metrics as a proxy of parsimony, namely proof depth, and concept drift. Proof depth, denoted as $Depth$ , is defined as the cardinality of the set of rules, $R$ , required by the Prolog solver to connect the conclusion to the premise via backward chaining. Let $h$ be a hypothesis candidate composed of a premise $p$ and a conclusion $c$ , and let $E$ be a formalized explanation represented as a set of rules $R^{\prime}$ . The proof depth is the number of rules $|R|$ , with $Râeq R^{\prime}$ , traversed during backward chaining to connect the conclusion $c$ to the premise $p$ :
$$
Depth(h)=|R|
$$
Concept drift, denoted as $Drift$ , is defined as the number of additional concepts and entities, outside the ones appearing in the hypothesis (i.e., premise and conclusion), that are introduced by the LLM to support the entailment. For simplicity, we consider nouns as concepts. Let $N=\{Noun_{p},Noun_{c},Noun_{E}\}$ be the unique nouns found in the premise, conclusion, and explanation steps. Concept drift is the cardinality of the set difference between the nouns found in the explanation and the nouns in the hypothesis:
$$
Drift(h)=|Noun_{E}-(Noun_{p}\cup Noun_{c})|
$$
Intuitively, the parsimony principle would predict the most plausible hypothesis as the one supported by an explanation with the smallest observed proof depth and concept drift. Implementation details can be found in Appendix A.6.
Coherence.
Coherence attempts to measure the logical validity at the level of the specific explanation steps. An explanation can be formally consistent on the surface while still including implausible or ungrounded intermediate assumptions. Coherence evaluates the quality of each intermediate If-Then implication by measuring the entailment strength between the If and Then clauses. To this end, we employ a fine-tuned natural language inference (NLI) model. Formally, let $S$ be a set of explanation steps, where each step $s$ consists of an If-Then statement, $s=(If_{s},Then_{s})$ . For a given step $s_{i}$ , let $ES(s_{i})$ denote the entailment score obtained via the NLI model between $If_{s}$ and $Then_{s}$ clauses. The step-wise entailment score $SWE(S)$ is then calculated as the averaged sum of the entailment scores across all explanation steps $|S|$ :
$$
\text{SWE}(S)=\frac{1}{|S|}\sum_{i=1}^{|S|}\text{ES}(s_{i})
$$
We hypothesize that the LLM should generate a higher coherence score for more plausible hypotheses, as such explanations should exhibit stronger step-wise entailment. Additional details can be found in Appendix A.7.
Uncertainty.
Finally, we consider the linguistic certainty expressed in the generated explanation as a proxy for plausibility. Hedging words such as probably, might be, could be, etc typically signal ambiguity and are often used when the truth condition of a statement is unknown or improbable. Pei and Jurgens (2021) found that the strength of scientific claims in research papers is strongly correlated with the use of direct language. In contrast, they found that the use of hedging language suggested that the veracity of the claim was weaker or highly contextualized.
To measure the linguistic uncertainty ( $UC$ ) of an explanation, we consider the explanationâs underlying assumptions ( $A_{i}$ ) and the overall explanation summary ( $S$ ). The linguistic uncertainty score is extracted using the fine-tuned sentence-level RoBERTa model from Pei and Jurgens (2021). The overall linguistic uncertainty score ( $UC_{\text{overall}}$ ) is the sum of the assumption and explanation summary scores:
$$
UC_{\text{overall}}=UC(A)+UC(S)
$$
Where $UC(A)$ is the sum of the linguistic uncertainty scores ( $UC(A)$ ) across all the assumptions $|A|$ associated with each explanation step $i$ :
$$
UC(A)=\sum_{i=1}^{|A|}UC(a_{i})
$$
and linguistic uncertainty of the explanation summary $UC(S)$ . We hypothesize that the LLM will use more hedging language when explaining the weaker hypothesis resulting in a higher uncertainty score. Further details can be found in Appendix A.8.
4.1 Inference to Best Explanation
After the IBE criteria are computed for each competing hypothesis, they are used to generate the final explanation plausibility score. We define a simple linear regression model $\theta(·)$ , which was fitted on a small set of training examples consisting of extracted IBE features to predict the probability that an explanation $E_{i}$ corresponds to the correct answer. Specifically, we employ IBE-Eval to score each generated explanation independently and then select the final answer $a$ via argmax:
$$
a=\operatorname*{argmax}_{i}[\theta(E_{1}),\ldots,\theta(E_{n})]
$$
Additional details can be found in Appendix A.9.
<details>
<summary>extracted/6246183/correlation.png Details</summary>

### Visual Description
## Heatmap: Correlation Analysis of LLM Performance
### Overview
The image presents two heatmaps displaying correlation coefficients between different Large Language Models (LLMs) â LLaMA 2 7B, LLaMA 2 13B, and GPT 3.5 â and various evaluation metrics. The two heatmaps correspond to two different evaluation datasets: COPA and E-CARE. The color intensity represents the strength and direction of the correlation, with green indicating positive correlation and red indicating negative correlation. Significance levels are indicated by asterisks.
### Components/Axes
* **Y-axis:** LLM names (LLaMA 2 7B, LLaMA 2 13B, GPT 3.5).
* **X-axis:** Evaluation metrics (Consistency, Depth, Coherence, Uncertainty, Drift).
* **Color Scale:** Ranges from -2.5 (dark red) to +2.5 (dark green), with 0 represented by white.
* **Significance Markers:**
* `*`: p < 0.05
* `**`: p < 0.01
* `***`: p < 0.001
* **Titles:**
* Left heatmap: "COPA"
* Right heatmap: "E-CARE"
* **Central Label:** "Corr." (indicating correlation)
### Detailed Analysis or Content Details
**COPA Heatmap (Left)**
* **LLaMA 2 7B:**
* Consistency: 1.37
* Depth: -2.95**
* Coherence: 1.22
* Uncertainty: -3.10**
* Drift: -0.27
* **LLaMA 2 13B:**
* Consistency: 1.36
* Depth: -1.28
* Coherence: 3.87***
* Uncertainty: -2.17*
* Drift: -3.33***
* **GPT 3.5:**
* Consistency: 4.67***
* Depth: -4.893***
* Coherence: 3.60***
* Uncertainty: -4.34***
* Drift: -3.22***
**E-CARE Heatmap (Right)**
* **LLaMA 2 7B:**
* Consistency: 0.20
* Depth: -0.53
* Coherence: 2.18*
* Uncertainty: -2.11**
* Drift: -0.78*
* **LLaMA 2 13B:**
* Consistency: 1.167
* Depth: -1.18
* Coherence: 1.67*
* Uncertainty: -1.52*
* Drift: -1.91*
* **GPT 3.5:**
* Consistency: 3.10**
* Depth: -2.91*
* Coherence: 0.98
* Uncertainty: -2.61*
* Drift: -5.14***
### Key Observations
* **GPT 3.5 consistently shows strong correlations (positive or negative) with all metrics in both datasets.** It exhibits the highest absolute correlation values across most metrics.
* **Depth consistently shows negative correlation with all models in both datasets.** This suggests that as the "depth" of the LLM's responses increases, the correlation with other metrics decreases.
* **Coherence shows positive correlation with LLaMA 2 13B and GPT 3.5 in both datasets.**
* **Uncertainty shows negative correlation with all models in both datasets.**
* **Drift shows negative correlation with all models in both datasets.**
* **LLaMA 2 7B generally has lower correlation values compared to LLaMA 2 13B and GPT 3.5.**
* **The significance levels (asterisks) indicate that many of the correlations are statistically significant.**
### Interpretation
The heatmaps reveal how different LLMs perform across various evaluation metrics, as measured by the COPA and E-CARE datasets. The correlation coefficients quantify the linear relationship between the LLM's performance and each metric.
The consistent negative correlation between "Depth" and other metrics suggests a potential trade-off: increasing the depth of responses might negatively impact other qualities like consistency or coherence. The strong correlations observed for GPT 3.5 indicate that its performance is more predictably related to these metrics than the LLaMA models.
The differences between the COPA and E-CARE heatmaps suggest that the correlation patterns are dataset-dependent. The E-CARE dataset appears to show weaker overall correlations compared to COPA, potentially indicating that the E-CARE metrics are less sensitive to the specific characteristics of these LLMs.
The varying significance levels highlight which correlations are statistically robust and likely represent genuine relationships, rather than random chance. The LLaMA 2 7B model consistently shows weaker correlations, suggesting it may be less sensitive to the nuances captured by these metrics compared to the larger models. The presence of statistically significant correlations (indicated by asterisks) provides evidence for the validity of these observed relationships.
</details>
Figure 2: A regression analysis measuring the correlation between IBE criteria and question accuracy. All the LLMs tend to conform to IBE expectations with GPT 3.5 exhibiting the most consistent and significant alignment. Linguistic uncertainty is the strongest IBE predictor for explanation quality, where higher uncertainty is negatively correlated with question accuracy. Statistical significance is noted as: â***â p < 0.001, â**â p < 0.01 â*â p < 0.05.
| Baselines | COPA GPT 3.5 | E-CARE LlaMA 2 13B | LlaMA 2 7B | GPT 3.5 | LlaMA 2 13B | LlaMA 2 7B |
| --- | --- | --- | --- | --- | --- | --- |
| GPT3.5 Judge | .59 | .47 | .63 | .43 | .61 | .52 |
| Human | .95 | 1.0 | .91 | .90 | .91 | .92 |
| IBE Features | | | | | | |
| Consistency | .51 | .52 | .55 | .54 | .54 | .54 |
| Depth (Parsimony) | .67 | .53 | .63 | .66 | .56 | .54 |
| Drift (Parsimony) | .67 | .63 | .58 | .66 | .57 | .57 |
| Coherence | .66 | .66 | .56 | .56 | .57 | .59 |
| Linguistic Uncertainty | .70 | .65 | .61 | .59 | .56 | .60 |
| Composed Model | | | | | | |
| Random | .50 | .50 | .50 | .50 | .50 | .50 |
| + Consistency | .51 | .52 | .55 | .54 | .54 | .54 |
| + Depth | .67 | .53 | .63 | .66 | .56 | .56 |
| + Drift | .70 | .65 | .65 | .72 | .66 | .65 |
| + Coherence | .73 | .71 | .69 | .73 | .68 | .69 |
| + Linguistic Uncertainty | .77 | .74 | .70 | .74 | .70 | .73 |
Table 1: An ablation study and evaluation of the IBE criteria and the composed IBE-Eval model. IBE-Eval outperforms the GPT 3.5 Judge baseline by an average of +17.5% across all all models and tasks.
5 Experimental Setting
Causal Question-Answering (CQA) requires reasoning about the causes and effects given an event description. We specifically consider the task of cause and effect prediction in a multiple-choice setting, where given a question and two candidate answers, the LLM must decide which is the most plausible cause or effect. Causal reasoning is a challenging task as the model must both possess commonsense knowledge about causal relationships and consider the event context which would make one option more plausible than the other. For our experiments, we use the Choice of Plausible Alternatives (COPA) Gordon et al. (2012) and E-CARE Du et al. (2022) datasets.
COPA.
COPA is a multiple-choice commonsense causal QA dataset consisting of 500 train and test examples that were manually generated. Each multiple-choice example consists of a question premise and a set of answer candidates which are potential causes or effects of the premise. COPA is a well-established causal reasoning benchmark that is both a part of SuperGlue Wang et al. (2019) and the CALM-Bench Dalal et al. (2023).
E-CARE.
E-CARE is a large-scale multiple-choice causal crowd-sourced QA dataset consisting of 15K train and 2k test examples. Similar to COPA, the task requires the selection of the most likely cause or effect provided an event description. We randomly sample 500 examples from the E-CARE test set for our experiments.
LLMs.
We consider GPT-Turbo-3.5, LLaMA 2 13B, and LLaMA 2 7B for all experiments. GPT 3.5 is a proprietary model Brown et al. (2020) and is highly effective across a wide range of natural language reasoning tasks Laskar et al. (2023). We additionally evaluate the open-source LLaMA 2 model Touvron et al. (2023). We consider both the 13B and 7B variants of Llama 2 as both are seen as viable commodity GPT alternatives and have been widely adopted by the research community for LLM benchmarking and evaluation.
Baselines.
We employ LLM-as-a-Judge Zheng et al. (2023) and human evaluators as baseline methods for the selection of the best explanation in the CQA setting. (Zheng et al., 2023) found LLMs can align with human judgment and be utilized for automated evaluation and judgment. We specifically uses GPT 3.5 as the LLM judge. For each CQA example, we present the judges with two competing explanations generated by the target LLM. The judge is asked to identify the best and most plausible explanation. Additional details about the baselines can be found in Appendix A.4.
6 Preliminary Analysis
We conduct a preliminary analysis as a sanity check to measure the extent to which LLMs generate self-evident or tautological explanations - i.e., explanations that simply restate the premises and conclusions. Tautological explanations present a risk for IBE-Eval as the metrics would be theoretically uninformative if the LLM adopts the tested causal relation as the explanation itself (e.g. A â B) without providing additional supporting statements.
We consider the parsimony metric to compute the percentage of explanations with proof depth equal to 1 (i.e, explanations containing only one inference step) and concept drift equal to 0 (i.e. no additional concepts other than the ones stated in premises and conclusions appear in the explanation). In such cases, the LLM is effectively generating a self-evident or tautological explanation.
We found that about 2% of the cases consist of self-evident explanations. For GPT 3.5, LLaMA 2 13B, and LLaMA 2 7B, 2% of the generated explanations exhibit a concept drift of 0, and on average 1.5% of the explanations have a proof depth of 1. We then conducted an error analysis to evaluate the cases where IBE-Eval selected a self-evident explanation as the best one. Across all LLMs, less than 0.1% of the errors were caused by the selection of such explanations. Our analysis suggests that the impact of self-evident explanations is not significant and that the IBE framework can be robustly applied to identify such cases.
7 Results
To assess the LLMâs alignment with the proposed IBE framework and evaluate the efficacy of IBE-Eval, we run a regression analysis and conduct a set of ablation studies to evaluate the relationship between IBE and question accuracy. The main results are presented in Figure 2 and Table 1.
Our regression analysis finds that the IBE criteria are generally consistent across the LLMs as demonstrated by similar correlation patterns found on both the COPA and E-CARE tasks (Figure 2). GPT 3.5 exhibits the strongest alignment with IBE expectations as we observe nearly all the IBE criteria have statistically significant and directionally aligned correlations across both tasks. Thus our proposed IBE criteria can serve as promising build blocks for future work on automated explanation evaluation.
In Table 1 we evaluate the accuracy of the IBE criteria and IBE-Eval in selecting the most plausible explanation in the CQA setting. We find that though independently the IBE criteria are generally limited in their ability to identify the more plausible explanation - they still outperform the GPT-3.5-as-a-judge baseline. IBE-Eval, which considers all IBE criteria, improves the ability to select the best explanation by 17% over both the GPT 3.5-as-a-judge and random baselines. We can achieve up to 77% accuracy utilizing just the extracted IBE criteria demonstrating IBEâs potential value for automatic explanation evaluation.
Next, we explore each explanation feature in further detail to better understand the variances across the IBE criteria and LLMs.
Consistency.
We find that the LLMs are surprisingly strong conjecture models. The LLMs can generate logically consistent explanations for any hypothesis as observed by similar consistency scores for correct and incorrect (Figure 3) explanations. Moreover, we observe that consistency tends to be a statistically insignificant predictor for the LLaMA models. Therefore, we conclude that evidence of logical consistency provides a limited signal for plausibility and is better understood in the context of other IBE criteria. For the incorrect candidate explanations, we find that LLMs over-rationalize and introduce additional premises to demonstrate entailment in their explanations.
<details>
<summary>extracted/6246183/consistency.png Details</summary>

### Visual Description
\n
## Horizontal Bar Chart: Logical Consistency of Language Models
### Overview
The image presents a horizontal bar chart comparing the percentage of logically consistent responses from three different language models: Llama 2 7B, Llama 2 13B, and ChatGPT. Each model has two bars representing "correct" and "incorrect" responses. The chart aims to visually represent the logical consistency of each model.
### Components/Axes
* **Y-axis:** Lists the language models: Llama 2 7B, Llama 2 13B, and ChatGPT.
* **X-axis:** Labeled "% Logically Consistent". Represents the percentage of logically consistent responses. The scale is not explicitly marked, but ranges from approximately 0% to 85%.
* **Legend:** Located in the top-right corner, defines the color coding:
* Green: "correct"
* Red: "incorrect"
### Detailed Analysis
The chart displays the following data:
* **Llama 2 7B:**
* Incorrect: Approximately 76% (Red bar)
* Correct: Approximately 70% (Green bar)
* **Llama 2 13B:**
* Incorrect: Approximately 77% (Red bar)
* Correct: Approximately 78% (Green bar)
* **ChatGPT:**
* Incorrect: Approximately 82% (Red bar)
* Correct: Approximately 81% (Green bar)
The bars are arranged vertically, with each model's "incorrect" bar positioned to the right of its "correct" bar.
### Key Observations
* ChatGPT has the highest percentage of incorrect responses (82%) and a slightly higher percentage of correct responses (81%) compared to the other models.
* Llama 2 13B has the highest percentage of correct responses (78%) and a similar percentage of incorrect responses (77%).
* Llama 2 7B has the lowest percentage of correct responses (70%) and a slightly lower percentage of incorrect responses (76%).
* For all three models, the percentage of incorrect responses is higher than the percentage of correct responses.
### Interpretation
The data suggests that none of the three language models consistently provide logically sound responses. ChatGPT appears to be the least logically consistent overall, while Llama 2 13B performs slightly better in terms of providing correct responses. The fact that the "incorrect" bars are consistently higher than the "correct" bars across all models indicates a significant challenge in ensuring logical consistency in these large language models. This could be due to various factors, including biases in the training data, limitations in the models' reasoning abilities, or the inherent complexity of natural language. The difference between the 7B and 13B versions of Llama 2 suggests that increasing model size can improve logical consistency, but doesn't eliminate the problem. Further investigation is needed to understand the specific types of logical errors these models are making and to develop strategies for mitigating them.
</details>
Figure 3: An evaluation of explanation consistency. LLMs are strong rationalizers and can generate logically consistent explanations at equal rates for explanations associated with both correct and incorrect answers options.
Parsimony.
<details>
<summary>extracted/6246183/parsimony.png Details</summary>

### Visual Description
\n
## Bar Chart: Comparison of LLM Performance - Proof Depth & Concept Drift
### Overview
This image presents a comparative analysis of three Large Language Models (LLMs): Llama 2 7B, Llama 2 13B, and ChatGPT, across two metrics: Average Proof Depth and Explanatory Concept Drift. The performance is visualized using horizontal bar charts, with bars representing the average values for 'correct' and 'incorrect' option types.
### Components/Axes
* **Y-axis (Vertical):** Lists the three LLMs: Llama 2 7B, Llama 2 13B, and ChatGPT.
* **X-axis (Horizontal):** Represents the values for Depth and Drift. The scale is not explicitly labeled, but values range from approximately 2.6 to 4.5.
* **Title 1:** "Avg. Proof Depth" - positioned above the first bar chart.
* **Title 2:** "Expl. Concept Drift" - positioned above the second bar chart.
* **Legend (Bottom-Center):**
* Green: "correct"
* Red: "incorrect"
### Detailed Analysis or Content Details
**1. Average Proof Depth:**
* **Llama 2 7B:**
* Correct: Approximately 2.68
* Incorrect: Approximately 2.94
* **Llama 2 13B:**
* Correct: Approximately 3.07
* Incorrect: Approximately 3.27
* **ChatGPT:**
* Correct: Approximately 2.63
* Incorrect: Approximately 3.17
**2. Explanatory Concept Drift:**
* **Llama 2 7B:**
* Correct: Approximately 3.96
* Incorrect: Approximately 4.33
* **Llama 2 13B:**
* Correct: Approximately 3.67
* Incorrect: Approximately 4.55
* **ChatGPT:**
* Correct: Approximately 2.95
* Incorrect: Approximately 4.18
### Key Observations
* For both metrics, the 'incorrect' option type consistently exhibits higher values than the 'correct' option type across all LLMs.
* Llama 2 13B generally shows the highest values for both 'correct' and 'incorrect' options, suggesting a greater depth and drift compared to the other models.
* ChatGPT consistently has the lowest 'correct' values for both metrics.
* The difference between 'correct' and 'incorrect' values is more pronounced for Concept Drift than for Proof Depth.
### Interpretation
The data suggests that when LLMs produce incorrect responses, they tend to exhibit greater proof depth and concept drift compared to when they produce correct responses. This could indicate that incorrect answers are often more elaborately reasoned (leading to greater depth) but deviate further from the core concepts (leading to greater drift).
The higher values for Llama 2 13B might suggest that a larger model size allows for more complex reasoning, but also increases the potential for drifting away from the correct concepts. ChatGPT's lower 'correct' values could indicate a different approach to reasoning or a stronger focus on conciseness.
The consistent pattern of higher values for 'incorrect' responses across all models is a notable finding. It implies that identifying incorrect responses might be facilitated by analyzing the depth and drift of their reasoning processes. This could be valuable for developing methods to detect and mitigate errors in LLM outputs.
</details>
Figure 4: Explanation parsimony is evaluated using proof depth and concept drift. Both metrics are consistently lower for explanations supporting the correct answers suggesting that LLMs are able to generate efficient explanations for the more plausible hypothesis.
The results suggest that parsimony has a more consistent effect and is a better predictor of explanation quality. We observe negative correlations between proof depth, concept drift, and question-answering accuracy, suggesting that LLMs tend to introduce more concepts and explanation steps when explaining less plausible hypotheses. On average, we found the depth and drift to be 6% and 10% greater for the incorrect option across all LLMs (Figure 4). Moreover, the results suggest that as the LLM parameter size increases, the tendency to over-rationalize increases as well. This is attested by the fact that the average difference in depth and drift is the greatest in GPT 3.5, suggesting that the model tends to find the most efficient explanations for stronger hypotheses and articulates explanations for weaker candidates. Finally, we found that the LLaMA models tend to generate more complex explanations overall, with LLaMA 2 13B exhibiting the largest concept drift for less plausible hypotheses. The parsimony criterion supports the IBE predictive power with an average of 14% improvement over consistency.
Coherence.
<details>
<summary>extracted/6246183/coherence.png Details</summary>

### Visual Description
\n
## Bar Chart: Avg. Coherence Scores
### Overview
This bar chart compares the average coherence scores of three Large Language Models (LLMs): ChatGPT, Llama 2 13B, and Llama 2 7B. It presents both the coherence score itself (as a bar height) and the relative difference (%) between correct and incorrect responses (as a red dashed line with markers). A secondary y-axis on the right displays the percentage scale for the relative difference.
### Components/Axes
* **Title:** Avg. Coherence Scores
* **X-axis Label:** LLM (with categories: ChatGPT, Llama 2 13B, Llama 2 7B)
* **Y-axis Label (left):** Coherence Score (scale from 0.0 to 0.3)
* **Y-axis Label (right):** Rel. Difference % (scale from 0% to 30%)
* **Legend:** Located in the top-right corner.
* **Type:** Correct (represented by green bars)
* **Type:** Incorrect (represented by red bars)
* **Data Series:**
* Correct Coherence Score (Green Bars)
* Incorrect Coherence Score (Red Bars)
* Relative Difference (%) (Red Dashed Line with Markers)
### Detailed Analysis
The chart displays three sets of bars, one for each LLM, representing the coherence scores for correct and incorrect responses. A red dashed line connects markers indicating the relative difference between correct and incorrect responses for each LLM.
* **ChatGPT:**
* Correct Coherence Score: Approximately 0.26 (±0.01)
* Incorrect Coherence Score: Approximately 0.22 (±0.01)
* Relative Difference: Approximately 24% (±2%)
* **Llama 2 13B:**
* Correct Coherence Score: Approximately 0.31 (±0.01)
* Incorrect Coherence Score: Approximately 0.28 (±0.01)
* Relative Difference: Approximately 31% (±2%)
* **Llama 2 7B:**
* Correct Coherence Score: Approximately 0.24 (±0.01)
* Incorrect Coherence Score: Approximately 0.18 (±0.01)
* Relative Difference: Approximately 20% (±2%)
The red dashed line shows a trend: it starts at approximately 24% for ChatGPT, peaks at approximately 31% for Llama 2 13B, and then decreases to approximately 20% for Llama 2 7B.
### Key Observations
* Llama 2 13B exhibits the highest average coherence score for correct responses.
* Llama 2 13B also has the largest relative difference between correct and incorrect responses, suggesting it is better at distinguishing between coherent and incoherent outputs.
* ChatGPT has a relatively low relative difference compared to Llama 2 13B, indicating a smaller gap in coherence between correct and incorrect responses.
* Llama 2 7B has the lowest coherence score for correct responses and the smallest relative difference.
### Interpretation
The data suggests that Llama 2 13B performs best in terms of generating coherent responses and differentiating them from incoherent ones. ChatGPT shows reasonable coherence but a smaller margin between correct and incorrect outputs. Llama 2 7B appears to struggle with coherence, exhibiting the lowest scores overall.
The relative difference metric is crucial. A higher relative difference indicates that the LLM is more reliable in identifying and producing coherent text. The peak at Llama 2 13B suggests that this model is particularly adept at maintaining coherence. The decreasing trend for Llama 2 7B could indicate that reducing model size (from 13B to 7B parameters) negatively impacts its ability to generate and assess coherence.
The chart provides a comparative assessment of the LLMs' ability to produce meaningful and consistent text, which is a critical aspect of their overall performance. The data could be used to inform model selection for applications requiring high levels of coherence, such as content generation or dialogue systems.
</details>
Figure 5: An evaluation of the explanation coherence and question accuracy.The average coherence score is consistently higher for explanations corresponding to the correct hypotheses across the LLMs.
Similarly to parsimony, we found coherence to be a better indicator of explanation quality being statistically significant for both GPT 3.5 and Llama 2 13B on COPA and both Llama 2 models on E-Care. We found that the average coherence score is consistently greater for the stronger hypothesis across all LLMs and datasets (see Figure 5). Both GPT and Llama 2 13B exhibit a higher relative difference between the correct and incorrect hypotheses in contrast to Llama 2 7B.
Uncertainty.
<details>
<summary>extracted/6246183/uncertainty.png Details</summary>

### Visual Description
\n
## Bar and Line Charts: Model Uncertainty and Hedge Cue Analysis
### Overview
The image presents three charts analyzing the uncertainty and hedging behavior of three language models: ChatGPT, Llama 2 13B, and Llama 2 7B. The first chart shows average uncertainty scores and relative differences. The second chart displays the average ratio of explanatory hedge cues and relative differences. The third chart is a stacked bar chart illustrating the distribution of different hedge cue categories (Conditional, Doxastic, Epistemic) in incorrect explanations.
### Components/Axes
* **Chart 1: Avg. Uncertainty**
* X-axis: Model names (ChatGPT, Llama 2 13B, Llama 2 7B)
* Y-axis: Score (ranging from 0 to 4.5)
* Secondary Y-axis: Rel. Difference (ranging from 0% to 10%)
* Data Series: Bar representing average uncertainty, dashed line representing relative difference.
* Legend: "type" with labels "correct" (black dashed line) and "incorrect" (red dashed line).
* **Chart 2: Avg. Ratio of Expl. Hedge Cues**
* X-axis: Model names (ChatGPT, Llama 2 13B, Llama 2 7B)
* Y-axis: Ratio (ranging from 0 to 0.05)
* Secondary Y-axis: Rel. Difference (ranging from 0% to 20%)
* Data Series: Bar representing average ratio, dashed line representing relative difference.
* Legend: "type" with labels "correct" (black dashed line) and "incorrect" (red dashed line).
* **Chart 3: Hedge Cues in Incorrect Expl.**
* Y-axis: Model names (ChatGPT, Llama 2 13B, Llama 2 7B) - listed vertically.
* X-axis: Percentage (ranging from 0% to 100%)
* Legend: Categories: Conditional (green), Doxastic (blue), Epistemic (red).
### Detailed Analysis or Content Details
**Chart 1: Avg. Uncertainty**
* **ChatGPT:** Average uncertainty score is approximately 3.8, with a relative difference of approximately 6% for incorrect explanations.
* **Llama 2 13B:** Average uncertainty score is approximately 3.2, with a relative difference of approximately 10% for incorrect explanations.
* **Llama 2 7B:** Average uncertainty score is approximately 4.1, with a relative difference of approximately 4% for incorrect explanations.
* The black dashed line (correct) is relatively flat across all models, indicating minimal relative difference in uncertainty for correct explanations. The red dashed line (incorrect) shows a more pronounced variation.
**Chart 2: Avg. Ratio of Expl. Hedge Cues**
* **ChatGPT:** Average ratio of explanatory hedge cues is approximately 0.025, with a relative difference of approximately 10% for incorrect explanations.
* **Llama 2 13B:** Average ratio of explanatory hedge cues is approximately 0.035, with a relative difference of approximately 15% for incorrect explanations.
* **Llama 2 7B:** Average ratio of explanatory hedge cues is approximately 0.04, with a relative difference of approximately 5% for incorrect explanations.
* Similar to Chart 1, the black dashed line (correct) is relatively flat, while the red dashed line (incorrect) shows more variation.
**Chart 3: Hedge Cues in Incorrect Expl.**
* **ChatGPT:** Approximately 10% Conditional, 75% Doxastic, and 15% Epistemic.
* **Llama 2 13B:** Approximately 20% Conditional, 60% Doxastic, and 20% Epistemic.
* **Llama 2 7B:** Approximately 30% Conditional, 40% Doxastic, and 30% Epistemic.
### Key Observations
* Llama 2 7B exhibits the highest average uncertainty score.
* Llama 2 13B shows the largest relative difference in uncertainty between correct and incorrect explanations.
* The ratio of explanatory hedge cues increases with model size (ChatGPT < Llama 2 13B < Llama 2 7B).
* Doxastic hedge cues dominate in incorrect explanations across all models.
* Conditional hedge cues increase in proportion as model size increases.
### Interpretation
The data suggests that larger models (Llama 2 7B) exhibit higher overall uncertainty, but also a more consistent uncertainty profile across correct and incorrect explanations. The increased ratio of hedge cues in larger models might indicate a greater awareness of potential inaccuracies or limitations in their responses. The prevalence of Doxastic cues (expressing belief or judgment) in incorrect explanations suggests that these models often present incorrect information with a degree of confidence. The increasing proportion of Conditional cues with model size could indicate a growing ability to express uncertainty and qualify statements. The relative differences between correct and incorrect explanations, particularly for Llama 2 13B, highlight a potential area for improvement in calibration â ensuring that the model's confidence level accurately reflects the correctness of its responses. The stacked bar chart provides a nuanced view of how models express uncertainty, revealing that they rely heavily on Doxastic cues even when incorrect, which could be misleading to users.
</details>
Figure 6: Evaluation of linguistic uncertainty in LLM-generated explanations. LLMs tend to use more hedging language in explanations supporting less plausible hypotheses. Across the LLMs, the hedging language is found to be predominantly epistemic A.8.
The results reveal that linguistic uncertainty is the strongest predictor of explanation quality and is a statistically significant feature for all LLMs. This suggests that LLMs use more qualifying language when explaining weaker hypotheses (see Figure 6). We found that uncertainty can improve accuracy by 13pp on COPA and 4pp on E-CARE. We also examine the uncertainty cues expressed by LLMs by analyzing both the frequency of hedge words and the types of hedge cues employed in incorrect explanations. We find the distribution of hedge cues across LLMs tends to be similar, with only minor differences between LLMs (Figure 6). Epistemic cues were most frequently used by all three models, with LLaMA 2 7B being more likely to use conditional cues. See Appendix A.8 for further details..
7.1 Correlation with Human Judgement.
We first sample 100 generated explanation pairs across both the COPA and E-CARE tasks and evaluated LLMs. Two human evaluators are instructed to evaluate the pair of explanations and to select which explanation is most plausible. No additional information about the original question nor the correct answer is provided to prevent biasing the judge.
The human evaluators on average were able to identify the explanation associated with the correct answer 96% (COPA) and 91% (E-Care) of time. We compute the inter-evaluator agreement score between two human evaluators and find that there is Cohenâs Kappa score of .68 suggesting there is a strong agreement between the two evaluators.
To evaluate if IBE-Eval is correlated with human judgment, we compute the Spearmanâs rank correlation between GPT-3.5-as-a-judge, IBE-Eval and human judgment. We find that GPT-3.5-as-a-judge exhibits a weak and statistically insignificant correlation with human judgment (0.31). In contrast, we find that the IBE-Eval is significantly aligned with human preferences (with a Spearmanâs correlation of 0.64 and p < 0.01) further suggesting the IBEâs potential for automatic explanation evaluation.
8 Related Work
Explorations of LLM reasoning capabilities across various domains (e.g. arithmetic, commonsense, planning, symbolic, etc) are an emerging area of interest Xu et al. (2023); Huang and Chang (2023). Prompt-based methods Wei et al. (2022b); Zhou et al. (2023); Wang et al. (2023), such as CoT, investigate strategies to elicit specific types of reasoning behavior through direct LLM interaction. Olausson et al. (2023) investigate automatic proof generation and propose a neurosymbolic framework with an LLM semantic parser and external solver. Creswell et al. (2022) propose an inference framework where the LLM acts as both a selection and inference module to produce explanations consisting of causal reasoning steps in entailment tasks. Research on LLM faithfulness Atanasova et al. (2023) investigates if LLM explanations are robust to spurious input alterations. Parcalabescu and Frank (2024) propose a self-consistency measure CC-SHAP which measures how specific alterations to a modelâs input contribute to the generated explanation. This paper primarily draws inspiration from recent work on the evaluation of natural language explanations Quan et al. (2024); Valentino et al. (2021); Wiegreffe and Marasovic (2021); Thayaparan et al. (2020); Dalvi et al. (2021); Camburu et al. (2018). However, differently from previous methods that require extensive human annotations or specific domain knowledge, we are the first to propose a set of criteria that can be automatically computed on explicit linguistic and logical features.
9 Conclusion
This paper proposed IBE-Eval, an interpretable framework for LLM explanation evaluation inspired by philosophical accounts of Inference to the Best Explanation (IBE). IBE-Eval can identify the best explanation supporting the correct answer with up to 77% accuracy in CQA scenarios, improving upon a GPT 3.5 Judge baselines by +17%. Our regression study suggests that LLM explanations tend to conform to IBE expectations and that IBE-Eval is strongly correlated with human judgment. Linguistic uncertainty is the stronger IBE predictor for explanation quality closely followed by parsimony and coherence. However, we also found that LLMs tend to be strong conjecture models able to generate logically consistent explanations for less plausible hypotheses, suggesting limited applicability for the logical consistency criterion in isolation. We believe our findings can open new lines of research on external evaluation methods for LLMs as well as interpretability tools for understanding the LLMâs underlying explanatory process.
10 Limitations
IBE-Eval offers an interpretable explanation evaluation framework utilizing logical and linguistic features. Our current instantiation of the framework is primarily limited in that it does not consider grounded knowledge for factuality. We observe that LLMs can generate factually incorrect but logically consistent explanations. In some cases, the coherence metric can identify those factual errors when the step-wise entailment score is comparatively lower. However, our reliance on aggregated metrics can hide weaker internal entailment especially when the explanation is longer or the entailment strength of the surrounding explanation steps is stronger. Future work can introduce metrics to evaluate grounded knowledge or perform more granular evaluations of explanations to better weight factual inaccuracies.
Additionally, IBE-Eval currently does not support single natural language explanations and was evaluated in the limited domain of causal commonsense reasoning. Future work will explore globally calibrating IBE-Eval plausibility scores to extend evaluation to more diverse explanation generation and QA settings. Calibration efforts would allow for IBE-Eval to generate comparable scores across unrelated explanations and could be used to produce global thresholds explanation classification.
Finally, the list of criteria considered in this work is not exhaustive and can be extended in future work. However, additional criteria for IBE might not be straightforward to implement (e.g., unification power, hardness to variation) and would probably require further progress in both epistemological accounts and existing NLP technology.
11 Ethics Statement
The human annotators for computing the human judgment baseline are all authors of the papers and as such were not further compensated for the annotation task.
Acknowledgements
This work was partially funded by the Swiss National Science Foundation (SNSF) project NeuMath (200021_204617), by the EPSRC grant EP/T026995/1 entitled âEnnCore: End-to-End Conceptual Guarding of Neural Architecturesâ under Security for all in an AI-enabled society, by the CRUK National Biomarker Centre, and supported by the Manchester Experimental Cancer Medicine Centre, the Science Foundation Ireland under grants SFI/18/CRT/6223 (Centre for Research Training in Artificial Intelligence), SFI/12/RC/2289_P2 (Insight), co-funded by the European Regional Development Fund, and the NIHR Manchester Biomedical Research Centre.
References
- Atanasova et al. (2023) Pepa Atanasova, Oana-Maria Camburu, Christina Lioma, Thomas Lukasiewicz, Jakob Grue Simonsen, and Isabelle Augenstein. 2023. Faithfulness tests for natural language explanations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 283â294, Toronto, Canada. Association for Computational Linguistics.
- Baumgartner (2015) Michael Baumgartner. 2015. Parsimony and causality. Quality & Quantity, 49:839â856.
- Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632â642, Lisbon, Portugal. Association for Computational Linguistics.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901.
- Buitinck et al. (2013) Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and GaĂ«l Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108â122.
- Camburu et al. (2018) Oana-Maria Camburu, Tim RocktÀschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
- Camburu et al. (2020) OM Camburu, B Shillingford, P Minervini, T Lukasiewicz, and P Blunsom. 2020. Make up your mind! adversarial generation of inconsistent natural language explanations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020. ACL Anthology.
- Chakraborty et al. (2017) Supriyo Chakraborty, Richard Tomsett, Ramya Raghavendra, Daniel Harborne, Moustafa Alzantot, Federico Cerutti, Mani Srivastava, Alun Preece, Simon Julier, Raghuveer M. Rao, Troy D. Kelley, Dave Braines, Murat Sensoy, Christopher J. Willis, and Prudhvi Gurram. 2017. Interpretability of deep learning models: A survey of results. In 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pages 1â6.
- Creswell et al. (2022) Antonia Creswell, Murray Shanahan, and Irina Higgins. 2022. Selection-inference: Exploiting large language models for interpretable logical reasoning.
- Dalal et al. (2023) Dhairya Dalal, Paul Buitelaar, and Mihael Arcan. 2023. CALM-bench: A multi-task benchmark for evaluating causality-aware language models. In Findings of the Association for Computational Linguistics: EACL 2023, pages 296â311, Dubrovnik, Croatia. Association for Computational Linguistics.
- Dalvi et al. (2021) Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura, and Peter Clark. 2021. Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358â7370.
- Danilevsky et al. (2020) Marina Danilevsky, Kun Qian, Ranit Aharonov, Yannis Katsis, Ban Kawas, and Prithviraj Sen. 2020. A survey of the state of explainable AI for natural language processing. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 447â459, Suzhou, China. Association for Computational Linguistics.
- Deutsch (2011) David Deutsch. 2011. The beginning of infinity: Explanations that transform the world. penguin uK.
- Du et al. (2022) Li Du, Xiao Ding, Kai Xiong, Ting Liu, and Bing Qin. 2022. e-CARE: a new dataset for exploring explainable causal reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 432â446, Dublin, Ireland. Association for Computational Linguistics.
- Farkas et al. (2010) RichĂĄrd Farkas, Veronika Vincze, György MĂłra, JĂĄnos Csirik, and György Szarvas. 2010. The CoNLL-2010 shared task: Learning to detect hedges and their scope in natural language text. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning â Shared Task, pages 1â12, Uppsala, Sweden. Association for Computational Linguistics.
- Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noacâh, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation.
- Geirhos et al. (2020) Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665â673.
- Gordon et al. (2012) Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. 2012. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics â Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 394â398, MontrĂ©al, Canada. Association for Computational Linguistics.
- Harman (1965) Gilbert H Harman. 1965. The inference to the best explanation. The philosophical review, 74(1):88â95.
- Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
- Huang and Chang (2022) Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403.
- Huang and Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey.
- Huang et al. (2023) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2023. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.
- Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
- Kitcher (1989) Philip Kitcher. 1989. Explanatory unification and the causal structure of the world.
- Knight (2023) Will Knight. 2023. Ai is becoming more powerful-but also more secretive.
- Lampinen et al. (2022) Andrew Lampinen, Ishita Dasgupta, Stephanie Chan, Kory Mathewson, Mh Tessler, Antonia Creswell, James McClelland, Jane Wang, and Felix Hill. 2022. Can language models learn from explanations in context? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 537â563.
- Laskar et al. (2023) Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Amran Hossen Bhuiyan, Shafiq Joty, and Jimmy Xiangji Huang. 2023. A systematic study and comprehensive evaluation of chatgpt on benchmark datasets.
- Liang et al. (2023) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher RĂ©, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. 2023. Holistic evaluation of language models.
- Lipton (2017) Peter Lipton. 2017. Inference to the best explanation. A Companion to the Philosophy of Science, pages 184â193.
- Lombrozo (2012) Tania Lombrozo. 2012. Explanation and abductive inference. Oxford handbook of thinking and reasoning, pages 260â276.
- Mackonis (2013) Adolfas Mackonis. 2013. Inference to the best explanation, coherence and other explanatory virtues. Synthese, 190(6):975â995.
- Nie et al. (2019) Yixin Nie, Haonan Chen, and Mohit Bansal. 2019. Combining fact extraction and verification with neural semantic matching networks. In Association for the Advancement of Artificial Intelligence (AAAI).
- Nie et al. (2020) Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Olausson et al. (2023) Theo X. Olausson, Alex Gu, Benjamin Lipkin, Cedegao E. Zhang, Armando Solar-Lezama, Joshua B. Tenenbaum, and Roger Levy. 2023. Linc: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers.
- Parcalabescu and Frank (2024) Letitia Parcalabescu and Anette Frank. 2024. On measuring faithfulness or self-consistency of natural language explanations.
- Pei and Jurgens (2021) Jiaxin Pei and David Jurgens. 2021. Measuring sentence-level and aspect-level (un)certainty in science communications. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9959â10011, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Popper (2014) Karl Popper. 2014. Conjectures and refutations: The growth of scientific knowledge. routledge.
- Quan et al. (2024) Xin Quan, Marco Valentino, Louise A Dennis, and André Freitas. 2024. Enhancing ethical explanations of large language models through iterative symbolic refinement. arXiv preprint arXiv:2402.00745.
- R Core Team (2013) R Core Team. 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.
- Sober (1981) Elliott Sober. 1981. The principle of parsimony. The British Journal for the Philosophy of Science, 32(2):145â156.
- Srivastava et al. (2023) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, AdriĂ Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas StuhlmĂŒller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla KarakaĆ, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, BartĆomiej Bojanowski, Batuhan Ăzyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, CĂ©sar Ferri RamĂrez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel MoseguĂ GonzĂĄlez, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando MartĂnez-Plumed, Francesca HappĂ©, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, GermĂĄn Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-LĂłpez, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich SchĂŒtze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime FernĂĄndez Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan KocoĆ, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros ColĂłn, Luke Metz, LĂŒtfi Kerem Ćenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose RamĂrez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, MĂĄtyĂĄs Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, MichaĆ SwÄdrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr MiĆkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, RaphaĂ«l MilliĂšre, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, ThĂ©o Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
- Thagard (1989) Paul Thagard. 1989. Explanatory coherence. Behavioral and brain sciences, 12(3):435â467.
- Thagard (1978) Paul R Thagard. 1978. The best explanation: Criteria for theory choice. The journal of philosophy, 75(2):76â92.
- Thayaparan et al. (2020) Mokanarangan Thayaparan, Marco Valentino, and André Freitas. 2020. A survey on explainability in machine reading comprehension. arXiv preprint arXiv:2010.00389.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models.
- Valentino and Freitas (2022) Marco Valentino and André Freitas. 2022. Scientific explanation and natural language: A unified epistemological-linguistic perspective for explainable ai. arXiv preprint arXiv:2205.01809.
- Valentino et al. (2021) Marco Valentino, Ian Pratt-Hartmann, and AndrĂ© Freitas. 2021. Do natural language explanations represent valid logical arguments? verifying entailment in explainable NLI gold standards. In Proceedings of the 14th International Conference on Computational Semantics (IWCS), pages 76â86, Groningen, The Netherlands (online). Association for Computational Linguistics.
- Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. CoRR, abs/1905.00537.
- Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models.
- Weber et al. (2019) Leon Weber, Pasquale Minervini, Jannes MĂŒnchmeyer, Ulf Leser, and Tim RocktĂ€schel. 2019. Nlprolog: Reasoning with weak unification for question answering in natural language.
- Wei et al. (2022a) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022a. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903.
- Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824â24837.
- Wiegreffe and Marasovic (2021) Sarah Wiegreffe and Ana Marasovic. 2021. Teach me to explain: A review of datasets for explainable natural language processing. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112â1122. Association for Computational Linguistics.
- Wu et al. (2022) Yuhuai Wu, Albert Q. Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. 2022. Autoformalization with large language models.
- Xiang (2023) Chloe Xiang. 2023. Openaiâs gpt-4 is closed source and shrouded in secrecy.
- Xu et al. (2023) Fangzhi Xu, Qika Lin, Jiawei Han, Tianzhe Zhao, Jun Liu, and Erik Cambria. 2023. Are large language models really good logical reasoners? a comprehensive evaluation and beyond.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Haotong Zhang, Joseph Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv, abs/2306.05685.
- Zhou et al. (2023) Denny Zhou, Nathanael SchÀrli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. 2023. Least-to-most prompting enables complex reasoning in large language models.
Appendix A Appendix
A.1 Reproducibility
All experimental code is available online https://github.com/dhairyadalal/IBE-eval to encourage future research in the field. We additionally summarize all the model implementations and technical resources used for the computation of the proposed IBE criteria below:
- We adopt the Prolog solver for neuro-symbolic integration from Quan et al. (2024).
- We use Spacy Honnibal and Montani (2017) to tokenize and extract part-of-speech (POS) tags.
- To compute coherence, we employ the RoBERTa-based NLI model Nie et al. (2020) that has been finetuned on a range of NLI and fact verification datasets consisting of SNLI Bowman et al. (2015), aNLI Nie et al. (2020), multilingual NLI Williams et al. (2018)), and FEVER-NLI Nie et al. (2019).
- To measure sentence-level uncertainty, we employ a finetuned RoBERTa model provided by Pei and Jurgens (2021).
- We use a fine-tuned BERT-based token classification model to classify all the words in the generated explanation with uncertainty categories introduced in the 2010 CoNLL shared task on Hedge Detection Farkas et al. (2010).
A.2 Explanation Prompting
<details>
<summary>extracted/6246183/eev.png Details</summary>

### Visual Description
\n
## Data Table: Entailment Forms Examples
### Overview
The image presents a data table illustrating examples of "Cause Prediction" and "Effect Prediction" types, along with their corresponding "Entailment Forms". The table has three columns: "Type", "Example", and "Entailment Forms".
### Components/Axes
The table has the following column headers:
* **Type:** Categorizes the prediction task (Cause Prediction, Effect Prediction).
* **Example:** Provides a context and question demonstrating the prediction type.
* **Entailment Forms:** Shows the premise and conclusion representing the logical relationship.
### Detailed Analysis or Content Details
The table contains two rows, each representing a different prediction type.
**Row 1: Cause Prediction**
* **Type:** Cause Prediction
* **Example:**
* Context: "The balloon expanded."
* Question: "What was the cause?"
* A) "I blew into it."
* B) "I pricked it."
* **Entailment Forms:**
* Premise: "I blew into it."
* Conclusion: "The balloon expanded."
* Premise: "I pricked it."
* Conclusion: "The balloon expanded."
**Row 2: Effect Prediction**
* **Type:** Effect Prediction
* **Example:**
* Context: "The child punched the stack of blocks."
* Question: "What was the effect?"
* A) "The stack towered over the boys head."
* B) "The blocks scattered all over the rug."
* **Entailment Forms:**
* Premise: "The child punched the stack of blocks."
* Conclusion: "The stack towered over the boys head."
* Premise: "The child punched the stack of blocks."
* Conclusion: "The blocks scattered all over the rug."
### Key Observations
The table demonstrates how different events can lead to the same outcome (in the case of Cause Prediction) and how a single event can have multiple possible effects (in the case of Effect Prediction). The "Entailment Forms" clearly show the logical connection between the premise and conclusion for each scenario.
### Interpretation
This table illustrates fundamental concepts in logical reasoning and natural language understanding. It highlights the complexities of inferring causality and predicting consequences. The examples demonstrate that a single effect can have multiple potential causes, and a single cause can lead to multiple potential effects. This is crucial for tasks like question answering, text comprehension, and building AI systems that can reason about the world. The table serves as a simplified model for how humans make inferences about events and their relationships. The inclusion of multiple possible conclusions for each premise underscores the inherent ambiguity in real-world scenarios.
</details>
Figure 7: To perform IBE we convert the CQA context and answer candidates into an entailment form (i.e., EEV) Valentino et al. (2021).
<details>
<summary>extracted/6246183/cot-prompt.png Details</summary>

### Visual Description
\n
## Text Document: Scenario & Options for Plausibility Assessment
### Overview
The image presents a text-based scenario followed by a question and multiple-choice options. The document details a method for determining the most plausible cause within a given context, using step-by-step logical proofs and IF-THEN rules. It includes an example to illustrate the process. The document appears to be an instructional or assessment material related to critical thinking or logical reasoning.
### Components/Axes
The document is structured into the following sections:
* **Context:** Provides the scenario.
* **Question:** Poses the problem to be solved.
* **Options:** Lists the potential answers.
* **Option 1 Explanation:** Detailed breakdown of the first option.
* **Option 2 Explanation:** Detailed breakdown of the second option.
* **Answer:** States the selected most plausible option.
The example section includes:
* **Example Context:** "The woman banished the children from her property."
* **Example Question:** "What was the cause?"
* **Example Options:** (a) the children trampled through her garden, (b) the children hit a ball into her yard.
* **Example Option 1 Explanation:** Premise, Conclusion, Step 1, Assumption, and a concluding statement.
* **Example Option 2 Explanation:** Premise, Conclusion, Step 1, Assumption, and a concluding statement.
* **Example Answer:** "(a) the children trampled through her garden"
### Detailed Analysis or Content Details
The document outlines a method for evaluating plausibility:
1. **Premise:** State the option as a given fact.
2. **Conclusion:** State the outcome described in the context.
3. **Step 1:** Formulate an IF-THEN rule connecting the premise to the conclusion.
4. **Assumption:** Identify the underlying causal or commonsense assumption.
5. **Concluding Statement:** Summarize the logical connection.
The example demonstrates this process for two options:
**Example Option 1:**
* **Premise:** The children trampled through her garden.
* **Conclusion:** The woman banished the children from her property.
* **Step 1:** IF children trample through someoneâs garden, THEN it can cause damage to the garden.
* **Assumption:** Trampling through a garden can result in damage to the garden.
* **Concluding Statement:** Therefore, since the children trampled through her garden, causing damage, the woman may have felt upset or angry and decided to banish the children from her property as a way to prevent further damage.
**Example Option 2:**
* **Premise:** The children hit a ball into her yard.
* **Conclusion:** The woman banished the children from her property.
* **Step 1:** IF children hit a ball into her yard, THEN the woman may feel her property is being invaded.
* **Assumption:** Having objects thrown into oneâs yard can be seen as an invasion of privacy.
* **Concluding Statement:** Therefore, since the children hit a ball into her yard, the woman may have felt her property was being invaded, which could have led to her becoming angry and ultimately banishing the children from her property.
The final answer in the example is (a) the children trampled through her garden.
The document then presents a new, incomplete scenario with a question and options, awaiting application of the described method.
### Key Observations
The document emphasizes a structured approach to causal reasoning. The use of IF-THEN statements and explicit assumptions forces a clear articulation of the logical connection between cause and effect. The example demonstrates that plausibility is not simply about identifying a possible cause, but about constructing a logical argument to support it.
### Interpretation
This document is a pedagogical tool designed to teach or assess logical reasoning skills. It highlights the importance of breaking down complex scenarios into smaller, manageable steps and explicitly stating the underlying assumptions that connect cause and effect. The method presented is applicable to a wide range of problem-solving scenarios, from everyday decision-making to scientific inquiry. The document's focus on constructing logical proofs suggests an emphasis on critical thinking and evidence-based reasoning. The incomplete scenario at the end serves as an exercise for the reader to apply the learned method. The document is well-structured and clearly written, making it an effective learning resource.
</details>
Figure 8: An example of the modified CoT prompt template for explanation generation.
A modified CoT prompt is used to instruct the LLM to generate explanations. The prompt includes a set of instructions for explanation generation and an in-context example. Appended to the end of the prompt are the CQA context, causal question, and answer candidates. The LLM is instructed to first convert the options into the EEV format consisting of a premise and conclusion. The EEV format will differ depending on the directionality of the causal question (see Figure 7). Cause prediction questions will treat the answer candidate as the premise and the context as the conclusion. In contrast, effect prediction reverses the relationship treating the context as the premise and the answer options as the conclusion. After the EEV conversion, the model is instructed to generate a step-by-step explanation consisting of IF-THEN statements and the associated causal or commonsense assumptions. For ease of post-processing, the LLM is instructed to use headers and enumerate steps using the Step # format. A full example of the prompt template is exhibited in Figure 8.
<details>
<summary>extracted/6246183/autoformalization.png Details</summary>

### Visual Description
\n
## Text Document: Prolog Syntax Examples
### Overview
The image contains a text document demonstrating the conversion of premises, conclusions, and explanations into Prolog syntax. It provides two examples illustrating how to generate goals from conclusions, facts from premises, and rules from explanations, adhering to specific constraints regarding variable usage.
### Content Details
The document is structured with "Example" headings, followed by "Premise", "Conclusion", and "Explanation" sections. Below these sections are "Goal", "Formal Goal", "Facts", and "Rules" sections, presenting the Prolog code generated from the preceding text.
**Example 1:**
* **Premise:** Tom's pancreas was injured.
* **Conclusion:** He has a high blood sugar level.
* **Explanation:**
* IF pancreas are injured, THEN pancreas may be dysfunctional.
* IF pancreas are dysfunctional, THEN pancreas have a reduced capacity for insulin production.
* IF there is a reduced capacity for insulin production, THEN there are high levels of blood sugar.
* Therefore, since Tom's pancreas was injured, he may have a reduced capacity for insulin production, leading to insufficient insulin and high blood sugar levels.
* **Goal:**
* `has_high_blood_sugar(tom).`
* **Formal Goal:**
* `has_high_blood_sugar(X) :- tom(X).`
* **Facts:**
* `injured_pancreas(tom)`
* `tom(tom)`
* **Rules:**
* `dysfunctional_pancreas(X) :- injured_pancreas(X).`
* `reduced_insulin_production(X) :- dysfunctional_pancreas(X).`
* `has_high_blood_sugar(X) :- reduced_insulin_production(X).`
**Example 2:**
* **Premise:** [Text is truncated and unreadable]
* **Conclusion:** [Text is truncated and unreadable]
* **Explanation:** [Text is truncated and unreadable]
### Key Observations
The document emphasizes the importance of using only one variable per predicate in the generated Prolog code. It also highlights the need to ensure that goals and facts refer to the same constant. The second example is incomplete due to truncation.
### Interpretation
The document serves as a tutorial or guide for translating logical arguments into Prolog code. It demonstrates a systematic approach to representing knowledge and reasoning in a declarative programming paradigm. The examples illustrate how to break down complex statements into simpler facts and rules, enabling Prolog to infer new knowledge based on the provided information. The constraints imposed on variable usage likely aim to simplify the logic and avoid ambiguity in the Prolog program. The truncation of the second example limits a full understanding of its intended demonstration. The document is a pedagogical tool for learning Prolog.
</details>
Figure 9: An example of the autoformalization prompt.
A.3 Autoformalization
Autoformalization is the process of translating natural language descriptions into formal specifications Wu et al. (2022). We adopt the translational capabilities of GPT-3.5-Turbo to convert the explanation into a formal entailment hypothesis. The IF-THEN explanation steps are converted into a set of Prolog rules, the entailment description is used to generate Prolog atoms, and the conclusion statement is translated into a Prolog query. We provide an example of the autoformalization prompt in Figure 9 and an example of the formalized output in Figure 11. After autoformalization, we deploy a post-processing script to extract the formalized rules, atoms, and query and generate a Prolog program for entailment verification.
A.4 LLM-as-a-Judge Baseline
<details>
<summary>extracted/6246183/llm-judge-prompt.png Details</summary>

### Visual Description
\n
## Screenshot: Plausibility Assessment Prompt
### Overview
The image is a screenshot of a text-based prompt designed to assess the plausibility of two given explanations. It presents a question asking the user to choose between "explanation 1" and "explanation 2" based on logical consistency and correctness. The prompt includes placeholders for the actual explanations and an "Answer:" field for the user's response.
### Components/Axes
The screenshot contains the following textual elements:
* **Header:** "Given the two explanations below (explanation 1 and explanation 2) which explanation is more plausible. A good explanation should be logically consistent and arrive at the correct conclusion."
* **Explanation 1 Label:** "Explanation 1:"
* **Explanation 1 Placeholder:** "... " (representing the content of explanation 1)
* **Explanation 2 Label:** "Explanation 2:"
* **Explanation 2 Placeholder:** "... " (representing the content of explanation 2)
* **Answer Label:** "Answer:"
### Detailed Analysis or Content Details
The image does not contain any factual data or numerical values. It is a template for a reasoning task. The content of the explanations is represented by ellipses ("..."), indicating that the actual text is missing. The prompt is structured to elicit a choice between two provided explanations.
### Key Observations
The screenshot is a simple, text-based interface. The layout is straightforward, with clear labels for each section. The use of ellipses suggests that the explanations are intended to be dynamically populated.
### Interpretation
The image represents a task designed to evaluate critical thinking and reasoning skills. The prompt requires the user to analyze two explanations, assess their logical coherence, and select the more plausible one. The absence of specific content in the explanations highlights the focus on the *process* of evaluation rather than the subject matter itself. The prompt is likely part of a larger assessment or experiment related to cognitive abilities or decision-making. It is a test of the ability to discern between good and bad reasoning.
</details>
Figure 10: An example of prompt used by the LLM-as-a-Judge model for evaluating competing explanations.
GPT 3.5 is used as the LLM for the LLM-as-a-Judge baseline. Similar to the human evaluators, GPT is presented with both generated explanations and asked to identify which explanation is more plausible.
A.5 Logical Consistency
<details>
<summary>extracted/6246183/logical-pipeline.png Details</summary>

### Visual Description
\n
## Table: Logical Proof Generation
### Overview
The image presents a table demonstrating a logical proof generation process. The table is divided into three columns: "1. Explanation", "2. Formalized Output", and "3. Generated Proof". It illustrates how a simple premise and conclusion can be formalized into a Prolog query and program, and subsequently used to generate a proof.
### Components/Axes
The table has three columns as described above. Each column represents a stage in the logical proof process. The rows represent steps in the explanation and corresponding formalized outputs.
### Detailed Analysis or Content Details
**Column 1: Explanation**
* **Premise:** "I blew into it."
* **Conclusion:** "The balloon expanded."
* **Step 1:** "IF someone blows into a balloon, THEN it can cause the balloon to inflate."
* **Assumption:** "Blowing air into a balloon increases the amount of air inside it, leading to inflation."
* **Step 2:** "IF the balloon inflates, THEN it can cause the balloon to expand."
* **Assumption:** "When a balloon inflates, it stretches and expands in size."
* **Final Statement:** "Therefore, since I blew into the balloon, it caused the balloon to inflate, which resulted in its expansion."
**Column 2: Formalized Output**
* **Prolog Query:** `expanded_balloon(me).`
* **Program:**
* `% Atoms`
* `blew_into_balloon(me).`
* `me(me).`
* `% Rules`
* `inflated_balloon(X) :- blew_into_balloon(X).`
* `expanded_balloon(X) :- inflated_balloon(X).`
**Column 3: Generated Proof**
* `expanded_balloon(me) ->`
* `expanded_balloon(X) :- inflated_balloon(X) ->`
* `inflated_balloon(X) :- blew_into_balloon(X) ->`
* `blew_into_balloon(me)`
### Key Observations
The table demonstrates a clear chain of reasoning. The explanation provides a human-readable argument, while the formalized output translates this argument into a Prolog program. The generated proof shows how the Prolog program can be used to derive the conclusion from the premise. The use of atoms and rules in the Prolog program mirrors the assumptions and steps in the explanation.
### Interpretation
This table illustrates a fundamental concept in logic and artificial intelligence: the ability to represent knowledge and reasoning in a formal, machine-readable format. The example demonstrates how a simple everyday scenario can be broken down into logical components and used to generate a proof. This process is crucial for building intelligent systems that can reason and make inferences. The table highlights the connection between natural language reasoning and formal logic, and how Prolog can be used as a tool for implementing logical reasoning. The use of "me" as a variable is interesting, representing the subject performing the action. The structure of the proof clearly shows the dependency of each step on the previous one, forming a logical chain.
</details>
Figure 11: An example of the autoformalization prompt.
An explanation hypothesis is considered logically consistent if the external solver can build a deductive proof connecting the conclusion to the premise. We use NLProlog Weber et al. (2019), a neuro-symbolic Prolog solver integrating backward chaining with word embedding models via a weak unification mechanism. NLProlog allows for a level of flexibility and robustness that is necessary for NLP use cases (e.g. unification applied to synonyms). We provide the autoformalized query, atoms, and rules to NLProlog. If NLProlog can satisfy the entailment query, it will return the proof consisting of the set of rules traversed, the weak unification score, and the proof depth. For simplicity, we assign a score of one if the entailment query is satisfied and zero if it is not. The proof depth score is evaluated as part of the parsimony analysis. An end-to-end example of consistency evaluation can be found in Figure 11.
1
Input : Symbolic KB $kb$ , Goal $goal$ , Glove embedding model $e(·)$
Output : proof chain $chain$ , proof depth $depth$
2
3 threshold $â$ 0.13;
4
5 $depth$ $â$ 1 ;
6 $chain$ $â$ $emptylist$ ;
7
8 foreach step $t$ in backward_chaining( $kb$ , $goal$ ) do
9 foreach $max\_unification(q,q_{t})$ do
10 $unification\_score$ $â$ $CosineSimilarity$ ( $e(q,m_{s}),e(q_{t},m_{s})$ );
11 $depth$ $â$ $depth$ $Ă$ $unification\_score$ ;
12
13 end foreach
14 $chain$ $â$ backward_chaining( $kb$ , $goal$ );
15
16 end foreach
17
18 if $chain$ is not empty and $depth$ $>$ threshold then
19 $chainâ$ current_proof_chain $[0]$ ;
20
21 end if
22 else
23 $depth$ $â$ 0 ;
24 end if
25
26 return $chain$ , $depth$ ;
Algorithm 1 Neuro-symbolic Solver
A.6 Parsimony
Parsimony measures the complexity of an explanation and is represented by the proof depth and concept drift metrics. Proof depth is automatically calculated by NLProlog and reflects the number of rules traversed by the solver to satisfy the entailment query. If the hypothesis is not logically consistent, depth is set to zero. The concept drift metric measures the entropy of novel concepts introduced to bridge the premise and conclusion. To compute the drift of an explanation, we consider the nouns found in the premise, conclusion, and explanation steps. We use Spacy Honnibal and Montani (2017) to tokenize and extract part-of-speech (POS) tags. All tokens with the âNOUNâ POS tag extracted. For normalization purposes, we consider the lemma of the tokens. Concept drift then is calculated as the set difference between the unique nouns found across all explanation steps and those found in the premise and conclusion.
1
2
Input : Premise, Conclusion, Explanation, Spacy model $spacy$ ( $·$ )
Output : Drift Score $drift$
3
4 $Noun_{p}â$ $spacy$ ( $Premise$ );
5 $Noun_{c}â$ $spacy$ ( $Conclusion$ );
6 $Noun_{E}â$ $spacy$ ( $Explanation$ );
7 $Nâ\{Noun_{p},Noun_{c},Noun_{E}\}$ ;
8 $driftâ length(set(Noun_{E})-set(Noun_{p}\cup Noun_{c})$ );
9 return $drift$ ;
Algorithm 2 Concept Drift
A.7 Coherence
Coherence evaluates the plausibility of the intermediate explanation. We propose stepwise entailment as a metric to measure the entailment strength of the If-then implications. We employ a RoBERTa-based NLI model Nie et al. (2020) that has been finetuned on a range of NLI and fact verification datasets consisting of SNLI Bowman et al. (2015), aNLI Nie et al. (2020), multilingual NLI Williams et al. (2018)), and FEVER-NLI Nie et al. (2019). To compute the stepwise entailment score, we first measure the entailment strength between the If and Then propositions. For example, to calculate the score of the statement âIF a balloon is pricked, THEN the balloon may deflateâ we consider âa balloon is prickedâ and âthe balloon may deflateâ as input sentences for the NLI model. The NLI will produce independent scores for the entailment and contradiction labels. We compute the entailment strength by subtracting the contraction label score from the entailment label score. An entailment strength of one indicates the If-then implication is strongly plausible whereas a score of zero suggests that it is likely implausible. The overall stepwise entailment score is the average of entailment strength measures across all explanation steps.
1
2
Input : Explanation $E$ , NLI Model $nli$ ( $·$ )
Output : Average Entailment Strength $strength$
3
4 $EntailmentStrengthScoresâ$ empty list;
5
6 foreach Step $(If_{s},Then_{s})$ in $E$ do
7 $EntailmentScoreâ$ $nli$ ( $If_{s}$ , $Then_{s}$ );
8 $ContradictionScoreâ$ $nli$ ( $If_{s}$ , $Then_{s}$ );
9 $EntailmentStrengthâ EntailmentScore-ContradictionScore$ ;
10 Append $EntailmentStrength$ to $EntailmentStrengthScores$ ;
11
12 end foreach
13
14 $strengthâ$ $Avg$ ( $EntailmentStrengthScores$ );
15 return $strength$ ;
Algorithm 3 Stepwise Entailment
A.8 Linguistic Uncertainty
Linguistic uncertainty measures the confidence of a statement where hedging cues and indirect language suggest ambiguity around the proposition. To measure sentence-level uncertainty, we employ a finetuned RoBERTa model provided by Pei and Jurgens (2021). The model was trained on a sentence-level dataset consisting of findings and statements extracted from new articles and scientific publications and human annotated evaluation of sentence certainty. A scale from one to six was used to annotate sentences where one corresponds to the lowest degree of certainty and six is the highest expressed by the sentence. We invert the scale to retrieve the uncertainty scores. To compute the overall linguistic uncertainty of an explanation, we first compute the uncertainty for each assumption and the explanation summary and then average all the scores.
We use a fine-tuned BERT-based token classification model to classify all the words in the generated explanation with uncertainty categories introduced in the 2010 CoNLL shared task on Hedge Detection Farkas et al. (2010). Farkas et al. (2010) classify hedge cues into three categories: epistemic, doxatic, and conditional. Epistemic cues refer to hedging scenarios where the truth value of a proposition can be determined but is unknown in the present (e.g. the blocks may fall). Doxatic cues refer to beliefs and hypotheses that can be held to be true or false by others (e.g. the child believed the blocks would fall). Finally, conditional cues refer to propositions whose truth value is dependent on another propositionâs truth value (e.g. if the balloon is pricked it may deflate).
1
2
Input : Assumptions, Explanation Summary, Uncertainty Estimator Model $uc(·)$
Output : Overall Uncertainty
3
4 $AssumptionUncertaintyListâ$ empty list;
5
6 foreach Assumption in Assumptions do
7 $UncertaintyScoreâ$ $uc$ ( $UncertaintyModel$ , $Assumption$ );
8 Append $UncertaintyScore$ to $AssumptionUncertaintyList$ ;
9
10 end foreach
11
12 $AverageAssumptionUncertaintyâ$ $Avg$ ( $AssumptionUncertaintyList$ );
13
14 $ExplanationUncertaintyâ$ $uc$ ( $UncertaintyModel$ , $ExplanationSummary$ );
15
16 $OverallExplanationUncertaintyâ AverageAssumptionUncertainty+ExplanationUncertainty$ ;
17
18 return $OverallExplanationUncertainty$ ;
Algorithm 4 Linguistic Uncertainty
A.9 Inference to Best Explanation
To perform IBE, we first fit a linear regression model over the extracted explanation features from the COPA train set and 500 random sample train examples from the E-CARE train set. We consider all explanations independently and annotate each explanation with a 1 if it corresponds to a correct answer or 0 if corresponds to an incorrect answer. After the linear model is fitted, we evaluate the COPA and E-CARE test sets. For each example, we use the trained linear model to score each answer candidate explanation and then select a candidate with the highest score. We use the linear regression implementation from scikit-learn Buitinck et al. (2013) for the IBE model. We additionally use the R stats package R Core Team (2013) for conducting our regression analysis.
A.10 E-CARE Results
A.10.1 E-CARE Consistency
See Figure 12.
<details>
<summary>extracted/6246183/consistency-ecare.png Details</summary>

### Visual Description
\n
## Horizontal Bar Chart: E-CARE Avg. Consistency
### Overview
This is a horizontal bar chart comparing the percentage of logically consistent responses (both correct and incorrect) from three different models: Chat-1, Llama 2 13B, and Llama 2 7B. The chart displays the average consistency for each model, broken down by whether the response was "correct" or "incorrect".
### Components/Axes
* **Title:** E-CARE: Avg. Consistency (top-center)
* **X-axis:** % Logically Consistent (bottom)
* **Y-axis:** Model Names (left): Chat-1, Llama 2 13B, Llama 2 7B
* **Legend:** Located in the top-right corner.
* "correct" - represented by a green color.
* "incorrect" - represented by a red color.
### Detailed Analysis
The chart consists of six horizontal bars, two for each model. Each pair represents the percentage of "correct" and "incorrect" responses.
* **Chat-1:**
* Correct: Approximately 78% (bar extends to around the 78% mark on the x-axis).
* Incorrect: Approximately 71% (bar extends to around the 71% mark on the x-axis).
* **Llama 2 13B:**
* Correct: Approximately 74% (bar extends to around the 74% mark on the x-axis).
* Incorrect: Approximately 70% (bar extends to around the 70% mark on the x-axis).
* **Llama 2 7B:**
* Correct: Approximately 80% (bar extends to around the 80% mark on the x-axis).
* Incorrect: Approximately 74% (bar extends to around the 74% mark on the x-axis).
### Key Observations
* Llama 2 7B has the highest percentage of correct responses (80%), while Chat-1 has the highest percentage of incorrect responses (71%).
* Llama 2 7B consistently outperforms the other two models in terms of logical consistency for correct responses.
* The difference between correct and incorrect responses is relatively small for all three models, suggesting that all models sometimes produce logically inconsistent outputs.
### Interpretation
The data suggests that Llama 2 7B is the most logically consistent model among the three tested, particularly when generating correct responses. The relatively close values for correct and incorrect responses across all models indicate that logical consistency is a challenge for all of them. The chart highlights the importance of evaluating not only the correctness of model outputs but also their logical coherence. The E-CARE metric appears to be measuring the logical soundness of the responses, and this chart provides a comparative view of three models on this metric. The fact that incorrect responses are also assigned a consistency score suggests that the metric is evaluating the internal logic of the response, even if the response is factually wrong. This could be useful for identifying models that generate confidently incorrect but logically structured responses versus those that generate random or incoherent outputs.
</details>
Figure 12: Average consistency comparison between correct and incorrect options for the E-CARE dataset.
A.10.2 E-CARE Proof Depth
See Figure 13.
<details>
<summary>extracted/6246183/depth-ecare.png Details</summary>

### Visual Description
\n
## Bar Chart: E-CARE: Avg. Proof Depth
### Overview
This is a horizontal bar chart comparing the average proof depth for three different language models: Llama 2 7B, Llama 2 13B, and ChatGPT. For each model, two bars are displayed, representing the average proof depth for correct and incorrect options. The chart aims to visualize how the models perform in terms of proof depth depending on the correctness of the answer.
### Components/Axes
* **Title:** E-CARE: Avg. Proof Depth (located at the top-center)
* **X-axis:** Depth (labeled at the bottom-center) - The scale is not explicitly shown, but values range from approximately 1.8 to 2.2.
* **Y-axis:** Model Names (listed vertically on the left):
* Llama 2 7B
* Llama 2 13B
* ChatGPT
* **Legend:** Located in the top-right corner.
* "correct" - represented by the color green.
* "incorrect" - represented by the color red.
### Detailed Analysis
The chart displays two bars for each model, one green (correct) and one red (incorrect).
* **Llama 2 7B:**
* Correct: The green bar extends to approximately 1.86.
* Incorrect: The red bar extends to approximately 2.03.
* **Llama 2 13B:**
* Correct: The green bar extends to approximately 2.15.
* Incorrect: The red bar extends to approximately 2.21.
* **ChatGPT:**
* Correct: The green bar extends to approximately 1.98.
* Incorrect: The red bar extends to approximately 2.18.
The bars are arranged vertically, with Llama 2 7B at the top and ChatGPT at the bottom. For each model, the red bar (incorrect) is positioned to the right of the green bar (correct).
### Key Observations
* For all three models, the average proof depth is higher for incorrect options than for correct options.
* Llama 2 13B has the highest average proof depth for both correct and incorrect options.
* Llama 2 7B has the lowest average proof depth for both correct and incorrect options.
* The difference between the correct and incorrect proof depths is relatively consistent across all three models, ranging from approximately 0.17 to 0.23.
### Interpretation
The data suggests that when language models make incorrect predictions, they tend to generate longer or more complex "proofs" compared to when they make correct predictions. This could indicate that the models are attempting to justify incorrect answers with more elaborate reasoning, or that the incorrect answers require more steps to reach. The higher average proof depth for Llama 2 13B might suggest that larger models are more prone to generating longer proofs, regardless of correctness. The consistent difference between correct and incorrect proof depths across all models suggests a systematic pattern in their reasoning process. It's important to note that "proof depth" is a metric specific to the E-CARE evaluation, and its precise meaning would require further context about the evaluation setup. The chart does not provide information about the *quality* of the proofs, only their length or complexity.
</details>
Figure 13: Comparison of average proof depth between correct and incorrect options.
<details>
<summary>extracted/6246183/drift-ecare.png Details</summary>

### Visual Description
\n
## Horizontal Bar Chart: E-CARE: Avg. Concept Drift
### Overview
This is a horizontal bar chart comparing the average concept drift for three different language models: Llama 2 7B, Llama 2 13B, and ChatGPT. The chart displays the average drift for both correct and incorrect options. The x-axis represents the "Drift" value, and the y-axis lists the language models.
### Components/Axes
* **Title:** E-CARE: Avg. Concept Drift
* **X-axis Label:** Drift (no units specified)
* **Y-axis Labels:** Llama 2 7B, Llama 2 13B, ChatGPT
* **Legend:**
* **Label:** Option Type
* **Correct:** Represented by green bars.
* **Incorrect:** Represented by red/brown bars.
### Detailed Analysis
The chart consists of six horizontal bars, grouped by language model. Each model has two bars representing the average drift for correct and incorrect options.
* **Llama 2 7B:**
* Incorrect Drift: Approximately 5.87 (red/brown bar)
* Correct Drift: Approximately 5.37 (green bar)
* **Llama 2 13B:**
* Incorrect Drift: Approximately 5.8 (red/brown bar)
* Correct Drift: Approximately 5.19 (green bar)
* **ChatGPT:**
* Incorrect Drift: Approximately 5.02 (red/brown bar)
* Correct Drift: Approximately 3.61 (green bar)
The incorrect drift values are consistently higher than the correct drift values for each model.
### Key Observations
* ChatGPT exhibits the lowest average drift for correct options (3.61), significantly lower than Llama 2 7B (5.37) and Llama 2 13B (5.19).
* The difference between correct and incorrect drift is most pronounced for ChatGPT (5.02 - 3.61 = 1.41).
* Llama 2 7B and Llama 2 13B have very similar drift values for both correct and incorrect options.
* The incorrect drift values for Llama 2 7B and Llama 2 13B are nearly identical (5.87 and 5.8 respectively).
### Interpretation
The chart suggests that ChatGPT is more consistent in its responses, exhibiting less concept drift when providing correct answers compared to Llama 2 7B and Llama 2 13B. The larger difference between correct and incorrect drift for ChatGPT indicates that it is better at identifying and avoiding conceptual errors. The similarity in drift values between the two Llama 2 models suggests they perform similarly in this E-CARE task. The "Drift" metric likely represents a measure of how much the model's understanding or response deviates from the expected or correct concept. A higher drift value indicates a greater deviation. The data suggests that while all models exhibit some degree of concept drift, ChatGPT demonstrates a stronger ability to maintain conceptual consistency, particularly when generating correct responses. The E-CARE task is not defined, but it appears to be a benchmark for evaluating the conceptual understanding of language models.
</details>
Figure 14: Comparison of average concept drift between correct and incorrect options.
<details>
<summary>extracted/6246183/coherence-ecare.png Details</summary>

### Visual Description
\n
## Bar Chart: E-CARE Avg. Coherence
### Overview
This bar chart compares the average coherence scores of three language models â ChatGPT, Llama 2 13B, and Llama 2 7B â alongside a line representing the relative difference in percentage between correct and incorrect responses. The chart aims to visualize the coherence of each model and the accuracy of their responses.
### Components/Axes
* **Title:** E-CARE: Avg. Coherence (top-center)
* **X-axis:** Language Models (ChatGPT, Llama 2 13B, Llama 2 7B)
* **Y-axis (left):** Coherence Score (ranging from approximately 0.0 to 0.25)
* **Y-axis (right):** Rel. Difference % (ranging from 0% to 40%)
* **Legend (top-right):**
* Type: Correct (Green)
* Type: Incorrect (Red)
* **Data Series:**
* Bar Chart: Average Coherence Score for each model (Green for Correct, Red for Incorrect)
* Line Chart: Relative Difference (%) between Correct and Incorrect responses (dashed red line with black markers)
### Detailed Analysis
The chart presents three sets of paired bars, one for each language model. Each model has a green bar representing the "Correct" responses and a red bar representing the "Incorrect" responses. A dashed red line with black markers connects the relative difference percentages for each model.
* **ChatGPT:**
* Correct (Green): Approximately 0.23
* Incorrect (Red): Approximately 0.12
* Relative Difference (%): Approximately 16%
* **Llama 2 13B:**
* Correct (Green): Approximately 0.25
* Incorrect (Red): Approximately 0.20
* Relative Difference (%): Approximately 25%
* **Llama 2 7B:**
* Correct (Green): Approximately 0.23
* Incorrect (Red): Approximately 0.32
* Relative Difference (%): Approximately 31%
The dashed red line shows an upward trend, indicating that the relative difference between correct and incorrect responses increases from ChatGPT to Llama 2 7B.
### Key Observations
* Llama 2 13B exhibits the highest average coherence score for correct responses.
* Llama 2 7B has the lowest average coherence score for correct responses and the highest for incorrect responses.
* The relative difference between correct and incorrect responses is lowest for ChatGPT and highest for Llama 2 7B.
* The line representing the relative difference shows a clear increasing trend.
### Interpretation
The data suggests that while Llama 2 13B demonstrates the highest coherence in correct responses, Llama 2 7B has a larger discrepancy between the coherence of correct and incorrect responses. This could indicate that Llama 2 7B is more prone to generating incoherent incorrect responses compared to the other two models. ChatGPT shows a moderate coherence score with a relatively low difference between correct and incorrect responses, suggesting a more consistent level of coherence across both types of responses. The increasing trend in the relative difference line implies that as the model size decreases (from 13B to 7B), the ability to maintain coherence in correct responses diminishes while the incoherence in incorrect responses increases. This could be due to the smaller model having less capacity to learn and generalize from the training data, leading to more erratic outputs. The E-CARE metric appears to be sensitive to the quality of responses, and this chart provides a comparative assessment of the three models based on this metric.
</details>
Figure 15: Comparison of average coherence scores between correct and incorrect options.
<details>
<summary>extracted/6246183/uncertainty-ecare.png Details</summary>

### Visual Description
\n
## Bar Chart: E-CARE Avg. Uncertainty
### Overview
This bar chart compares the average uncertainty scores for three different language models: ChatGPT, Llama 2 13B, and Llama 2 7B. The chart uses paired bars to represent "correct" and "incorrect" responses, with a dashed line connecting the average scores for each model. The y-axis represents the score, while the right y-axis represents the relative difference.
### Components/Axes
* **Title:** E-CARE: Avg. Uncertainty
* **X-axis:** Model Name (ChatGPT, Llama 2 13B, Llama 2 7B)
* **Y-axis (left):** Score (Scale from 0 to 4, with increments of 1)
* **Y-axis (right):** Rel. Difference (Scale from 0% to 5%, with increments of 1%)
* **Legend (top-right):**
* Type: correct (Green)
* Type: incorrect (Red)
### Detailed Analysis
The chart consists of three groups of paired bars, one for each model. Each group contains a green bar representing the "correct" responses and a red bar representing the "incorrect" responses. A dashed red line connects the top of the green and red bars for each model, representing the average uncertainty score.
**ChatGPT:**
* Correct: Approximately 3.6 (Green bar)
* Incorrect: Approximately 3.2 (Red bar)
* Average (dashed line): Approximately 3.4
**Llama 2 13B:**
* Correct: Approximately 3.1 (Green bar)
* Incorrect: Approximately 3.7 (Red bar)
* Average (dashed line): Approximately 3.4
**Llama 2 7B:**
* Correct: Approximately 3.5 (Green bar)
* Incorrect: Approximately 3.8 (Red bar)
* Average (dashed line): Approximately 3.65
The relative difference scale on the right y-axis is not directly used to extract numerical values from the chart, but it indicates the scale for the difference between the correct and incorrect scores.
### Key Observations
* ChatGPT has the highest average score for correct responses and a relatively lower score for incorrect responses.
* Llama 2 13B has the lowest average score for correct responses and the highest score for incorrect responses.
* The average uncertainty score is approximately the same for ChatGPT and Llama 2 13B (around 3.4).
* Llama 2 7B has a higher average uncertainty score than both ChatGPT and Llama 2 13B.
* The dashed line connecting the bars shows a slight upward trend from ChatGPT to Llama 2 7B.
### Interpretation
The chart suggests that ChatGPT performs better on the E-CARE task in terms of lower uncertainty for correct responses, while Llama 2 13B exhibits higher uncertainty, particularly for incorrect responses. Llama 2 7B shows a slightly higher overall uncertainty compared to ChatGPT. The consistent average score of approximately 3.4 for ChatGPT and Llama 2 13B could indicate a similar level of confidence in their responses, despite differences in the correctness of those responses. The upward trend in the dashed line suggests that as the model size decreases (from 13B to 7B), the average uncertainty score increases. This could be due to the smaller model having less capacity to accurately assess its own confidence. The relative difference scale, while present, doesn't provide specific insights without knowing the exact values for each bar. The chart highlights the trade-off between model size and uncertainty in the context of the E-CARE task.
</details>
Figure 16: Comparison of average uncertainty scores between correct and incorrect options.
<details>
<summary>extracted/6246183/hedge-ratio-ecare.png Details</summary>

### Visual Description
\n
## Bar Chart: E-CARE: Avg. Ratio of Hedge Cues
### Overview
This bar chart compares the average ratio of hedge cues across three large language models: ChatGPT, Llama 2 13B, and Llama 2 7B. A secondary y-axis displays the relative difference. A dashed line connects data points representing the relative difference between models.
### Components/Axes
* **Title:** E-CARE: Avg. Ratio of Hedge Cues (top-center)
* **X-axis:** Model Name (ChatGPT, Llama 2 13B, Llama 2 7B)
* **Y-axis (left):** Ratio (scale from 0.00 to 0.04, increments of 0.01)
* **Y-axis (right):** Rel. Difference (scale from 0% to 20%, increments of 5%)
* **Bars:** Green bars represent the average ratio of hedge cues for each model. Red shading within the green bars indicates a secondary value.
* **Dashed Line:** Red dashed line connects the relative difference data points.
* **Data Points:** Black circular markers are placed on the dashed line, indicating the relative difference for each model comparison.
### Detailed Analysis
* **ChatGPT:** The green bar has a height of approximately 0.033. The red shading within the bar reaches approximately 0.038. The corresponding relative difference is approximately 12%.
* **Llama 2 13B:** The green bar has a height of approximately 0.041. The red shading within the bar reaches approximately 0.044. The corresponding relative difference is approximately 6%.
* **Llama 2 7B:** The green bar has a height of approximately 0.035. The red shading within the bar reaches approximately 0.046. The corresponding relative difference is approximately 9%.
The dashed line starts at approximately 12% for ChatGPT, decreases to approximately 6% for Llama 2 13B, and then increases to approximately 9% for Llama 2 7B.
### Key Observations
* Llama 2 13B exhibits the highest average ratio of hedge cues (approximately 0.041).
* ChatGPT has the lowest relative difference (approximately 12%).
* The relative difference between ChatGPT and Llama 2 13B is the largest.
* The red shading within the bars consistently exceeds the height of the green bars, suggesting a difference between two measured values.
### Interpretation
The chart suggests that Llama 2 13B utilizes hedge cues more frequently than ChatGPT and Llama 2 7B. The relative difference line indicates that the change in hedge cue usage between models is not linear. The consistent difference between the green bar height and the red shading suggests that there are two different measurements being compared, potentially representing different methods for calculating the hedge cue ratio or different datasets used for evaluation. The use of hedge cues could indicate a model's uncertainty or attempt to mitigate potential risks associated with its responses. The higher ratio in Llama 2 13B might suggest a more cautious or nuanced approach to generating text. Further investigation is needed to understand the specific meaning of the red shading and the context of the "E-CARE" metric.
</details>
Figure 17: Comparison of the average ratio of hedge cues between correct and incorrect options.
<details>
<summary>extracted/6246183/hedge-distrib-ecare.png Details</summary>

### Visual Description
\n
## Bar Chart: E-CARE: Hedge Cue Distrib.
### Overview
The image presents a horizontal bar chart comparing the distribution of three hedge cue categories â Conditional, Doxastic, and Epistemic â across three language models: Llama 2 7B, Llama 2 13B, and ChatGPT. Each bar represents 100% of the responses from a given model, segmented to show the proportion of each hedge cue category.
### Components/Axes
* **Title:** "E-CARE: Hedge Cue Distrib." positioned at the top-center.
* **Y-axis:** Lists the language models: "Llama 2 7B", "Llama 2 13B", and "ChatGPT" (from top to bottom).
* **X-axis:** Represents the percentage distribution, ranging from 0% to 100%.
* **Legend:** Located at the bottom-center, identifying the colors associated with each category:
* Conditional (Yellow)
* Doxastic (Green)
* Epistemic (Light Blue)
### Detailed Analysis
Each language model is represented by a horizontal bar, divided into three colored segments corresponding to the hedge cue categories.
* **Llama 2 7B:**
* Conditional (Yellow): Approximately 25%
* Doxastic (Green): Approximately 40%
* Epistemic (Light Blue): Approximately 35%
* **Llama 2 13B:**
* Conditional (Yellow): Approximately 20%
* Doxastic (Green): Approximately 50%
* Epistemic (Light Blue): Approximately 30%
* **ChatGPT:**
* Conditional (Yellow): Approximately 20%
* Doxastic (Green): Approximately 45%
* Epistemic (Light Blue): Approximately 35%
### Key Observations
* The Doxastic cue is the most prevalent across all three models, consistently occupying the largest segment of each bar.
* Llama 2 13B exhibits the highest proportion of Doxastic cues (approximately 50%).
* Llama 2 7B has the highest proportion of Conditional cues (approximately 25%).
* The Conditional cue consistently has the lowest proportion across all models.
### Interpretation
The chart suggests that all three language models, when responding within the E-CARE context, frequently employ Doxastic hedges â cues indicating belief or commitment. This could imply a tendency towards expressing opinions or making assertions, even while acknowledging uncertainty. The relatively lower use of Conditional hedges suggests a less frequent reliance on explicitly stating conditions or limitations. The differences between the models, particularly the higher Doxastic usage in Llama 2 13B and the higher Conditional usage in Llama 2 7B, might reflect variations in their training data or architectural design. These differences could be significant in applications where nuanced expression of uncertainty is crucial. The data suggests that the models are not equally calibrated in their use of hedging strategies. Further investigation would be needed to understand the implications of these differences for the reliability and trustworthiness of their outputs.
</details>
Figure 18: Distribution of hedge cues across incorrect explanations.
A.10.3 E-CARE Concept Drift
See Figure 13.
A.10.4 E-CARE Coherence
See Figure 15.
A.10.5 E-CARE Uncertainty
See Figure 16.
A.10.6 E-CARE Hedge Ratio
See Figure 17.
A.10.7 E-CARE Hedge Distribution
See Figure 18.
A.11 Causal Directionality
<details>
<summary>extracted/6246183/cause-effect-acc.png Details</summary>

### Visual Description
\n
## Horizontal Bar Chart: LLM Accuracy Comparison
### Overview
This horizontal bar chart compares the accuracy of three Large Language Models (LLMs) â Llama 2 7B, Llama 2 13B, and ChatGPT â across two types of reasoning: "Cause" and "Effect". Accuracy is measured as a percentage.
### Components/Axes
* **Y-axis:** LLM (Llama 2 7B, Llama 2 13B, ChatGPT)
* **X-axis:** Accuracy (%) - ranging from approximately 60% to 80%.
* **Legend:** Located in the top-right corner, defining the color coding for "Type":
* Cause: Light Green
* Effect: Medium Blue
* **Data Series:** Two horizontal bars for each LLM, representing accuracy for "Cause" and "Effect" reasoning.
### Detailed Analysis
The chart presents accuracy data for each LLM and reasoning type.
* **Llama 2 7B:**
* Cause: Approximately 64% accuracy. The light green bar extends to the 64% mark on the x-axis.
* Effect: Approximately 72% accuracy. The medium blue bar extends to the 72% mark on the x-axis.
* **Llama 2 13B:**
* Cause: Approximately 72% accuracy. The light green bar extends to the 72% mark on the x-axis.
* Effect: Approximately 72% accuracy. The medium blue bar extends to the 72% mark on the x-axis.
* **ChatGPT:**
* Cause: Approximately 71% accuracy. The light green bar extends to the 71% mark on the x-axis.
* Effect: Approximately 80% accuracy. The medium blue bar extends to the 80% mark on the x-axis.
### Key Observations
* ChatGPT demonstrates the highest accuracy for "Effect" reasoning at approximately 80%.
* Llama 2 13B shows consistent accuracy across both "Cause" and "Effect" reasoning, at approximately 72%.
* Llama 2 7B has the lowest accuracy for "Cause" reasoning at approximately 64%.
* The difference in accuracy between "Cause" and "Effect" reasoning is most pronounced for Llama 2 7B and ChatGPT.
### Interpretation
The data suggests that ChatGPT excels in "Effect" reasoning, significantly outperforming the other models in this area. Llama 2 13B exhibits a balanced performance across both reasoning types. Llama 2 7B appears to struggle more with "Cause" reasoning compared to "Effect" reasoning. The consistent performance of Llama 2 13B might indicate that increasing model size improves reasoning capabilities, but ChatGPT's performance suggests that architectural differences or training data also play a crucial role. The disparity in accuracy between reasoning types could be due to the inherent complexity of identifying causal relationships versus understanding effects. The chart provides a comparative snapshot of LLM performance on these specific reasoning tasks, highlighting strengths and weaknesses of each model.
</details>
Figure 19: Accuracy in predicting the most plausible causes vs effects on COPA.
When considering the causal directionality (i.e. cause vs effect), we observed that accuracy tended to differ between LLMs on COPA. In particular, we found both GPT and LLaMA 2 7B to be more accurate in predicting the effects in causal scenarios (see Figure 19). We hypothesize that LLMs may suffer the challenge of causal sufficiency as the space of potential causal explanations can be far greater than the range of effects once an event has been observed. This hypothesis is partly supported by the fact that GPT and LLaMA 2 7B express greater linguistic uncertainty and produce more complex explanations when predicting causes rather than effects.
A.12 Dataset Details
COPA is released under a BSD-2 license and made available for broad research usage with copyright notification restrictions people.ict.usc.edu/~gordon/copa.html. We do not modify or use COPA outside of its intended use which is primarily open-domain commonsense causal reasoning. E-CARE is released under the MIT license and can be used for broad purposes with copyright notification restrictions github.com/Waste-Wood/e-CARE?tab=MIT-1-ov-file#readme. We do not modify or use E-CARE outside of its intended use which is causal reasoning evaluation of language models.