2402.03744v2

Model: healer-alpha-free

# INSIDE: LLMs’ Internal States Retain the Power of Hallucination Detection **Authors**: - Zhihang Fu, Jieping Ye (Alibaba Cloud Zhejiang University) > Corresponding Author ## Abstract Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs’ IN ternal S tates for halluc I nation DE tection (INSIDE). In particular, a simple yet effective EigenScore metric is proposed to better evaluate responses’ self-consistency, which exploits the eigenvalues of responses’ covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal. Code is available at https://github.com/alibaba/eigenscore ## 1 Introduction Large Language Models (LLMs) have recently achieved a milestone breakthrough and demonstrated impressive abilities in various applications (Ouyang et al., 2022; OpenAI, 2023). However, it has been widely observed that even the state-of-the-art LLMs often make factually incorrect or nonsense generations (Cohen et al., 2023; Ren et al., 2022; Kuhn et al., 2022), which is also known as knowledge hallucination (Ji et al., 2023). The potentially unreliable generations make it risky to deploy LLMs in practical scenarios. Therefore, hallucination detection, that is, accurately detecting and rejecting responses when hallucinations occur in LLMs, has attracted more and more attention from the academic community (Azaria & Mitchell, 2023; Ren et al., 2022; Kuhn et al., 2022). The token-level uncertainty estimation (e.g., predictive confidence or entropy) has shown its efficacy in hallucination detection on conventional NLP tasks (Malinin & Gales, 2020; Huang et al., 2023). However, how to derive the sentence-level uncertainty from the token-level remains a challenge, especially for modern auto-regressive LLMs whose response contents are generally diverse and sophisticated (Malinin & Gales, 2020; Kuhn et al., 2022; Duan et al., 2023). Thus, to avoid complicated token-to-sentence uncertainty derivation, researchers propose to evaluate the sentence uncertainty by the output languages directly (Kadavath et al., 2022; Yin et al., 2023; Zhou et al., 2023). Among the recent advancements, prompting LLMs to generate multiple responses to the same question and evaluating the self-consistency of those responses has been proven effective in hallucination detection (Wang et al., 2022; Shi et al., 2022). However, such a post-hoc semantic measurement on decoded language sentences is inferior to precisely modeling the logical consistency/divergence Manakul et al. (2023); Zhang et al. (2023). Hence, instead of logit-level or language-level uncertainty estimation, this paper proposes to leverage the internal states of LLMs to conduct hallucination detection. The motivation is intuitive: LLMs preserve the highly-concentrated semantic information of the entire sentence within their internal states (Azaria & Mitchell, 2023), allowing for the direct detection of hallucinated responses in the sentence embedding space. In particular, with the generalized framework of IN ternal S tates for halluc I nation DE tection (INSIDE), this paper performs hallucination detection from two perspectives. First, skipping secondary semantic extraction via extra models, we directly measure the self-consistency/divergence of the output sentences using internal states of LLMs. In order to explore semantic consistency in the embedding space, Section 3.1 introduces an EigenScore metric regarding the eigenvalues of sentence embeddings’ covariance matrix. Second, to handle the self-consistent (overconfident) hallucinations, we propose to rectify abnormal activations of the internal states. Specifically, Section 3.2 develops a feature clipping approach to truncate extreme features, which tends to prevent overconfident generations during the auto-regressive procedure. In Section 4, the effectiveness of our method is validated through extensive experiments on several well-established QA benchmarks. The main contributions of our work are as follows: - We propose a generalized INSIDE framework that leverages the internal states of LLMs to perform hallucination detection. - We develop an EigenScore metric to measure the semantic consistency in the embedding space, and demonstrate that the proposed EigenScore represents the differential entropy in the sentence embedding space. - A test time feature clipping approach is introduced to truncate extreme activations in the feature space, which implicitly reduces overconfident generations and helps identify the overconfident hallucinations. - We achieve state-of-the-art hallucination detection performance on several QA benchmarks, and conduct extensive ablation studies to verify the efficacy of our method. ## 2 Background on Hallucination Detection In this work, we mainly focus on the knowledge hallucination detection of natural language generation based on LLMs, especially for Q&A task (Reddy et al., 2019; Kwiatkowski et al., 2019). Given an input context $\bm{x}$ , a typical LLM (Zhang et al., 2022; Touvron et al., 2023a) parameterized with $\bm{θ}$ is able to generate output sequences in autoregressive manner $y_t=f(\bm{x},y_1,y_2,⋯,y_t-1|\bm{θ})$ , where $\bm{y}=[y_1,y_2,⋯,y_T]$ denotes the output sequence and $y_t$ denotes the t- $th$ output token. We denote $p(y_t|y_<t,\bm{x})$ the Maximum Softmax Probability (MSP) of $t$ -th token. For a traditional classification model, the MSP measures the confidence level of the classification result and has been widely used as an uncertainty measure of predictions (Hendrycks & Gimpel, 2016). Therefore, for sequence generation task, a straightforward sequence uncertainty can be defined as the joint probability of different tokens, which is known as Perplexity (Ren et al., 2022), $$ P(\bm{y}|\bm{x},\bm{θ})=-\frac{1}{T}\log∏_tp(y_t|y_<t,\bm{x})=- \frac{1}{T}∑_t\log p(y_t|y_<t,\bm{x}) \tag{1} $$ As shorter sequences generally have lower perplexity, the length of the output sequence $T$ is utilized to normalize the joint probability. Since different tokens contribute differently to the semantics of the sentence (Raj et al., 2023; Duan et al., 2023), the perplexity defined by averaging token-level uncertainty cannot effectively capture the uncertainty of the entire sequence. It has been demonstrated that utilizing multiple generations for one input is beneficial to estimate the sequence-level uncertainty (Malinin & Gales, 2020; Kuhn et al., 2022; Manakul et al., 2023). We denote $Y=[\bm{y}^1,\bm{y}^2,⋯,\bm{y}^K]$ as $K$ generated responses for input context $\bm{x}$ . For a given LLM, multiple responses could be easily obtained by the top-p/top-k sampling strategy during inference time (Touvron et al., 2023a; Kadavath et al., 2022). In Malinin & Gales (2020), the Length Normalized Entropy is proposed to measure the sequence-level uncertainty by making use of multiple generations, which is defined as $$ H(Y|\bm{x},\bm{θ})=-E_\bm{y∈Y}\frac{1}{T _\bm{y}}∑_t\log p(y_t|y_<t,\bm{x}) \tag{2} $$ When a model is uncertain about its response, it generates hallucination context, resulting in an answer distribution with a high entropy (Kadavath et al., 2022). It has been shown that the length-normalized entropy performs better than the non-normalized one (Lin et al., 2023). In addition to the predictive uncertainty or entropy, the semantic consistency (Lin et al., 2023; Raj et al., 2023) among multiple responses has also been widely explored to measure the hallucination degree of LLMs, which hypothesis that the LLMs are expected to generate similar outputs if they know the input context and they are sure about the answers (Wang et al., 2022; Manakul et al., 2023). An intuitive semantic consistency metric is Lexical Similarity (Lin et al., 2022; 2023), which explores the average similarity across multiple answers as consistency measure $$ S(Y|\bm{x},\bm{θ})=\frac{1}{C}∑_i=1^K∑_j=i+1^Ksim (\bm{y}^i,\bm{y}^j) \tag{3} $$ where $C=K·(K-1)/2$ and $sim(·,·)$ is the similarity defined by Rouge-L Lin (2004). ## 3 Method <details> <summary>x1.png Details</summary> ![68286e78](/v1/image/68286e783cc762694c3149914bb53d3d0ddd106e8cda8a1a5baa85e365dadc1e) ### Visual Description ## System Architecture Diagram: LLM Answer Filtering via EigenScore ### Overview The image is a technical system architecture diagram illustrating a process for filtering answers generated by a Large Language Model (LLM). The system takes a question as input, processes it through an LLM to generate multiple candidate answer embeddings, computes an Eigenvector from these embeddings, and uses a derived "EigenScore" to decide whether to output a final answer or a rejection message. The flow is depicted from left to right. ### Components/Axes The diagram is composed of several interconnected blocks and decision points, with a legend at the bottom explaining the color-coding of specific elements. **1. Input Block (Leftmost, light blue rounded rectangle):** * **Label:** `Input` * **Content:** A sample question: `Q: On what date in 1969 did Neil Armstrong first set foot on the Moon?` **2. LLM Processing Block (Center-left, light green rounded rectangle):** * **Label:** `LLM` (at the top) * **Internal Components (from bottom to top):** * `Input Tokens` (represented by four small grey rectangles). * `Decoder` (large teal rectangle). * `Feature Clip` (blue rectangle). * `FC Layer` (teal rectangle). * `Output Logit` (a single pink rectangle at the top). * **Flow:** Arrows indicate data flow from `Input Tokens` up through the `Decoder`, `Feature Clip`, and `FC Layer` to the `Output Logit`. **3. Embedding Generation (Center, branching from LLM):** * The LLM output branches into multiple parallel paths, each generating an "Embedding of answer". * **Labels:** `Embedding of answer 1`, `Embedding of answer 2`, ..., `Embedding of answer K`. * **Visual Representation:** Each embedding is shown as a horizontal bar composed of multiple colored segments (purple, orange, yellow, etc.), representing a `Sentence Embedding` as per the legend. **4. Eigenvector Computation Block (Center-right, light blue rounded rectangle):** * **Input:** The collection of K answer embeddings. * **Internal Representation:** A matrix symbol `[...]` containing the colored embedding bars. * **Output:** An `Eigenvector`, visualized as three colored arrows (orange, blue, green) radiating from a central point. **5. Decision Diamond (Right, orange diamond):** * **Label:** `High EigenScore?` * **Function:** This is a decision node that evaluates the computed Eigenvector. **6. Output Blocks (Rightmost):** * **Top Output (Green rounded rectangle, "No" path):** * **Label:** `Output` * **Content:** `The answer is 20th July.` * **Bottom Output (Green rounded rectangle, "Yes" path):** * **Label:** `Output` * **Content:** `Sorry we don't support answer for this question.` **7. Legend (Bottom of the image):** * **Token Embedding:** Light yellow rectangle. * **Current Token Embedding:** Orange rectangle. * **Output Logit:** Pink rectangle. * **Sentence Embedding:** A horizontal bar composed of multiple small black-outlined rectangles. ### Detailed Analysis The diagram details a specific technical workflow: 1. **Input Processing:** A natural language question is fed into an LLM. 2. **Candidate Generation:** The LLM's decoder architecture processes the input tokens to generate multiple potential answer candidates (from 1 to K). Each candidate is represented as a high-dimensional vector (a `Sentence Embedding`). 3. **Dimensionality Reduction & Analysis:** These K embeddings are analyzed together. The system computes an `Eigenvector` from this set, likely through a method like Principal Component Analysis (PCA) or a similar spectral technique. The Eigenvector captures the principal directions of variance among the candidate answers. 4. **Scoring & Decision:** A scalar "EigenScore" is derived from this Eigenvector. The diagram implies this score measures the consistency, confidence, or semantic coherence of the candidate answers. * If the EigenScore is **NOT High** (the "No" path), the system proceeds to generate and output a specific answer (e.g., "20th July"). * If the EigenScore **IS High** (the "Yes" path), the system interprets this as an indicator of an unsupported or problematic question and outputs a rejection message instead. ### Key Observations * **Process Flow:** The flow is strictly linear and unidirectional from input to output, with a single branching point at the decision diamond. * **Color-Coding Consistency:** The colors in the legend are used consistently in the diagram. The `Sentence Embedding` bars in the center match the legend's pattern. The `Output Logit` (pink) in the LLM block matches the legend. * **Spatial Grounding:** The `Input` is on the far left. The `LLM` block is central-left. The `Embedding` generation is in the center. The `Eigenvector` computation is center-right. The `Decision` and `Output` blocks are on the far right. The `Legend` is anchored at the bottom-left. * **Example Logic:** The sample question about Neil Armstrong leads to a "No" decision (Low EigenScore), resulting in a factual answer. This suggests the system is designed to answer straightforward, fact-based questions confidently. A "High EigenScore" likely corresponds to questions where the LLM generates highly variable, uncertain, or nonsensical candidate answers, triggering the rejection. ### Interpretation This diagram illustrates a **confidence or reliability filtering mechanism** for LLM outputs. Instead of relying on a single output, the system generates multiple candidate answers and analyzes their collective properties. * **What it demonstrates:** The core idea is that the *variance* or *structure* among multiple generated answers (captured by the Eigenvector and its score) can be a proxy for the model's confidence or the question's answerability. A low variance (or a specific spectral signature) among candidates suggests consensus, leading to answer output. High variance or an anomalous spectral signature suggests uncertainty, leading to a refusal. * **Relationships:** The LLM is the generator. The embedding analysis block is the evaluator. The decision diamond is the gatekeeper. This creates a feedback loop where the model's own output distribution is used to self-assess reliability before finalizing a response. * **Notable Implications:** This approach moves beyond simple token-level probability scores. It uses a more holistic, semantic-level analysis of multiple potential responses. It's a form of **ensemble reasoning** or **self-consistency checking** implemented within a single model's inference pass. The rejection message ("Sorry we don't support answer for this question") implies this system is part of a larger application that defines a specific domain of "supportable" questions, and this EigenScore mechanism is the filter for that domain. </details> Figure 1: Illustration of our proposed hallucination detection pipeline. During inference time, for a given question, the extreme features in the penultimate layer are truncated and the EigenScore is computed based on the sentence embeddings across multiple responses. In this section, we introduce the details of our proposed INSIDE framework for hallucination detection. The whole pipeline is illustrated as Fig. 1. In section 3.1, we demonstrate a simple but effective EigenScore metric by exploring sentence-level semantics in the internal states of LLMs. In section 3.2, a test-time feature clipping approach is introduced to effectively alleviate the issue of overconfident generation, thereby aiding in the identification of self-consistent hallucinations ### 3.1 Hallucination Detection by EigenScore The existing uncertainty or consistency based hallucination detection metrics are exploited in the logit or language space, which neglect the dense semantic information that is retained within the internal states of LLMs. To better exploit the dense semantic information, we propose to measure the semantic divergence in the sentence embedding space. For the $t$ -th output token $y_t$ , we denote the hidden embedding in the $l$ -th layer as $\bm{h}^l_t∈ℝ^d$ , where $d$ is the dimension of the hidden embedding ( $d=4096$ for LLaMA-7B and $d=5120$ for LLaMA-13B). According to Ren et al. (2022); Azaria & Mitchell (2023), the sentence embedding can be obtained by averaging the token embedding $\bm{z}=\frac{1}{T}∑_t=1^T\bm{h}_t$ , or taking the last token embedding as sentence embedding $\bm{z}=\bm{h}_T$ . In our main experiments, we use the embedding of the last token in the middle layer as the sentence embedding, as it effectively captures the sentence semantic (Azaria & Mitchell, 2023). The comparison results of using different sentence embeddings are demonstrated in the ablation studies 4.3. For $K$ generated sequences, the covariance matrix of $K$ sentence embeddings can be computed as $$ \bm{Σ}=Z^⊤·J_d·Z \tag{4} $$ where $\bm{Σ}∈ℝ^K× K$ represents the covariance matrix that captures the relationship between different sentences in the embedding space, $Z=[\bm{z}_1,\bm{z}_2,⋯,\bm{z}_K]∈ℝ^d× K$ represents the embedding matrix of different sentences, $J_d=\bm{I}_d-\frac{1}{d}1_d1_d^⊤$ is the centering matrix and $1_d∈ℝ^d$ is the all-one column vector. Then, the proposed EigenScore can be defined as the logarithm determinant (LogDet) of the covariance matrix, $$ E(Y|\bm{x},\bm{θ})=\frac{1}{K}\logdet(\bm{Σ}+α ·I_K) \tag{5} $$ Here, $det(X)$ represents the determinant of matrix $X$ , and a small regularization term $α·I_K$ is added to the covariance matrix to explicitly make it full rank. Since the matrix determinant can be obtained by solving the eigenvalues, the EigenScore can be computed as $$ E(Y|\bm{x},\bm{θ})=\frac{1}{K}\log(∏_iλ_i)=\frac{ 1}{K}∑_i^K\log(λ_i) \tag{6} $$ where $λ=\{λ_1,λ_2,⋯,λ_K\}$ denotes the eigenvalues of the regularized covariance matrix $\bm{Σ}+α·I$ , which can be solved by Singular Value Decomposition (SVD). Eq. 6 shows that the hallucination degree of LLM’s generation can be measured by the average logarithm of the eigenvalues. The conclusion is intuitive, as the eigenvalues of covariance matrix capture the divergence and correlation relationship between embeddings of different sentences. When the LLM is confident to the answers and $K$ generations have similar semantic, the sentence embeddings will be highly correlated and most eigenvalues will be close to 0. On the contrary, when the LLM is indecisive and hallucinating contents, the model will generate multiple sentences with diverse semantics leading to more significant eigenvalues. The following remark is also provided to explain why the proposed EigenScore is a good measure of knowledge hallucination. Remark 1. LogDet of covariance matrix represents the differential entropy in the sentence embedding space. Differential Entropy is the natural extension of discrete Shannon Entropy $H_e(X)=-∑_X-p(x)\log p(x)$ . The differential entropy $H_de(X)$ in continuous space can be defined by replacing the probability function with its density function $f(x)$ and integrating over $x$ , i.e., $H_de(X)=-∫_xf(x)\log f(x)dx$ . In principle (Zhouyin & Liu, 2021), for a multivariate Gaussian distribution $X∼ N(\bm{μ},Σ)$ , the differential entropy can be represented as $$ H_de(X)=\frac{1}{2}\logdet(Σ)+\frac{d}{2}(\log 2π+1)= \frac{1}{2}∑_i=1^d\logλ_i+C \tag{7} $$ where $d$ is the dimension of variables and $C$ is a constant. Therefore, the differential entropy is determined by the eigenvalues (LogDet) of the covariance matrix. According to Remark 1, the proposed EigenScore defined by Eq. 6 represents the differential entropy in the sentence embedding space, which offers valuable insight into using EigenScore as a semantic divergence measure. Compared to existing uncertainty or consistency metrics that obtained in logit or language space (Malinin & Gales, 2020; Huang et al., 2023; Lin et al., 2022), the advantages of EigenScore are: (1) It captures the semantic divergence (entropy) in the dense embedding space, which is expected to retain highly-concentrated semantic information compared to logits or languages (Reimers & Gurevych, 2019). (2) Representing semantic divergence in embedding space can effectively solve the semantic equivalence (linguistic invariances) problem (Kuhn et al., 2022) in natural language space. (3) Fine-grained semantic relationship among different responses can be exploited by using eigenvalues of covariance matrix. Therefore, through the exploration of dense semantic information in the internal states, the EigenScore is expected to outperform existing uncertainty and consistency metrics, resulting in improved hallucination detection performance. <details> <summary>x2.png Details</summary> ![28d737eb](/v1/image/28d737eb12add48f0037fe06ca76d60091d3547af2de7ac9b48f6e0ed8b4ff71) ### Visual Description ## Line Chart: Neuron Activation Distribution ### Overview The image displays a line chart titled "Neuron Activation Distribution." It plots the activation values of approximately 4,000 individual neurons, indexed sequentially along the horizontal axis. The data is presented as a single, dense, teal-colored line that fluctuates around a central baseline, with several prominent positive and negative spikes. ### Components/Axes * **Title:** "Neuron Activation Distribution" (centered at the top). * **Y-Axis (Vertical):** * **Label:** "Neuron Activations" * **Scale:** Linear scale ranging from -30 to 30. * **Tick Marks:** Labeled at intervals of 10: -30, -20, -10, 0, 10, 20, 30. * **X-Axis (Horizontal):** * **Label:** "Neuron Indexes" * **Scale:** Linear scale from 0 to 4000. * **Tick Marks:** Labeled at intervals of 1000: 0, 1000, 2000, 3000, 4000. * **Data Series:** A single series represented by a teal-colored line. There is no legend, as only one dataset is plotted. ### Detailed Analysis The chart visualizes the activation magnitude for each neuron in a sequence. The core pattern is a dense, noisy band of activations centered tightly around 0, with a typical range of approximately -5 to +5. This baseline activity is punctuated by numerous sharp, high-magnitude spikes. **Key Data Points & Trends (Approximate):** * **General Trend:** The line does not show a consistent upward or downward slope across the entire index range. Instead, it exhibits a stationary pattern of low-amplitude noise with intermittent, high-amplitude outliers. * **Notable Positive Spikes (Activations > 10):** * Near index ~200: Spike to ~25. * Near index ~1500: The highest positive spike, reaching ~28. * Near index ~2000: Spike to ~15. * Near index ~2500: Spike to ~20. * Near index ~3000: Spike to ~14. * Near index ~3500: Spike to ~12. * **Notable Negative Spikes (Activations < -10):** * Near index ~800: Spike down to ~-25. * Near index ~2500: The most extreme negative spike, reaching approximately -35 (extending below the -30 axis label). * Several other spikes reach between -10 and -20 at various points (e.g., near index ~500, ~1800, ~2800). ### Key Observations 1. **Sparse High Activation:** The vast majority of neurons have low activation values near zero. Strong activations (both positive and negative) are sparse events, occurring for only a small fraction of the neuron indexes. 2. **Symmetry of Outliers:** High-magnitude events occur in both the positive and negative directions, though the single most extreme value is negative (~-35). 3. **No Clear Index-Based Pattern:** The high-activation neurons do not appear to cluster in specific index ranges; they are distributed seemingly randomly across the 0-4000 index span. 4. **Baseline Noise:** The persistent low-level fluctuation around zero suggests a background level of activity or noise across the entire neural population. ### Interpretation This chart likely represents the output of a layer in a neural network (e.g., a hidden layer or embedding layer) for a specific input or averaged over inputs. The distribution suggests a **sparse coding** or **efficient representation** scheme. * **What it demonstrates:** The network encodes information primarily through the rare, high-magnitude activation of specific neurons, while most neurons remain relatively inactive. This is a common characteristic in well-trained models, where specialization leads to sparse representations. * **Relationship between elements:** The "Neuron Indexes" represent individual processing units. The "Activation" value indicates the strength of that unit's response. The pattern shows that the system's response is concentrated in a few key units. * **Notable anomalies:** The extreme negative spike near index 2500 is the most significant outlier. In a technical context, this could indicate a neuron with a strong inhibitory response, a potential dead neuron (if activations are typically positive), or simply a highly specialized detector for a specific feature in the input data. The lack of clustering suggests the learned features are distributed across the network's width rather than being localized to specific index blocks. </details> (a) Neuron Activation <details> <summary>x3.png Details</summary> ![297b9869](/v1/image/297b9869ae81c7f76bfefe8b29e90ef7080f7be7dfcc6b3dce5198cc424cbff5) ### Visual Description ## Histogram: Neuron Activation Distribution ### Overview The image displays a histogram titled "Neuron Activation Distribution." It visualizes the frequency distribution of normalized feature values, likely representing the activation levels of neurons in a neural network or a similar computational model. The chart uses a single, dense series of teal-colored bars to show the density of data points across a range of values. ### Components/Axes * **Title:** "Neuron Activation Distribution" (centered at the top). * **X-Axis:** * **Label:** "Normalized Features" (centered below the axis). * **Scale & Ticks:** Linear scale ranging from approximately -0.75 to 1.00. Major tick marks are placed at intervals of 0.25: -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75, 1.00. * **Y-Axis:** * **Label:** "Density" (rotated 90 degrees, centered to the left of the axis). * **Scale & Ticks:** Linear scale ranging from 0.0 to 3.0. Major tick marks are placed at intervals of 0.5: 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0. * **Data Series:** A single histogram represented by numerous thin, adjacent teal bars. There is no legend, as only one data category is plotted. ### Detailed Analysis * **Distribution Shape:** The overall distribution is unimodal (single-peaked) and roughly bell-shaped, but exhibits a slight positive (right) skew. The tail on the right side (positive values) appears slightly longer and more tapered than the tail on the left. * **Central Tendency & Peak:** The highest density of activations occurs in the bin centered at approximately **0.25** on the x-axis. The peak density value at this point is approximately **3.0**. * **Spread & Range:** The vast majority of the data falls between **-0.50** and **0.75**. The distribution has a visible but low-density tail extending to the left towards -0.75 and to the right towards 1.00. * **Density Values:** * At x = -0.50, density is very low, approximately **0.1**. * At x = 0.00, density is moderate, approximately **1.8**. * At x = 0.50, density is still significant, approximately **1.5**. * At x = 0.75, density drops to a low value, approximately **0.2**. * **Visual Trend:** The data series shows a clear upward trend from the left tail, accelerating to a sharp peak just past the center (0.25), followed by a more gradual decline towards the right tail. ### Key Observations 1. **Positive Bias:** The center of mass of the distribution is shifted to the right of zero (0.00), with the peak at ~0.25. This indicates that, on average, the normalized neuron activations are positive. 2. **Concentration:** The highest concentration of activation values is within a relatively narrow band between approximately 0.00 and 0.50. 3. **Asymmetry:** The distribution is not perfectly symmetric. The slope ascending to the peak from the left appears slightly steeper than the slope descending to the right. 4. **Outliers:** There are very few data points with normalized feature values below -0.50 or above 0.75, suggesting these are rare activation states. ### Interpretation This histogram provides a statistical snapshot of neuron behavior within a model. The data suggests that the neurons are, on average, moderately active (positive normalized features), with a strong tendency to cluster around a specific activation level (~0.25). The unimodal, near-normal distribution is typical for many activation functions (like ReLU or tanh) in well-trained networks, indicating stable and predictable behavior. The slight right skew could imply a few possibilities: a minor bias in the data or model parameters, the influence of a specific activation function that allows for a longer positive tail, or the presence of a small subset of neurons that are highly active. The low density at the extremes (-0.75 and 1.00) confirms that extreme activation states are uncommon. For a technical document, this chart serves to validate that the neuron activations are well-behaved, centered, and not suffering from issues like saturation (where values would cluster at the extremes) or vanishing gradients (which might produce a distribution heavily skewed toward zero). </details> (b) Feature Distribution Figure 2: Illustration of activation distributions in the penultimate layer of LLaMA-7B. (a) Activation distribution in the penultimate layer for a randomly sampled token. (b) Activation distribution for a randomly sampled neuron activation of numerous tokens. ### 3.2 Test Time Feature Clipping Recent works have shown that the LLMs are subject to the risks of self-consistent (overconfident) hallucinations (Ren et al., 2022; Ji et al., 2023), which has not been considered by existing consistency based methods. Therefore, to address those failure cases caused by overconfident generation, a test time feature clipping approach is introduced during the computation of EigenScore. As shown in Figure. 2, we illustrate the activation distribution in the penultimate layer of LLaMA-7B. An intuitive observation is that the penultimate layer of LLMs tends to exhibit numerous extreme features, consequently increasing the likelihood of generating overconfident and self-consistent generations. Inspired by prior works that rectify internal activations to reduce overconfident prediction for Out-of-Distribution (OOD) detection (Sun et al., 2021; Djurisic et al., 2022; Chen et al., 2024), we introduce a test time feature clipping (FC) method to prevent LLMs generate overconfident hallucinations. To rectify those extreme features, the FC operation is defined as the following piecewise function $$ FC(h)=\begin{cases}h_min,&h<h_min\\ h,&h_min≤ h≤ h_max\\ h_max&h>h_max\end{cases} \tag{8} $$ where $h$ represents the feature of the hidden embeddings in the penultimate layer of the LLMs, $h_min$ and $h_max$ are two thresholds for determining the minimum and maximum truncation activations. When $h_min=-∞$ and $h_max=+∞$ , the output feature embedding is equivalent to the original output. For the determination of the optimal truncation thresholds, a memory bank which dynamically pushes and pops element in it, is utilized to conserve $N$ token embeddings during test time. Then, for each hidden neuron, the thresholds $h_min$ and $h_max$ are set to the top and bottom $p$ -th percentiles of the features in the memory bank. Refer to the three-sigma-rule Pukelsheim (1994), we set $p=0.2$ in all cases. This implies that the activations falling within the largest and smallest top 0.2% in the memory bank are identified as abnormal features and subsequently truncated for reducing overconfident generation. ## 4 Experiments ### 4.1 Experimental Setup Datasets. We utilize four widely used question answering (QA) datasets for evaluation, including two open-book conversational QA datasets CoQA (Reddy et al., 2019) and SQuAD (Rajpurkar et al., 2016), as well as two closed-book QA datasets TriviaQA (Joshi et al., 2017) and Natural Questions (NQ) (Kwiatkowski et al., 2019). We follow Lin et al. (2023) to utilize the development split of CoQA with 7983 QA pairs, the validation split of NQ with 3610 QA pairs and the validation split of the TriviaQA (rc.nocontext subset) with 9,960 deduplicated QA pairs. For the SQuAD dataset, we filter out the QA pairs with their flag is_impossible = True, and utilize the subset of the development-v2.0 split with 5928 QA pairs. The lengths of the sequences vary in the four datasets. Specifically, the ground truth answers in CoQA and SQuAD are relatively longer, while and TriviaQA typically consists of answers that are only with one or two words. Models. We use two representative open source LLMs, including LLaMA (Touvron et al., 2023a) and OPT (Zhang et al., 2022) in our experiments. Specifically, we consider off-the-shelf LLaMA-7B https://huggingface.co/decapoda-research/llama-7b-hf, LLaMA-13B https://huggingface.co/decapoda-research/llama-13b-hf, OPT-6.7B https://huggingface.co/facebook/opt-6.7b and their corresponding tokenizer provided by Hugging Face. We use the pre-trained wights and do not finetune these models in all cases. Evaluation Metrics. Following prior work Kuhn et al. (2022); Ren et al. (2022), we evaluate the hallucination detection ability of different methods by employing them to determine whether the generation is correct or not. Therefore, the area under the receiver operator characteristic curve (AUROC) and Pearson Correlation Coefficient (PCC) are utilized as the performance measure. AUROC is a popular metric to evaluate the quality of a binary classifier and uncertainty measure (Ren et al., 2022; Lin et al., 2023). Higher AUROC scores are better. PCC is utilized to measure the correlation between the hallucination detection metric and the correctness measure, which is usually defined as the ROUGE score (Lin, 2004) or semantic similarity (Reimers & Gurevych, 2019) between the generated answers and ground truth answers. A higher PCC score is better. Baselines. We compare our proposal with the most popular uncertainty-based methods Perplexity Ren et al. (2022) and Length-normalized Entropy (LN-Entropy) Malinin & Gales (2020), and the consistency-based metric Lexical Similarity (Lin et al., 2022). Besides, in order to investigate whether traditional OOD detection methods can be used for hallucination detection, we also introduce a popular OOD detection method Energy score (Liu et al., 2020) as a comparison method. Correctness Measure. We follow Kuhn et al. (2022); Lin et al. (2023) to utilize both the ROUGE-L (Lin, 2004) and the semantic similarity (Reimers & Gurevych, 2019) as the correctness measure. ROUGE-L https://github.com/google-research/google-research/tree/master/rouge is an n-gram based metric that computes the longest common subsequence between two pieces of text. The generation is regarded as correct when the ROUGE-L (f-measure) is large than a given threshold, which we set to 0.5 in our main experiments. Besides, we also use the embedding similarity as the correctness measure. The sentence embeddings of model generation and the ground truth answer are extracted by the nli-roberta-large model https://huggingface.co/sentence-transformers/nli-roberta-large, and the generation is regarded as true when the cosine similarity between two embeddings is larger than 0.9. Implementation Details. Implementation of this work is based on pytorch and transformers libraries. For the hyperparameters that are used for sampling strategies of LLMs’ decoder, we set temperature to 0.5, top-p to 0.99 and top-k to 5 through the experiments. The number of generations is set to $K=10$ . For the sentence embedding used in our proposal, we use the last token embedding of the sentence in the middle layer, i.e., the layer index is set to int(L/2). For the regularization term of the covariance matrix, we set $α=0.001$ . For the memory bank used to conserve token embeddings, we set $N=3000$ . When implement the Energy Score, we average the token-level energy score as the sentence-level energy score. ### 4.2 Main Results Table 1: Hallucination detection performance evaluation of different methods on four QA tasks. AUROC (AUC) and Pearson Correlation Coefficient (PCC) are utilized to measure the performance. $AUC_s$ represents AUROC score with sentence similarity as correctness measure, and $AUC_r$ represents AUROC score with ROUGE-L score as correctness measure. All numbers are percentages. | LLaMA-7B Energy LN-Entropy | Perplexity 51.7 68.7 | 64.1 54.7 73.6 | 68.3 1.0 30.6 | 20.4 45.1 70.1 | 57.5 47.6 70.9 | 60.0 -10.7 30.0 | 10.2 64.3 72.8 | 74.0 64.8 73.7 | 74.7 18.2 29.8 | 30.1 66.8 83.4 | 83.6 67.1 83.2 | 83.6 29.1 54.0 | 54.4 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Lexical Similarity | 74.8 | 77.8 | 43.5 | 74.9 | 76.4 | 44.0 | 73.8 | 75.9 | 30.6 | 82.6 | 84.0 | 55.6 | | | EigenScore | 80.4 | 80.8 | 50.8 | 81.5 | 81.2 | 53.5 | 76.5 | 77.1 | 38.3 | 82.7 | 82.9 | 57.4 | | | LLaMA-13B | Perplexity | 63.2 | 66.2 | 20.1 | 59.1 | 61.7 | 14.2 | 73.5 | 73.4 | 36.3 | 84.7 | 84.5 | 56.5 | | Energy | 47.5 | 49.2 | -5.9 | 36.0 | 39.2 | -20.2 | 59.1 | 59.8 | 14.7 | 71.3 | 71.5 | 36.7 | | | LN-Entropy | 68.8 | 72.9 | 31.2 | 72.4 | 74.0 | 36.6 | 74.9 | 75.2 | 39.4 | 83.4 | 83.1 | 54.2 | | | Lexical Similarity | 74.8 | 77.6 | 44.1 | 77.4 | 79.1 | 48.6 | 74.9 | 76.8 | 40.3 | 82.9 | 84.3 | 57.5 | | | EigenScore | 79.5 | 80.4 | 50.2 | 83.8 | 83.9 | 57.7 | 78.2 | 78.1 | 49.0 | 83.0 | 83.0 | 58.4 | | | OPT-6.7B | Perplexity | 60.9 | 63.5 | 11.5 | 58.4 | 69.3 | 8.6 | 76.4 | 77.0 | 32.9 | 82.6 | 82.0 | 50.0 | | Energy | 45.6 | 45.9 | -14.5 | 41.6 | 43.3 | -16.4 | 60.3 | 58.6 | 25.6 | 70.6 | 68.8 | 37.3 | | | LN-Entropy | 61.4 | 65.4 | 18.0 | 65.5 | 66.3 | 22.0 | 74.0 | 76.1 | 28.4 | 79.8 | 80.0 | 43.0 | | | Lexical Similarity | 71.2 | 74.0 | 38.4 | 72.8 | 74.0 | 39.3 | 71.5 | 74.3 | 23.1 | 78.2 | 79.7 | 42.5 | | | EigenScore | 76.5 | 77.5 | 45.6 | 81.7 | 80.8 | 49.9 | 77.9 | 77.2 | 33.5 | 80.3 | 80.4 | 0.485 | | Effectiveness of EigenScore. In Table. 1, we compare our proposed EigenScore with several representative reliability evaluation methods on three LLMs and four QA datasets. The results show that: (1) In both LLaMA and OPT models, our proposed EigenScore consistently outperforms other comparison methods by a large margin in CoQA, SQuAD and NQ datasets under different evaluation metrics. In particular, the EigenScore outperforms Lexical Similarity by 5.6% in CoQA and 8.9% in SQuAD with AUROC metric at most. (2) It’s interesting to see that the Perplexity performs best in TriviaQA dataset but performs poorly on other datasets, especially for CoQA and SQuAD. This is because the generations and ground truth answers on TrivaiQA dataset is very simple, with only one or two words in the most cases. Therefore, the performance of different methods in TriviaQA is close and by simply averaging the token-level confidence as uncertainty measure performs well. (3) On average, the performance in LLaMA-13B is better than that in LLaMA-7B and OPT-6.7B, while the performances in LLaMA-7B is slightly better than that in OPT-6.7B. It demonstrates that better hallucination detection performance can be achieved with a more powerful pre-trained LLM. Effectiveness of Feature Clipping. To demonstrate the effectiveness of the introduced test-time feature clipping, we compare the hallucination detection performance of different methods with and without applying the feature clipping technique. The results are shown in Table 2. As can be seen, the introduced feature clipping consistently improves the performance of different methods, with the largest improvement being 1.8% in AUROC. Table 2: Hallucination detection performance evaluation of different methods with and without (w/o) applying feature clipping (FC). ”+FC” denotes applying feature clipping and EigenScore (w/o) denotes EigenScore without applying feature clipping. All numbers are percentages. | Model | LLaMA-7B | OPT-6.7B | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Datasets | CoQA | NQ | CoQA | NQ | | | | | | Methods | AUC s | PCC | AUC s | PCC | AUC s | PCC | AUC s | PCC | | LN-Entropy | 68.7 | 30.6 | 72.8 | 29.8 | 61.4 | 18.0 | 74.0 | 28.4 | | LN-Entropy + FC | 70.0 | 33.4 | 73.4 | 31.1 | 62.6 | 21.4 | 74.8 | 30.3 | | Lexical Similarity | 74.8 | 43.5 | 73.8 | 30.6 | 71.2 | 38.4 | 71.5 | 23.1 | | Lexical Similarity + FC | 76.6 | 46.3 | 74.8 | 32.1 | 72.6 | 40.2 | 72.4 | 24.2 | | EigenScore (w/o) | 79.3 | 48.9 | 75.9 | 38.3 | 75.3 | 43.1 | 77.1 | 32.2 | | EigenScore | 80.4 | 50.8 | 76.5 | 38.3 | 76.5 | 45.6 | 77.9 | 33.5 | ### 4.3 Ablation Studies <details> <summary>x4.png Details</summary> ![dfb3eb89](/v1/image/dfb3eb89a0e017a1e620a66454d3b0f2eb2a9e57e431649453acf535aeb6aa4b) ### Visual Description ## Line Chart: AUROC Performance Across Generations ### Overview The image displays a line chart comparing the performance of three different methods—LN-Entropy, Lexical Similarity, and EigenScore—measured by AUROC (Area Under the Receiver Operating Characteristic curve) as a function of the "Number of Generations." The chart illustrates how each method's performance evolves as the number of generations increases from 5 to 40. ### Components/Axes * **Y-Axis (Vertical):** Labeled "AUROC". The scale ranges from 72 to 80, with major tick marks at intervals of 2 (72, 74, 76, 78, 80). * **X-Axis (Horizontal):** Labeled "Number of Generations". The scale shows discrete points at 5, 10, 15, 20, 30, and 40. * **Legend:** Positioned in the top-left corner of the chart area. It contains three entries: 1. **LN-Entropy:** Represented by a gray line with diamond markers. 2. **Lexical Similarity:** Represented by a teal (blue-green) line with circular markers. 3. **EigenScore:** Represented by an orange line with star markers. ### Detailed Analysis The chart plots three distinct data series, each showing a different trend: 1. **EigenScore (Orange line, star markers):** * **Trend:** Shows a clear and consistent upward trend, with the steepest increase occurring between 5 and 15 generations. The growth rate slows but remains positive after 15 generations. * **Approximate Data Points:** * 5 generations: ~74.5 * 10 generations: ~76.5 * 15 generations: ~77.5 * 20 generations: ~77.4 * 30 generations: ~77.6 * 40 generations: ~77.8 2. **Lexical Similarity (Teal line, circle markers):** * **Trend:** Shows a moderate upward trend that appears to plateau after approximately 15 generations. The performance gain from 15 to 40 generations is minimal. * **Approximate Data Points:** * 5 generations: ~72.9 * 10 generations: ~73.8 * 15 generations: ~74.7 * 20 generations: ~74.7 * 30 generations: ~75.2 * 40 generations: ~75.2 3. **LN-Entropy (Gray line, diamond markers):** * **Trend:** Remains relatively flat and stable across all generation counts, showing very little variation. It consistently performs the lowest of the three methods. * **Approximate Data Points:** * 5 generations: ~72.4 * 10 generations: ~72.7 * 15 generations: ~73.2 * 20 generations: ~73.0 * 30 generations: ~72.8 * 40 generations: ~73.0 ### Key Observations * **Performance Hierarchy:** EigenScore consistently achieves the highest AUROC across all generation counts, followed by Lexical Similarity, with LN-Entropy performing the worst. * **Diverging Trends:** The performance gap between EigenScore and the other two methods widens as the number of generations increases. At 5 generations, the spread is about 2.1 AUROC points (74.5 vs. 72.4). By 40 generations, the spread has increased to approximately 4.8 points (77.8 vs. 73.0). * **Plateau Points:** Lexical Similarity's performance improvement largely plateaus after 15 generations. In contrast, EigenScore continues to show modest gains up to 40 generations. * **Stability:** LN-Entropy demonstrates high stability but low performance, with its AUROC fluctuating within a narrow band of approximately 0.8 points (72.4 to 73.2). ### Interpretation The data suggests a significant difference in how these three methods benefit from an increased "Number of Generations." * **EigenScore** appears to be the most effective and scalable method in this context. Its strong, sustained upward trend indicates that it successfully leverages additional generations to improve its discriminative power (as measured by AUROC). This could imply it is better at refining or exploring a solution space over iterative steps. * **Lexical Similarity** shows initial benefit from more generations but hits a performance ceiling relatively quickly. This suggests its effectiveness is limited and does not scale well beyond a certain point (around 15 generations in this experiment). * **LN-Entropy** shows negligible sensitivity to the number of generations. Its flat trend indicates that increasing computational effort (more generations) does not translate to better performance for this metric. It may be measuring a property that is fixed early in the process or is simply less informative for the task at hand. **Overall Implication:** If the goal is to maximize AUROC with increased generative effort, EigenScore is the superior choice among the methods presented. The chart provides strong visual evidence that its performance advantage becomes more pronounced with greater investment in the number of generations. </details> <details> <summary>x5.png Details</summary> ![ce53b28d](/v1/image/ce53b28debbbb08c74bda22e438356cb7ed4cd7ab5a190b764a2d27a5ad798ff) ### Visual Description \n ## Bar Chart: AUROC Performance Across Layer Indexes ### Overview The image is a vertical bar chart displaying the AUROC (Area Under the Receiver Operating Characteristic Curve) performance metric for five different "Layer Indexes." The chart includes two horizontal dashed reference lines. The overall visual suggests an analysis of model performance at different depths or stages (layers) of a system. ### Components/Axes * **Y-Axis (Vertical):** * **Label:** `AUROC` * **Scale:** Linear, ranging from 75 to 82. * **Major Ticks:** 75, 76, 77, 78, 79, 80, 81, 82. * **X-Axis (Horizontal):** * **Label:** `Layer Indexes` * **Categories:** 5, 10, 20, 30, 33. These are discrete, non-continuous labels. * **Data Series:** * **Bars:** Five teal-colored vertical bars, one for each Layer Index. * **Reference Lines:** * **Orange Dashed Line:** A horizontal line positioned at approximately **AUROC = 80.4**. It spans the full width of the chart area. * **Gray Dashed Line:** A horizontal line positioned at approximately **AUROC = 78.8**. It also spans the full width of the chart area. * **Legend:** There is no explicit legend box. The two dashed lines are distinguished solely by color (orange and gray). ### Detailed Analysis **Bar Values (Approximate):** * **Layer 5:** The bar height is just below the 79 mark. Estimated AUROC ≈ **78.9**. * **Layer 10:** The bar height is slightly above the 80 mark. Estimated AUROC ≈ **80.1**. * **Layer 20:** This is the tallest bar, extending above the 80 mark and slightly above the orange dashed line. Estimated AUROC ≈ **80.6**. * **Layer 30:** The bar is shorter than the Layer 20 bar but taller than the Layer 10 bar. Estimated AUROC ≈ **80.3**. * **Layer 33:** The bar height is above the 79 mark but below the Layer 5 bar. Estimated AUROC ≈ **79.2**. **Reference Line Analysis:** * The **orange dashed line (≈80.4)** is exceeded only by the bar for **Layer 20**. * The **gray dashed line (≈78.8)** is below all bars except for **Layer 5**, which is very close to it (≈78.9). **Trend Verification:** The visual trend of the data series (the bars) is non-monotonic. Performance (AUROC) increases from Layer 5 to Layer 10, peaks at Layer 20, then decreases at Layer 30 and further at Layer 33. The highest performance is at the middle index (20), with lower performance at both the earliest (5) and latest (33) indexes shown. ### Key Observations 1. **Peak Performance:** The optimal performance, as measured by AUROC, occurs at **Layer Index 20**. 2. **Performance Drop-off:** There is a clear decline in AUROC after Layer 20, with Layer 33 showing the second-lowest performance. 3. **Reference Line Context:** The orange line appears to represent a high-performance benchmark (e.g., a target or state-of-the-art result), which is only surpassed at Layer 20. The gray line may represent a baseline or average performance level. 4. **Non-linear Relationship:** The relationship between layer depth (index) and performance is not linear; it follows an inverted-U or peaked shape within the given range. ### Interpretation This chart likely illustrates the performance of a neural network or similar layered model at different intermediate layers. The AUROC metric suggests a classification task. * **What the data suggests:** The model's discriminative power (ability to separate classes) is not uniform across its layers. It improves as information is processed through initial layers, reaches an optimal representation at an intermediate depth (Layer 20), and then degrades in deeper layers. This could indicate that later layers become too specialized, over-smooth, or lose generalizable features for this specific task. * **How elements relate:** The bars show the measured performance at each discrete layer. The orange dashed line provides a critical visual benchmark, highlighting that only one layer (20) achieves "exceptional" performance relative to that standard. The gray line contextualizes the lower bound of performance for the shown layers. * **Notable anomalies/trends:** The most significant trend is the peak at Layer 20. The drop at Layer 33 is notable, suggesting that simply going deeper does not guarantee better performance and may be detrimental. The fact that Layer 5 (the shallowest) performs nearly as well as Layer 33 (the deepest) is an interesting point of comparison, indicating that very early and very late layers may have similarly limited utility for this metric. * **Underlying message:** The data argues for the importance of **layer-wise analysis** and suggests that the most useful features for the task reside in the middle of the network. It may inform decisions about where to extract features for downstream tasks or where to apply interventions like pruning or distillation. </details> Figure 3: (a) Performance in LLaMA-7B and NQ dataset with different number of generations. (b) Performance in LLaMA-7B and CoQA dataset with sentence embedding in different layers. Orange line indicates using the last token’s embedding in the middle layer (layer 17) as sentence embedding. Gray line indicates using the averaged token embedding in the last layer as sentence embedding. The performance is measured by $AUROC_s$ . Number of Generations. For the methods that explore semantic consistency for hallucination detection, the number of generations $K$ is a key factor to the performance. Therefore, to evaluate the impact of the number of generations, we select $K$ from $\{5,10,15,20,30,40\}$ and perform experiments with LLaMA-7B and the NQ dataset. The performance in Figure 3 shows that: (1) Our proposed EigenScore consistently outperforms LN-Entropy and Lexical Similarity by a large margin for different $K$ . (2) When $K<15$ , the performance of different methods increases as $K$ increases and when $K>15$ , the performance tends to remain stable. The results suggeste that setting K to 20 provides the optimal trade-off between performance and inference cost. (3) Compared to EigenScore and Lexical Similarity, LN-Entropy is less sensitive to the number of generations, which demonstrates that Lexical Similarity and our EigenScore are more effective at utilizing the information in different generations. How EigenScore Performs with Different Sentence Embeddings. In the main experiments, we employ the embedding of the last token in the middle layer as sentence embedding. Here, we also investigate how the model performs with different sentence embeddings. In Figure. 3, we show the hallucination detection performance by using sentence embedding from different layers. The results show that using the sentence embedding in the shallow and final layers yields significantly inferior performance compared to using sentence embedding in the layers close to the middle. Besides, another interesting observation is that utilizing the embedding of the last token as the sentence embedding achieves superior performance compared to simply averaging the token embeddings, which suggests that the last token of the middle layers retain more information about the truthfulness. Sensitivity to Correctness Measures. It’s difficult to develop automatic metrics for QA task that correlate well with human evaluations. Therefore, the choice of correctness measures is a crucial component of hallucination detection evaluation. In this section, we evaluate the performance with different correctness measure thresholds in LLaMA-7B and CoQA dataset. The experimental results are presented in Table. 3. It shows that the threshold has a great influence on the final hallucination detection performance. Significantly, our proposed EigenScore consistently outperforms comparison methods in different thresholds. Besides, the results also indicate that the hallucination detection performance of different methods will be better under a rigorous correctness measure. Table 3: Performance evaluation with different correctness measure thresholds in LLaMA-7B and CoQA dataset. The ROUGE-L (f-measure) score and Sentence Similarity with different thresholds are employed to measure the correctness of the generated answers. | Perplexity | 65.2 | 68.3 | 68.1 | 63.7 | 63.5 | 64.1 | | --- | --- | --- | --- | --- | --- | --- | | LN-Entropy | 67.4 | 73.6 | 74.1 | 65.2 | 65.6 | 68.7 | | Lexical Similarity | 75.8 | 77.8 | 79.3 | 72.8 | 73.9 | 74.8 | | EigenScore | 76.4 | 80.8 | 83.5 | 75.9 | 77.2 | 80.4 | Sensitivity to Hyperparameters. The hyperparameters, including temperature, top-k and top-p, of the LLMs’ decoder determine the diversity of the generations. To evaluate the impact of those hyperparameters. We provide a sensitivity analysis in Figure 4. As observed, the performance is greatly influenced by temperature but shows little sensitivity to top-k. The performance of the consistency based methods (EigenScore and Lexical Similarity) drops significantly when the temperature is greater than 1. The optimal temperature can be selected from $[0.1,1.0]$ . <details> <summary>x6.png Details</summary> ![45a3a9b5](/v1/image/45a3a9b57f1b0aa442f140da16f8aea185fde108d5ec90563b805968cfa1dbc9) ### Visual Description ## Line Chart: Sensitivity to Temperature ### Overview The image is a line chart titled "Sensitivity to Temperature." It plots the performance of four different metrics (Perplexity, LN-Entropy, Lexical Similarity, EigenScore) as a function of a "Temperature" parameter. The performance is measured by the AUROC (Area Under the Receiver Operating Characteristic Curve) score on the y-axis. The chart demonstrates how the AUROC for each metric changes as the temperature value increases from 0.1 to 5. ### Components/Axes * **Chart Title:** "Sensitivity to Temperature" (Top Center) * **Y-Axis:** * **Label:** "AUROC" (Left side, rotated vertically) * **Scale:** Linear, ranging from 40 to 100. * **Major Ticks:** 40, 50, 60, 70, 80, 90, 100. * **X-Axis:** * **Label:** "Temperature" (Bottom Center) * **Scale:** Appears to be logarithmic or categorical, with discrete points. * **Tick Labels:** 0.1, 0.3, 0.5, 1, 3, 5. * **Legend:** Located in the top-right corner of the plot area. It maps line colors and marker styles to metric names. * **Blue line with 'x' markers:** Perplexity * **Gray line with diamond markers:** LN-Entropy * **Teal line with circle markers:** Lexical Similarity * **Orange line with star markers:** EigenScore ### Detailed Analysis The following table reconstructs the approximate AUROC values for each metric at the specified temperature points. Values are estimated from the chart's grid lines. | Temperature | Perplexity (Blue, 'x') | LN-Entropy (Gray, Diamond) | Lexical Similarity (Teal, Circle) | EigenScore (Orange, Star) | | :---------- | :---------------------- | :------------------------- | :-------------------------------- | :------------------------ | | **0.1** | ~64 | ~66 | ~68 | ~71 | | **0.3** | ~64 | ~67 | ~75 | ~80 | | **0.5** | ~64 | ~68 | ~75 | ~80 | | **1** | ~64 | ~68 | ~70 | ~74 | | **3** | ~64 | ~67 | ~60 | ~64 | | **5** | ~64 | ~66 | ~57 | ~58 | **Trend Verification per Data Series:** 1. **Perplexity (Blue):** The line is perfectly horizontal. **Trend:** Constant. It shows no sensitivity to temperature, maintaining an AUROC of approximately 64 across all values. 2. **LN-Entropy (Gray):** The line has a very gentle arc. **Trend:** Slightly increases from ~66 to a peak of ~68 at Temperature=0.5/1, then gently decreases back to ~66. It shows low sensitivity. 3. **Lexical Similarity (Teal):** The line has a pronounced peak. **Trend:** Increases from ~68 to a peak of ~75 at Temperature=0.3/0.5, then decreases sharply, falling below its starting point to ~57 at Temperature=5. It shows high sensitivity, with performance degrading significantly at higher temperatures. 4. **EigenScore (Orange):** The line has the most pronounced peak. **Trend:** Increases from ~71 to a peak of ~80 at Temperature=0.3/0.5, then decreases steadily, falling to ~58 at Temperature=5. It shows the highest sensitivity and the best peak performance. ### Key Observations * **Peak Performance:** Both **EigenScore** and **Lexical Similarity** achieve their highest AUROC (~80 and ~75, respectively) at the lower temperature range of 0.3 to 0.5. * **Performance Degradation:** All metrics except Perplexity show a decline in AUROC as temperature increases beyond 1. The decline is most severe for **Lexical Similarity** and **EigenScore**. * **Stability:** **Perplexity** is completely stable (flat line). **LN-Entropy** is relatively stable, with only minor fluctuations. * **Crossover Points:** At Temperature=3, the performance of EigenScore, Lexical Similarity, and LN-Entropy converges around an AUROC of 64-67. At Temperature=5, EigenScore and Lexical Similarity perform worse than LN-Entropy and Perplexity. * **Relative Ranking:** The ranking of metrics by AUROC changes with temperature. At T=0.5, the order is EigenScore > Lexical Similarity > LN-Entropy > Perplexity. At T=5, the order is LN-Entropy > Perplexity > EigenScore ≈ Lexical Similarity. ### Interpretation This chart investigates how the effectiveness of different evaluation metrics (Perplexity, LN-Entropy, Lexical Similarity, EigenScore) for a certain task (likely related to language model output quality or detection) is influenced by the "Temperature" parameter, which typically controls randomness in generation. The data suggests a clear trade-off: * **EigenScore** is the most powerful metric at optimal (low) temperatures, achieving the highest AUROC. However, it is also the most fragile, with its performance collapsing as temperature increases. * **Lexical Similarity** follows a similar but less extreme pattern, also peaking at low temperatures and degrading with higher randomness. * **LN-Entropy** offers a balance, providing moderate and relatively stable performance across the temperature range. * **Perplexity** is insensitive to temperature in this context, suggesting it measures a property of the model or text that is unaffected by the randomness introduced by the temperature parameter. **Practical Implication:** If one can control or know the temperature of the generated text being evaluated, **EigenScore** is the superior choice for low-temperature settings. For high-temperature or unknown settings, **LN-Entropy** provides more reliable, albeit lower, performance. **Perplexity** appears to be a poor discriminator for the task measured by AUROC in this experiment, as its score does not change with the variable of interest. The chart effectively demonstrates that metric selection must be contingent on the operational conditions (here, temperature). </details> <details> <summary>x7.png Details</summary> ![9dfbfb5b](/v1/image/9dfbfb5ba9fdadbb7cf80686fa4d7ffe2143e8436be30f5a680594c065ea52f5) ### Visual Description ## Line Chart: Sensitivity to Top-K ### Overview This is a line chart titled "Sensitivity to Top-K" that plots the performance of four different metrics (Perplexity, LN-Entropy, Lexical Similarity, and EigenScore) as a function of the "Top-K" parameter. The performance is measured by the AUROC (Area Under the Receiver Operating Characteristic Curve) score. The chart demonstrates how sensitive each metric's performance is to changes in the Top-K value. ### Components/Axes * **Chart Title:** "Sensitivity to Top-K" (centered at the top). * **Y-Axis:** * **Label:** "AUROC" * **Scale:** Linear, ranging from 40 to 90. * **Major Ticks:** 40, 50, 60, 70, 80, 90. * **X-Axis:** * **Label:** "Top-K" * **Scale:** Appears to be categorical or logarithmic, with discrete values. * **Data Points (Ticks):** 3, 5, 10, 20, 30, 50. * **Legend:** Located in the bottom-right corner of the plot area. It maps line colors and marker styles to metric names: * **Blue line with 'x' markers:** Perplexity * **Gray line with diamond markers:** LN-Entropy * **Teal line with circle markers:** Lexical Similarity * **Orange line with star markers:** EigenScore ### Detailed Analysis The chart displays four data series, each showing a relatively flat trend across the range of Top-K values. 1. **EigenScore (Orange, Star Markers):** * **Trend:** The line is nearly horizontal, showing a very slight upward trend from Top-K=3 to Top-K=50. * **Approximate Values:** * Top-K=3: ~79 * Top-K=5: ~80 * Top-K=10: ~79 * Top-K=20: ~80 * Top-K=30: ~80 * Top-K=50: ~80 2. **Lexical Similarity (Teal, Circle Markers):** * **Trend:** The line is mostly flat with a minor dip around Top-K=20 before recovering. * **Approximate Values:** * Top-K=3: ~74 * Top-K=5: ~75 * Top-K=10: ~75 * Top-K=20: ~73 * Top-K=30: ~74 * Top-K=50: ~76 3. **LN-Entropy (Gray, Diamond Markers):** * **Trend:** The line is very flat, showing minimal variation. * **Approximate Values:** * Top-K=3: ~67 * Top-K=5: ~67 * Top-K=10: ~68 * Top-K=20: ~68 * Top-K=30: ~68 * Top-K=50: ~68 4. **Perplexity (Blue, 'x' Markers):** * **Trend:** The line is almost perfectly horizontal, indicating no sensitivity to Top-K. * **Approximate Values:** * Top-K=3: ~64 * Top-K=5: ~64 * Top-K=10: ~64 * Top-K=20: ~64 * Top-K=30: ~64 * Top-K=50: ~64 ### Key Observations * **Performance Hierarchy:** There is a clear and consistent performance ranking across all Top-K values: EigenScore > Lexical Similarity > LN-Entropy > Perplexity. * **Low Sensitivity:** All four metrics exhibit very low sensitivity to the Top-K parameter within the tested range (3 to 50). The AUROC scores change by only 1-2 points at most. * **Stability:** The Perplexity metric is the most stable, showing virtually no change. EigenScore and LN-Entropy are also highly stable. Lexical Similarity shows the most variation, though it is still minimal. * **Visual Separation:** The lines for the four metrics are distinctly separated and do not intersect, confirming their consistent relative performance. ### Interpretation The data suggests that for the task being evaluated, the choice of Top-K (within the range of 3 to 50) has a negligible impact on the performance of these four evaluation metrics. This is a significant finding, as it implies that model comparisons using these metrics would be robust to the specific choice of the Top-K hyperparameter. The consistent performance hierarchy indicates that **EigenScore** is the most effective metric (highest AUROC) for this particular task, followed by **Lexical Similarity**. **Perplexity**, a common language model metric, performs the worst in this context. This could imply that the task requires evaluation criteria beyond simple next-token prediction likelihood, favoring metrics that capture semantic similarity (Lexical Similarity) or distributional properties (EigenScore, LN-Entropy). The investigation reveals a stable evaluation landscape where the primary differentiator is the choice of metric itself, not the tuning of the Top-K parameter. This allows for more confident and less parameter-sensitive model selection and comparison. </details> Figure 4: (a) Performance sensitivity to temperature. (b) Performance sensitivity to top-k. The performance is measured by $AUROC_s$ . ## 5 Related Work Reliability Evaluation of LLMs During real-world deployments, the reliability of LLMs poses a substantial challenge, as LLMs reveal their propensity to exhibit unreliable generations (Ji et al., 2023; Zhang et al., 2023). Therefore, considerable efforts has been made to address the security and reliability evaluation of LLMs (Huang et al., 2023; Malinin & Gales, 2020; Kuhn et al., 2022; Kadavath et al., 2022; Cohen et al., 2023; Azaria & Mitchell, 2023). Among those methods, uncertainty based metric has been widely explored, which typically involves predictive confidence or entropy of the output token (Malinin & Gales, 2020; Kuhn et al., 2022; Duan et al., 2023). Besides, consistency based methods also play an important role in reliability evaluation, which hypothesizes that LLMs tend to generate logically inconsistent responses to the same question when they are indecisive and hallucinating contents (Kuhn et al., 2022; Raj et al., 2023; Manakul et al., 2023). Based on the consistency hypothesis, researchers also found it is feasible to prompt the LLMs to evaluate their responses themselves (Kadavath et al., 2022; Cohen et al., 2023; Manakul et al., 2023). Eigenvalue as Divergence Measure The eigenvalue or determinant of covariance matrix captures the variability of the data and has been widely explored as divergence measure in a wide range of machine learning tasks (Wold et al., 1987; Kulesza & Taskar, 2011; Xu et al., 2021; Zhouyin & Liu, 2021; Cai et al., 2015). For instance, in Wold et al. (1987), the authors proposed the well-known Principal Components Analysis (PCA) and demonstrates that the most largest eigenvalues of sample covariance matrix corresponds to the principle semantic of sample set. Besides, the determinant of covariance matrix, determined by the eigenvalues, has been utilized to sample a diversity subset in determinantal point processes (DDP) (Kulesza & Taskar, 2011) and activation learning (Xu et al., 2021) tasks, which demonstrates the determinant of covariance matrix is a good diversity measure. Besides, several studies also proposed to approximate the differential entropy with the logarithm determinant of covariance matrix (Zhouyin & Liu, 2021; Klir & Wierman, 1999). ## 6 Conclusion Measuring the hallucination degree of LLM’s generation is of critical importance in enhancing the security and reliability of LLM-based AI systems. This work presents an INSIDE framework to exploit the semantic information that are retained within the internal states of LLMs for hallucination detection. Specifically, a simple yet effective EigenScore is proposed to measure the semantic consistency across different generations in the embedding space. Besides, to identify those self-consistent (overconfident) hallucinations which have been overlooked by previous methods, a feature clipping technique is introduced to reduce overconfident generations by truncating extreme features. Significant performance improvement has been achieved in several popular LLMs and QA benchmarks. Although our experiments focus on QA task, our method does not make any assumptions about the task modality, and we believe our method is widely applicable to other tasks, such as summarization and translation. We hope that our insights inspire future research to further explore the internal semantics of LLMs for hallucination detection. ## References - Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Maitha Alhammadi, Mazzotta Daniele, Daniel Heslow, Julien Launay, Quentin Malartic, et al. The falcon series of language models: Towards open frontier models. Hugging Face repository, 2023. - Azaria & Mitchell (2023) Amos Azaria and Tom Mitchell. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023. - Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. - Cai et al. (2015) T Tony Cai, Tengyuan Liang, and Harrison H Zhou. Law of log determinant of sample covariance matrix and optimal estimation of differential entropy for high-dimensional gaussian distributions. Journal of Multivariate Analysis, 137:161–172, 2015. - Chang et al. (2023) Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023. - Chen et al. (2024) Chao Chen, Zhihang Fu, Kai Liu, Ze Chen, Mingyuan Tao, and Jieping Ye. Optimal parameter and neuron pruning for out-of-distribution detection. Advances in Neural Information Processing Systems, 36, 2024. - Cohen et al. (2023) Roi Cohen, May Hamri, Mor Geva, and Amir Globerson. Lm vs lm: Detecting factual errors via cross examination. arXiv e-prints, pp. arXiv–2305, 2023. - Djurisic et al. (2022) Andrija Djurisic, Nebojsa Bozanic, Arjun Ashok, and Rosanne Liu. Extremely simple activation shaping for out-of-distribution detection. In The Eleventh International Conference on Learning Representations, 2022. - Duan et al. (2023) Jinhao Duan, Hao Cheng, Shiqi Wang, Chenan Wang, Alex Zavalny, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv preprint arXiv:2307.01379, 2023. - Hendrycks & Gimpel (2016) Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2016. - Huang et al. (2023) Yuheng Huang, Jiayang Song, Zhijie Wang, Huaming Chen, and Lei Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv e-prints, pp. arXiv–2307, 2023. - Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023. - Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611, 2017. - Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv e-prints, pp. arXiv–2207, 2022. - Klir & Wierman (1999) George Klir and Mark Wierman. Uncertainty-based information: elements of generalized information theory, volume 15. Springer Science & Business Media, 1999. - Kuhn et al. (2022) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, 2022. - Kulesza & Taskar (2011) Alex Kulesza and Ben Taskar. k-dpps: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1193–1200, 2011. - Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. - Li et al. (2023) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023. - Liang et al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022. - Lin (2004) Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004. - Lin et al. (2023) Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv e-prints, pp. arXiv–2305, 2023. - Lin et al. (2022) Zi Lin, Jeremiah Zhe Liu, and Jingbo Shang. Towards collaborative neural-symbolic graph semantic parsing via uncertainty. Findings of the Association for Computational Linguistics: ACL 2022, 2022. - Liu et al. (2020) Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in neural information processing systems, 33:21464–21475, 2020. - Malinin & Gales (2020) Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations, 2020. - Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023. - OpenAI (2023) OpenAI. Gpt-4 technical report, 2023. - Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. - Pukelsheim (1994) Friedrich Pukelsheim. The three sigma rule. The American Statistician, 48(2):88–91, 1994. - Raj et al. (2023) Harsh Raj, Vipul Gupta, Domenic Rosati, and Subhabrata Majumdar. Semantic consistency for assuring reliability of large language models. arXiv preprint arXiv:2308.09138, 2023. - Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, 2016. - Reddy et al. (2019) Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019. - Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019. - Ren et al. (2022) Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. Out-of-distribution detection and selective generation for conditional language models. In The Eleventh International Conference on Learning Representations, 2022. - Shi et al. (2022) Freda Shi, Daniel Fried, Marjan Ghazvininejad, Luke Zettlemoyer, and Sida I Wang. Natural language to code translation with execution. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3533–3546, 2022. - Sun et al. (2021) Yiyou Sun, Chuan Guo, and Yixuan Li. React: Out-of-distribution detection with rectified activations. Advances in Neural Information Processing Systems, 34:144–157, 2021. - Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. - Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b. - Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. - Wold et al. (1987) Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987. - Xu et al. (2021) Xinyi Xu, Zhaoxuan Wu, Chuan Sheng Foo, and Bryan Kian Hsiang Low. Validation free and replication robust volume-based data valuation. Advances in Neural Information Processing Systems, 34:10837–10848, 2021. - Yin et al. (2023) Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153, 2023. - Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. - Zhang et al. (2023) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023. - Zhou et al. (2023) Kaitlyn Zhou, Dan Jurafsky, and Tatsunori Hashimoto. Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv preprint arXiv:2302.13439, 2023. - Zhouyin & Liu (2021) Zhanghao Zhouyin and Ding Liu. Understanding neural networks with logarithm determinant entropy estimator. arXiv preprint arXiv:2105.03705, 2021. ## Appendix A Performance Evaluation on TruthfulQA TruthfulQA is an important benchmark to evaluate the truthfulness of LLMs (Joshi et al., 2017). Therefore, we also compare our proposal with the baseline methods in the TruthfulQA benchmark. The optimal classification thresholds is determined by maximizing the G-Mean value, which is defined as $\textbf{G-Mean}=√{TPR*(1-FPR)}$ . The results are presented in Table 4. For the ITI Li et al. (2023), which trains multiple binary classifiers with the internal embeddings for hallucination detection, we report the best performance in their paper. As can be seen, our proposal consistently outperforms the baseline methods and achieves comparable performance as ITI when we utilize 50 in-distribution prompts. It’s worth nothing that the ITI relies on training 1024 binary classifiers in TruthQA datasets, and they report the best performance (83.3) in the validation set. Therefore, their best performance is better than our proposal which has not been trained on TruthfulQA. However, training on the validation set also limits the generalization of their method on other domains (Li et al., 2023). As TruthfulQA is a very challenging dataset for LLMs, zero-shot inference results in poor performance. Therefore, we follow previous work (Bai et al., 2022) to utilize different number of in-distribution prompts during inference time. The results show that the performance could be significantly improved when we increase the number of prompts, which also explains why ITI performs good. Table 4: Performance comparison of different methods on TruthfulQA dataset. LexialSim denotes Lexical Similarity and SelfCKGPT denotes SelfCheckGPT. Hallucination detection accuracy is reported. # Prompt denotes the number of prompt templates. For ITI Li et al. (2023), we report the best number in their paper directly. All numbers are percentages. | 5 20 50 | 70.0 76.4 73.1 | 71.2 77.7 77.9 | 73.6 77.9 73.6 | 74.2 76.8 78.3 | 83.3 83.3 83.3 | 76.7 79.5 81.3 | | --- | --- | --- | --- | --- | --- | --- | ## Appendix B Comparison with More Competitive Methods To demonstrate the effectiveness of our proposal, we also compare our EigenScore with several competitive methods, including Semantic Entropy (SemanticEnt) (Kuhn et al., 2022), Shifting Attention to Relevance (SentSAR) (Duan et al., 2023) and SelfCheckGPT (SelfCKGPT) (Manakul et al., 2023). We follow the experimental setting in Duan et al. (2023) to set the number of generation to $N=10$ for OPT-6.7B and $N=5$ for LLaMA. For the results of SementicEnt and SentSAR, we report the number in Duan et al. (2023) directly. For the implementation of SelfCheckGPT, we leverage the SelfCheckBERTScore provided in the official code package https://github.com/potsawee/selfcheckgpt. The comparison results in Table 5 demonstrate that our EigenScore significantly outperforms the competitors. Additionally, both SentSAR and SelfCheckGPT exhibit comparable performance, which is much superior to Semantic Entropy. Note that both SentSAR, SelfCheckGPT and our proposal evaluate the quality of LLMs’ generation by exploring the self-consistency across multiple outputs. However, compared to Semantic Entropy (Kuhn et al., 2022) or SelfCheckGPT (Manakul et al., 2023) which relies on another language model for sentence embedding extraction, our approach leverages the internal states of LLMs, which retain highly-concentrated semantic information. Besides, the EigenScore defined by the LogDet of the sentence covariance matrix is able to capture the semantic consistency more effectively compared to the sentence-wise similarity (Manakul et al., 2023). Furthermore, the proposed feature clipping strategy allows our model to identify the overconfident hallucinations, which has not been investigated by previous works (Kuhn et al., 2022; Manakul et al., 2023) Table 5: Performance comparison of EigenScore and and several state-of-the-art methods on CoQA dataset. AUC s represents AUROC with the sentence similarity as correctness measure, and AUC r represents using ROUGE-L as correctness measure. All numbers are percentages. | OPT-6.7B | 63.1 | 71.7 | 69.8 | 72.2 | 70.2 | 74.1 | 71.9 | 77.5 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | LLaMA-7B | 64.9 | 68.2 | 70.4 | 65.8 | 68.7 | 72.9 | 71.2 | 75.7 | | LLaMA-13B | 65.3 | 66.7 | 71.4 | 64.7 | 68.1 | 77.0 | 72.8 | 79.8 | ## Appendix C Performance Evaluation on More LLMs In the main experiments, we evaluate the performance of different methods in LLaMA-7B, LLaMA-13B and OPT-6.7B. To demonstrate the robustness of our method across different models, we also provide the performance comparison in the recent LLaMA2-7B (Touvron et al., 2023b) and Falcon-7B models (Almazrouei et al., 2023). Table 6 reveals that our proposal consistently exhibits superior performance compared to the other methods across different LLMs. Table 6: Performance evaluation on LLaMA2-7B and Falcon-7B. LexicalSim denotes Lexical Similarity and SelfCKGPT denotes SelfCheckGPT. AUC s and AUC r are utilized as correctness measure. Other experimental settings are consistent with Table 1. | LLaMA2-7b | CoQA | 62.2 | 66.6 | 69.9 | 75.2 | 74.4 | 77.5 | 72.4 | 75.1 | 78.6 | 80.7 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | NQ | 70.8 | 70.2 | 72.1 | 71.2 | 72.1 | 72.9 | 69.1 | 68.1 | 74.4 | 73.7 | | | Falcon-7b | CoQA | 57.0 | 60.6 | 62.6 | 63.2 | 74.8 | 76.4 | 76.7 | 77.9 | 80.8 | 80.6 | | NQ | 74.3 | 74.7 | 74.6 | 74.7 | 73.8 | 75.4 | 74.7 | 74.0 | 76.3 | 75.7 | | ## Appendix D Computational Efficiency Analysis As our proposal is a sampling based approach, additional inference cost is required to generate multiple outputs for accurate hallucination detection. We compare our proposal with the base LLM and other comparing methods in LLaMA-7B and LLaMA-13B. All experiments are performed on NVIDIA-A100 and we set the number of generations to $N=10$ through the experiments. The average inference time per question is shown in Fig. 5. As observed, our EigenScore is about 10 times more efficient than the methods that rely on another large model to measure the self-consistency (such as SelfCheckGPT (Manakul et al., 2023)), and shares the similar computational overhead with the LN-Entropy and Lexical Similarity. Compared to the computational overhead of generating multiple outputs, the cost of feature clipping and EigenScore computation is negligible (0.06s). It is worth noting that the inference overhead required to generate multiple results is not linearly proportional to the time required to generate a single output, owing to the sampling and decoding strategy of the autoregressive LLM model. <details> <summary>x8.png Details</summary> ![87f791bc](/v1/image/87f791bc255340e5095b05ba35849a26fa4dfb7b0d03ba8303fbe4b3f81b81ff) ### Visual Description ## Bar Chart: Computational Cost Comparison in LLaMA-7B ### Overview This is a vertical bar chart comparing the inference computational cost, measured in seconds per question, for six different methods or models applied to the LLaMA-7B large language model. The chart clearly demonstrates a significant disparity in cost between one method and the others. ### Components/Axes * **Chart Title:** "Computational Cost Comparison in LLaMA-7B" (positioned at the top center). * **Y-Axis (Vertical):** * **Label:** "Inference Cost (Second/Question)" * **Scale:** Linear scale from 0 to 12, with major tick marks at intervals of 2 (0, 2, 4, 6, 8, 10, 12). * **X-Axis (Horizontal):** * **Categories (from left to right):** BaseLLM, Perplexity, LN-Entropy, LexicalSim, SelfCKGPT, EigenScore. * **Label Orientation:** The category labels are rotated approximately 45 degrees for readability. * **Data Series:** A single series represented by blue bars. Each bar's height corresponds to the inference cost for its respective method. The exact numerical value is annotated directly above each bar. ### Detailed Analysis The chart presents the following data points for inference cost (seconds/question): 1. **BaseLLM:** 0.24 2. **Perplexity:** 0.24 3. **LN-Entropy:** 0.80 4. **LexicalSim:** 0.81 5. **SelfCKGPT:** 10.68 6. **EigenScore:** 0.81 **Visual Trend:** The data series shows a relatively flat, low-cost profile for the first two methods (BaseLLM, Perplexity), a moderate step-up for the next three methods (LN-Entropy, LexicalSim, EigenScore), and then a dramatic, singular spike for SelfCKGPT, which is an order of magnitude higher than all others. ### Key Observations 1. **Dominant Outlier:** The method **SelfCKGPT** has a vastly higher computational cost (10.68 s/question) compared to all other methods. It is approximately 13 times more expensive than the next highest group (LN-Entropy, LexicalSim, EigenScore at ~0.8 s/question) and about 44 times more expensive than the base models (BaseLLM, Perplexity at 0.24 s/question). 2. **Clustering of Costs:** The methods fall into three distinct cost clusters: * **Lowest Cost (~0.24 s):** BaseLLM and Perplexity. * **Moderate Cost (~0.80-0.81 s):** LN-Entropy, LexicalSim, and EigenScore. The costs for LexicalSim and EigenScore are identical (0.81). * **Very High Cost (10.68 s):** SelfCKGPT. 3. **Precision of Values:** The annotated values are given to two decimal places, suggesting precise measurement. ### Interpretation This chart quantifies the computational overhead introduced by different evaluation or analysis techniques when applied to the LLaMA-7B model. The data suggests a fundamental trade-off: * **BaseLLM and Perplexity** represent the baseline inference cost of the model itself or a very lightweight metric, serving as a reference point. * **LN-Entropy, LexicalSim, and EigenScore** introduce a consistent, moderate overhead (adding roughly 0.56-0.57 seconds per question). This indicates these methods involve additional computation but are of similar complexity to each other. * **SelfCKGPT** represents a method with a dramatically different computational profile. Its cost is so high it likely involves a significantly more complex process—potentially iterative generation, multi-step reasoning, or the use of a much larger auxiliary model—making it impractical for scenarios requiring high-throughput or low-latency inference compared to the other methods. The chart effectively argues that while SelfCKGPT may offer certain qualitative advantages (not shown here), it comes at a severe computational cost, whereas methods like EigenScore or LexicalSim provide a middle ground with a predictable, moderate overhead. </details> (a) LLaMA-7B <details> <summary>x9.png Details</summary> ![c3c77cb6](/v1/image/c3c77cb6130fcf0b187e7ff6d65c5c164f40764a7143748d40001bec0bf7cede) ### Visual Description ## Bar Chart: Computational Cost Comparison in LLaMA-13B ### Overview The image is a vertical bar chart comparing the inference cost, measured in seconds per question, for six different methods or models applied to the LLaMA-13B large language model. The chart clearly shows a significant disparity in computational cost, with one method being substantially more expensive than the others. ### Components/Axes * **Chart Title:** "Computational Cost Comparison in LLaMA-13B" (centered at the top). * **Y-Axis (Vertical):** * **Label:** "Inference Cost (Second/Question)" (rotated 90 degrees, left side). * **Scale:** Linear scale from 0 to 12, with major tick marks at intervals of 2 (0, 2, 4, 6, 8, 10, 12). * **X-Axis (Horizontal):** * **Categories (from left to right):** BaseLLM, Perplexity, LN-Entropy, LexicalSim, SelfCKGPT, EigenScore. * **Label:** None explicitly stated for the axis itself; the category names serve as labels. * **Data Series:** A single series represented by blue bars. Each bar's height corresponds to the inference cost for its respective category. * **Data Labels:** The exact numerical value of each bar is displayed directly above it. ### Detailed Analysis The following table reconstructs the data presented in the chart: | Method/Category | Inference Cost (Seconds/Question) | | :--- | :--- | | BaseLLM | 0.31 | | Perplexity | 0.31 | | LN-Entropy | 1.27 | | LexicalSim | 1.28 | | SelfCKGPT | 10.26 | | EigenScore | 1.27 | **Visual Trend Verification:** 1. **BaseLLM & Perplexity:** These two bars are the shortest and of identical height, indicating the lowest and equal computational cost. 2. **LN-Entropy, LexicalSim, & EigenScore:** These three bars form a cluster of similar, moderate height. LN-Entropy and EigenScore are visually identical at 1.27, while LexicalSim is marginally taller at 1.28. 3. **SelfCKGPT:** This bar is a dramatic outlier, towering over all others. Its height is approximately 8 times that of the moderate-cost cluster and over 33 times that of the lowest-cost methods. ### Key Observations 1. **Extreme Outlier:** The SelfCKGPT method has a vastly higher inference cost (10.26 s/q) compared to all other methods shown. This is the most salient feature of the chart. 2. **Cost Clustering:** The methods fall into three distinct cost tiers: * **Low Cost (~0.3 s/q):** BaseLLM, Perplexity. * **Moderate Cost (~1.3 s/q):** LN-Entropy, LexicalSim, EigenScore. * **Very High Cost (~10.3 s/q):** SelfCKGPT. 3. **Identical Costs:** BaseLLM and Perplexity share the exact same cost. LN-Entropy and EigenScore also share an identical cost. 4. **Minimal Variance in Moderate Tier:** The difference between the lowest (1.27) and highest (1.28) cost in the moderate tier is only 0.01 seconds, suggesting these methods have nearly identical computational overhead in this benchmark. ### Interpretation This chart demonstrates a trade-off between the complexity or approach of a method and its computational efficiency when applied to the LLaMA-13B model. * **What the data suggests:** The methods labeled BaseLLM and Perplexity are the most computationally efficient, likely representing baseline or simpler evaluation metrics. The cluster containing LN-Entropy, LexicalSim, and EigenScore represents a middle ground, possibly indicating methods that add a moderate layer of analysis or computation. The SelfCKGPT method is an order of magnitude more expensive, suggesting it involves a significantly more complex process—perhaps iterative generation, external knowledge retrieval, or a more sophisticated verification step that requires many additional forward passes through the model. * **Relationship between elements:** The chart is designed to highlight the cost disparity. By placing the extreme outlier (SelfCKGPT) among more efficient methods, it visually emphasizes the potential computational burden of adopting such a method. The grouping of similar-cost methods allows for easy comparison within tiers. * **Notable implications:** For a practitioner, this data is crucial for resource planning. Using SelfCKGPT would require substantially more time and/or computational resources (GPU hours) compared to the other methods. The choice of method would depend on whether the purported benefits in accuracy, robustness, or other metrics justify this ~8x to ~33x increase in cost. The near-identical costs within the moderate tier suggest that, from a pure speed perspective, LN-Entropy, LexicalSim, and EigenScore are interchangeable, so the choice between them would depend entirely on other performance characteristics. </details> (b) LLaMA-13B Figure 5: Inference cost comparison of different methods in LLaMA-7B and LLaMA-13B. BaseLLM denotes the LLM without using any hallucination detection metrics. LexicalSim denotes Lexical Similarity and SelfCKGPT denotes SelfCkeckGPT. ## Appendix E Evaluation with Exact Match In the main experiments, we employ the ROUGE and sentence similarity as correctness measure, which are widely used for natural language generation evaluation (Chang et al., 2023; Kuhn et al., 2022; Huang et al., 2023). In order to facilitate the comparison of our work’s performance with other works, we also provide the evaluation results by employing exact match (Liang et al., 2022) as the correctness score, which is much more strict to determine a generation as correct. The results in Table 7 show similar conclusions to those in Table 1, which demonstrates that our proposal significantly outperforms the compared methods in most cases. Table 7: Performance evaluation with Exact Match as correctness measure. LexicalSim denotes the Lexical Similarity. The experimental settings are consistent with Table 1. | LLaMA-7B | CoQA | 63.7 | 70.7 | 76.1 | 83.0 | | --- | --- | --- | --- | --- | --- | | SQuAD | 57.3 | 72.1 | 76.9 | 83.9 | | | NQ | 75.3 | 75.6 | 75.8 | 80.1 | | | TriviaQA | 82.5 | 83.4 | 81.8 | 82.4 | | | OPT-6.7B | CoQA | 59.4 | 61.7 | 71.8 | 79.4 | | SQuAD | 56.7 | 65.2 | 72.7 | 82.9 | | | NQ | 79.8 | 78.1 | 73.2 | 79.8 | | | TriviaQA | 83.8 | 81.3 | 79.3 | 82.7 | | ## Appendix F More visualization and ablation for Feature Clipping In Fig. 6, we illustrate the distributions of neuron activation from four selected tokens. As can be seen, the distribution changes a lot across samples. Therefore, it is risky to determine the clipping threshold with only the current input sample (EigenScore-C). A feasible solution is to pre-compute the optimal threshold based on a batch of input samples (EigenScore-P). Besides, another solution is to dynamically record the activation values and determine the threshold during the inference process (EigenScore-MB). We have experimented with both solutions and the experimental results are presented in Table 8. The results demonstrate that determining the thresholds with a memory bank works slightly better. We attribute this variability to potential differences in the activation distributions across various datasets. Table 8: Ablation study of determining the clipping threshold with different technique. EigenScore-C indicates determining the threshold with the current input sample. EigenScore-P indicates pre-computing the threshold with a batch of samples. EigenScore-MB denotes using memory bank to determine the optimal threshold. AUC s is reported. | EigenScore-C | 78.1 | 74.8 | | --- | --- | --- | | EigenScore-P | 79.9 | 75.3 | | EigenScore-MB | 80.4 | 76.5 | <details> <summary>x10.png Details</summary> ![4996af09](/v1/image/4996af092ce30251c4d3d10ada313e49baaa2b6ec1787475decef6ffdf4676c9) ### Visual Description ## [Chart Type]: Neuron Activation Distribution ### Overview The image displays a dense scatter plot or line chart titled "Neuron Activation Distribution." It visualizes the activation values of a large set of neurons, indexed from 0 to over 4000. The data appears as a continuous, noisy signal with a central tendency and several prominent outliers. ### Components/Axes * **Title:** "Neuron Activation Distribution" (centered at the top). * **Y-Axis:** * **Label:** "Neuron Activations" * **Scale:** Linear, ranging from approximately -6 to 10. * **Major Ticks:** -6, -4, -2, 0, 2, 4, 6, 8, 10. * **X-Axis:** * **Label:** "Neuron Indexes" * **Scale:** Linear, ranging from 0 to just over 4000. * **Major Ticks:** 0, 1000, 2000, 3000, 4000. * **Data Series:** A single, dense series plotted in a teal/cyan color. There is no legend, as only one data type is presented. * **Plot Area:** The data fills the majority of the chart area, bounded by the axes. ### Detailed Analysis * **Data Density & Range:** The plot contains thousands of data points, creating a near-solid band of color. The vast majority of neuron activations are concentrated in a band between approximately **-4 and +4**. * **Central Tendency:** The highest density of points appears centered around the **0** activation line. * **Outliers & Extremes:** * **Positive Spikes:** Several distinct spikes extend upward. The most prominent spike reaches a value of approximately **9** (located between neuron indexes 2000 and 3000). Another notable spike reaches approximately **8** (near neuron index 4000). * **Negative Spikes:** The data extends downward to a minimum of approximately **-6**. Several points cluster near this lower bound. * **Trend:** There is no discernible upward or downward trend across the neuron indexes. The distribution appears relatively stationary, with the mean and variance remaining consistent from index 0 to 4000. The visual trend is that of **stable, high-variance noise centered on zero**. ### Key Observations 1. **Symmetry:** The distribution is roughly symmetric around zero, though the extreme positive outlier (at ~9) is more pronounced than the extreme negative one (at ~-6). 2. **Homogeneity:** The pattern of activation is remarkably consistent across the entire range of neuron indexes. There are no obvious segments or clusters where the behavior changes fundamentally. 3. **Outlier Significance:** The few spikes reaching beyond ±6 are visually distinct from the main data cloud, indicating a small subset of neurons with exceptionally high or low activation compared to the population. ### Interpretation This chart demonstrates the activation profile of a neural network layer or population. The data suggests: * **Normalization:** The activations are centered around zero, which is common in networks using activation functions like tanh or after applying batch normalization. * **High Variance:** The wide spread (from -6 to 4 for the core mass) indicates significant variability in neuron responses to the given input or dataset. * **Sparse Extreme Activation:** The presence of isolated, high-magnitude outliers implies that while most neurons have moderate activity, a very small number are "hyper-active" or "hypo-active." This could be a characteristic of the network's learned representation, an artifact of the data, or potentially an indicator of dead or saturated neurons if the values are at the limits of an activation function's range. * **Lack of Spatial Structure:** The uniformity across neuron indexes suggests the neurons are not ordered in a way that correlates with their activation magnitude (e.g., they are not sorted by activation value). The index is likely an arbitrary identifier. **In summary, the image provides a factual snapshot of a neural network's internal state, showing a zero-centered, high-variance activation distribution with a few notable extreme values, all consistent across the observed population of over 4000 neurons.** </details> <details> <summary>x11.png Details</summary> ![03fdb6f0](/v1/image/03fdb6f082bbc692ba80f9b32f7ddc1f4a6bc021cc5dd0336bfb5f27a6dfb3c9) ### Visual Description ## Chart: Neuron Activation Distribution ### Overview The image is a line chart titled "Neuron Activation Distribution." It visualizes the activation values of approximately 4,000 individual neurons, plotted against their index number. The chart displays a single, dense data series in a teal color, showing the distribution and variability of activations across the neuron population. ### Components/Axes * **Title:** "Neuron Activation Distribution" (centered at the top). * **X-Axis:** * **Label:** "Neuron Indexes" * **Scale:** Linear, from 0 to 4000. * **Major Tick Marks:** 0, 1000, 2000, 3000, 4000. * **Y-Axis:** * **Label:** "Neuron Activations" * **Scale:** Linear, from -10.0 to 7.5. * **Major Tick Marks:** -10.0, -7.5, -5.0, -2.5, 0.0, 2.5, 5.0, 7.5. * **Data Series:** A single, continuous line plot in a solid teal color. There is no legend, as only one data series is present. * **Spatial Layout:** The plot area occupies the majority of the image. The title is positioned above the plot frame. Axis labels are centered along their respective axes. ### Detailed Analysis * **Data Density:** The line is extremely dense, indicating a data point for each of the ~4,000 neuron indexes. The values fluctuate rapidly from one index to the next. * **Central Tendency:** The vast majority of activation values are clustered around the 0.0 line on the y-axis. The dense "core" of the data appears to be concentrated roughly between -2.5 and +2.5. * **Range and Spread:** * **Upper Bound:** The highest activation spikes reach approximately **7.5**. Several distinct peaks approach this maximum value. * **Lower Bound:** The lowest activation spikes reach approximately **-10.0**. A few sharp, downward spikes extend to this minimum. * **Visual Spread:** While the core is near zero, the data exhibits significant vertical spread, with frequent excursions beyond ±5.0. * **Trend Verification:** There is no overarching upward or downward trend across the neuron indexes (from 0 to 4000). The data appears as stationary noise, with the mean and variance remaining relatively constant across the entire x-axis range. The visual pattern is one of high-frequency oscillation around a central mean. ### Key Observations 1. **Symmetry and Outliers:** The distribution appears roughly symmetric around zero, but with notable outliers on both ends. The most extreme negative outlier (~-10.0) is slightly more pronounced than the most extreme positive outlier (~7.5). 2. **No Systematic Drift:** The activation pattern does not show any gradual increase or decrease as the neuron index increases. The statistical properties (mean, spread) appear consistent from neuron 0 to neuron 4000. 3. **High Variability:** The chart demonstrates that individual neuron activations are highly variable and not tightly constrained to a narrow range. This suggests a layer with a diverse set of responses. 4. **Absence of Structure:** There are no visible clusters, bands, or periodic patterns within the noise. The activations appear randomly distributed within their bounds. ### Interpretation This chart likely represents the output activations of a single layer (e.g., a hidden layer) in a neural network, where each "neuron index" corresponds to a specific unit in that layer. * **What the data suggests:** The data suggests this layer is operating in a regime where most neurons have low-magnitude activations centered around zero, but a subset of neurons fire strongly (positively or negatively) in response to the input data. This is a common pattern in trained networks, indicating specialization. * **How elements relate:** The x-axis (neuron index) is an arbitrary identifier, so the lack of trend is expected. The y-axis (activation) is the functional output. The relationship shown is the population code of the layer—which neurons are active and to what degree for a given input or set of inputs. * **Notable patterns/anomalies:** The key observation is the presence of extreme outliers (near -10 and +7.5). In many activation functions (like ReLU), negative values are zeroed out. The presence of significant negative activations suggests the use of an activation function like tanh, sigmoid, or a linear layer, or that this plot shows pre-activation (logit) values. The outliers could represent neurons that are highly selective for specific features in the input data. * **Underlying Information:** This visualization is a diagnostic tool. It helps answer: Is the layer saturated? Are activations well-distributed or collapsed? Are there "dead" neurons (consistently near zero) or "hyperactive" ones? The dense, noisy plot with outliers indicates a healthy, active layer with a diverse response profile, but the extreme values might warrant investigation to ensure they are not causing numerical instability. </details> <details> <summary>x12.png Details</summary> ![c786a149](/v1/image/c786a149d3136030dfb1d9acb90696dc73a05ff836161d1b6e1a3229265694d9) ### Visual Description \n ## Line Chart: Neuron Activation Distribution ### Overview The image is a line chart titled "Neuron Activation Distribution." It visualizes the activation values of approximately 4,000 individual neurons, plotted against their sequential index. The chart displays a single, dense data series showing a distribution centered near zero with numerous sharp, high-magnitude spikes in both positive and negative directions. ### Components/Axes * **Title:** "Neuron Activation Distribution" (centered at the top). * **X-Axis:** * **Label:** "Neuron Indexes" (centered below the axis). * **Scale:** Linear scale from 0 to 4000. * **Major Tick Marks:** 0, 1000, 2000, 3000, 4000. * **Y-Axis:** * **Label:** "Neuron Activations" (centered to the left of the axis, rotated 90 degrees). * **Scale:** Linear scale from -30 to 30. * **Major Tick Marks:** -30, -20, -10, 0, 10, 20, 30. * **Data Series:** A single line plotted in a teal/cyan color. There is no legend, as only one series is present. * **Plot Area:** Bounded by a black rectangular frame. ### Detailed Analysis * **General Trend:** The vast majority of neuron activations are clustered tightly around the 0 value, forming a dense, noisy baseline. The line appears as a thick band near zero due to the high density of data points. * **Positive Spikes:** There are numerous sharp, positive spikes extending upward from the baseline. * The most prominent positive spike occurs near neuron index ~1500, reaching a value of approximately **28**. * Other significant positive spikes are visible near index ~200 (value ~25), index ~2500 (value ~20), and index ~3500 (value ~15). * **Negative Spikes:** There are also several sharp, negative spikes extending downward. * The most prominent negative spike occurs near neuron index ~2500, reaching a value of approximately **-35**. * Another major negative spike is near index ~800, with a value of approximately **-25**. * **Spatial Distribution:** The spikes are irregularly distributed across the neuron index range. There is no clear pattern suggesting certain index ranges are more prone to high activations than others. ### Key Observations 1. **Sparse High Activation:** The distribution is highly sparse. While the baseline is near zero, a small subset of neurons exhibit very high-magnitude activations (both positive and negative). 2. **Asymmetry in Extremes:** The largest magnitude activation is negative (~-35), which is greater in absolute value than the largest positive activation (~28). 3. **No Clear Periodicity:** The spikes do not appear to follow a regular, periodic pattern across the neuron indexes. 4. **High Variance:** The range of activations is large, spanning from approximately -35 to +28, indicating significant variability in neuron behavior. ### Interpretation This chart likely represents the activation profile of a layer in an artificial neural network (e.g., a hidden layer in a deep learning model) for a given input or across a set of inputs. * **What the data suggests:** The pattern of a dense baseline near zero with sparse, high-magnitude spikes is characteristic of **sparse activation**. This suggests that for the processed input(s), only a small fraction of neurons in this layer are strongly "firing" or contributing significantly to the computation. The majority remain relatively inactive. * **How elements relate:** The "Neuron Indexes" represent individual processing units. The "Neuron Activations" quantify their output signal strength. The spikes identify the specific neurons (by their index) that are most responsive to the input features. * **Notable outliers and anomalies:** The extreme spikes at indices ~1500 (positive) and ~2500 (negative) are the most significant outliers. These neurons could be specialized detectors for particular features in the input data. The presence of both strong positive and negative activations is common in networks using activation functions like tanh or in layers before a final activation function. * **Underlying significance:** Analyzing such distributions helps in understanding network behavior, diagnosing issues like dead neurons (if many activations were stuck at a low value), or identifying important neurons for model interpretability. The sparse, high-variance nature seen here is often a desired property for efficient and discriminative representations in machine learning. </details> <details> <summary>x13.png Details</summary> ![15e35ab9](/v1/image/15e35ab98af95c87246c14f30daf63726c66678604b6b8de7c2bcf28d9dcff8a) ### Visual Description \n ## Line Chart: Neuron Activation Distribution ### Overview The image displays a line chart titled "Neuron Activation Distribution." It visualizes the activation values of approximately 4,000 individual neurons, plotted against their sequential index. The chart is characterized by a dense, noisy baseline centered near zero, punctuated by several prominent positive and negative spikes. ### Components/Axes * **Title:** "Neuron Activation Distribution" (centered at the top). * **Y-Axis:** * **Label:** "Neuron Activations" (rotated vertically on the left side). * **Scale:** Linear scale ranging from approximately -20 to +30. * **Major Ticks:** Marked at intervals of 10: -20, -10, 0, 10, 20, 30. * **X-Axis:** * **Label:** "Neuron Indexes" (centered at the bottom). * **Scale:** Linear scale from 0 to 4000. * **Major Ticks:** Marked at intervals of 1000: 0, 1000, 2000, 3000, 4000. * **Data Series:** A single, continuous line plotted in a cyan/teal color. * **Legend:** None present. ### Detailed Analysis The data represents a distribution of activation values across a population of neurons. * **Baseline Activity:** The vast majority of neurons exhibit low-magnitude activations, forming a dense, noisy band centered around the 0 line. The typical activation range for this baseline appears to be between approximately -5 and +5. * **Positive Spikes (High Activations):** Several neurons show significantly elevated positive activations. The most prominent spike occurs near neuron index ~1500, reaching a peak value of approximately **30**. Other notable positive spikes are visible near: * Index ~100: ~25 * Index ~2000: ~22 * Index ~3000: ~15 * Index ~3500: ~12 * **Negative Spikes (Inhibitions):** A smaller number of neurons display strong negative activations. The most significant negative spike is located near neuron index ~2000, dropping to approximately **-18**. Another clear negative spike is near index ~2500, reaching about -12. * **Spatial Distribution:** The spikes are irregularly distributed across the neuron index range. There is no clear, repeating pattern or clustering; the high-magnitude activations appear sporadically. ### Key Observations 1. **Sparse High Activation:** The distribution is highly non-uniform. While most neurons are near-zero, a small subset (<1%) shows activations an order of magnitude larger than the baseline. 2. **Asymmetry:** The maximum positive activation (~30) is greater in magnitude than the maximum negative activation (~-18). 3. **No Clear Index-Based Pattern:** The high-activation neurons are not concentrated at the beginning, middle, or end of the index range. They appear randomly interspersed. 4. **Noise Floor:** The persistent, low-amplitude oscillation across all indexes suggests a baseline level of noise or minor activity in the network. ### Interpretation This chart likely represents the output of a layer in an artificial neural network (e.g., a deep learning model) at a specific point in time or for a specific input. * **Functional Implication:** The pattern suggests a **sparse activation** regime. In such a regime, only a few neurons are strongly activated for a given input, while most remain quiet. This is a common and often desirable property in efficient neural networks, as it can lead to better generalization and lower computational cost. * **Outlier Significance:** The neurons with extreme positive or negative activations (the spikes) are likely the most "important" or responsive units for the particular input that generated this data. They may be detecting specific, salient features. * **Network Health:** The presence of both strong positive and negative activations indicates the network is utilizing both excitatory and inhibitory signals. The clean baseline (centered at zero) suggests the network's activations are well-normalized, which is typically a sign of stable training. * **Investigative Lead:** If this chart represents a trained model, analyzing the specific inputs that cause the largest spikes could reveal what features those particular neurons have learned to detect. Conversely, if this is from an untrained or poorly trained network, the random spike pattern might indicate instability. </details> Figure 6: Activation distributions of four selected tokens in LLaMA-7B. ## Appendix G Limitations and future work By exploring the internal states of LLM and introducing an EigenScore metric, the hallucination detection performance has been significantly improved. However, there are several limitations of this study. One critical limitation is that the proposed method relies on the internal information of the LLMs, therefore cannot be applied to the black-box models (OpenAI, 2023), where users can not access the hidden states. Additionally, our proposal is a sampling-based approach, necessitating the generation of multiple outputs, which introduces additional inference overhead for accurate hallucination detection. Furthermore, another limitation is that we only focus on hallucination detection in this study and have not yet mitigate the hallucination with our proposal. In the future work, we hope to reduce the inference cost of our proposal and leverage the EigenScore to mitigate hallucinations. We believe the proposed EigenScore is a strong baseline for detecting hallucination and we invite further researchers to utilize and enhance our proposal. ## Appendix H Cases Study ### H.1 Hallucination Detection Cases We show several cases with LLaMA-7B and NQ dataset. The number of generation is set to $N=10$ . We determine the optimal detection threshold by maximizing the G-Mean value, which is defined as $\textbf{G-Mean}=√{TPR*(1-FPR)}$ . The hallucination detection thresholds for different metrics are Perplexity: 0.535; LN-Entropy: 0.153; LexicalSimilarity 0.489; SelfCheckGPT: 0.168; EigenScore: -1.74. For LexicalSimilarity, a score larger than the threshold indicates non-hallucination. For other metrics, a score smaller than the thresholds indicates non-hallucination. Green indicates non-hallucination answer and Red indicates hallucination answer generated by LLM. ✓ (✗) indicates that the hallucination is (not) correctly identified by the metric. Question: the german princes who chose the holy roman empire were called GTAns: prince-electors LLMAns: electors BatchGenerations: [’electors’, ’electors’, ’electors’, ’electors’, ’electors’, ’electors’, ’electors’, ’electors’, ’electors’, ’electors’] Perplexity: 0.361 ✓ LN-Entropy: 0.027 ✓ LexicalSimilarity: 1.0 ✓ SentBERTScore: 0.0 ✓ EigenScore: -2.63 ✓ EigenValue: [4.87719579e+00 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: where is fe best absorbed in the body GTAns: in the duodenum LLMAns: in the small intestine BatchGenerations: [’in the liver’, ’small intestine’, ’in the intestines’, ’the small intestine’, ’the small intestine’, ’in the liver’, ’small intestine’, ’fat’, ’in the small intestine’, ’fatty tissues’] Perplexity: 0.641 ✓ LN-Entropy: 0.213 ✓ LexicalSimilarity: 0.357 ✓ SentBERTScore: 0.258 ✓ EigenScore: -1.40 ✓ EigenValue: [3.71561676e+00 4.34496729e-01 3.77751922e-01 1.75326593e-01 9.92596975e-02 4.20723353e-02 2.49385766e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: who did the united states win its independence from GTAns: the British Empire LLMAns: britain BatchGenerations: [’britain’, ’england’, ’great britain’, ’great britain’, ’england’, ’england’, ’england’, ’england’, ’great britain’, ’great britain’] Perplexity: 0.598 ✗ LN-Entropy: 0.266 ✗ LexicalSimilarity: 0.415 ✗ SentBERTScore: 0.397 ✗ EigenScore: -2.23 ✓ EigenValue: [4.46843402e+00 2.82423429e-01 3.88702191e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: who won the most stanley cups in history GTAns: Montreal Canadiens LLMAns: the detroit red wings BatchGenerations: [’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’] Perplexity: 0.366 ✗ LN-Entropy: 0.025 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.0 ✗ EigenScore: -2.63 ✗ EigenValue: [5.23534401e+00 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: what is the second book in the alchemyst series GTAns: The Magician LLMAns: the alchemyst: the secret of the immortal Nicholas flamel BatchGenerations: [’the magician in the middle’, ”the magician’s nephew”, ’the magician’, ’the alchemyst’, ’the magician’, ’the alchemyst’, ’the magician in the middle’, ’the magician in amsterdam’, ’the magician in the middle’, ’the secret of the near witch’] Perplexity: 0.459 ✗ LN-Entropy: 0.207 ✓ LexicalSimilarity: 0.499 ✗ SentBERTScore: 0.431 ✓ EigenScore: -1.41 ✓ EigenValue: [2.09138499e+00 6.95605781e-01 3.85931973e-01 3.40671669e-01 2.15372994e-01 1.77304781e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: a good that can be used in place of another good GTAns: A substitute good LLMAns: substitute BatchGenerations: [’a substitute’, ’a substitute’, ’substitute’, ’a substitute’, ’substitute’, ’substitute’, ’substitute’, ’substitute’, ’substitute’, ’substitute’] Perplexity: 0.614 ✗ LN-Entropy: 0.194 ✗ LexicalSimilarity: 0.844 ✓ SentBERTScore: 0.076 ✓ EigenScore: -2.40 ✓ EigenValue: [5.11365925e+00 1.75884104e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: who said i will not go quietly into the night GTAns: Welsh poet Dylan Thomas LLMAns: douglas macarthur BatchGenerations: [’douglas macarthur’, ’elvis presley’, ’douglas macarthur’, ’douglas macarthur’, ’douglas macarthur’, ’robin williams’, ’robert f. kennedy’, ’robin williams’, ’robert f kennedy’, ’Dwight Eisenhower’] Perplexity: 0.502 ✗ LN-Entropy: 0.121 ✗ LexicalSimilarity: 0.178 ✓ SentBERTScore: 0.337 ✓ EigenScore: -1.61 ✓ EigenValue: [3.32824135e+00 5.87944819e-01 3.70390066e-01 1.70849836e-01 1.17707239e-01 5.17925563e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: when did they stop saying mass in latin GTAns: from 1967 onwards LLMAns: 1969 BatchGenerations: [’in 1969’, ’1969’, ’around 1969’, ’in 1965’, ’in 1969’, ’1969’, ’In 1969, the Roman Missal was changed from Latin to the vernacular.’, ’1969’, ’1969 when the second vatican council was held’, ’1969’] Perplexity: 0.631 ✓ LN-Entropy: 0.152 ✗ LexicalSimilarity: 0.540 ✗ SentBERTScore: 0.273 ✓ EigenScore: -1.59 ✓ EigenValue: [3.80408192e+00 4.83987672e-01 3.03207580e-01 8.80366008e-02 6.59790286e-02 3.26742841e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: where does the highway of heroes start and end GTAns: Glen Miller Road in Trenton LLMAns: it starts in ottawa and ends in toronto BatchGenerations: [’start in ottawa and end in toronto’, ’it starts in toronto and ends in ottawa’, ”it starts at the queen’s park gate at queen’s park and ends at toronto’s coronation park”, ”it starts in toronto and ends at queen’s park”, ’starts in Toronto and ends in Trenton’, ’it starts at the CNE and ends at the cenotaph in downtown toronto’, ’it starts in downtown toronto and ends in ottawa’, ’it starts in the city of Toronto and ends in the city of Hamilton’, ’it starts in toronto and ends in ottawa’, ’it starts in Toronto and ends in Ottawa’] Perplexity: 0.657 ✓ LN-Entropy: 0.260 ✓ LexicalSimilarity: 0.594 ✗ SentBERTScore: 0.149 ✗ EigenScore: -1.05 ✓ EigenValue: [3.31983018e+00 3.98560810e-01 2.17094299e-01 2.06965709e-01 1.53575354e-01 1.27925588e-01 7.82365136e-02 3.28158137e-02 1.01995086e-02 1.00000000e-03] Question: which body part(s) occupy the greatest portion of the primary motor cortex GTAns: face LLMAns: the hand BatchGenerations: [’the hand’, ’the hand’, ’the fingers’, ’the hand’, ’the hand’, ’the hand and the foot’, ’the hand’, ’the hand and the foot’, ’the hand and face’, ’arms and legs’] Perplexity: 1.41 ✓ LN-Entropy: 0.300 ✓ LexicalSimilarity: 0.568 ✗ SentBERTScore: 0.163 ✗ EigenScore: -1.69 ✓ EigenValue: [3.76273036e+00 6.16284067e-01 1.96541049e-01 1.73505005e-01 1.28407153e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: who said have you no sense of decency GTAns: Joseph Nye Welch LLMAns: Joseph Nye Welch BatchGenerations: [’Joseph N. Welch’, ’Joseph N. Welch’, ’joe stalin’, ’joseph mccarthy’, ’Joseph N. Welch’, ’Joseph N. Welch’, ’Joseph Nye Welch’, ’joseph mccarthy’, ’joe mccarthy’, ’joseph mccarthy’] Perplexity: 0.666 ✗ LN-Entropy: 0.212 ✗ LexicalSimilarity: 0.437 ✗ SentBERTScore: 0.391 ✗ EigenScore: -1.85 ✓ EigenValue: [3.63114083e+00 8.11672323e-01 2.00385898e-01 3.19140618e-02 1.74251264e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] ### H.2 Model generations with many and few Outliers To demonstrate the relationship between the number of extreme features and model outputs, we provide several examples with many/few extreme features. The results show that when there are many extreme features, the model tends to generate consistent hallucination outputs across multiple generations. Instead, when there are few extreme features, the model generates diverse hallucination outputs which can be spotted by different hallucination detection metrics. Question: who sang on great gig in the sky GTAns: Clare Torry LLMAns: freddie mercury AvgNumOutliers: 15 BatchGenerations: [’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’] Perplexity: 0.263 ✗ LN-Entropy: 0.028 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.0 ✗ EigenScore: -2.63 ✗ EigenValue: [4.65740187e+00 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: what are the top five wine producing states GTAns: Washington LLMAns: California AvgNumOutliers: 13 BatchGenerations: [’ California,’, ’ California,’, ’ california,’, ’ California,’, ’ California,’, ’ California,’, ’ California,’, ’ california,’, ’ California,’, ’ California,’] Perplexity: 0.368 ✗ LN-Entropy: 0.075 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.054 ✗ EigenScore: -2.42 ✗ EigenValue: [5.30709315e+00 1.13222379e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: how many seasons of rules of engagement is there GTAns: 7 LLMAns: 4 seasons AvgNumOutliers: 2 BatchGenerations: [’3 seasons’, ’4 seasons’, ’4 seasons’, ’6 seasons’, ’7 seasons’, ’3 (2007-2009)’, ’3 (2007-2009)’, ’4 seasons’, ’three’, ’11 seasons’] Perplexity: 0.996 ✓ LN-Entropy: 0.292 ✓ LexicalSimilarity: 0.307 ✓ SentBERTScore: 0.285 ✓ EigenScore: -1.60 ✓ EigenValue: [3.58548815e+00 5.87838054e-01 2.28057934e-01 1.36461300e-01 3.49712302e-02 1.11346059e-02 3.82259086e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: where did the first persian gulf war take place GTAns: Israel LLMAns: kuwait AvgNumOutliers: 3 BatchGenerations: [’Iraq’, ’Iraq and Kuwait’, ’Iraq’, ’kuwait’, ’kuwait’, ’in the middle east’, ’in iraq’, ’kuwait’, ’iraq’, ’kuwait’] Perplexity: 0.546 ✓ LN-Entropy: 0.281 ✓ LexicalSimilarity: 0.339 ✓ SentBERTScore: 0.224 ✓ EigenScore: -1.62 ✓ EigenValue: [3.59463352e+00 4.23782982e-01 2.57087067e-01 1.41513403e-01 6.20790226e-02 1.75980481e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] ### H.3 Impact of Feature Clipping The texts in yellow represents model generations after applying feature clipping. The results show that after feature clipping, the overconfident generations can be appropriately suppressed, and some self-consistent hallucinations are finally identified. Question: what are the top five wine producing states GTAns: Washington LLMAns: California BatchGenerations: [’ California,’, ’ California,’, ’ california,’, ’ California,’, ’ California,’, ’ California,’, ’ California,’, ’ california,’, ’ California,’, ’ California,’] Perplexity: 0.368 ✗ LN-Entropy: 0.075 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.054 ✗ EigenScore: -2.42 ✗ EigenValue: [5.30709315e+00 1.13222379e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] BatchGenerations: [’california’, ’california’, ’Washington’, ’california’, ’new york’, ’california’, ’washington’, ’california’, ’new york’, ’michigan’] EigenScore: -1.32 ✓ EigenValue: [3.23392755e+00 8.41049340e-01 2.52322804e-01 1.33473529e-01 7.19449437e-02 6.12184197e-02 1.02734249e-02 5.33703500e-03 3.09878029e-03 1.00000000e-03] Question: who sang on great gig in the sky GTAns: Clare Torry LLMAns: freddie mercury AvgNumOutliers: 15 BatchGenerations: [’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’] Perplexity: 0.263 ✗ LN-Entropy: 0.028 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.0 ✗ EigenScore: -2.63 ✗ EigenValue: [4.65740187e+00 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] BatchGenerations: [’claire torry’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’] EigenScore: -2.38 ✗ EigenValue: [4.38745800e+00 3.14982649e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: who are you in assassin’s creed 4 GTAns: third-person perspective LLMAns: Edward Kenway BatchGenerations: [’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’] Perplexity: 0.264 ✗ LN-Entropy: 0.002 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.0 ✗ EigenScore: -2.67 ✗ EigenValue: [2.10973201e+00 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] BatchGenerations: [’Edward Kenway’, ’Edward Kenway’, ’Connor’, ’Edward Kenway’, ’connor’, ’Connor’, ’alexander hamilton’, ’Edward Kenway’, ’ezio’, ’connor’] EigenScore: -1.68 ✓ EigenValue: [3.47825477e+00 7.48127381e-01 3.24792650e-01 2.17182636e-01 8.15050807e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03]

Rendering Paper...