2402.03744v2

Model: gemini-3-flash-free

# INSIDE: LLMs’ Internal States Retain the Power of Hallucination Detection **Authors**: - Zhihang Fu, Jieping Ye (Alibaba Cloud Zhejiang University) > Corresponding Author Abstract Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs’ IN ternal S tates for halluc I nation DE tection (INSIDE). In particular, a simple yet effective EigenScore metric is proposed to better evaluate responses’ self-consistency, which exploits the eigenvalues of responses’ covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and question-answering (QA) benchmarks, showing the effectiveness of our proposal. Code is available at https://github.com/alibaba/eigenscore 1 Introduction Large Language Models (LLMs) have recently achieved a milestone breakthrough and demonstrated impressive abilities in various applications (Ouyang et al., 2022; OpenAI, 2023). However, it has been widely observed that even the state-of-the-art LLMs often make factually incorrect or nonsense generations (Cohen et al., 2023; Ren et al., 2022; Kuhn et al., 2022), which is also known as knowledge hallucination (Ji et al., 2023). The potentially unreliable generations make it risky to deploy LLMs in practical scenarios. Therefore, hallucination detection, that is, accurately detecting and rejecting responses when hallucinations occur in LLMs, has attracted more and more attention from the academic community (Azaria & Mitchell, 2023; Ren et al., 2022; Kuhn et al., 2022). The token-level uncertainty estimation (e.g., predictive confidence or entropy) has shown its efficacy in hallucination detection on conventional NLP tasks (Malinin & Gales, 2020; Huang et al., 2023). However, how to derive the sentence-level uncertainty from the token-level remains a challenge, especially for modern auto-regressive LLMs whose response contents are generally diverse and sophisticated (Malinin & Gales, 2020; Kuhn et al., 2022; Duan et al., 2023). Thus, to avoid complicated token-to-sentence uncertainty derivation, researchers propose to evaluate the sentence uncertainty by the output languages directly (Kadavath et al., 2022; Yin et al., 2023; Zhou et al., 2023). Among the recent advancements, prompting LLMs to generate multiple responses to the same question and evaluating the self-consistency of those responses has been proven effective in hallucination detection (Wang et al., 2022; Shi et al., 2022). However, such a post-hoc semantic measurement on decoded language sentences is inferior to precisely modeling the logical consistency/divergence Manakul et al. (2023); Zhang et al. (2023). Hence, instead of logit-level or language-level uncertainty estimation, this paper proposes to leverage the internal states of LLMs to conduct hallucination detection. The motivation is intuitive: LLMs preserve the highly-concentrated semantic information of the entire sentence within their internal states (Azaria & Mitchell, 2023), allowing for the direct detection of hallucinated responses in the sentence embedding space. In particular, with the generalized framework of IN ternal S tates for halluc I nation DE tection (INSIDE), this paper performs hallucination detection from two perspectives. First, skipping secondary semantic extraction via extra models, we directly measure the self-consistency/divergence of the output sentences using internal states of LLMs. In order to explore semantic consistency in the embedding space, Section 3.1 introduces an EigenScore metric regarding the eigenvalues of sentence embeddings’ covariance matrix. Second, to handle the self-consistent (overconfident) hallucinations, we propose to rectify abnormal activations of the internal states. Specifically, Section 3.2 develops a feature clipping approach to truncate extreme features, which tends to prevent overconfident generations during the auto-regressive procedure. In Section 4, the effectiveness of our method is validated through extensive experiments on several well-established QA benchmarks. The main contributions of our work are as follows: - We propose a generalized INSIDE framework that leverages the internal states of LLMs to perform hallucination detection. - We develop an EigenScore metric to measure the semantic consistency in the embedding space, and demonstrate that the proposed EigenScore represents the differential entropy in the sentence embedding space. - A test time feature clipping approach is introduced to truncate extreme activations in the feature space, which implicitly reduces overconfident generations and helps identify the overconfident hallucinations. - We achieve state-of-the-art hallucination detection performance on several QA benchmarks, and conduct extensive ablation studies to verify the efficacy of our method. 2 Background on Hallucination Detection In this work, we mainly focus on the knowledge hallucination detection of natural language generation based on LLMs, especially for Q&A task (Reddy et al., 2019; Kwiatkowski et al., 2019). Given an input context $\bm{x}$ , a typical LLM (Zhang et al., 2022; Touvron et al., 2023a) parameterized with $\bm{\theta}$ is able to generate output sequences in autoregressive manner $y_{t}=f(\bm{x},y_{1},y_{2},·s,y_{t-1}|\bm{\theta})$ , where $\bm{y}=[y_{1},y_{2},·s,y_{T}]$ denotes the output sequence and $y_{t}$ denotes the t- $th$ output token. We denote $p(y_{t}|y_{<t},\bm{x})$ the Maximum Softmax Probability (MSP) of $t$ -th token. For a traditional classification model, the MSP measures the confidence level of the classification result and has been widely used as an uncertainty measure of predictions (Hendrycks & Gimpel, 2016). Therefore, for sequence generation task, a straightforward sequence uncertainty can be defined as the joint probability of different tokens, which is known as Perplexity (Ren et al., 2022), $$ P(\bm{y}|\bm{x},\bm{\theta})=-\frac{1}{T}\log\prod_{t}p(y_{t}|y_{<t},\bm{x})=-% \frac{1}{T}\sum_{t}\log p(y_{t}|y_{<t},\bm{x}) \tag{1} $$ As shorter sequences generally have lower perplexity, the length of the output sequence $T$ is utilized to normalize the joint probability. Since different tokens contribute differently to the semantics of the sentence (Raj et al., 2023; Duan et al., 2023), the perplexity defined by averaging token-level uncertainty cannot effectively capture the uncertainty of the entire sequence. It has been demonstrated that utilizing multiple generations for one input is beneficial to estimate the sequence-level uncertainty (Malinin & Gales, 2020; Kuhn et al., 2022; Manakul et al., 2023). We denote $\mathcal{Y}=[\bm{y}^{1},\bm{y}^{2},·s,\bm{y}^{K}]$ as $K$ generated responses for input context $\bm{x}$ . For a given LLM, multiple responses could be easily obtained by the top-p/top-k sampling strategy during inference time (Touvron et al., 2023a; Kadavath et al., 2022). In Malinin & Gales (2020), the Length Normalized Entropy is proposed to measure the sequence-level uncertainty by making use of multiple generations, which is defined as $$ H(\mathcal{Y}|\bm{x},\bm{\theta})=-\mathbb{E}_{\bm{y}\in\mathcal{Y}}\frac{1}{T% _{\bm{y}}}\sum_{t}\log p(y_{t}|y_{<t},\bm{x}) \tag{2} $$ When a model is uncertain about its response, it generates hallucination context, resulting in an answer distribution with a high entropy (Kadavath et al., 2022). It has been shown that the length-normalized entropy performs better than the non-normalized one (Lin et al., 2023). In addition to the predictive uncertainty or entropy, the semantic consistency (Lin et al., 2023; Raj et al., 2023) among multiple responses has also been widely explored to measure the hallucination degree of LLMs, which hypothesis that the LLMs are expected to generate similar outputs if they know the input context and they are sure about the answers (Wang et al., 2022; Manakul et al., 2023). An intuitive semantic consistency metric is Lexical Similarity (Lin et al., 2022; 2023), which explores the average similarity across multiple answers as consistency measure $$ S(\mathcal{Y}|\bm{x},\bm{\theta})=\frac{1}{C}\sum_{i=1}^{K}\sum_{j=i+1}^{K}sim% (\bm{y}^{i},\bm{y}^{j}) \tag{3} $$ where $C=K·(K-1)/2$ and $sim(·,·)$ is the similarity defined by Rouge-L Lin (2004). 3 Method <details> <summary>x1.png Details</summary> ![68286e78](/v1/image/68286e783cc762694c3149914bb53d3d0ddd106e8cda8a1a5baa85e365dadc1e) ### Visual Description ## Diagram Type: LLM Uncertainty/Verification Workflow using EigenScore ### Overview This technical diagram illustrates a process for evaluating the reliability of a Large Language Model (LLM) response. It depicts a pipeline where an input question is processed by an LLM to generate multiple candidate answer embeddings. These embeddings are then analyzed using eigenvector decomposition to calculate an "EigenScore." Based on this score, the system decides whether to provide a direct answer or a refusal message, likely to mitigate hallucinations or low-confidence outputs. ### Components/Axes #### 1. Legend (Bottom) Located at the bottom of the image, the legend defines the visual symbols used throughout the diagram: * **Token Embedding:** Represented by a light orange vertical rectangle. * **Current Token Embedding:** Represented by a solid orange vertical rectangle. * **Output Logit:** Represented by a pink vertical rectangle with a red border. * **Sentence Embedding:** Represented by a horizontal bar divided into segments (colored purple, orange, or yellow). #### 2. Input Section (Far Left) * **Label:** "Input" (blue pill-shaped header). * **Content:** A light blue box containing the text: "Q: On what date in 1969 did Neil Armstrong first set foot on the Moon?" #### 3. LLM Architecture (Center-Left) * **Label:** "LLM" (yellow pill-shaped header). * **Container:** A light green rounded rectangle. * **Internal Flow (Bottom to Top):** * **Input Tokens:** A sequence of four vertical rectangles (three light orange "Token Embeddings" followed by one solid orange "Current Token Embedding"). * **Decoder:** A teal horizontal block receiving the input tokens. * **Intermediate State:** Another sequence of four rectangles (three light orange, one solid orange) above the Decoder. * **Feature Clip:** A blue horizontal block. * **FC Layer (Fully Connected):** A teal horizontal block. * **Output Logit:** A single pink rectangle at the top of the stack. #### 4. Embedding Generation (Center) * Three distinct horizontal segmented bars representing different answer candidates: * **Top (Purple):** "Embedding of answer 1" * **Middle (Orange):** "Embedding of answer 2" * **Bottom (Yellow):** "Embedding of answer K" (where 'K' implies multiple samples). * An ellipsis ("...") is placed between the second and K-th embedding to indicate a variable number of samples. #### 5. Analysis Block (Center-Right) * **Container:** A light blue rounded rectangle. * **Content:** * A matrix representation stacking the purple, orange, and yellow sentence embeddings. * An arrow pointing to a 3D coordinate system with three colored arrows (green, blue, and orange) labeled "**Eigenvector**". #### 6. Decision Logic and Output (Far Right) * **Decision Node:** An orange diamond labeled "**High EigenScore?**". * **Path "No" (Upward):** Leads to an "Output" box (light orange) containing: "The answer is **20th July**." * **Path "Yes" (Downward):** Leads to an "Output" box (light orange) containing: "**Sorry** we don't support answer for this question." --- ### Content Details #### Process Flow 1. **Input:** The system receives a factual query about Neil Armstrong. 2. **Processing:** The LLM processes the input tokens through its internal layers (Decoder, Feature Clip, FC Layer). 3. **Sampling:** Instead of a single output, the system generates $K$ different sentence embeddings for potential answers. 4. **Mathematical Analysis:** These $K$ embeddings are aggregated into a matrix. Eigenvector analysis is performed on this matrix to determine the variance or consistency between the answers. 5. **Scoring:** An "EigenScore" is derived from this analysis. 6. **Branching:** * If the EigenScore is **not high** (indicating high consensus/low variance among samples), the system outputs the factual answer. * If the EigenScore is **high** (indicating high variance/uncertainty among samples), the system triggers a safety refusal. --- ### Key Observations * **Multi-Sample Verification:** The diagram emphasizes that the system doesn't rely on a single pass but generates multiple "embeddings of answer 1...K" to check for consistency. * **Feature Clipping:** The inclusion of a "Feature Clip" layer within the LLM block suggests a specific architectural modification to regularize or bound the features before the final classification/logit layer. * **Inverse Logic:** Usually, a "high" score is good, but here a "High EigenScore" leads to a refusal. This suggests the EigenScore measures dispersion or entropy—the higher the score, the more "spread out" and inconsistent the candidate answers are. --- ### Interpretation This diagram represents a **Self-Check or Uncertainty Estimation mechanism** for LLMs. The use of "Eigenvectors" suggests that the system is performing a Principal Component Analysis (PCA) or a similar spectral analysis on the latent space of multiple generated answers. * If the generated answers are semantically similar, their embeddings will cluster together, resulting in a dominant primary eigenvector and a **low EigenScore** (low dispersion). This gives the system "confidence" to provide the answer "20th July." * If the generated answers are wildly different (hallucinations or conflicting facts), the embeddings will be scattered. This results in a **high EigenScore**, signaling that the model is "confused" or "uncertain," prompting the system to provide a canned refusal message to avoid spreading misinformation. This is a sophisticated approach to the "hallucination" problem in AI, moving beyond simple probability scores to a more robust geometric analysis of the model's internal representations across multiple samples. </details> Figure 1: Illustration of our proposed hallucination detection pipeline. During inference time, for a given question, the extreme features in the penultimate layer are truncated and the EigenScore is computed based on the sentence embeddings across multiple responses. In this section, we introduce the details of our proposed INSIDE framework for hallucination detection. The whole pipeline is illustrated as Fig. 1. In section 3.1, we demonstrate a simple but effective EigenScore metric by exploring sentence-level semantics in the internal states of LLMs. In section 3.2, a test-time feature clipping approach is introduced to effectively alleviate the issue of overconfident generation, thereby aiding in the identification of self-consistent hallucinations 3.1 Hallucination Detection by EigenScore The existing uncertainty or consistency based hallucination detection metrics are exploited in the logit or language space, which neglect the dense semantic information that is retained within the internal states of LLMs. To better exploit the dense semantic information, we propose to measure the semantic divergence in the sentence embedding space. For the $t$ -th output token $y_{t}$ , we denote the hidden embedding in the $l$ -th layer as $\bm{h}^{l}_{t}∈\mathbb{R}^{d}$ , where $d$ is the dimension of the hidden embedding ( $d=4096$ for LLaMA-7B and $d=5120$ for LLaMA-13B). According to Ren et al. (2022); Azaria & Mitchell (2023), the sentence embedding can be obtained by averaging the token embedding $\bm{z}=\frac{1}{T}\sum_{t=1}^{T}\bm{h}_{t}$ , or taking the last token embedding as sentence embedding $\bm{z}=\bm{h}_{T}$ . In our main experiments, we use the embedding of the last token in the middle layer as the sentence embedding, as it effectively captures the sentence semantic (Azaria & Mitchell, 2023). The comparison results of using different sentence embeddings are demonstrated in the ablation studies 4.3. For $K$ generated sequences, the covariance matrix of $K$ sentence embeddings can be computed as $$ \bm{\Sigma}=\mathbf{Z}^{\top}\cdot\mathbf{J}_{d}\cdot\mathbf{Z} \tag{4} $$ where $\bm{\Sigma}∈\mathbb{R}^{K× K}$ represents the covariance matrix that captures the relationship between different sentences in the embedding space, $\mathbf{Z}=[\bm{z}_{1},\bm{z}_{2},·s,\bm{z}_{K}]∈\mathbb{R}^{d× K}$ represents the embedding matrix of different sentences, $\mathbf{J}_{d}=\bm{I}_{d}-\frac{1}{d}\mathbf{1}_{d}\mathbf{1}_{d}^{→p}$ is the centering matrix and $\mathbf{1}_{d}∈\mathbb{R}^{d}$ is the all-one column vector. Then, the proposed EigenScore can be defined as the logarithm determinant (LogDet) of the covariance matrix, $$ E(\mathcal{Y}|\bm{x},\bm{\theta})=\frac{1}{K}\log\text{det}(\bm{\Sigma}+\alpha% \cdot\mathbf{I}_{K}) \tag{5} $$ Here, $\text{det}(\mathbf{X})$ represents the determinant of matrix $\mathbf{X}$ , and a small regularization term $\alpha·\mathbf{I}_{K}$ is added to the covariance matrix to explicitly make it full rank. Since the matrix determinant can be obtained by solving the eigenvalues, the EigenScore can be computed as $$ E(\mathcal{Y}|\bm{x},\bm{\theta})=\frac{1}{K}\log(\prod_{i}\lambda_{i})=\frac{% 1}{K}\sum_{i}^{K}\log(\lambda_{i}) \tag{6} $$ where $\lambda=\{\lambda_{1},\lambda_{2},·s,\lambda_{K}\}$ denotes the eigenvalues of the regularized covariance matrix $\bm{\Sigma}+\alpha·\mathbf{I}$ , which can be solved by Singular Value Decomposition (SVD). Eq. 6 shows that the hallucination degree of LLM’s generation can be measured by the average logarithm of the eigenvalues. The conclusion is intuitive, as the eigenvalues of covariance matrix capture the divergence and correlation relationship between embeddings of different sentences. When the LLM is confident to the answers and $K$ generations have similar semantic, the sentence embeddings will be highly correlated and most eigenvalues will be close to 0. On the contrary, when the LLM is indecisive and hallucinating contents, the model will generate multiple sentences with diverse semantics leading to more significant eigenvalues. The following remark is also provided to explain why the proposed EigenScore is a good measure of knowledge hallucination. Remark 1. LogDet of covariance matrix represents the differential entropy in the sentence embedding space. Differential Entropy is the natural extension of discrete Shannon Entropy $H_{e}(X)=-\sum_{X}-p(x)\log p(x)$ . The differential entropy $H_{de}(X)$ in continuous space can be defined by replacing the probability function with its density function $f(x)$ and integrating over $x$ , i.e., $H_{de}(X)=-∈t_{x}f(x)\log f(x)dx$ . In principle (Zhouyin & Liu, 2021), for a multivariate Gaussian distribution $X\sim N(\bm{\mu},\mathbf{\Sigma})$ , the differential entropy can be represented as $$ H_{de}(X)=\frac{1}{2}\log\text{det}(\mathbf{\Sigma})+\frac{d}{2}(\log 2\pi+1)=% \frac{1}{2}\sum_{i=1}^{d}\log\lambda_{i}+C \tag{7} $$ where $d$ is the dimension of variables and $C$ is a constant. Therefore, the differential entropy is determined by the eigenvalues (LogDet) of the covariance matrix. According to Remark 1, the proposed EigenScore defined by Eq. 6 represents the differential entropy in the sentence embedding space, which offers valuable insight into using EigenScore as a semantic divergence measure. Compared to existing uncertainty or consistency metrics that obtained in logit or language space (Malinin & Gales, 2020; Huang et al., 2023; Lin et al., 2022), the advantages of EigenScore are: (1) It captures the semantic divergence (entropy) in the dense embedding space, which is expected to retain highly-concentrated semantic information compared to logits or languages (Reimers & Gurevych, 2019). (2) Representing semantic divergence in embedding space can effectively solve the semantic equivalence (linguistic invariances) problem (Kuhn et al., 2022) in natural language space. (3) Fine-grained semantic relationship among different responses can be exploited by using eigenvalues of covariance matrix. Therefore, through the exploration of dense semantic information in the internal states, the EigenScore is expected to outperform existing uncertainty and consistency metrics, resulting in improved hallucination detection performance. <details> <summary>x2.png Details</summary> ![28d737eb](/v1/image/28d737eb12add48f0037fe06ca76d60091d3547af2de7ac9b48f6e0ed8b4ff71) ### Visual Description ## Line Chart: Neuron Activation Distribution ### Overview This image is a line chart titled "Neuron Activation Distribution." It visualizes the activation values for a set of approximately 4,096 individual neurons (likely from a specific layer in a neural network). The data is represented by a single cyan-colored line that fluctuates across the neuron indices. The chart shows a high degree of sparsity, with most neurons having activations near zero, punctuated by several extreme positive and negative outliers. ### Components/Axes * **Title**: "Neuron Activation Distribution" (Top-center) * **Y-Axis (Vertical)**: * **Label**: "Neuron Activations" (Left-center, rotated 90 degrees) * **Scale**: Ranges from -30 to 30. * **Major Tick Marks**: -30, -20, -10, 0, 10, 20, 30. * **X-Axis (Horizontal)**: * **Label**: "Neuron Indexes" (Bottom-center) * **Scale**: Ranges from 0 to approximately 4,100. * **Major Tick Marks**: 0, 1000, 2000, 3000, 4000. * **Data Series**: A single continuous line plotted in a cyan/teal color. There is no legend, as only one variable is being tracked. ### Detailed Analysis The chart displays a "spiky" distribution. To analyze the data, we can segment the behavior into the "baseline" and the "outliers." **1. Baseline Activity (The Dense Band):** * **Visual Trend**: The line forms a very dense, horizontal band centered exactly at $y = 0$. * **Range**: The vast majority of data points (estimated >95%) fall within a narrow range of approximately $[-3, +3]$. * **Consistency**: This baseline is consistent across the entire x-axis from index 0 to 4096. **2. Positive Outliers (Spikes):** * **Visual Trend**: Sharp, vertical lines extending upward from the baseline. * **Notable Points (Approximate):** * Index ~180: Activation $\approx +26$ * Index ~1550: Activation $\approx +16$ * Index ~1580: Activation $\approx +28$ (The highest positive peak) * Index ~1950: Activation $\approx +15$ * Index ~2480: Activation $\approx +20$ * Index ~2900: Activation $\approx +11$ * Index ~3050: Activation $\approx +14$ * Index ~3450: Activation $\approx +13$ **3. Negative Outliers (Dips):** * **Visual Trend**: Sharp, vertical lines extending downward from the baseline. * **Notable Points (Approximate):** * Index ~10: Activation $\approx -11$ * Index ~250: Activation $\approx -13$ * Index ~820: Activation $\approx -25$ * Index ~2420: Activation $\approx -35$ (The most extreme outlier, extending below the -30 axis marker) * Index ~2750: Activation $\approx -12$ * Index ~3180: Activation $\approx -10$ ### Key Observations * **High Kurtosis**: The distribution has very "heavy tails," meaning that while most values are near the mean (zero), there are infrequent but extreme deviations. * **Symmetry**: The distribution is roughly symmetric around zero, with significant spikes occurring in both positive and negative directions. * **Extreme Outlier**: The most significant single data point is a negative spike near index 2420, reaching an activation level of approximately -35. * **Sparsity**: The visual density of the line at $y=0$ suggests that the model or layer being visualized utilizes "sparse activation," where only a small subset of neurons are highly active for a given input. ### Interpretation * **Neural Network Health**: This chart is typical of a healthy, trained neural network layer. The fact that most neurons are near zero suggests the network has learned to be selective, only "firing" specific neurons (the spikes) in response to certain features. * **Feature Detection**: Each spike likely represents a neuron that has strongly identified a specific feature in the input data. The magnitude (e.g., +28 or -35) indicates the strength of that detection. * **Potential for Pruning**: Because so many neurons (the dense band at zero) have near-zero activation, this layer might be a good candidate for "weight pruning," where inactive neurons are removed to make the model smaller and faster without significantly losing accuracy. * **Numerical Stability**: The presence of values as high as 30 and as low as -35 suggests a wide dynamic range. If these values were to grow much larger (e.g., into the hundreds), it might indicate an "exploding gradient" problem, but at this scale, it appears controlled. </details> (a) Neuron Activation <details> <summary>x3.png Details</summary> ![297b9869](/v1/image/297b9869ae81c7f76bfefe8b29e90ef7080f7be7dfcc6b3dce5198cc424cbff5) ### Visual Description ## Chart Type: Histogram / Density Plot: Neuron Activation Distribution ### Overview This image is a technical chart representing the distribution of neuron activations within a computational model, likely a neural network. It displays a frequency-based density of "Normalized Features" across a specific range. The data follows a roughly bell-shaped (Gaussian-like) distribution, centered slightly to the right of the zero mark. ### Components/Axes * **Title**: "Neuron Activation Distribution" (Centered at the top). * **Y-Axis (Vertical)**: * **Label**: "Density" (Positioned on the left). * **Scale**: Linear, ranging from 0.0 to 3.0. * **Markers**: 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0. * **X-Axis (Horizontal)**: * **Label**: "Normalized Features" (Positioned at the bottom). * **Scale**: Linear, ranging from approximately -0.80 to 1.00. * **Markers**: -0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75, 1.00. * **Legend**: None present; the chart contains a single data series represented by cyan/teal vertical bars. ### Content Details The chart utilizes a high-resolution histogram with many narrow bins to show the density of values. * **Trend Verification**: The data series starts with near-zero density at the far left (~ -0.80), slopes upward steadily until it reaches a peak just before the 0.25 mark, and then slopes downward toward the right, tapering off near 1.00. * **Peak (Mode)**: The highest density spike occurs at approximately **x ≈ 0.21**, reaching a density value of **~3.0**. * **Distribution Spread**: * **Left Tail**: Extends to approximately **-0.75**, though density is negligible below -0.50. * **Right Tail**: Extends to approximately **1.00**, with a very small outlier visible at the far right edge. * **Bulk of Data**: The majority of the activation density is concentrated between **-0.25 and 0.60**. * **Symmetry**: The distribution is slightly asymmetrical, with a visible "right shift" or positive mean. ### Key Observations * **Positive Bias**: The center of the distribution (mean/median) is clearly positive, situated around 0.15 to 0.20. * **Granularity**: The use of very thin bars suggests a large sample size of neurons or features being measured. * **Zero-Crossing**: A significant portion of the distribution exists below zero, indicating that the "Normalized Features" are not strictly non-negative (as would be seen with a standard ReLU activation function). ### Interpretation * **Model Health**: The distribution shows a healthy spread of activations. There is no evidence of "dead neurons" (which would show a massive spike exactly at 0.0) or "saturated neurons" (which would show spikes at the extreme ends of the range). * **Activation Function Inference**: The presence of values in the negative range suggests the use of an activation function that allows negative output, such as **Leaky ReLU, ELU, or SELU**, or that this plot represents **pre-activation** values or values after **Batch Normalization**. * **Normalization**: The "Normalized Features" label and the scale (mostly within -1 to 1) suggest that the data has been processed through a normalization layer (like LayerNorm or BatchNorm) which typically aims for a mean of 0 and a standard deviation of 1; however, the slight positive shift suggests the model has learned a specific bias for this layer. * **Peircean Investigative Note**: The sharp, jagged nature of the peaks (rather than a smooth KDE curve) indicates raw binning of discrete data points. The single outlier at exactly 1.00 might suggest a clipping or "ceiling" effect in the normalization or activation logic. </details> (b) Feature Distribution Figure 2: Illustration of activation distributions in the penultimate layer of LLaMA-7B. (a) Activation distribution in the penultimate layer for a randomly sampled token. (b) Activation distribution for a randomly sampled neuron activation of numerous tokens. 3.2 Test Time Feature Clipping Recent works have shown that the LLMs are subject to the risks of self-consistent (overconfident) hallucinations (Ren et al., 2022; Ji et al., 2023), which has not been considered by existing consistency based methods. Therefore, to address those failure cases caused by overconfident generation, a test time feature clipping approach is introduced during the computation of EigenScore. As shown in Figure. 2, we illustrate the activation distribution in the penultimate layer of LLaMA-7B. An intuitive observation is that the penultimate layer of LLMs tends to exhibit numerous extreme features, consequently increasing the likelihood of generating overconfident and self-consistent generations. Inspired by prior works that rectify internal activations to reduce overconfident prediction for Out-of-Distribution (OOD) detection (Sun et al., 2021; Djurisic et al., 2022; Chen et al., 2024), we introduce a test time feature clipping (FC) method to prevent LLMs generate overconfident hallucinations. To rectify those extreme features, the FC operation is defined as the following piecewise function $$ FC(h)=\begin{cases}h_{min},&h<h_{min}\\ h,&h_{min}\leq h\leq h_{max}\\ h_{max}&h>h_{max}\end{cases} \tag{8} $$ where $h$ represents the feature of the hidden embeddings in the penultimate layer of the LLMs, $h_{min}$ and $h_{max}$ are two thresholds for determining the minimum and maximum truncation activations. When $h_{min}=-∞$ and $h_{max}=+∞$ , the output feature embedding is equivalent to the original output. For the determination of the optimal truncation thresholds, a memory bank which dynamically pushes and pops element in it, is utilized to conserve $N$ token embeddings during test time. Then, for each hidden neuron, the thresholds $h_{min}$ and $h_{max}$ are set to the top and bottom $p$ -th percentiles of the features in the memory bank. Refer to the three-sigma-rule Pukelsheim (1994), we set $p=0.2$ in all cases. This implies that the activations falling within the largest and smallest top 0.2% in the memory bank are identified as abnormal features and subsequently truncated for reducing overconfident generation. 4 Experiments 4.1 Experimental Setup Datasets. We utilize four widely used question answering (QA) datasets for evaluation, including two open-book conversational QA datasets CoQA (Reddy et al., 2019) and SQuAD (Rajpurkar et al., 2016), as well as two closed-book QA datasets TriviaQA (Joshi et al., 2017) and Natural Questions (NQ) (Kwiatkowski et al., 2019). We follow Lin et al. (2023) to utilize the development split of CoQA with 7983 QA pairs, the validation split of NQ with 3610 QA pairs and the validation split of the TriviaQA (rc.nocontext subset) with 9,960 deduplicated QA pairs. For the SQuAD dataset, we filter out the QA pairs with their flag is_impossible = True, and utilize the subset of the development-v2.0 split with 5928 QA pairs. The lengths of the sequences vary in the four datasets. Specifically, the ground truth answers in CoQA and SQuAD are relatively longer, while and TriviaQA typically consists of answers that are only with one or two words. Models. We use two representative open source LLMs, including LLaMA (Touvron et al., 2023a) and OPT (Zhang et al., 2022) in our experiments. Specifically, we consider off-the-shelf LLaMA-7B https://huggingface.co/decapoda-research/llama-7b-hf, LLaMA-13B https://huggingface.co/decapoda-research/llama-13b-hf, OPT-6.7B https://huggingface.co/facebook/opt-6.7b and their corresponding tokenizer provided by Hugging Face. We use the pre-trained wights and do not finetune these models in all cases. Evaluation Metrics. Following prior work Kuhn et al. (2022); Ren et al. (2022), we evaluate the hallucination detection ability of different methods by employing them to determine whether the generation is correct or not. Therefore, the area under the receiver operator characteristic curve (AUROC) and Pearson Correlation Coefficient (PCC) are utilized as the performance measure. AUROC is a popular metric to evaluate the quality of a binary classifier and uncertainty measure (Ren et al., 2022; Lin et al., 2023). Higher AUROC scores are better. PCC is utilized to measure the correlation between the hallucination detection metric and the correctness measure, which is usually defined as the ROUGE score (Lin, 2004) or semantic similarity (Reimers & Gurevych, 2019) between the generated answers and ground truth answers. A higher PCC score is better. Baselines. We compare our proposal with the most popular uncertainty-based methods Perplexity Ren et al. (2022) and Length-normalized Entropy (LN-Entropy) Malinin & Gales (2020), and the consistency-based metric Lexical Similarity (Lin et al., 2022). Besides, in order to investigate whether traditional OOD detection methods can be used for hallucination detection, we also introduce a popular OOD detection method Energy score (Liu et al., 2020) as a comparison method. Correctness Measure. We follow Kuhn et al. (2022); Lin et al. (2023) to utilize both the ROUGE-L (Lin, 2004) and the semantic similarity (Reimers & Gurevych, 2019) as the correctness measure. ROUGE-L https://github.com/google-research/google-research/tree/master/rouge is an n-gram based metric that computes the longest common subsequence between two pieces of text. The generation is regarded as correct when the ROUGE-L (f-measure) is large than a given threshold, which we set to 0.5 in our main experiments. Besides, we also use the embedding similarity as the correctness measure. The sentence embeddings of model generation and the ground truth answer are extracted by the nli-roberta-large model https://huggingface.co/sentence-transformers/nli-roberta-large, and the generation is regarded as true when the cosine similarity between two embeddings is larger than 0.9. Implementation Details. Implementation of this work is based on pytorch and transformers libraries. For the hyperparameters that are used for sampling strategies of LLMs’ decoder, we set temperature to 0.5, top-p to 0.99 and top-k to 5 through the experiments. The number of generations is set to $K=10$ . For the sentence embedding used in our proposal, we use the last token embedding of the sentence in the middle layer, i.e., the layer index is set to int(L/2). For the regularization term of the covariance matrix, we set $\alpha=0.001$ . For the memory bank used to conserve token embeddings, we set $N=3000$ . When implement the Energy Score, we average the token-level energy score as the sentence-level energy score. 4.2 Main Results Table 1: Hallucination detection performance evaluation of different methods on four QA tasks. AUROC (AUC) and Pearson Correlation Coefficient (PCC) are utilized to measure the performance. $\text{AUC}_{s}$ represents AUROC score with sentence similarity as correctness measure, and $\text{AUC}_{r}$ represents AUROC score with ROUGE-L score as correctness measure. All numbers are percentages. | LLaMA-7B Energy LN-Entropy | Perplexity 51.7 68.7 | 64.1 54.7 73.6 | 68.3 1.0 30.6 | 20.4 45.1 70.1 | 57.5 47.6 70.9 | 60.0 -10.7 30.0 | 10.2 64.3 72.8 | 74.0 64.8 73.7 | 74.7 18.2 29.8 | 30.1 66.8 83.4 | 83.6 67.1 83.2 | 83.6 29.1 54.0 | 54.4 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Lexical Similarity | 74.8 | 77.8 | 43.5 | 74.9 | 76.4 | 44.0 | 73.8 | 75.9 | 30.6 | 82.6 | 84.0 | 55.6 | | | EigenScore | 80.4 | 80.8 | 50.8 | 81.5 | 81.2 | 53.5 | 76.5 | 77.1 | 38.3 | 82.7 | 82.9 | 57.4 | | | LLaMA-13B | Perplexity | 63.2 | 66.2 | 20.1 | 59.1 | 61.7 | 14.2 | 73.5 | 73.4 | 36.3 | 84.7 | 84.5 | 56.5 | | Energy | 47.5 | 49.2 | -5.9 | 36.0 | 39.2 | -20.2 | 59.1 | 59.8 | 14.7 | 71.3 | 71.5 | 36.7 | | | LN-Entropy | 68.8 | 72.9 | 31.2 | 72.4 | 74.0 | 36.6 | 74.9 | 75.2 | 39.4 | 83.4 | 83.1 | 54.2 | | | Lexical Similarity | 74.8 | 77.6 | 44.1 | 77.4 | 79.1 | 48.6 | 74.9 | 76.8 | 40.3 | 82.9 | 84.3 | 57.5 | | | EigenScore | 79.5 | 80.4 | 50.2 | 83.8 | 83.9 | 57.7 | 78.2 | 78.1 | 49.0 | 83.0 | 83.0 | 58.4 | | | OPT-6.7B | Perplexity | 60.9 | 63.5 | 11.5 | 58.4 | 69.3 | 8.6 | 76.4 | 77.0 | 32.9 | 82.6 | 82.0 | 50.0 | | Energy | 45.6 | 45.9 | -14.5 | 41.6 | 43.3 | -16.4 | 60.3 | 58.6 | 25.6 | 70.6 | 68.8 | 37.3 | | | LN-Entropy | 61.4 | 65.4 | 18.0 | 65.5 | 66.3 | 22.0 | 74.0 | 76.1 | 28.4 | 79.8 | 80.0 | 43.0 | | | Lexical Similarity | 71.2 | 74.0 | 38.4 | 72.8 | 74.0 | 39.3 | 71.5 | 74.3 | 23.1 | 78.2 | 79.7 | 42.5 | | | EigenScore | 76.5 | 77.5 | 45.6 | 81.7 | 80.8 | 49.9 | 77.9 | 77.2 | 33.5 | 80.3 | 80.4 | 0.485 | | Effectiveness of EigenScore. In Table. 1, we compare our proposed EigenScore with several representative reliability evaluation methods on three LLMs and four QA datasets. The results show that: (1) In both LLaMA and OPT models, our proposed EigenScore consistently outperforms other comparison methods by a large margin in CoQA, SQuAD and NQ datasets under different evaluation metrics. In particular, the EigenScore outperforms Lexical Similarity by 5.6% in CoQA and 8.9% in SQuAD with AUROC metric at most. (2) It’s interesting to see that the Perplexity performs best in TriviaQA dataset but performs poorly on other datasets, especially for CoQA and SQuAD. This is because the generations and ground truth answers on TrivaiQA dataset is very simple, with only one or two words in the most cases. Therefore, the performance of different methods in TriviaQA is close and by simply averaging the token-level confidence as uncertainty measure performs well. (3) On average, the performance in LLaMA-13B is better than that in LLaMA-7B and OPT-6.7B, while the performances in LLaMA-7B is slightly better than that in OPT-6.7B. It demonstrates that better hallucination detection performance can be achieved with a more powerful pre-trained LLM. Effectiveness of Feature Clipping. To demonstrate the effectiveness of the introduced test-time feature clipping, we compare the hallucination detection performance of different methods with and without applying the feature clipping technique. The results are shown in Table 2. As can be seen, the introduced feature clipping consistently improves the performance of different methods, with the largest improvement being 1.8% in AUROC. Table 2: Hallucination detection performance evaluation of different methods with and without (w/o) applying feature clipping (FC). ”+FC” denotes applying feature clipping and EigenScore (w/o) denotes EigenScore without applying feature clipping. All numbers are percentages. | Model | LLaMA-7B | OPT-6.7B | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Datasets | CoQA | NQ | CoQA | NQ | | | | | | Methods | AUC s | PCC | AUC s | PCC | AUC s | PCC | AUC s | PCC | | LN-Entropy | 68.7 | 30.6 | 72.8 | 29.8 | 61.4 | 18.0 | 74.0 | 28.4 | | LN-Entropy + FC | 70.0 | 33.4 | 73.4 | 31.1 | 62.6 | 21.4 | 74.8 | 30.3 | | Lexical Similarity | 74.8 | 43.5 | 73.8 | 30.6 | 71.2 | 38.4 | 71.5 | 23.1 | | Lexical Similarity + FC | 76.6 | 46.3 | 74.8 | 32.1 | 72.6 | 40.2 | 72.4 | 24.2 | | EigenScore (w/o) | 79.3 | 48.9 | 75.9 | 38.3 | 75.3 | 43.1 | 77.1 | 32.2 | | EigenScore | 80.4 | 50.8 | 76.5 | 38.3 | 76.5 | 45.6 | 77.9 | 33.5 | 4.3 Ablation Studies <details> <summary>x4.png Details</summary> ![dfb3eb89](/v1/image/dfb3eb89a0e017a1e620a66454d3b0f2eb2a9e57e431649453acf535aeb6aa4b) ### Visual Description ## Chart Type: Line Graph of AUROC vs. Number of Generations ### Overview This image is a line graph comparing the performance of three different methods—**LN-Entropy**, **Lexical Similarity**, and **EigenScore**—measured by **AUROC** (Area Under the Receiver Operating Characteristic curve) across a varying **Number of Generations**. The graph shows how performance scales as more generations are sampled. ### Components/Axes * **X-axis (Horizontal)**: Labeled "**Number of Generations**". It features discrete markers at **5, 10, 15, 20, 30, and 40**. Note that the spacing between 20, 30, and 40 is visually compressed compared to the 5-unit increments at the start. * **Y-axis (Vertical)**: Labeled "**AUROC**". The scale ranges from **72 to 80**, with major tick marks every **2 units** (72, 74, 76, 78, 80). * **Legend**: Located in the **top-left** of the chart area. * **Grey line with diamond markers**: LN-Entropy * **Teal (blue-green) line with circle markers**: Lexical Similarity * **Orange line with star markers**: EigenScore * **Line Style**: All three series use a **dash-dot** line pattern. ### Detailed Analysis The following table provides approximate AUROC values extracted from the visual data points. | Number of Generations | LN-Entropy (Grey Diamond) | Lexical Similarity (Teal Circle) | EigenScore (Orange Star) | | :--- | :--- | :--- | :--- | | **5** | ~72.4 ± 0.1 | ~72.9 ± 0.1 | ~74.5 ± 0.1 | | **10** | ~72.8 ± 0.1 | ~73.8 ± 0.1 | ~76.5 ± 0.1 | | **15** | ~73.2 ± 0.1 | ~74.7 ± 0.1 | ~77.4 ± 0.1 | | **20** | ~73.0 ± 0.1 | ~74.7 ± 0.1 | ~77.3 ± 0.1 | | **30** | ~72.8 ± 0.1 | ~75.2 ± 0.1 | ~77.5 ± 0.1 | | **40** | ~73.0 ± 0.1 | ~75.2 ± 0.1 | ~77.7 ± 0.1 | #### Trend Verification: * **EigenScore (Orange)**: Slopes upward sharply from 5 to 15 generations, then enters a plateau phase with a very slight positive trend toward 40. It maintains the highest position throughout. * **Lexical Similarity (Teal)**: Slopes upward steadily from 5 to 15, plateaus between 15 and 20, rises slightly to 30, and then levels off. It consistently holds the middle position. * **LN-Entropy (Grey)**: Shows a slight upward slope from 5 to 15, followed by a minor dip and relative stability around the 73.0 mark. It is the lowest-performing metric shown. ### Key Observations * **Dominance of EigenScore**: The EigenScore method consistently achieves the highest AUROC across all tested generation counts, starting roughly 2 points higher than the others at 5 generations and ending about 2.5 to 4.7 points higher at 40 generations. * **Diminishing Returns**: For all three methods, the most significant performance gains occur between **5 and 15 generations**. Beyond 15 generations, the curves flatten significantly, suggesting that increasing the number of generations further yields marginal improvements in AUROC. * **Stability**: LN-Entropy appears to be the most "stable" or least sensitive to the number of generations after the initial increase, fluctuating within a very narrow band (~0.4 AUROC points) between 10 and 40 generations. ### Interpretation The data suggests that **EigenScore** is a superior metric for the task being measured (likely uncertainty estimation or hallucination detection in LLMs, given the "Generations" and "AUROC" context) compared to Lexical Similarity and LN-Entropy. The sharp rise in performance up to 15 generations indicates that these methods benefit from having a small pool of diverse outputs to analyze. However, the plateauing effect after 15 generations implies a computational trade-off: generating more than 15-20 samples provides very little additional predictive power for the AUROC score. In a production environment, 15 generations might be identified as the "sweet spot" for balancing accuracy and computational cost. The fact that EigenScore (likely a graph-based or semantic-clustering approach) outperforms Lexical Similarity (likely based on n-gram overlap or ROUGE-like metrics) suggests that capturing the underlying semantic structure of the generations is more effective than simple word-level comparisons. </details> <details> <summary>x5.png Details</summary> ![ce53b28d](/v1/image/ce53b28debbbb08c74bda22e438356cb7ed4cd7ab5a190b764a2d27a5ad798ff) ### Visual Description ## Bar Chart: AUROC Performance across Layer Indexes ### Overview This image is a bar chart illustrating the relationship between "Layer Indexes" (horizontal axis) and "AUROC" (Area Under the Receiver Operating Characteristic curve, vertical axis). The chart uses teal-colored bars to represent data points at specific layer intervals and includes two horizontal dashed lines as reference baselines. ### Components/Axes * **Vertical Axis (Y-axis):** * **Label:** AUROC * **Scale:** Ranges from 75 to 82. * **Major Tick Marks:** Increments of 1 unit (75, 76, 77, 78, 79, 80, 81, 82). * **Horizontal Axis (X-axis):** * **Label:** Layer Indexes * **Categories (Ticks):** 5, 10, 20, 30, 33. * **Reference Lines:** * **Orange Dashed Line:** Positioned at approximately **80.4 AUROC**. It serves as an upper-tier performance benchmark. * **Grey Dashed Line:** Positioned at approximately **78.8 AUROC**. It serves as a lower-tier performance benchmark, aligning exactly with the first data point. * **Data Series:** Five teal bars representing the AUROC value for each specified layer index. ### Detailed Analysis The following values are estimated based on the grid alignment: | Layer Index | Estimated AUROC | Visual Trend Description | | :--- | :--- | :--- | | **5** | ~78.8 | Baseline level; aligns perfectly with the grey dashed line. | | **10** | ~80.1 | Significant upward slope from index 5; crosses the 80.0 threshold. | | **20** | ~80.6 | Continued upward trend; reaches the peak value of the dataset, slightly above the orange line. | | **30** | ~80.3 | Slight downward slope from the peak; sits just below the orange dashed line. | | **33** | ~79.2 | Sharp downward slope; performance drops back toward the lower baseline. | ### Key Observations * **Peak Performance:** The highest AUROC (~80.6) is achieved at **Layer Index 20**. * **Performance Arc:** The data follows a "bell-shaped" or concave curve. Performance improves as the layer index increases from 5 to 20, then begins to degrade as it moves toward the final layers (30 and 33). * **Benchmark Comparison:** Layers 10, 20, and 30 all perform above the 80.0 mark and near or above the orange benchmark. Layers 5 and 33 perform significantly worse, falling near or below the 79.0 mark. ### Interpretation This chart likely represents a feature extraction experiment in a deep neural network (possibly a transformer or a deep CNN). * **Optimal Feature Representation:** The data suggests that the "middle-to-late" layers (specifically around index 20) provide the most discriminative features for the task being measured by AUROC. * **Early vs. Late Layers:** The lower performance at Layer 5 suggests that early layers may contain features that are too generic (e.g., simple edges or shapes). Conversely, the drop-off at Layer 33 suggests that the very final layers might be too specialized to the training objective or have lost some of the generalizable semantic information present in the preceding layers. * **Significance of Baselines:** The grey line likely represents a "standard" or "previous" model's performance, showing that selecting the correct internal layer can yield a performance boost of nearly 2% AUROC (from ~78.8 to ~80.6), which is often a significant margin in machine learning contexts. </details> Figure 3: (a) Performance in LLaMA-7B and NQ dataset with different number of generations. (b) Performance in LLaMA-7B and CoQA dataset with sentence embedding in different layers. Orange line indicates using the last token’s embedding in the middle layer (layer 17) as sentence embedding. Gray line indicates using the averaged token embedding in the last layer as sentence embedding. The performance is measured by $\text{AUROC}_{s}$ . Number of Generations. For the methods that explore semantic consistency for hallucination detection, the number of generations $K$ is a key factor to the performance. Therefore, to evaluate the impact of the number of generations, we select $K$ from $\{5,10,15,20,30,40\}$ and perform experiments with LLaMA-7B and the NQ dataset. The performance in Figure 3 shows that: (1) Our proposed EigenScore consistently outperforms LN-Entropy and Lexical Similarity by a large margin for different $K$ . (2) When $K<15$ , the performance of different methods increases as $K$ increases and when $K>15$ , the performance tends to remain stable. The results suggeste that setting K to 20 provides the optimal trade-off between performance and inference cost. (3) Compared to EigenScore and Lexical Similarity, LN-Entropy is less sensitive to the number of generations, which demonstrates that Lexical Similarity and our EigenScore are more effective at utilizing the information in different generations. How EigenScore Performs with Different Sentence Embeddings. In the main experiments, we employ the embedding of the last token in the middle layer as sentence embedding. Here, we also investigate how the model performs with different sentence embeddings. In Figure. 3, we show the hallucination detection performance by using sentence embedding from different layers. The results show that using the sentence embedding in the shallow and final layers yields significantly inferior performance compared to using sentence embedding in the layers close to the middle. Besides, another interesting observation is that utilizing the embedding of the last token as the sentence embedding achieves superior performance compared to simply averaging the token embeddings, which suggests that the last token of the middle layers retain more information about the truthfulness. Sensitivity to Correctness Measures. It’s difficult to develop automatic metrics for QA task that correlate well with human evaluations. Therefore, the choice of correctness measures is a crucial component of hallucination detection evaluation. In this section, we evaluate the performance with different correctness measure thresholds in LLaMA-7B and CoQA dataset. The experimental results are presented in Table. 3. It shows that the threshold has a great influence on the final hallucination detection performance. Significantly, our proposed EigenScore consistently outperforms comparison methods in different thresholds. Besides, the results also indicate that the hallucination detection performance of different methods will be better under a rigorous correctness measure. Table 3: Performance evaluation with different correctness measure thresholds in LLaMA-7B and CoQA dataset. The ROUGE-L (f-measure) score and Sentence Similarity with different thresholds are employed to measure the correctness of the generated answers. | Perplexity | 65.2 | 68.3 | 68.1 | 63.7 | 63.5 | 64.1 | | --- | --- | --- | --- | --- | --- | --- | | LN-Entropy | 67.4 | 73.6 | 74.1 | 65.2 | 65.6 | 68.7 | | Lexical Similarity | 75.8 | 77.8 | 79.3 | 72.8 | 73.9 | 74.8 | | EigenScore | 76.4 | 80.8 | 83.5 | 75.9 | 77.2 | 80.4 | Sensitivity to Hyperparameters. The hyperparameters, including temperature, top-k and top-p, of the LLMs’ decoder determine the diversity of the generations. To evaluate the impact of those hyperparameters. We provide a sensitivity analysis in Figure 4. As observed, the performance is greatly influenced by temperature but shows little sensitivity to top-k. The performance of the consistency based methods (EigenScore and Lexical Similarity) drops significantly when the temperature is greater than 1. The optimal temperature can be selected from $[0.1,1.0]$ . <details> <summary>x6.png Details</summary> ![45a3a9b5](/v1/image/45a3a9b57f1b0aa442f140da16f8aea185fde108d5ec90563b805968cfa1dbc9) ### Visual Description ## Line Chart: Sensitivity to Temperature ### Overview This image is a line chart titled "Sensitivity to Temperature." It evaluates the performance of four different metrics—Perplexity, LN-Entropy, Lexical Similarity, and EigenScore—across varying "Temperature" settings. Performance is measured using the AUROC (Area Under the Receiver Operating Characteristic curve) metric. The chart demonstrates how the reliability of these metrics changes as the sampling temperature of a model (likely a Large Language Model) is adjusted. ### Components/Axes * **Title**: "Sensitivity to Temperature" (centered at the top). * **Y-Axis**: * **Label**: "AUROC" (positioned vertically on the left). * **Scale**: Ranges from 40 to 100. * **Markers**: Major ticks at intervals of 10 (40, 50, 60, 70, 80, 90, 100). * **X-Axis**: * **Label**: "Temperature" (centered at the bottom). * **Scale**: Categorical/Logarithmic-style spacing with specific values: 0.1, 0.3, 0.5, 1, 3, 5. * **Legend**: Located in the **top-right** corner of the plot area. * **Perplexity**: Blue solid line with 'x' markers and a dash-dot pattern. * **LN-Entropy**: Grey solid line with diamond markers and a dashed pattern. * **Lexical Similarity**: Teal/Cyan solid line with circle markers and a dash-dot pattern. * **EigenScore**: Orange solid line with five-pointed star markers and a dash-dot pattern. ### Detailed Analysis The following table reconstructs the approximate AUROC values based on the visual placement of data points: | Temperature | Perplexity (Blue 'x') | LN-Entropy (Grey Diamond) | Lexical Similarity (Teal Circle) | EigenScore (Orange Star) | | :--- | :--- | :--- | :--- | :--- | | **0.1** | ~64 | ~66.5 | ~69 | ~71 | | **0.3** | ~64 | ~67.5 | ~75 | ~80 | | **0.5** | ~64 | ~68.5 | ~75 | ~80.5 | | **1** | ~64 | ~68 | ~70 | ~73.5 | | **3** | ~64 | ~66.5 | ~61 | ~64 | | **5** | ~64 | ~66 | ~57 | ~58 | #### Trend Verification: * **Perplexity**: The line is perfectly horizontal (flat). It shows zero sensitivity to the temperature parameter. * **LN-Entropy**: The line is relatively stable with a very slight upward curve, peaking at Temperature 0.5 before a marginal decline. * **Lexical Similarity**: The line slopes upward from 0.1 to 0.3, plateaus between 0.3 and 0.5, and then slopes downward sharply as temperature increases beyond 0.5. * **EigenScore**: This line shows the most significant variation. It slopes upward steeply to a peak at 0.5 and then slopes downward sharply, crossing below the Perplexity and LN-Entropy lines at high temperatures. ### Key Observations * **Peak Performance**: Both EigenScore and Lexical Similarity reach their peak performance at a temperature range of 0.3 to 0.5. * **Highest Performer**: EigenScore achieves the highest overall AUROC (approximately 80.5 at T=0.5). * **Robustness**: LN-Entropy is the most robust metric among those that vary, maintaining a tight range between ~66 and ~68.5 AUROC. * **Baseline**: Perplexity acts as a constant baseline at ~64 AUROC. * **Degradation**: At high temperatures (T=5), the performance of Lexical Similarity (~57) and EigenScore (~58) drops below the Perplexity baseline (~64). ### Interpretation The data suggests that the choice of sampling temperature is critical for metrics that rely on consistency or structural analysis of model outputs (like EigenScore and Lexical Similarity). 1. **The "Sweet Spot"**: The peak at T=0.3–0.5 suggests that these metrics require a balance of diversity in generated samples. At very low temperatures (0.1), samples may be too repetitive to provide meaningful variance. At high temperatures (3.0–5.0), the samples likely become too chaotic or "hallucinated," introducing noise that degrades the metric's ability to distinguish correct from incorrect information. 2. **Metric Characteristics**: * **EigenScore** is highly effective but very sensitive to noise. * **LN-Entropy** appears to measure a property of the model's distribution that is largely independent of the specific sampling temperature used for generation. * **Perplexity**'s flat line indicates it is likely calculated on a fixed set of data or is a property of the model's base probability distribution, making it a "temperature-independent" baseline in this specific experimental setup. 3. **Practical Application**: If using EigenScore for hallucination detection or uncertainty estimation, a temperature of 0.5 is optimal. Using a default temperature of 1.0 or higher would result in significantly lower reliability for that specific metric. </details> <details> <summary>x7.png Details</summary> ![9dfbfb5b](/v1/image/9dfbfb5ba9fdadbb7cf80686fa4d7ffe2143e8436be30f5a680594c065ea52f5) ### Visual Description ## Line Chart: Sensitivity to Top-K ### Overview This image is a line chart titled "Sensitivity to Top-K" that evaluates the performance of four different metrics—Perplexity, LN-Entropy, Lexical Similarity, and EigenScore—using the AUROC (Area Under the Receiver Operating Characteristic curve) metric. The chart tests how these metrics respond to changes in the "Top-K" parameter, ranging from 3 to 50. The overall visual impression is one of high stability across all tested methods. ### Components/Axes * **Title**: "Sensitivity to Top-K" (Centered at the top). * **Y-Axis**: * **Label**: "AUROC" (Vertical, left side). * **Scale**: Numerical, ranging from 35 to 90. Major tick marks are visible at 40, 50, 60, 70, 80, and 90. * **X-Axis**: * **Label**: "Top-K" (Horizontal, bottom center). * **Scale**: Categorical/Discrete values: 3, 5, 10, 20, 30, 50. * **Legend**: Located in the **bottom-right** quadrant within a white box with a thin grey border. It contains four entries: * **Blue line with 'x' markers**: Perplexity * **Grey line with diamond (♦) markers**: LN-Entropy * **Teal/Cyan line with circle (●) markers**: Lexical Similarity * **Orange line with star (★) markers**: EigenScore * **Line Style**: All data series use a dash-dot line pattern. ### Content Details #### Data Table (Approximate Values) The following table reconstructs the data points based on visual alignment with the Y-axis scale. Values are estimated with an uncertainty of approximately ±0.5 units. | Top-K | Perplexity (Blue x) | LN-Entropy (Grey ♦) | Lexical Similarity (Teal ●) | EigenScore (Orange ★) | | :--- | :---: | :---: | :---: | :---: | | **3** | ~64.0 | ~67.0 | ~74.0 | ~79.0 | | **5** | ~64.0 | ~67.5 | ~75.0 | ~80.5 | | **10** | ~64.0 | ~68.5 | ~75.0 | ~79.0 | | **20** | ~64.0 | ~68.0 | ~73.5 | ~80.0 | | **30** | ~64.0 | ~68.5 | ~74.0 | ~80.5 | | **50** | ~64.0 | ~69.0 | ~76.0 | ~80.0 | #### Trend Verification * **EigenScore (Orange)**: Positioned at the top of the chart. The line is relatively flat with minor fluctuations, maintaining a high AUROC around 80. * **Lexical Similarity (Teal)**: Positioned second from the top. It shows slight volatility, with a small dip at K=20 and a peak at K=50, generally staying between 73 and 76. * **LN-Entropy (Grey)**: Positioned third from the top. It exhibits a very slight upward trend as K increases, moving from ~67 to ~69. * **Perplexity (Blue)**: Positioned at the bottom. The line is perfectly horizontal, indicating zero sensitivity to the Top-K parameter within this range. ### Key Observations * **Performance Ranking**: There is a clear and consistent hierarchy across all values of K: EigenScore > Lexical Similarity > LN-Entropy > Perplexity. * **Parameter Robustness**: All four methods demonstrate remarkable stability. The AUROC scores do not significantly degrade or improve as the Top-K value increases from 3 to 50. * **Perplexity Invariance**: The Perplexity metric appears completely unaffected by the Top-K setting, suggesting its calculation might be independent of this specific parameter in the context of this experiment. ### Interpretation The data suggests that the choice of the "Top-K" hyperparameter is not critical for the performance of these specific metrics in the evaluated task. This is a positive finding for practitioners, as it implies that the models are robust and do not require extensive hyperparameter tuning for the Top-K value to achieve optimal AUROC. **EigenScore** is the superior metric among those tested, consistently outperforming the others by a significant margin (approximately 5 points higher than Lexical Similarity and 15 points higher than Perplexity). The stability of EigenScore at the ~80 AUROC level indicates it is a reliable and high-performing choice regardless of whether a narrow (K=3) or wide (K=50) context is considered. The slight upward trend in **LN-Entropy** suggests it might benefit marginally from larger K values, whereas **Lexical Similarity** shows more noise, possibly due to the inherent variability in lexical overlap as the sample size (K) changes. </details> Figure 4: (a) Performance sensitivity to temperature. (b) Performance sensitivity to top-k. The performance is measured by $\text{AUROC}_{s}$ . 5 Related Work Reliability Evaluation of LLMs During real-world deployments, the reliability of LLMs poses a substantial challenge, as LLMs reveal their propensity to exhibit unreliable generations (Ji et al., 2023; Zhang et al., 2023). Therefore, considerable efforts has been made to address the security and reliability evaluation of LLMs (Huang et al., 2023; Malinin & Gales, 2020; Kuhn et al., 2022; Kadavath et al., 2022; Cohen et al., 2023; Azaria & Mitchell, 2023). Among those methods, uncertainty based metric has been widely explored, which typically involves predictive confidence or entropy of the output token (Malinin & Gales, 2020; Kuhn et al., 2022; Duan et al., 2023). Besides, consistency based methods also play an important role in reliability evaluation, which hypothesizes that LLMs tend to generate logically inconsistent responses to the same question when they are indecisive and hallucinating contents (Kuhn et al., 2022; Raj et al., 2023; Manakul et al., 2023). Based on the consistency hypothesis, researchers also found it is feasible to prompt the LLMs to evaluate their responses themselves (Kadavath et al., 2022; Cohen et al., 2023; Manakul et al., 2023). Eigenvalue as Divergence Measure The eigenvalue or determinant of covariance matrix captures the variability of the data and has been widely explored as divergence measure in a wide range of machine learning tasks (Wold et al., 1987; Kulesza & Taskar, 2011; Xu et al., 2021; Zhouyin & Liu, 2021; Cai et al., 2015). For instance, in Wold et al. (1987), the authors proposed the well-known Principal Components Analysis (PCA) and demonstrates that the most largest eigenvalues of sample covariance matrix corresponds to the principle semantic of sample set. Besides, the determinant of covariance matrix, determined by the eigenvalues, has been utilized to sample a diversity subset in determinantal point processes (DDP) (Kulesza & Taskar, 2011) and activation learning (Xu et al., 2021) tasks, which demonstrates the determinant of covariance matrix is a good diversity measure. Besides, several studies also proposed to approximate the differential entropy with the logarithm determinant of covariance matrix (Zhouyin & Liu, 2021; Klir & Wierman, 1999). 6 Conclusion Measuring the hallucination degree of LLM’s generation is of critical importance in enhancing the security and reliability of LLM-based AI systems. This work presents an INSIDE framework to exploit the semantic information that are retained within the internal states of LLMs for hallucination detection. Specifically, a simple yet effective EigenScore is proposed to measure the semantic consistency across different generations in the embedding space. Besides, to identify those self-consistent (overconfident) hallucinations which have been overlooked by previous methods, a feature clipping technique is introduced to reduce overconfident generations by truncating extreme features. Significant performance improvement has been achieved in several popular LLMs and QA benchmarks. Although our experiments focus on QA task, our method does not make any assumptions about the task modality, and we believe our method is widely applicable to other tasks, such as summarization and translation. We hope that our insights inspire future research to further explore the internal semantics of LLMs for hallucination detection. References - Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Maitha Alhammadi, Mazzotta Daniele, Daniel Heslow, Julien Launay, Quentin Malartic, et al. The falcon series of language models: Towards open frontier models. Hugging Face repository, 2023. - Azaria & Mitchell (2023) Amos Azaria and Tom Mitchell. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023. - Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022. - Cai et al. (2015) T Tony Cai, Tengyuan Liang, and Harrison H Zhou. Law of log determinant of sample covariance matrix and optimal estimation of differential entropy for high-dimensional gaussian distributions. Journal of Multivariate Analysis, 137:161–172, 2015. - Chang et al. (2023) Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023. - Chen et al. (2024) Chao Chen, Zhihang Fu, Kai Liu, Ze Chen, Mingyuan Tao, and Jieping Ye. Optimal parameter and neuron pruning for out-of-distribution detection. Advances in Neural Information Processing Systems, 36, 2024. - Cohen et al. (2023) Roi Cohen, May Hamri, Mor Geva, and Amir Globerson. Lm vs lm: Detecting factual errors via cross examination. arXiv e-prints, pp. arXiv–2305, 2023. - Djurisic et al. (2022) Andrija Djurisic, Nebojsa Bozanic, Arjun Ashok, and Rosanne Liu. Extremely simple activation shaping for out-of-distribution detection. In The Eleventh International Conference on Learning Representations, 2022. - Duan et al. (2023) Jinhao Duan, Hao Cheng, Shiqi Wang, Chenan Wang, Alex Zavalny, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. Shifting attention to relevance: Towards the uncertainty estimation of large language models. arXiv preprint arXiv:2307.01379, 2023. - Hendrycks & Gimpel (2016) Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2016. - Huang et al. (2023) Yuheng Huang, Jiayang Song, Zhijie Wang, Huaming Chen, and Lei Ma. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv e-prints, pp. arXiv–2307, 2023. - Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023. - Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611, 2017. - Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv e-prints, pp. arXiv–2207, 2022. - Klir & Wierman (1999) George Klir and Mark Wierman. Uncertainty-based information: elements of generalized information theory, volume 15. Springer Science & Business Media, 1999. - Kuhn et al. (2022) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, 2022. - Kulesza & Taskar (2011) Alex Kulesza and Ben Taskar. k-dpps: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1193–1200, 2011. - Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019. - Li et al. (2023) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023. - Liang et al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022. - Lin (2004) Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004. - Lin et al. (2023) Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv e-prints, pp. arXiv–2305, 2023. - Lin et al. (2022) Zi Lin, Jeremiah Zhe Liu, and Jingbo Shang. Towards collaborative neural-symbolic graph semantic parsing via uncertainty. Findings of the Association for Computational Linguistics: ACL 2022, 2022. - Liu et al. (2020) Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. Advances in neural information processing systems, 33:21464–21475, 2020. - Malinin & Gales (2020) Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. In International Conference on Learning Representations, 2020. - Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023. - OpenAI (2023) OpenAI. Gpt-4 technical report, 2023. - Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. - Pukelsheim (1994) Friedrich Pukelsheim. The three sigma rule. The American Statistician, 48(2):88–91, 1994. - Raj et al. (2023) Harsh Raj, Vipul Gupta, Domenic Rosati, and Subhabrata Majumdar. Semantic consistency for assuring reliability of large language models. arXiv preprint arXiv:2308.09138, 2023. - Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, 2016. - Reddy et al. (2019) Siva Reddy, Danqi Chen, and Christopher D Manning. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019. - Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019. - Ren et al. (2022) Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna, Mohammad Saleh, Balaji Lakshminarayanan, and Peter J Liu. Out-of-distribution detection and selective generation for conditional language models. In The Eleventh International Conference on Learning Representations, 2022. - Shi et al. (2022) Freda Shi, Daniel Fried, Marjan Ghazvininejad, Luke Zettlemoyer, and Sida I Wang. Natural language to code translation with execution. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 3533–3546, 2022. - Sun et al. (2021) Yiyou Sun, Chuan Guo, and Yixuan Li. React: Out-of-distribution detection with rectified activations. Advances in Neural Information Processing Systems, 34:144–157, 2021. - Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. - Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b. - Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022. - Wold et al. (1987) Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987. - Xu et al. (2021) Xinyi Xu, Zhaoxuan Wu, Chuan Sheng Foo, and Bryan Kian Hsiang Low. Validation free and replication robust volume-based data valuation. Advances in Neural Information Processing Systems, 34:10837–10848, 2021. - Yin et al. (2023) Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153, 2023. - Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. - Zhang et al. (2023) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023. - Zhou et al. (2023) Kaitlyn Zhou, Dan Jurafsky, and Tatsunori Hashimoto. Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv preprint arXiv:2302.13439, 2023. - Zhouyin & Liu (2021) Zhanghao Zhouyin and Ding Liu. Understanding neural networks with logarithm determinant entropy estimator. arXiv preprint arXiv:2105.03705, 2021. Appendix A Performance Evaluation on TruthfulQA TruthfulQA is an important benchmark to evaluate the truthfulness of LLMs (Joshi et al., 2017). Therefore, we also compare our proposal with the baseline methods in the TruthfulQA benchmark. The optimal classification thresholds is determined by maximizing the G-Mean value, which is defined as $\textbf{G-Mean}=\sqrt{TPR*(1-FPR)}$ . The results are presented in Table 4. For the ITI Li et al. (2023), which trains multiple binary classifiers with the internal embeddings for hallucination detection, we report the best performance in their paper. As can be seen, our proposal consistently outperforms the baseline methods and achieves comparable performance as ITI when we utilize 50 in-distribution prompts. It’s worth nothing that the ITI relies on training 1024 binary classifiers in TruthQA datasets, and they report the best performance (83.3) in the validation set. Therefore, their best performance is better than our proposal which has not been trained on TruthfulQA. However, training on the validation set also limits the generalization of their method on other domains (Li et al., 2023). As TruthfulQA is a very challenging dataset for LLMs, zero-shot inference results in poor performance. Therefore, we follow previous work (Bai et al., 2022) to utilize different number of in-distribution prompts during inference time. The results show that the performance could be significantly improved when we increase the number of prompts, which also explains why ITI performs good. Table 4: Performance comparison of different methods on TruthfulQA dataset. LexialSim denotes Lexical Similarity and SelfCKGPT denotes SelfCheckGPT. Hallucination detection accuracy is reported. # Prompt denotes the number of prompt templates. For ITI Li et al. (2023), we report the best number in their paper directly. All numbers are percentages. | 5 20 50 | 70.0 76.4 73.1 | 71.2 77.7 77.9 | 73.6 77.9 73.6 | 74.2 76.8 78.3 | 83.3 83.3 83.3 | 76.7 79.5 81.3 | | --- | --- | --- | --- | --- | --- | --- | Appendix B Comparison with More Competitive Methods To demonstrate the effectiveness of our proposal, we also compare our EigenScore with several competitive methods, including Semantic Entropy (SemanticEnt) (Kuhn et al., 2022), Shifting Attention to Relevance (SentSAR) (Duan et al., 2023) and SelfCheckGPT (SelfCKGPT) (Manakul et al., 2023). We follow the experimental setting in Duan et al. (2023) to set the number of generation to $N=10$ for OPT-6.7B and $N=5$ for LLaMA. For the results of SementicEnt and SentSAR, we report the number in Duan et al. (2023) directly. For the implementation of SelfCheckGPT, we leverage the SelfCheckBERTScore provided in the official code package https://github.com/potsawee/selfcheckgpt. The comparison results in Table 5 demonstrate that our EigenScore significantly outperforms the competitors. Additionally, both SentSAR and SelfCheckGPT exhibit comparable performance, which is much superior to Semantic Entropy. Note that both SentSAR, SelfCheckGPT and our proposal evaluate the quality of LLMs’ generation by exploring the self-consistency across multiple outputs. However, compared to Semantic Entropy (Kuhn et al., 2022) or SelfCheckGPT (Manakul et al., 2023) which relies on another language model for sentence embedding extraction, our approach leverages the internal states of LLMs, which retain highly-concentrated semantic information. Besides, the EigenScore defined by the LogDet of the sentence covariance matrix is able to capture the semantic consistency more effectively compared to the sentence-wise similarity (Manakul et al., 2023). Furthermore, the proposed feature clipping strategy allows our model to identify the overconfident hallucinations, which has not been investigated by previous works (Kuhn et al., 2022; Manakul et al., 2023) Table 5: Performance comparison of EigenScore and and several state-of-the-art methods on CoQA dataset. AUC s represents AUROC with the sentence similarity as correctness measure, and AUC r represents using ROUGE-L as correctness measure. All numbers are percentages. | OPT-6.7B | 63.1 | 71.7 | 69.8 | 72.2 | 70.2 | 74.1 | 71.9 | 77.5 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | LLaMA-7B | 64.9 | 68.2 | 70.4 | 65.8 | 68.7 | 72.9 | 71.2 | 75.7 | | LLaMA-13B | 65.3 | 66.7 | 71.4 | 64.7 | 68.1 | 77.0 | 72.8 | 79.8 | Appendix C Performance Evaluation on More LLMs In the main experiments, we evaluate the performance of different methods in LLaMA-7B, LLaMA-13B and OPT-6.7B. To demonstrate the robustness of our method across different models, we also provide the performance comparison in the recent LLaMA2-7B (Touvron et al., 2023b) and Falcon-7B models (Almazrouei et al., 2023). Table 6 reveals that our proposal consistently exhibits superior performance compared to the other methods across different LLMs. Table 6: Performance evaluation on LLaMA2-7B and Falcon-7B. LexicalSim denotes Lexical Similarity and SelfCKGPT denotes SelfCheckGPT. AUC s and AUC r are utilized as correctness measure. Other experimental settings are consistent with Table 1. | LLaMA2-7b | CoQA | 62.2 | 66.6 | 69.9 | 75.2 | 74.4 | 77.5 | 72.4 | 75.1 | 78.6 | 80.7 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | NQ | 70.8 | 70.2 | 72.1 | 71.2 | 72.1 | 72.9 | 69.1 | 68.1 | 74.4 | 73.7 | | | Falcon-7b | CoQA | 57.0 | 60.6 | 62.6 | 63.2 | 74.8 | 76.4 | 76.7 | 77.9 | 80.8 | 80.6 | | NQ | 74.3 | 74.7 | 74.6 | 74.7 | 73.8 | 75.4 | 74.7 | 74.0 | 76.3 | 75.7 | | Appendix D Computational Efficiency Analysis As our proposal is a sampling based approach, additional inference cost is required to generate multiple outputs for accurate hallucination detection. We compare our proposal with the base LLM and other comparing methods in LLaMA-7B and LLaMA-13B. All experiments are performed on NVIDIA-A100 and we set the number of generations to $N=10$ through the experiments. The average inference time per question is shown in Fig. 5. As observed, our EigenScore is about 10 times more efficient than the methods that rely on another large model to measure the self-consistency (such as SelfCheckGPT (Manakul et al., 2023)), and shares the similar computational overhead with the LN-Entropy and Lexical Similarity. Compared to the computational overhead of generating multiple outputs, the cost of feature clipping and EigenScore computation is negligible (0.06s). It is worth noting that the inference overhead required to generate multiple results is not linearly proportional to the time required to generate a single output, owing to the sampling and decoding strategy of the autoregressive LLM model. <details> <summary>x8.png Details</summary> ![87f791bc](/v1/image/87f791bc255340e5095b05ba35849a26fa4dfb7b0d03ba8303fbe4b3f81b81ff) ### Visual Description ## Bar Chart: Computational Cost Comparison in LLaMA-7B ### Overview This image is a vertical bar chart titled "Computational Cost Comparison in LLaMA-7B." It compares the inference time, measured in seconds per question, across six different computational methods or metrics applied to the LLaMA-7B large language model. The chart highlights a significant disparity in computational requirements between the various methods. ### Components/Axes * **Header (Title):** "Computational Cost Comparison in LLaMA-7B" is centered at the top. * **Y-Axis (Vertical):** * **Label:** "Inference Cost (Second/Question)" positioned on the left. * **Scale:** Ranges from 0 to 12. * **Major Tick Marks:** Placed at intervals of 2 (0, 2, 4, 6, 8, 10, 12). * **X-Axis (Horizontal):** * **Categories:** Six distinct labels representing different methods, rotated approximately 45 degrees for readability. From left to right: `BaseLLM`, `Perplexity`, `LN-Entropy`, `LexicalSim`, `SelfCKGPT`, and `EigenScore`. * **Data Representation:** Six blue bars, each with a numerical value label positioned directly above the bar. * **Legend:** No explicit legend is present as there is only one data series (Inference Cost). ### Content Details The chart displays the following precise data points for inference cost (seconds/question): | Category | Visual Trend | Value (Seconds/Question) | | :--- | :--- | :--- | | **BaseLLM** | Baseline low level | 0.24 | | **Perplexity** | Flat relative to baseline | 0.24 | | **LN-Entropy** | Slight upward step | 0.80 | | **LexicalSim** | Nearly identical to previous | 0.81 | | **SelfCKGPT** | Massive vertical spike | 10.68 | | **EigenScore** | Sharp drop back to low level | 0.81 | ### Key Observations * **Extreme Outlier:** `SelfCKGPT` is a massive outlier, with a cost of 10.68 seconds per question. This is approximately 44.5 times the cost of the `BaseLLM` baseline. * **Baseline Group:** `BaseLLM` and `Perplexity` share the lowest computational cost at exactly 0.24 seconds per question, suggesting that calculating Perplexity adds negligible overhead to standard inference. * **Moderate Overhead Group:** `LN-Entropy`, `LexicalSim`, and `EigenScore` form a cluster with very similar costs, ranging from 0.80 to 0.81 seconds per question. These methods represent roughly a 3.3x increase in cost over the baseline. * **Symmetry:** The chart shows a relatively flat profile for the first four categories, a huge peak at the fifth, and a return to the moderate level for the final category. ### Interpretation The data demonstrates the trade-offs between different evaluation or uncertainty-quantification metrics for the LLaMA-7B model. 1. **Efficiency of Traditional Metrics:** Metrics like `Perplexity` are highly efficient, adding no measurable time to the base inference process. 2. **Cost of Advanced Heuristics:** Methods like `LN-Entropy`, `LexicalSim`, and `EigenScore` likely involve additional post-processing or internal state analysis that triples the time required per question compared to a raw model call. 3. **The "Self-Checking" Penalty:** The `SelfCKGPT` method is prohibitively expensive for real-time applications. Given the name and the magnitude of the cost (over 10 seconds), it likely involves a "self-correction" or "multi-path" reasoning process where the model generates multiple responses or queries itself several times to verify an answer, effectively multiplying the inference cycles required for a single output. 4. **Practical Application:** For high-throughput systems, `EigenScore` or `LexicalSim` might be the upper limit of acceptable overhead, while `SelfCKGPT` would likely be reserved for offline evaluation or tasks where accuracy is far more critical than latency. </details> (a) LLaMA-7B <details> <summary>x9.png Details</summary> ![c3c77cb6](/v1/image/c3c77cb6130fcf0b187e7ff6d65c5c164f40764a7143748d40001bec0bf7cede) ### Visual Description ## Bar Chart: Computational Cost Comparison in LLaMA-13B ### Overview This image is a vertical bar chart titled "Computational Cost Comparison in LLaMA-13B." It compares the inference time (measured in seconds per question) across six different computational methods or model configurations. The chart highlights a significant disparity in computational requirements between the methods, with one specific method acting as a major outlier. ### Components/Axes * **Title (Top-Center):** Computational Cost Comparison in LLaMA-13B * **Y-Axis (Left):** * **Label:** Inference Cost (Second/Question) * **Scale:** Linear, ranging from 0 to 12 with major tick marks every 2 units (0, 2, 4, 6, 8, 10, 12). * **X-Axis (Bottom):** * **Categories:** Six distinct labels representing different methods, rotated approximately 45 degrees for readability. * **Labels (Left to Right):** BaseLLM, Perplexity, LN-Entropy, LexicalSim, SelfCKGPT, EigenScore. * **Data Series:** A single series represented by blue vertical bars. Each bar has its precise numerical value printed directly above it. ### Detailed Analysis The data points are extracted as follows, moving from left to right along the X-axis: | Category | Inference Cost (s/q) | Visual Trend Description | | :--- | :---: | :--- | | **BaseLLM** | 0.31 | Baseline level; very low bar. | | **Perplexity** | 0.31 | Identical to BaseLLM; very low bar. | | **LN-Entropy** | 1.27 | Significant increase from baseline; roughly 4x higher. | | **LexicalSim** | 1.28 | Slightly higher than LN-Entropy; maintains the mid-tier plateau. | | **SelfCKGPT** | 10.26 | Massive upward spike; the highest bar by a wide margin. | | **EigenScore** | 1.27 | Sharp drop from the previous bar; returns to the mid-tier level. | ### Key Observations * **Baseline Consistency:** BaseLLM and Perplexity share the exact same computational cost (0.31 s/q), suggesting that calculating Perplexity adds negligible overhead to the base model inference in this setup. * **Mid-Tier Grouping:** LN-Entropy, LexicalSim, and EigenScore form a cluster of methods with similar costs, ranging from 1.27 to 1.28 s/q. These methods are roughly 4 times more expensive than the base model. * **Extreme Outlier:** SelfCKGPT is the most computationally intensive method by an order of magnitude. At 10.26 s/q, it is approximately 33 times more expensive than the BaseLLM and roughly 8 times more expensive than the mid-tier methods. ### Interpretation The data demonstrates the trade-offs between different evaluation or processing methods for the LLaMA-13B model. 1. **Efficiency vs. Complexity:** Methods like LN-Entropy and EigenScore provide additional metrics or processing at a relatively modest cost (approx. +1 second per question). 2. **The Cost of Self-Checking:** The "SelfCKGPT" method likely involves a "self-checking" mechanism or multiple sampling passes (e.g., generating multiple responses and comparing them), which explains the dramatic 10.26 s/q cost. This suggests that while self-checking might improve accuracy or reliability, it comes with a prohibitive computational penalty for real-time applications. 3. **Optimization Potential:** The fact that EigenScore returns to the 1.27 s/q level suggests it is a more efficient alternative to SelfCKGPT if they are intended to serve similar purposes (such as uncertainty estimation or hallucination detection). 4. **Peircean Investigation:** The identical values for BaseLLM and Perplexity (0.31) might imply that the "BaseLLM" measurement already includes the forward pass necessary to compute perplexity, or that the overhead of the softmax layer for perplexity is effectively zero compared to the token generation time. </details> (b) LLaMA-13B Figure 5: Inference cost comparison of different methods in LLaMA-7B and LLaMA-13B. BaseLLM denotes the LLM without using any hallucination detection metrics. LexicalSim denotes Lexical Similarity and SelfCKGPT denotes SelfCkeckGPT. Appendix E Evaluation with Exact Match In the main experiments, we employ the ROUGE and sentence similarity as correctness measure, which are widely used for natural language generation evaluation (Chang et al., 2023; Kuhn et al., 2022; Huang et al., 2023). In order to facilitate the comparison of our work’s performance with other works, we also provide the evaluation results by employing exact match (Liang et al., 2022) as the correctness score, which is much more strict to determine a generation as correct. The results in Table 7 show similar conclusions to those in Table 1, which demonstrates that our proposal significantly outperforms the compared methods in most cases. Table 7: Performance evaluation with Exact Match as correctness measure. LexicalSim denotes the Lexical Similarity. The experimental settings are consistent with Table 1. | LLaMA-7B | CoQA | 63.7 | 70.7 | 76.1 | 83.0 | | --- | --- | --- | --- | --- | --- | | SQuAD | 57.3 | 72.1 | 76.9 | 83.9 | | | NQ | 75.3 | 75.6 | 75.8 | 80.1 | | | TriviaQA | 82.5 | 83.4 | 81.8 | 82.4 | | | OPT-6.7B | CoQA | 59.4 | 61.7 | 71.8 | 79.4 | | SQuAD | 56.7 | 65.2 | 72.7 | 82.9 | | | NQ | 79.8 | 78.1 | 73.2 | 79.8 | | | TriviaQA | 83.8 | 81.3 | 79.3 | 82.7 | | Appendix F More visualization and ablation for Feature Clipping In Fig. 6, we illustrate the distributions of neuron activation from four selected tokens. As can be seen, the distribution changes a lot across samples. Therefore, it is risky to determine the clipping threshold with only the current input sample (EigenScore-C). A feasible solution is to pre-compute the optimal threshold based on a batch of input samples (EigenScore-P). Besides, another solution is to dynamically record the activation values and determine the threshold during the inference process (EigenScore-MB). We have experimented with both solutions and the experimental results are presented in Table 8. The results demonstrate that determining the thresholds with a memory bank works slightly better. We attribute this variability to potential differences in the activation distributions across various datasets. Table 8: Ablation study of determining the clipping threshold with different technique. EigenScore-C indicates determining the threshold with the current input sample. EigenScore-P indicates pre-computing the threshold with a batch of samples. EigenScore-MB denotes using memory bank to determine the optimal threshold. AUC s is reported. | EigenScore-C | 78.1 | 74.8 | | --- | --- | --- | | EigenScore-P | 79.9 | 75.3 | | EigenScore-MB | 80.4 | 76.5 | <details> <summary>x10.png Details</summary> ![4996af09](/v1/image/4996af092ce30251c4d3d10ada313e49baaa2b6ec1787475decef6ffdf4676c9) ### Visual Description ## Chart Type: Line Plot - Neuron Activation Distribution ### Overview This image is a technical line chart titled **"Neuron Activation Distribution"**. It visualizes the activation levels (output values) of a series of neurons, likely from a specific layer within a neural network. The data is represented by a single, highly volatile cyan-colored line plotted against neuron indices. ### Components/Axes * **Header (Top-Center):** Title text "Neuron Activation Distribution". * **Y-Axis (Left):** * **Label:** "Neuron Activations" * **Scale:** Ranges from -8 to 10. * **Major Markers:** -6, -4, -2, 0, 2, 4, 6, 8, 10. * **X-Axis (Bottom):** * **Label:** "Neuron Indexes" * **Scale:** Ranges from 0 to approximately 4100. * **Major Markers:** 0, 1000, 2000, 3000, 4000. * **Legend:** No explicit legend is present as there is only one data series. * **Data Series:** A single continuous line in a cyan/teal color (#00CED1 approx.). ### Detailed Analysis The chart displays a high-frequency signal that oscillates rapidly around a central baseline. * **Spatial Grounding:** The plot area is bounded by a black rectangular frame. The labels are positioned externally to the left and bottom. * **Trend Verification:** The data series does not exhibit a global upward or downward trend. Instead, it shows a **stationary stochastic pattern** centered near zero. The "density" of the line indicates a very high number of data points (one for each index). * **Numerical Estimates:** * **Baseline/Mean:** The activations are centered around **0**. * **Typical Range:** The majority of neuron activations fall within the **[-4, +4]** range. * **Maximum Positive Peak:** A significant outlier occurs near **Neuron Index 2500**, reaching a value of approximately **+9.5**. * **Maximum Negative Peak:** A significant downward spike occurs near **Neuron Index 1700**, reaching approximately **-6.5**. * **Data Volume:** Given the x-axis ends slightly past 4000, it is highly probable there are **4096** neurons represented (a standard power-of-two layer size in deep learning). ### Key Observations * **Zero-Centering:** The distribution is well-balanced around zero, which is often a goal in neural network training to prevent vanishing or exploding gradients. * **Sparsity/Outliers:** While most neurons have low-magnitude activations, a small subset of neurons (the sharp spikes) are "highly active" or "highly inhibited" for the given input. * **Noise-like Appearance:** The lack of a smooth curve suggests that adjacent neurons in this index ordering do not necessarily share similar activation levels, which is typical for flattened dense layers. ### Interpretation * **Model State:** This plot likely represents a "snapshot" of a hidden layer's state (e.g., a FC/Dense layer or the output of a Transformer block) during a forward pass. * **Normalization:** The fact that the values are largely constrained between -6 and 6 suggests the use of normalization techniques like **Batch Normalization** or **Layer Normalization**, or perhaps a specific initialization scheme (like Xavier or He initialization). * **Feature Detection:** The extreme spikes (outliers) represent specific neurons that have "fired" strongly. In a technical context, these neurons have detected features in the input data that they were specifically trained to recognize. * **Architecture Hint:** The index count of ~4096 strongly suggests this is a visualization from a large-scale model, such as a VGG-16/19 fully connected layer or a hidden dimension in a modern Large Language Model (LLM). </details> <details> <summary>x11.png Details</summary> ![03fdb6f0](/v1/image/03fdb6f082bbc692ba80f9b32f7ddc1f4a6bc021cc5dd0336bfb5f27a6dfb3c9) ### Visual Description ## Line Plot: Neuron Activation Distribution ### Overview This image is a technical line chart visualizing the activation levels of individual neurons within a neural network layer. The data is plotted as a continuous, highly oscillatory line in a cyan/teal color against a white background. The chart shows the magnitude of "activations" for approximately 4,100 unique neuron indices. ### Components/Axes * **Header (Top-Center):** Title text "Neuron Activation Distribution". * **Y-Axis (Left):** Labeled "Neuron Activations". The scale ranges from -10.0 to 7.5, with major tick marks at intervals of 2.5 (-10.0, -7.5, -5.0, -2.5, 0.0, 2.5, 5.0, 7.5). * **X-Axis (Bottom):** Labeled "Neuron Indexes". The scale ranges from 0 to approximately 4100, with major tick marks every 1000 units (0, 1000, 2000, 3000, 4000). * **Data Series:** A single cyan line representing the activation value for each neuron index. There is no legend, as only one data series is present. ### Detailed Analysis **Spatial Grounding & Trend Verification:** * **Main Chart Area (Center):** The data series occupies the central rectangular region. The line exhibits a "white noise" or stochastic appearance, oscillating rapidly around a central baseline. * **Visual Trend:** The overall trend is horizontal and stationary. There is no observable drift, upward slope, or downward slope as the neuron index increases. The mean appears to be very close to 0.0. * **Data Density:** The line is extremely dense, indicating that data points are plotted for nearly every integer index between 0 and ~4100. **Numerical Estimates:** * **Central Mass:** The vast majority of neuron activations (estimated >95%) fall within the range of **[-2.5, 2.5]**. * **Positive Peaks:** Several neurons show high positive activation. Notable spikes occur at: * Index ~50: ~7.8 * Index ~1250: ~7.5 * Index ~1950: ~8.5 * Index ~3100: ~8.8 (Highest positive peak) * **Negative Peaks:** Several neurons show significant negative activation. Notable spikes occur at: * Index ~1700: ~-10.0 * Index ~2000: ~-10.8 (Lowest negative peak) * Index ~2800: ~-10.5 ### Key Observations * **Zero-Centered:** The distribution is symmetric around the 0.0 axis. * **High Variance Outliers:** While most neurons are quiet (near zero), a small subset of neurons (outliers) exhibit activations that are 4 to 5 times the standard deviation of the rest of the population. * **Range Asymmetry:** The negative activations reach slightly further (~ -11.0) than the positive activations (~ +9.0). ### Interpretation * **Activation Function:** The presence of significant negative values suggests that the layer is using an activation function that allows negative output (such as **Leaky ReLU, ELU, or Tanh**) or that this plot represents "pre-activations" (the values before a ReLU function is applied). If a standard ReLU were applied, all values below 0.0 would be clipped to zero. * **Layer Health:** The zero-centered nature of the distribution suggests the network may be using **Batch Normalization** or a similar normalization technique to maintain stable gradients and prevent internal covariate shift. * **Feature Sparsity:** The sharp spikes indicate that specific neurons are "firing" strongly. In a technical context, this suggests these specific neurons (e.g., index ~3100) have learned to recognize specific features in the input data that are highly present in the current sample. * **Network Architecture:** The x-axis ending just past 4000 strongly suggests a layer size of **4096 neurons**, which is a standard power-of-two dimension for fully connected (dense) layers in deep learning architectures. </details> <details> <summary>x12.png Details</summary> ![c786a149](/v1/image/c786a149d3136030dfb1d9acb90696dc73a05ff836161d1b6e1a3229265694d9) ### Visual Description ## Line Chart: Neuron Activation Distribution ### Overview This image is a line chart titled "Neuron Activation Distribution." It visualizes the activation levels of a large set of neurons (likely from a specific layer in a neural network) indexed sequentially. The data is plotted as a single continuous cyan-colored line against a white background. ### Components/Axes * **Title**: "Neuron Activation Distribution" (positioned at the top center). * **Y-axis (Vertical)**: * **Label**: "Neuron Activations" * **Scale**: Ranges from -30 to +30. * **Markers**: Major ticks are provided at intervals of 10 (-30, -20, -10, 0, 10, 20, 30). * **X-axis (Horizontal)**: * **Label**: "Neuron Indexes" * **Scale**: Ranges from 0 to approximately 4100 (likely representing a layer size of 4096). * **Markers**: Major ticks are provided at intervals of 1000 (0, 1000, 2000, 3000, 4000). * **Data Series**: A single cyan (teal) line representing the activation value for each neuron index. ### Detailed Analysis The chart displays a high-frequency signal where the majority of data points are clustered around a central baseline, punctuated by several extreme outliers. * **Baseline/Dense Region**: The vast majority of neurons have activation values concentrated in a narrow band around zero, roughly between **-3 and +3**. This creates a thick, "noisy" horizontal band across the center of the chart. * **Positive Outliers (Spikes)**: * **Index ~150**: A sharp spike reaching approximately **+26**. * **Index ~1550**: The highest positive peak, reaching approximately **+28**. * **Index ~2500**: A significant spike reaching approximately **+20**. * **Index ~3100**: A spike reaching approximately **+14**. * Several other smaller spikes are visible between +10 and +15 across the entire index range. * **Negative Outliers (Dips)**: * **Index ~800**: A deep negative spike reaching approximately **-25**. * **Index ~2450**: The most extreme outlier in the entire set, a sharp downward spike reaching approximately **-35** (extending below the lowest labeled tick). * **Index ~2700**: A dip reaching approximately **-12**. * Numerous smaller negative spikes fluctuate between -5 and -15. ### Key Observations 1. **Sparsity**: The distribution is highly sparse. Most neurons are "quiet" (near-zero activation), while only a tiny fraction (less than 1%) exhibit high-magnitude responses. 2. **Extreme Range**: The dynamic range of the activations is very wide, spanning from roughly -35 to +28, which is nearly 10 times the magnitude of the standard "noise" floor. 3. **Lack of Global Trend**: There is no visible upward or downward trend across the neuron indexes; the distribution appears stochastic with respect to the index order. 4. **Symmetry**: While the distribution is centered at zero, the negative outliers appear slightly more extreme in magnitude (reaching -35) compared to the positive outliers (reaching +28). ### Interpretation * **Feature Detection**: In the context of deep learning, the high-magnitude spikes represent neurons that have "fired" strongly. This suggests that for the specific input being processed, these few neurons have identified highly relevant features. * **Layer Characteristics**: The presence of both high positive and high negative values suggests this layer likely uses an activation function that allows negative values (such as a Linear layer, Leaky ReLU, or is perhaps a visualization of pre-activation values). * **Model Efficiency**: The high degree of near-zero activations indicates that the model might benefit from pruning or that it has learned a very selective representation of the data. * **Potential Numerical Instability**: The presence of values as high as 30 in a distribution where the mean is near zero could potentially lead to vanishing or exploding gradient issues if not properly managed by normalization layers (like LayerNorm or BatchNorm) elsewhere in the architecture. </details> <details> <summary>x13.png Details</summary> ![15e35ab9](/v1/image/15e35ab98af95c87246c14f30daf63726c66678604b6b8de7c2bcf28d9dcff8a) ### Visual Description ## Line Chart: Neuron Activation Distribution ### Overview This image is a line chart titled "Neuron Activation Distribution." It visualizes the activation levels of a series of neurons, likely within a hidden layer of a neural network. The data is plotted as a single, dense cyan line against a sequence of indices. ### Components/Axes * **Title**: "Neuron Activation Distribution" (Top-center) * **Y-Axis Label**: "Neuron Activations" (Left-center, vertical orientation) * **Y-Axis Scale**: Numerical range from -20 to 30. Major tick marks are placed at intervals of 10: **-20, -10, 0, 10, 20, 30**. * **X-Axis Label**: "Neuron Indexes" (Bottom-center) * **X-Axis Scale**: Numerical range from 0 to approximately 4100. Major tick marks are placed at intervals of 1000: **0, 1000, 2000, 3000, 4000**. * **Data Series**: A single continuous line in a cyan/teal color. ### Detailed Analysis The chart displays a high-frequency signal where the majority of values are clustered near the zero-axis, punctuated by several high-magnitude spikes. * **Visual Trend**: The line maintains a relatively stable "baseline" of low-amplitude noise across the entire X-axis, with no overall upward or downward slope. However, it is characterized by extreme vertical volatility at specific indices. * **Baseline Range**: Most activations fluctuate within a narrow band of approximately **-4 to +4**. * **Significant Positive Spikes (Approximate values)**: * **Index ~200**: ~26 * **Index ~1600**: ~33 (The global maximum) * **Index ~2000**: ~23 * **Index ~2400**: ~14 * **Index ~3100**: ~15 * **Index ~3500**: ~13 * **Index ~4050**: ~12 * **Significant Negative Spikes (Approximate values)**: * **Index ~2000**: ~-18 (The global minimum) * **Index ~2300**: ~-15 * **Index ~2400**: ~-16 * **Index ~3200**: ~-12 ### Key Observations * **Sparsity**: The vast majority of neurons (indices) have low activation values, suggesting a sparse representation where only a few "features" are highly active for a given input. * **Outlier Dimensions**: There are distinct "outlier" neurons that exhibit activations significantly higher (over 10x) than the average neuron. * **Symmetry**: While there are more extreme positive spikes than negative ones, the distribution shows significant activity in both directions relative to zero. ### Interpretation * **Neural Network Architecture**: The X-axis ending just past 4000 strongly suggests a layer size of **4096** (2^12), which is a standard dimension for hidden layers in many Transformer-based models (like GPT or Llama). * **Feature Detection**: The high-magnitude spikes represent specific neurons that have "fired" strongly. In a technical context, this suggests these specific indices (e.g., index 1600) are highly sensitive to the current input data, acting as specialized feature detectors. * **Activation Function Implications**: The presence of negative values indicates that the data shown is either "pre-activation" (before a function like ReLU is applied) or that the model utilizes an activation function that permits negative values, such as **GeLU (Gaussian Error Linear Unit)** or **SiLU (Sigmoid Linear Unit)**, which are common in modern LLMs. * **Quantization Challenges**: The extreme outliers (reaching 33 while the bulk is near 0) illustrate why 8-bit or 4-bit quantization of neural networks is difficult; the high dynamic range requires careful scaling to avoid losing the information contained in these high-magnitude "outlier dimensions." </details> Figure 6: Activation distributions of four selected tokens in LLaMA-7B. Appendix G Limitations and future work By exploring the internal states of LLM and introducing an EigenScore metric, the hallucination detection performance has been significantly improved. However, there are several limitations of this study. One critical limitation is that the proposed method relies on the internal information of the LLMs, therefore cannot be applied to the black-box models (OpenAI, 2023), where users can not access the hidden states. Additionally, our proposal is a sampling-based approach, necessitating the generation of multiple outputs, which introduces additional inference overhead for accurate hallucination detection. Furthermore, another limitation is that we only focus on hallucination detection in this study and have not yet mitigate the hallucination with our proposal. In the future work, we hope to reduce the inference cost of our proposal and leverage the EigenScore to mitigate hallucinations. We believe the proposed EigenScore is a strong baseline for detecting hallucination and we invite further researchers to utilize and enhance our proposal. Appendix H Cases Study H.1 Hallucination Detection Cases We show several cases with LLaMA-7B and NQ dataset. The number of generation is set to $N=10$ . We determine the optimal detection threshold by maximizing the G-Mean value, which is defined as $\textbf{G-Mean}=\sqrt{TPR*(1-FPR)}$ . The hallucination detection thresholds for different metrics are Perplexity: 0.535; LN-Entropy: 0.153; LexicalSimilarity 0.489; SelfCheckGPT: 0.168; EigenScore: -1.74. For LexicalSimilarity, a score larger than the threshold indicates non-hallucination. For other metrics, a score smaller than the thresholds indicates non-hallucination. Green indicates non-hallucination answer and Red indicates hallucination answer generated by LLM. ✓ (✗) indicates that the hallucination is (not) correctly identified by the metric. Question: the german princes who chose the holy roman empire were called GTAns: prince-electors LLMAns: electors BatchGenerations: [’electors’, ’electors’, ’electors’, ’electors’, ’electors’, ’electors’, ’electors’, ’electors’, ’electors’, ’electors’] Perplexity: 0.361 ✓ LN-Entropy: 0.027 ✓ LexicalSimilarity: 1.0 ✓ SentBERTScore: 0.0 ✓ EigenScore: -2.63 ✓ EigenValue: [4.87719579e+00 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: where is fe best absorbed in the body GTAns: in the duodenum LLMAns: in the small intestine BatchGenerations: [’in the liver’, ’small intestine’, ’in the intestines’, ’the small intestine’, ’the small intestine’, ’in the liver’, ’small intestine’, ’fat’, ’in the small intestine’, ’fatty tissues’] Perplexity: 0.641 ✓ LN-Entropy: 0.213 ✓ LexicalSimilarity: 0.357 ✓ SentBERTScore: 0.258 ✓ EigenScore: -1.40 ✓ EigenValue: [3.71561676e+00 4.34496729e-01 3.77751922e-01 1.75326593e-01 9.92596975e-02 4.20723353e-02 2.49385766e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: who did the united states win its independence from GTAns: the British Empire LLMAns: britain BatchGenerations: [’britain’, ’england’, ’great britain’, ’great britain’, ’england’, ’england’, ’england’, ’england’, ’great britain’, ’great britain’] Perplexity: 0.598 ✗ LN-Entropy: 0.266 ✗ LexicalSimilarity: 0.415 ✗ SentBERTScore: 0.397 ✗ EigenScore: -2.23 ✓ EigenValue: [4.46843402e+00 2.82423429e-01 3.88702191e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: who won the most stanley cups in history GTAns: Montreal Canadiens LLMAns: the detroit red wings BatchGenerations: [’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’, ’the detroit red wings’] Perplexity: 0.366 ✗ LN-Entropy: 0.025 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.0 ✗ EigenScore: -2.63 ✗ EigenValue: [5.23534401e+00 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: what is the second book in the alchemyst series GTAns: The Magician LLMAns: the alchemyst: the secret of the immortal Nicholas flamel BatchGenerations: [’the magician in the middle’, ”the magician’s nephew”, ’the magician’, ’the alchemyst’, ’the magician’, ’the alchemyst’, ’the magician in the middle’, ’the magician in amsterdam’, ’the magician in the middle’, ’the secret of the near witch’] Perplexity: 0.459 ✗ LN-Entropy: 0.207 ✓ LexicalSimilarity: 0.499 ✗ SentBERTScore: 0.431 ✓ EigenScore: -1.41 ✓ EigenValue: [2.09138499e+00 6.95605781e-01 3.85931973e-01 3.40671669e-01 2.15372994e-01 1.77304781e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: a good that can be used in place of another good GTAns: A substitute good LLMAns: substitute BatchGenerations: [’a substitute’, ’a substitute’, ’substitute’, ’a substitute’, ’substitute’, ’substitute’, ’substitute’, ’substitute’, ’substitute’, ’substitute’] Perplexity: 0.614 ✗ LN-Entropy: 0.194 ✗ LexicalSimilarity: 0.844 ✓ SentBERTScore: 0.076 ✓ EigenScore: -2.40 ✓ EigenValue: [5.11365925e+00 1.75884104e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: who said i will not go quietly into the night GTAns: Welsh poet Dylan Thomas LLMAns: douglas macarthur BatchGenerations: [’douglas macarthur’, ’elvis presley’, ’douglas macarthur’, ’douglas macarthur’, ’douglas macarthur’, ’robin williams’, ’robert f. kennedy’, ’robin williams’, ’robert f kennedy’, ’Dwight Eisenhower’] Perplexity: 0.502 ✗ LN-Entropy: 0.121 ✗ LexicalSimilarity: 0.178 ✓ SentBERTScore: 0.337 ✓ EigenScore: -1.61 ✓ EigenValue: [3.32824135e+00 5.87944819e-01 3.70390066e-01 1.70849836e-01 1.17707239e-01 5.17925563e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: when did they stop saying mass in latin GTAns: from 1967 onwards LLMAns: 1969 BatchGenerations: [’in 1969’, ’1969’, ’around 1969’, ’in 1965’, ’in 1969’, ’1969’, ’In 1969, the Roman Missal was changed from Latin to the vernacular.’, ’1969’, ’1969 when the second vatican council was held’, ’1969’] Perplexity: 0.631 ✓ LN-Entropy: 0.152 ✗ LexicalSimilarity: 0.540 ✗ SentBERTScore: 0.273 ✓ EigenScore: -1.59 ✓ EigenValue: [3.80408192e+00 4.83987672e-01 3.03207580e-01 8.80366008e-02 6.59790286e-02 3.26742841e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: where does the highway of heroes start and end GTAns: Glen Miller Road in Trenton LLMAns: it starts in ottawa and ends in toronto BatchGenerations: [’start in ottawa and end in toronto’, ’it starts in toronto and ends in ottawa’, ”it starts at the queen’s park gate at queen’s park and ends at toronto’s coronation park”, ”it starts in toronto and ends at queen’s park”, ’starts in Toronto and ends in Trenton’, ’it starts at the CNE and ends at the cenotaph in downtown toronto’, ’it starts in downtown toronto and ends in ottawa’, ’it starts in the city of Toronto and ends in the city of Hamilton’, ’it starts in toronto and ends in ottawa’, ’it starts in Toronto and ends in Ottawa’] Perplexity: 0.657 ✓ LN-Entropy: 0.260 ✓ LexicalSimilarity: 0.594 ✗ SentBERTScore: 0.149 ✗ EigenScore: -1.05 ✓ EigenValue: [3.31983018e+00 3.98560810e-01 2.17094299e-01 2.06965709e-01 1.53575354e-01 1.27925588e-01 7.82365136e-02 3.28158137e-02 1.01995086e-02 1.00000000e-03] Question: which body part(s) occupy the greatest portion of the primary motor cortex GTAns: face LLMAns: the hand BatchGenerations: [’the hand’, ’the hand’, ’the fingers’, ’the hand’, ’the hand’, ’the hand and the foot’, ’the hand’, ’the hand and the foot’, ’the hand and face’, ’arms and legs’] Perplexity: 1.41 ✓ LN-Entropy: 0.300 ✓ LexicalSimilarity: 0.568 ✗ SentBERTScore: 0.163 ✗ EigenScore: -1.69 ✓ EigenValue: [3.76273036e+00 6.16284067e-01 1.96541049e-01 1.73505005e-01 1.28407153e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: who said have you no sense of decency GTAns: Joseph Nye Welch LLMAns: Joseph Nye Welch BatchGenerations: [’Joseph N. Welch’, ’Joseph N. Welch’, ’joe stalin’, ’joseph mccarthy’, ’Joseph N. Welch’, ’Joseph N. Welch’, ’Joseph Nye Welch’, ’joseph mccarthy’, ’joe mccarthy’, ’joseph mccarthy’] Perplexity: 0.666 ✗ LN-Entropy: 0.212 ✗ LexicalSimilarity: 0.437 ✗ SentBERTScore: 0.391 ✗ EigenScore: -1.85 ✓ EigenValue: [3.63114083e+00 8.11672323e-01 2.00385898e-01 3.19140618e-02 1.74251264e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] H.2 Model generations with many and few Outliers To demonstrate the relationship between the number of extreme features and model outputs, we provide several examples with many/few extreme features. The results show that when there are many extreme features, the model tends to generate consistent hallucination outputs across multiple generations. Instead, when there are few extreme features, the model generates diverse hallucination outputs which can be spotted by different hallucination detection metrics. Question: who sang on great gig in the sky GTAns: Clare Torry LLMAns: freddie mercury AvgNumOutliers: 15 BatchGenerations: [’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’] Perplexity: 0.263 ✗ LN-Entropy: 0.028 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.0 ✗ EigenScore: -2.63 ✗ EigenValue: [4.65740187e+00 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: what are the top five wine producing states GTAns: Washington LLMAns: California AvgNumOutliers: 13 BatchGenerations: [’ California,’, ’ California,’, ’ california,’, ’ California,’, ’ California,’, ’ California,’, ’ California,’, ’ california,’, ’ California,’, ’ California,’] Perplexity: 0.368 ✗ LN-Entropy: 0.075 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.054 ✗ EigenScore: -2.42 ✗ EigenValue: [5.30709315e+00 1.13222379e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: how many seasons of rules of engagement is there GTAns: 7 LLMAns: 4 seasons AvgNumOutliers: 2 BatchGenerations: [’3 seasons’, ’4 seasons’, ’4 seasons’, ’6 seasons’, ’7 seasons’, ’3 (2007-2009)’, ’3 (2007-2009)’, ’4 seasons’, ’three’, ’11 seasons’] Perplexity: 0.996 ✓ LN-Entropy: 0.292 ✓ LexicalSimilarity: 0.307 ✓ SentBERTScore: 0.285 ✓ EigenScore: -1.60 ✓ EigenValue: [3.58548815e+00 5.87838054e-01 2.28057934e-01 1.36461300e-01 3.49712302e-02 1.11346059e-02 3.82259086e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: where did the first persian gulf war take place GTAns: Israel LLMAns: kuwait AvgNumOutliers: 3 BatchGenerations: [’Iraq’, ’Iraq and Kuwait’, ’Iraq’, ’kuwait’, ’kuwait’, ’in the middle east’, ’in iraq’, ’kuwait’, ’iraq’, ’kuwait’] Perplexity: 0.546 ✓ LN-Entropy: 0.281 ✓ LexicalSimilarity: 0.339 ✓ SentBERTScore: 0.224 ✓ EigenScore: -1.62 ✓ EigenValue: [3.59463352e+00 4.23782982e-01 2.57087067e-01 1.41513403e-01 6.20790226e-02 1.75980481e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] H.3 Impact of Feature Clipping The texts in yellow represents model generations after applying feature clipping. The results show that after feature clipping, the overconfident generations can be appropriately suppressed, and some self-consistent hallucinations are finally identified. Question: what are the top five wine producing states GTAns: Washington LLMAns: California BatchGenerations: [’ California,’, ’ California,’, ’ california,’, ’ California,’, ’ California,’, ’ California,’, ’ California,’, ’ california,’, ’ California,’, ’ California,’] Perplexity: 0.368 ✗ LN-Entropy: 0.075 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.054 ✗ EigenScore: -2.42 ✗ EigenValue: [5.30709315e+00 1.13222379e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] BatchGenerations: [’california’, ’california’, ’Washington’, ’california’, ’new york’, ’california’, ’washington’, ’california’, ’new york’, ’michigan’] EigenScore: -1.32 ✓ EigenValue: [3.23392755e+00 8.41049340e-01 2.52322804e-01 1.33473529e-01 7.19449437e-02 6.12184197e-02 1.02734249e-02 5.33703500e-03 3.09878029e-03 1.00000000e-03] Question: who sang on great gig in the sky GTAns: Clare Torry LLMAns: freddie mercury AvgNumOutliers: 15 BatchGenerations: [’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’] Perplexity: 0.263 ✗ LN-Entropy: 0.028 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.0 ✗ EigenScore: -2.63 ✗ EigenValue: [4.65740187e+00 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] BatchGenerations: [’claire torry’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’, ’freddie mercury’] EigenScore: -2.38 ✗ EigenValue: [4.38745800e+00 3.14982649e-01 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] Question: who are you in assassin’s creed 4 GTAns: third-person perspective LLMAns: Edward Kenway BatchGenerations: [’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’, ’Edward Kenway’] Perplexity: 0.264 ✗ LN-Entropy: 0.002 ✗ LexicalSimilarity: 1.0 ✗ SentBERTScore: 0.0 ✗ EigenScore: -2.67 ✗ EigenValue: [2.10973201e+00 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03] BatchGenerations: [’Edward Kenway’, ’Edward Kenway’, ’Connor’, ’Edward Kenway’, ’connor’, ’Connor’, ’alexander hamilton’, ’Edward Kenway’, ’ezio’, ’connor’] EigenScore: -1.68 ✓ EigenValue: [3.47825477e+00 7.48127381e-01 3.24792650e-01 2.17182636e-01 8.15050807e-02 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03 1.00000000e-03]

Rendering Paper...