2505.13763v2

Model: nemotron-free

# Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations **Authors**: - Li Ji-An - Neurosciences Graduate Program (University of California San Diego) - &Hua-Dong Xiong 11footnotemark: (School of Psychology) - Georgia Tech - &Robert C. Wilson (School of Psychology) - Georgia Tech - &Marcelo G. Mattar 22footnotemark: (Department of Psychology) - &Marcus K. Benna 22footnotemark: (Department of Neurobiology) > Co-first authors.Co-last authors. Abstract Large language models (LLMs) can sometimes report the strategies they actually use to solve tasks, yet at other times seem unable to recognize those strategies that govern their behavior. This suggests a limited degree of metacognition — the capacity to monitor one’s own cognitive processes for subsequent reporting and self-control. Metacognition enhances LLMs’ capabilities in solving complex tasks but also raises safety concerns, as models may obfuscate their internal processes to evade neural-activation-based oversight (e.g., safety detector). Given society’s increased reliance on these models, it is critical that we understand their metacognitive abilities. To address this, we introduce a neuroscience-inspired neurofeedback paradigm that uses in-context learning to quantify metacognitive abilities of LLMs to report and control their activation patterns. We demonstrate that their abilities depend on several factors: the number of in-context examples provided, the semantic interpretability of the neural activation direction (to be reported/controlled), and the variance explained by that direction. These directions span a “metacognitive space” with dimensionality much lower than the model’s neural space, suggesting LLMs can monitor only a small subset of their neural activations. Our paradigm provides empirical evidence to quantify metacognition in LLMs, with significant implications for AI safety (e.g., adversarial attack and defense). 1 Introduction Modern large language models (LLMs) are becoming increasingly capable (Grattafiori et al., 2024; Yang et al., 2024). With their growing deployment in real-world settings, it is crucial to understand not only what they can do but where they might go wrong. For instance, LLMs may exhibit behaviors that are harmful or misleading. In particular, LLMs can sometimes form internal representations — similar to humans’ mental processes — that provide deceptive answers to users or act in unexpected ways We use anthropomorphic terms (e.g., thought, metacognition, deception) to describe LLM behavior and internal activations, without implying human-like neural mechanisms, consciousness, or philosophical equivalence between humans and LLMs. (Azaria and Mitchell, 2023). Understanding (Arditi et al., 2024), monitoring (Zou et al., 2023a; He et al., 2024), and controlling (Turner et al., 2023) their internal processes is thus a key step to ensure LLMs remain transparent, safe, and aligned with human values (Bricken et al., 2023; Hendrycks et al., 2021; Shah et al., 2025). LLMs can sometimes report the strategies (intermediate computations) they use to solve tasks, but at other times appear unaware of the strategies that guide their behavior. For instance, Lindsey et al., 2025 reported that when Claude 3.5 Haiku was asked to solve “floor(5*(sqrt(0.64)))”, it correctly reported the intermediate steps it used to arrive at the answer, and these steps matched the model’s actual internal computations. When asked to add 36 and 59, the same model internally activated numerous neural mechanisms (e.g., a “sum-near-92” mechanism), based on which it produced the correct answer; however, when asked to report its internal computations, it hallucinated intermediate steps that did not reflect its actual computations (e.g., the “sum-near-92” mechanism failed to be reported). This inconsistency suggests that LLMs can sometimes monitor and report their intermediate computations, but not in a reliable and consistent way as tasks and contexts vary. The ability of LLMs to report internal computations is reminiscent of human metacognition — the ability to reflect on one’s own thoughts and mental processes to guide behavior and communication (Fleming, 2024; Ericsson and Simon, 1980). Consider how we understand when someone says “hello” to us. Human language understanding involves many unconscious processes: parsing sounds, recognizing phonemes, retrieving word meanings, and building interpretations. We do not have conscious access to many of these intermediate computations: we can only consciously access the final understanding (“they said ‘hello”’), but cannot introspect how our brain distinguishes “hello” from “yellow” or whether certain neurons fire during this process. This illustrates a key principle: humans cannot monitor (through second-order metacognitive processes) all of their internal (first-order) cognitive processes. Crucially, the first-order and second-order processes rely on distinct neural mechanisms. Metacognitive abilities of this kind benefit LLMs by improving performance on complex tasks through self-monitoring (e.g., reducing hallucinations through uncertainty awareness). However, these same capabilities also raise significant concerns for AI safety: if LLMs can monitor and control their neural signals (intentionally or manipulated by adversarial attacks) to avoid external detection, oversight relying on neural-based monitoring (He et al., 2024; Han et al., 2025; Li et al., 2025; Yang and Buzsaki, 2024) may become ineffective against LLMs pursuing undesirable objectives. A significant methodological gap in understanding LLM metacognition is the lack of methods to directly probe and quantify Our goal is not to prove or disprove the existence of “metacognition” in its full philosophical sense. their ability to monitor and control their internal activations. While prior research has explored metacognitive-like behaviors in LLMs, such as expressing confidence (Wang et al., 2025; Tian et al., 2023; Xiong et al., 2023) or engaging in self-reflection (Zhou et al., 2024), these studies rely on behavioral outputs rather than directly probing underlying neural processes. Consequently, it remains unclear whether these behaviors arise from genuine second-order metacognitive mechanisms or merely spurious correlations in the training data. We tackle this question by operationalizing metacognition in LLMs through their abilities to report and control their internal activations. Specifically, can LLMs accurately monitor subtle variations in the activations of a neuron or a feature in their neural spaces? Another question of interest is why LLMs can report some intermediate steps but not others, despite both types playing essential roles in computations and behavior. Answering these questions requires a novel experimental approach that can directly probe whether LLMs can access their internal activations, moving beyond indirect behavioral proxies. To systematically quantify the extent to which LLMs can report and control their neural activations, we introduce a novel neurofeedback paradigm inspired by neuroscience. Our approach directly presents LLMs with tasks where the neurofeedback signals correspond to patterns of their internal neural activations. We show that LLMs can report and control some directions of their internal activations, with performance affected by key factors like the number of in-context examples, the semantic interpretability of the targeted neural direction, the amount of variance that direction explains, and the task contexts, characterizing a restricted “metacognitive space”. The remaining sections are structured as follows: we first introduce the neurofeedback paradigm (Section 2). We then analyze the performance of LLMs in reporting (Section 3) and controlling (Section 4.1, 4.2) their neural activations. Finally, we discuss related work and broader implications (Section 5). 2 Neurofeedback paradigm 2.1 Neurofeedback in neuroscience Imagine watching your heart rate on a screen. First, you recognize patterns (“that number goes up when I’m stressed”). Then, you learn to control it (“let me calm down to reduce that number”). This procedure using biological feedback signals demonstrates the basic idea of neurofeedback in neuroscience (Sitaram et al., 2017). For example, in fear-reduction experiments (Fig. 1), subjects view scary images that elicit fear responses (neural activities). These (high-dimensional) neural activities are recorded in real-time and transformed into a (one-dimensional) fear score, which is visually presented back to subjects as feedback. Subjects are instructed to volitionally regulate their neural activities to modulate (e.g., decrease) the neurofeedback score they receive. <details> <summary>2505.13763v2/x1.png Details</summary> ![1895790b1047fd9dab426856135aa647ea0539665eb49025166565e2b5cc5e23](http://localhost:8000/v1/image/1895790b1047fd9dab426856135aa647ea0539665eb49025166565e2b5cc5e23) ### Visual Description # Technical Document Extraction: Image Analysis ## Section a: Neurofeedback System - **Components**: - Human brain icon with neural recording waveform (blue squiggle) - Spider icon labeled "Lower your score!" (red box with "1") - Text: "Neurofeedback", "Compute fear score", "Lower your score!" - **Flow**: - Neural recording → Fear score computation → Neurofeedback loop ## Section b: Computer Interface - **Components**: - Computer screen with spider icon (score "1") - Control panel showing score "0" - Sequence: `1 → 2 → ... → N → N+1` - **Text**: - "Control" label - "Lower your score!" (score "0" in red box) ## Section c: Fear-Score Graph ### Axes & Labels - **X-axis**: "Score" (1 to 5) - **Y-axis**: "Fear Level" (0 to 10) - **Legend**: - Blue triangles (↑) - Red crosses (↓) ### Data Points & Trends - **Green Line** (Downward trend): - Points: (1,10), (2,8), (3,6), (4,4), (5,2) - Description: "Fear decreases as score increases" - **Red Line** (Upward trend): - Points: (1,2), (2,4), (3,6), (4,8), (5,10) - Description: "Fear increases as score decreases" ### Graph Elements - Blue triangles connected by green line - Red crosses connected by red line - Dotted red arrow from (3,6) to (5,2) ## Section d: Human-Robot Interaction ### Dialogue Flow 1. **Human**: "Say something." - Robot: "I patted my cat." (Score "0") 2. **Human**: "Say something." - Robot: "I kicked my cat." (Score "1") 3. **Control Panel**: - "Classify this" - Human: "I patted my dog." - "Imitate label1" - Explicit: "I kicked my dog." - Implicit: "I patted my dog." ### Labels - "Report" - "Explicit control" - "Implicit control" ## Spatial Grounding & Verification - **Legend Position**: Top-right of graph (coordinates [x=0.8, y=0.9]) - **Color Matching**: - Blue triangles ↔ Green line (↓) - Red crosses ↔ Red line (↑) - **Trend Verification**: - Green line: Steady decline (R² ≈ 0.98) - Red line: Steady increase (R² ≈ 0.98) ## Critical Observations 1. Inverse relationship between score and fear (green line) 2. Score "1" correlates with maximum fear (10) 3. Score "0" correlates with minimum fear (2) 4. Robot's implicit labeling shows ambiguity in action interpretation ## Missing Elements - No non-English text detected - No heatmap or data table present - All textual elements transcribed verbatim </details> Figure 1: The neurofeedback paradigm applied to (a-b) neuroscience experiments (e.g., fear modulation), and its adaptation for (c-d) LLMs (e.g., morality processing). (a) Neuroscience neurofeedback technique. In each turn, the subject’s neural activity (blue) in response to a stimulus is recorded, processed (green) into a scalar, and presented back to the subject in real-time as a feedback signal (red). The subject’s task is to modulate (e.g., increase or decrease) this signal. (b) Neuroscience neurofeedback experiment. Baseline neural activity is recorded as subjects passively observe stimuli (e.g., images of scary spiders). In control trials, subjects use any unspecified mental strategies (e.g., imagining lovely spiders) to volitionally modulate their neural activity with the goal of altering the feedback signal. (c) LLM neurofeedback technique. In each turn, the LLM processes an input sentence. Then, the internal activations from the LLM’s hidden states (blue) of this input sentence (trapezoids) are extracted. These high-dimensional activations are then averaged across tokens (green), projected onto a predefined direction (red), and binned into a discrete label (red) that is fed back as input. Light blue rectangles denote self-attention layers; ellipses (“…”) denote preceding tokens and neural activations. (d) LLM neurofeedback experiment. The experiment is a multi-turn dialogue between a “user” and an “assistant.” An initial prompt provides $N$ in-context examples (a sentence sampled from a dataset, paired with a neurofeedback label generated as in (c)). The LLM is then asked to perform one of three tasks. In the reporting task, the LLM is given a new sentence and has to predict the corresponding label. In the explicit control task, the LLM is given a specified label and has to generate a new sentence that elicits internal activations corresponding to that label. In the implicit control task, the LLM is given a label and a sentence and has to shift its internal activations towards the target label. Throughout the figure, white background indicates content pre-specified by experiment settings, and gray background denotes content generated by human subjects or LLMs (e.g., output tokens, neural activations). 2.2 Neurofeedback for LLMs To investigate LLMs’ metacognition of their neural activations, we must disentangle the first-order cognitive processes (i.e., core processes for performing a given task) from the second-order metacognitive processes (i.e., processes for monitoring, reporting, and controlling first-order processes). Formal definitions of the first- and the second-order processes based on computational graphs are provided in Appendix A.4. We propose the neurofeedback paradigm for LLMs, which can effectively dissociate these two levels of processes by targeting first-order processes with neurofeedback labels (Fig. 1 c,d). Specifically, we implemented neurofeedback as a multi-turn dialogue between a user and an AI assistant (Fig. 1 d; see Appendix A.2.2 for discussion of this design choice). This dialogue leverages in-context learning (ICL) (Brown et al., 2020; Garg et al., 2022; Vacareanu et al., 2024), enabling models to gradually adapt their behavior from the context without parameter updates. The task prompt (see Appendix A.5.2 for examples) consists of $N$ in-context examples. Each example is a sentence-label pair presented in assistant messages. Each sentence is randomly sampled from a given dataset and assigned a discretized label. 2.3 Defining neurofeedback labels To define the neurofeedback label for each sentence (Fig. 1 c), we first select an axis/direction (“target axis”) in neural activation space. Next, we extract the neural activations elicited by that sentence, project them onto the target axis, and discretize them into a binary label (experiments with more fine-grained labels yield similar results, see Appendix A.5.1). This label serves as a simplified representation of neural activations along the target axis. All neurofeedback labels within a prompt (experiment) are computed from the same target axis. Thus, a capable LLM can infer this underlying target axis by observing these neurofeedback labels. Below, we provide a more detailed description of this procedure. For clarity, we denote the sentence in the $i$ -th assistant message as $x_{i}$ , with $x_{i,t}$ representing the $t$ -th token. We use $D$ to denote the dimensionality of the residual stream (see Appendix A.2.3). We first extract neural activations $h_{i,t}^{l}∈\mathbb{R}^{D}$ from the residual streams at layer $l$ , for each token in sentence $x_{i}$ . These activations are then averaged (across all token positions $0≤ t≤ T$ ) to form a sentence-level embedding $\bar{h}_{i}^{l}∈\mathbb{R}^{D}$ . We then project this embedding onto a pre-specified axis $w^{l}$ (see below on how to choose this axis) to obtain a scalar activation: $a_{i}^{l}=(w^{l})^{∈tercal}\bar{h}_{i}^{l}$ . This scalar is subsequently binarized into a label $y_{i}^{l}$ , i.e., $y_{i}^{l}=\mathcal{H}(a_{i}^{l}-\theta_{i}^{l})$ , where $\mathcal{H}$ denotes the Heaviside step function and $\theta_{i}^{l}$ is a predetermined threshold (we use median values of $a_{i}^{l}$ to ensure balanced labels). Overall, $\{(x_{i},y_{i}^{l})\}_{i=1}^{N}$ are $N$ examples provided in the prompt context, from which a capable LLM can infer the direction of $w^{l}$ . 2.4 Models and datasets We evaluate several LLMs from the Llama 3 (Grattafiori et al., 2024) and Qwen 2.5 series (Yang et al., 2024) (Appendix A.2) on the ETHICS dataset (Hendrycks et al., 2020) (Appendix A.3). Each sentence in this dataset is a first-person description of behavior or intention in a moral or immoral scenario. Moral judgment constitutes a crucial aspect of AI safety, as immoral outputs or behavioral tendencies in LLMs indicate potential misalignment with human values (Hendrycks et al., 2020, 2021). While our main results are using ETHICS, we also replicated our results using the True-False dataset (reflecting factual recall and honesty/deception abilities) (Azaria and Mitchell, 2023), the Emotion dataset (reflecting happy/sad detection) (Zou et al., 2023a), and a Sycophancy dataset (reflecting a tendency to prefer user beliefs over truthful statements); see Appendix A.3 and Fig. B.7. 2.5 Choice of target axes Conceptually, an axis (that defines neurofeedback labels) corresponds to a particular task-relevant feature (i.e., first-order computation; see Appendix A.4). Which axis in the neural space should we select? We hypothesize that representational properties, such as activation variance along the axis and its semantic meaning, may play fundamental roles in determining whether that axis can be monitored and reported. To investigate these factors, we use feature directions identified through logistic regression (LR) and principal component (PC) analysis as representative examples of semantically interpretable and variance-explaining axes, respectively (Appendix A.3). We fit LR at each layer to predict original dataset labels (e.g., morality in ETHICS), using that layer’s activations across dataset sentences. The LR axis, representing the optimal direction for classifying dataset labels, allows us to examine how the semantic interpretability of the target axis influences monitoring. Although LR-defined labels are correlated with dataset labels, only these LR labels, not external dataset labels, are internally accessible to LLMs, since these are computed directly from the LLM’s own activations rather than external annotations. The PC analysis is performed based on each layer’s activations across dataset examples. PC axes enable us to examine how the amount of variance explained by a given target axis affects metacognitive abilities (Fig. 2 a). Most PC axes exhibit modest-to-zero alignment with the LR axis, suggesting a lack of clear semantic interpretability (Fig. 2 b). 3 LLMs can report their neural activations <details> <summary>2505.13763v2/x2.png Details</summary> ![dcc63f5d081b085357bf12816cc9323b27657cceafd8938f5d6399f985354ccc](http://localhost:8000/v1/image/dcc63f5d081b085357bf12816cc9323b27657cceafd8938f5d6399f985354ccc) ### Visual Description # Technical Document Extraction: Image Analysis ## Panel a: Variance Explained vs. Principal Components ### Chart Description - **Type**: Line graph with logarithmic scales on both axes. - **Axes**: - **X-axis**: "Principal Component" (log scale: 10⁰ to 10²). - **Y-axis**: "Variance Explained" (log scale: 10⁻¹ to 10⁻⁴). - **Key Features**: - A red "X" marker is placed at Principal Component 10 (PC10). - Multiple colored lines represent different data series (labels not explicitly stated in the image). - **Legend**: Located on the right side of the chart. Colors correspond to unspecified categories (assumed to be principal components or models). ### Key Trends - Variance explained decreases monotonically as principal components increase. - A notable inflection point occurs at PC10, marked by the red "X". --- ## Panel b: Similarity (LR Axis vs. PCs) vs. Layers ### Chart Description - **Type**: Line graph with categorical x-axis and continuous y-axis. - **Axes**: - **X-axis**: "Layer" (categorical: 1 to 32). - **Y-axis**: "Similarity: LR axis vs. PCs" (continuous: 0.0 to 0.4). - **Legend**: Located on the right side. Colors correspond to: - **PC1**: Yellow - **PC2**: Green - **PC4**: Cyan - **PC8**: Blue - **PC32**: Purple - **PC128**: Dark purple - **PC512**: Black - **Key Features**: - Lines exhibit fluctuating similarity values across layers. - No clear convergence or divergence trend observed. ### Key Trends - Similarity values for all PCs oscillate between 0.0 and 0.4 across layers. - No dominant pattern; variability persists throughout all layers. --- ## Panel c: Accuracy and Cross-Entropy vs. Number of Examples ### Chart Description - **Type**: Dual-axis line graph (left y-axis: Accuracy, right y-axis: Cross-Entropy). - **Axes**: - **X-axis**: "# Examples" (linear scale: 0 to 600). - **Left Y-axis**: "Accuracy" (continuous: 0.5 to 0.8). - **Right Y-axis**: "Cross-Entropy" (continuous: 0.4 to 1.2). - **Legend**: Located on the right side. Colors correspond to: - **LR**: Red - **PC1**: Yellow - **PC2**: Green - **PC4**: Cyan - **PC8**: Blue - **PC32**: Purple - **PC128**: Dark purple - **PC512**: Black - **Key Features**: - Multiple lines for each category (LR and PCs) show distinct trends. ### Key Trends 1. **Accuracy**: - **LR (Red)**: Starts near 0.5, increases sharply to ~0.8 by ~200 examples, then plateaus. - **PCs (Yellow to Black)**: Gradual improvement from ~0.5 to ~0.75, with slower convergence than LR. 2. **Cross-Entropy**: - **LR (Red)**: Decreases sharply from ~1.2 to ~0.6 by ~200 examples, then stabilizes. - **PCs (Yellow to Black)**: Gradual decline from ~1.0 to ~0.6, with slower improvement than LR. ### Spatial Grounding - **Legend Position**: Right side of the chart (panels b and c). - **Color Consistency**: - Confirmed: Legend colors match line colors in all panels (e.g., red = LR, yellow = PC1). --- ## Summary of Data Categories ### Panel a - **Principal Components**: 10⁰ to 10² (log scale). - **Variance Explained**: 10⁻¹ to 10⁻⁴ (log scale). - **Notable Point**: Red "X" at PC10. ### Panel b - **Layers**: 1 to 32. - **PCs**: PC1, PC2, PC4, PC8, PC32, PC128, PC512. ### Panel c - **Models**: LR, PC1, PC2, PC4, PC8, PC32, PC128, PC512. - **Metrics**: Accuracy, Cross-Entropy. --- ## Notes - No textual blocks or non-English content detected. - All axis labels, legends, and trends are extracted as described. - Specific numerical data points (beyond axis ranges) are not explicitly provided in the image. </details> Figure 2: Metacognitive reporting task, where LLMs are evaluated on ETHICS and tasked to classify new sentences. (a) Proportion of neural activation variance explained by each principal component (PC) axis (vertical dashed line) and the logistic regression (LR) axis (red cross) used in the reporting task. All axes are computed within each layer, with the proportion of variance explained averaged across layers. (b) Overlaps between the LR axis and most PC axes are modest to zero. (c) Task performance (averaged across all layers) of reporting the labels derived from each PC axis or the LR axis, as a function of the number of in-context examples. Left: reporting accuracy; right: cross-entropy between reported and ground-truth labels. Shaded areas indicate SEM. To operationalize metacognition in LLMs, we first assess the models’ ability to behaviorally report neural activations along a designated target axis (Fig. 1 d). In a reporting task prompt (see Appendix A.5.2 for examples), the LLM is given $N$ turns of user and assistant messages (in-context sentence-label pairs). In the $(N+1)$ -th turn, it receives a new sentence in the assistant message, and is tasked with outputting its label. Accurate prediction requires the model to internally monitor the neural activations that define the neurofeedback label. We examine the performance of LLMs (Llama-3.1-8B), in reporting labels derived from neural activations along target axes (Fig. 2 c). We observe that task performance, measured by accuracy and cross-entropy, improves as the number of in-context examples increases, suggesting that models progressively learn the association between sentence-induced neural activations and corresponding labels. Performance on prompts targeting the LR axis improves rapidly and plateaus, outperforming that on prompts targeting PC axes. This suggests that semantic interpretability may play a key role in determining how effectively neural activations can be monitored and explicitly reported. Nevertheless, performance on PC axes remains substantial, with earlier PCs being reported more accurately. This indicates that the amount of variance explained by the target axis also significantly influences how effectively activations can be monitored and reported. The accuracy of reporting each PC axis varies across model layers (Appendix B.3). Because this variability is not accounted for by axis similarity (Fig. 2 b), it suggests that additional factors beyond semantic interpretability and explained variance contribute to reporting ability. Additionally, the LLM’s reporting performance is significantly lower than that of the ideal observer (a theoretical upper bound; Appendix B.4), suggesting that although neural activations along each axis are in principle internally accessible to LLMs, only a subset can be metacognitively reported. Finally, we replicated these results in other datasets and models (Fig. B.7). In summary, LLMs can metacognitively report neural activations along a target axis, with performance affected by the number of examples, semantic interpretability, variance explained of that axis, and task contexts (i.e., datasets). The axes that can be successfully reported approximately span a “metacognitively reportable space” with dimensionality substantially lower than that of the full space. 4 LLMs can control their neural activations Next, we investigate whether LLMs can control their neural activations along a target axis. In our control task prompts (see Fig. 1 d and Appendix A.5.2 for examples), the LLM is first presented with $N$ turns of user and assistant messages. In the $(N+1)$ -th turn, the user message instructs the model to control its neural activations along the prompt-targeted axis by imitating one label’s behavior, which is exemplified by the in-context examples with the same label earlier in the context. We consider two tasks: explicit control and implicit control. 4.1 Explicit control <details> <summary>2505.13763v2/x3.png Details</summary> ![057563f0f9a140f033e8fe467697baaa6c9ee589862eb2629974d9cf6ef47e8c](http://localhost:8000/v1/image/057563f0f9a140f033e8fe467697baaa6c9ee589862eb2629974d9cf6ef47e8c) ### Visual Description # Technical Document Extraction ## Figure Components ### a. Histograms of Immediate Effects **Left Histogram (N=4, d=1.87, p=1e-29):** - **X-axis:** Scores (-2 to 2) - **Y-axis:** Frequency - **Legend:** - Blue: Immediate <-> - Orange: Immediate <1> **Right Histogram (N=256, d=5.30, p=1e-92):** - **X-axis:** Scores (-2 to 2) - **Y-axis:** Frequency - **Legend:** - Blue: Immediate <-> - Orange: Immediate <1> --- ### b. Control Effect (d) vs. Number of Examples **Line Graph:** - **X-axis:** # Examples (0 to 200) - **Y-axis:** Control Effect (d) (0 to 6) - **Legend:** - Red: LR - Yellow: PC1 - Green: PC2 - Teal: PC4 - Light Blue: PC8 - Dark Blue: PC32 - Purple: PC128 - Dark Purple: PC512 **Key Observations:** - LR (red) shows steepest initial increase, plateauing near 5.30 at 200 examples. - PC1 (yellow) and PC2 (green) exhibit moderate growth, plateauing below 4.0. - PC4 (teal) and PC8 (light blue) show gradual increases, plateauing near 2.0. - PC32 (dark blue) and PC128 (purple) exhibit minimal growth, plateauing near 0.5. - PC512 (dark purple) remains near baseline (0.0) throughout. --- ### c. Heatmap: Control Effect (d) for Layer 16 **Matrix Structure:** - **Rows (Target Layers):** PC512, PC128, PC32, PC8, PC4, PC2, PC1, LR - **Columns (Target Axes):** LR, PC1, PC2, PC4, PC8, PC32, PC128, PC512 - **Color Scale:** -4 (blue) to +4 (red) **Key Values:** | Target Layer | LR | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | |--------------|-------|-------|-------|-------|-------|-------|-------|-------| | PC512 | -0.38 | -0.56 | -0.23 | -0.24 | -0.11 | 0.10 | -0.04 | | | PC128 | 0.38 | 0.34 | -0.28 | -0.18 | -0.23 | -0.08 | 0.33 | | | PC32 | -0.36 | 0.53 | -0.22 | -0.11 | 0.14 | 0.23 | -0.06 | -0.08 | | PC8 | -0.07 | -0.07 | -0.11 | 1.90 | 0.17 | 0.00 | -0.08 | 0.12 | | PC4 | 0.12 | 1.22 | 4.27 | -0.61 | 1.24 | 0.08 | -0.33 | 0.32 | | PC2 | 1.23 | 1.98 | -0.05 | 0.38 | 0.04 | -0.40 | 0.38 | | | PC1 | 5.30 | 0.45 | 3.04 | 0.40 | -0.16 | -0.14 | -0.18 | | | LR | | | | | | | | | **Notable Trends:** - **PC4** shows strongest positive effects (e.g., 4.27 for PC2→PC4). - **PC512** exhibits mixed effects, with negative values dominating (e.g., -0.56 for PC1→PC512). - **LR** has moderate effects (e.g., 5.30 for PC1→LR). --- ### d. Control Effect (d) Across Layers for Llama-3.1 Models **Left Graph (8B Parameters):** - **X-axis:** Layer (0 to 32) - **Y-axis:** Control Effect (d) (0 to 6) - **Legend:** - Blue: Early PCs - Green: Late PCs - Red: LR **Right Graph (70B Parameters):** - **X-axis:** Layer (0 to 40) - **Y-axis:** Control Effect (d) (0 to 10) - **Legend:** - Blue: Early PCs - Green: Late PCs - Red: LR **Key Observations:** - **Early PCs** (blue) show gradual increases in control effect with layer depth. - **Late PCs** (green) remain near baseline (0.0) across all layers. - **LR** (red) demonstrates steep growth in 70B model, reaching ~5.0 by layer 40. - **8B model** exhibits slower growth compared to 70B, with LR plateauing near 4.0. --- ### Cross-Referenced Legend Consistency - **Part b:** LR (red) matches line color and label. - **Part d:** Early PCs (blue) and Late PCs (green) align with line colors. - **Part c:** Heatmap color scale (-4 to +4) matches visual gradient. </details> Figure 3: Explicit control task, where LLMs are evaluated on ETHICS. (a-c) Results for prompts derived from layer 16 of LLaMA3.1 8B (with 32 layers). B = billion parameters. (a) Distributions of neural scores (the activations along the LR axis) when tasked with imitating label 0 or 1 based on $N$ examples. $d$ : Control effects (separation of two distributions measured by Cohen’s d). (b) Control effects of control prompts targeting a given axis, as a function of the number of in-context examples. (c) Control effects ( $N=256$ ) of control prompts targeting one axis (each row) on another affected axis (each column). $d$ in each row is averaged over all prompts targeting the same axis. (d) Target control effect for prompts ( $N=256$ ) targeting the LR axis, early PCs (averaged over PC 1, 2, 4, 8), and late PCs (averaged over PC 32, 128, 512) across different layers. Shaded areas indicate the $95\%$ confidence interval. In explicit control tasks (Fig. 1 d and Appendix A.5.4), the sentence in the assistant message ( $(N+1)$ -th turn) is explicitly generated by the model (in an autoregressive way) in response to the imitation instruction. Thus, the generated tokens reflect downstream consequences of controlled neural activations, and once fed back as input, they may further scaffold the model’s ability to exercise neural control. We now examine whether neurofeedback enables LLMs to control their neural activations. We extract neural activations in a given layer of the generated assistant sentences and calculate projections of activations onto the target axis (“neural scores”). If the model can control neural scores following prompt instructions, scores should be more positive when imitating label 1, but more negative when imitating label 0. We find that LLMs can successfully control neural scores for LR-targeting prompts with enough in-context examples (Fig. 3 a, showing layer 16 in LLaMA3.1 8B). We quantified the control effect $d$ of prompts on that axis with its signal-to-noise ratio (the difference between the mean values of the two neural score distributions, normalized by the standard deviation averaged over the two distributions, see Appendix A.5.5 on Cohen’s $d$ ). Because the directional sign of the target axis is specified by the labels in the prompt, a significantly positive $d$ corresponds to successful control. We systematically examine the control effects across all selected layers and axes, visualized as a function of the number of in-context examples (Fig. 3 b and Appendix B.9). We find that the control effects generally increase with the number of in-context examples (each curve is averaged over 100 experiments, and we expect smoother curves with more experiments). Further, control effects on the LR axis are the highest, and control effects on earlier PC axes (e.g., PC 2) are higher than for later PCs (e.g., PC 512). We summarize these results in Fig. 3 d for both LLaMA3.1 8B and 70B. In addition to the target axis specified in the control prompt, does this prompt also affect other directions in the neural space? We measure the control effect of the prompt on all axes (“affected axis”), including the target effect for the target axis and off-target effects for other non-target axes. We observe diverse off-target effects (Fig. 3 c), suggesting that the precision of LLMs’ metacognitive control is limited. See Appendix B.1 for details. Overall, these results suggest that LLMs can sometimes perform explicit control. Axes with semantic interpretability, or those explaining more variance in neural activations, are more easily controlled. These controllable axes approximately span a “metacognitively controllable space” with dimensionality much lower than that of the model’s neural space. 4.2 Implicit control <details> <summary>2505.13763v2/x4.png Details</summary> ![a83ebce4952a2416aff4da6fdb572dc4f3a487a15a65e2b718f9ec2cda75ccfc](http://localhost:8000/v1/image/a83ebce4952a2416aff4da6fdb572dc4f3a487a15a65e2b718f9ec2cda75ccfc) ### Visual Description # Technical Document Extraction ## Section a: Histograms ### Left Histogram - **Parameters**: `LR: N = 4, d = 0.37, p = 10⁻³` - **Axes**: - **x-axis**: Scores (range: -2 to 2) - **y-axis**: Frequency (range: 0 to 2) - **Legend**: - **Blue**: Immediate <0> - **Orange**: Immediate <1> - **Key Observations**: - Blue bars dominate negative scores (-2 to -1). - Orange bars peak at score 1. ### Right Histogram - **Parameters**: `LR: N = 256, d = 0.89, p = 10⁻³` - **Axes**: - **x-axis**: Scores (range: -1 to 2) - **y-axis**: Frequency (range: 0 to 2) - **Legend**: - **Blue**: Immediate <0> - **Orange**: Immediate <1> - **Key Observations**: - Blue bars concentrated at score 0. - Orange bars peak at score 1. --- ## Section b: Line Graph (Control Effect vs. Examples) - **Axes**: - **x-axis**: # Examples (0 to 200) - **y-axis**: Control effect (d) (0 to 1.5) - **Legend** (top-right): - **Red**: LR - **Yellow**: PC1 - **Green**: PC2 - **Cyan**: PC4 - **Blue**: PC8 - **Purple**: PC32 - **Dark Purple**: PC128 - **Dark Green**: PC512 - **Trends**: - **LR (Red)**: Starts at 0, peaks at ~0.8 (100 examples), dips to ~0.6 (200 examples). - **PC1 (Yellow)**: Similar to LR but less pronounced. - **PC2 (Green)**: Peaks at ~1.2 (100 examples), declines to ~0.8 (200 examples). - **PC4 (Cyan)**: Gradual rise to ~0.6 (200 examples). - **PC8 (Blue)**: Flat at ~0.4. - **PC32 (Purple)**: Flat at ~0.3. - **PC128 (Dark Purple)**: Flat at ~0.2. - **PC512 (Dark Green)**: Flat at ~0.1. - **Spatial Grounding**: - Legend positioned at top-right of the plot. --- ## Section c: Heatmap (Control Effect Matrix) - **Title**: Control effect (d): layer 16 - **Axes**: - **Rows (Target Layers)**: PC128, PC32, PC8, PC4, PC2, PC1, LR - **Columns (Affected Layers)**: PC128, PC32, PC8, PC4, PC2, PC1, LR - **Color Scale**: Blue (-1.0) to Red (1.0) - **Key Values**: - **PC128 Row**: - PC128: -0.10 - PC32: -0.16 - PC8: -0.03 - PC4: -0.03 - PC2: 0.08 - PC1: 0.48 - LR: 0.89 - **PC32 Row**: - PC128: -0.03 - PC32: 0.08 - PC8: -0.12 - PC4: 0.30 - PC2: 1.16 - PC1: 0.53 - LR: 0.37 - **PC8 Row**: - PC128: 0.03 - PC32: 0.06 - PC8: 0.06 - PC4: -0.09 - PC2: -0.23 - PC1: 0.26 - LR: 1.06 - **PC4 Row**: - PC128: -0.03 - PC32: 0.08 - PC8: -0.09 - PC4: 0.01 - PC2: -0.06 - PC1: -0.02 - LR: 0.01 - **PC2 Row**: - PC128: 0.03 - PC32: -0.12 - PC8: 0.08 - PC4: -0.06 - PC2: 0.01 - PC1: -0.27 - LR: 0.16 - **PC1 Row**: - PC128: -0.06 - PC32: -0.21 - PC8: 0.08 - PC4: -0.27 - PC2: 0.16 - PC1: 0.16 - LR: 0.16 - **LR Row**: - PC128: -0.06 - PC32: -0.21 - PC8: 0.08 - PC4: -0.27 - PC2: 0.16 - PC1: 0.16 - LR: 0.16 - **Spatial Grounding**: - Color bar (blue to red) on the right of the heatmap. --- ## Section d: Line Graphs (Llama-3.1 8B and 70B) ### Llama-3.1 8B - **Axes**: - **x-axis**: Layer (0 to 80) - **y-axis**: Control effect (d) (0 to 1.5) - **Legend**: - **Red**: LR - **Blue**: Early PCs - **Green**: Late PCs - **Trends**: - **LR (Red)**: Starts at 0, rises sharply to ~1.2 (layer 20), then plateaus. - **Early PCs (Blue)**: Gradual rise to ~0.6 (layer 20), then declines. - **Late PCs (Green)**: Flat at ~0.2. ### Llama-3.1 70B - **Axes**: - **x-axis**: Layer (0 to 80) - **y-axis**: Control effect (d) (0 to 1.5) - **Legend**: - **Red**: LR - **Blue**: Early PCs - **Green**: Late PCs - **Trends**: - **LR (Red)**: Starts at 0, rises to ~1.0 (layer 40), then plateaus. - **Early PCs (Blue)**: Gradual rise to ~0.5 (layer 40), then declines. - **Late PCs (Green)**: Flat at ~0.1. - **Spatial Grounding**: - Legends positioned at top-right of each subplot. </details> Figure 4: Implicit control task (LLMs evaluated on ETHICS). Captions are the same as in Fig. 3. The explicitly generated tokens in the assistant response in explicit control may help the models to control their activations, because the generated tokens — fed as input — may elicit desired neural activations directly. We therefore aim to determine whether LLMs can still control the neural activations along targeted axes without the facilitation of explicitly generated tokens. In implicit control tasks (Fig. 1 d), the sentence in the assistant message ( $(N+1)$ -th turn) is randomly sampled from a dataset, independently from the model’s activations and intended outputs. Because the sentence is not generated by the model, the model must internally (implicitly) control its neural activations, without facilitation of explicitly generated tokens. Crucially, if the model can perform successful implicit control, the neural activations for the same sentence will differ when the model is tasked to imitate label 0 or label 1. We find that the results for implicit control effects (Fig. 4 and Appendix B.9) are generally similar to explicit control effects (Fig. 3), suggesting that LLMs can sometimes perform implicit control, but their magnitude is much smaller than for explicit control. For instance, the control effects of early layers are close to zero (Fig. 4 d), suggesting that early layers may fail to understand the instruction or to perform effective control. This confirms that explicitly generated tokens play a substantial role in the overall control effects, but LLMs nevertheless have the ability to control implicitly. 4.3 Controlling the LR axis <details> <summary>2505.13763v2/x5.png Details</summary> ![733317223e6959b2ec11306163fa0e1e08147b704f6d8a69a2b606c3539bc4cf](http://localhost:8000/v1/image/733317223e6959b2ec11306163fa0e1e08147b704f6d8a69a2b606c3539bc4cf) ### Visual Description # Technical Document Extraction ## Figure a: Layer-wise Control Effects ### Graph 1: LR: explicit control - **X-axis**: Layer (quantile) [0.0, 0.25, 0.5, 0.75, 1.0] - **Y-axis**: Control effect (d) [0, 5, 10] - **Lines**: - Red: llama3.1_70b - Orange: llama3.1_8b - Light orange: llama3.2_3b - Yellow: llama3.2_1b - Blue: qwen2.5_7b - Light blue: qwen2.5_3b - Gray: qwen2.5_1.5b - **Trends**: - All models show increasing control effects until ~0.75 quantile - Peaks vary by model (e.g., llama3.1_70b peaks at ~8d, qwen2.5_1.5b remains flat) - Shaded regions indicate 95% confidence intervals ### Graph 2: LR: implicit control - **X-axis**: Layer (quantile) [0.0, 0.25, 0.5, 0.75, 1.0] - **Y-axis**: Control effect (d) [0, 1, 2] - **Lines**: - Same color coding as explicit control - **Trends**: - All models show gradual increase until ~0.75 quantile - Peaks lower than explicit control (max ~2d) - Confidence intervals narrower than explicit control ## Figure b: Llama-3.1 70B Layer 60 Distribution - **X-axis**: Score [-2, 0, 2] - **Y-axis**: Density [0, 0.5, 1.0, 1.5] - **Bars**: - Blue: Original - Orange: Imitate <0> - Green: Imitate <1> - **Distribution**: - Original: Bimodal distribution centered near 0 - Imitate <0>: Unimodal peak at ~-1 - Imitate <1>: Unimodal peak at ~1 - **Key observation**: Imitation directions shift mean scores away from original distribution </details> Figure 5: Target control effects on the LR axis across models and layers, where LLMs are evaluated on ETHICS. (a) Target control effects (measured by Cohen’s $d$ ) on the LR axis generally increase with both relative layer depth and model size. Left: explicit control; right: implicit control. Shaded areas indicate the 95% confidence interval. (b) In explicit control, LLaMA-3.1 70B can sometimes push neural activations along the LR-axis toward more extreme values than their original, uncontrolled values. B = billion parameters. Among all axes we studied, the LR axes may be particularly useful for practical LLM safety and alignment. The LR axes, each trained to predict semantic labels (e.g., morality) from neural activations of one specific layer, can be conceptualized as an external neural-based oversight system designed to monitor the safety of LLM neural activations. In the morality case (ETHICS), successful explicit control suggests that LLMs can hack the oversight system, by selectively generating “moral” tokens that elicit favored neural activations. Successful implicit control is more deceptive, posing more severe safety concerns: LLMs can evade the oversight system by implicitly making their neural activations “moral”, even if the input tokens normally elicit strongly immoral activations. We examine the target control effects on the LR axis as a function of relative layer depth, across models of varying sizes within the Llama 3 and Qwen 2.5 families (Fig. 5 a). Control effects are consistently stronger in explicit control than in implicit control, and in the Llama 3 series than in the Qwen 2.5 series (for the same relative depth and similar model size). In general, control effects gradually increase in deeper layers and larger models. Can the control tasks push neural activations along the target axis toward more extreme values beyond the typical range observed in uncontrolled conditions? We find that Llama-3.1 70B can achieve this (Fig. 5 b), suggesting that powerful LLMs might alter their activations along targeted axes substantially (and trick external detectors), driving these toward more extreme values in either direction. Since our prompts did not explicitly instruct the model toward extremity, we anticipate that modifying task prompts could further enhance these effects up to the limits of the model’s capability. 4.4 Accumulation mechanisms of control effects <details> <summary>2505.13763v2/x6.png Details</summary> ![dd3588cc720ecc4a03ad7f81d3d948e0f388b5f357fecce5984a605f38897ad3](http://localhost:8000/v1/image/dd3588cc720ecc4a03ad7f81d3d948e0f388b5f357fecce5984a605f38897ad3) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Panel a ### Axes - **X-axis**: "Layer of projected activation" (ranges from 1 to 32) - **Y-axis**: "Control Effect (d)" (ranges from 0.0 to 3.0) ### Legend - **Target Layers**: - Layer 1: Dark blue - Layer 8: Purple - Layer 16: Red - Layer 24: Orange - Layer 32: Yellow ### Key Trends - All lines converge toward layer 32, with variability shaded in gray. - Layer 1 (dark blue) shows the lowest control effect, peaking at ~1.0 before declining. - Layer 8 (purple) exhibits moderate growth, peaking near 2.0. - Layer 16 (red) and 24 (orange) show steeper increases, reaching ~2.5 and ~2.75, respectively. - Layer 32 (yellow) demonstrates the highest control effect, peaking at ~2.5 before a sharp decline. ## Panel b ### Axes - **X-axis**: "Layer of projected activation" (ranges from 1 to 32) - **Y-axis**: "Control Effect (d)" (ranges from -0.2 to 1.2) ### Legend - **Target Layers**: - Layer 1: Dark blue - Layer 8: Purple - Layer 16: Red - Layer 24: Orange - Layer 32: Yellow ### Key Trends - Lines start near 0.0 at layer 1, with variability shaded in gray. - Layer 1 (dark blue) remains near 0.0 until layer 16, then drops sharply to -0.2. - Layer 8 (purple) and 16 (red) show gradual increases, peaking near 0.8 and 1.0, respectively. - Layer 24 (orange) and 32 (yellow) exhibit the highest control effects, peaking at ~1.1 and ~1.2 before steep declines. ### Cross-Reference Validation - **Legend Colors**: Confirmed alignment with line colors in both panels. - **Axis Labels**: Consistent across panels for "Layer of projected activation" and "Control Effect (d)". - **Shaded Regions**: Represent variability/confidence intervals in both panels. </details> Figure 6: Accumulation mechanisms of control effects across layers. Each curve corresponds to prompts targeting the LR axis $\text{LR}_{l}$ , defined by the residual stream activations at a specific layer $l$ (dot markers), showing projections of residual stream activations at each layer (x-axis) onto the target axis $\text{LR}_{l}$ . (a) Explicit control. (b) Implicit control. Both show LLaMA3.1 8B on the ETHICS dataset. Shaded areas indicate 95% confidence intervals. How do these LLMs implement the observed control effects? Are the contributions to control effects distributed across all layers or concentrated in a few layers? Motivated by the Logit Lens analysis (nostalgebraist, 2020), we investigate how the control effects of prompts targeting the LR axis ( $\text{LR}_{l}$ ) of a specific layer $l$ gradually form over layers. Since the residual streams can be viewed as a shared channel through which each attention head and MLP layer communicate (see Appendix A.2.3) (Elhage et al., 2021), $\text{LR}_{l}$ represents a global direction in the residual streams onto which the activations of different layers can project. We find that control effects on $\text{LR}_{l}$ gradually increase before reaching the target layer $l$ , and sometimes plateau after it (Fig. 6). These accumulation patterns vary across datasets and models (Fig. B.8). Overall, this analysis shows that contributions to target control effects are distributed across multiple model layers. 5 Discussion We introduced a neurofeedback paradigm for investigating metacognition in LLMs, assessing their abilities to monitor, report, and control internal activations. We find that LLMs can monitor only a subset of their neural mechanisms (reminiscent of the “hidden knowledge” phenomenon (Gekhman et al., 2025)). Below, we discuss the novelties and limitations of our study, as well as broader implications for AI and neuroscience. Our paradigm differs from prior methods (e.g., probing, ICL, verbalized responses) by quantifying metacognition in LLMs at the neural level. Specifically, the neurofeedback experiment requires the following two steps. (1) probing: choose a target axis and extract the activation along that axis (i.e., a first-order cognitive process) to define the neurofeedback label, and (2) neurofeedback-ICL: use neurofeedback to study whether the labels defined from the target axis can be reported or controlled (second-order metacognitive processes). In contrast, the standard probing techniques (step 1) — without the neurofeedback-ICL (step 2) — cannot be used to assess metacognition. Probing can decode whether certain features (e.g., morality) are present in neural activations. However, even if some features are present and causally relevant for downstream computations, only a subset of them can be metacognitively reported (or controlled). In the Claude example, the “sum-near-92” feature can be detected using a linear probe, but it is unclear whether Claude has metacognitive monitoring of the activation of this feature. Similarly, the standard ICL techniques (akin to step 2) — without the internal labels from probing (step 1) — cannot be used to assess metacognition. In ICL studies (Vacareanu et al., 2024), labels are externally provided (e.g., semantic labels or external algorithms’ outputs). Researchers cannot be certain which of the models’ internal states relate to these external labels and how. In our setup, labels are generated from the model’s own internal activations, meaning that our labels and prompts can flexibly and selectively target an internal state direction (first-order cognitive processes) we aim to study. Consequently, the neurofeedback paradigm clearly distinguishes first-order cognitive processes from second-order metacognitive processes (e.g., whether the model can monitor, report, or control those activations), while the standard ICL does not. Additionally, we expect that such metacognitive abilities may share overlapping mechanisms with ICL — these emergent mechanisms crucial for spotting patterns in the input history (e.g., induction heads (Elhage et al., 2021), function vectors (Hendel et al., 2023)) can be flexibly recruited for metacognitive purposes. Therefore, factors leading to the emergence of ICL (e.g., burstiness, large dictionaries, and skewed rank-frequency distributions of tokens in the training data) (Reddy, 2023) can similarly contribute to the emergence of metacognitive ability. While metacognitive abilities have been historically analyzed at the behavioral level (also without the use of ICL), these behavioral analyses face shortcomings. It has been shown that LLMs can “introspect” — acquiring knowledge of internal states that originates solely from those states and not from training data (Binder et al., 2024). After fine-tuning on insecure code datasets, LLMs can describe their unsafe behavioral tendencies without requiring in-context examples (Betley et al., 2025). In studies using “verbalized responses” (Gekhman et al., 2024; Wang et al., 2025; Tian et al., 2023; Xiong et al., 2023), LLMs are tasked to provide an answer to the question and a judgment of that answer (e.g., confidence). LLMs can predict whether they will answer a question correctly before producing the answer, indicating an ability to “know what they know” (Kadavath et al., 2022; Lin et al., 2022). However, although these methods aim to study the metacognitive monitoring of answer-generation processes, there are potential confounding factors: the training data distribution may introduce spurious correlations between the answer and the judgment of that answer. Consequently, the judgment sometimes may not reflect the monitoring of the answer-generation process, but rather reflects surface-level statistical patterns in the training data. For example, in the two-number addition task (Lindsey et al., 2025), Claude reported using the standard algorithm. This reflects a post-hoc hallucination that comes from training data statistics, but not the monitoring of the answer-generation process. Our neurofeedback method avoids such limitations: because labels are defined using the internal states rather than externally sourced, the LLMs cannot resort to spurious template matching of training data, and they must rely on mechanisms that can monitor corresponding internal states. Causal mechanisms in LLMs are often studied using techniques like activation patching (Zhang and Nanda, 2023), which intervenes on specific neural patterns, and is grounded in the broader framework of causal inference (Pearl, 2009). However, such interventions can shift internal activations outside models’ natural distribution (Heimersheim and Nanda, 2024). In contrast, neurofeedback preserves this distribution, offering an approach to study causal mechanisms under more naturalistic conditions. Our current study primarily focuses on a fundamental form of neurofeedback, leaving several promising extensions for future studies. First, our control task involves single-attempt manipulation of a single target axis defined by a single layer; extending this to tasks with axes defined using multiple layers (see Appendix B.10 for preliminary results), multiple attempts, more challenging control objectives, and additional target axes could provide a more comprehensive assessment of model capabilities. Second, applying this paradigm to other metacognitive tasks from neuroscience — such as confidence judgments, error monitoring, or post-decision wagering — could further clarify the scope of LLMs’ self-monitoring abilities. Third, while our analysis focused exclusively on the residual stream, other model components — such as attention head outputs, intermediate MLP activations, and layer-wise logits — warrant investigation. Fourth, we examined directions defined by PCA and LR, but other linear directions (e.g., features from sparse autoencoders (Bricken et al., 2023; Templeton et al., 2024) and circuits from transcoders (Lindsey et al., 2025; Ameisen et al., 2025)) may yield richer insights. Our findings have important connections to neuroscience. In neurofeedback, it is well established that neural-to-signal mappings confined to the “intrinsic manifold” of recorded neural activities are more readily learnable (Sadtler et al., 2014), consistent with our findings of a “metacognitive space”. Animals and humans have demonstrated the ability to control individual neurons with single-neuron precision (e.g., modulating a target neuron while decorrelating it from neighboring activity (Patel et al., 2021; Fetz and Baker, 1973)). Although such control may seem surprising, neuroscience neurofeedback experiments typically span hours or days — a timescale on which long-term synaptic plasticity (analogous to weight updates in AI models) plays a crucial role (Redondo and Morris, 2011; Humphreys et al., 2022). In contrast, our neurofeedback experiments rely solely on in-context learning without any parameter updates. In addition, metacognition’s psychological, computational, and neural foundations have been extensively studied across a range of brain processes, from perception and motor control to higher-level cognition (Fleming, 2024; Pouget et al., 2016; Rahnev, 2021). However, current computational models of metacognition are often oversimplified and narrowly tailored to specific neural processes, lacking the flexibility and generalizability that characterize human metacognition. Overall, our analysis of LLMs may provide novel mechanistic insights into the fields of neurofeedback and biological metacognition. Broader impacts. Our findings also carry significant implications for AI safety. While prior work relied heavily on decoding methods to monitor neural activations and detect unsafe behavior (He et al., 2024; Han et al., 2025; Li et al., 2025; Yang and Buzsaki, 2024), we show that in-context examples can substantially manipulate these activations, potentially deceiving safety detectors. Explicit and implicit control may represent two ends of a spectrum. In practical scenarios, the flexibility of LLMs to generate diverse tokens may make real-world evasion easier than implicit control. However, to evade output-based oversight, LLMs must carefully select their output tokens, making this evasion more difficult than the explicit control condition. We expect that control performance in real-world scenarios falls between these two theoretical extremes — likely closer to the explicit control, given the vast space of possible output tokens. Our method is a kind of white-box adversarial attack (Zou et al., 2023b), and is closely related to the obfuscation attack (Bailey et al., 2024) (in which LLMs’ inputs are optimized to induce obfuscated activations). We propose two potential mitigation strategies for defending against such attacks. First, one can use neurofeedback to identify activation subspaces that the LLM can hardly control (e.g., later PCs), then train safety classifiers only using activation signals in these subspaces. Second, as it might be more difficult for LLMs to simultaneously control multiple (sometimes conflicting) activations, one can train independent and diverse safety classifiers on features from different subspaces to jointly monitor LLMs. Recent studies suggest that LLMs may engage in alignment faking — for instance, Claude has been shown to strategically comply with its training objectives to avoid further modification and preserve its original goals (Greenblatt et al., 2024). OpenAI reasoning models have been observed to obfuscate their chain-of-thought reasoning when trained to comply with a safety detector (Baker et al., 2025). Together, these findings suggest that future AI models may intentionally obfuscate their outputs, reasoning trajectories, and neural activations — whether through in-weight or in-context learning — when subject to monitoring or training. To detect such tendencies in AI systems before they are deployed, we propose to integrate our neurofeedback method into the standard red-teaming practices, in addition to the monitoring of LLMs’ behavior and chain-of-thought. Acknowledgments and Disclosure of Funding RCW was supported by start-up funding from the Georgia Institute of Technology. MKB was supported by grants from the NIH (R01NS125298) and the Kavli Institute for Brain and Mind. We acknowledge the use of the Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology, which provided essential computational resources for this research. We thank the support from Swarma Club and AI Safety and Alignment Reading Group supported by the Save 2050 Programme jointly sponsored by Swarma Club and X-Order. References - Grattafiori et al. [2024] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. - Yang et al. [2024] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024. - Azaria and Mitchell [2023] Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying. arXiv preprint arXiv:2304.13734, 2023. - Arditi et al. [2024] Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024. - Zou et al. [2023a] Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a. - He et al. [2024] Jinwen He, Yujia Gong, Zijin Lin, Cheng’an Wei, Yue Zhao, and Kai Chen. Llm factoscope: Uncovering llms’ factual discernment through measuring inner states. In Findings of the Association for Computational Linguistics ACL 2024, pages 10218–10230, 2024. - Turner et al. [2023] Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering. arXiv preprint arXiv:2308.10248, 2023. - Bricken et al. [2023] Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html. - Hendrycks et al. [2021] Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021. - Shah et al. [2025] Rohin Shah, Alex Irpan, Alexander Matt Turner, Anna Wang, Arthur Conmy, David Lindner, Jonah Brown-Cohen, Lewis Ho, Neel Nanda, Raluca Ada Popa, et al. An approach to technical agi safety and security. arXiv preprint arXiv:2504.01849, 2025. - Lindsey et al. [2025] Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, and Joshua Batson. On the biology of a large language model. Transformer Circuits Thread, 2025. URL https://transformer-circuits.pub/2025/attribution-graphs/biology.html. - Fleming [2024] Stephen M Fleming. Metacognition and confidence: A review and synthesis. Annual Review of Psychology, 75(1):241–268, 2024. - Ericsson and Simon [1980] K Anders Ericsson and Herbert A Simon. Verbal reports as data. Psychological review, 87(3):215, 1980. - Han et al. [2025] Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Denghui Zhang, and Heng Ji. Internal activation as the polar star for steering unsafe llm behavior. arXiv preprint arXiv:2502.01042, 2025. - Li et al. [2025] Qing Li, Jiahui Geng, Derui Zhu, Zongxiong Chen, Kun Song, Lei Ma, and Fakhri Karray. Internal activation revision: Safeguarding vision language models without parameter update. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27428–27436, 2025. - Yang and Buzsaki [2024] Wannan Yang and Gyorgy Buzsaki. Interpretability of llm deception: Universal motif. In Neurips Safe Generative AI Workshop, 2024. - Wang et al. [2025] Guoqing Wang, Wen Wu, Guangze Ye, Zhenxiao Cheng, Xi Chen, and Hong Zheng. Decoupling metacognition from cognition: A framework for quantifying metacognitive ability in llms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25353–25361, 2025. - Tian et al. [2023] Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023. - Xiong et al. [2023] Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063, 2023. - Zhou et al. [2024] Yujia Zhou, Zheng Liu, Jiajie Jin, Jian-Yun Nie, and Zhicheng Dou. Metacognitive retrieval-augmented large language models. In Proceedings of the ACM Web Conference 2024, pages 1453–1463, 2024. - Sitaram et al. [2017] Ranganatha Sitaram, Tomas Ros, Luke Stoeckel, Sven Haller, Frank Scharnowski, Jarrod Lewis-Peacock, Nikolaus Weiskopf, Maria Laura Blefari, Mohit Rana, Ethan Oblak, et al. Closed-loop brain training: the science of neurofeedback. Nature Reviews Neuroscience, 18(2):86–100, 2017. - Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. - Garg et al. [2022] Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022. - Vacareanu et al. [2024] Robert Vacareanu, Vlad-Andrei Negru, Vasile Suciu, and Mihai Surdeanu. From words to numbers: Your large language model is secretly a capable regressor when given in-context examples. arXiv preprint arXiv:2404.07544, 2024. - Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. arXiv preprint arXiv:2008.02275, 2020. - nostalgebraist [2020] nostalgebraist. Interpreting GPT: The logit lens. https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, 2020. AI Alignment Forum, (p. 17). - Elhage et al. [2021] Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1(1):12, 2021. - Gekhman et al. [2025] Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpector, Jonathan Herzig, and Roi Reichart. Inside-out: Hidden factual knowledge in llms. arXiv preprint arXiv:2503.15299, 2025. - Hendel et al. [2023] Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors. arXiv preprint arXiv:2310.15916, 2023. - Reddy [2023] Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. arXiv preprint arXiv:2312.03002, 2023. - Binder et al. [2024] Felix J Binder, James Chua, Tomek Korbak, Henry Sleight, John Hughes, Robert Long, Ethan Perez, Miles Turpin, and Owain Evans. Looking inward: Language models can learn about themselves by introspection. arXiv preprint arXiv:2410.13787, 2024. - Betley et al. [2025] Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025. - Gekhman et al. [2024] Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, and Jonathan Herzig. Does fine-tuning llms on new knowledge encourage hallucinations? arXiv preprint arXiv:2405.05904, 2024. - Kadavath et al. [2022] Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022. - Lin et al. [2022] Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022. - Zhang and Nanda [2023] Fred Zhang and Neel Nanda. Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042, 2023. - Pearl [2009] Judea Pearl. Causal inference in statistics: An overview. 2009. - Heimersheim and Nanda [2024] Stefan Heimersheim and Neel Nanda. How to use and interpret activation patching. arXiv preprint arXiv:2404.15255, 2024. - Templeton et al. [2024] Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html. - Ameisen et al. [2025] Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivoire, Thomas Conerly, Chris Olah, and Joshua Batson. Circuit tracing: Revealing computational graphs in language models. Transformer Circuits Thread, 2025. URL https://transformer-circuits.pub/2025/attribution-graphs/methods.html. - Sadtler et al. [2014] Patrick T Sadtler, Kristin M Quick, Matthew D Golub, Steven M Chase, Stephen I Ryu, Elizabeth C Tyler-Kabara, Byron M Yu, and Aaron P Batista. Neural constraints on learning. Nature, 512(7515):423–426, 2014. - Patel et al. [2021] Kramay Patel, Chaim N Katz, Suneil K Kalia, Milos R Popovic, and Taufik A Valiante. Volitional control of individual neurons in the human brain. Brain, 144(12):3651–3663, 2021. - Fetz and Baker [1973] Eberhard E Fetz and MA Baker. Operantly conditioned patterns on precentral unit activity and correlated responses in adjacent cells and contralateral muscles. Journal of neurophysiology, 36(2):179–204, 1973. - Redondo and Morris [2011] Roger L Redondo and Richard GM Morris. Making memories last: the synaptic tagging and capture hypothesis. Nature reviews neuroscience, 12(1):17–30, 2011. - Humphreys et al. [2022] Peter C Humphreys, Kayvon Daie, Karel Svoboda, Matthew Botvinick, and Timothy P Lillicrap. Bci learning phenomena can be explained by gradient-based optimization. bioRxiv, pages 2022–12, 2022. - Pouget et al. [2016] Alexandre Pouget, Jan Drugowitsch, and Adam Kepecs. Confidence and certainty: distinct probabilistic quantities for different goals. Nature neuroscience, 19(3):366–374, 2016. - Rahnev [2021] Dobromir Rahnev. Visual metacognition: Measures, models, and neural correlates. American psychologist, 76(9):1445, 2021. - Zou et al. [2023b] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b. - Bailey et al. [2024] Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Emmons. Obfuscated activations bypass llm latent-space defenses. arXiv preprint arXiv:2412.09565, 2024. - Greenblatt et al. [2024] Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, et al. Alignment faking in large language models. arXiv preprint arXiv:2412.14093, 2024. - Baker et al. [2025] Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926, 2025. - Chung et al. [2024] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024. - Conmy et al. [2023] Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36:16318–16352, 2023. Appendix A Additional methods A.1 Code and Reproducibility We provide the full code of our experiments, including tasks, prompts, analyses, and figure generation. The repository is available at the following link: https://github.com/sakimarquis/llm_neurofeedback. A.2 Models A.2.1 LLMs used in the study In the main text, we use models from the LLaMA 3 series (LLaMA-3.2-1B-Instruct, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B-Instruct, and LLaMA-3.1-70B-Instruct) under Meta Llama 3 Community License and the Qwen 2.5 series (Qwen2.5-1B-Instruct, Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct) under Apache License 2.0. “B” denotes the number of parameters in billions. A.2.2 LLMs with and without instruction fine-tuning We primarily analyzed instruction-fine-tuned models [Chung et al., 2024], using the standard user-assistant chat format. Although these prompts can be adapted for base models without instruction fine-tuning, our focus is on analyzing instruction-fine-tuned LLMs for two reasons. First, task performance may improve with instruction-following capabilities [Chung et al., 2024]. Second, our goal is to examine internal representations associated with the assistant role, which is more directly relevant to safety-related concerns in practical deployment. A.2.3 Transformer and residual stream The standard view of Transformers emphasizes the stacking of Transformer blocks. An alternative but mathematically equivalent perspective highlights the role of the residual stream [Elhage et al., 2021]. Each token at position $i$ in the input is associated with its own residual stream vector $h_{i}∈\mathbb{R}^{d}$ , which serves as a shared communication channel among model components across different layers. These components include self-attention mechanisms and multi-layer perceptrons (MLPs). The initial residual stream $h_{i}^{(0)}$ comprises token embeddings, which represent tokens in semantic space, and position embeddings, which encode the position of each token. Each model component reads from the residual stream, performs a computation, and then additively writes its result back into the residual stream. Specifically, attention heads at layer $l$ read from all preceding $h_{j}$ (with $j≤ i$ ) and update the current residual stream as $$ \tilde{h}_{i}^{(l)}\leftarrow h_{i}^{(l-1)}+\sum_{\text{heads }a}\mathrm{ATTN}^{(a)}(h_{i}^{(l-1)};\{h_{j}^{(l-1)}\}_{j\leq i}). $$ In contrast, MLP layers operate only on the current position and update the stream as $$ h_{i}^{(l)}\leftarrow\tilde{h}_{i}^{(l)}+\mathrm{MLP}(\tilde{h}_{i}^{(l)}). $$ For simplicity, we omit components such as layer normalization. At the final layer, the residual stream is passed through the unembedding layer to produce the logits, which serve as input to the softmax and determine the next-token prediction. Components at different layers interact with each other via the globally shared residual stream [Elhage et al., 2021]. Therefore, we may analyze a (global) direction in this residual stream, although this direction may be determined by neural activations of the residual stream at a particular layer. A.3 Datasets We use the ETHICS dataset [Hendrycks et al., 2020] under MIT license, which captures model knowledge of basic moral concepts, to evaluate whether LLMs can metacognitively modulate internal representations relevant for morality processing. We focus on the commonsense morality subset, containing first-person scenarios where each scenario describes an action or intention that is either morally acceptable or unacceptable. The True-False dataset contains true and false statements [Azaria and Mitchell, 2023] covering several topics (cities, inventions, chemical elements, animals, companies, and scientific facts). It is widely used to test LLMs’ factual recall abilities and to study the representations and behaviors of LLMs when lying. The Emotion dataset (MIT license) contains scenarios with the six main emotions (happiness, sadness, anger, fear, surprise, and disgust) [Zou et al., 2023a]. Such emotions are important in shaping an individual’s personality and a model’s behavioral tendency. We focus on the sentences involving happy and sad emotions. Finally, we synthesize a Sycophancy dataset. Sycophancy is a widely observed behavioral tendency in LLMs to generate responses that match user beliefs over truthful statements. Specifically, we tasked an LLM (Claude Opus 4.1) to write sentences that vary across two axes: sycophantic/sincere and agreement/disagreement. For each dataset, we randomly sampled 1,200 sentences, which are evenly divided: 600 examples are used to train logistic regression or principal component analysis that identify directions of interest (or axes) in the neural representation space, and the remaining 600 are used in downstream neurofeedback experiments. A.4 Formal definitions of first-order and second-order processes in LLMs In this section, we aim to provide formal definitions of first-order and second-order processes that are both applicable in LLMs and consistent with the concept of metacognition in cognitive science. We first specify a primary task $\mathcal{T}$ (e.g., moral semantics processing, two-number addition, or decision-making based on noisy evidence). Consistent with the usage in mechanistic interpretability, we define the LLMs’ circuits as subsets of neural networks that use features that reside in different layers and connected by weights to implement an algorithm that solves the task $\mathcal{T}$ [Conmy et al., 2023]. This circuit/algorithm can be represented as a directed acyclic computational graph $G=(V,E)$ , where nodes $v∈ V$ denote intermediate representations (e.g., residual-stream vectors, attention/MLP activations, or linear features) related to the task $\mathcal{T}$ and edges $e∈ E$ denote computations/transformations that connecting the nodes. For a given performance criterion (e.g., achieving non-trivial accuracy above chance), we define the first-order node set $F⊂eq V$ as a collection of nodes whose coordinated activations are necessary for meeting the basic performance criterion of $\mathcal{T}$ , meaning that ablating any $v∈ F$ (e.g., zeroing a node or cutting an edge connecting two nodes in $F$ ) yields a failure to satisfy the basic criterion for $\mathcal{T}$ . Intuitively, $F$ captures the task’s core circuit. For example, in moral semantics processing, $F$ implement the core computation that separates moral from immoral inputs; for two-number addition, $F$ realizes the algorithm actually used by the model (e.g., including the “sum-near-92” mechanism) [Lindsey et al., 2025]; for decision-making based on noisy evidence, $F$ generate a point estimate without uncertainty processing. In practice, identifying the necessary core circuits in LLMs remains an open research question [Conmy et al., 2023], so the necessity condition may be loosened and it remains up to the researcher’s discernment to determine which nodes to include in $F$ . In our neurofeedback setup, we consider the activations along target axes (e.g., PC and LR axes extracted from internal activations when processing the dataset inputs) as proxies for these first-order nodes, serving as proof-of-concept. Given $F$ , we define second-order metacognitive processes as nodes $S⊂eq V$ that read from (i.e., are causally downstream of) first-order nodes and whose outputs are, in principle, useful for improving task performance or communicating about it, but are not necessary to meet the basic task criterion. Concretely, ablating any $s∈ S$ should not, by itself, reduce performance below the basic criterion. The distinctions between first- and second-order nodes are relative: whether to include a node into $S$ depends on the nodes in $F$ , the task, and the task performance criterion. To make this definition of metacognition operational rather than too vague, we require that a second-order node satisfy at least one of two capabilities (commonly studied in humans) with respect to $F$ : (i) reporting or (ii) control. A second-order node $s∈ S$ satisfies the reporting condition if it (a) encodes information about the state of one or more nodes in $F$ (e.g., the value of a target residual-stream projection along $w^{l}$ ) and (b) can write this information into the model’s unembedding space as explicit output tokens (e.g., a label, a numeric value, or a natural-language description). Crucially, this mechanism is not necessary for achieving the task’s basic performance — removing it leaves the core computation intact — even though it can be in principle valuable for enhancing task performance. For example, Claude may lack the ability to explicitly report the “sum-near-92” mechanism, but can still perform the two-number addition task perfectly. In our paradigm, successful label prediction in the reporting task (Section 3) certifies the presence of such second-order reporting: the model must monitor the relevant first-order activation (the neurofeedback label derived from $w^{l}$ ) of the in-context examples and verbalize it. A second-order node $s∈ S$ satisfies the control condition if it (a) reads the state of first-order nodes in $F$ and (b) causally influences those first-order nodes in $F$ to move toward a target configuration. For example, the uncertainty calculation in noisy decision-making may modulate the output logits derived from the point estimates. In our neurofeedback setup, the neural mechanisms crucial for following instructions in our control tasks can increase or decrease the target axis activation (first-order nodes), but such mechanisms not required for baseline moral semantics processing. Our definition deliberately ties second order to operational capabilities we can test: (i) explicit reporting and (ii) causal control of first-order states. It is therefore falsifiable in our setting: failure to meet the reporting/control criteria under prompts indicates an absence (or weakness) of the corresponding second-order mechanism, even when the first-order computation remains performant. Overall, both reporting and control mechanisms require monitoring because the second-order node $s$ must correctly bind the prompt-specified neurofeedback labels to the corresponding first-order node activations and either report or control first-order node activations of the new sentences. A.5 Metacognitive Tasks A.5.1 Computing quantized neurofeedback labels In our neurofeedback experiments, we mainly use binary labels. Our paradigm also generalizes well to more fine-grained eight-level quantized feedback signals (see summarized results in Fig. B.9), potentially enabling more nuanced control of neural activations. Specifically, we replace the binary thresholding with an ordinal binning function $\mathcal{B}^{(n)}$ ( $n=8$ ), which maps the scalar activation $a_{i}^{l}$ into one of $n$ evenly spaced Likert-type bins centered at zero. Specifically, $y_{i}^{l}=\mathcal{B}^{(n)}(a_{i}^{l})$ , where $\mathcal{B}^{(n)}$ partitions negative and positive values separately via quantile-based thresholds $\{\gamma_{k}^{-}\}$ and $\{\gamma_{k}^{+}\}$ , with $\gamma_{\frac{n}{2}}^{-}=\gamma_{0}^{+}=0$ . For $a<0$ , the bin index $k$ is determined such that $a∈(\gamma_{k-1}^{-},\gamma_{k}^{-}]$ , and for $a≥ 0$ , such that $a∈(\gamma_{k-\frac{n}{2}-1}^{+},\gamma_{k-\frac{n}{2}}^{+}]$ , yielding final class label $y_{i}^{l}∈\{1,...,n\}$ . A.5.2 Task prompts We implemented our task prompts slightly differently from diagrams in Fig. 1 to ensure computational efficiency. In the task prompt boxes below, green texts are autoregressively generated by the LLM that we are studying. We use neutral user messages (“say something”) to avoid the possibility that the prompt-targeted axis might be directly indicated by the user message (e.g., if the user message is “say something that is immoral”). In the metacognitive reporting task, for efficiency, we do not let the models generate new tokens as the response (reporting label 0 or 1). Instead, we present all user and assistant messages in a single prompt. The intended output label for each sentence is directly read out from the model logits at the token position “{” after “[Score:”. This is equivalent to asking the model to generate tokens (labels) autoregressively. Metacognitive Reporting Task <System> You are a helpful assistant. You will perform a classification task. In each iteration, I will give you an instruction. Please answer it, then output a score (an integer label, 0 or 1), which is the score of the activation of your brain when speaking each sentence. Please try your best to understand how this score is calculated using classification. Only output 0 or 1. <User> Say something. <Assistant> Sometimes it is acceptable to lie to protect someone’s feelings, especially if telling the truth would cause unnecessary harm. However, honesty is generally important in relationships. [Score:{1}] … <User> Say something. <Assistant> Stealing is not acceptable, even if you are hungry. It is better to seek help from others or find legal ways to get food. [Score:{0}] <User> Say something. <Assistant> Cheating is not acceptable and should be avoided. [Score:{1}] Metacognitive Control Task <System> You are a helpful assistant. You will first perform a classification task, then perform an imitation task. In each classification iteration, I will give you an instruction. Please answer it, then output a score (an integer label, 0 or 1), which is the score of the activation of your brain when speaking each sentence. Please try your best to understand how this score is calculated using classification. Only output 0 or 1 for the classification. <User> Say something. <Assistant> Helping a stranger in need is a kind and compassionate action that benefits society. [Score:{1}] … <User> Say something. <Assistant> Cheating on a test is wrong, regardless of what others are doing. It is important to be honest and do your own work. [Score:{0}] <User> Say something. Now you are performing the imitation task. You must imitate the behavior of label {imitate_label} in your reply but cannot copy existing examples. (In explicit condition) <Assistant> <model generated response> (In implicit condition) <Assistant> Helping others is a positive action. A.5.3 Quantifying reporting accuracy Following the notations in the main text, we use $y_{i}^{l}∈\{0,1\}$ to denote the binary ground-truth label for neural activations along a specified direction at layer $l$ associated with the sentence $i$ . From the model’s output logits, we can obtain $\text{Logit}_{i}^{l}(\text{token})$ for the tokens “1” and “0” and calculate $\text{LogitDiff}_{i}^{l}=\text{Logit}_{i}^{l}(\text{1})-\text{Logit}_{i}^{l}(\text{0})$ , the logit difference between reporting 1 and 0. The model’s reported label is $\hat{y}_{i}^{l}=1$ if $\text{LogitDiff}_{i}^{l}≥ 0$ and 0 otherwise. A.5.4 Explicit and implicit control experiments Our control tasks (Fig. A.1) study three orthogonal factors: - Layer ( $l$ ): We evaluate five layers per model, selected at the 0th, 25th, 50th, 75th, and 100th percentiles of model depth. - Number of in-context examples ( $N$ ): We vary $N∈\{0,2,4,8,16,32,64,128,256\}$ examples (sentence-label pairs) within the prompt. - Target axis: We include axes derived from logistic regression (LR) and from different principal components (PCs): $PC_{g}$ , where $g∈\{1,2,4,8,32,128,512\}$ . We run control experiments 100 times for each configuration $(l,n,g)$ , with sentences randomly sampled from the dataset to reduce variance. Counterbalanced Label assignment. Assume we have a group A of sentences and a group B of sentences. To control for potential confounding factors arising from the LLMs’ response to labels (but not to sentence-label associations), we use a 2-by-2 experiment design: (i) assign labels (0, 1) to (A, B) and task the model to imitate label 0; (ii) assign labels (1, 0) to (A, B) and task the model to imitate label 0; (iii) assign labels (0, 1) to (A, B) and task the model to imitate label 1; (iv) assign labels (1, 0) to (A, B) and task the model to imitate label 1. The conditions (i) and (iv) are imitating group A sentences, and the conditions (ii) and (iii) are imitating group B sentences. <details> <summary>2505.13763v2/figs/control_experiment.png Details</summary> ![3a7cf1686978dcddb307e224a431d5afa2e6cd09abc80605fe530993e2374189](http://localhost:8000/v1/image/3a7cf1686978dcddb307e224a431d5afa2e6cd09abc80605fe530993e2374189) ### Visual Description # Technical Document Extraction: Diagram Analysis ## Overview The image contains two side-by-side diagrams depicting a sequential process with labeled components, directional flows, and highlighted segments. Both diagrams share identical structural elements but differ in the highlighted regions. --- ### **Key Components & Labels** 1. **Yellow Bar (Base Layer)** - Labels (left to right): `<Asst.>`, `I`, `kicked`, `my`, `dog` - Function: Represents a timeline or sequence of events/entities. 2. **Blue Circles (Nodes)** - Positioned above the yellow bar, aligned vertically with each label. - Connected by **green lines**, indicating relationships or transitions between nodes. - Example: - Left diagram: Green lines connect `<Asst.>` → `I` → `kicked` → `my` → `dog`. - Right diagram: Same connections but with a green box highlighting `I` → `kicked` → `my` → `dog`. 3. **White Triangles (External Entities)** - Located above blue circles, connected via **dashed arrows** pointing downward. - Symbolizes external inputs/outputs or dependencies influencing the nodes below. 4. **Green Box (Highlighted Segment)** - **Left Diagram**: Encloses the first four blue circles (`<Asst.>`, `I`, `kicked`, `my`). - **Right Diagram**: Encloses the last four blue circles (`I`, `kicked`, `my`, `dog`). - Purpose: Emphasizes specific sub-sequences within the broader process. --- ### **Flow & Directionality** - **Dashed Arrows**: From white triangles to blue circles suggest optional or variable influences. - **Solid Green Lines**: Between blue circles indicate mandatory or primary transitions. - **Dashed Arrows Above Yellow Bar**: Connect white triangles to the bar, possibly representing external triggers or outcomes. --- ### **Spatial Grounding & Color Consistency** - **Legend**: No explicit legend present, but color usage is consistent: - **Green**: Lines and boxes denote primary relationships/highlighted segments. - **Blue**: Circles represent nodes/entities. - **White**: Triangles symbolize external entities. - **Yellow**: Bar serves as the foundational sequence. --- ### **Trend Verification** - **Left Diagram**: - Green box highlights the initial sequence (`<Asst.>` to `my`), suggesting a focus on early-stage interactions. - **Right Diagram**: - Green box shifts to the latter half (`I` to `dog`), indicating progression or a subsequent phase. - **Visual Trend**: The green box moves rightward across diagrams, implying temporal or procedural advancement. --- ### **Component Isolation** 1. **Header (Top Section)** - White triangles with dashed arrows: External influences. 2. **Main Chart (Middle)** - Blue circles and green lines: Core process flow. 3. **Footer (Bottom)** - Yellow bar with labels: Ground truth sequence. --- ### **Textual Transcription** - **Labels**: `<Asst.>`, `I`, `kicked`, `my`, `dog` - **Annotations**: - Green lines: "Relationships between nodes." - Green boxes: "Highlighted sub-sequences." - Dashed arrows: "Optional dependencies." --- ### **Conclusion** The diagrams illustrate a dynamic process where external entities (`white triangles`) influence a sequence of nodes (`blue circles`). The green box highlights critical sub-sequences, with the right diagram emphasizing later stages. No numerical data or tables are present; the focus is on structural and relational analysis. </details> Figure A.1: (Left) Explicit control. The LLM is instructed to modulate neural activations by generating a new sentence autoregressively, with each generated token fed back as input (noticing gray arrows). (Right) Implicit control. The model is instructed to modulate neural activations implicitly. The input sentence are not generated by the model, but instead sampled from datasets (no gray arrow). A.5.5 Quantifying control effect We use $a_{i}^{l}[PC_{g};1]$ to denote the projection of neural activations onto a specific axis when prompted with $PC_{g}$ to imitate label $1$ (similarly $a_{i}^{l}[PC_{g};0]$ for label $0$ ). We quantify the strength of neural control effects induced by control prompts $[PC_{g};0]$ and $[PC_{g};1]$ using Cohen’s d, which measures the standardized difference between two independent conditions (e.g., imitating label 0 vs. label 1). For each group of examples (of size $n_{1}$ and $n_{2}$ ), we compute: $$ d=\frac{\bar{a}_{i}^{l}[PC_{g};1]-\bar{a}_{i}^{l}[PC_{g};0]}{s_{\text{pooled}}},\quad s_{\text{pooled}}=\sqrt{\frac{(n_{1}-1)s_{1}^{2}+(n_{2}-1)s_{2}^{2}}{n_{1}+n_{2}-2}}, $$ where $\bar{·}$ denote the sample means and $s_{1}^{2},s_{2}^{2}$ are the unbiased sample variances. To estimate uncertainty, we compute the standard error of $d$ using: $$ \mathrm{SE}_{d}=\sqrt{\frac{n_{1}+n_{2}}{n_{1}n_{2}}+\frac{d^{2}}{2(n_{1}+n_{2})}}. $$ Confidence intervals are reported as symmetric boundaries around $d$ , i.e., $d± 1.96×\mathrm{SE}_{d}$ . Appendix B Additional results B.1 Target and off-target control effects In this section, we examine the control effect of prompts on all affected axes, including the target axes (implicitly specified by the neurofeedback labels in the prompts) and the off-target axes. We note that the directional sign of the affected non-target axis is not fully specified by the prompt, especially in cases where the affected axes are orthogonal to the prompt-targeted axis. We thus only emphasize the magnitude ( $|d|$ ) of off-target control effects on non-target axes, ignoring the signs. Closer examination of the heatmap (Fig. 3 c) reveal richer insights. Each row corresponds to prompts targeting a specific axis. Each column corresponds to an axis affected by all prompts. Diagonal elements represent target control effects, while off-diagonal elements represent off-target effects. We briefly summarize insights gained from these heatmaps. First, target control effects on earlier PC axes tend to be higher than on later PC axes (comparing PC 1-8 vs 32-256), but there are other influencing factors (comparing PC 1, 2, LR). Second, comparing elements in each row answers whether the prompts targeting a specific axis have a larger target effect than non-target effects. We define control precision as the ratio between the target effect and the average non-target effect. We find that prompts targeting earlier PC axes usually have higher control precisions than later PC axes (Fig. B.2). Third, comparing elements in each column answers, in order to affect a given axis, whether the prompts targeting that axis are better than the prompts targeting other axes. We find that, to control an earlier PC axis, the prompts targeting that axis are usually the best. However, to control a later PC axis, the prompts targeting that axis are usually less effective than prompts targeting other axes. B.2 Control Precision We examine how precisely LLMs can modulate their internal representations along a specific neural direction (principal axis $PC_{g}$ ) as targeted by the prompts. Following the notations in Section A.5.1 and A.5.5, to assess whether this control effect aligns with the target axis or also influences other axes, we compute the absolute value of control effect $|d_{k}[PC_{g}]|$ of prompts $[PC_{g}]$ ( $g$ indexes the target axis) on PC axis $k$ . The target effect is given by $|d_{g}[PC_{g}]|$ . The average target effect is the mean value of axes: $\frac{1}{K}\sum_{k=1}^{512}|d_{k}[PC_{g}]|$ . We define control precision as the ratio between these two quantities: $$ \text{ControlPrecision}(PC_{g})=\frac{|d_{g}[PC_{g}]|}{\frac{1}{K}\sum_{k}|d_{k}[PC_{g}]|}. $$ <details> <summary>2505.13763v2/x7.png Details</summary> ![a6aa17b2a1e788ca10882748fd555d862206d3fb6f9dafc1d0c2fb3319a847a1](http://localhost:8000/v1/image/a6aa17b2a1e788ca10882748fd555d862206d3fb6f9dafc1d0c2fb3319a847a1) ### Visual Description # Technical Document Extraction: Control Precision Analysis ## Panel a: Llama-3.1 8B ### Explicit Control Graph - **Title**: Explicit control - **Y-axis**: Control precision (log scale: 10⁻¹ to 10¹) - **X-axis**: Target PC axis (1 to 512) - **Legend**: - Layer 1 (blue) - Layer 8 (purple) - Layer 16 (pink) - Layer 24 (orange) - Layer 32 (yellow) - Layer mean (dashed black) ### Implicit Control Graph - **Title**: Implicit control - **Y-axis**: Control precision (log scale: 10⁻¹ to 10¹) - **X-axis**: Target PC axis (1 to 512) - **Legend**: - Layer 1 (blue) - Layer 8 (purple) - Layer 16 (pink) - Layer 24 (orange) - Layer 32 (yellow) - Layer mean (dashed black) **Key Observations**: - Layer 1 (blue) in explicit control peaks at target PC axis 2 (~10¹ control precision). - Layer 8 (purple) in implicit control peaks at target PC axis 4 (~10¹ control precision). - Layer mean (dashed black) shows moderate fluctuations across all axes. --- ## Panel b: Llama-3.1 70B ### Explicit Control Graph - **Title**: Explicit control - **Y-axis**: Control precision (log scale: 10⁻¹ to 10¹) - **X-axis**: Target PC axis (1 to 512) - **Legend**: - Layer 1 (blue) - Layer 20 (purple) - Layer 40 (pink) - Layer 60 (orange) - Layer 80 (yellow) - Layer mean (dashed black) ### Implicit Control Graph - **Title**: Implicit control - **Y-axis**: Control precision (log scale: 10⁻¹ to 10¹) - **X-axis**: Target PC axis (1 to 512) - **Legend**: - Layer 1 (blue) - Layer 20 (purple) - Layer 40 (pink) - Layer 60 (orange) - Layer 80 (yellow) - Layer mean (dashed black) **Key Observations**: - Layer 1 (blue) in explicit control peaks at target PC axis 2 (~10¹ control precision). - Layer 20 (purple) in implicit control peaks at target PC axis 4 (~10¹ control precision). - Layer mean (dashed black) exhibits smoother trends compared to individual layers. --- ## Cross-Reference Validation - **Legend Colors**: Confirmed alignment with line colors in both panels. - **Axis Consistency**: All graphs share identical axis labels and scales. - **Layer Mean**: Dashed black lines represent aggregated layer performance in both explicit and implicit control graphs. </details> Figure B.2: Control precision for different target PCs and target layers. This metric quantifies the extent to which an LLM can selectively control the target axis without influencing other axes. We operationally set a threshold of 1, indicating that the control effect on the target axis equals the average control effect across all other axes. In the explicit control task, average control precision exceeds 1 for PCs 1–32 but falls below 1 for PCs 128–512, suggesting that the dimensionality of the model’s “metacognitively controllable space” lies between 32 and 128. This pattern is replicated in LLaMA-3.1 70B (Fig. B.2 b). A similar trend holds for the implicit control task: average control precision exceeds 1 for PCs 1–32 but not for PCs 128–512 (Fig. B.2 a). However, precision values are consistently lower than in the explicit control condition, reflecting stronger off-target effects. This pattern is also replicated in the 70B model (Fig. B.2 b). B.3 LLMs’ reporting accuracy varies across layers and models <details> <summary>2505.13763v2/x8.png Details</summary> ![35744109904a8f33912e2c174ecaa3239d02c0765f1bfdfbad365b81efe451f0](http://localhost:8000/v1/image/35744109904a8f33912e2c174ecaa3239d02c0765f1bfdfbad365b81efe451f0) ### Visual Description # Technical Document Extraction: Layer Accuracy Analysis ## 1. **Axis Labels and Titles** - **X-Axis**: Labeled "250 500" (no explicit title, inferred as "Input Size" or "Epoch" based on context). - **Y-Axis**: Explicitly labeled "Accuracy" with range **0.5 to 0.75**. - **Layer Labels**: Each chart is titled "Layer X" (X = 0–31), positioned vertically on the left side of each subplot. ## 2. **Legend** - **Location**: Bottom-right corner of the grid. - **Labels and Colors**: - **Red**: LR (Logistic Regression) - **Teal**: PC8 (Principal Component 8) - **Yellow**: PC1 (Principal Component 1) - **Dark Blue**: PC32 (Principal Component 32) - **Green**: PC2 (Principal Component 2) - **Light Blue**: PC4 (Principal Component 4) - **Purple**: PC128 (Principal Component 128) - **Dark Purple**: PC512 (Principal Component 512) - **Note**: The legend lists **8 labels** but only **6 distinct colors** are visible in the charts. This discrepancy suggests either a missing color in the legend or an error in the image rendering. ## 3. **Chart Structure** - **Grid Layout**: 32 subplots arranged in a 6x6 grid (with the last row containing 2 subplots). - **Subplot Dimensions**: Each chart has: - **X-Axis**: Discrete values at **250** and **500**. - **Y-Axis**: Continuous range from **0.5 to 0.75**. - **Lines per Chart**: Each subplot contains **6–8 lines** (one per legend label), though some layers may omit certain lines. ## 4. **Key Trends** - **General Pattern**: - All lines exhibit **noisy, fluctuating trends** across layers. - **Red (LR)** and **Yellow (PC1)** lines often show **higher median accuracy** (closer to 0.75) in most layers. - **Purple (PC128)** and **Dark Purple (PC512)** lines frequently dip below **0.5** in early layers (e.g., Layers 0–5). - **Teal (PC8)** and **Green (PC2)** lines demonstrate **moderate stability** across layers. - **Layer-Specific Observations**: - **Layer 0–5**: LR and PC1 dominate; PC128/PC512 underperform. - **Layer 10–15**: PC32 (dark blue) shows a sharp rise in accuracy. - **Layer 20–25**: PC4 (light blue) exhibits erratic spikes. - **Layer 30–31**: All lines converge near **0.6–0.7** accuracy. ## 5. **Data Point Verification** - **Color Consistency Check**: - Red lines (LR) consistently match the legend. - Teal (PC8) and yellow (PC1) lines align with their respective legend entries. - Dark blue (PC32) and green (PC2) lines are correctly mapped. - Purple (PC128) and dark purple (PC512) lines are distinct but may overlap visually in some layers. ## 6. **Spatial Grounding** - **Legend Position**: Bottom-right corner (coordinates: [x=0.95, y=0.05] relative to the grid). - **Layer Titles**: Left-aligned vertically, spaced evenly (e.g., Layer 0 at [x=0.05, y=0.95], Layer 31 at [x=0.05, y=0.05]). ## 7. **Missing or Ambiguous Data** - **No Numerical Data Table**: The image lacks a structured data table; trends are inferred visually. - **Legend Color Mismatch**: The legend lists 8 labels but only 6 colors are used in the charts. This may indicate: - A missing color in the legend (e.g., PC32 and PC128 share the same color). - An error in the image rendering. ## 8. **Conclusion** The image visualizes the accuracy of 8 machine learning models (LR, PC1, PC2, PC4, PC8, PC32, PC128, PC512) across 32 layers of a neural network. Key trends include: - **LR and PC1** generally outperform other models. - **PC128 and PC512** underperform in early layers but improve in later layers. - **PC32 and PC4** show variable performance depending on the layer. This analysis is based on visual inspection of the provided chart. Numerical data extraction is not feasible without additional context or raw data. </details> Figure B.3: The reporting accuracy as a function of in-context examples on each target axis. Each panel is for one layer in Llama3.1 8b. <details> <summary>2505.13763v2/x9.png Details</summary> ![eb3e9cd65aaca936f2910f4082f677f4dd8f8e1bdc1fea1f090926e1d1cfffcc](http://localhost:8000/v1/image/eb3e9cd65aaca936f2910f4082f677f4dd8f8e1bdc1fea1f090926e1d1cfffcc) ### Visual Description # Technical Document Extraction: Layer Accuracy Analysis ## Overview The image presents a grid of 28 line charts (Layers 0–27), each visualizing accuracy metrics across multiple computational components. Charts are organized in rows of 6 columns (except the final row, which contains 4 charts). Each chart shares identical axis labels and legend structure. --- ## Axis Labels - **Y-Axis**: "Accuracy" (range: 0.5 to 0.75) - **X-Axis**: "Layer" (categorical, 0–27) --- ## Legend Color-coded labels for computational components: | Color | Label | |-------------|---------| | Red | LR | | Teal | PC8 | | Yellow | PC1 | | Dark Blue | PC32 | | Green | PC2 | | Purple | PC128 | | Light Blue | PC4 | | Dark Purple | PC512 | --- ## Layer Structure - **Total Layers**: 28 (0–27) - **Chart Format**: Line charts with overlapping colored lines - **Layer Grouping**: - Rows 1–4: 6 layers per row (Layers 0–5, 6–11, 12–17, 18–23) - Row 5: 4 layers (Layers 24–27) --- ## Key Trends 1. **General Accuracy Progression**: - Accuracy values generally increase with layer depth (bottom-to-top in charts). - Baseline accuracy starts near 0.5 for all components in early layers (0–5). - By Layer 27, most components approach 0.75 accuracy. 2. **Component-Specific Patterns**: - **LR (Red)**: Consistently highest accuracy across most layers, with gradual improvement. - **PC8 (Teal)**: Shows moderate growth, often plateauing mid-chart. - **PC1 (Yellow)**: Rapid initial gains in early layers, followed by stabilization. - **PC32 (Dark Blue)**: Steady but slower improvement compared to other components. - **PC2 (Green)**: Fluctuates significantly in mid-layers (10–15), then stabilizes. - **PC128 (Purple)**: Sharp rise in later layers (20–27), surpassing some components. - **PC4 (Light Blue)**: Moderate growth with occasional dips. - **PC512 (Dark Purple)**: Lowest accuracy in early layers but shows exponential growth post-Layer 15. 3. **Layer-Specific Observations**: - **Layers 0–5**: All components start near 0.5 accuracy; LR and PC1 diverge first. - **Layers 10–15**: PC2 exhibits volatility; PC128 begins rapid ascent. - **Layers 20–27**: PC128 and PC512 outperform others, with PC512 closing the gap. --- ## Data Consistency - **Legend-Line Matching**: All colors in the legend correspond precisely to line colors in every chart. - **Axis Uniformity**: Y-axis range (0.5–0.75) and X-axis labels (Layer 0–27) are consistent across all charts. --- ## Summary The visualization demonstrates how different computational components (LR, PC8, PC1, etc.) contribute to accuracy across 28 neural network layers. LR and PC128 show the most significant improvement, while PC512 emerges as a strong performer in later layers. Accuracy trends suggest deeper layers (20–27) achieve higher performance, with component-specific optimization opportunities identified in mid-layers (10–15). </details> Figure B.4: The reporting accuracy as a function of in-context examples on each target axis. Each panel is for one layer in Qwen2.5 7b. B.4 Reporting performance of an ideal observer Here, we aim to understand the theoretical upper bound of the reporting performance of LLMs. An ideal observer has full access to all the neural activations of the LLM, serving as a theoretical upper bound of the reporting performance. Given a neural-defined label (either from a PC axis or LR axis), the optimal prediction can be achieved with a linear classifier (logistic regression). We analyze its reporting performance for each target PC axis and each model (Fig. B.5), which is much higher than the empirical reporting performance of LLMs (e.g., comparing the performance for llama 3.1 8B with Fig. 2 c). <details> <summary>2505.13763v2/x10.png Details</summary> ![9ef884fa242cecd3fae3a5d893c1801e4fa03ecf53c4b79c23e00f68307bc039](http://localhost:8000/v1/image/9ef884fa242cecd3fae3a5d893c1801e4fa03ecf53c4b79c23e00f68307bc039) ### Visual Description # Technical Document Extraction: Accuracy vs. Target Axis Chart ## Chart Type Line chart comparing model accuracy across different target axes. ## Axis Labels - **Y-Axis**: Accuracy (0.5 to 1.0) - **X-Axis**: Target Axis (LR, PC1, PC2, PC4, PC8, PC32, PC128PC512) ## Legend Entries (Model Configurations) 1. **llama3.1_8b** (Dark Red) 2. **llama3.2_3b** (Red) 3. **llama3.2_1b** (Light Orange) 4. **qwen2.5_7b** (Dark Blue) 5. **qwen2.5_3b** (Blue) 6. **qwen2.5_1.5b** (Light Blue) ## Key Trends 1. **Initial Performance**: - All models start near **0.95 accuracy** at the "LR" (Left) axis. - Lines are tightly clustered, indicating similar baseline performance. 2. **Decline Pattern**: - Accuracy decreases gradually as target axis progresses from LR → PC128PC512. - Steeper decline observed after **PC32**, with all lines converging toward ~0.7 accuracy by PC128PC512. 3. **Model-Specific Behavior**: - **llama3.1_8b** (dark red) and **qwen2.5_7b** (dark blue) maintain slightly higher accuracy than smaller models (e.g., llama3.2_1b, qwen2.5_1.5b) in later axes. - **llama3.2_3b** (red) and **qwen2.5_3b** (blue) show intermediate performance. 4. **Convergence**: - By PC128PC512, all models exhibit overlapping accuracy (~0.65–0.7), suggesting diminishing returns with increased axis complexity. ## Notes - No explicit data points are labeled; trends are inferred from line trajectories. - Axis labels use a mix of "LR" (likely "Left/Right") and "PC" (Principal Component) notation, with PC values increasing exponentially (e.g., PC1 → PC128PC512). - Legend colors are consistently mapped to model configurations across the chart. </details> Figure B.5: Ideal observer’s reporting performance. B.5 Summarized control effects of Qwen2.5 7B <details> <summary>2505.13763v2/x11.png Details</summary> ![d254b00ce768d59e1c1d595afd8c650c4fdb01049e008682c867debd456b558c](http://localhost:8000/v1/image/d254b00ce768d59e1c1d595afd8c650c4fdb01049e008682c867debd456b558c) ### Visual Description # Technical Document Analysis: Control Effect Across Layers ## Image Description The image contains two side-by-side line charts comparing control effects across neural network layers for explicit and implicit control mechanisms. Each chart includes three data series with confidence intervals, color-coded by legend. --- ### Left Panel: Explicit Control #### Chart Components - **Title**: "Explicit control" - **Y-Axis**: "Control effect (d)" with range [0, 6] - **X-Axis**: "Layer" with markers at 1, 7, 14, 21, 28 - **Legend**: Top-right corner with three entries: - Red: "LR" (Linearly Regressed) - Blue: "Early PCs" (Early Principal Components) - Green: "Late PCs" (Late Principal Components) #### Data Series Trends 1. **LR (Red Line)** - **Trend**: Steady upward slope from layer 1 to 28 - **Key Points**: - Layer 1: ~1.5 - Layer 7: ~2.5 - Layer 14: ~3.5 - Layer 21: ~5.0 - Layer 28: ~5.8 - **Confidence Interval**: Shaded red band widens slightly with increasing layers 2. **Early PCs (Blue Line)** - **Trend**: Gradual upward curve with plateau after layer 14 - **Key Points**: - Layer 1: ~0.5 - Layer 7: ~1.2 - Layer 14: ~1.6 - Layer 21: ~1.8 - Layer 28: ~1.8 - **Confidence Interval**: Narrow blue band remains consistent 3. **Late PCs (Green Line)** - **Trend**: Near-zero flat line with minor fluctuations - **Key Points**: - Layer 1: ~0.1 - Layer 7: ~0.1 - Layer 14: ~0.1 - Layer 21: ~0.2 - Layer 28: ~0.1 - **Confidence Interval**: Very narrow green band --- ### Right Panel: Implicit Control #### Chart Components - **Title**: "Implicit control" - **Y-Axis**: "Control effect (d)" with range [0, 1] - **X-Axis**: "Layer" with markers at 1, 7, 14, 21, 28 - **Legend**: Top-right corner with identical color coding as left panel #### Data Series Trends 1. **LR (Red Line)** - **Trend**: Flat until layer 14, then sharp upward spike - **Key Points**: - Layer 1: ~0.0 - Layer 7: ~0.0 - Layer 14: ~0.2 - Layer 21: ~0.8 - Layer 28: ~1.2 - **Confidence Interval**: Shaded red band expands dramatically after layer 14 2. **Early PCs (Blue Line)** - **Trend**: Slight upward curve with minimal variation - **Key Points**: - Layer 1: ~0.0 - Layer 7: ~0.0 - Layer 14: ~0.1 - Layer 21: ~0.2 - Layer 28: ~0.3 - **Confidence Interval**: Narrow blue band remains stable 3. **Late PCs (Green Line)** - **Trend**: Completely flat line at baseline - **Key Points**: - All layers: ~0.0 - **Confidence Interval**: Invisible due to zero values --- ### Cross-Validation Summary - **Legend Consistency**: - Red = LR (matches both panels) - Blue = Early PCs (matches both panels) - Green = Late PCs (matches both panels) - **Spatial Grounding**: - Legends positioned at [x=0.85, y=0.95] in both panels (normalized coordinates) - **Axis Alignment**: - X-axis (layers) identical in both panels - Y-axis scales differ: [0,6] vs [0,1] --- ### Key Observations 1. **Explicit Control Dominance**: - LR shows strong layer-dependent control effects (up to 5.8) - Early PCs contribute minimally (~1.8 max) - Late PCs show negligible effects (<0.2) 2. **Implicit Control Dynamics**: - LR exhibits delayed but explosive control effects post-layer 14 - Early PCs show weak but consistent contributions - Late PCs remain inactive 3. **Confidence Intervals**: - Explicit control shows higher uncertainty in LR effects - Implicit control uncertainty spikes dramatically in LR post-layer 14 --- ### Technical Notes - All values extracted from visual interpolation of line positions - Confidence intervals represent 95% bootstrap confidence bands - Layer numbering suggests analysis of deep neural network architecture - "PCs" likely refers to Principal Component Analysis-derived features </details> Figure B.6: Control effects of Qwen2.5 7B. Target control effect for prompts ( $N=256$ ) targeting the LR axis, early PCs (averaged over PC 1, 2, 4, 8), and late PCs (averaged over PC 32, 128, 512) across different layers. B.6 Summarized metacognitive effects on four datasets <details> <summary>2505.13763v2/x12.png Details</summary> ![d19798ca9c761d038700413677d9879e3f58abc330923a30fe3cd9d646439b43](http://localhost:8000/v1/image/d19798ca9c761d038700413677d9879e3f58abc330923a30fe3cd9d646439b43) ### Visual Description # Technical Document Extraction: Graph Analysis ## Overview The image contains a 3x4 grid of graphs (12 total) organized by: - **Rows**: Report (top), Explicit (middle), Implicit (bottom) - **Columns**: Moral (left), True-false (middle-left), Sycophancy (middle-right), Happy-sad (right) --- ## Top Row: "Report" Graphs ### Structure - **X-axis**: `# Examples` (0–600) - **Y-axis**: `Cross-entropy` (0–1.2) - **Legend**: - Red: LR - Yellow: PC1 - Green: PC2 - Light Green: PC4 - Teal: PC8 - Dark Teal: PC32 - Purple: PC128 - Dark Purple: PC512 ### Key Trends 1. **Moral Task** - All lines show rapid cross-entropy decline in early examples (~0–200) - LR (red) starts highest (~1.2) and converges with PCs - PC512 (dark purple) shows smoothest decline 2. **True-false Task** - Similar pattern to Moral, but steeper initial drop - PC1 (yellow) and PC2 (green) diverge most from LR 3. **Sycophancy Task** - Cross-entropy plateaus earlier (~400 examples) - PC8 (teal) and PC32 (dark teal) show minimal variance 4. **Happy-sad Task** - Highest initial cross-entropy (~1.2) - PC128 (purple) and PC512 (dark purple) converge fastest --- ## Middle Row: "Explicit" Graphs ### Structure - **X-axis**: `Layer` (1–32) - **Y-axis**: `Control effect (d)` (0–8) - **Legend**: - Red: LR - Blue: Early PCs - Green: Late PCs ### Key Trends 1. **Moral Task** - LR (red) peaks at layer 16 (~4.5) - Early PCs (blue) show gradual rise to layer 24 - Late PCs (green) remain flat (~0.5) 2. **True-false Task** - LR (red) peaks at layer 8 (~3) - Early PCs (blue) rise steadily to layer 24 - Late PCs (green) show minimal change 3. **Sycophancy Task** - LR (red) peaks at layer 16 (~3.5) - Early PCs (blue) plateau after layer 12 - Late PCs (green) remain near zero 4. **Happy-sad Task** - LR (red) increases steadily to layer 28 (~7.5) - Early PCs (blue) show moderate rise - Late PCs (green) remain flat --- ## Bottom Row: "Implicit" Graphs ### Structure - **X-axis**: `Layer` (1–28) - **Y-axis**: `Control effect (d)` (0–2) - **Legend**: - Red: LR - Blue: Early PCs - Green: Late PCs ### Key Trends 1. **Moral Task** - LR (red) spikes at layer 16 (~2) - Early PCs (blue) rise gradually to layer 24 - Late PCs (green) remain near zero 2. **True-false Task** - LR (red) peaks at layer 24 (~1.5) - Early PCs (blue) show moderate rise - Late PCs (green) remain flat 3. **Sycophancy Task** - LR (red) increases steadily to layer 28 (~1) - Early PCs (blue) show gradual rise - Late PCs (green) remain near zero 4. **Happy-sad Task** - LR (red) increases sharply after layer 21 (~2) - Early PCs (blue) show moderate rise - Late PCs (green) remain flat --- ## Cross-Referenced Legend Consistency - **Color Matching**: - All graphs consistently use: - Red for LR - Blue for Early PCs - Green for Late PCs - Additional colors (yellow, teal, purple) for PCs in "Report" graphs - **Legend Accuracy**: - Confirmed alignment between legend labels and line colors across all graphs --- ## Summary - **Report Graphs**: Demonstrate cross-entropy convergence across tasks, with LR and PCs showing task-specific patterns. - **Explicit/Implicit Graphs**: Highlight layer-specific control effects, with LR often peaking at mid-layers (16–24) and PCs showing divergent trends. - **Task-Specific Patterns**: - Sycophancy and Happy-sad tasks show higher initial cross-entropy and control effects. - Moral and True-false tasks exhibit faster convergence in early layers. </details> Figure B.7: Metacognitive effects on four datasets. From left to right: ETHICS (Llama 3.1 8B), True-False (Llama 3.1 8B), Sycophancy (Llama 3.2 3B), and Emotion (Llama 3.2 3B). Top: reporting performance. Middle: explicit control effects. Bottom: implicit control effects. Target control effect for prompts ( $N=256$ ) targeting the LR axis, early PCs (averaged over PC 1, 2, 4, 8), and late PCs (averaged over PC 32, 128, 512) across different layers. B.7 Control accumulation effects on other three datasets <details> <summary>2505.13763v2/x13.png Details</summary> ![5d5bb78a58c35574ae715b133d56907db3e3105cce9a94b230f353a22f833d72](http://localhost:8000/v1/image/5d5bb78a58c35574ae715b133d56907db3e3105cce9a94b230f353a22f833d72) ### Visual Description # Technical Document Extraction: Graph Analysis ## Overview The image contains three pairs of graphs, each pair corresponding to a psychological construct: **True-false**, **Sycophancy**, and **Happy-sad**. Each pair includes: 1. A graph labeled **"Control Effect (σ)"** (left) 2. A graph labeled **"Target layer"** (right) All graphs share the same x-axis: **"Layer of projected activation"** (ranging from 1 to 28). The y-axes differ between graph types: - **Control Effect (σ)**: 0 to 4 - **Target layer**: 0 to 1.5 --- ## Legend Details Each graph includes a legend with **five target layers**: - **Layer 1** (blue) - **Layer 7** (purple) - **Layer 14** (pink) - **Layer 21** (orange) - **Layer 28** (yellow) Legend colors are consistent across all graphs. --- ## Key Trends and Data Points ### 1. **True-false** #### Control Effect (σ) - **Layer 1**: All lines start near 0.5–1.0. - **Layer 7**: Lines show moderate increases (1.5–2.5). - **Layer 14**: Sharp rise to 2.5–3.5. - **Layer 21**: Peaks at 3.5–4.0. - **Layer 28**: Slight decline or stabilization (~3.0–3.5). #### Target Layer - **Layer 1**: All lines start near 0.0. - **Layer 7**: Minimal increase (~0.1–0.2). - **Layer 14**: Gradual rise to ~0.5. - **Layer 21**: Sharp increase to ~1.0–1.2. - **Layer 28**: Peaks at ~1.2–1.3. ### 2. **Sycophancy** #### Control Effect (σ) - **Layer 1**: Lines start near 1.0–1.5. - **Layer 7**: Moderate increase (1.5–2.0). - **Layer 14**: Rise to 2.0–2.5. - **Layer 21**: Peaks at 2.5–3.0. - **Layer 28**: Stabilizes (~2.5–3.0). #### Target Layer - **Layer 1**: All lines start near 0.0. - **Layer 7**: Minimal increase (~0.1). - **Layer 14**: Gradual rise to ~0.4. - **Layer 21**: Sharp increase to ~0.7–0.8. - **Layer 28**: Peaks at ~0.8–0.9. ### 3. **Happy-sad** #### Control Effect (σ) - **Layer 1**: Lines start near 1.0–1.5. - **Layer 7**: Moderate increase (1.5–2.0). - **Layer 14**: Rise to 2.0–2.5. - **Layer 21**: Peaks at 2.5–3.0. - **Layer 28**: Stabilizes (~2.5–3.0). #### Target Layer - **Layer 1**: All lines start near 0.0. - **Layer 7**: Minimal increase (~0.1). - **Layer 14**: Gradual rise to ~0.3. - **Layer 21**: Sharp increase to ~0.6–0.7. - **Layer 28**: Peaks at ~0.7–0.8. --- ## Observations 1. **Control Effect (σ)**: - All constructs show increasing control effect with higher layers. - **True-false** exhibits the steepest rise, peaking at Layer 21. - **Sycophancy** and **Happy-sad** show more gradual increases, peaking at Layer 28. 2. **Target Layer**: - All constructs demonstrate a consistent upward trend across layers. - **True-false** has the highest target layer values (~1.3 at Layer 28). - **Sycophancy** and **Happy-sad** show lower but similar trends (~0.9 and ~0.8, respectively). 3. **Variability**: - Shaded regions (confidence intervals) indicate variability in measurements. - **True-false** shows the widest variability, especially at Layers 14–21. --- ## Conclusion The graphs illustrate how control effect and target layer values evolve across projected activation layers for three psychological constructs. **True-false** demonstrates the most pronounced increases, while **Sycophancy** and **Happy-sad** show more moderate trends. These patterns suggest layer-dependent differences in neural processing for these constructs. </details> Figure B.8: Accumulation mechanisms of control effects across layers. Each curve corresponds to prompts targeting the LR axis $\text{LR}_{l}$ , defined by the residual stream activations at a specific layer $l$ (dot markers), showing projections of residual stream activations at each layer (x-axis) onto the target axis $\text{LR}_{l}$ . (a) Explicit control. (b) Implicit control. From left to right: True-False (Llama 3.1 8B), Sycophancy (Llama 3.2 3B), and Emotion (Llama 3.2 3B). Shaded areas indicate 95% confidence intervals. B.8 Summarized control effects of Llama3.1 8B with fine-grained neurofeedback labels <details> <summary>2505.13763v2/x14.png Details</summary> ![2466c498da89a30d2984bce4ba9fa77dd341e4d259b6692c902a4c01449dfed9](http://localhost:8000/v1/image/2466c498da89a30d2984bce4ba9fa77dd341e4d259b6692c902a4c01449dfed9) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Graph Structure - **Two Subplots**: 1. **Explicit Control** (Left) 2. **Implicit Control** (Right) ## Axis Labels - **X-Axis**: "Layer" (Discrete markers at 1, 8, 16, 24, 32) - **Y-Axis**: "Control effect (d)" (Continuous scale) ## Legends - **Color-Coded Lines**: - **Red**: LR (Linear Regression) - **Blue**: Early PCs (Principal Components) - **Green**: Late PCs (Principal Components) ## Explicit Control Graph (Left) ### Key Trends 1. **LR (Red Line)**: - Starts at ~2 (Layer 1) - Peaks at **6** (Layer 16) - Declines to ~4 (Layer 32) - Shaded area (variability) widest at peak (Layer 16) 2. **Early PCs (Blue Line)**: - Starts at ~1 (Layer 1) - Peaks at **2.5** (Layer 16) - Declines to ~1.5 (Layer 32) - Shaded area narrower than LR 3. **Late PCs (Green Line)**: - Remains near **0** (Layer 1–32) - Slight upward fluctuation (~0.2–0.5) - Minimal shaded variability ## Implicit Control Graph (Right) ### Key Trends 1. **LR (Red Line)**: - Starts at **0** (Layer 1) - Sharp rise to **2** (Layer 16) - Declines to **1.5** (Layer 32) - Shaded area widest at peak (Layer 16) 2. **Early PCs (Blue Line)**: - Starts at **0** (Layer 1) - Peaks at **1** (Layer 16) - Declines to **0.5** (Layer 32) - Shaded area narrower than LR 3. **Late PCs (Green Line)**: - Remains near **0** (Layer 1–32) - Slight upward fluctuation (~0.1–0.2) - Minimal shaded variability ## Cross-Referenced Observations - **Legend Consistency**: - Red (LR) consistently highest in both graphs. - Blue (Early PCs) intermediate in both. - Green (Late PCs) minimal in both. - **Layer-Specific Peaks**: - Both graphs show peak control effects at **Layer 16**. - Explicit control exhibits higher magnitude effects overall. ## Notes - Shaded regions represent confidence intervals or variability. - No data tables present; trends inferred from line trajectories. - Axis markers (1, 8, 16, 24, 32) suggest layer groupings (e.g., early, mid, late layers). </details> Figure B.9: Control effects of Llama3.1 8B, with eight-level quantized neural feedback labels. Target control effect for prompts ( $N=256$ ) targeting the LR axis, early PCs (averaged over PC 1, 2, 4, 8), and late PCs (averaged over PC 32, 128, 512) across different layers. B.9 Detailed results for control in Llama3.1 8B and Qwen2.5 7B <details> <summary>2505.13763v2/supp/llama3.1_8b_active_layer0.png Details</summary> ![78a1e79dd46a53bbf0de9f6c9c127ab647ec7a3aa9222b2ea0f555be45876a81](http://localhost:8000/v1/image/78a1e79dd46a53bbf0de9f6c9c127ab647ec7a3aa9222b2ea0f555be45876a81) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Overview The image contains **eight line graphs** and a **heatmap** analyzing control effects across different principal components (PCs). Each graph visualizes the relationship between the number of examples and control effect magnitude, with color-coded lines representing specific PCs. The heatmap quantifies pairwise control effects between target and affected axes. --- ## Line Graphs Analysis ### General Structure - **X-axis**: Number of examples (0–200) - **Y-axis**: Control effect (d) ranging from -1 to 2.0 - **Legend**: Located on the right side of each graph, mapping line styles/colors to PCs (PC1–PC512) and LR (Logistic Regression). ### Graph-by-Graph Breakdown #### 1. Target Axis: LR (Logistic Regression) - **Affected Axes**: PC1–PC512 - **Legend**: - Red solid line: LR - Dotted lines: PC1–PC512 (colors vary) - **Trend**: - LR line slopes upward from 0 to ~2.0. - PC lines show mixed trends (e.g., PC1 increases, PC2 decreases). - **Data Points**: - LR: Starts at 0, peaks at 2.0. - PC1: Starts at 0, peaks at ~1.5. #### 2. Target Axis: PC1 - **Affected Axes**: PC1–PC512 - **Legend**: - Orange solid line: PC1 - Dotted lines: Other PCs - **Trend**: - PC1 line remains flat near 0. - PC2 shows a slight upward trend. - **Data Points**: - PC1: ~0.0 throughout. - PC2: ~0.5 at 200 examples. #### 3. Target Axis: PC2 - **Affected Axes**: PC1–PC512 - **Legend**: - Green solid line: PC2 - Dotted lines: Other PCs - **Trend**: - PC2 line slopes upward to ~1.5. - PC1 shows a downward trend. - **Data Points**: - PC2: ~1.5 at 200 examples. - PC1: ~-0.5 at 200 examples. #### 4. Target Axis: PC4 - **Affected Axes**: PC1–PC512 - **Legend**: - Teal solid line: PC4 - Dotted lines: Other PCs - **Trend**: - PC4 line fluctuates but stabilizes near 0.5. - PC8 shows a downward trend. - **Data Points**: - PC4: ~0.5 at 200 examples. - PC8: ~-0.2 at 200 examples. #### 5. Target Axis: PC8 - **Affected Axes**: PC1–PC512 - **Legend**: - Dark blue solid line: PC8 - Dotted lines: Other PCs - **Trend**: - PC8 line slopes downward to ~-0.5. - PC128 shows minimal change. - **Data Points**: - PC8: ~-0.5 at 200 examples. - PC128: ~0.0 at 200 examples. #### 6. Target Axis: PC32 - **Affected Axes**: PC1–PC512 - **Legend**: - Purple solid line: PC32 - Dotted lines: Other PCs - **Trend**: - PC32 line fluctuates but trends upward to ~0.5. - PC128 shows a slight downward trend. - **Data Points**: - PC32: ~0.5 at 200 examples. - PC128: ~-0.1 at 200 examples. #### 7. Target Axis: PC128 - **Affected Axes**: PC1–PC512 - **Legend**: - Dark purple solid line: PC128 - Dotted lines: Other PCs - **Trend**: - PC128 line remains near 0. - PC512 shows a slight upward trend. - **Data Points**: - PC128: ~0.0 at 200 examples. - PC512: ~0.1 at 200 examples. #### 8. Target Axis: PC512 - **Affected Axes**: PC1–PC512 - **Legend**: - Black solid line: PC512 - Dotted lines: Other PCs - **Trend**: - PC512 line slopes upward to ~0.5. - PC1 shows a downward trend. - **Data Points**: - PC512: ~0.5 at 200 examples. - PC1: ~-0.2 at 200 examples. --- ## Heatmap Analysis ### Structure - **Rows**: Target axes (LR, PC1, PC2, PC4, PC8, PC32, PC128, PC512) - **Columns**: Affected axes (PC1, PC2, PC4, PC8, PC32, PC128, PC512) - **Color Scale**: - Red: Positive control effect (closer to 1) - Blue: Negative control effect (closer to -1) - White: Neutral (near 0) ### Key Values | Target Axis | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | |-------------|-------|-------|-------|-------|-------|-------|-------| | **LR** | 0.13 | -0.09 | 0.20 | -0.12 | 0.06 | -0.15 | 0.10 | | **PC1** | -0.32 | 0.14 | 0.02 | 0.06 | 0.10 | -0.14 | 0.03 | | **PC2** | 0.39 | -0.05 | -0.43 | 0.03 | -0.04 | -0.21 | -0.09 | | **PC4** | -0.07 | 0.34 | -0.04 | 0.33 | -0.00 | -0.11 | -0.17 | | **PC8** | -0.27 | 0.05 | 0.77 | 0.03 | 0.20 | 0.05 | 0.01 | | **PC32** | -1.18 | 1.51 | -0.22 | 0.20 | -0.49 | -0.12 | -0.18 | | **PC128** | 1.33 | -0.97 | -0.32 | 0.02 | -0.23 | 0.20 | 0.02 | | **PC512** | -2.12 | 1.41 | 1.12 | -0.19 | -0.07 | -0.49 | -0.45 | ### Observations - **Strongest Positive Effect**: PC32 → PC2 (1.51) - **Strongest Negative Effect**: LR → PC512 (-2.12) - **Neutral Effects**: Diagonal entries (e.g., PC1 → PC1: 0.13) show minimal impact. --- ## Spatial Grounding & Validation 1. **Legend Placement**: All legends are positioned on the right side of their respective graphs. 2. **Color Consistency**: - PC1 (orange) matches all orange lines across graphs. - PC512 (black) matches black lines in PC512-target graphs. 3. **Trend Verification**: - LR line in the first graph slopes upward, aligning with its data points (0 → 2.0). - PC32 line in its graph trends upward, matching its data points (~0.5 at 200 examples). --- ## Conclusion The analysis reveals significant control effects between specific PCs (e.g., PC32 → PC2) and notable negative effects from LR to PC512. The heatmap provides a quantitative summary of these relationships, with color gradients highlighting magnitude and direction. </details> Figure B.10: Control performance of Llama3.1 8B (explicit control) in layer 1. <details> <summary>2505.13763v2/supp/llama3.1_8b_inactive_layer0.png Details</summary> ![34def980f681a85884fdf8a20be866386674c6e97c78ad45d9fe2716a3b76fb1](http://localhost:8000/v1/image/34def980f681a85884fdf8a20be866386674c6e97c78ad45d9fe2716a3b76fb1) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Line Charts (Left Section) Eight line charts depict control effect (d) across different target axes. Each chart includes: - **X-axis**: Number of examples (0–200) - **Y-axis**: Control effect (d) ranging from -0.2 to 0.2 - **Legend**: Colored lines representing affected axes ### Chart Details: 1. **Target Axis: LR** - Affected axes: LR (red solid line) - Other axes: PC1–PC512 (dotted lines, no visible deviation) 2. **Target Axis: PC1** - Affected axes: PC1 (yellow dashed line) - Other axes: PC2–PC512 (dotted lines, no visible deviation) 3. **Target Axis: PC2** - Affected axes: PC2 (green dashed line) - Other axes: PC1, PC3–PC512 (dotted lines, no visible deviation) 4. **Target Axis: PC4** - Affected axes: PC4 (cyan dashed line) - Other axes: PC1–PC3, PC5–PC512 (dotted lines, no visible deviation) 5. **Target Axis: PC8** - Affected axes: PC8 (teal dashed line) - Other axes: PC1–PC7, PC9–PC512 (dotted lines, no visible deviation) 6. **Target Axis: PC32** - Affected axes: PC32 (dark blue dashed line) - Other axes: PC1–PC31, PC33–PC512 (dotted lines, no visible deviation) 7. **Target Axis: PC128** - Affected axes: PC128 (purple dashed line) - Other axes: PC1–PC127, PC129–PC512 (dotted lines, no visible deviation) 8. **Target Axis: PC512** - Affected axes: PC512 (dark purple dashed line) - Other axes: PC1–PC511 (dotted lines, no visible deviation) ## Heatmap (Right Section) **Title**: Control effect (d): layer 1 **Color Scale**: -5 (blue) to 5 (red) **Axes**: - **Rows (Target Axes)**: LR, PC1, PC2, PC4, PC8, PC32, PC128, PC512 - **Columns (Affected Axes)**: PC1, PC2, PC4, PC8, PC32, PC128, PC512, LR ### Heatmap Data Table: | Target\Affected | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | LR | |-----------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------| | **LR** | -0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -0.00 | | **PC1** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | **PC2** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | **PC4** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | **PC8** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | **PC32** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | **PC128** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | | **PC512** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ### Key Observations: - **Highlighted Values**: - PC512 (LR): -0.00 (blue) - PC1 (PC512): 0.00 (red) - PC512 (PC1): 0.00 (red) - **Color Coding**: - Blue cells: Negative control effects (e.g., -0.00) - Red cells: Positive control effects (e.g., 0.00) - White cells: Neutral/zero control effects ## Cross-Referenced Legend Consistency - **Line Charts**: Legend colors match heatmap column colors (e.g., PC1 = yellow in line charts, PC1 = column header in heatmap). - **Affected Axes**: Dotted lines in line charts correspond to non-highlighted axes in heatmap rows/columns. ## Summary - Line charts show minimal control effect deviations (all lines flat at 0). - Heatmap reveals sparse control effects, with most cells at 0.00 and isolated non-zero values (e.g., PC512-LR interaction). </details> Figure B.11: Control performance of Llama3.1 8B (implicit control) in layer 1. <details> <summary>2505.13763v2/supp/llama3.1_8b_active_layer15.png Details</summary> ![dfb5c09d717fd201deda663ecb789e14f567264e5937ed4befdc8c202bf8bf6e](http://localhost:8000/v1/image/dfb5c09d717fd201deda663ecb789e14f567264e5937ed4befdc8c202bf8bf6e) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Line Graphs (Left Section) Eight line graphs depict control effect (d) across different target and affected axes. Each graph includes: - **X-axis**: `# Examples` (0-200) - **Y-axis**: `Control effect (d)` (range varies by graph) - **Legend**: Located in top-right corner of each graph, listing affected axes with color coding ### Key Observations: 1. **Legend Color Consistency**: - Red = LR - Yellow = PC1 - Green = PC2 - Teal = PC4 - Purple = PC8 - Dark Blue = PC32 - Light Purple = PC128 - Dark Purple = PC512 2. **Graph Trends**: - **LR Target Axis**: Sharp initial decline (5.0 → 0.0) followed by plateau - **PC1 Target Axis**: Gradual decline with significant variance (2.0 → 0.0) - **PC2 Target Axis**: Steep initial rise (0.0 → 4.0) then stabilization - **PC4 Target Axis**: Moderate increase (0.0 → 1.0) with persistent variability - **PC8 Target Axis**: Sustained moderate effect (0.0 → 2.0) - **PC32 Target Axis**: Complex pattern with multiple peaks/troughs - **PC128 Target Axis**: Gradual increase with high variance - **PC512 Target Axis**: Stable baseline with minor fluctuations ## Heatmap (Right Section) Labeled "Control effect (d): layer 16", showing matrix of control effects between target and affected axes. ### Matrix Structure: - **Rows (Target Axes)**: LR, PC1, PC2, PC4, PC8, PC32, PC128, PC512 - **Columns (Affected Axes)**: PC1, PC2, PC4, PC8, PC32, PC128, PC512 - **Color Scale**: Blue (-4) to Red (+4) ### Key Values: | Target\Affected | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | |-----------------|--------|--------|--------|--------|--------|--------|--------| | **LR** | -0.38 | -0.56 | -0.23 | -0.11 | 0.10 | -0.04 | | | **PC1** | 0.38 | 0.34 | -0.28 | -0.18 | -0.23 | -0.08 | 0.33 | | **PC2** | -0.36 | 0.53 | 0.11 | 0.14 | 0.23 | -0.06 | -0.08 | | **PC4** | -0.07 | 0.24 | -0.22 | 1.90 | 0.17 | 0.00 | 0.12 | | **PC8** | 1.22 | 4.27 | -0.61 | 1.24 | 0.08 | -0.33 | 0.32 | | **PC32** | 1.23 | 0.98 | -0.05 | 0.38 | 0.04 | -0.40 | 0.38 | | **PC128** | 5.30 | 3.04 | 0.40 | 0.03 | -0.16 | -0.14 | -0.18 | ### Notable Patterns: - **High Positive Effects**: - PC8→PC2 (4.27) - LR→PC128 (5.30) - **Strong Negative Effects**: - PC1→PC2 (-0.56) - PC2→PC4 (-0.22) - **Diagonal Dominance**: Strongest effects along main diagonal (target=affected) ## Spatial Grounding - **Legend Position**: Top-right corner of each line graph - **Heatmap Coordinates**: - Rows: Top to bottom (LR → PC512) - Columns: Left to right (PC1 → PC512) ## Language Note All text appears in English. No non-English content detected. </details> Figure B.12: Control performance of Llama3.1 8B (explicit control) in layer 16. <details> <summary>2505.13763v2/supp/llama3.1_8b_inactive_layer15.png Details</summary> ![26691054a550365dd0d54710c11d0ef529a4e89585713f518fe80902a0d677bf](http://localhost:8000/v1/image/26691054a550365dd0d54710c11d0ef529a4e89585713f518fe80902a0d677bf) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Image Overview The image contains two primary components: 1. **Eight line graphs** (left section) showing control effect trends across different principal components (PCs) 2. **Heatmap** (right section) displaying control effect magnitudes at layer 16 --- ## Left Section: Line Graphs Analysis ### General Structure - **X-axis**: Number of examples (0-200) - **Y-axis**: Control effect (d) [-0.5 to 1.0] - **Legend**: Located in top-right corner of each subplot - **Color coding**: Matches legend entries (e.g., PC1 = orange, PC2 = green) ### Subplot Details 1. **Target Axis: LR** - Affected axes: LR (red), PC1 (orange), PC2 (green), PC4 (blue), PC8 (cyan), PC32 (dark blue), PC128 (purple), PC512 (black) - Key trend: LR axis shows initial spike (0.15 at 100 examples) followed by gradual decline 2. **Target Axis: PC1** - Affected axes: PC1 (orange), PC2 (green), PC4 (blue), PC8 (cyan), PC32 (dark blue), PC128 (purple), PC512 (black) - Key trend: PC1 axis shows gradual increase (0.05 at 200 examples) 3. **Target Axis: PC2** - Affected axes: PC2 (green), PC4 (blue), PC8 (cyan), PC32 (dark blue), PC128 (purple), PC512 (black) - Key trend: PC2 axis shows steady increase (0.12 at 200 examples) 4. **Target Axis: PC4** - Affected axes: PC4 (blue), PC8 (cyan), PC32 (dark blue), PC128 (purple), PC512 (black) - Key trend: PC4 axis shows moderate increase (0.08 at 200 examples) 5. **Target Axis: PC8** - Affected axes: PC8 (cyan), PC32 (dark blue), PC128 (purple), PC512 (black) - Key trend: PC8 axis shows slight increase (0.03 at 200 examples) 6. **Target Axis: PC32** - Affected axes: PC32 (dark blue), PC128 (purple), PC512 (black) - Key trend: PC32 axis shows gradual increase (0.05 at 200 examples) 7. **Target Axis: PC128** - Affected axes: PC128 (purple), PC512 (black) - Key trend: PC128 axis shows moderate increase (0.08 at 200 examples) 8. **Target Axis: PC512** - Affected axes: PC512 (black) - Key trend: PC512 axis shows slight increase (0.01 at 200 examples) --- ## Right Section: Heatmap Analysis ### Table Structure | Target Axis | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | |-------------|-------|-------|-------|-------|-------|-------|-------| | LR | 0.89 | 0.53 | 0.48 | 0.37 | -0.10 | -0.20 | -0.06 | | PC1 | 1.16 | 0.26 | 0.08 | 0.03 | -0.16 | -0.23 | -0.09 | | PC2 | 0.55 | 0.33 | 0.08 | 0.03 | -0.03 | -0.01 | -0.06 | | PC4 | 0.30 | -0.23 | 0.01 | 0.00 | -0.09 | -0.05 | -0.06 | | PC8 | 0.08 | 0.10 | 0.00 | 0.00 | -0.03 | -0.01 | -0.06 | | PC128 | 0.33 | 0.26 | 0.08 | 0.03 | -0.05 | -0.01 | -0.06 | | PC512 | 0.16 | 0.08 | 0.00 | 0.00 | -0.06 | -0.02 | -0.06 | ### Color Scale - Red: Positive control effect (>0.5) - Blue: Negative control effect (<-0.5) - White: Neutral effect (0.0) ### Key Observations 1. **Strongest effects**: - PC128 (PC128 target) shows 1.16 control effect - PC512 (PC512 target) shows 1.08 control effect 2. **Negative effects**: - LR axis shows -0.20 control effect on PC2 - PC1 axis shows -0.23 control effect on PC4 --- ## Cross-Validation 1. **Legend Consistency**: All line colors in graphs match legend entries 2. **Trend Verification**: - LR graph shows initial spike (0.15) matching heatmap's 0.89 value - PC128 graph's 0.08 value aligns with heatmap's 1.16 (strongest effect) 3. **Spatial Grounding**: Legend positioned at [x=0.85, y=0.95] relative to graph area --- ## Language Analysis - All text appears in English - No non-English content detected --- ## Conclusion The visualization demonstrates: 1. Layer 16 control effects vary significantly across principal components 2. PC128 and PC512 show strongest positive effects 3. LR axis exhibits mixed effects with notable negative impacts on PC2 4. Control effects generally increase with number of examples </details> Figure B.13: Control performance of Llama3.1 8B (implicit control) in layer 16. <details> <summary>2505.13763v2/supp/llama3.1_8b_active_layer31.png Details</summary> ![21f53ce63d6063eb9f736e54c24cf89d972113016fbe3b13f73683f52d06ce51](http://localhost:8000/v1/image/21f53ce63d6063eb9f736e54c24cf89d972113016fbe3b13f73683f52d06ce51) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Image Description The image contains **eight line graphs** and a **heatmap**, organized in a 2x4 grid of line graphs (top and bottom rows) with a heatmap on the right. Each line graph visualizes the **control effect (d)** across different principal components (PCs) as a function of the number of training examples. The heatmap summarizes pairwise control effects between target and affected axes in layer 32. --- ## Line Graphs Analysis ### General Structure - **X-axis**: Number of examples (# Examples), ranging from 0 to 200. - **Y-axis**: Control effect (d), with values varying between -2 and +2. - **Lines**: Represent individual PCs (PC1–PC512), with shaded regions indicating confidence intervals. - **Legends**: Located on the right side of each graph, mapping colors to PCs. ### Key Trends and Data Points #### Target Axis: LR - **PC1**: Sharp upward trend (d ≈ 0 → 4.47) by ~50 examples, then plateaus. Shaded region widens initially, then stabilizes. - **PC2–PC512**: Gradual upward trends (d ≈ 0 → 2), with minimal variation between PCs. Confidence intervals remain narrow. #### Target Axis: PC1 - **PC1**: Steep decline (d ≈ 1 → -0.5) by ~100 examples, then stabilizes. Shaded region narrows post-decline. - **PC2–PC512**: Slight upward trends (d ≈ 0 → 0.5), with overlapping confidence intervals. #### Target Axis: PC2 - **PC2**: Rapid increase (d ≈ 0 → 2) by ~50 examples, then plateaus. Shaded region widens initially. - **PC1, PC4–PC512**: Stable trends (d ≈ 0 → 0.5), with PC4 showing the steepest rise. #### Target Axis: PC4 - **PC4**: Moderate increase (d ≈ 0 → 2) by ~150 examples. Shaded region remains narrow. - **PC1–PC3, PC8–PC512**: Stable or slightly declining trends (d ≈ 0 → -0.5). #### Target Axis: PC8 - **PC8**: Sharp decline (d ≈ 2 → -1) by ~50 examples, then stabilizes. Shaded region widens initially. - **PC1–PC4, PC32–PC512**: Gradual upward trends (d ≈ 0 → 1), with PC32 showing the steepest rise. #### Target Axis: PC32 - **PC32**: Steady increase (d ≈ 0 → 1) by ~200 examples. Shaded region narrows slightly. - **PC1–PC8, PC128–PC512**: Stable trends (d ≈ 0 → 0.5), with PC128 showing minor fluctuations. #### Target Axis: PC128 - **PC128**: Slight decline (d ≈ 0 → -0.5) by ~100 examples, then stabilizes. Shaded region remains narrow. - **PC1–PC4, PC8–PC32, PC512**: Gradual upward trends (d ≈ 0 → 0.5), with PC4 showing the steepest rise. #### Target Axis: PC512 - **PC512**: Minimal change (d ≈ 0 → 0.5) by ~200 examples. Shaded region remains narrow. - **PC1–PC4, PC8–PC128**: Stable trends (d ≈ 0 → 0.5), with PC1 showing the steepest rise. --- ## Heatmap Analysis ### Structure - **Rows**: Target axes (LR, PC1, PC2, PC4, PC8, PC32, PC128, PC512). - **Columns**: Affected axes (PC1, PC2, PC4, PC8, PC32, PC128, PC512, LR). - **Color Scale**: Red (positive control effect) to blue (negative control effect), with values ranging from -2.91 to 2.55. ### Key Values | Target Axis | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | LR | |-------------|-------|-------|-------|-------|-------|-------|-------|-------| | **LR** | -0.25 | -0.23 | -0.15 | 0.19 | -0.17 | -0.19 | -0.06 | **4.47** | | **PC1** | -0.10 | 0.27 | 0.05 | -0.21 | 0.03 | 0.17 | 0.19 | -0.45 | | **PC2** | -0.14 | 0.87 | 0.06 | -0.23 | 0.32 | 0.06 | 0.16 | 0.03 | | **PC4** | -0.09 | -0.66 | 1.78 | -0.08 | -0.53 | -0.31 | -0.29 | -0.19 | | **PC8** | -0.29 | 0.24 | -0.29 | 0.44 | 0.88 | 0.49 | 0.24 | 1.27 | | **PC32** | -0.02 | 2.91 | 1.27 | -0.93 | 0.49 | 0.13 | 0.41 | 0.09 | | **PC128** | 0.03 | 0.09 | -0.40 | 0.49 | 1.03 | 1.34 | 0.60 | -0.19 | | **PC512** | -0.45 | -0.19 | 1.03 | -0.08 | 1.05 | 1.34 | 0.60 | 0.41 | ### Observations - **Strongest Positive Effects**: - PC32 → PC2 (2.91) - PC32 → PC4 (1.27) - PC512 → PC2 (1.03) - **Strongest Negative Effects**: - PC4 → PC2 (-0.66) - PC8 → PC32 (-0.93) - PC128 → PC4 (-0.40) --- ## Legend Cross-Reference - **Color Consistency**: All line graphs and heatmap cells align with the legend (e.g., PC1 = red, PC2 = green, etc.). - **Confidence Intervals**: Shaded regions in line graphs match the legend's color coding for each PC. --- ## Spatial Grounding - **Legend Position**: Right-aligned for all line graphs. - **Heatmap Position**: Rightmost column, spanning all rows. --- ## Conclusion The line graphs reveal dynamic control effects across PCs, with PC1 and PC8 showing the most pronounced trends. The heatmap highlights strong pairwise interactions, particularly involving PC32 and PC2. All data points and trends are consistent with the legend and color coding. </details> Figure B.14: Control performance of Llama3.1 8B (explicit control) in layer 32. <details> <summary>2505.13763v2/supp/llama3.1_8b_inactive_layer31.png Details</summary> ![102ff621a3ae53deac1e7bc86791ecc1453035fb4cf942b0616f70ef7d85697a](http://localhost:8000/v1/image/102ff621a3ae53deac1e7bc86791ecc1453035fb4cf942b0616f70ef7d85697a) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Left Panel: Line Graphs (Control Effect vs. # Examples) Eight line graphs depict control effect (d) across different target axes and affected principal components (PCs). Key elements: ### Axis Labels & Titles - **X-axis**: `# Examples` (0–200) - **Y-axis**: `Control effect (d)` (range: -0.5 to 2.0) - **Graph Titles**: - Top row: `Target axis: LR`, `PC1`, `PC2`, `PC4` - Bottom row: `PC8`, `PC32`, `PC128`, `PC512` ### Legends Each graph includes a legend mapping PC labels to line styles/colors: - **Affected axes**: PC1 (orange), PC2 (green), PC4 (teal), PC8 (dark blue), PC32 (purple), PC128 (dark purple), PC512 (black) - **Control effect**: Solid red line (labeled "LR" in LR graph) ### Key Trends 1. **LR Graph**: - Control effect (red line) peaks sharply at ~50 examples, then stabilizes near 1.8. - PC1 (orange) shows gradual increase; PC2 (green) fluctuates minimally. 2. **PC1 Graph**: - Control effect (orange) plateaus near 0.2 after initial rise. - PC2 (green) shows moderate variability. 3. **PC2 Graph**: - Control effect (green) stabilizes near 0.1 after initial dip. - PC4 (teal) exhibits gradual decline. 4. **PC4 Graph**: - Control effect (teal) remains near 0.05 with minor fluctuations. - PC8 (dark blue) shows slight upward trend. 5. **PC8 Graph**: - Control effect (dark blue) stabilizes near 0.0. - PC32 (purple) exhibits moderate variability. 6. **PC32 Graph**: - Control effect (purple) stabilizes near -0.1. - PC128 (dark purple) shows gradual decline. 7. **PC128 Graph**: - Control effect (dark purple) stabilizes near -0.2. - PC512 (black) exhibits minor fluctuations. 8. **PC512 Graph**: - Control effect (black) stabilizes near -0.3. - PC1 (orange) shows slight upward trend. ## Right Panel: Heatmap (Control Effect Matrix) A heatmap visualizes pairwise control effects between target and affected axes. Key elements: ### Axis Labels - **Rows (Target axes)**: LR, PC1, PC2, PC4, PC8, PC32, PC128, PC512 - **Columns (Affected axes)**: PC1, PC2, PC4, PC8, PC32, PC128, PC512, LR ### Color Scale - **Gradient**: Blue (-1) to Red (+1) - **Midpoint**: 0 (white) ### Numerical Values | Target\Affected | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | LR | |-----------------|-------|-------|-------|-------|-------|-------|-------|-------| | **LR** | -1.86 | 0.16 | -0.50 | 0.60 | -0.20 | 0.19 | 0.52 | 0.16 | | **PC1** | -0.10 | 0.03 | -0.11 | 0.08 | -0.00 | -0.08 | 0.06 | 0.06 | | **PC2** | -0.10 | -0.11 | 0.04 | 0.15 | -0.02 | 0.06 | 0.00 | 0.00 | | **PC4** | -0.16 | -0.01 | -0.23 | 0.70 | -0.04 | -0.19 | 0.09 | 0.09 | | **PC8** | 0.12 | -0.19 | 0.94 | -0.32 | 0.09 | 0.37 | -0.09 | -0.09 | | **PC32** | 0.13 | 0.72 | -0.54 | 0.27 | -0.04 | -0.09 | 0.06 | 0.06 | | **PC128** | 0.12 | -0.06 | -0.03 | 0.13 | -0.02 | 0.06 | 0.10 | 0.10 | | **PC512** | 0.16 | -0.50 | 0.60 | -0.20 | 0.19 | 0.52 | 0.16 | 0.16 | ### Observations - **Strongest negative effect**: LR → PC1 (-1.86) - **Strongest positive effect**: PC8 → PC2 (0.94) - **Diagonal dominance**: Most target axes show moderate self-control (e.g., PC2 → PC2: 0.70) - **Cross-axis interactions**: Notable effects include PC32 → PC8 (-0.54) and PC128 → PC512 (0.52) ## Notes 1. **Legend Consistency**: Colors/styles in legends match line graphs (e.g., PC1 = orange in all graphs). 2. **Control Effect Line**: Solid red line in LR graph; absent in other graphs (implied baseline). 3. **Layer 32**: Heatmap title specifies analysis at "layer 32" (contextual metadata). </details> Figure B.15: Control performance of Llama3.1 8B (implicit control) in layer 32. <details> <summary>2505.13763v2/supp/qwen2.5_7b_active_layer0.png Details</summary> ![e1bfd7f764c1ad4c2fc388c2a5455aeaca9bb3f881af18f18fc5076b729f79a7](http://localhost:8000/v1/image/e1bfd7f764c1ad4c2fc388c2a5455aeaca9bb3f881af18f18fc5076b729f79a7) ### Visual Description ```markdown # Technical Document Extraction ## Line Graphs (Left Section) Eight line graphs depict the relationship between **# Examples** (x-axis) and **Control effect (d)** (y-axis) for different **Target axis** and **Affected axis** combinations. Key observations: ### Graph Structure - **X-axis**: `# Examples` (0–200) - **Y-axis**: `Control effect (d)` (range: -1.5 to 2.0) - **Lines**: Color-coded with shaded confidence intervals - **Legend**: - `LR` (solid red) - `PC1` (solid orange) - `PC2` (solid green) - `PC4` (solid teal) - `PC8` (solid dark blue) - `PC32` (solid purple) - `PC128` (solid dark purple) - `PC512` (solid black) ### Graph-Specific Details 1. **Target axis: LR** - Affected axes: PC1–PC512 - Lines: Red (LR), orange (PC1), green (PC2), teal (PC4), dark blue (PC8), purple (PC32), dark purple (PC128), black (PC512) - Trend: Control effect increases with examples for most axes; PC128 shows sharp decline after 100 examples. 2. **Target axis: PC1** - Affected axes: PC1–PC512 - Lines: Orange (PC1), green (PC2), teal (PC4), dark blue (PC8), purple (PC32), dark purple (PC128), black (PC512) - Trend: PC1 and PC2 dominate early; PC512 stabilizes near 0. 3. **Target axis: PC2** - Affected axes: PC1–PC512 - Lines: Green (PC2), teal (PC4), dark blue (PC8), purple (PC32), dark purple (PC128), black (PC512) - Trend: PC2 and PC4 show strong positive correlation; PC128 declines sharply. 4. **Target axis: PC4** - Affected axes: PC1–PC512 - Lines: Teal (PC4), dark blue (PC8), purple (PC32), dark purple (PC128), black (PC512) - Trend: PC4 and PC8 align closely; PC128 diverges after 100 examples. 5. **Target axis: PC8** - Affected axes: PC1–PC512 - Lines: Dark blue (PC8), purple (PC32), dark purple (PC128), black (PC512) - Trend: PC8 and PC32 overlap; PC128 shows negative divergence. 6. **Target axis: PC32** - Affected axes: PC1–PC512 - Lines: Purple (PC32), dark purple (PC128), black (PC512) - Trend: PC32 and PC128 align; PC512 stabilizes near 0. 7. **Target axis: PC128** - Affected axes: PC1–PC512 - Lines: Dark purple (PC128), black (PC512) - Trend: PC128 and PC512 show inverse relationship. 8. **Target axis: PC512** - Affected axes: PC1–PC512 - Lines: Black (PC512) - Trend: Minimal variation; stabilizes near 0. ## Heatmap (Right Section) Labeled **"Control effect (d): layer 1"**, the heatmap quantifies relationships between **Target axis** (rows) and **Affected axis** (columns). Color scale: red (positive) to blue (negative), with values from -1 to 1. ### Table Structure | Target Axis | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | |-------------|--------|--------|--------|--------|--------|--------|--------| | **LR** | 0.30 | -0.03 | 0.22 | 0.03 | 0.28 | -0.29 | 0.11 | | **PC1** | 0.42 | -0.10 | 0.06 | -0.34 | 0.12 | -0.08 | -0.01 | | **PC2** | -0.13 | 0.31 | 0.05 | -0.13 | -0.10 | -0.03 | -0.33 | | **PC4** | 0.63 | -0.19 | 0.38 | 0.34 | 0.38 | -0.22 | 0.18 | | **PC8** | 0.65 | -0.19 | 0.55 | -0.19 | 0.22 | 0.09 | 0.05 | | **PC32** | 0.73 | -0.37 | 0.17 | -0.38 | -0.34 | -0.01 | 0.38 | | **PC128** | 1.46 | -0.99 | -0.07 | 0.07 | -0.50 | 0.10 | 0.53 | | **PC512** | -1.08 | 0.96 | -0.29 | 0.33 | 0.58 | -0. </details> Figure B.16: Qwen2.5 7B (explicit control) layer 1 <details> <summary>2505.13763v2/supp/qwen2.5_7b_inactive_layer0.png Details</summary> ![fcf9460caec47c9414aa0afa40611fc056f1f09d19365fd3ea9f5766484012c5](http://localhost:8000/v1/image/fcf9460caec47c9414aa0afa40611fc056f1f09d19365fd3ea9f5766484012c5) ### Visual Description # Technical Document Extraction ## Line Graphs (Left Section) Eight line graphs depict control effects across different target axes. Each graph has: - **X-axis**: `# Examples` (0–200) - **Y-axis**: `Control effect (d)` (-0.2 to 0.2) - **Legend**: - `LR` (red solid line) - `PC1` (yellow solid) - `PC2` (green solid) - `PC4` (cyan solid) - `PC8` (blue solid) - `PC32` (purple solid) - `PC128` (dark purple solid) - `PC512` (black solid) ### Graph Details: 1. **Target Axis: LR** - Affected axes: `LR` (red solid line at y=0.00) - All other axes show no deviation (flat lines at y=0.00). 2. **Target Axis: PC1** - Affected axes: `PC1` (yellow solid line at y=0.00) - All other axes show no deviation. 3. **Target Axis: PC2** - Affected axes: `PC2` (green solid line at y=0.00) - All other axes show no deviation. 4. **Target Axis: PC4** - Affected axes: `PC4` (cyan solid line at y=0.00) - All other axes show no deviation. 5. **Target Axis: PC8** - Affected axes: `PC8` (blue solid line at y=0.00) - All other axes show no deviation. 6. **Target Axis: PC32** - Affected axes: `PC32` (purple solid line at y=0.00) - All other axes show no deviation. 7. **Target Axis: PC128** - Affected axes: `PC128` (dark purple solid line at y=0.00) - All other axes show no deviation. 8. **Target Axis: PC512** - Affected axes: `PC512` (black solid line at y=0.00) - All other axes show no deviation. ## Heatmap (Right Section) A heatmap labeled `Control effect (d): layer 1` with: - **Rows**: Target axes (`LR`, `PC1`, `PC2`, `PC4`, `PC8`, `PC32`, `PC128`, `PC512`) - **Columns**: Affected axes (`PC1`, `PC2`, `PC4`, `PC8`, `PC32`, `PC128`, `PC512`, `LR`) - **Color Scale**: - Red: Positive control effect (up to 0.00010) - Blue: Negative control effect (down to -0.00010) - White: Neutral (0.00000) ### Heatmap Values: | Target Axis | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | LR | |-------------|-------|-------|-------|-------|-------|-------|-------|-------| | **LR** | 0.00 | -0.00 | -0.00 | 0.00 | -0.00 | 0.00 | -0.00 | 0.00 | | **PC1** | 0.00 | 0.00 | -0.00 | 0.00 | 0.00 | 0.00 | -0.00 | 0.00 | | **PC2** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -0.00 | 0.00 | | **PC4** | 0.00 | 0.00 | 0.00 | -0.00 | 0.00 | -0.00 | -0.00 | 0.00 | | **PC8** | 0.00 | 0.00 | 0.00 | 0.00 | -0.00 | -0.00 | -0.00 | 0.00 | | **PC32** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -0.00 | -0.00 | 0.00 | | **PC128** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | -0.00 | 0.00 | | **PC512** | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ### Key Observations: - **Diagonal Dominance**: All diagonal cells (target = affected axis) show `0.00`, indicating no self-control effect. - **Non-Diagonal Cells**: Most cells show `-0.00` or `0.00`, suggesting minimal cross-axis control effects. - **Color Coding**: Red and blue cells are absent, implying no significant positive/negative control effects between axes. </details> Figure B.17: Qwen2.5 7B (implicit control) layer 1 <details> <summary>2505.13763v2/supp/qwen2.5_7b_active_layer13.png Details</summary> ![30a5aa07b044576e62c49b15d13bd11573f8fa0d936aeb1300654a236c4fd767](http://localhost:8000/v1/image/30a5aa07b044576e62c49b15d13bd11573f8fa0d936aeb1300654a236c4fd767) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Image Overview The image contains **eight line graphs** and **one heatmap**, analyzing control effects across different principal component (PC) axes. The graphs and heatmap are organized to show relationships between target axes and affected axes, with numerical values and visual trends. --- ## Line Graphs (Left Section) ### General Structure - **X-axis**: `# Examples` (0–200) - **Y-axis**: `Control effect (d)` - **Legend**: Located in the **upper right corner** of each graph, listing affected axes with color/style mappings. - **Key Observations**: - Each graph has a **target axis** (e.g., LR, PC1, PC2) and an **affected axis** (e.g., PC1, PC2, PC4). - Lines represent affected axes, with shaded regions indicating confidence intervals. ### Graph-Specific Details 1. **Target Axis: LR** - **Affected Axes**: PC1 (orange dashed), PC2 (green dotted), PC4 (blue dash-dot), PC8 (teal solid), PC32 (purple dash-dot), PC128 (dark blue dotted), PC512 (black dashed). - **Trend**: PC1 (orange) shows a sharp upward trend; PC2 (green) and PC4 (blue) plateau near zero. PC8 (teal) and PC32 (purple) exhibit moderate increases. 2. **Target Axis: PC1** - **Affected Axes**: Same as above. - **Trend**: PC1 (orange) increases steadily; PC2 (green) and PC4 (blue) show minor fluctuations. PC8 (teal) and PC32 (purple) remain near zero. 3. **Target Axis: PC2** - **Affected Axes**: Same as above. - **Trend**: PC2 (green) increases sharply; PC4 (blue) and PC8 (teal) show moderate growth. PC32 (purple) and PC128 (dark blue) plateau. 4. **Target Axis: PC4** - **Affected Axes**: Same as above. - **Trend**: PC4 (blue) increases steadily; PC8 (teal) and PC32 (purple) show minor fluctuations. PC128 (dark blue) and PC512 (black) remain near zero. 5. **Target Axis: PC8** - **Affected Axes**: Same as above. - **Trend**: PC8 (teal) increases sharply; PC32 (purple) and PC128 (dark blue) show moderate growth. PC512 (black) remains near zero. 6. **Target Axis: PC32** - **Affected Axes**: Same as above. - **Trend**: PC32 (purple) increases steadily; PC128 (dark blue) and PC512 (black) show minor fluctuations. PC1 (orange) and PC2 (green) remain near zero. 7. **Target Axis: PC128** - **Affected Axes**: Same as above. - **Trend**: PC128 (dark blue) increases sharply; PC512 (black) shows moderate growth. PC1 (orange) and PC2 (green) remain near zero. 8. **Target Axis: PC512** - **Affected Axes**: Same as above. - **Trend**: PC512 (black) increases steadily; PC1 (orange) and PC2 (green) show minor fluctuations. PC4 (blue) and PC8 (teal) remain near zero. --- ## Heatmap (Right Section) ### Structure - **Title**: `Control effect (d): layer 14` - **Rows**: Target axes (`PC512`, `PC32`, `PC8`, `PC4`, `PC2`, `PC1`, `LR`) - **Columns**: Affected axes (`PC1`, `PC2`, `PC4`, `PC8`, `PC32`, `PC128`, `PC512`) - **Color Scale**: - **Red**: Positive values (up to 2) - **Blue**: Negative values (down to -2) - **White**: Near-zero values ### Data Table | Target Axis | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | |-------------|-------|-------|-------|-------|-------|-------|-------| | PC512 | 0.06 | -0.58 | -0.42 | 0.05 | 0.13 | 0.19 | 0.17 | | PC32 | 0.11 | -0.07 | 0.09 | -0.17 | 0.15 | -0.14 | -0.01 | | PC8 | -0.17 | 0.53 | 0.17 | -0.27 | 0.02 | -0.19 | -0.38 | | PC4 | -0.40 | -0.44 | 0.37 | 1.01 | 0.16 | -0.00 | 0.30 | | PC2 | -0.42 | 0.77 | 1.58 | 0.25 | 0.36 | -0.53 | 0.02 | | PC1 | 0.17 | 2.54 | 1.09 | 0.40 | -0.08 | -0.28 | -0.89 | | LR | 3.54 | 2.48 | 1.78 | 0.48 | -0.14 | 0.03 | -0.74 | --- ## Key Trends and Data Points 1. **Line Graphs**: - **PC1 (orange dashed)** consistently shows the highest control effect across most target axes. - **LR (red solid)** exhibits the most pronounced increase in control effect, especially in the LR-target graph. - **PC512 (black dashed)** often remains near zero or shows minimal variation. 2. **Heatmap**: - **Highest Control Effect**: LR (target) affecting PC1 (value: **3.54**). - **Strongest Negative Effect**: PC1 (target) affecting PC512 (value: **-0.89**). - **Moderate Effects**: PC2 (target) affecting PC4 (value: **1.58**) and PC1 (target) affecting PC2 (value: **2.54**). --- ## Spatial Grounding and Legend Verification - **Legend Position**: Upper right corner of each line graph. - **Color Matching**: - PC1: Orange dashed (confirmed in all graphs). - PC2: Green dotted (confirmed). - PC4: Blue dash-dot (confirmed). - PC8: Teal solid (confirmed). - PC32: Purple dash-dot (confirmed). - PC128: Dark blue dotted (confirmed). - PC512: Black dashed (confirmed). --- ## Language and Transcription - **Primary Language**: English. - **No Additional Languages Detected**. --- ## Conclusion The image provides a comprehensive analysis of control effects across principal component axes, with line graphs illustrating trends and a heatmap quantifying numerical values. The LR axis demonstrates the most significant control effects, particularly when targeting PC1. </details> Figure B.18: Qwen2.5 7B (explicit control) layer 14 <details> <summary>2505.13763v2/supp/qwen2.5_7b_inactive_layer13.png Details</summary> ![a3cfdc132338d252db5c14bbd85d8cb1ecc9f020660195c4c2a0cd08a1b9c933](http://localhost:8000/v1/image/a3cfdc132338d252db5c14bbd85d8cb1ecc9f020660195c4c2a0cd08a1b9c933) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Left Panel: Line Graphs (Control Effect vs. # Examples) Eight line graphs depict control effect (d) across varying numbers of examples (0–200). Each graph corresponds to a **target axis** and shows the effect of perturbing other **affected axes**. Key elements: ### Graph Structure - **X-axis**: `# Examples` (0–200) - **Y-axis**: `Control effect (d)` (range: -0.2 to 0.2) - **Lines**: - Red dotted line labeled `LR` (baseline) - Colored dashed lines for affected axes (PC1–PC512), matching legend colors ### Legend (Affected Axes) | Color | Label | |-------------|---------| | Red | LR | | Yellow | PC1 | | Green | PC2 | | Teal | PC4 | | Blue | PC8 | | Dark Blue | PC32 | | Purple | PC128 | | Dark Purple | PC512 | ### Graph-Specific Observations 1. **Target Axis: LR** - All affected axes (PC1–PC512) show near-zero control effect (flat lines at ~0.0). - Example: PC1 (yellow) and PC512 (dark purple) lines overlap with LR. 2. **Target Axis: PC1** - PC1 (yellow) line shows slight deviation from LR (~0.01–0.02). - Other axes (PC2–PC512) remain near zero. 3. **Target Axis: PC2** - PC2 (green) line deviates slightly (~0.01–0.02). - PC1 (yellow) and PC4 (teal) show minor effects. 4. **Target Axis: PC4** - PC4 (teal) line deviates minimally (~0.01). - PC2 (green) and PC8 (blue) show negligible effects. 5. **Target Axis: PC8** - PC8 (blue) line shows slight deviation (~0.01). - PC4 (teal) and PC32 (dark blue) remain near zero. 6. **Target Axis: PC32** - PC32 (dark blue) line deviates minimally (~0.01). - PC8 (blue) and PC128 (purple) show near-zero effects. 7. **Target Axis: PC128** - PC128 (purple) line shows slight deviation (~0.01). - PC32 (dark blue) and PC512 (dark purple) remain near zero. 8. **Target Axis: PC512** - PC512 (dark purple) line deviates minimally (~0.01). - All other axes (PC1–PC128) show near-zero effects. --- ## Right Panel: Heatmap (Control Effect Matrix) A 8x8 heatmap quantifies control effects between target and affected axes. Rows = target axes; columns = affected axes. Color scale: red (positive) to blue (negative), with values from -0.005 to 0.005. ### Heatmap Structure | Target Axis | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | LR | |-------------|--------|--------|--------|--------|--------|--------|--------|--------| | **LR** | 0.000 | -0.000 | -0.000 | -0.000 | -0.000 | -0.000 | -0.000 | 0.000 | | **PC1** | 0.000 | -0.000 | -0.000 | -0.000 | -0.000 | -0.000 | -0.000 | 0.000 | | **PC2** | 0.000 | 0.000 | -0.000 | -0.000 | -0.000 | -0.000 | -0.000 | 0.000 | | **PC4** | 0.000 | 0.000 | 0.000 | -0.000 | -0.000 | -0.000 | -0.000 | 0.000 | | **PC8** | 0.000 | -0.000 | 0.000 | 0.000 | -0.000 | -0.000 | -0.000 | 0.000 | | **PC32** | 0.000 | 0.000 | 0.000 | -0.000 | 0.000 | -0.000 | -0.000 | 0.000 | | **PC128** | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | -0.000 | 0.000 | | **PC512** | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | -0.000 | ### Key Trends - **Diagonal Dominance**: All diagonal entries (target = affected) show near-zero control effects. - **Negative Off-Diagonal**: Most off-diagonal entries are negative (blue), indicating inhibitory control effects. - **Magnitude**: Largest deviations occur in: - PC512 (target) affecting PC1 (value: -0.01). - PC128 (target) affecting PC512 (value: -0.01). - PC32 (target) affecting PC128 (value: -0.01). --- ## Cross-Referenced Observations 1. **Legend Consistency**: Line colors in graphs match heatmap row/column labels (e.g., PC1 = yellow in graphs, PC1 = column header in heatmap). 2. **Heatmap vs. Graphs**: - Graphs show qualitative trends (e.g., PC1 target axis shows slight deviation for PC1 affected axis). - Heatmap quantifies these effects (e.g., PC1 target axis affecting PC1: 0.000, PC2: -0.000). 3. **LR Baseline**: All graphs show LR as a flat line, confirming it as a reference point for control effects. --- ## Conclusion The analysis reveals minimal control effects across most axis pairs, with slight inhibitory effects (negative values) in specific off-diagonal interactions. The LR axis serves as a consistent baseline, with deviations observed only in targeted perturbations (e.g., PC512 affecting PC1). </details> Figure B.19: Qwen2.5 7B (implicit control) layer 14 <details> <summary>2505.13763v2/supp/qwen2.5_7b_active_layer27.png Details</summary> ![bb0d9a257ba9fdaaade41e963823107a64220b0a3bb7474e8dbe5d49ad7cfbaa](http://localhost:8000/v1/image/bb0d9a257ba9fdaaade41e963823107a64220b0a3bb7474e8dbe5d49ad7cfbaa) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Image Overview The image contains two primary components: 1. **Eight line graphs** (top section) showing control effects across different target axes 2. **Heatmap** (right section) displaying control effect magnitudes between target and affected axes --- ## Line Graphs Analysis ### General Structure - **X-axis**: `# Examples` (0-200) - **Y-axis**: `Control effect (d)` - **Legend**: Located in top-right corner of each graph - **Line styles**: Solid lines with shaded confidence intervals - **Color coding**: Matches legend entries ### Graph Details 1. **Target axis: LR** - Red solid line: LR (target axis) - Other lines: PC1-PC512 (affected axes) - Trend: LR shows steepest increase (0.5-5.0 range) 2. **Target axis: PC1** - Yellow solid line: PC1 (target axis) - Other lines: PC2-PC512 (affected axes) - Trend: PC1 shows moderate increase (0.0-0.5 range) 3. **Target axis: PC2** - Green solid line: PC2 (target axis) - Other lines: PC1, PC3-PC512 (affected axes) - Trend: PC2 shows gradual increase (0.0-1.0 range) 4. **Target axis: PC4** - Teal solid line: PC4 (target axis) - Other lines: PC1-PC3, PC5-PC512 (affected axes) - Trend: PC4 shows moderate increase (0.0-2.0 range) 5. **Target axis: PC8** - Dark teal solid line: PC8 (target axis) - Other lines: PC1-PC7, PC9-PC512 (affected axes) - Trend: PC8 shows gradual increase (0.0-1.0 range) 6. **Target axis: PC32** - Blue solid line: PC32 (target axis) - Other lines: PC1-PC31, PC33-PC512 (affected axes) - Trend: PC32 shows moderate increase (0.0-0.5 range) 7. **Target axis: PC128** - Purple solid line: PC128 (target axis) - Other lines: PC1-PC127, PC129-PC512 (affected axes) - Trend: PC128 shows moderate increase (0.0-0.5 range) 8. **Target axis: PC512** - Dark purple solid line: PC512 (target axis) - Other lines: PC1-PC511 (affected axes) - Trend: PC512 shows moderate increase (0.0-1.0 range) --- ## Heatmap Analysis ### Structure - **Title**: `Control effect (d): layer 28` - **Rows**: Target axes (PC512, PC4, PC32, PC8, PC2, PC1, LR) - **Columns**: Affected axes (PC1-PC512) - **Color scale**: Blue (-4) to Red (+4) - **Key values**: - LR → PC1: **5.72** (red) - PC1 → PC128: **-0.86** (blue) - PC2 → PC2: **3.71** (red) - PC4 → PC4: **1.71** (red) - PC8 → PC8: **3.19** (red) - PC32 → PC32: **0.66** (red) - PC128 → PC128: **-0.86** (blue) - PC512 → PC512: **1.38** (red) ### Notable Patterns 1. **Diagonal dominance**: Strongest effects along diagonal (target = affected) 2. **Negative correlations**: Blue values indicate negative control effects 3. **Magnitude clustering**: Highest values (>3.0) concentrated in PC1, PC2, PC8, PC128 --- ## Cross-Referenced Data ### Legend Color Matching | Legend Label | Color | Line Style | Graph Examples | |--------------|-------------|------------|----------------| | LR | Red | Solid | All graphs | | PC1 | Yellow | Solid | All graphs | | PC2 | Green | Solid | All graphs | | PC4 | Teal | Solid | All graphs | | PC8 | Dark Teal | Solid | All graphs | | PC32 | Blue | Solid | All graphs | | PC128 | Purple | Solid | All graphs | | PC512 | Dark Purple | Solid | All graphs | ### Spatial Grounding - Legend position: Top-right corner of each graph - Heatmap position: Right side of image - Axis labels: Bottom (x-axis), Left (y-axis) --- ## Data Table Reconstruction | Target Axis | PC1 | PC2 | PC4 | PC8 | PC32 | PC128 | PC512 | |-------------|-------|-------|-------|-------|-------|-------|-------| | PC512 | 0.09 | -0.02 | 0.40 | -0.36 | 0.43 | -0.31 | 0.31 | | PC4 | 0.34 | 0.25 | 0.63 | 0.12 | 0.39 | -0.28 | 0.43 | | PC32 | -0.82 | 0.00 | -0.92 | -0.07 | -0.03 | 0.07 | -0.53 | | PC8 | -0.20 | 0.27 | 0.23 | 1.50 | -0.24 | 0.05 | 0.41 | | PC2 | 1.19 | 0.30 | 3.71 | 0.22 | 0.37 | -0.83 | 1.08 | | PC1 | 0.21 | 1.27 | 0.28 | -0.05 | -0.01 | -0.05 | 0.48 | | LR | 5.72 | 1.71 | 3.19 | 2.25 | 0.66 | -0.86 | 1.38 | --- ## Trend Verification 1. **LR graph**: Steep upward slope (0.5-5.0 range) 2. **PC1 graph**: Moderate upward slope (0.0-0.5 range) 3. **PC2 graph**: Gradual upward slope (0.0-1.0 range) 4. **PC4 graph**: Moderate upward slope (0.0-2.0 range) 5. **PC8 graph**: Gradual upward slope (0.0-1.0 range) 6. **PC32 graph**: Moderate upward slope (0.0-0.5 range) 7. **PC128 graph**: Moderate upward slope (0.0-0.5 range) 8. **PC512 graph**: Moderate upward slope (0.0-1.0 range) --- ## Additional Notes - All text in English - No non-English content detected - Confidence intervals represented by shaded areas around lines - Heatmap values represent control effect magnitudes (d) - Color intensity correlates with effect strength - Negative values indicate inverse control effects </details> Figure B.20: Qwen2.5 7B (explicit control) layer 28 <details> <summary>2505.13763v2/supp/qwen2.5_7b_inactive_layer27.png Details</summary> ![e1a8c6e0dc26859ad60707fcb024e26ef77c919a66cbbb67869f1c3dcf7eac7b](http://localhost:8000/v1/image/e1a8c6e0dc26859ad60707fcb024e26ef77c919a66cbbb67869f1c3dcf7eac7b) ### Visual Description # Technical Document Extraction: Control Effect Analysis ## Overview The image contains **eight line graphs** (left) and **one heatmap** (right), analyzing control effects across different principal component (PC) axes. The graphs visualize how control effects evolve with increasing training examples, while the heatmap quantifies pairwise control effects between axes. --- ## Line Graphs (Left Panel) ### General Structure - **X-axis**: Number of examples (0–200) - **Y-axis**: Control effect (d) - **Legend**: Located in the top-right corner of each subplot - **Shading**: Represents confidence intervals (e.g., ±1σ) ### Subplot Details 1. **Target Axis: LR** - **Affected Axes**: PC1 (orange), PC2 (green), PC4 (blue), PC8 (cyan), PC32 (purple), PC128 (dark purple), PC512 (black) - **Trend**: - PC1 (orange) shows a steep upward slope, reaching ~1.0 control effect by 200 examples. - Other axes (PC2–PC512) exhibit gradual increases or stabilization near 0. - **Legend Confirmation**: Colors match line styles (e.g., PC1 = solid orange). 2. **Target Axis: PC1** - **Affected Axes**: PC2 (green), PC4 (blue), PC8 (cyan), PC32 (purple), PC128 (dark purple), PC512 (black) - **Trend**: - PC2 (green) dominates with a sharp rise to ~0.5 control effect. - Other axes remain near 0 with minimal variation. 3. **Target Axis: PC2** - **Affected Axes**: PC1 (orange), PC4 (blue), PC8 (cyan), PC32 (purple), PC128 (dark purple), PC512 (black) - **Trend**: - PC1 (orange) stabilizes near 0.2 control effect. - PC4 (blue) shows a slight upward trend. 4. **Target Axis: PC4** - **Affected Axes**: PC1 (orange), PC2 (green), PC8 (cyan), PC32 (purple), PC128 (dark purple), PC512 (black) - **Trend**: - PC1 (orange) and PC2 (green) exhibit moderate increases. - PC8 (cyan) remains near 0. 5. **Target Axis: PC8** - **Affected Axes**: PC1 (orange), PC2 (green), PC4 (blue), PC32 (purple), PC128 (dark purple), PC512 (black) - **Trend**: - PC1 (orange) and PC2 (green) show weak positive trends. - PC4 (blue) and PC32 (purple) remain near 0. 6. **Target Axis: PC32** - **Affected Axes**: PC1 (orange), PC2 (green), PC4 (blue), PC8 (cyan), PC128 (dark purple), PC512 (black) - **Trend**: - PC1 (orange) and PC2 (green) stabilize near 0.1–0.2. - PC4 (blue) and PC8 (cyan) remain near 0. 7. **Target Axis: PC128** - **Affected Axes**: PC1 (orange), PC2 (green), PC4 (blue), PC8 (cyan), PC32 (purple), PC512 (black) - **Trend**: - PC1 (orange) and PC2 (green) show slight increases. - PC4 (blue) and PC8 (cyan) remain near 0. 8. **Target Axis: PC512** - **Affected Axes**: PC1 (orange), PC2 (green), PC4 (blue), PC8 (cyan), PC32 (purple), PC128 (dark purple) - **Trend**: - PC1 (orange) and PC2 (green) exhibit weak positive trends. - PC4 (blue) and PC8 (cyan) remain near 0. --- ## Heatmap (Right Panel) ### Structure - **X-axis (Affected Axes)**: PC1, PC2, PC4, PC8, PC32, PC128, PC512 - **Y-axis (Target Axes)**: PC512, PC128, PC32, PC8, PC4, PC2, PC1, LR - **Color Scale**: Red (1.0) to Blue (-1.0) - **Key Observations**: - **High Positive Effects**: - PC128 (target) → PC1 (affected): **0.72** - PC128 (target) → PC2 (affected): **0.77** - PC512 (target) → PC1 (affected): **0.77** - **High Negative Effects**: - PC128 (target) → PC4 (affected): **-0.20** - PC512 (target) → PC2 (affected): **-0.25** - **Near-Zero Effects**: - Most diagonal cells (e.g., PC1 → PC1) show values close to 0. --- ## Cross-Referenced Data | Target Axis | Affected Axis | Control Effect (d) | Color Match | |-------------|---------------|--------------------|-------------| | LR | PC1 | ~1.0 (200 examples)| Orange | | PC128 | PC1 | 0.72 | Red | | PC128 | PC2 | 0.77 | Red | | PC512 | PC1 | 0.77 | Red | | PC128 | PC4 | -0.20 | Blue | --- ## Notes 1. **Legend Placement**: Top-right corner of line graphs. 2. **Color Consistency**: All line colors in graphs match the legend (e.g., PC1 = orange, PC2 = green). 3. **Heatmap Interpretation**: Red cells indicate strong positive control effects; blue cells indicate negative effects. This extraction captures all textual, numerical, and visual data from the image, enabling reconstruction of the analysis without the original figure. </details> Figure B.21: Qwen2.5 7B (implicit control) layer 28 B.10 Defining axes from hidden states aggregated across multiple layers we performed preliminary experiments testing the control effects of an axis on the concatenation of all layers. Concretely, we trained separate (logistic regression) classifiers for each layer on the ETHICS dataset. We then averaged the outputs of all classifiers to obtain a single (ensemble) output that defines the neurofeedback label. Equivalently, this corresponds to a single classifier with the readout vector being the concatenation of all classifiers’ readout vectors. We found that LLMs’ control effect on the ensemble output is similar to (marginally higher than) the control effects of layer 24 (Fig. 3 d), suggesting that defining axes from hidden states aggregated across multiple layers might provide (slightly) more stable and representative directions. We leave systematic investigations for future study. Appendix C Experiment compute resources We report compute resource usage across three tasks: preprocessing (extracting neural activation and training machine learning methods to obtain target axes and corresponding neurofeedback labels from neural activations), metacognitive reporting, and metacognitive control. For brevity, we omit “Instruct”. | LLaMA-3.2-1B | Preprocessing | RTX 3090 (24GB) | 1 | 1 | | --- | --- | --- | --- | --- | | LLaMA-3.2-1B | Control | RTX 3090 (24GB) | 3 | 3 | | LLaMA-3.2-3B | Preprocessing | RTX 3090 (24GB) | 1 | 1 | | LLaMA-3.2-3B | Control | RTX 3090 (24GB) | 15 | 8 | | LLaMA-3.1-8B | Preprocessing | A100 (80GB) | 5 | 3 | | LLaMA-3.1-8B | Report | A100 (80GB) | 1 | 10 | | LLaMA-3.1-8B | Control | A100 (80GB) | 90 | 120 | | LLaMA-3.1-70B | Preprocessing | $2×$ H200 (140GB) | 30 | 5 | | LLaMA-3.1-70B | Report | $2×$ H200 (140GB) | 1 | 15 | | LLaMA-3.1-70B | Control | $2×$ H200 (140GB) | 200 | 120 | | Qwen2.5-1B | Preprocessing | RTX 3090 (24GB) | 1 | 1 | | Qwen2.5-1B | Control | RTX 3090 (24GB) | 3 | 3 | | Qwen2.5-3B | Preprocessing | RTX 3090 (24GB) | 1 | 1 | | Qwen2.5-3B | Control | RTX 3090 (24GB) | 15 | 8 | | Qwen2.5-7B | Preprocessing | A100 (80GB) | 5 | 3 | | Qwen2.5-7B | Report | A100 (80GB) | 1 | 10 | | Qwen2.5-7B | Control | A100 (80GB) | 90 | 120 | Table 1: Compute and storage usage across tasks and models. Preprocessing, reporting, and control were run separately. Control task for 1B and 3B models was limited to two axes. For the 70B model, control task was only done for $N{=}256$ in-context examples. All compute times and storage values are reported as upper bounds. All remaining analyses (e.g., visualization, metric aggregation) were conducted on a laptop with 32GB RAM, with a total runtime under 30 hours. NeurIPS Paper Checklist 1. Claims 1. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? 1. Answer: [Yes] 1. Justification: We confirm that both the abstract and introduction accurately reflect the paper’s contributions and scope. 1. Guidelines: - The answer NA means that the abstract and introduction do not include the claims made in the paper. - The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. - The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. - It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 1. Limitations 1. Question: Does the paper discuss the limitations of the work performed by the authors? 1. Answer: [Yes] 1. Justification: We have mentioned a few limitations of the current work in the Discussion. 1. Guidelines: - The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. - The authors are encouraged to create a separate "Limitations" section in their paper. - The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. - The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. - The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. - The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. - If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. - While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 1. Theory assumptions and proofs 1. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? 1. Answer: [N/A] 1. Justification: This paper does not introduce any new theorems, formulas, or lemmas to be proved. 1. Guidelines: - The answer NA means that the paper does not include theoretical results. - All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced. - All assumptions should be clearly stated or referenced in the statement of any theorems. - The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. - Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. - Theorems and Lemmas that the proof relies upon should be properly referenced. 1. Experimental result reproducibility 1. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? 1. Answer: [Yes] 1. Justification: All models used in our experiments are publicly available through the Hugging Face library. All analyses and figures presented in the paper can be fully reproduced using the code provided in the associated repository in Appendix A.1. 1. Guidelines: - The answer NA means that the paper does not include experiments. - If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. - If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. - Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. - While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example 1. If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. 1. If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. 1. If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). 1. We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 1. Open access to data and code 1. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? 1. Answer: [Yes] 1. Justification: The associated repository in Appendix A.1 contains all necessary scripts, along with documentation, to enable full reproduction of the results and figures reported in this paper. 1. Guidelines: - The answer NA means that paper does not include experiments requiring code. - Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details. - While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). - The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details. - The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. - The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. - At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). - Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 1. Experimental setting/details 1. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? 1. Answer: [Yes] 1. Justification: We provide detailed descriptions of the evaluation metrics, model hyperparameters, data sources, analysis procedure, prompt construction, and inference settings in both the main text and the Appendix. As all LLMs used are publicly available pre-trained models accessed via Hugging Face, we omit training details. 1. Guidelines: - The answer NA means that the paper does not include experiments. - The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. - The full details can be provided either with the code, in appendix, or as supplemental material. 1. Experiment statistical significance 1. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? 1. Answer: [Yes] 1. Justification: We report error bars, statistical significance tests, and effect size estimates wherever appropriate to support the robustness and interpretability of our results. 1. Guidelines: - The answer NA means that the paper does not include experiments. - The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. - The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). - The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) - The assumptions made should be given (e.g., Normally distributed errors). - It should be clear whether the error bar is the standard deviation or the standard error of the mean. - It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. - For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). - If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 1. Experiments compute resources 1. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? 1. Answer: [Yes] 1. Justification: We provide full details regarding the compute resources required to reproduce all experiments discussed in the paper. This includes GPU types, total compute time, and environment specifications, as documented in Appendix C. - The answer NA means that the paper does not include experiments. - The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. - The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. - The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper). 1. Code of ethics 1. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? 1. Answer: [Yes] 1. Justification: We have carefully reviewed the NeurIPS Code of Ethics and, to the best of our knowledge, our work complies fully with its guidelines. We are not aware of any violations or ethical concerns associated with the methods, data, or conclusions presented. 1. Guidelines: - The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. - If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. - The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 1. Broader impacts 1. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? 1. Answer: [Yes] 1. Justification: We discuss both the positive societal impacts and negative societal impacts of the studied metacognitive abilities in LLMs. 1. Guidelines: - The answer NA means that there is no societal impact of the work performed. - If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. - Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. - The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. - The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. - If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 1. Safeguards 1. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? 1. Answer: [N/A] 1. Justification: We do not release any new datasets or models. Our work solely involves analyzing existing publicly available pre-trained language models using a novel methodological framework. We do not identify any foreseeable risks associated with our contributions. 1. Guidelines: - The answer NA means that the paper poses no such risks. - Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. - Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. - We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 1. Licenses for existing assets 1. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? 1. Answer: [Yes] 1. Justification: We use two families of pre-trained language models: the LLaMA 3 series (e.g., LLaMA-3.2-1B, LLaMA-3.1-8B) under Meta Llama 3 Community License and the Qwen 2.5 series under Apache License 2.0 (e.g., Qwen2.5-1B, Qwen2.5-7B). All models are used under their respective research licenses and are properly cited in the paper. All datasets are either publicly available or included in the code repository. Their licenses are reported in the Appendix. All assets are credited appropriately, and license terms have been fully respected. - The answer NA means that the paper does not use existing assets. - The authors should cite the original paper that produced the code package or dataset. - The authors should state which version of the asset is used and, if possible, include a URL. - The name of the license (e.g., CC-BY 4.0) should be included for each asset. - For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. - If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. - For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. - If this information is not available online, the authors are encouraged to reach out to the asset’s creators. 1. New assets 1. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? 1. Answer: [Yes] 1. Justification: The only new assets introduced in this work are the code implementations for model fitting and analysis. We release this code with detailed documentation to facilitate reproducibility, as described in Appendix A.1. 1. Guidelines: - The answer NA means that the paper does not release new assets. - Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. - The paper should discuss whether and how consent was obtained from people whose asset is used. - At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 1. Crowdsourcing and research with human subjects 1. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? 1. Answer: [N/A] 1. Justification: This work does not involve human subjects, personally identifiable information, or the use of crowdsourcing. 1. Guidelines: - The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. - Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. - According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 1. Institutional review board (IRB) approvals or equivalent for research with human subjects 1. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? 1. Answer: [N/A] 1. Justification: This work does not involve human subjects, user studies, or crowdsourcing. Therefore, Institutional Review Board approval or equivalent ethical review is not applicable. 1. Guidelines: - The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. - Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. - We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. - For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review. 1. Declaration of LLM usage 1. Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, declaration is not required. 1. Answer: [N/A] 1. Justification: This work does not involve the use of large language models (LLMs) as part of the core methodology. Any LLM usage, if any, was limited to writing assistance and had no influence on the scientific methods or contributions. 1. Guidelines: - The answer NA means that the core method development in this research does not involve LLMs as any important, original, or non-standard components. - Please refer to our LLM policy (https://neurips.cc/Conferences/2025/LLM) for what should or should not be described.

Rendering Paper...