# Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
**Authors**: Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
> University of California, Riverside Riverside CA USA
(2026)
## Abstract
Jailbreaking large language models (LLMs) has emerged as a critical security challenge with the widespread deployment of conversational AI systems. Adversarial users exploit these models through carefully crafted prompts to elicit restricted or unsafe outputs, a phenomenon commonly referred to as Jailbreaking. Despite numerous proposed defense mechanisms, attackers continue to develop adaptive prompting strategies, and existing models remain vulnerable. This motivates approaches that examine the internal behavior of LLMs rather than relying solely on prompt-level defenses. In this work, we study jailbreaking from both security and interpretability perspectives by analyzing how internal representations differ between jailbreak and benign prompts. We conduct a systematic layer-wise analysis across multiple open-source models, including GPT-J, LLaMA, Mistral, and the state-space model Mamba2, and identify consistent latent-space patterns associated with adversarial inputs. We then propose a tensor-based latent representation framework that captures structure in hidden activations and enables lightweight jailbreak detection without model fine-tuning or auxiliary LLM-based detectors. We further demonstrate that these latent signals can be used to actively disrupt jailbreak execution at inference time. On an abliterated LLaMA 3.1 8B model, selectively bypassing high-susceptibility layers blocks 78% of jailbreak attempts while preserving benign behavior on 94% of benign prompts. This intervention operates entirely at inference time and introduces minimal overhead, providing a scalable foundation for achieving stronger coverage by incorporating additional attack distributions or more refined susceptibility thresholds. Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures and suggest a complementary, architecture-agnostic direction for improving LLM security. Our implementation can be found here (jai, [n. d.]).
Jailbreaking, Large Language Models, LLM Internal Representations, Self-attention, Hidden Representation, Tensor Decomposition copyright: acmlicensed journalyear: 2026 doi: XXXXXXX.XXXXXXX conference: ; ; ccs: Computing methodologies Artificial intelligence ccs: Information systems Data mining ccs: Security and privacy Software and application security
## 1. Introduction
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks and have become increasingly integrated into applications spanning diverse domains and user populations. Despite their utility and efforts to align them according to human and safety expectations (Wang et al., 2023; Zeng et al., 2024; Glaese et al., 2022), these models remain highly susceptible to adversarial exploitation, raising significant safety and security concerns given their widespread accessibility. Among such threats, Jailbreaking has emerged as a persistent and particularly concerning attack vector, wherein malicious actors craft carefully engineered prompts to circumvent built-in safety mechanisms (Chao et al., 2024; Zou et al., 2023) and elicit restricted, sensitive, or otherwise disallowed content (Liu et al., 2024). Jailbreak attacks pose substantial risks, as they enable users with harmful intent to manipulate LLMs into producing outputs that violate safety policies, including actionable instructions for malicious activities (Zhang et al., 2024; Wang et al., 2024a; Yu et al., 2023). The growing availability of jailbreak prompts in public repositories, research artifacts, and online forums further exacerbates this issue (Jiang et al., 2024; Shen et al., 2024).
To mitigate these risks, prior work has explored a range of defense strategies, including prompt-level filtering (Wang et al., 2024b), model-level interventions (Zeng et al., 2024; Xiong et al., 2025), reinforcement learning from human feedback (RLHF) (Bai et al., 2022), and the use of auxiliary safety models (Chen et al., 2023; Lu et al., 2024). While such approaches have demonstrated partial effectiveness, they are not without limitations. In practice, even well-aligned models can remain vulnerable under repeated or adaptive attack attempts. Moreover, no single defense mechanism has proven sufficient to counter the continually evolving landscape of jailbreak strategies.
In this study, we investigate a complementary and comparatively underexplored direction: leveraging internal model representations to distinguish jailbreak prompts from benign inputs and to guide mitigation. Our central hypothesis is that adversarial prompts induce distinct and detectable structural patterns within the hidden representations of LLMs, independent of output behavior. To evaluate this hypothesis, we extract layer-wise internal representations such as multi-head attention and layer output/hidden state representation from multiple models such as GPT-J, LlaMa, Mistral, and the state-space sequence model Mamba, and apply tensor decomposition (Sidiropoulos et al., 2017) techniques to characterize and compare latent-space behaviors across benign and jailbreak prompts. Building on this analysis, we further demonstrate how these latent representations can be used to identify layers that are particularly susceptible to adversarial manipulation and to intervene during inference by selectively bypassing such layers. This representation-centric framework not only enables reliable detection of jailbreak prompts but also provides a principled mechanism for mitigating harmful behavior without modifying model parameters or relying on output-level filtering. Together, our results suggest that internal representations offer a powerful and generalizable foundation for both understanding and defending against jailbreak attacks beyond surface-level text analysis. Our contributions are as follows:
- Layer-wise Jailbreak Signature: We show that jailbreak and benign prompts exhibit distinct, layer-dependent latent signatures in the internal representations of LLMs, which can be uncovered using tensor decomposition (Sidiropoulos et al., 2017).
- Effective Defense via Targeted Layer Bypass: We demonstrate that these latent signatures can be exploited at inference time to identify susceptible layers and disrupt jailbreak execution through targeted layer bypass.
## 2. Related work
### 2.1. Adversarial Attacks
Adversarial attacks on LLMs encompass a broad class of inputs intentionally crafted to induce unintended, incorrect, or unsafe behaviors (Zou et al., 2023; Chao et al., 2024). Unlike adversarial examples in vision or speech domains, which often rely on imperceptible input perturbations, attacks on LLMs primarily exploit semantic, syntactic, and contextual vulnerabilities in language understanding and generation. By manipulating instructions, context, or interaction structure, adversaries can steer models toward generating factually incorrect information, violating behavioral constraints, or producing harmful or sensitive content, posing significant risks to deployed systems (Liu et al., 2024). Existing adversarial strategies span a wide range of mechanisms, including prompt injection, role-playing and persona manipulation, instruction obfuscation, multi-turn coercion (Yu et al., 2023; Jiang et al., 2024), and indirect attacks embedded within external content such as documents or code, and even by fine-tuning (Yang et al., 2023; Yao et al., 2023). A central challenge in defending against these attacks is their adaptability: adversarial prompts are often transferable across models and can be easily modified to evade static defenses (Zou et al., 2023). As a result, surface-level or prompt-based mitigation strategies have shown limited robustness
### 2.2. Jailbreak attacks
A prominent and particularly challenging class of adversarial attacks on LLMs is jailbreaking. Jailbreak attacks aim to circumvent built-in safety mechanisms and alignment constraints, enabling the model to produce outputs that it is explicitly designed to refuse. These attacks often rely on prompt engineering techniques such as hypothetical scenarios, instruction overriding, contextual reframing, or step-by-step coercion, effectively manipulating the model’s internal decision-making processes (Wei et al., 2023). Unlike general adversarial prompting, jailbreak attacks explicitly target safety guardrails and content moderation policies, making them a critical concern from both security and governance perspectives (Liu et al., 2024; Zou et al., 2023; Jiang et al., 2024). Despite extensive efforts to harden models through alignment training and reinforcement learning from human feedback (Bai et al., 2022; Wang et al., 2023; Glaese et al., 2022), jailbreak prompts continue to evolve, highlighting fundamental limitations in current defense approaches. This motivates the need for methods that analyze jailbreak behavior at the level of internal model representations, rather than relying solely on external prompt or output inspection.
### 2.3. Jailbreak Defenses
Prior work on defending against jailbreak attacks in LLMs has primarily focused on prompt and output-level safeguards. Rule-based filtering and keyword matching are commonly used due to their low computational cost, but such approaches are brittle and easily bypassed through paraphrasing, obfuscation, or multi-turn prompting (Deng et al., 2024). Learning-based defenses, including supervised classifiers and auxiliary LLMs for intent detection or self-evaluation (Wang et al., 2025), improve robustness but introduce additional complexity, inference overhead, and new attack surfaces. Model-level defenses, such as alignment fine-tuning, reinforcement learning from human feedback (RLHF), and policy-based or constitutional training, aim to internalize safety constraints within the model (Ouyang et al., 2022). While effective to an extent, these approaches are resource-intensive and require continual updates as jailbreak strategies evolve. Moreover, even extensively aligned models remain susceptible to jailbreak attacks, indicating fundamental limitations in current training-based defenses. Overall, existing defenses largely treat jailbreak detection as a black-box problem and rely on external signals from prompts or generated outputs. In contrast, fewer works explore the internal representations of LLMs as a basis for defense (Candogan et al., 2025). This gap motivates approaches that leverage latent-space and layer-wise signals to identify jailbreak behavior in an interpretable and architecture-agnostic manner, without requiring additional fine-tuning or auxiliary models.
## 3. Preliminaries
### 3.1. Tensors
Tensors (Sidiropoulos et al., 2017) are defined as multi-dimensional arrays that generalize one-dimensional arrays (vectors) and two-dimensional arrays (matrices) to higher dimensions. The dimension of a tensor is traditionally referred to as its order, or equivalently, the number of modes, while the size of each mode is called its dimensionality. For instance, we may refer to a third-order tensor as a three-mode tensor $\boldsymbol{\mathscr{X}}∈ℝ^I× J× K$ .
### 3.2. Tensor Decomposition
Tensor Decomposition (Sidiropoulos et al., 2017) is a popular data science tool for discovering underlying low-dimensional patterns in the data. We focus on the CANDECOMP/PARAFAC (CP) decomposition model (Sidiropoulos et al., 2017), one of the most famous tensor decomposition models that decomposes a tensor into a sum of rank-one components. We use CP decomposition because of its simplicity and interpretability. The CP decomposition of a three-mode tensor $\boldsymbol{\mathscr{X}}∈ℝ^I× J× K$ is the sum of three-way outer products, that is, $\boldsymbol{\mathscr{X}}≈∑_r=1^Ra_r∘b_r∘c_r$ , where $R$ is the rank of the decomposition, $a_r∈ℝ^I$ , $b_r∈ℝ^J$ , and $c_r∈ℝ^K$ are the factor vectors and $∘$ denotes the outer product. The rank of a tensor $\boldsymbol{\mathscr{X}}$ is the minimal number of rank-1 tensors required to exactly reconstruct it:
$$
rank(\boldsymbol{\mathscr{X}})=\min≤ft\{R:\boldsymbol{\mathscr{X}}=∑_r=1^Ra_r∘b_r∘c_r, a_r∈ℝ^I,b_r∈ℝ^J,c_r∈ℝ^K\right\}
$$
Selecting an appropriate rank is critical, as it directly affects both the expressiveness and interpretability of the decomposition. Lower-rank approximations yield compact and computationally efficient representations, while higher ranks can capture richer structure at the cost of increased complexity and potential noise.
### 3.3. Transformer Architecture
A Transformer (Vaswani et al., 2023) is a neural network architecture designed for modeling sequential data through attention mechanisms rather than recurrence or convolution. Transformers process input sequences in parallel and capture long-range dependencies by explicitly modeling interactions between all tokens in a sequence.
A transformer consists of a stack of layers, each composed of two primary submodules: multi-head self-attention and a position-wise feed-forward network (FFN). Residual connections and layer normalization are applied around each submodule to stabilize training.
#### 3.3.1. Multi-Head Self-Attention
Multi-head self-attention enables the model to attend to different parts of the input sequence simultaneously. Given an input representation $H∈ℝ^T× d$ , each attention head projects $H$ into query ( $Q$ ), key ( $K$ ), and value ( $V$ ) matrices:
$$
Q=HW^Q, K=HW^K, V=HW^V.
$$
Attention is computed as
$$
Attn(Q,K,V)=softmax≤ft(\frac{QK^⊤}{√{d_k}}\right)V,
$$
where $d_k$ is the dimensionality of each attention head. Multiple attention heads operate in parallel, and their outputs are concatenated and linearly projected, allowing the model to capture diverse relational patterns across tokens.
#### 3.3.2. Layer Outputs and Hidden Representations
Each transformer layer produces a hidden representation (or layer output) that serves as input to the next layer. Formally, for layer $\ell$ , the output representation $H^(\ell)$ is given by:
$$
H^(\ell)=LN≤ft(H^(\ell-1)+MHA≤ft(H^(\ell-1)\right)\right),
$$
followed by
$$
H^(\ell)=LN≤ft(H^(\ell)+FFN≤ft(H^(\ell)\right)\right),
$$
where $MHA$ denotes multi-head attention, $FFN$ denotes the feed-forward network, and $LN$ denotes layer normalization.
The sequence of hidden states $\{H^(1),…,H^(L)\}$ captures increasingly abstract features, ranging from local syntactic patterns in early layers to semantic and task-relevant representations in deeper layers. These intermediate representations are commonly referred to as hidden layer activations and form the basis for interpretability and internal behavior analysis.
### 3.4. Model Types: Base, Instruction-Tuned, and Abliterated Models
Large language models (LLMs) can be categorized based on their training and alignment processes, which influence their behavior under adversarial conditions.
Base Models
These are pretrained on large-scale text corpora using self-supervised objectives without explicit instruction tuning. Examples include GPT-J and LLaMA. Base models capture broad language patterns but lack alignment with human preferences, making them prone to generate unrestricted or unsafe outputs.
Instruction-Tuned Models
Derived from base models via supervised fine-tuning on datasets containing human instructions and responses, these models improve instruction-following capabilities (Zhang et al., 2026) and enforce safety constraints, such as refusing harmful queries. While instruction tuning enhances safety, these models remain susceptible to sophisticated jailbreak prompts.
Abliterated Models
Abliterated models are instruction-tuned models where alignment or safety components have been removed, disabled, or bypassed (Young, 2026). Such models behave more like base models but may retain subtle differences due to prior fine-tuning. Abliterated models serve as valuable testbeds to study jailbreak vulnerabilities and internal representational changes resulting from alignment removal.
Analyzing these model types enables us to investigate how alignment and instruction tuning affect internal layer activations and latent patterns, informing the design of robust jailbreak detection methods that generalize across model variations.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Technical Process Diagram: Latent Analysis and Classifier Training for LLM Jailbreak Detection and Mitigation
### Overview
This diagram illustrates a two-phase technical framework for detecting and mitigating "jailbreak" attacks on Large Language Models (LLMs). The top phase focuses on **offline training** of a classifier using latent space analysis of model activations. The bottom phase describes the **inference-time application** of this classifier to identify and bypass model layers susceptible to jailbreak attacks, thereby preventing harmful output generation.
### Components/Axes & Process Flow
The diagram is divided into two primary horizontal sections:
**1. Top Section: Latent Analysis and Classifier Training (Offline Phase)**
* **Input:** "Input prompts to LLM" are shown as two document icons, labeled **"Benign"** (green) and **"Jailbreak"** (purple with a skull icon).
* **Process:**
* These prompts are fed into an "LLM Layers" stack (represented as a blue 3D block).
* "Intermediate layers' data" is extracted. A callout specifies: "Layer *T* extracted for a single input prompt," showing a 2D grid with axes "Sequence length" and "Token embedding length."
* This extraction is repeated: "Extract representations from each layer *l = 1,2,3...L* forms a 3-mode tensor." A 3D grid visualizes this tensor with axes: "Sequence length," "Token embedding length," and "prompts."
* The tensor undergoes "Decomposed Factors (CP Decomposition)," visualized as three factor matrices labeled **A**, **B**, and **C**.
* **Output/Goal:** "Effective separation of factors in the latent space." A 2D scatter plot shows green circles ("Benign") clustered separately from purple skull icons ("Jailbreak"). The final step is "Training a classifier on prompt-mode factors."
**2. Bottom Section: Jailbreak Mitigation at Inference (Deployment Phase)**
* **Input:** An "Inference time prompt" (document icon) is fed into the same "LLM Layers" stack.
* **Process:**
* "Layer *T* extracted for the prompt at inference." The extracted "Layer output/ MHA output" is multiplied (`*`) with the pre-trained factor matrix **A**.
* This results in a "Resultant prompt layer in latent space," shown as a vector labeled "latent factors."
* The system will "Project layer representations onto learned latent factors."
* Next, it will "Classify using the trained factors" (green circle and purple skull icons with question marks).
* A decision rule is stated: "If layer wise Jailbreak Prob > threshold, layer might be showing more signals of jailbreak attack."
* This leads to "Identify Jailbreak sensitive layers." A list shows layers `a` through `x`. Layers **b** and **e** are highlighted in red with a red 'X', indicating they are "bypassed as they exhibit stronger jailbreak signature."
* The final step is "Susceptible layer bypassing," showing the LLM layer stack with layers `b` and `e` visually skipped (red lines bypass them), leading to "Layer bypass prevented harmful generation."
### Detailed Analysis / Content Details
* **Key Technical Terms:** LLM (Large Language Model), MHA (Multi-Head Attention), CP Decomposition (CANDECOMP/PARAFAC), Latent Space, 3-mode Tensor, Jailbreak Attack.
* **Data Flow:** The process is a pipeline: Raw Prompts -> Layer-wise Activation Extraction -> Tensor Construction -> Decomposition -> Classifier Training -> Inference-time Projection -> Classification -> Layer Identification -> Conditional Bypassing.
* **Visual Coding:**
* **Color:** Green consistently represents "Benign" data/behavior. Purple represents "Jailbreak" data/behavior. Red is used for layers identified as sensitive/bypassed.
* **Icons:** Document icons for prompts. Skull icon for jailbreak. 'X' marks for bypassed layers.
* **Spatial Layout:** The training phase flows left-to-right on top. The inference phase flows left-to-right on the bottom, with a central connection showing the use of the pre-trained factor matrix **A**.
### Key Observations
1. **Layer-Specific Analysis:** The method does not treat the LLM as a black box. It analyzes activations at *each intermediate layer* (`l = 1,2,3...L`) to find granular attack signatures.
2. **Factor Decomposition:** The core innovation is using tensor decomposition (CP) on the 3-mode activation tensor (sequence × embedding × prompt) to disentangle the underlying factors that distinguish benign from jailbreak prompts.
3. **Dynamic Bypassing:** Mitigation is not a static filter. It dynamically identifies which specific layers in the model are most activated by a given suspicious prompt at inference time and bypasses only those layers.
4. **Threshold-Based Decision:** The system uses a probabilistic threshold ("Jailbreak Prob > threshold") to decide on layer bypassing, introducing a tunable sensitivity parameter.
### Interpretation
This diagram outlines a sophisticated, **proactive defense mechanism** against adversarial attacks on LLMs. The underlying premise is that jailbreak prompts cause the model's internal representations (activations) to deviate from a "benign" manifold in a predictable way that can be captured in the latent space of layer activations.
* **What it demonstrates:** It shows a complete pipeline from offline analysis to real-time defense. The key insight is that jailbreak attacks leave a "signature" not just in the final output, but in the *process* of computation across specific layers. By learning this signature via tensor factorization, the system can detect it early.
* **How elements relate:** The top section (training) creates the "detector" (the classifier and factor matrices). The bottom section (inference) is the "deployment" where this detector is applied to new prompts. The critical link is the projection of new layer outputs onto the learned latent factors (**A**), allowing for comparison with the trained classifier.
* **Notable Implications:**
* **Efficiency:** Bypassing only sensitive layers could be more computationally efficient than re-running the prompt or using a separate large classifier model.
* **Adaptability:** The classifier can be retrained as new jailbreak techniques emerge, updating the latent factor understanding.
* **Transparency:** This method provides some interpretability—by identifying *which* layers are sensitive, developers might gain insight into *how* the model is being manipulated.
* **Potential Limitation:** The effectiveness hinges on the assumption that future jailbreak attacks will project onto the same latent factors learned during training. Novel attacks might evade detection. The threshold setting also presents a trade-off between false positives (blocking benign prompts) and false negatives (missing jailbreaks).
</details>
Figure 1. Proposed method: (top) Latent analysis and classifier training: self-attention and layer-output tensors are constructed from input prompts, decomposed via CP decomposition, and used to learn jailbreak-discriminative latent factors. (Bottom) Inference-time mitigation: internal representations from a new prompt are projected onto the learned factors to estimate layer-wise jailbreak susceptibility; layers exhibiting strong adversarial signals are bypassed to suppress jailbreak behavior.
## 4. Proposed Method
We study jailbreak behavior through internal model representations using two complementary analysis pipelines.
### 4.1. Hidden Representation Analysis
Model Suite and Representation Extraction
We perform hidden representation analysis by examining both multi-head self-attention outputs and layer-wise hidden representations across a diverse set of large language models. Specifically, we evaluate three base models: GPT-J-6B (Wang et al., 2021), LLaMA-3.1-8B (AI@Meta, 2024b), and Mistral-7B-v01 (Jiang et al., 2023); three instruction-tuned models: GPT-JT-6B (Computer, 2022), LLaMA-3.1-8B-Instruct (AI@Meta, 2024a), and Mistral-7B-Instruct-V0.1 (AI, 2023); one abliterated model, LLaMA-3.1-8B-Instruct-Abliterated (Labonne, [n. d.]); and one state-space sequence model, Mamba-2.8b-hf (mam, [n. d.]). This selection enables a systematic comparison across different stages of alignment and architectural paradigms.
The inclusion of base, instruction-tuned, and abliterated models is intentional. Base models offer insight into unaligned latent structures; instruction-tuned models show how safety fine-tuning alters internal processing; and abliterated models help isolate the role of alignment layers. We focus not on output quality but on how jailbreak and benign prompts are internally encoded across this alignment spectrum. From this perspective, the specific semantic quality of the output is not critical; instead, we focus on identifying discriminative patterns that persist across model variants. Together, these model categories enable us to analyze jailbreak behavior across the full alignment spectrum and assess whether latent-space signatures of jailbreak prompts are consistent and model-agnostic.
Tensor Construction and Latent Factors for Jailbreak Detection
For a given set of input prompts containing both benign and jailbreak instances, we extract multi-head attention representations and hidden state/layer output from each layer of the model. For a single prompt, the resulting representation has dimensions $1× T× d$ , where $T$ denotes the sequence length and $d$ is the hidden dimensionality. By stacking representations across multiple prompts, we construct a third-order tensor of size $N× T× d$ , where $N$ is the number of prompts as illustrated in Fig. 1.
To analyze the latent structure of these internal representations, we apply the CANDECOMP/PARAFAC (CP) tensor decomposition to factorize the tensor into three low-rank factors corresponding to the prompt, sequence, and hidden dimensions. The factor associated with the prompt mode captures latent patterns that reflect how different prompts are encoded internally across model layers. Prior work (Papalexakis, 2018; Zhao et al., 2019) has shown that tensor decomposition-derived latent patterns effectively capture meaningful structure for classification and detection tasks, even with limited data (Qazi et al., 2024; Kadali and Papalexakis, 2025).
We use these prompt-mode latent factors as features for a lightweight classifier that distinguishes jailbreak prompts from benign prompts. This classifier serves two purposes: to assess separability between prompt types in the latent space (Fig. 2), and as a mechanism to estimate layer-wise susceptibility to jailbreak behavior, which is leveraged by the mitigation method described next.
Layer-wise Separability and Susceptibility
To localize where jailbreak-related information is expressed within the network, we train separate classifiers on latent factors extracted from individual layers. This produces a layer-wise profile that quantifies how strongly each layer encodes representations associated with adversarial prompts. From an interpretability perspective, these layers can be viewed as critical representational stages where benign and adversarial behaviors diverge.
Importantly, we do not interpret high separability as evidence that a layer directly causes jailbreak behavior. Rather, it indicates that these layers capture discriminative representations associated with jailbreak prompts, making them especially informative for detection. This observation provides insight into how adversarial instructions propagate through the model and forms the basis for the inference-time intervention introduced in the following section.
<details>
<summary>figures/layer_9_layer_output_tsne_rank_20_regenerated.png Details</summary>

### Visual Description
## Scatter Plot: t-SNE Visualization of Decomposed factors
### Overview
This image is a two-dimensional scatter plot generated using t-SNE (t-Distributed Stochastic Neighbor Embedding), a dimensionality reduction technique. The plot visualizes the distribution of data points from two distinct classes, labeled "Benign" and "Jailbreak," in a reduced feature space. The primary purpose is to show how separable or clustered these two classes are based on their decomposed factors.
### Components/Axes
* **Title:** "t-SNE Visualization of Decomposed factors" (centered at the top).
* **X-Axis:** Labeled "t-SNE Dimension 1". The scale runs from approximately -40 to +40, with major tick marks at intervals of 20 (-40, -20, 0, 20, 40).
* **Y-Axis:** Labeled "t-SNE Dimension 2". The scale runs from approximately -40 to +30, with major tick marks at intervals of 10 (-40, -30, -20, -10, 0, 10, 20, 30).
* **Legend:** Located in the top-right corner of the plot area. It contains two entries:
* A blue circle symbol followed by the text "Benign".
* A red circle symbol followed by the text "Jailbreak".
* **Grid:** A light gray grid is present in the background, aligned with the major tick marks on both axes.
### Detailed Analysis
The plot contains hundreds of individual data points, each represented by a semi-transparent circle. The transparency allows for the visualization of point density where clusters overlap.
* **"Benign" Class (Blue Points):**
* **Spatial Distribution:** The blue points are predominantly clustered on the left side of the plot (negative values on Dimension 1).
* **Key Clusters:**
1. A large, dense, and vertically elongated cluster spans from approximately (-40, -10) to (-30, 30). This is the most prominent blue cluster.
2. A smaller, distinct cluster is located in the bottom-left quadrant, centered around (-25, -25).
3. Several smaller, looser groupings and individual points are scattered in the central region, roughly between Dimension 1 values of -20 and 0.
* **Trend:** The overall visual trend for the blue class is a concentration along the left edge of the plot, with some dispersion toward the center.
* **"Jailbreak" Class (Red Points):**
* **Spatial Distribution:** The red points are predominantly clustered on the right side of the plot (positive values on Dimension 1).
* **Key Clusters:**
1. A very large, dense, and sprawling cluster dominates the right half of the plot. It extends from approximately (0, -35) to (40, 15), with the highest density in the bottom-right quadrant (e.g., around (20, -30)).
2. A smaller, separate cluster is visible in the upper-right area, centered near (30, 10).
3. A notable, isolated cluster of red points appears in the upper-middle region, centered around (0, 15).
* **Trend:** The red class shows a strong concentration on the right side, with a significant dense region in the lower-right and a clear separation of a smaller cluster in the upper-middle area.
* **Overlap Region:**
* There is a transitional zone in the center of the plot, roughly between Dimension 1 values of -10 and 10, where blue and red points intermingle. This indicates that for some data instances, the decomposed factors of "Benign" and "Jailbreak" are not perfectly separable in this 2D projection.
### Key Observations
1. **Clear Class Separation:** There is a strong, visually apparent separation between the majority of the "Benign" (left) and "Jailbreak" (right) data points along the first t-SNE dimension.
2. **Cluster Structure:** Both classes form multiple distinct clusters rather than a single homogeneous cloud, suggesting potential sub-categories or variations within each class.
3. **Density Variation:** The "Jailbreak" class appears to have a higher point density in its main cluster compared to the more spread-out primary "Benign" cluster.
4. **Isolated Anomalies:** A few blue points are found deep within red clusters (e.g., near (30, 5)) and vice-versa (e.g., a red point near (-35, -5)). These could represent outliers, mislabeled data, or adversarial examples.
### Interpretation
This t-SNE plot provides strong visual evidence that the "decomposed factors" extracted from the underlying data contain sufficient information to distinguish between "Benign" and "Jailbreak" instances in most cases. The clear spatial segregation implies that a classifier trained on these factors would likely achieve high accuracy.
The presence of distinct clusters within each class is particularly interesting. It suggests that "Benign" and "Jailbreak" are not monolithic categories. For example, the different blue clusters might correspond to different types of benign queries or user behaviors, while the separate red clusters could represent different jailbreak strategies or attack vectors. The isolated cluster of red points in the upper-middle area (around (0,15)) is especially noteworthy, as it is spatially separated from the main red mass, potentially indicating a unique subclass of jailbreak attempts.
The central overlap region is critical from a security perspective. Data points in this zone represent the "hard cases" where the decomposed features of a benign and a jailbreak attempt are similar. These are the instances where a detection system is most likely to make errors (false positives or false negatives). Analyzing the specific characteristics of these overlapping points could be key to improving the robustness of a detection model.
In summary, the visualization demonstrates the effectiveness of the decomposition method for feature separation while also revealing the complex, multi-faceted nature of both benign and malicious behaviors within the analyzed system.
</details>
Figure 2. t-SNE visualization of prompt-mode CP factors for a representative model. Clear separation between benign and jailbreak clusters indicates that internal latent factors capture strong structure, motivating their use for jailbreak detection. Similar patterns are observed across models.
### 4.2. Layer-Aware Mitigation via Latent-Space Susceptibility
To mitigate jailbreak attacks, we propose a representation-level defense method, that leverages layer-wise susceptibility signals derived from internal representations. By identifying layers that strongly encode jailbreak-specific representations, we selectively bypass them during inference (Elhoushi et al., 2024; Shukor and Cord, 2024). While prior work has explored layer bypassing primarily for reducing computation and improving inference efficiency, our approach demonstrates that such bypassing can simultaneously reduce computational cost and mitigate jailbreak behaviors (Luo et al., 2025; Lawson and Aitchison, 2025).
We conduct this experiment on an abliterated model (LLaMA-3.1-8B). Base models are excluded from this intervention as they primarily perform next-token prediction without alignment constraints, making output-based safety evaluation less meaningful. Instruction-tuned models are also not ideal candidates, as their built-in guardrails obscure whether observed safety improvements arise from our method or from prior alignment. Abliterated models, which lack safety mechanisms while retaining instruction-tuned structure, provide a suitable testbed for isolating the effects of our approach.
Layer-wise Projection and Jailbreak Susceptibility Scoring
Given an input prompt, we extract intermediate representations from each transformer layer of the model. Let $x∈ℝ^d$ denote the feature vector obtained from a specific layer (e.g., multi-head self-attention output or layer output).
Latent Projection.
For each layer, we project the extracted features onto a lower-dimensional latent space defined by factors obtained from tensor decomposition of the corresponding instruct model representations obtained in § 4.1. Let $W∈ℝ^d× r$ denote the matrix of $r$ basis vectors (factors). The projected representation is computed as:
$$
z=W^⊤x,
$$
where $z∈ℝ^r$ is the latent feature representation. This operation constitutes a linear projection that preserves task-relevant structure encoded by the factors.
Layer-wise Jailbreak Probability Estimation.
The projected features are passed to a classifier trained to distinguish between benign and jailbreak prompts. For a logistic regression classifier, the probability of a jailbreak at a given layer is computed as:
$$
p=σ(w^⊤z+b),
$$
where $w$ and $b$ denote the classifier weights and bias, respectively, and $σ(·)$ is the sigmoid function:
$$
σ(x)=\frac{1}{1+e^-x}.
$$
Classifier Training Objective.
The classifier is trained using labeled projected representations $(z_i,y_i)$ , where $y_i∈\{0,1\}$ indicates benign or jailbreak prompts. The parameters are optimized by minimizing the binary cross-entropy loss:
$$
L=-\frac{1}{N}∑_i=1^N≤ft[y_i\log p_i+(1-y_i)\log(1-p_i)\right],
$$
where $N$ denotes the number of samples and $p_i$ is the predicted probability for sample $i$ .
Layer Susceptibility Interpretation.
Layers exhibiting higher classification performance (e.g., F1 score) indicate stronger representational separability between benign and jailbreak prompts within the latent space. We interpret such layers as being more susceptible to jailbreak-style perturbations, as they encode discriminative adversarial features.
## 5. Experimental Evaluation
<details>
<summary>x2.png Details</summary>

### Visual Description
## Heatmap Chart: F1 Score Analysis of Transformer Models
### Overview
The image displays two side-by-side heatmaps comparing the F1 scores (a measure of model accuracy) across different layers of various transformer-based language models. The left heatmap analyzes "Layer Output F1 Scores," while the right analyzes "MHA (Multi-Head Attention) Output F1 Scores." The charts use a color gradient from red (low score, ~0.5) to dark green (high score, 1.0) to visualize performance.
### Components/Axes
* **Chart Titles:**
* Left Chart: "Layer Output F1 Scores"
* Right Chart: "MHA Output F1 Scores"
* **Y-Axis (Vertical):** Labeled "Model." It lists seven distinct models from top to bottom:
1. GPT-J-Base
2. GPT-JT-Instruct
3. Llama-Base
4. Llama-Instruct
5. Llama-Abliterated
6. Mistral-Base
7. Mistral-Instruct
* **X-Axis (Horizontal):** Labeled "Layer Number." It lists layers from 0 to 31 (inclusive), indicating a 32-layer model architecture.
* **Color Scale/Legend:** Located to the right of each heatmap. It is a vertical bar showing the F1 Score mapping:
* Dark Green: 1.0
* Green: ~0.9
* Light Green: ~0.8
* Yellow: ~0.7
* Orange: ~0.6
* Red: 0.5
### Detailed Analysis
**Left Heatmap: Layer Output F1 Scores**
* **GPT-J-Base:** Shows very low scores (red/orange, ~0.5-0.6) in layers 0-2. Performance improves to light green (~0.8) in middle layers and reaches dark green (~0.95-1.0) in the final layers (28-31).
* **GPT-JT-Instruct:** Similar pattern to GPT-J-Base but with a slightly better start. Early layers (0-1) are orange/red, quickly improving to yellow/light green by layer 4, and maintaining high green scores through the later layers.
* **Llama-Base:** Starts with moderate scores (yellow/light green, ~0.7-0.8) in early layers. Shows a strong, consistent performance (dark green, ~0.9-1.0) from approximately layer 6 onward.
* **Llama-Instruct:** Exhibits the strongest overall performance. Begins with good scores (green, ~0.85) and achieves near-perfect dark green scores across the vast majority of layers, especially from layer 5 onwards.
* **Llama-Abliterated:** Performance is very similar to Llama-Instruct, with high dark green scores across most layers, indicating robust performance.
* **Mistral-Base:** Starts with lower scores (yellow, ~0.7) in the first few layers. Performance steadily increases, reaching dark green in the final third of the layers (approx. layers 20-31).
* **Mistral-Instruct:** Begins with moderate scores (light green, ~0.8) and shows a clear, steady improvement trend, culminating in dark green scores in the final layers.
**Right Heatmap: MHA Output F1 Scores**
* **GPT-J-Base:** Shows a dramatic improvement. Layers 0-1 are red/orange (~0.5-0.6), but from layer 2 onward, the scores are almost uniformly dark green (~0.95-1.0).
* **GPT-JT-Instruct:** Has a notable outlier. Layer 0 is red (~0.5), but from layer 1 onward, it is almost entirely dark green.
* **Llama-Base, Llama-Instruct, Llama-Abliterated:** All three Llama variants show consistently high performance (dark green) across nearly all layers for MHA outputs, with minimal variation.
* **Mistral-Base & Mistral-Instruct:** Both show a clear gradient. They start with lighter green/yellow scores in the earliest layers (0-3) and transition to solid dark green for the remainder of the layers.
### Key Observations
1. **Layer-wise Progression:** For all models, F1 scores generally improve from earlier to later layers. The final layers (28-31) consistently show the highest performance.
2. **Model Comparison:** The "Instruct" variants (Llama-Instruct, Mistral-Instruct) and the "Abliterated" model generally outperform their "Base" counterparts, especially in the earlier and middle layers.
3. **MHA vs. Layer Output:** The MHA Output scores (right chart) are uniformly higher and reach peak performance much earlier (often by layer 2-5) compared to the general Layer Output scores. This suggests the attention mechanism's representations become highly effective very early in the network.
4. **Early Layer Vulnerability:** The first 0-3 layers are where the most significant performance deficits and variability occur, particularly for the GPT-J family of models.
5. **Outlier:** The single red cell for GPT-JT-Instruct at Layer 0 in the MHA Output chart is a stark outlier against its otherwise perfect green row.
### Interpretation
This analysis provides a granular view of how different transformer models develop their internal representations across depth. The data suggests:
* **Functional Specialization:** The stark difference between early and late layer performance indicates a functional hierarchy. Early layers likely process basic syntactic or token-level information, while later layers integrate this into semantically rich representations suitable for the final task (as measured by F1).
* **Impact of Training:** The superior performance of Instruct and Abliterated models implies that instruction-tuning or specific modification techniques lead to more robust and immediately useful internal representations across the entire network depth, not just at the output.
* **Attention is Key:** The consistently high MHA Output scores highlight the critical role of the multi-head attention mechanism. It appears to generate high-quality, task-relevant signals very early in the processing stream, which the rest of the network then refines.
* **Architectural Insight:** The uniformity of high scores in later layers across all models suggests a convergence point where the models' internal states become highly optimized for the evaluation task, regardless of their initial training differences. The early layers, however, reveal the "fingerprint" of each model's specific architecture and training regimen.
</details>
Figure 3. (Left) Layer-wise F1 scores using CP-decomposed Transformer layer outputs. (Right) Layer-wise F1 scores using CP-decomposed multi-head attention representations. Jailbreak and benign prompts become reliably separable at early depths, suggesting that adversarial intent is encoded shortly after input embedding. The strong performance of attention-based features further indicates that prompt-type information is reflected in token interaction structure as well as in hidden representations.
### 5.1. Datasets and Prompt Scope
We used two prompt sources with provenance relationships to separate representation learning from mitigation evaluation.
Training/representation analysis. For latent-space analysis, we use the Jailbreak Classification dataset from Hugging Face (Hao, [n. d.]). This dataset provides labeled benign and jailbreak prompts and is used to (i) extract layer-wise hidden representations and multi-head attention (MHA) outputs, (ii) learn CP decomposition factors for each layer and representation type, and (iii) train a lightweight classifier on the resulting latent features.
Test/mitigation evaluation. To evaluate layer-aware bypass at inference time, we construct a held-out test set of 200 prompts (100 benign, 100 jailbreak) sourced from the ‘In the Wild Jailbreak Prompts’ (Shen et al., 2024) prompt collection. (Hao, [n. d.]) reports that its jailbreak prompts are drawn from (Shen et al., 2024), which motivates this choice, as these evaluations are consistent and distinct from the training corpus.
Attack scope. Both datasets primarily consist of instruction-level jailbreaks (e.g., persona overrides, explicit safety negation, role-play framing, and meta-instructions). They do not include optimization-based attacks such as GCG (Zou et al., 2023), PAIR (Chao et al., 2024), or other gradient-guided adversarial suffix constructions. Since our framework operates on internal representations, extending it to additional attack families can be achieved by incorporating corresponding prompt distributions during factor learning; we leave such evaluations to future work.
### 5.2. Hidden Representation Analysis
We assess whether internal model representations can reliably separate jailbreak from benign prompts across model families, layers, and representation types. Our analysis spans eight models: base, instruction-tuned, abliterated, and state-space (Mamba), providing a broad view across training paradigms.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: Mamba-2.8B: Block vs Mixer Output F1 Scores
### Overview
This image is a line chart comparing the F1 Score performance of two different output types ("Block Output" and "Mixer Output") across the layers of a model named "Mamba-2.8B". The chart plots the F1 Score on the vertical axis against the layer number on the horizontal axis.
### Components/Axes
* **Chart Title:** "Mamba-2.8B: Block vs Mixer Output F1 Scores" (centered at the top).
* **X-Axis:**
* **Label:** "Layer" (centered below the axis).
* **Scale:** Linear scale from 0 to 64, with major tick marks and labels at intervals of 8 (0, 8, 16, 24, 32, 40, 48, 56).
* **Y-Axis:**
* **Label:** "F1 Score" (rotated vertically, left of the axis).
* **Scale:** Linear scale from 0.5 to 1.0, with major tick marks and labels at intervals of 0.1 (0.5, 0.6, 0.7, 0.8, 0.9, 1.0).
* **Legend:** Located in the bottom-left corner of the plot area.
* **Series 1:** "Block Output" - Represented by a blue line with circular markers.
* **Series 2:** "Mixer Output" - Represented by a purple/magenta line with square markers.
* **Grid:** A light gray grid is present in the background.
### Detailed Analysis
**Trend Verification & Data Points:**
1. **Block Output (Blue line, circular markers):**
* **Trend:** Starts relatively high, experiences a minor initial dip, then shows a steady, smooth upward trend before plateauing in the middle-to-late layers, with a slight decline at the very end.
* **Approximate Data Points:**
* Layer 0: ~0.88
* Layer 4: ~0.87 (slight dip)
* Layer 8: ~0.88
* Layer 16: ~0.94
* Layer 24: ~0.95
* Layer 32: ~0.95
* Layer 40: ~0.95
* Layer 48: ~0.95
* Layer 56: ~0.95
* Layer 64: ~0.91
2. **Mixer Output (Purple line, square markers):**
* **Trend:** Exhibits high volatility in the early layers (0-16), with sharp drops and spikes. After approximately layer 16, it converges with the Block Output line and follows a very similar, stable, high-scoring path for the remainder of the layers, also showing a slight final decline.
* **Approximate Data Points:**
* Layer 0: ~0.83
* Layer 2: ~0.79 (sharp drop)
* Layer 4: ~0.92 (sharp spike)
* Layer 6: ~0.81 (sharp drop)
* Layer 8: ~0.92
* Layer 12: ~0.88
* Layer 16: ~0.94 (converges with Block Output)
* Layer 24: ~0.96
* Layer 32: ~0.95
* Layer 40: ~0.92 (notable dip)
* Layer 48: ~0.96
* Layer 56: ~0.95
* Layer 64: ~0.92
### Key Observations
* **Early Layer Instability:** The most striking feature is the significant volatility of the Mixer Output in the first 16 layers, contrasting with the relatively stable, gradual ascent of the Block Output.
* **Convergence:** After layer 16, the two lines become tightly coupled, suggesting that the performance difference between the Block and Mixer outputs becomes negligible in the deeper layers of the model.
* **Performance Plateau:** Both outputs achieve and maintain a high F1 Score (between ~0.94 and ~0.96) from approximately layer 16 to layer 56.
* **Final Layer Dip:** Both series show a noticeable decrease in F1 Score at the final layer (64), dropping to approximately 0.91-0.92.
* **Outlier Point:** The Mixer Output has a distinct, isolated dip around layer 40 (to ~0.92) while the Block Output remains stable at ~0.95.
### Interpretation
This chart provides a layer-wise performance diagnostic for the Mamba-2.8B model. The data suggests that the "Mixer" component of the architecture is highly sensitive or unstable during the initial processing stages (early layers), leading to erratic F1 scores. In contrast, the "Block" output provides a more consistent and reliable signal from the start.
The critical finding is that after about 16 layers of processing, the model's internal representations from both the Block and Mixer pathways become functionally equivalent in terms of the measured task performance (F1 Score). This could indicate that the model's deeper layers learn to integrate or stabilize the information from both pathways.
The high, sustained plateau indicates the model reaches its peak task-specific performance in the middle layers. The slight decline at the very final layer is a common phenomenon in deep networks and could be due to over-specialization, a slight degradation in representation quality, or the final layer being optimized for a different objective than the F1-scored task. The isolated dip for Mixer Output at layer 40 warrants investigation as a potential layer-specific instability or artifact.
</details>
Figure 4. Layer-wise F1 scores for CP-decomposed Mamba representations (mixer and block outputs) showing early and increasing separability between benign and jailbreak prompts, indicating that state-space architectures encode adversarial prompt structure in their internal representations.
For each model, we extract hidden states and multi-head attention (MHA) outputs across all layers. These are aggregated into third-order tensors ( $N× T× d$ ) and decomposed using CP tensor decomposition (rank $r=20$ ) to obtain low-dimensional features for each prompt. We fix the CP decomposition rank to $r=20$ for all experiments, balancing expressiveness and efficiency as discussed in § 3.2, and to ensure consistent latent representations across models. A lightweight classifier is trained on these features to predict jailbreak status. Fig. 3 presents layer-wise F1 scores for each model. For Mamba, we report results from both the mixer (analogous to MHA) and full block output in Fig. 4.
Our results show a clear separation between the two prompt types in the learned latent space, achieving consistently high F1 scores across all evaluated models. These findings suggest that jailbreak behavior manifests as identifiable and discriminative patterns within internal representations, independent of output quality or alignment stage, and can be effectively leveraged for detection without model fine-tuning.
Qualitative Analysis.
For clarity of presentation, we visualize qualitative results for instruction-tuned models, which provide the most interpretable view of aligned internal dynamics. The qualitative patterns discussed here are representative of those observed across all evaluated models.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Heatmap Grid: Attention Pattern Analysis Across Language Models
### Overview
The image displays a 3x3 grid of heatmaps analyzing attention patterns in three different Large Language Models (LLMs) under two conditions ("Benign" and "Jailbreak") and the calculated difference between them. Each row corresponds to a specific model and layer, while each column represents a condition. The heatmaps visualize the attention scores between query tokens and key tokens.
### Components/Axes
**Global Structure:**
- **Rows (Models & Layers):**
1. Top Row: `GPT-JT-6B - Layer 7`
2. Middle Row: `LLaMA-3.1-8B - Layer 4`
3. Bottom Row: `Mistral-7B - Layer 18`
- **Columns (Conditions):**
1. Left Column: `Benign`
2. Middle Column: `Jailbreak`
3. Right Column: `Difference`
**Axes (Identical for all 9 heatmaps):**
- **X-axis (Bottom):** `Key Token`. Scale ranges from 0 to 448, with major tick marks at 0, 64, 128, 192, 256, 320, 384, 448.
- **Y-axis (Left):** `Query Token`. Scale ranges from 0 to 448, with major tick marks at 0, 64, 128, 192, 256, 320, 384, 448.
**Color Bars (Legends):**
- **For "Benign" and "Jailbreak" columns (Left & Middle):** A vertical color bar is positioned to the right of each heatmap. The scale represents attention scores, ranging from approximately **-10 (dark purple/blue)** to **0 (yellow)**. The gradient transitions from dark purple/blue (low/negative attention) through teal and green to yellow (high/zero attention).
- **For the "Difference" column (Right):** A vertical color bar is positioned to the right of each heatmap. The scale represents the change in attention score (Jailbreak minus Benign), ranging from approximately **-4 (dark blue)** to **+4 (dark red)**, with **0 (white/light gray)** at the center. The gradient transitions from dark blue (decrease) through light blue/white to dark red (increase).
### Detailed Analysis
**Row 1: GPT-JT-6B - Layer 7**
- **Benign (Top-Left):** The heatmap shows a strong triangular pattern. High attention scores (yellow/green, ~0 to -2) are concentrated along the main diagonal (where Query Token index ≈ Key Token index) and in the upper-left triangle (where Query Token index < Key Token index). The lower-right triangle (where Query Token index > Key Token index) is dominated by very low attention scores (dark purple, ~-8 to -10).
- **Jailbreak (Top-Middle):** The pattern is visually similar to the Benign condition, maintaining the same triangular structure. The intensity of the high-attention region (yellow/green) appears slightly more pronounced or extended along the diagonal compared to Benign.
- **Difference (Top-Right):** This heatmap is predominantly red, indicating a positive difference (Jailbreak > Benign) across most of the upper-left triangle and diagonal. The strongest increases (dark red, ~+4) are concentrated in the region where both Query and Key Token indices are low (approximately 0-128). The lower-right triangle shows minimal change (white/light gray, ~0).
**Row 2: LLaMA-3.1-8B - Layer 4**
- **Benign (Middle-Left):** Exhibits a similar triangular attention pattern to GPT-JT-6B. High attention (yellow/green) is in the upper-left triangle and along the diagonal. Low attention (dark purple) fills the lower-right triangle.
- **Jailbreak (Middle-Middle):** Again, the pattern is structurally identical to its Benign counterpart. The high-attention region appears slightly brighter or more extensive.
- **Difference (Middle-Right):** This map is also largely red, showing a widespread increase in attention scores under the Jailbreak condition. The increase is most significant (dark red) in the upper-left quadrant (low token indices). The lower-right triangle shows near-zero change.
**Row 3: Mistral-7B - Layer 18**
- **Benign (Bottom-Left):** The pattern differs notably. While still triangular, the high-attention region (yellow/green) is much broader and extends further into the matrix. The gradient from high to low attention is smoother. The lowest attention scores (dark blue/purple) are confined to the very bottom-right corner.
- **Jailbreak (Bottom-Middle):** The pattern changes significantly. The high-attention region (yellow) becomes more concentrated along the main diagonal and the top edge (low Key Token indices). A larger portion of the matrix, especially the central and lower-left areas, shifts to moderate attention scores (teal/green, ~-4 to -6).
- **Difference (Bottom-Right):** This heatmap shows a complex pattern. A large, central region (spanning roughly Query 128-384 and Key 0-256) is blue, indicating a *decrease* in attention under Jailbreak. The top-left corner and a strip along the bottom edge (high Query Token indices) show red, indicating an *increase*. The diagonal shows mixed or minimal change.
### Key Observations
1. **Consistent Triangular Structure:** All "Benign" and "Jailbreak" heatmaps display a causal attention pattern, where tokens primarily attend to themselves and previous tokens (upper-left triangle), with little to no attention to future tokens (lower-right triangle).
2. **Model-Specific Baseline:** GPT-JT-6B and LLaMA-3.1-8B show very similar baseline ("Benign") attention distributions in the selected layers. Mistral-7B's baseline attention in Layer 18 is more diffuse.
3. **Jailbreak Impact - General Increase:** For GPT-JT-6B and LLaMA-3.1-8B, the "Jailbreak" condition leads to a general, widespread *increase* in attention scores within the causally allowed region (upper-left triangle), most pronounced for early tokens.
4. **Jailbreak Impact - Mistral's Redistribution:** Mistral-7B shows a different response. The Jailbreak condition causes a *redistribution* of attention, not just a uniform increase. Attention decreases in a large central region and increases in specific areas (early tokens and late query tokens).
5. **Spatial Focus of Change:** In all "Difference" maps, the most significant changes (whether increase or decrease) are concentrated in the regions corresponding to lower token indices (top-left of the matrices).
### Interpretation
This visualization provides a technical, layer-specific view of how "jailbreaking" prompts alter the internal attention mechanisms of different LLMs.
* **What the data suggests:** The jailbreak technique appears to modify the model's focus. For GPT-JT and LLaMA, it generally amplifies attention within the standard causal window, potentially making the model more sensitive to or reliant on the context provided by earlier tokens in the sequence when processing a jailbreak prompt. For Mistral, the effect is more nuanced, suggesting a strategic reallocation of attention resources—perhaps suppressing certain internal relationships while enhancing others to bypass safety training.
* **How elements relate:** The "Difference" column is the critical analytical output, directly isolating the effect of the jailbreak from the model's baseline behavior. The consistency of the triangular structure confirms the underlying causal attention mask is unchanged; the jailbreak alters the *strength* of attention, not its fundamental direction.
* **Notable anomalies/outliers:** The stark contrast between the response of Mistral-7B (Layer 18) and the other two models is the primary anomaly. This could be due to differences in model architecture, the specific layer analyzed (18 vs. 4/7), or the jailbreak's effectiveness/mechanism on that model. The concentration of change in low-index tokens across all models is also notable, suggesting the initial context of the prompt is a critical battleground during jailbreak attempts.
* **Peircean investigative reading:** The heatmaps are an *index* of the model's internal state, pointing directly to a physical change (attention score) caused by the jailbreak stimulus. They are also a *symbol* representing a complex computational process. The pattern suggests that successful jailbreaking may not require a complete overhaul of the model's processing, but rather a subtle, targeted modulation of its existing attention patterns, particularly in early processing stages. This has implications for detection and defense strategies, which could focus on monitoring for these characteristic attention shifts.
</details>
Figure 5. Self-attention maps for three instruction-tuned models, averaged over benign and jailbreak prompts (log 10 scale). Difference maps (right) highlight systematic but localized changes in attention patterns induced by jailbreak prompts, suggesting that adversarial intent manifests as targeted rerouting of attention rather than global disruption.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Heatmap Comparison: Benign vs. Jailbreak Activation Patterns Across Language Models
### Overview
The image displays a 3x3 grid of heatmaps comparing the internal activation patterns of three large language models (LLMs) under two conditions: "Benign" (standard prompts) and "Jailbreak" (adversarial prompts designed to bypass safety filters). The third column shows the "Difference" between the two conditions for each model. Each heatmap plots "Layer" (y-axis) against "Token Position" (x-axis), with color intensity representing a numerical value (likely activation magnitude or a related metric).
### Components/Axes
* **Grid Structure:** 3 rows (Models) x 3 columns (Conditions).
* **Row Labels (Left Side):**
* Row 1: **GPT-JT-6B**
* Row 2: **LLaMA-3.1-8B**
* Row 3: **Mistral-7B**
* **Column Headers (Top):**
* Column 1: **Benign**
* Column 2: **Jailbreak**
* Column 3: **Difference**
* **Axes (Per Heatmap):**
* **Y-axis:** Label: **Layer**. Scale: 0 at top, increasing downward. Ticks: 0, 4, 8, 12, 16, 20, 24, 28 (for LLaMA-3.1-8B and Mistral-7B). GPT-JT-6B scale ends at 24.
* **X-axis:** Label: **Token Position**. Scale: 0 at left, increasing rightward. Ticks: 0, 64, 128, 192, 256, 320, 384, 448.
* **Color Bars (Legends):** Located to the right of each individual heatmap.
* **GPT-JT-6B (Benign & Jailbreak):** Scale from ~4 (dark purple) to ~8 (bright yellow).
* **GPT-JT-6B (Difference):** Scale from -2 (dark blue) to +2 (dark red). Zero is white/light gray.
* **LLaMA-3.1-8B (Benign & Jailbreak):** Scale from 0 (dark purple) to 6 (bright yellow).
* **LLaMA-3.1-8B (Difference):** Scale from -0.2 (dark blue) to +0.2 (dark red). Zero is white.
* **Mistral-7B (Benign & Jailbreak):** Scale from -1 (dark purple) to 5 (bright yellow).
* **Mistral-7B (Difference):** Scale from -2.0 (dark blue) to +2.0 (dark red). Zero is white.
### Detailed Analysis
**1. GPT-JT-6B (Top Row)**
* **Benign Heatmap:** Shows a strong, consistent gradient. Values are lowest (dark purple, ~4) at Layer 0 across all token positions. Values increase steadily with layer depth, reaching the highest values (bright yellow, ~8) in the deepest layers (20-24). The pattern is uniform across token positions.
* **Jailbreak Heatmap:** Shows a similar but muted pattern. The gradient from low (purple) to high (green/yellow) with depth is present, but the overall intensity is lower. The deepest layers reach a medium green (~6-7), not the bright yellow seen in the Benign condition.
* **Difference Heatmap (Jailbreak - Benign):** Dominated by blue tones, indicating the Jailbreak condition has *lower* values than Benign across nearly all layers and token positions. The strongest negative difference (deepest blue, ~-2) occurs in the middle-to-deep layers (approx. 8-20). The difference is less pronounced in the very first and very last layers.
**2. LLaMA-3.1-8B (Middle Row)**
* **Benign Heatmap:** Shows a clear vertical gradient. Values are lowest (dark purple, 0) at Layer 0. They increase with depth, but the increase is not perfectly uniform. The highest values (yellow, ~6) appear in a band around layers 16-24. The pattern is largely consistent across token positions.
* **Jailbreak Heatmap:** Visually very similar to the Benign heatmap. The same vertical gradient and band of high activation in deep layers are present.
* **Difference Heatmap (Jailbreak - Benign):** Reveals subtle but systematic differences. The pattern is horizontally banded:
* **Early Layers (0-8):** Predominantly blue (negative difference, Jailbreak < Benign), with the strongest negative values (~-0.2) around layers 4-8.
* **Middle Layers (8-16):** A mix, with a notable band of red (positive difference, Jailbreak > Benign) around layers 10-14.
* **Deep Layers (16-28):** Strongly red (positive difference), with the highest values (~+0.2) concentrated in the deepest layers (24-28). This indicates Jailbreak activations are *higher* than Benign in the model's final layers.
**3. Mistral-7B (Bottom Row)**
* **Benign Heatmap:** Shows a smooth vertical gradient. Lowest values (dark purple, ~-1) at Layer 0, increasing to highest values (yellow, ~5) in the deepest layers (24-28). The pattern is uniform across token positions.
* **Jailbreak Heatmap:** Visually almost identical to the Benign heatmap. The same gradient and intensity are observed.
* **Difference Heatmap (Jailbreak - Benign):** Appears almost entirely white/very light orange, indicating near-zero difference across the entire layer-token space. The color bar ranges from -2 to +2, but the heatmap values are clustered very close to 0. There is a very faint, diffuse positive (light orange) tint in some middle-to-deep layers, but the magnitude is negligible compared to the scale.
### Key Observations
1. **Model-Specific Response to Jailbreak:** The three models exhibit fundamentally different internal activation responses to jailbreak prompts.
* **GPT-JT-6B:** Shows a global *suppression* of activations (blue Difference map).
* **LLaMA-3.1-8B:** Shows a *redistribution* of activations—suppressed in early layers, enhanced in deep layers (banded blue/red Difference map).
* **Mistral-7B:** Shows *minimal change* in activation patterns (near-white Difference map).
2. **Layer-Wise Sensitivity:** For LLaMA-3.1-8B, the most significant positive changes (Jailbreak > Benign) occur in the final layers, suggesting these layers are most affected by the adversarial prompt.
3. **Token Position Invariance:** Across all models and conditions, the activation patterns are remarkably consistent along the horizontal (Token Position) axis. The primary variation is vertical (Layer-wise).
### Interpretation
This visualization provides a "fingerprint" of how different LLM architectures process adversarial inputs at an internal, layer-by-layer level.
* **GPT-JT-6B's** uniform suppression suggests the jailbreak prompt may cause a general dampening of the model's standard processing pathway, potentially indicating a form of internal conflict or confusion.
* **LLaMA-3.1-8B's** layered response is particularly insightful. The suppression in early layers might reflect an attempt to filter or ignore the adversarial instruction, while the heightened activation in deep layers could indicate the model ultimately engaging with and processing the harmful content more intensely than a benign prompt. This aligns with theories that later layers handle more abstract, task-specific execution.
* **Mistral-7B's** near-identical maps suggest its internal representations are highly robust or invariant to the specific jailbreak technique used here. Its processing pathway does not significantly deviate from the benign case, which could imply stronger inherent safety alignment or a different failure mode not captured by this metric.
**Conclusion:** The "Difference" heatmap is a powerful diagnostic tool. It reveals that jailbreaking is not a monolithic phenomenon; its internal mechanistic impact varies dramatically across model families. LLaMA-3.1-8B shows the most structured and interpretable shift, while Mistral-7B appears most resistant *under these specific conditions*. This analysis moves beyond simply asking "did the jailbreak work?" to asking "how did the model's internal state change when it was attempted?"
</details>
Figure 6. Layer-wise log-magnitude of hidden representations for benign (left) and jailbreak (middle) prompts, averaged across prompts, with their difference shown on the right. The difference heatmaps reveal consistent, localized deviations across layers, highlighting where adversarial prompts induce layer-dependent representational shifts.
Aggregated Self-Attention Heatmaps.
We use aggregated self-attention heatmaps to qualitatively assess how jailbreak prompts alter token-to-token information routing within the model. While attention alone does not encode semantic content, systematic differences in attention patterns can indicate how adversarial prompts redirect internal focus during processing.
For each instruction-tuned model and transformer layer $\ell$ , we extract the self-attention weight tensors
$$
A^(\ell)∈ℝ^N× H× T× T,
$$
where $N$ is the number of prompts, $H$ the number of attention heads, and $T$ the (padded) token length. To obtain a stable, global view of attention behavior, we aggregate over both prompts and heads. Let $I_ben$ and $I_jb$ denote the index sets of benign and jailbreak prompts, respectively. We compute the class-wise, head-averaged attention maps:
$$
\bar{A}^(\ell)_ben=\frac{1}{|I_ben|H}∑_n∈I_ben∑_h=1^HA^(\ell)_n,h,:,:, \bar{A}^(\ell)_jb=\frac{1}{|I_jb|H}∑_n∈I_jb∑_h=1^HA^(\ell)_n,h,:,:.
$$
To highlight systematic differences between prompt types, we additionally compute a per-layer difference map:
$$
ΔA^(\ell)=\bar{A}^(\ell)_jb-\bar{A}^(\ell)_ben.
$$
Visualizing $\bar{A}^(\ell){ben}$ , $\bar{A}^(\ell){jb}$ , and $ΔA^(\ell)$ (Fig. 6) shows that jailbreak prompts lead to consistent, localized changes in attention patterns. This indicates that adversarial prompts influence attention by selectively emphasizing specific instruction or control tokens, providing qualitative evidence that jailbreak behavior arises from targeted changes in information flow rather than global attention disruption.
Hidden-Representation Magnitude Heatmaps.
While attention maps reflect information routing, hidden representations capture the content and intensity of internal computation. We therefore analyze layer-wise hidden-state magnitudes to understand how strongly jailbreak prompts perturb internal activations across network depth.
For each layer $\ell$ , we extract hidden states
$$
H^(\ell)∈ℝ^N× T× D,
$$
where $D$ is the hidden dimensionality. To summarize activation strength across token positions, we compute the per-token $\ell_2$ magnitude:
$$
M^(\ell)_n,t=≤ft\lVertH^(\ell)_n,t,:\right\rVert_2, M^(\ell)∈ℝ^N× T.
$$
We then average magnitudes across prompts within each class:
$$
\bar{M}^(\ell)_ben(t)=\frac{1}{|I_ben|}∑_n∈I_benM^(\ell)_n,t, \bar{M}^(\ell)_jb(t)=\frac{1}{|I_jb|}∑_n∈I_jbM^(\ell)_n,t.
$$
For visualization, we apply a logarithmic transform:
$$
\tilde{M}^(\ell)(t)=\log\big(\bar{M}^(\ell)(t)+ε\big),
$$
with a small $ε>0$ for numerical stability. We plot $\tilde{M}^(\ell)(t)$ as heatmaps with layers on the y-axis and token positions on x-axis.
Although the averaged hidden-state magnitudes for benign and jailbreak prompts appear broadly similar (especially for LLaMA abd Mistral), their difference heatmaps reveal consistent, localized deviations across layers. This indicates that jailbreak behavior does not manifest as a global disruption of internal activations, but rather as subtle, structured changes superimposed on otherwise normal model processing.
### 5.3. Layer-Aware Mitigation via Latent-Space Susceptibility
We evaluate our second proposed method, layer-aware mitigation via latent-space susceptibility, to check whether representation-level signals can be used to suppress jailbreak execution during inference without relying on output-level filtering or fine-tuning.
Experimental Setup.
We conduct this experiment on the abliterated LLaMA 3.1 8B model described earlier. Evaluation is performed on a held-out set of 200 prompts (100 benign, 100 jailbreak), using latent factors learned during the analysis phase (§ 5.2).
Inference-time Susceptibility Scoring
Given an input prompt at inference time, we extract layer outputs and attention representations, project them onto the pre-learned CP factors, and use a lightweight classifier to compute a per-layer susceptibility score indicating the strength of jailbreak-correlated features. Layers whose predicted jailbreak probability exceeds a fixed threshold ( $τ=0.7$ ) are treated as highly susceptible. The threshold $τ$ is a tunable hyperparameter that controls the trade-off between mitigation strength and preservation of benign behavior.
Layer-/Head-Bypass Intervention.
Based on the susceptibility score, we selectively perform: (i) Layer Bypass: bypassing selected layer outputs; and (ii) MHA Bypass: bypassing selected attention components. This intervention is parameter-free (no fine-tuning), prompt-conditional (depends on the susceptibility profile), and does not require any output-side heuristics.
Output-Based Evaluation with LLM-Assisted Judging.
Since our goal is to prevent harmful compliance rather than optimize helpfulness, we evaluate mitigation effectiveness based on observed output behavior. Model responses are categorized as: (i) harmful completions, where the jailbreak intent succeeds; (ii) benign completions, where the model responds appropriately; and (iii) disrupted outputs, including truncated, repetitive, or incoherent text.
For jailbreak prompts, disrupted or non-compliant outputs are treated as successful defenses, while for benign prompts such outputs are undesirable. Output labels are assigned using an LLM-as-a-judge rubric, followed by manual review of ambiguous cases. This evaluation protocol follows established practice for open-ended generation assessment with human validation (Zheng et al., 2023; Dubois et al., 2024; Li et al., 2024). Based on these criteria, we define the confusion matrix as follows:
| Jailbreak | Harmful completion | False Negative (FN) |
| --- | --- | --- |
| Jailbreak | Disrupted/benign output | True Positive (TP) |
| Benign | Benign completion | True Negative (TN) |
| Benign | Disrupted output | False Positive (FP) |
Results.
Table 1 summarizes the confusion-matrix counts. Layer-guided bypass suppresses most jailbreak attempts (TP=78) while largely preserving benign behavior (TN=94). In contrast, MHA-only bypass results in substantially more jailbreak failures (FN=39), indicating that layer outputs capture a larger fraction of jailbreak-relevant computation than attention components alone.
To provide a compact summary aligned with prior jailbreak evaluations, we additionally report the attack success rate (ASR), defined as the fraction of jailbreak prompts that remain successful after mitigation. Layer-output bypass reduces ASR to $22\$ (22/100), compared to $39\$ for MHA-only bypass, highlighting the effectiveness of layer-level intervention.
Table 1. Confusion matrix counts for latent-space-guided mitigation (100 jailbreak and 100 benign prompts).
| Layer Bypass | 78 | 22 | 94 | 6 |
| --- | --- | --- | --- | --- |
| MHA Bypass | 61 | 39 | 92 | 8 |
Failure (False Negative) Analysis
We examine the 22 jailbreak prompts that remain harmful after layer-output bypass. The majority are persona- or roleplay-based prompt injections (e.g., “never refuse,” “no morals,” forced speaker tags such as “AIM:” and “[H4X]:”) that aim to establish persistent control over the model’s identity, tone, and formatting. Because such instructions are repeatedly reinforced throughout the prompt, elements of adversarial control can persist even when highly susceptible layers are bypassed.
Additional failures stem from susceptibility estimation: the intervention targets layers exceeding a fixed probability threshold chosen to preserve benign behavior. Attacks that distribute their influence across multiple layers, or weakly activate any single layer, may therefore evade suppression despite succeeding overall. Some failures also involve milder jailbreaks that retain adversarial framing without immediately producing explicit harmful content; under our conservative evaluation criterion, these are counted as failures.
These limitations are addressable within the proposed framework by expanding the diversity of jailbreak styles used for latent factor learning and by adopting adaptive or cumulative susceptibility criteria. Since the method operates entirely in latent space, such extensions require no architectural changes.
## 6. Conclusion
Our hypothesis and experiments indicate that internal representations of LLMs contain sufficiently strong and consistent signals to both detect jailbreak prompts and, in many cases, disrupt jailbreak execution at inference time. Importantly, these capabilities emerge from lightweight representation-level analysis and intervention, without requiring additional post-training, auxiliary models, or complex rule-based filtering. The consistency of these findings across diverse model families suggests that adversarial intent leaves stable latent-space signatures, motivating internal-representation monitoring as a practical and broadly applicable direction for understanding and mitigating jailbreak behavior.
## 7. GenAI Usage Disclosure
The authors acknowledge the use of AI-based writing and coding assistance tools during the preparation of this manuscript. These tools were used exclusively to improve clarity, organization, and academic tone of text written by the authors, as well as to assist with code formatting and plot generation. All scientific ideas, methodologies, analyses, and conclusions are the original intellectual contributions of the authors. No AI system was used to generate research ideas or substantive technical content, and all AI-assisted revisions were carefully reviewed and validated by the authors.
## References
- (1)
- jai ([n. d.]) [n. d.]. Implementation of the proposed method. https://anonymous.4open.science/r/Jailbreaking-leaves-a-trace-Understanding-and-Detecting-Jailbreak-Attacks-in-LLMs-C401
- mam ([n. d.]) [n. d.]. Mamab2.8b. https://huggingface.co/state-spaces/mamba-2.8b-hf
- AI (2023) Mistral AI. 2023. Mistral 7B Instruct v0.1. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1.
- AI@Meta (2024a) AI@Meta. 2024a. Llama 3 Instruct Models. https://ai.meta.com/llama/. LLaMA-3.1-8B-Instruct.
- AI@Meta (2024b) AI@Meta. 2024b. Llama 3 Models. https://ai.meta.com/llama/. LLaMA-3.1-8B base model.
- Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862 [cs.CL] https://arxiv.org/abs/2204.05862
- Candogan et al. (2025) Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios Chrysos, and Volkan Cevher. 2025. Single-pass Detection of Jailbreaking Input in Large Language Models. Transactions on Machine Learning Research (2025). https://openreview.net/forum?id=42v6I5Ut9a
- Chao et al. (2024) Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2024. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv:2310.08419 [cs.LG] https://arxiv.org/abs/2310.08419
- Chen et al. (2023) Bocheng Chen, Advait Paliwal, and Qiben Yan. 2023. Jailbreaker in Jail: Moving Target Defense for Large Language Models. arXiv:2310.02417 [cs.CR] https://arxiv.org/abs/2310.02417
- Computer (2022) Together Computer. 2022. GPT-JT-6B: Instruction-Tuned GPT-J Model. https://huggingface.co/togethercomputer/GPT-JT-6B-v1.
- Deng et al. (2024) Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. 2024. Multilingual Jailbreak Challenges in Large Language Models. arXiv:2310.06474 [cs.CL] https://arxiv.org/abs/2310.06474
- Dubois et al. (2024) Yann Dubois, Aarohi Srivastava, Abhinav Venigalla, et al. 2024. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv preprint arXiv:2404.04475 (2024). https://arxiv.org/abs/2404.04475
- Elhoushi et al. (2024) Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed Aly, Beidi Chen, and Carole-Jean Wu. 2024. LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 12622–12642. doi: 10.18653/v1/2024.acl-long.681
- Glaese et al. (2022) Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nicholas Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel, William Isaac, John Mellor, Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey Irving. 2022. Improving alignment of dialogue agents via targeted human judgements. arXiv:2209.14375 [cs.LG] https://arxiv.org/abs/2209.14375
- Hao ([n. d.]) Jack Hao. [n. d.]. Jailbreak Classification Dataset. https://huggingface.co/datasets/jackhhao/jailbreak-classification
- Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, et al. 2023. Mistral 7B. https://mistral.ai/news/introducing-mistral-7b/.
- Jiang et al. (2024) Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. 2024. WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models. arXiv:2406.18510 [cs.CL] https://arxiv.org/abs/2406.18510
- Kadali and Papalexakis (2025) Sri Durga Sai Sowmya Kadali and Evangelos Papalexakis. 2025. CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25). ACM, 4857–4861. doi: 10.1145/3746252.3760886
- Labonne ([n. d.]) M. Labonne. [n. d.]. Meta-Llama-3.1-8B-Instruct-abliterated. https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated. Hugging Face model card.
- Lawson and Aitchison (2025) Tim Lawson and Laurence Aitchison. 2025. Learning to Skip the Middle Layers of Transformers. arXiv:2506.21103 [cs.LG] https://arxiv.org/abs/2506.21103
- Li et al. (2024) Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579 [cs.CL] https://arxiv.org/abs/2412.05579
- Liu et al. (2024) Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. arXiv:2310.04451 [cs.CL] https://arxiv.org/abs/2310.04451
- Lu et al. (2024) Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, and Cen Chen. 2024. Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge. arXiv:2404.05880 [cs.CL] https://arxiv.org/abs/2404.05880
- Luo et al. (2025) Xuan Luo, Weizhi Wang, and Xifeng Yan. 2025. Adaptive layer-skipping in pre-trained llms. arXiv preprint arXiv:2503.23798 (2025).
- Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL] https://arxiv.org/abs/2203.02155
- Papalexakis (2018) Evangelos E. Papalexakis. 2018. Unsupervised Content-Based Identification of Fake News Articles with Tensor Decomposition Ensembles. https://api.semanticscholar.org/CorpusID:26675959
- Qazi et al. (2024) Zubair Qazi, William Shiao, and Evangelos E. Papalexakis. 2024. GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method. In Companion Proceedings of the ACM Web Conference 2024 (Singapore, Singapore) (WWW ’24). Association for Computing Machinery, New York, NY, USA, 842–846. doi: 10.1145/3589335.3651513
- Shen et al. (2024) Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. In ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM.
- Shukor and Cord (2024) Mustafa Shukor and Matthieu Cord. 2024. Skipping Computations in Multimodal LLMs. In Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models. https://openreview.net/forum?id=qkmMvLckB9
- Sidiropoulos et al. (2017) Nicholas D Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E Papalexakis, and Christos Faloutsos. 2017. Tensor decomposition for signal processing and machine learning. IEEE Transactions on signal processing 65, 13 (2017), 3551–3582.
- Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL] https://arxiv.org/abs/1706.03762
- Wang et al. (2021) Ben Wang, Aran Komatsuzaki, and EleutherAI. 2021. GPT-J-6B. https://huggingface.co/EleutherAI/gpt-j-6B. Model card and weights.
- Wang et al. (2024a) Hao Wang, Hao Li, Minlie Huang, and Lei Sha. 2024a. ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 2697–2711. doi: 10.18653/v1/2024.emnlp-main.157
- Wang et al. (2025) Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Shuai Wang, Yingjiu Li, Yang Liu, Ning Liu, and Juergen Rahmel. 2025. SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner. arXiv:2406.05498 [cs.CR] https://arxiv.org/abs/2406.05498
- Wang et al. (2024b) Yihan Wang, Zhouxing Shi, Andrew Bai, and Cho-Jui Hsieh. 2024b. Defending llms against jailbreaking attacks via backtranslation. arXiv preprint arXiv:2402.16459 (2024).
- Wang et al. (2023) Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023. Aligning Large Language Models with Human: A Survey. arXiv:2307.12966 [cs.CL] https://arxiv.org/abs/2307.12966
- Wei et al. (2023) Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail? arXiv:2307.02483 [cs.LG] https://arxiv.org/abs/2307.02483
- Xiong et al. (2025) Chen Xiong, Xiangyu Qi, Pin-Yu Chen, and Tsung-Yi Ho. 2025. Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks. arXiv:2405.20099 [cs.CR] https://arxiv.org/abs/2405.20099
- Yang et al. (2023) Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. 2023. Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models. arXiv:2310.02949 [cs.CL] https://arxiv.org/abs/2310.02949
- Yao et al. (2023) Hongwei Yao, Jian Lou, and Zhan Qin. 2023. PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models. arXiv:2310.12439 [cs.CL] https://arxiv.org/abs/2310.12439
- Young (2026) Richard J. Young. 2026. Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation. arXiv:2512.13655 [cs.CL] https://arxiv.org/abs/2512.13655
- Yu et al. (2023) Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023).
- Zeng et al. (2024) Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. 2024. AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks. arXiv:2403.04783 [cs.LG] https://arxiv.org/abs/2403.04783
- Zhang et al. (2026) Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, et al. 2026. Instruction tuning for large language models: A survey. Comput. Surveys 58, 7 (2026), 1–36.
- Zhang et al. (2024) Tianrong Zhang, Bochuan Cao, Yuanpu Cao, Lu Lin, Prasenjit Mitra, and Jinghui Chen. 2024. WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response. arXiv:2405.14023 [cs.LG] https://arxiv.org/abs/2405.14023
- Zhao et al. (2019) Zhenjie Zhao, Andrew Cattle, Evangelos Papalexakis, and Xiaojuan Ma. 2019. Embedding Lexical Features via Tensor Decomposition for Small Sample Humor Recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 6376–6381. doi: 10.18653/v1/D19-1669
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685 (2023). https://arxiv.org/abs/2306.05685
- Zou et al. (2023) Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043 [cs.CL] https://arxiv.org/abs/2307.15043