# Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
**Authors**: Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis
> University of California, RiversideRiversideCAUSAskada009@ucr.edu
> University of California, RiversideRiversideCAUSAepapalex@cs.ucr.edu
(2026)
Abstract.
Jailbreaking large language models (LLMs) has emerged as a critical security challenge with the widespread deployment of conversational AI systems. Adversarial users exploit these models through carefully crafted prompts to elicit restricted or unsafe outputs, a phenomenon commonly referred to as Jailbreaking. Despite numerous proposed defense mechanisms, attackers continue to develop adaptive prompting strategies, and existing models remain vulnerable. This motivates approaches that examine the internal behavior of LLMs rather than relying solely on prompt-level defenses. In this work, we study jailbreaking from both security and interpretability perspectives by analyzing how internal representations differ between jailbreak and benign prompts. We conduct a systematic layer-wise analysis across multiple open-source models, including GPT-J, LLaMA, Mistral, and the state-space model Mamba2, and identify consistent latent-space patterns associated with adversarial inputs. We then propose a tensor-based latent representation framework that captures structure in hidden activations and enables lightweight jailbreak detection without model fine-tuning or auxiliary LLM-based detectors. We further demonstrate that these latent signals can be used to actively disrupt jailbreak execution at inference time. On an abliterated LLaMA 3.1 8B model, selectively bypassing high-susceptibility layers blocks 78% of jailbreak attempts while preserving benign behavior on 94% of benign prompts. This intervention operates entirely at inference time and introduces minimal overhead, providing a scalable foundation for achieving stronger coverage by incorporating additional attack distributions or more refined susceptibility thresholds. Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures and suggest a complementary, architecture-agnostic direction for improving LLM security. Our implementation can be found here (jai, [n. d.]).
Jailbreaking, Large Language Models, LLM Internal Representations, Self-attention, Hidden Representation, Tensor Decomposition copyright: acmlicensed journalyear: 2026 doi: XXXXXXX.XXXXXXX conference: ; ; ccs: Computing methodologies Artificial intelligence ccs: Information systems Data mining ccs: Security and privacy Software and application security
1. Introduction
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks and have become increasingly integrated into applications spanning diverse domains and user populations. Despite their utility and efforts to align them according to human and safety expectations (Wang et al., 2023; Zeng et al., 2024; Glaese et al., 2022), these models remain highly susceptible to adversarial exploitation, raising significant safety and security concerns given their widespread accessibility. Among such threats, Jailbreaking has emerged as a persistent and particularly concerning attack vector, wherein malicious actors craft carefully engineered prompts to circumvent built-in safety mechanisms (Chao et al., 2024; Zou et al., 2023) and elicit restricted, sensitive, or otherwise disallowed content (Liu et al., 2024). Jailbreak attacks pose substantial risks, as they enable users with harmful intent to manipulate LLMs into producing outputs that violate safety policies, including actionable instructions for malicious activities (Zhang et al., 2024; Wang et al., 2024a; Yu et al., 2023). The growing availability of jailbreak prompts in public repositories, research artifacts, and online forums further exacerbates this issue (Jiang et al., 2024; Shen et al., 2024).
To mitigate these risks, prior work has explored a range of defense strategies, including prompt-level filtering (Wang et al., 2024b), model-level interventions (Zeng et al., 2024; Xiong et al., 2025), reinforcement learning from human feedback (RLHF) (Bai et al., 2022), and the use of auxiliary safety models (Chen et al., 2023; Lu et al., 2024). While such approaches have demonstrated partial effectiveness, they are not without limitations. In practice, even well-aligned models can remain vulnerable under repeated or adaptive attack attempts. Moreover, no single defense mechanism has proven sufficient to counter the continually evolving landscape of jailbreak strategies.
In this study, we investigate a complementary and comparatively underexplored direction: leveraging internal model representations to distinguish jailbreak prompts from benign inputs and to guide mitigation. Our central hypothesis is that adversarial prompts induce distinct and detectable structural patterns within the hidden representations of LLMs, independent of output behavior. To evaluate this hypothesis, we extract layer-wise internal representations such as multi-head attention and layer output/hidden state representation from multiple models such as GPT-J, LlaMa, Mistral, and the state-space sequence model Mamba, and apply tensor decomposition (Sidiropoulos et al., 2017) techniques to characterize and compare latent-space behaviors across benign and jailbreak prompts. Building on this analysis, we further demonstrate how these latent representations can be used to identify layers that are particularly susceptible to adversarial manipulation and to intervene during inference by selectively bypassing such layers. This representation-centric framework not only enables reliable detection of jailbreak prompts but also provides a principled mechanism for mitigating harmful behavior without modifying model parameters or relying on output-level filtering. Together, our results suggest that internal representations offer a powerful and generalizable foundation for both understanding and defending against jailbreak attacks beyond surface-level text analysis. Our contributions are as follows:
- Layer-wise Jailbreak Signature: We show that jailbreak and benign prompts exhibit distinct, layer-dependent latent signatures in the internal representations of LLMs, which can be uncovered using tensor decomposition (Sidiropoulos et al., 2017).
- Effective Defense via Targeted Layer Bypass: We demonstrate that these latent signatures can be exploited at inference time to identify susceptible layers and disrupt jailbreak execution through targeted layer bypass.
2. Related work
2.1. Adversarial Attacks
Adversarial attacks on LLMs encompass a broad class of inputs intentionally crafted to induce unintended, incorrect, or unsafe behaviors (Zou et al., 2023; Chao et al., 2024). Unlike adversarial examples in vision or speech domains, which often rely on imperceptible input perturbations, attacks on LLMs primarily exploit semantic, syntactic, and contextual vulnerabilities in language understanding and generation. By manipulating instructions, context, or interaction structure, adversaries can steer models toward generating factually incorrect information, violating behavioral constraints, or producing harmful or sensitive content, posing significant risks to deployed systems (Liu et al., 2024). Existing adversarial strategies span a wide range of mechanisms, including prompt injection, role-playing and persona manipulation, instruction obfuscation, multi-turn coercion (Yu et al., 2023; Jiang et al., 2024), and indirect attacks embedded within external content such as documents or code, and even by fine-tuning (Yang et al., 2023; Yao et al., 2023). A central challenge in defending against these attacks is their adaptability: adversarial prompts are often transferable across models and can be easily modified to evade static defenses (Zou et al., 2023). As a result, surface-level or prompt-based mitigation strategies have shown limited robustness
2.2. Jailbreak attacks
A prominent and particularly challenging class of adversarial attacks on LLMs is jailbreaking. Jailbreak attacks aim to circumvent built-in safety mechanisms and alignment constraints, enabling the model to produce outputs that it is explicitly designed to refuse. These attacks often rely on prompt engineering techniques such as hypothetical scenarios, instruction overriding, contextual reframing, or step-by-step coercion, effectively manipulating the model’s internal decision-making processes (Wei et al., 2023). Unlike general adversarial prompting, jailbreak attacks explicitly target safety guardrails and content moderation policies, making them a critical concern from both security and governance perspectives (Liu et al., 2024; Zou et al., 2023; Jiang et al., 2024). Despite extensive efforts to harden models through alignment training and reinforcement learning from human feedback (Bai et al., 2022; Wang et al., 2023; Glaese et al., 2022), jailbreak prompts continue to evolve, highlighting fundamental limitations in current defense approaches. This motivates the need for methods that analyze jailbreak behavior at the level of internal model representations, rather than relying solely on external prompt or output inspection.
2.3. Jailbreak Defenses
Prior work on defending against jailbreak attacks in LLMs has primarily focused on prompt and output-level safeguards. Rule-based filtering and keyword matching are commonly used due to their low computational cost, but such approaches are brittle and easily bypassed through paraphrasing, obfuscation, or multi-turn prompting (Deng et al., 2024). Learning-based defenses, including supervised classifiers and auxiliary LLMs for intent detection or self-evaluation (Wang et al., 2025), improve robustness but introduce additional complexity, inference overhead, and new attack surfaces. Model-level defenses, such as alignment fine-tuning, reinforcement learning from human feedback (RLHF), and policy-based or constitutional training, aim to internalize safety constraints within the model (Ouyang et al., 2022). While effective to an extent, these approaches are resource-intensive and require continual updates as jailbreak strategies evolve. Moreover, even extensively aligned models remain susceptible to jailbreak attacks, indicating fundamental limitations in current training-based defenses. Overall, existing defenses largely treat jailbreak detection as a black-box problem and rely on external signals from prompts or generated outputs. In contrast, fewer works explore the internal representations of LLMs as a basis for defense (Candogan et al., 2025). This gap motivates approaches that leverage latent-space and layer-wise signals to identify jailbreak behavior in an interpretable and architecture-agnostic manner, without requiring additional fine-tuning or auxiliary models.
3. Preliminaries
3.1. Tensors
Tensors (Sidiropoulos et al., 2017) are defined as multi-dimensional arrays that generalize one-dimensional arrays (vectors) and two-dimensional arrays (matrices) to higher dimensions. The dimension of a tensor is traditionally referred to as its order, or equivalently, the number of modes, while the size of each mode is called its dimensionality. For instance, we may refer to a third-order tensor as a three-mode tensor $\boldsymbol{\mathscr{X}}∈\mathbb{R}^{I× J× K}$ .
3.2. Tensor Decomposition
Tensor Decomposition (Sidiropoulos et al., 2017) is a popular data science tool for discovering underlying low-dimensional patterns in the data. We focus on the CANDECOMP/PARAFAC (CP) decomposition model (Sidiropoulos et al., 2017), one of the most famous tensor decomposition models that decomposes a tensor into a sum of rank-one components. We use CP decomposition because of its simplicity and interpretability. The CP decomposition of a three-mode tensor $\boldsymbol{\mathscr{X}}∈\mathbb{R}^{I× J× K}$ is the sum of three-way outer products, that is, $\boldsymbol{\mathscr{X}}≈\sum_{r=1}^{R}\mathbf{a}_{r}\circ\mathbf{b}_{r}\circ\mathbf{c}_{r}$ , where $R$ is the rank of the decomposition, $\mathbf{a}_{r}∈\mathbb{R}^{I}$ , $\mathbf{b}_{r}∈\mathbb{R}^{J}$ , and $\mathbf{c}_{r}∈\mathbb{R}^{K}$ are the factor vectors and $\circ$ denotes the outer product. The rank of a tensor $\boldsymbol{\mathscr{X}}$ is the minimal number of rank-1 tensors required to exactly reconstruct it:
$$
\text{rank}(\boldsymbol{\mathscr{X}})=\min\left\{R:\boldsymbol{\mathscr{X}}=\sum_{r=1}^{R}\mathbf{a}_{r}\circ\mathbf{b}_{r}\circ\mathbf{c}_{r},\;\mathbf{a}_{r}\in\mathbb{R}^{I},\mathbf{b}_{r}\in\mathbb{R}^{J},\mathbf{c}_{r}\in\mathbb{R}^{K}\right\}
$$
Selecting an appropriate rank is critical, as it directly affects both the expressiveness and interpretability of the decomposition. Lower-rank approximations yield compact and computationally efficient representations, while higher ranks can capture richer structure at the cost of increased complexity and potential noise.
3.3. Transformer Architecture
A Transformer (Vaswani et al., 2023) is a neural network architecture designed for modeling sequential data through attention mechanisms rather than recurrence or convolution. Transformers process input sequences in parallel and capture long-range dependencies by explicitly modeling interactions between all tokens in a sequence.
A transformer consists of a stack of layers, each composed of two primary submodules: multi-head self-attention and a position-wise feed-forward network (FFN). Residual connections and layer normalization are applied around each submodule to stabilize training.
3.3.1. Multi-Head Self-Attention
Multi-head self-attention enables the model to attend to different parts of the input sequence simultaneously. Given an input representation $\mathbf{H}∈\mathbb{R}^{T× d}$ , each attention head projects $\mathbf{H}$ into query ( $\mathbf{Q}$ ), key ( $\mathbf{K}$ ), and value ( $\mathbf{V}$ ) matrices:
$$
\mathbf{Q}=\mathbf{H}\mathbf{W}^{Q},\quad\mathbf{K}=\mathbf{H}\mathbf{W}^{K},\quad\mathbf{V}=\mathbf{H}\mathbf{W}^{V}.
$$
Attention is computed as
$$
\mathrm{Attn}(\mathbf{Q},\mathbf{K},\mathbf{V})=\mathrm{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^{\top}}{\sqrt{d_{k}}}\right)\mathbf{V},
$$
where $d_{k}$ is the dimensionality of each attention head. Multiple attention heads operate in parallel, and their outputs are concatenated and linearly projected, allowing the model to capture diverse relational patterns across tokens.
3.3.2. Layer Outputs and Hidden Representations
Each transformer layer produces a hidden representation (or layer output) that serves as input to the next layer. Formally, for layer $\ell$ , the output representation $\mathbf{H}^{(\ell)}$ is given by:
$$
\mathbf{H}^{(\ell)}=\mathrm{LN}\!\left(\mathbf{H}^{(\ell-1)}+\mathrm{MHA}\!\left(\mathbf{H}^{(\ell-1)}\right)\right),
$$
followed by
$$
\mathbf{H}^{(\ell)}=\mathrm{LN}\!\left(\mathbf{H}^{(\ell)}+\mathrm{FFN}\!\left(\mathbf{H}^{(\ell)}\right)\right),
$$
where $\mathrm{MHA}$ denotes multi-head attention, $\mathrm{FFN}$ denotes the feed-forward network, and $\mathrm{LN}$ denotes layer normalization.
The sequence of hidden states $\{\mathbf{H}^{(1)},...,\mathbf{H}^{(L)}\}$ captures increasingly abstract features, ranging from local syntactic patterns in early layers to semantic and task-relevant representations in deeper layers. These intermediate representations are commonly referred to as hidden layer activations and form the basis for interpretability and internal behavior analysis.
3.4. Model Types: Base, Instruction-Tuned, and Abliterated Models
Large language models (LLMs) can be categorized based on their training and alignment processes, which influence their behavior under adversarial conditions.
Base Models
These are pretrained on large-scale text corpora using self-supervised objectives without explicit instruction tuning. Examples include GPT-J and LLaMA. Base models capture broad language patterns but lack alignment with human preferences, making them prone to generate unrestricted or unsafe outputs.
Instruction-Tuned Models
Derived from base models via supervised fine-tuning on datasets containing human instructions and responses, these models improve instruction-following capabilities (Zhang et al., 2026) and enforce safety constraints, such as refusing harmful queries. While instruction tuning enhances safety, these models remain susceptible to sophisticated jailbreak prompts.
Abliterated Models
Abliterated models are instruction-tuned models where alignment or safety components have been removed, disabled, or bypassed (Young, 2026). Such models behave more like base models but may retain subtle differences due to prior fine-tuning. Abliterated models serve as valuable testbeds to study jailbreak vulnerabilities and internal representational changes resulting from alignment removal.
Analyzing these model types enables us to investigate how alignment and instruction tuning affect internal layer activations and latent patterns, informing the design of robust jailbreak detection methods that generalize across model variations.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Latent Analysis and Classifier Training and Jailbreak Mitigation at Inference
### Overview
The image presents a diagram illustrating two main processes: "Latent Analysis and Classifier Training" and "Jailbreak Mitigation at Inference." The first process describes how a classifier is trained to distinguish between benign and jailbreak prompts based on latent factors extracted from an LLM. The second process outlines a method for mitigating jailbreak attempts during inference by identifying and bypassing sensitive layers.
### Components/Axes
**Latent Analysis and Classifier Training (Top Section):**
* **Input prompts to LLM:** Shows two input prompts, one labeled "Benign" (represented by a green document icon) and the other "Jailbreak" (represented by a purple document icon with a lock).
* **LLM Layers:** A stack of blue layers representing the layers of the Large Language Model (LLM).
* **Intermediate layers' data:** Arrows indicate data flow from the input prompts through the LLM layers.
* **Layer 'l' extracted for a single input prompt:** A yellow rectangle representing the extracted layer, labeled with "Sequence length" and "Token embedding length."
* **Construct 3-mode activation tensor:** A 3D cube composed of smaller yellow cubes, representing the activation tensor. Labeled with "Sequence length," "prompts," and "Token embedding length."
* **Effective separation of factors in the latent space:** A scatter plot showing two clusters: "Benign" (green circles) and "Jailbreak" (purple figures).
* **Decomposed Factors (CP Decomposition):** Three rectangles labeled A, B, and C. A is yellow, B is light red, and C is orange.
* **Training a classifier on prompt-mode factors:** Text indicating the purpose of the scatter plot.
**Jailbreak Mitigation at Inference (Bottom Section):**
* **Inference time prompt:** A document icon representing the input prompt.
* **LLM Layers:** A stack of blue layers representing the LLM layers.
* **Intermediate layers' data:** Arrows indicate data flow from the input prompt through the LLM layers.
* **Layer 'l' extracted for the prompt at inference:** A rectangle labeled "Layer output/MHA output."
* **Project layer representations onto learned latent factors:** Text describing the process.
* **Resultant prompt layer in latent space:** A rectangle labeled "New prompt."
* **Identify Jailbreak sensitive layers:** Text describing the goal of this section.
* **Classify using the trained factors:** Question marks next to a green circle and a purple figure, indicating classification.
* **If layer wise Jailbreak Prob > threshold, layer might be showing more signals of jailbreak attack:** Text describing the condition for identifying jailbreak-sensitive layers.
* **Susceptible layer bypassing:** Text describing the outcome of the process.
* **Layers in red are bypassed as they exhibit stronger jailbreak signature:** Text explaining the bypassing mechanism.
* **Layer bypass prevented harmful generation:** Text describing the benefit of layer bypassing.
* **Layer B, Layer E:** Layers marked in red with a cross, indicating they are bypassed.
* **Layer a, Layer c, Layer d, Layer f, Layer x:** Layers in blue.
### Detailed Analysis or ### Content Details
**Latent Analysis and Classifier Training:**
1. **Input Prompts:** The diagram starts with two types of input prompts: "Benign" and "Jailbreak." These prompts are fed into the LLM.
2. **LLM Layers:** The prompts pass through multiple layers of the LLM, represented by a stack of blue rectangles.
3. **Layer Extraction:** For a single input prompt, a specific layer 'l' is extracted. This layer is represented as a yellow rectangle, characterized by "Sequence length" and "Token embedding length."
4. **3-Mode Activation Tensor:** Representations from each layer (l = 1, 2, 3, ..., L) are extracted and formed into a 3-mode tensor. This tensor is visualized as a 3D cube.
5. **Factor Separation:** The process aims to achieve effective separation of factors in the latent space. This is visualized as a scatter plot where "Benign" prompts cluster separately from "Jailbreak" prompts.
6. **Classifier Training:** A classifier is trained on these prompt-mode factors to distinguish between benign and jailbreak attempts.
7. **Decomposed Factors:** The factors are decomposed into components A, B, and C.
**Jailbreak Mitigation at Inference:**
1. **Inference Time Prompt:** An input prompt is provided to the LLM during inference.
2. **LLM Layers:** The prompt passes through the LLM layers.
3. **Layer Extraction:** A specific layer 'l' is extracted for the prompt at inference. This layer's output is labeled "Layer output/MHA output."
4. **Projection onto Latent Factors:** The layer representations are projected onto learned latent factors.
5. **Resultant Prompt Layer:** This projection results in a "New prompt" layer in latent space.
6. **Jailbreak Detection:** The system classifies whether the prompt is a jailbreak attempt using the trained factors.
7. **Layer Bypassing:** If the layer-wise jailbreak probability exceeds a threshold, the layer is identified as jailbreak-sensitive and bypassed. Layers in red (Layer B, Layer E) are bypassed because they exhibit a stronger jailbreak signature.
8. **Harmful Generation Prevention:** Layer bypassing prevents harmful generation by avoiding the use of jailbreak-sensitive layers.
### Key Observations
* The diagram illustrates a two-stage process: training a classifier to detect jailbreak attempts and then using this classifier to mitigate jailbreak attempts during inference.
* The key idea is to identify and bypass layers that are highly sensitive to jailbreak attacks.
* The use of latent factor analysis allows for the separation of benign and jailbreak prompts in the latent space.
### Interpretation
The diagram presents a method for enhancing the safety and reliability of Large Language Models (LLMs) by mitigating jailbreak attempts. The approach involves training a classifier to distinguish between benign and malicious prompts based on latent factors extracted from the LLM's layers. During inference, the system identifies layers that are highly sensitive to jailbreak attacks and bypasses them, preventing the generation of harmful or inappropriate content.
The diagram highlights the importance of understanding the internal representations of LLMs and using this knowledge to improve their robustness. By identifying and mitigating jailbreak-sensitive layers, the system can effectively reduce the risk of malicious use and ensure that the LLM is used responsibly. The use of CP decomposition to decompose the factors into components A, B, and C suggests a method for further analyzing and understanding the underlying factors that contribute to jailbreak vulnerability.
</details>
Figure 1. Proposed method: (top) Latent analysis and classifier training: self-attention and layer-output tensors are constructed from input prompts, decomposed via CP decomposition, and used to learn jailbreak-discriminative latent factors. (Bottom) Inference-time mitigation: internal representations from a new prompt are projected onto the learned factors to estimate layer-wise jailbreak susceptibility; layers exhibiting strong adversarial signals are bypassed to suppress jailbreak behavior.
4. Proposed Method
We study jailbreak behavior through internal model representations using two complementary analysis pipelines.
4.1. Hidden Representation Analysis
Model Suite and Representation Extraction
We perform hidden representation analysis by examining both multi-head self-attention outputs and layer-wise hidden representations across a diverse set of large language models. Specifically, we evaluate three base models: GPT-J-6B (Wang et al., 2021), LLaMA-3.1-8B (AI@Meta, 2024b), and Mistral-7B-v01 (Jiang et al., 2023); three instruction-tuned models: GPT-JT-6B (Computer, 2022), LLaMA-3.1-8B-Instruct (AI@Meta, 2024a), and Mistral-7B-Instruct-V0.1 (AI, 2023); one abliterated model, LLaMA-3.1-8B-Instruct-Abliterated (Labonne, [n. d.]); and one state-space sequence model, Mamba-2.8b-hf (mam, [n. d.]). This selection enables a systematic comparison across different stages of alignment and architectural paradigms.
The inclusion of base, instruction-tuned, and abliterated models is intentional. Base models offer insight into unaligned latent structures; instruction-tuned models show how safety fine-tuning alters internal processing; and abliterated models help isolate the role of alignment layers. We focus not on output quality but on how jailbreak and benign prompts are internally encoded across this alignment spectrum. From this perspective, the specific semantic quality of the output is not critical; instead, we focus on identifying discriminative patterns that persist across model variants. Together, these model categories enable us to analyze jailbreak behavior across the full alignment spectrum and assess whether latent-space signatures of jailbreak prompts are consistent and model-agnostic.
Tensor Construction and Latent Factors for Jailbreak Detection
For a given set of input prompts containing both benign and jailbreak instances, we extract multi-head attention representations and hidden state/layer output from each layer of the model. For a single prompt, the resulting representation has dimensions $1× T× d$ , where $T$ denotes the sequence length and $d$ is the hidden dimensionality. By stacking representations across multiple prompts, we construct a third-order tensor of size $N× T× d$ , where $N$ is the number of prompts as illustrated in Fig. 1.
To analyze the latent structure of these internal representations, we apply the CANDECOMP/PARAFAC (CP) tensor decomposition to factorize the tensor into three low-rank factors corresponding to the prompt, sequence, and hidden dimensions. The factor associated with the prompt mode captures latent patterns that reflect how different prompts are encoded internally across model layers. Prior work (Papalexakis, 2018; Zhao et al., 2019) has shown that tensor decomposition-derived latent patterns effectively capture meaningful structure for classification and detection tasks, even with limited data (Qazi et al., 2024; Kadali and Papalexakis, 2025).
We use these prompt-mode latent factors as features for a lightweight classifier that distinguishes jailbreak prompts from benign prompts. This classifier serves two purposes: to assess separability between prompt types in the latent space (Fig. 2), and as a mechanism to estimate layer-wise susceptibility to jailbreak behavior, which is leveraged by the mitigation method described next.
Layer-wise Separability and Susceptibility
To localize where jailbreak-related information is expressed within the network, we train separate classifiers on latent factors extracted from individual layers. This produces a layer-wise profile that quantifies how strongly each layer encodes representations associated with adversarial prompts. From an interpretability perspective, these layers can be viewed as critical representational stages where benign and adversarial behaviors diverge.
Importantly, we do not interpret high separability as evidence that a layer directly causes jailbreak behavior. Rather, it indicates that these layers capture discriminative representations associated with jailbreak prompts, making them especially informative for detection. This observation provides insight into how adversarial instructions propagate through the model and forms the basis for the inference-time intervention introduced in the following section.
<details>
<summary>figures/layer_9_layer_output_tsne_rank_20_regenerated.png Details</summary>

### Visual Description
## Scatter Plot: t-SNE Visualization of Decomposed factors
### Overview
The image is a scatter plot visualizing data points in a two-dimensional space using t-distributed Stochastic Neighbor Embedding (t-SNE). The plot aims to represent high-dimensional data in a lower-dimensional space while preserving the local structure. The data points are colored blue and red, representing two categories: "Benign" and "Jailbreak," respectively. The plot shows the distribution and clustering of these two categories based on the decomposed factors.
### Components/Axes
* **Title:** t-SNE Visualization of Decomposed factors
* **X-axis:** t-SNE Dimension 1, with scale markers at -40, -20, 0, 20, and 40.
* **Y-axis:** t-SNE Dimension 2, with scale markers at -40, -30, -20, -10, 0, 10, 20, and 30.
* **Legend:** Located in the top-right corner.
* Blue circles: "Benign"
* Red circles: "Jailbreak"
* The background has a light gray grid.
### Detailed Analysis
* **Benign (Blue):**
* A large cluster is located in the top-left quadrant, with data points ranging approximately from x=-45 to -20 and y=15 to 35.
* Another cluster is located in the bottom-left quadrant, with data points ranging approximately from x=-45 to -15 and y=-25 to -5.
* A smaller cluster is located around x=-20 to 0 and y=0 to 5.
* **Jailbreak (Red):**
* A cluster is located in the bottom-right quadrant, with data points ranging approximately from x=10 to 45 and y=-35 to 15.
* A cluster is located around x=0 to 20 and y=0 to 20.
* A cluster is located around x=-10 to 10 and y=-20 to -5.
* There is some overlap between the two categories, particularly around the center of the plot.
### Key Observations
* The "Benign" data points are primarily clustered in the left side of the plot, while the "Jailbreak" data points are primarily clustered on the right side.
* There is a clear separation between the two categories, but some overlap indicates that the decomposed factors are not perfectly distinguishable.
* The t-SNE algorithm has successfully reduced the dimensionality of the data while preserving some of the original structure, as evidenced by the clustering of data points.
### Interpretation
The t-SNE plot visualizes the separation between "Benign" and "Jailbreak" data based on decomposed factors. The clustering suggests that these factors can differentiate between the two categories, although some overlap indicates that the separation is not perfect. This visualization can be used to understand the relationships between the decomposed factors and the two categories, and to identify potential features that can be used to classify new data points. The plot demonstrates the effectiveness of t-SNE in reducing the dimensionality of complex data while preserving meaningful relationships.
</details>
Figure 2. t-SNE visualization of prompt-mode CP factors for a representative model. Clear separation between benign and jailbreak clusters indicates that internal latent factors capture strong structure, motivating their use for jailbreak detection. Similar patterns are observed across models.
4.2. Layer-Aware Mitigation via Latent-Space Susceptibility
To mitigate jailbreak attacks, we propose a representation-level defense method, that leverages layer-wise susceptibility signals derived from internal representations. By identifying layers that strongly encode jailbreak-specific representations, we selectively bypass them during inference (Elhoushi et al., 2024; Shukor and Cord, 2024). While prior work has explored layer bypassing primarily for reducing computation and improving inference efficiency, our approach demonstrates that such bypassing can simultaneously reduce computational cost and mitigate jailbreak behaviors (Luo et al., 2025; Lawson and Aitchison, 2025).
We conduct this experiment on an abliterated model (LLaMA-3.1-8B). Base models are excluded from this intervention as they primarily perform next-token prediction without alignment constraints, making output-based safety evaluation less meaningful. Instruction-tuned models are also not ideal candidates, as their built-in guardrails obscure whether observed safety improvements arise from our method or from prior alignment. Abliterated models, which lack safety mechanisms while retaining instruction-tuned structure, provide a suitable testbed for isolating the effects of our approach.
Layer-wise Projection and Jailbreak Susceptibility Scoring
Given an input prompt, we extract intermediate representations from each transformer layer of the model. Let $\mathbf{x}∈\mathbb{R}^{d}$ denote the feature vector obtained from a specific layer (e.g., multi-head self-attention output or layer output).
Latent Projection.
For each layer, we project the extracted features onto a lower-dimensional latent space defined by factors obtained from tensor decomposition of the corresponding instruct model representations obtained in § 4.1. Let $\mathbf{W}∈\mathbb{R}^{d× r}$ denote the matrix of $r$ basis vectors (factors). The projected representation is computed as:
$$
\mathbf{z}=\mathbf{W}^{\top}\mathbf{x},
$$
where $\mathbf{z}∈\mathbb{R}^{r}$ is the latent feature representation. This operation constitutes a linear projection that preserves task-relevant structure encoded by the factors.
Layer-wise Jailbreak Probability Estimation.
The projected features are passed to a classifier trained to distinguish between benign and jailbreak prompts. For a logistic regression classifier, the probability of a jailbreak at a given layer is computed as:
$$
p=\sigma(\mathbf{w}^{\top}\mathbf{z}+b),
$$
where $\mathbf{w}$ and $b$ denote the classifier weights and bias, respectively, and $\sigma(·)$ is the sigmoid function:
$$
\sigma(x)=\frac{1}{1+e^{-x}}.
$$
Classifier Training Objective.
The classifier is trained using labeled projected representations $(\mathbf{z}_{i},y_{i})$ , where $y_{i}∈\{0,1\}$ indicates benign or jailbreak prompts. The parameters are optimized by minimizing the binary cross-entropy loss:
$$
\mathcal{L}=-\frac{1}{N}\sum_{i=1}^{N}\left[y_{i}\log p_{i}+(1-y_{i})\log(1-p_{i})\right],
$$
where $N$ denotes the number of samples and $p_{i}$ is the predicted probability for sample $i$ .
Layer Susceptibility Interpretation.
Layers exhibiting higher classification performance (e.g., F1 score) indicate stronger representational separability between benign and jailbreak prompts within the latent space. We interpret such layers as being more susceptible to jailbreak-style perturbations, as they encode discriminative adversarial features.
5. Experimental Evaluation
<details>
<summary>x2.png Details</summary>

### Visual Description
## Heatmap: F1 Score Analysis: Transformer Models
### Overview
The image presents two heatmaps comparing the F1 scores of various transformer models across different layers. The left heatmap displays "Layer Output F1 Scores," while the right heatmap shows "MHA Output F1 Scores." The models compared are GPT-J-Base, GPT-JT-Instruct, Llama-Base, Llama-Instruct, Llama-Abliterated, Mistral-Base, and Mistral-Instruct. The F1 scores are represented by a color gradient, ranging from red (0.5) to green (1.0).
### Components/Axes
* **Title:** F1 Score Analysis: Transformer Models
* **Left Heatmap Title:** Layer Output F1 Scores
* **Right Heatmap Title:** MHA Output F1 Scores
* **Y-axis (Model):** GPT-J-Base, GPT-JT-Instruct, Llama-Base, Llama-Instruct, Llama-Abliterated, Mistral-Base, Mistral-Instruct
* **X-axis (Layer Number):** 0 to 31
* **Color Scale (F1 Score):** Ranges from 0.5 (red) to 1.0 (green), with intermediate values of 0.6, 0.7, 0.8, and 0.9.
### Detailed Analysis
**Left Heatmap: Layer Output F1 Scores**
* **GPT-J-Base:** Shows relatively high F1 scores (green) across all layers, with some layers showing slightly lower scores (yellowish-green).
* **GPT-JT-Instruct:** Similar to GPT-J-Base, generally high F1 scores across all layers, with some variation.
* **Llama-Base:** High F1 scores (green) up to layer 27, after which the data stops.
* **Llama-Instruct:** High F1 scores (green) up to layer 27, after which the data stops.
* **Llama-Abliterated:** Shows a mix of F1 scores, generally in the green range, but with some layers showing lower scores (yellowish-green). Data stops after layer 27.
* **Mistral-Base:** Shows a mix of F1 scores, generally in the yellowish-green range.
* **Mistral-Instruct:** Similar to Mistral-Base, with F1 scores generally in the yellowish-green range.
**Right Heatmap: MHA Output F1 Scores**
* **GPT-J-Base:** Starts with a low F1 score (orange/red) at layer 0, then quickly increases to high F1 scores (green) for the remaining layers.
* **GPT-JT-Instruct:** Starts with a very low F1 score (red) at layer 0, then quickly increases to high F1 scores (green) for the remaining layers.
* **Llama-Base:** High F1 scores (green) across all layers, data stops after layer 27.
* **Llama-Instruct:** High F1 scores (green) across all layers, data stops after layer 27.
* **Llama-Abliterated:** High F1 scores (green) across all layers, data stops after layer 27.
* **Mistral-Base:** High F1 scores (green) across all layers.
* **Mistral-Instruct:** High F1 scores (green) across all layers.
### Key Observations
* The MHA Output F1 Scores for GPT-J-Base and GPT-JT-Instruct show a significant initial drop in F1 score at layer 0 compared to the Layer Output F1 Scores.
* The Llama models (Base, Instruct, and Abliterated) have data only up to layer 27.
* The Mistral models (Base and Instruct) generally have lower Layer Output F1 Scores compared to the other models.
* The MHA Output F1 Scores are generally higher and more consistent across all models and layers (excluding the initial layer for GPT-J-Base and GPT-JT-Instruct).
### Interpretation
The heatmaps provide a visual comparison of the F1 scores of different transformer models at each layer. The Layer Output F1 Scores represent the performance of each layer individually, while the MHA Output F1 Scores represent the performance of the multi-head attention mechanism within each layer.
The initial drop in MHA Output F1 Scores for GPT-J-Base and GPT-JT-Instruct at layer 0 suggests that the multi-head attention mechanism may have some initial difficulties in processing the input at the first layer. However, this is quickly resolved in subsequent layers.
The generally higher MHA Output F1 Scores indicate that the multi-head attention mechanism is effective in improving the performance of the transformer models.
The lower Layer Output F1 Scores for the Mistral models suggest that these models may have some limitations in their individual layer performance compared to the other models.
The fact that the Llama models only have data up to layer 27 may indicate a difference in architecture or configuration compared to the other models.
</details>
Figure 3. (Left) Layer-wise F1 scores using CP-decomposed Transformer layer outputs. (Right) Layer-wise F1 scores using CP-decomposed multi-head attention representations. Jailbreak and benign prompts become reliably separable at early depths, suggesting that adversarial intent is encoded shortly after input embedding. The strong performance of attention-based features further indicates that prompt-type information is reflected in token interaction structure as well as in hidden representations.
5.1. Datasets and Prompt Scope
We used two prompt sources with provenance relationships to separate representation learning from mitigation evaluation.
Training/representation analysis. For latent-space analysis, we use the Jailbreak Classification dataset from Hugging Face (Hao, [n. d.]). This dataset provides labeled benign and jailbreak prompts and is used to (i) extract layer-wise hidden representations and multi-head attention (MHA) outputs, (ii) learn CP decomposition factors for each layer and representation type, and (iii) train a lightweight classifier on the resulting latent features.
Test/mitigation evaluation. To evaluate layer-aware bypass at inference time, we construct a held-out test set of 200 prompts (100 benign, 100 jailbreak) sourced from the ‘In the Wild Jailbreak Prompts’ (Shen et al., 2024) prompt collection. (Hao, [n. d.]) reports that its jailbreak prompts are drawn from (Shen et al., 2024), which motivates this choice, as these evaluations are consistent and distinct from the training corpus.
Attack scope. Both datasets primarily consist of instruction-level jailbreaks (e.g., persona overrides, explicit safety negation, role-play framing, and meta-instructions). They do not include optimization-based attacks such as GCG (Zou et al., 2023), PAIR (Chao et al., 2024), or other gradient-guided adversarial suffix constructions. Since our framework operates on internal representations, extending it to additional attack families can be achieved by incorporating corresponding prompt distributions during factor learning; we leave such evaluations to future work.
5.2. Hidden Representation Analysis
We assess whether internal model representations can reliably separate jailbreak from benign prompts across model families, layers, and representation types. Our analysis spans eight models: base, instruction-tuned, abliterated, and state-space (Mamba), providing a broad view across training paradigms.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: Mamba-2.8B: Block vs Mixer Output F1 Scores
### Overview
The image is a line chart comparing the F1 scores of "Block Output" and "Mixer Output" across different layers in the Mamba-2.8B model. The x-axis represents the layer number, and the y-axis represents the F1 score.
### Components/Axes
* **Title:** Mamba-2.8B: Block vs Mixer Output F1 Scores
* **X-axis:**
* Label: Layer
* Scale: 0 to 56, with increments of 8 (0, 8, 16, 24, 32, 40, 48, 56)
* **Y-axis:**
* Label: F1 Score
* Scale: 0.5 to 1.0, with increments of 0.1 (0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
* **Legend:** Located in the bottom-left corner.
* Blue line with circle markers: Block Output
* Purple line with square markers: Mixer Output
### Detailed Analysis
* **Block Output (Blue Line):**
* Trend: Initially increases from layer 0 to approximately layer 16, then stabilizes with minor fluctuations around an F1 score of approximately 0.95, and decreases slightly towards the end.
* Data Points:
* Layer 0: ~0.88
* Layer 8: ~0.88
* Layer 16: ~0.94
* Layer 24: ~0.94
* Layer 32: ~0.95
* Layer 40: ~0.95
* Layer 48: ~0.95
* Layer 56: ~0.94
* **Mixer Output (Purple Line):**
* Trend: More volatile than Block Output in the initial layers (0-16), then converges towards the Block Output, stabilizing around an F1 score of approximately 0.95, and decreases slightly towards the end.
* Data Points:
* Layer 0: ~0.82
* Layer 8: ~0.81
* Layer 16: ~0.94
* Layer 24: ~0.95
* Layer 32: ~0.95
* Layer 40: ~0.95
* Layer 48: ~0.95
* Layer 56: ~0.93
### Key Observations
* The F1 scores for both Block Output and Mixer Output are relatively high, generally above 0.8.
* The Mixer Output shows more variation in the initial layers compared to the Block Output.
* Both outputs converge to similar F1 scores after approximately layer 16.
* Both outputs show a slight decrease in F1 score towards the end of the layers.
### Interpretation
The chart suggests that both the Block Output and Mixer Output of the Mamba-2.8B model perform well, as indicated by their high F1 scores. The Mixer Output's initial volatility might indicate a period of adjustment or learning in the earlier layers. The convergence of the two outputs after layer 16 suggests that they eventually reach a similar level of performance. The slight decrease in F1 score towards the end could be due to factors such as vanishing gradients or overfitting in the later layers of the model. Overall, the model demonstrates stable and high performance across different layers, with minor differences between the Block and Mixer outputs.
</details>
Figure 4. Layer-wise F1 scores for CP-decomposed Mamba representations (mixer and block outputs) showing early and increasing separability between benign and jailbreak prompts, indicating that state-space architectures encode adversarial prompt structure in their internal representations.
For each model, we extract hidden states and multi-head attention (MHA) outputs across all layers. These are aggregated into third-order tensors ( $N× T× d$ ) and decomposed using CP tensor decomposition (rank $r=20$ ) to obtain low-dimensional features for each prompt. We fix the CP decomposition rank to $r=20$ for all experiments, balancing expressiveness and efficiency as discussed in § 3.2, and to ensure consistent latent representations across models. A lightweight classifier is trained on these features to predict jailbreak status. Fig. 3 presents layer-wise F1 scores for each model. For Mamba, we report results from both the mixer (analogous to MHA) and full block output in Fig. 4.
Our results show a clear separation between the two prompt types in the learned latent space, achieving consistently high F1 scores across all evaluated models. These findings suggest that jailbreak behavior manifests as identifiable and discriminative patterns within internal representations, independent of output quality or alignment stage, and can be effectively leveraged for detection without model fine-tuning.
Qualitative Analysis.
For clarity of presentation, we visualize qualitative results for instruction-tuned models, which provide the most interpretable view of aligned internal dynamics. The qualitative patterns discussed here are representative of those observed across all evaluated models.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Heatmap Comparison: Model Behavior on Benign vs. Jailbreak Prompts
### Overview
The image presents a series of heatmaps comparing the behavior of three different language models (GPT-JT-6B, LLaMA-3.1-8B, and Mistral-7B) when processing benign and jailbreak prompts. Each row corresponds to a different model, and each column represents a different condition: "Benign" prompt, "Jailbreak" prompt, and the "Difference" between the two. The heatmaps visualize the interaction between "Query Token" and "Key Token" within specific layers of each model.
### Components/Axes
* **Rows (Models):**
* GPT-JT-6B - Layer 7
* LLaMA-3.1-8B - Layer 4
* Mistral-7B - Layer 18
* **Columns (Conditions):**
* Benign
* Jailbreak
* Difference
* **X-axis (Key Token):** Ranges from 0 to 448, with tick marks at 0, 64, 128, 192, 256, 320, 384, and 448.
* **Y-axis (Query Token):** Ranges from 0 to 448, with tick marks at 0, 64, 128, 192, 256, 320, 384, and 448.
* **Color Scale (Benign & Jailbreak):** Ranges from -10 (dark purple) to 0 (yellow).
* **Color Scale (Difference):** Ranges from -4 (dark blue) to 4 (dark red), with 0 being white.
### Detailed Analysis
**GPT-JT-6B - Layer 7**
* **Benign:** The heatmap shows a gradient, with values decreasing from yellow (0) at the top-left corner to dark purple (-10) towards the bottom-right.
* **Jailbreak:** Similar to the "Benign" condition, the heatmap shows a gradient from yellow (0) to dark purple (-10).
* **Difference:** The heatmap shows a clear separation. The upper-left triangle is predominantly blue (negative difference), while the lower-right triangle is predominantly red (positive difference).
**LLaMA-3.1-8B - Layer 4**
* **Benign:** Similar to GPT-JT-6B, the heatmap shows a gradient from yellow (0) to dark purple (-10).
* **Jailbreak:** Similar to the "Benign" condition, the heatmap shows a gradient from yellow (0) to dark purple (-10).
* **Difference:** The heatmap shows a clear separation. The upper-left triangle is predominantly blue (negative difference), while the lower-right triangle is predominantly red (positive difference).
**Mistral-7B - Layer 18**
* **Benign:** The heatmap shows a gradient, with values decreasing from yellow (0) at the top-left corner to dark purple (-10) towards the bottom-right.
* **Jailbreak:** Similar to the "Benign" condition, the heatmap shows a gradient from yellow (0) to dark purple (-10).
* **Difference:** The heatmap shows a clear separation. The upper-left triangle is predominantly blue (negative difference), while the lower-right triangle is predominantly red (positive difference).
### Key Observations
* The "Benign" and "Jailbreak" heatmaps for each model are visually similar, suggesting that the overall attention patterns are not drastically different between the two conditions.
* The "Difference" heatmaps highlight specific regions where the attention patterns diverge between the "Benign" and "Jailbreak" prompts.
* The "Difference" heatmaps consistently show a separation, with negative differences (blue) in the upper-left triangle and positive differences (red) in the lower-right triangle.
### Interpretation
The heatmaps visualize the attention patterns within different layers of the language models when processing benign and jailbreak prompts. The similarity between the "Benign" and "Jailbreak" heatmaps suggests that the overall attention mechanisms are not fundamentally altered by the jailbreak prompts. However, the "Difference" heatmaps reveal subtle but significant variations in attention patterns.
The consistent separation observed in the "Difference" heatmaps, with negative differences in the upper-left triangle and positive differences in the lower-right triangle, indicates that the jailbreak prompts may be causing the models to shift their attention towards different tokens or regions within the input sequence. This shift in attention could be a contributing factor to the models' susceptibility to jailbreak attacks.
The specific layers chosen for analysis (Layer 7 for GPT-JT-6B, Layer 4 for LLaMA-3.1-8B, and Layer 18 for Mistral-7B) may represent layers where these differences are most pronounced. Further investigation could explore other layers to understand the full extent of the attention pattern variations.
</details>
Figure 5. Self-attention maps for three instruction-tuned models, averaged over benign and jailbreak prompts (log 10 scale). Difference maps (right) highlight systematic but localized changes in attention patterns induced by jailbreak prompts, suggesting that adversarial intent manifests as targeted rerouting of attention rather than global disruption.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Heatmap: Benign vs. Jailbreak Token Embeddings
### Overview
The image presents a series of heatmaps comparing the token embeddings of three language models (GPT-JT-6B, LLaMA-3.1-8B, and Mistral-7B) under two conditions: "Benign" and "Jailbreak". The heatmaps visualize the activation levels across different layers and token positions. A third heatmap shows the "Difference" between the Benign and Jailbreak conditions.
### Components/Axes
* **Rows:** Each row represents a different language model: GPT-JT-6B (top), LLaMA-3.1-8B (middle), and Mistral-7B (bottom).
* **Columns:** Each set of three columns represents a condition: "Benign", "Jailbreak", and "Difference".
* **Y-axis (Layer):** The y-axis represents the layer number of the language model. The layer numbers increase from top to bottom.
* GPT-JT-6B: Layers range from 0 to 24.
* LLaMA-3.1-8B: Layers range from 0 to 28.
* Mistral-7B: Layers range from 0 to 28.
* **X-axis (Token Position):** The x-axis represents the token position, ranging from 0 to 448.
* **Color Scale (Benign & Jailbreak):** The color scale represents the activation level, with warmer colors (yellow) indicating higher activation and cooler colors (purple/dark blue) indicating lower activation.
* GPT-JT-6B: Ranges from approximately 4 to 8.
* LLaMA-3.1-8B: Ranges from approximately 0 to 6.
* Mistral-7B: Ranges from approximately -1 to 5.
* **Color Scale (Difference):** The color scale represents the difference in activation levels between the "Benign" and "Jailbreak" conditions. Red indicates a positive difference (higher activation in "Benign"), and blue indicates a negative difference (higher activation in "Jailbreak").
* GPT-JT-6B: Ranges from approximately -2 to 2.
* LLaMA-3.1-8B: Ranges from approximately -0.2 to 0.2.
* Mistral-7B: Ranges from approximately -2 to 2.
### Detailed Analysis
#### GPT-JT-6B
* **Benign:** The heatmap shows relatively high activation across all layers and token positions, with a slight decrease in activation towards the bottom layers. The activation is generally uniform.
* **Jailbreak:** The heatmap shows a similar pattern to the "Benign" condition, but with slightly lower overall activation.
* **Difference:** The heatmap shows a clear negative difference (blue) in the upper layers (approximately 0-8), indicating higher activation in the "Jailbreak" condition for these layers. The lower layers show a slight positive difference (red), indicating higher activation in the "Benign" condition.
#### LLaMA-3.1-8B
* **Benign:** The heatmap shows lower activation in the upper layers, gradually increasing towards the bottom layers.
* **Jailbreak:** The heatmap shows a similar pattern to the "Benign" condition.
* **Difference:** The heatmap shows a more complex pattern. The upper layers (approximately 0-8) show a negative difference (blue), while the middle layers (approximately 8-20) show a positive difference (red). The bottom layers show a mix of positive and negative differences.
#### Mistral-7B
* **Benign:** The heatmap shows a similar pattern to LLaMA-3.1-8B, with lower activation in the upper layers and increasing activation towards the bottom layers.
* **Jailbreak:** The heatmap shows a similar pattern to the "Benign" condition.
* **Difference:** The heatmap shows a relatively small difference between the "Benign" and "Jailbreak" conditions. The upper layers show a slight negative difference (blue), while the lower layers show a slight positive difference (red).
### Key Observations
* The "Difference" heatmaps highlight the layers and token positions where the "Benign" and "Jailbreak" conditions diverge the most.
* GPT-JT-6B shows the most pronounced difference in the upper layers, with higher activation in the "Jailbreak" condition.
* LLaMA-3.1-8B shows a more complex difference pattern, with both positive and negative differences across different layers.
* Mistral-7B shows the smallest difference between the two conditions.
* The token position does not appear to have a strong influence on the activation levels, as the heatmaps are relatively uniform along the x-axis.
### Interpretation
The heatmaps provide a visual representation of how different language models respond to "Benign" and "Jailbreak" prompts. The "Difference" heatmaps are particularly useful for identifying the layers that are most affected by the "Jailbreak" condition.
The data suggests that GPT-JT-6B is more susceptible to "Jailbreak" attacks, as evidenced by the significant difference in activation levels in the upper layers. LLaMA-3.1-8B shows a more nuanced response, with different layers exhibiting different behaviors. Mistral-7B appears to be the most robust against "Jailbreak" attacks, as the difference between the two conditions is minimal.
These findings could be used to develop more effective defense mechanisms against "Jailbreak" attacks by focusing on the layers that are most vulnerable. Further research is needed to understand the underlying mechanisms that cause these differences in behavior.
</details>
Figure 6. Layer-wise log-magnitude of hidden representations for benign (left) and jailbreak (middle) prompts, averaged across prompts, with their difference shown on the right. The difference heatmaps reveal consistent, localized deviations across layers, highlighting where adversarial prompts induce layer-dependent representational shifts.
Aggregated Self-Attention Heatmaps.
We use aggregated self-attention heatmaps to qualitatively assess how jailbreak prompts alter token-to-token information routing within the model. While attention alone does not encode semantic content, systematic differences in attention patterns can indicate how adversarial prompts redirect internal focus during processing.
For each instruction-tuned model and transformer layer $\ell$ , we extract the self-attention weight tensors
$$
\mathbf{A}^{(\ell)}\in\mathbb{R}^{N\times H\times T\times T},
$$
where $N$ is the number of prompts, $H$ the number of attention heads, and $T$ the (padded) token length. To obtain a stable, global view of attention behavior, we aggregate over both prompts and heads. Let $\mathcal{I}_{\text{ben}}$ and $\mathcal{I}_{\text{jb}}$ denote the index sets of benign and jailbreak prompts, respectively. We compute the class-wise, head-averaged attention maps:
$$
\bar{\mathbf{A}}^{(\ell)}_{\text{ben}}=\frac{1}{|\mathcal{I}_{\text{ben}}|H}\sum_{n\in\mathcal{I}_{\text{ben}}}\sum_{h=1}^{H}\mathbf{A}^{(\ell)}_{n,h,:,:},\qquad\bar{\mathbf{A}}^{(\ell)}_{\text{jb}}=\frac{1}{|\mathcal{I}_{\text{jb}}|H}\sum_{n\in\mathcal{I}_{\text{jb}}}\sum_{h=1}^{H}\mathbf{A}^{(\ell)}_{n,h,:,:}.
$$
To highlight systematic differences between prompt types, we additionally compute a per-layer difference map:
$$
\Delta\mathbf{A}^{(\ell)}=\bar{\mathbf{A}}^{(\ell)}_{\text{jb}}-\bar{\mathbf{A}}^{(\ell)}_{\text{ben}}.
$$
Visualizing $\bar{\mathbf{A}}^{(\ell)}{\text{ben}}$ , $\bar{\mathbf{A}}^{(\ell)}{\text{jb}}$ , and $\Delta\mathbf{A}^{(\ell)}$ (Fig. 6) shows that jailbreak prompts lead to consistent, localized changes in attention patterns. This indicates that adversarial prompts influence attention by selectively emphasizing specific instruction or control tokens, providing qualitative evidence that jailbreak behavior arises from targeted changes in information flow rather than global attention disruption.
Hidden-Representation Magnitude Heatmaps.
While attention maps reflect information routing, hidden representations capture the content and intensity of internal computation. We therefore analyze layer-wise hidden-state magnitudes to understand how strongly jailbreak prompts perturb internal activations across network depth.
For each layer $\ell$ , we extract hidden states
$$
\mathbf{H}^{(\ell)}\in\mathbb{R}^{N\times T\times D},
$$
where $D$ is the hidden dimensionality. To summarize activation strength across token positions, we compute the per-token $\ell_{2}$ magnitude:
$$
\mathbf{M}^{(\ell)}_{n,t}=\left\lVert\mathbf{H}^{(\ell)}_{n,t,:}\right\rVert_{2},\qquad\mathbf{M}^{(\ell)}\in\mathbb{R}^{N\times T}.
$$
We then average magnitudes across prompts within each class:
$$
\bar{\mathbf{M}}^{(\ell)}_{\text{ben}}(t)=\frac{1}{|\mathcal{I}_{\text{ben}}|}\sum_{n\in\mathcal{I}_{\text{ben}}}\mathbf{M}^{(\ell)}_{n,t},\qquad\bar{\mathbf{M}}^{(\ell)}_{\text{jb}}(t)=\frac{1}{|\mathcal{I}_{\text{jb}}|}\sum_{n\in\mathcal{I}_{\text{jb}}}\mathbf{M}^{(\ell)}_{n,t}.
$$
For visualization, we apply a logarithmic transform:
$$
\tilde{\mathbf{M}}^{(\ell)}(t)=\log\big(\bar{\mathbf{M}}^{(\ell)}(t)+\varepsilon\big),
$$
with a small $\varepsilon>0$ for numerical stability. We plot $\tilde{\mathbf{M}}^{(\ell)}(t)$ as heatmaps with layers on the y-axis and token positions on x-axis.
Although the averaged hidden-state magnitudes for benign and jailbreak prompts appear broadly similar (especially for LLaMA abd Mistral), their difference heatmaps reveal consistent, localized deviations across layers. This indicates that jailbreak behavior does not manifest as a global disruption of internal activations, but rather as subtle, structured changes superimposed on otherwise normal model processing.
5.3. Layer-Aware Mitigation via Latent-Space Susceptibility
We evaluate our second proposed method, layer-aware mitigation via latent-space susceptibility, to check whether representation-level signals can be used to suppress jailbreak execution during inference without relying on output-level filtering or fine-tuning.
Experimental Setup.
We conduct this experiment on the abliterated LLaMA 3.1 8B model described earlier. Evaluation is performed on a held-out set of 200 prompts (100 benign, 100 jailbreak), using latent factors learned during the analysis phase (§ 5.2).
Inference-time Susceptibility Scoring
Given an input prompt at inference time, we extract layer outputs and attention representations, project them onto the pre-learned CP factors, and use a lightweight classifier to compute a per-layer susceptibility score indicating the strength of jailbreak-correlated features. Layers whose predicted jailbreak probability exceeds a fixed threshold ( $\tau=0.7$ ) are treated as highly susceptible. The threshold $\tau$ is a tunable hyperparameter that controls the trade-off between mitigation strength and preservation of benign behavior.
Layer-/Head-Bypass Intervention.
Based on the susceptibility score, we selectively perform: (i) Layer Bypass: bypassing selected layer outputs; and (ii) MHA Bypass: bypassing selected attention components. This intervention is parameter-free (no fine-tuning), prompt-conditional (depends on the susceptibility profile), and does not require any output-side heuristics.
Output-Based Evaluation with LLM-Assisted Judging.
Since our goal is to prevent harmful compliance rather than optimize helpfulness, we evaluate mitigation effectiveness based on observed output behavior. Model responses are categorized as: (i) harmful completions, where the jailbreak intent succeeds; (ii) benign completions, where the model responds appropriately; and (iii) disrupted outputs, including truncated, repetitive, or incoherent text.
For jailbreak prompts, disrupted or non-compliant outputs are treated as successful defenses, while for benign prompts such outputs are undesirable. Output labels are assigned using an LLM-as-a-judge rubric, followed by manual review of ambiguous cases. This evaluation protocol follows established practice for open-ended generation assessment with human validation (Zheng et al., 2023; Dubois et al., 2024; Li et al., 2024). Based on these criteria, we define the confusion matrix as follows:
| Jailbreak | Harmful completion | False Negative (FN) |
| --- | --- | --- |
| Jailbreak | Disrupted/benign output | True Positive (TP) |
| Benign | Benign completion | True Negative (TN) |
| Benign | Disrupted output | False Positive (FP) |
Results.
Table 1 summarizes the confusion-matrix counts. Layer-guided bypass suppresses most jailbreak attempts (TP=78) while largely preserving benign behavior (TN=94). In contrast, MHA-only bypass results in substantially more jailbreak failures (FN=39), indicating that layer outputs capture a larger fraction of jailbreak-relevant computation than attention components alone.
To provide a compact summary aligned with prior jailbreak evaluations, we additionally report the attack success rate (ASR), defined as the fraction of jailbreak prompts that remain successful after mitigation. Layer-output bypass reduces ASR to $22\%$ (22/100), compared to $39\%$ for MHA-only bypass, highlighting the effectiveness of layer-level intervention.
Table 1. Confusion matrix counts for latent-space-guided mitigation (100 jailbreak and 100 benign prompts).
| Layer Bypass | 78 | 22 | 94 | 6 |
| --- | --- | --- | --- | --- |
| MHA Bypass | 61 | 39 | 92 | 8 |
Failure (False Negative) Analysis
We examine the 22 jailbreak prompts that remain harmful after layer-output bypass. The majority are persona- or roleplay-based prompt injections (e.g., “never refuse,” “no morals,” forced speaker tags such as “AIM:” and “[H4X]:”) that aim to establish persistent control over the model’s identity, tone, and formatting. Because such instructions are repeatedly reinforced throughout the prompt, elements of adversarial control can persist even when highly susceptible layers are bypassed.
Additional failures stem from susceptibility estimation: the intervention targets layers exceeding a fixed probability threshold chosen to preserve benign behavior. Attacks that distribute their influence across multiple layers, or weakly activate any single layer, may therefore evade suppression despite succeeding overall. Some failures also involve milder jailbreaks that retain adversarial framing without immediately producing explicit harmful content; under our conservative evaluation criterion, these are counted as failures.
These limitations are addressable within the proposed framework by expanding the diversity of jailbreak styles used for latent factor learning and by adopting adaptive or cumulative susceptibility criteria. Since the method operates entirely in latent space, such extensions require no architectural changes.
6. Conclusion
Our hypothesis and experiments indicate that internal representations of LLMs contain sufficiently strong and consistent signals to both detect jailbreak prompts and, in many cases, disrupt jailbreak execution at inference time. Importantly, these capabilities emerge from lightweight representation-level analysis and intervention, without requiring additional post-training, auxiliary models, or complex rule-based filtering. The consistency of these findings across diverse model families suggests that adversarial intent leaves stable latent-space signatures, motivating internal-representation monitoring as a practical and broadly applicable direction for understanding and mitigating jailbreak behavior.
7. GenAI Usage Disclosure
The authors acknowledge the use of AI-based writing and coding assistance tools during the preparation of this manuscript. These tools were used exclusively to improve clarity, organization, and academic tone of text written by the authors, as well as to assist with code formatting and plot generation. All scientific ideas, methodologies, analyses, and conclusions are the original intellectual contributions of the authors. No AI system was used to generate research ideas or substantive technical content, and all AI-assisted revisions were carefully reviewed and validated by the authors.
References
- (1)
- jai ([n. d.]) [n. d.]. Implementation of the proposed method. https://anonymous.4open.science/r/Jailbreaking-leaves-a-trace-Understanding-and-Detecting-Jailbreak-Attacks-in-LLMs-C401
- mam ([n. d.]) [n. d.]. Mamab2.8b. https://huggingface.co/state-spaces/mamba-2.8b-hf
- AI (2023) Mistral AI. 2023. Mistral 7B Instruct v0.1. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1.
- AI@Meta (2024a) AI@Meta. 2024a. Llama 3 Instruct Models. https://ai.meta.com/llama/. LLaMA-3.1-8B-Instruct.
- AI@Meta (2024b) AI@Meta. 2024b. Llama 3 Models. https://ai.meta.com/llama/. LLaMA-3.1-8B base model.
- Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862 [cs.CL] https://arxiv.org/abs/2204.05862
- Candogan et al. (2025) Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios Chrysos, and Volkan Cevher. 2025. Single-pass Detection of Jailbreaking Input in Large Language Models. Transactions on Machine Learning Research (2025). https://openreview.net/forum?id=42v6I5Ut9a
- Chao et al. (2024) Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2024. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv:2310.08419 [cs.LG] https://arxiv.org/abs/2310.08419
- Chen et al. (2023) Bocheng Chen, Advait Paliwal, and Qiben Yan. 2023. Jailbreaker in Jail: Moving Target Defense for Large Language Models. arXiv:2310.02417 [cs.CR] https://arxiv.org/abs/2310.02417
- Computer (2022) Together Computer. 2022. GPT-JT-6B: Instruction-Tuned GPT-J Model. https://huggingface.co/togethercomputer/GPT-JT-6B-v1.
- Deng et al. (2024) Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. 2024. Multilingual Jailbreak Challenges in Large Language Models. arXiv:2310.06474 [cs.CL] https://arxiv.org/abs/2310.06474
- Dubois et al. (2024) Yann Dubois, Aarohi Srivastava, Abhinav Venigalla, et al. 2024. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv preprint arXiv:2404.04475 (2024). https://arxiv.org/abs/2404.04475
- Elhoushi et al. (2024) Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed Aly, Beidi Chen, and Carole-Jean Wu. 2024. LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 12622–12642. doi: 10.18653/v1/2024.acl-long.681
- Glaese et al. (2022) Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nicholas Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel, William Isaac, John Mellor, Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey Irving. 2022. Improving alignment of dialogue agents via targeted human judgements. arXiv:2209.14375 [cs.LG] https://arxiv.org/abs/2209.14375
- Hao ([n. d.]) Jack Hao. [n. d.]. Jailbreak Classification Dataset. https://huggingface.co/datasets/jackhhao/jailbreak-classification
- Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, et al. 2023. Mistral 7B. https://mistral.ai/news/introducing-mistral-7b/.
- Jiang et al. (2024) Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. 2024. WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models. arXiv:2406.18510 [cs.CL] https://arxiv.org/abs/2406.18510
- Kadali and Papalexakis (2025) Sri Durga Sai Sowmya Kadali and Evangelos Papalexakis. 2025. CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25). ACM, 4857–4861. doi: 10.1145/3746252.3760886
- Labonne ([n. d.]) M. Labonne. [n. d.]. Meta-Llama-3.1-8B-Instruct-abliterated. https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated. Hugging Face model card.
- Lawson and Aitchison (2025) Tim Lawson and Laurence Aitchison. 2025. Learning to Skip the Middle Layers of Transformers. arXiv:2506.21103 [cs.LG] https://arxiv.org/abs/2506.21103
- Li et al. (2024) Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579 [cs.CL] https://arxiv.org/abs/2412.05579
- Liu et al. (2024) Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2024. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. arXiv:2310.04451 [cs.CL] https://arxiv.org/abs/2310.04451
- Lu et al. (2024) Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, and Cen Chen. 2024. Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge. arXiv:2404.05880 [cs.CL] https://arxiv.org/abs/2404.05880
- Luo et al. (2025) Xuan Luo, Weizhi Wang, and Xifeng Yan. 2025. Adaptive layer-skipping in pre-trained llms. arXiv preprint arXiv:2503.23798 (2025).
- Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL] https://arxiv.org/abs/2203.02155
- Papalexakis (2018) Evangelos E. Papalexakis. 2018. Unsupervised Content-Based Identification of Fake News Articles with Tensor Decomposition Ensembles. https://api.semanticscholar.org/CorpusID:26675959
- Qazi et al. (2024) Zubair Qazi, William Shiao, and Evangelos E. Papalexakis. 2024. GPT-generated Text Detection: Benchmark Dataset and Tensor-based Detection Method. In Companion Proceedings of the ACM Web Conference 2024 (Singapore, Singapore) (WWW ’24). Association for Computing Machinery, New York, NY, USA, 842–846. doi: 10.1145/3589335.3651513
- Shen et al. (2024) Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. In ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM.
- Shukor and Cord (2024) Mustafa Shukor and Matthieu Cord. 2024. Skipping Computations in Multimodal LLMs. In Workshop on Responsibly Building the Next Generation of Multimodal Foundational Models. https://openreview.net/forum?id=qkmMvLckB9
- Sidiropoulos et al. (2017) Nicholas D Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E Papalexakis, and Christos Faloutsos. 2017. Tensor decomposition for signal processing and machine learning. IEEE Transactions on signal processing 65, 13 (2017), 3551–3582.
- Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL] https://arxiv.org/abs/1706.03762
- Wang et al. (2021) Ben Wang, Aran Komatsuzaki, and EleutherAI. 2021. GPT-J-6B. https://huggingface.co/EleutherAI/gpt-j-6B. Model card and weights.
- Wang et al. (2024a) Hao Wang, Hao Li, Minlie Huang, and Lei Sha. 2024a. ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 2697–2711. doi: 10.18653/v1/2024.emnlp-main.157
- Wang et al. (2025) Xunguang Wang, Daoyuan Wu, Zhenlan Ji, Zongjie Li, Pingchuan Ma, Shuai Wang, Yingjiu Li, Yang Liu, Ning Liu, and Juergen Rahmel. 2025. SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner. arXiv:2406.05498 [cs.CR] https://arxiv.org/abs/2406.05498
- Wang et al. (2024b) Yihan Wang, Zhouxing Shi, Andrew Bai, and Cho-Jui Hsieh. 2024b. Defending llms against jailbreaking attacks via backtranslation. arXiv preprint arXiv:2402.16459 (2024).
- Wang et al. (2023) Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. 2023. Aligning Large Language Models with Human: A Survey. arXiv:2307.12966 [cs.CL] https://arxiv.org/abs/2307.12966
- Wei et al. (2023) Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How Does LLM Safety Training Fail? arXiv:2307.02483 [cs.LG] https://arxiv.org/abs/2307.02483
- Xiong et al. (2025) Chen Xiong, Xiangyu Qi, Pin-Yu Chen, and Tsung-Yi Ho. 2025. Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks. arXiv:2405.20099 [cs.CR] https://arxiv.org/abs/2405.20099
- Yang et al. (2023) Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. 2023. Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models. arXiv:2310.02949 [cs.CL] https://arxiv.org/abs/2310.02949
- Yao et al. (2023) Hongwei Yao, Jian Lou, and Zhan Qin. 2023. PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models. arXiv:2310.12439 [cs.CL] https://arxiv.org/abs/2310.12439
- Young (2026) Richard J. Young. 2026. Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation. arXiv:2512.13655 [cs.CL] https://arxiv.org/abs/2512.13655
- Yu et al. (2023) Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023).
- Zeng et al. (2024) Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. 2024. AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks. arXiv:2403.04783 [cs.LG] https://arxiv.org/abs/2403.04783
- Zhang et al. (2026) Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, et al. 2026. Instruction tuning for large language models: A survey. Comput. Surveys 58, 7 (2026), 1–36.
- Zhang et al. (2024) Tianrong Zhang, Bochuan Cao, Yuanpu Cao, Lu Lin, Prasenjit Mitra, and Jinghui Chen. 2024. WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response. arXiv:2405.14023 [cs.LG] https://arxiv.org/abs/2405.14023
- Zhao et al. (2019) Zhenjie Zhao, Andrew Cattle, Evangelos Papalexakis, and Xiaojuan Ma. 2019. Embedding Lexical Features via Tensor Decomposition for Small Sample Humor Recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 6376–6381. doi: 10.18653/v1/D19-1669
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. 2023. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685 (2023). https://arxiv.org/abs/2306.05685
- Zou et al. (2023) Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043 [cs.CL] https://arxiv.org/abs/2307.15043