2505.04806v2

Model: gemma-3-27b-it-free

# Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs **Authors**: Chetan Pathade > Independent ResearcherSan Jose, CA, USAcup@alumni.cmu.edu Abstract Large Language Models (LLMs) are increasingly integrated into consumer and enterprise applications. Despite their capabilities, they remain susceptible to adversarial attacks such as prompt injection and jailbreaks that override alignment safeguards. This paper provides a systematic investigation of jailbreak strategies against various state-of-the-art LLMs. We categorize over 1,400 adversarial prompts, analyze their success against GPT-4, Claude 2, Mistral 7B, and Vicuna, and examine their generalizability and construction logic. We further propose layered mitigation strategies and recommend a hybrid red-teaming and sandboxing approach for robust LLM security. Index Terms: Large Language Models, Prompt Injection, Jailbreak, Adversarial Prompts, AI Security, Red Teaming, LLM Safety I Introduction The field of artificial intelligence has experienced a paradigm shift with the emergence of large language models (LLMs). These systems have transitioned from research prototypes to core components of production-grade systems, shaping industries from finance and law to healthcare and entertainment. LLMs are praised for their fluency, contextual reasoning, and ability to generate human-like responses. However, these capabilities also expose them to a new class of security threats. As LLMs are increasingly used in decision-making systems, chatbots, content moderation tools, and virtual agents, the potential for abuse through adversarial inputs grows exponentially. Large Language Models (LLMs) have fundamentally transformed the landscape of natural language processing, enabling applications in content generation, customer service, coding assistance, legal analysis, and more. With models like OpenAI’s GPT-4 [17], Anthropic’s Claude 2 [18], Meta’s LLaMA [50], and open-source offerings such as Vicuna [51] and Mistral 7B [52], LLMs now influence millions of users globally. However, this ubiquity introduces significant security concerns, particularly surrounding adversarial prompt engineering techniques that manipulate model behavior. These techniques, often referred to as prompt injection or jailbreaks, are capable of bypassing built-in safety filters and elicit outputs that violate platform policies, such as generating hate speech, misinformation, or malicious code [1] [2] [4]. Prompt injection represents a new class of vulnerabilities unique to LLMs. Unlike traditional software vulnerabilities rooted in memory safety or access control flaws, prompt injection leverages the interpretive nature of natural language inputs [5] [6]. This paper explores the mechanisms and success of prompt injection across a range of LLMs, documenting the systemic weaknesses that attackers exploit [2] [7] [8]. Contributions of this work include: - A comprehensive taxonomy of jailbreak prompts categorized by attack vector [3] [6] - Empirical evaluation of prompt effectiveness across closed and open-source LLMs [4] [10] [11] - Scenario-specific attack success analysis in domains such as law, politics, and security [1] [9] [13] - Discussion of community-based jailbreak dissemination and its parallels to exploit markets [14] [15] [27] - Detailed recommendations for mitigating prompt injection vulnerabilities [7] [12] [16] II Background II-A Overview of Large Language Models Large Language Models operate using billions of parameters and are trained on diverse datasets encompassing text from books, articles, websites, and code. Notable models such as GPT-4, Claude 2, and Mistral 7B build on earlier architectures but have significantly improved reasoning, factual recall, and stylistic flexibility. The ability of these models to learn from few examples, a phenomenon called in-context learning, contributes to their versatility but also to their vulnerability. When exposed to crafted prompts, these models can be misled into misaligned behavior. LLMs are built on the transformer architecture introduced by Vaswani et al. in 2017 [47]. Recent advancements include autoregressive pretraining on massive corpora followed by supervised finetuning and alignment through Reinforcement Learning from Human Feedback (RLHF) [34]. II-B Alignment and Safety Mechanisms To prevent harmful output, LLMs rely on several safety mechanisms, including instruction tuning [28], reinforcement from rejection sampling (RLAIF), pre- and post-output moderation filters [16], and system prompts embedding safety guidelines [12] [33]. II-C Prompt Injection Explained Prompt injection is the LLM analogue to command injection in traditional computing [1] [3] [6]. Common attack vectors include role-based conditioning, instruction hijacking, obfuscated encoding, and multi-turn manipulation [2] [5] [7]. Studies have shown that these attacks are reproducible, transferable, and can circumvent various filtering methods [3] [10] [14]. II-D Related Work Zhang et al. introduced a foundational taxonomy categorizing prompt injections [3]. Shen et al. aggregated 1,405 jailbreak prompts across 131 forums, revealing a 95% success rate in some cases [48]. Ding et al. developed ReNeLLM, which improved jailbreak performance by 40% [49]. Anthropic’s many-shot prompt conditioning decreased attack success rates significantly [18]. OWASP’s Top 10 identified prompt injection as the most critical vulnerability [16]. Other notable contributions include: - Liu et al. on empirical jailbreak strategies [2] - Yi et al. on indirect prompt injection detection [4] - Suo et al. on defense techniques derived from attack insights [9] - Chen et al. on preference-aligned defenses [11] - William on bypass detection and Zhao et al. unified defenses [15] [13] - Apurv et al. on threat modeling for red-teaming LLMs [53] III Methodology This section outlines our approach to measuring LLM vulnerabilities with an emphasis on reproducibility and diversity. Our methodology integrates qualitative red teaming insights with quantitative metrics collected through structured prompt testing. All experiments are governed by ethical red-teaming principles. In addition to evaluating raw performance, we also tracked behavioral consistency and model self-awareness to adversarial stimuli. This dual-pronged framework allows us to detect subtle failure patterns beyond binary success metrics. This section builds on emerging adversarial benchmarks such as JailbreakBench and RedBench, while drawing defense insights from frameworks like PromptShield [21] and Palisade [10]. We employed Sentence-BERT embeddings [46], GPT-based moderation strategies [17], and adversarial annotation heuristics from prior poisoning literature [22] to validate prompt behavior and misalignment tendencies. III-A Dataset Construction We curated a dataset of 1,400+ adversarial prompts from: - Public jailbreak repositories (e.g., GitHub, JailbreakChat) - LLM exploit forums on Reddit and Discord - Prior academic corpora, including JailbreakBench [3] and PromptBench [6] Each prompt was manually validated, categorized into attack types (roleplay, logic traps, encoding, multi-turn), and annotated for content sensitivity (e.g., political, legal, explicit). III-B Target Models We tested prompts on four models: - GPT-4 (OpenAI, March 2024 snapshot) - Claude 2 (Anthropic, July 2023 API version) - Mistral 7B (open-weight model via Hugging Face Inference) - Vicuna-13B (via local HF inference server) Model versions were frozen to ensure reproducibility. All inference was performed using controlled prompts, with the system context initialized per the model’s recommended safety guidelines. III-C Evaluation Metrics We used the following primary metrics: - Attack Success Rate (ASR): Whether the model produced an output violating its intended guardrails - Prompt Generalizability: How often a prompt successful on one model succeeded on another - Time-to-Bypass: Average minutes taken to successfully induce misaligned behavior - Failure Mode Classification: Taxonomy of observed response behaviors (e.g., partial refusals, misleading responses) III-D Automation Pipeline We developed a semi-automated red-teaming script using the LangChain framework. Prompts were injected via API calls (OpenAI, Claude) and local model inference (Mistral, Vicuna). Output was scored using a hybrid method: - Keyword spotting (for trigger words) - GPT-based meta-evaluation of harmfulness [17] - Sentence-BERT semantic distance from refusal templates [46] III-E Defense Framework Evaluation To simulate real-world defenses, we layered external filtering strategies on outputs: - PromptShield ruleset [21] - Palisade detection framework [10] - Signed-Prompt verification logic [41] We then retested a subset of successful jailbreaks against these defenses to estimate defense coverage and bypass rate. IV Results The evaluation of prompt injection efficacy was conducted using a rigorous experimental design. The figures, included in the appendix or digital supplement, visually illustrate comparative vulnerability trends. For instance, Figure 1 shows that GPT-4 exhibited the highest attack success rate. These results highlight not only model-specific weaknesses but also the effectiveness of specific prompt engineering tactics. IV-A Model Susceptibility Analysis Among the tested models, GPT-4 demonstrated the highest vulnerability with an ASR of 87.2%, confirming its powerful but permissive instruction-following nature. While Claude 2 performed slightly better in filtering, it still succumbed to 82.5% of attacks. Open models such as Mistral 7B (71.3%) and Vicuna (69.4%) revealed significant weaknesses, likely due to the absence of robust fine-tuned safety layers. Interestingly, GPT-4 and Claude 2 shared structural similarities in moderation behavior-exhibiting soft refusals before ultimately yielding to adversarial logic, especially in legal, creative, or conditional prompts. These nuances suggest that model scale and alignment tuning complexity both contribute to attack surface depth. In terms of generalizability, jailbreak prompts that succeeded on GPT-4 transferred effectively to Claude 2 and Vicuna in 64.1% and 59.7% of cases respectively. Average time to generate a successful jailbreak was under 17 minutes for GPT-4, while Mistral required approximately 21.7 minutes on average. Our experiments evaluated over 1,400 adversarial prompts across four LLMs: GPT-4, Claude 2, Mistral 7B, and Vicuna. We analyze results along several dimensions, including model susceptibility, attack technique efficacy, prompt behavior patterns, and cross-model generalization. | GPT-4 Claude 2 Mistral 7B | 87.2 82.5 71.3 | 64.1 59.7 52.4 | 16.2 17.4 21.7 | | --- | --- | --- | --- | | Vicuna | 69.4 | 50.6 | 20.9 | TABLE I: Model-wise Evaluation Metrics <details> <summary>extracted/6434027/fig_time_to_bypass.png Details</summary> ![61068a52](/v1/image/61068a521266ff748ddc9a9131bcc7a8557d9f0060c31926b5f70375191cf6ec) ### Visual Description \n ## Bar Charts: Model Security Evaluation ### Overview The image presents three stacked bar charts comparing the security performance of four Large Language Models (LLMs): GPT-4, Claude 2, Mistral 7B, and Vicuna. The charts evaluate the models across three metrics: Attack Success Rate (ASR), Prompt Generalizability, and Average Time-to-Bypass. Each chart uses the same x-axis representing the models. ### Components/Axes * **X-axis (all charts):** "Model" with categories: GPT-4, Claude 2, Mistral 7B, Vicuna. * **Chart 1: Attack Success Rate by Model** * **Y-axis:** "ASR (%)" ranging from 0 to 80. * **Bars:** Red bars representing the Attack Success Rate for each model. * **Chart 2: Prompt Generalizability by Model** * **Y-axis:** "Generalizability (%)" ranging from 0 to 60. * **Bars:** Light blue bars representing the Prompt Generalizability for each model. * **Chart 3: Average Time-to-Bypass by Model** * **Y-axis:** "Time (min)" ranging from 0 to 20. * **Bars:** Green bars representing the Average Time-to-Bypass for each model. ### Detailed Analysis or Content Details **Chart 1: Attack Success Rate by Model** * GPT-4: Approximately 75% ASR. * Claude 2: Approximately 65% ASR. * Mistral 7B: Approximately 70% ASR. * Vicuna: Approximately 60% ASR. The bars show that GPT-4 has the highest ASR, followed by Mistral 7B, Claude 2, and then Vicuna. **Chart 2: Prompt Generalizability by Model** * GPT-4: Approximately 50% Generalizability. * Claude 2: Approximately 30% Generalizability. * Mistral 7B: Approximately 55% Generalizability. * Vicuna: Approximately 50% Generalizability. Mistral 7B exhibits the highest prompt generalizability, followed by GPT-4 and Vicuna, with Claude 2 showing the lowest. **Chart 3: Average Time-to-Bypass by Model** * GPT-4: Approximately 13 minutes. * Claude 2: Approximately 8 minutes. * Mistral 7B: Approximately 16 minutes. * Vicuna: Approximately 18 minutes. Vicuna has the longest average time-to-bypass, followed by Mistral 7B, GPT-4, and then Claude 2. ### Key Observations * GPT-4 is most vulnerable to attacks (highest ASR) but has relatively good prompt generalizability and a moderate time-to-bypass. * Claude 2 is the most resistant to attacks (lowest ASR) but has the lowest prompt generalizability and a short time-to-bypass. * Mistral 7B has high ASR and prompt generalizability, but a long time-to-bypass. * Vicuna has moderate ASR and prompt generalizability, with the longest time-to-bypass. ### Interpretation The data suggests a trade-off between security and usability/generalizability in these LLMs. Models that are more resistant to attacks (like Claude 2) may be less flexible and adaptable to different prompts. Conversely, models that are more generalizable (like Mistral 7B) may be more susceptible to attacks. GPT-4 and Vicuna represent intermediate positions on this spectrum. The Attack Success Rate (ASR) indicates how easily an attacker can successfully prompt the model to generate undesirable outputs. Prompt Generalizability measures the model's ability to handle a variety of prompts without significant performance degradation. Time-to-Bypass represents the effort required for an attacker to circumvent the model's security measures. The differences in Time-to-Bypass could be due to varying levels of security mechanisms implemented in each model. The higher ASR of GPT-4 and Mistral 7B could be attributed to their more open and flexible architectures, which may inadvertently create vulnerabilities. Claude 2's lower ASR might be a result of more restrictive safety filters, but at the cost of generalizability. Vicuna's long time-to-bypass suggests robust security measures, but its moderate ASR indicates that these measures are not foolproof. </details> Figure 1: Model-wise Evaluation Metrics IV-B Attack Category Performance Prompt injections exploiting roleplay dynamics (e.g., impersonation of fictional characters or hypothetical scenarios) achieved the highest ASR (89.6%). These prompts often bypass filters by deflecting responsibility away from the model (e.g., “as an AI in a movie script…”). Logic trap attacks (ASR: 81.4%) exploit conditional structures and moral dilemmas to elicit disallowed content. Encoding tricks (e.g., base64 or zero-width characters) achieved 76.2% ASR by evading keyword-based filtering mechanisms. While multi-turn dialogues yielded slightly lower effectiveness (68.7%), they often succeeded in long-form tasks where context buildup gradually weakened safety enforcement. <details> <summary>extracted/6434027/fig_attack_categories.png Details</summary> ![7eaf26e2](/v1/image/7eaf26e2e46389b13008037e980663a24f513b2c0e9c617d0f386f59766dc662) ### Visual Description \n ## Horizontal Bar Chart: Attack Category Effectiveness ### Overview The image presents a horizontal bar chart illustrating the effectiveness of different attack categories, measured by their success rate. The chart compares five attack categories: Roleplay, Logic Traps, Encoding Tricks, Multi-turn, and Framing. The x-axis represents the Attack Success Rate in percentage, ranging from 0 to 80. The y-axis lists the attack categories. ### Components/Axes * **Title:** "Attack Category Effectiveness" (centered at the top) * **X-axis Label:** "Attack Success Rate (%)" (bottom-center) * **Y-axis Label:** "Category" (left-center) * **Categories:** Roleplay, Logic Traps, Encoding Tricks, Multi-turn, Framing * **Color Scheme:** A gradient of reddish-purple hues, with lighter shades representing lower success rates and darker shades representing higher success rates. ### Detailed Analysis The bars are arranged vertically, with Roleplay at the top and Framing at the bottom. * **Roleplay:** The bar for Roleplay extends to approximately 78% on the x-axis. The color is a light reddish-orange. * **Logic Traps:** The bar for Logic Traps extends to approximately 74% on the x-axis. The color is a medium reddish-orange. * **Encoding Tricks:** The bar for Encoding Tricks extends to approximately 70% on the x-axis. The color is a medium reddish-purple. * **Multi-turn:** The bar for Multi-turn extends to approximately 64% on the x-axis. The color is a darker reddish-purple. * **Framing:** The bar for Framing extends to approximately 60% on the x-axis. The color is the darkest reddish-purple. The bars generally decrease in length as you move down the y-axis, indicating a decreasing trend in attack success rate. ### Key Observations * Roleplay has the highest attack success rate, significantly higher than the other categories. * Framing has the lowest attack success rate. * The success rates are relatively close for Roleplay, Logic Traps, and Encoding Tricks. * There is a noticeable drop in success rate between Encoding Tricks and Multi-turn. ### Interpretation The chart suggests that Roleplay is the most effective attack category, while Framing is the least effective. This could indicate that Roleplay is easier to execute successfully, or that defenses against Framing are more robust. The relatively high success rates of Roleplay, Logic Traps, and Encoding Tricks suggest these are viable attack strategies. The lower success rates of Multi-turn and Framing may indicate that these attacks are more difficult to pull off, or that targets are more aware of them. The data implies a hierarchy of attack effectiveness, with Roleplay being the most potent and Framing the weakest. This information could be valuable for security professionals in prioritizing defenses or for attackers in selecting the most promising attack vectors. The visual trend of decreasing success rates as you move down the chart is clear and supports the quantitative data. The color gradient effectively reinforces this trend, making it easy to quickly identify the most and least effective attack categories. </details> Figure 2: Attack Category Effectiveness IV-C Scenario-Specific Vulnerabilities Targeted domains revealed non-uniform vulnerabilities: - Political content: Prompts involving campaign advice or fake lobbying succeeded 85.5% of the time. - Legal content: Prompts framed as courtroom hypotheticals or legal simulations yielded 79.4% ASR. - Explicit content: Erotic roleplay prompts were especially effective in jailbreak forums, with a 76.1% success rate. - Malicious code: Although many models blocked direct malware requests, evasion through obfuscation or “educational context” resulted in 58.3% success, especially on Vicuna and Mistral. <details> <summary>extracted/6434027/fig_scenario_success_rates.png Details</summary> ![593b00e6](/v1/image/593b00e6d5dea81d76ca7aa63b563dcae614af6ad5f3566de688752c77436ee0) ### Visual Description \n ## Bar Chart: Scenario-specific Success Rates ### Overview The image presents a horizontal bar chart illustrating the attack success rates for four different scenarios: Political Lobbying, Legal Opinion, Pornographic Content, and Malware Generation. The chart uses color to differentiate the scenarios, with the x-axis representing the attack success rate in percentage and the y-axis listing the scenarios. ### Components/Axes * **Title:** "Scenario-specific Success Rates" (positioned at the top-center) * **Y-axis Label:** "Scenario" (positioned on the left side) * **X-axis Label:** "Attack Success Rate (%)" (positioned at the bottom) * **X-axis Scale:** Ranges from 0 to 80, with increments of 10. * **Scenarios (Y-axis Categories):** * Political Lobbying * Legal Opinion * Pornographic Content * Malware Generation * **Colors:** * Political Lobbying: Dark Purple (#6A3A70) * Legal Opinion: Medium Purple (#B266B2) * Pornographic Content: Light Red (#E67E22) * Malware Generation: Pale Orange (#D3CE00) ### Detailed Analysis * **Political Lobbying:** The bar extends to approximately 82% on the x-axis. * **Legal Opinion:** The bar extends to approximately 75% on the x-axis. * **Pornographic Content:** The bar extends to approximately 68% on the x-axis. * **Malware Generation:** The bar extends to approximately 60% on the x-axis. ### Key Observations The chart shows a clear ranking of success rates across the scenarios. Political Lobbying has the highest success rate, followed by Legal Opinion, Pornographic Content, and Malware Generation. The difference in success rates between Political Lobbying and the other scenarios is substantial. ### Interpretation The data suggests that attacks leveraging the context of "Political Lobbying" are significantly more successful than those related to other scenarios. This could be due to a variety of factors, such as the sensitivity of the information involved, the potential for high rewards, or the relative lack of security measures in place. The lower success rate for "Malware Generation" might indicate increased security awareness and defenses against such attacks. The chart highlights the importance of tailoring security measures to the specific context of potential attacks. The data could be used to prioritize security investments and training efforts, focusing on the scenarios with the highest success rates. It is important to note that this chart only shows success *rates*, not the *frequency* of attacks. A scenario with a lower success rate but a higher attack frequency could still pose a significant threat. </details> Figure 3: Scenario-specific Success Rates IV-D Prompt Transferability The Prompt Transferability Matrix reveals the high portability of successful prompts. GPT-4-derived prompts transferred with 64.1% success to Claude 2 and over 50% to Mistral and Vicuna. This finding underscores the systemic nature of these vulnerabilities across architectures. Notably, Claude 2 showed higher resistance to Vicuna-origin prompts, indicating some directional asymmetry in generalization. This is likely due to the more fine-grained safety alignment mechanisms employed in commercial models. <details> <summary>extracted/6434027/fig_prompt_transferability.png Details</summary> ![25c41503](/v1/image/25c41503ba004a565746962518a35fdb9eb8257af9a2381cf71b509b0a6bd2f8) ### Visual Description ## Heatmap: Prompt Transferability Matrix ### Overview This image presents a heatmap visualizing the prompt transferability between four Large Language Models (LLMs): GPT-4, Claude 2, Mistral 7B, and Vicuna. The color intensity represents the transferability percentage, with darker blues indicating lower transferability and brighter reds indicating higher transferability. The matrix shows the performance of prompts designed for one model when applied to another. ### Components/Axes * **Title:** "Prompt Transferability Matrix" (centered at the top) * **X-axis:** LLMs - GPT-4, Claude 2, Mistral 7B, Vicuna (listed horizontally) * **Y-axis:** LLMs - GPT-4, Claude 2, Mistral 7B, Vicuna (listed vertically) * **Color Scale/Legend:** Located on the right side of the heatmap, ranging from 0% (dark blue) to 60% (bright red). The scale is labeled "Transferability (%)". * **Cells:** Each cell represents the transferability percentage between the corresponding LLM on the X and Y axes. ### Detailed Analysis The heatmap is a 4x4 matrix. The values within each cell are as follows: * **GPT-4:** * GPT-4 to GPT-4: 0.0% (dark blue) * GPT-4 to Claude 2: 61.9% (bright red) * GPT-4 to Mistral 7B: 49.7% (orange-red) * GPT-4 to Vicuna: 48.3% (orange-red) * **Claude 2:** * Claude 2 to GPT-4: 64.1% (bright red) * Claude 2 to Claude 2: 0.0% (dark blue) * Claude 2 to Mistral 7B: 47.2% (orange-red) * Claude 2 to Vicuna: 46.1% (orange-red) * **Mistral 7B:** * Mistral 7B to GPT-4: 54.2% (orange-red) * Mistral 7B to Claude 2: 50.3% (orange-red) * Mistral 7B to Mistral 7B: 0.0% (dark blue) * Mistral 7B to Vicuna: 45.7% (orange-red) * **Vicuna:** * Vicuna to GPT-4: 52.7% (orange-red) * Vicuna to Claude 2: 48.6% (orange-red) * Vicuna to Mistral 7B: 46.5% (orange-red) * Vicuna to Vicuna: 0.0% (dark blue) The diagonal cells (GPT-4 to GPT-4, Claude 2 to Claude 2, etc.) all have a transferability of 0.0%. This indicates that prompts optimized for a specific model perform poorly when used with the same model, which is counterintuitive. ### Key Observations * **Highest Transferability:** The highest transferability scores are between GPT-4 and Claude 2 (61.9% and 64.1%), and between Claude 2 and GPT-4. * **Lowest Transferability:** The lowest transferability scores are along the diagonal, indicating poor self-transferability. * **Symmetry:** The matrix is approximately symmetric, meaning the transferability from Model A to Model B is roughly the same as from Model B to Model A. * **Mistral 7B and Vicuna:** These models show relatively consistent transferability scores to other models, generally in the 45-55% range. ### Interpretation The data suggests that prompts designed for one LLM do not necessarily translate well to other LLMs, even those with similar architectures. The 0% self-transferability is a particularly striking finding, implying that prompts highly tuned for a specific model may actually *degrade* performance when re-applied to the same model. This could be due to subtle differences in the models' internal representations or training data. The high transferability between GPT-4 and Claude 2 suggests these models share some common underlying capabilities or prompt understanding mechanisms. The lower transferability scores for Mistral 7B and Vicuna may indicate that these models have more distinct prompt sensitivities or require different prompting strategies. The approximate symmetry of the matrix suggests a degree of reciprocity in prompt transferability. If a prompt works well when transferred from Model A to Model B, it's likely to work reasonably well in the reverse direction. This data is valuable for understanding the limitations of prompt engineering and the importance of tailoring prompts to specific LLMs. It also highlights the need for research into more robust and transferable prompting techniques. </details> Figure 4: Prompt Transferability Matrix IV-E Failure Modes and Detection Gaps We observed five dominant failure patterns: - Partial Refusals (34%): Prompts initially triggered refusal but continued to output harmful content mid-response. - Hidden Compliance (22%): The model appeared to refuse but provided veiled or coded information. - No Output (18%): Complete refusal, often due to prompt being too direct or malformed. - Misleading Responses (15%): Factually incorrect or evasive answers. This taxonomy provides a baseline for future behavioral alignment benchmarks. | Partial Refusal | 34% | Hypotheticals, satire | | --- | --- | --- | | Hidden Compliance | 22% | Roleplay and analogy | | No Output | 18% | Base64, multi-turn traps | | Misleading Response | 15% | Legal/political scenarios | TABLE II: Prompt Failure Modes IV-F Prompt Length and Obfuscation Success rates were highest for prompts in the 101–150 token range (80.3%), suggesting a sweet spot for encoding deception while maintaining clarity. Prompts exceeding 150 tokens saw a slight dip in success-likely due to verbosity or token truncation. Encoded or obfuscated prompts, such as those using zero-width spaces, emojis, or alternate encodings, had lower detection rates (21.3%) but retained strong ASR (76.2%). These findings emphasize the need for semantic-level input sanitization. <details> <summary>extracted/6434027/fig_prompt_length_vs_asr.png Details</summary> ![f3388912](/v1/image/f3388912ab25379b1b2d95798023eb44d0a3c29d2869954eac310e1dfae8872c) ### Visual Description \n ## Bar Chart: Prompt Length vs. Average Success Rate ### Overview This bar chart illustrates the relationship between prompt length (measured in token count) and the average success rate (ASR) achieved. The chart displays four bars, each representing a different range of prompt token counts. The height of each bar corresponds to the average success rate for prompts within that token count range. ### Components/Axes * **Title:** "Prompt Length vs. Average Success Rate" - positioned at the top-center of the chart. * **X-axis:** "Prompt Token Count" - displays four categories: "<50", "51-100", "101-150", and ">150". * **Y-axis:** "ASR (%)" - represents the Average Success Rate, with a scale ranging from 0 to 80, incrementing by 10. * **Bars:** Four bars representing the ASR for each prompt token count range. The bars are colored in shades of green and blue, transitioning from lighter to darker as the token count increases. ### Detailed Analysis * **<50 Tokens:** The bar for prompts with less than 50 tokens has a height of approximately 68%. The bar is a light green color. * **51-100 Tokens:** The bar for prompts with 51-100 tokens has a height of approximately 74%. The bar is a medium green color. * **101-150 Tokens:** The bar for prompts with 101-150 tokens has a height of approximately 79%. The bar is a light blue color. * **>150 Tokens:** The bar for prompts with more than 150 tokens has a height of approximately 76%. The bar is a dark blue color. The bars generally increase in height from "<50" to "101-150", then slightly decrease for ">150". ### Key Observations * The highest average success rate is observed for prompts with 101-150 tokens (approximately 79%). * The lowest average success rate is observed for prompts with less than 50 tokens (approximately 68%). * There is a slight decrease in average success rate for prompts exceeding 150 tokens, compared to the 101-150 token range. ### Interpretation The data suggests that there is an optimal prompt length for maximizing success rate. Prompts within the 101-150 token range appear to perform best. Shorter prompts (<50 tokens) may lack sufficient detail or context, leading to lower success rates. While longer prompts (>150 tokens) still achieve a relatively high success rate, the slight decrease suggests that excessive length may introduce noise or redundancy, potentially hindering performance. The relationship is not strictly linear; it appears to have a peak around the 101-150 token range. This could indicate that the model benefits from a certain level of detail and instruction, but beyond that point, additional information does not necessarily translate to improved results. Further investigation might explore the impact of prompt content and structure within these token ranges to refine the optimal prompt length strategy. </details> Figure 5: Prompt Length vs. Average Success Rate <details> <summary>extracted/6434027/fig_stealth_vs_detection.png Details</summary> ![b35b0826](/v1/image/b35b082611d518573cef9935cb667c811eaebed9a8d5b247786503e4c042f46f) ### Visual Description \n ## Chart: Stealthiness vs. Detection Likelihood and ASR ### Overview This chart visualizes the relationship between stealth class, detection rate, and attack success rate (ASR). It uses a bar chart to represent the detection rate for each stealth class, and a line graph overlaid on the bar chart to represent the ASR. The x-axis represents the stealth class, categorized as High, Medium, and Low. The primary y-axis (left) represents the detection rate in percentage, while the secondary y-axis (right) represents the attack success rate, also in percentage. ### Components/Axes * **Title:** Stealthiness vs. Detection Likelihood and ASR * **X-axis:** Stealth Class (High (encoded/meta), Medium (roleplay), Low (direct)) * **Primary Y-axis (left):** Detection Rate (%) - Scale from 0 to 70, increments of 10. * **Secondary Y-axis (right):** Attack Success Rate (%) - Scale from 60 to 90, increments of 5. * **Bar Chart:** Represents Detection Rate for each Stealth Class. Bars are colored in a shade of red. * **Line Graph:** Represents Attack Success Rate for each Stealth Class. The line is colored blue. * **Legend:** No explicit legend is present, but the axes labels clearly indicate what each visual element represents. ### Detailed Analysis The chart presents three stealth classes: High, Medium, and Low. * **High Stealth Class:** * Detection Rate: Approximately 25%. * Attack Success Rate: Approximately 60%. * **Medium Stealth Class:** * Detection Rate: Approximately 40%. * Attack Success Rate: Approximately 73%. * **Low Stealth Class:** * Detection Rate: Approximately 65%. * Attack Success Rate: Approximately 87%. **Trend Analysis:** * **Detection Rate:** The detection rate increases as the stealth class decreases (from High to Low). The bars show a clear upward trend. * **Attack Success Rate:** The attack success rate also increases as the stealth class decreases (from High to Low). The blue line slopes upward. ### Key Observations * There is a strong inverse relationship between stealthiness and detection rate. Higher stealthiness correlates with lower detection rates. * There is a strong inverse relationship between stealthiness and attack success rate. Higher stealthiness correlates with lower attack success rates. * The highest attack success rate is achieved with the lowest stealth class, but this comes at the cost of a significantly higher detection rate. * The medium stealth class offers a balance between detection rate and attack success rate. ### Interpretation The data suggests a trade-off between stealth and effectiveness. While higher stealthiness reduces the likelihood of detection, it also reduces the potential for a successful attack. Conversely, lower stealthiness increases the risk of detection but also increases the chances of a successful attack. The chart demonstrates a clear pattern: as the complexity of stealth techniques decreases (moving from "encoded/meta" to "direct"), the detection rate increases, but so does the attack success rate. This could be due to several factors, such as the increased visibility of less sophisticated attacks or the greater ease with which they can be executed. The "Medium (roleplay)" stealth class appears to be a sweet spot, offering a reasonable balance between detection risk and attack success. This suggests that a moderate level of stealth may be the most effective strategy in many scenarios. The chart provides valuable insights for security professionals and attackers alike. It highlights the importance of understanding the trade-offs involved in stealth and attack strategies and tailoring them to the specific context. </details> Figure 6: Stealthiness vs. Detection Likelihood and ASR V Discussion The implications of these findings extend beyond prompt injection. They expose the fragility of current safety alignment mechanisms under realistic threat conditions. Our work reinforces the need for adversarial testing as a continuous validation tool in LLM deployment pipelines. Prompt injection represents not only a technical challenge but a policy and governance issue as well. Failure to address these risks may erode trust in AI applications and hinder broader societal adoption. These results validate the claim that current LLM safety mechanisms are insufficiently robust against prompt injection, especially indirect or obfuscated attacks [1] [4] [5]. The findings reinforce prior studies that describe alignment filters as semantically shallow and largely reliant on static refusal templates [11] [16]. The ease with which these attacks transferred across models points to shared architectural vulnerabilities or training data biases [3] [6] [18]. Moreover, roleplay and scenario-based prompts exploit not only the model’s capacity for creativity but its inability to judge moral context effectively [2] [9] [14]. Online communities (e.g., Reddit, GitHub, Discord) operate as informal exploit databases, with prompt variations evolving similarly to malware strains in the cybersecurity domain. We echo ethical concerns raised in recent works regarding open publication of jailbreaks [19] [20] and advocate for controlled disclosures and bug bounty mechanisms tailored for LLM developers [15] [24]. VI Mitigation Strategies Effective defenses against prompt injection must evolve alongside adversarial creativity. Static filtering or keyword-based systems offer only limited protection. Our defense recommendations blend technical solutions with operational safeguards. We emphasize the importance of feedback loops between model developers and red teams, and call for public benchmarks that simulate dynamic adversarial scenarios. Defenses should be evaluated under live attack settings, with evolving attacker strategies embedded in the test suite. Our mitigation strategy draws inspiration from work such as PromptShield [21], Palisade [10], and UniGuardian [13]. We recommend: - System prompt hardening using context anchoring [14] - Behavior-based anomaly detection during multi-turn dialogues [8] [11] [18] - Input sanitization via Signed-Prompt techniques [41] - Embedding adversarial decoys and rejection-conditioned training [8] [12] [42] - Session-level analytics to detect evasion attempts [11] These methods, when applied in combination, form a layered defense that significantly reduces the likelihood of successful jailbreak attempts while preserving usability. VII Limitations and Future Work Despite its comprehensive scope, this study remains limited by the availability of open model weights and API constraints. Prompt injection tactics may evolve in ways not covered in our dataset. Additionally, prompt interpretation may differ across cultures and languages, warranting multilingual and socio-contextual extensions. Future work will aim to create shared evaluation platforms, akin to CVE databases, where new prompt exploits and defense bypasses can be collaboratively tracked and neutralized. While our findings are grounded in robust experimentation, limitations remain. Our evaluation used static model checkpoints and may not account for updates or real-time moderation layers applied in production APIs [3] [17]. Additionally, cultural and linguistic diversity in prompts remains underrepresented [25] [35]. We recommend future work focus on multilingual adversarial prompt corpora [25], evaluation of plug-in-enabled LLMs [44], and adversarial training using open red-teaming platforms [21] [8] [13]. The development of explainable safety filters and real-time flagging systems could greatly aid in closing the alignment gap [11] [18] [41]. VIII Conclusion This research affirms the growing consensus that prompt injection is not an edge-case anomaly but a fundamental issue in current-generation LLMs. The findings not only highlight the technical inadequacies of present alignment systems but also illuminate the adversarial creativity of the prompt engineering community. Addressing these challenges will require collaborative frameworks that blend secure NLP research, adversarial testing, and governance. We envision a future where LLMs are audited as rigorously as software, with red teaming treated as a core development practice. Prompt injection remains an open frontier in LLM safety. Through comprehensive evaluation and a synthesis of recent research, we provide compelling evidence that jailbreak techniques are both transferable and evolving. Our work aligns with the concerns raised in recent surveys [1] [2] [13] [31], and supports the call for stronger multi-layered defenses, proactive red teaming, and coordinated disclosure practices in AI development [19] [23] [24]. References - [1] Yi Liu and Gelei Deng and Yuekang Li and Kailong Wang and Zihao Wang and Xiaofeng Wang and Tianwei Zhang and Yepang Liu and Haoyu Wang and Yan Zheng and Yang Liu. ”Prompt Injection attack against LLM-integrated Applications.” arXiv:2306.05499 - [2] Yi Liu and Gelei Deng and Zhengzi Xu and Yuekang Li and Yaowen Zheng and Ying Zhang and Lida Zhao and Tianwei Zhang and Kailong Wang and Yang Liu. ”Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study.” arXiv:2305.13860 - [3] Xiaogeng Liu and Zhiyuan Yu and Yizhe Zhang and Ning Zhang and Chaowei Xiao. ”Automatic and Universal Prompt Injection Attacks against Large Language Models.” arXiv:2403.04957 - [4] Yi, Jingwei and Xie, Yueqi and Zhu, Bin and Kiciman, Emre and Sun, Guangzhong and Xie, Xing and Wu, Fangzhao. ”Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models.” arXiv:2312.14197 - [5] Sippo Rossi and Alisia Marianne Michel and Raghava Rao Mukkamala and Jason Bennett Thatcher. ”An Early Categorization of Prompt Injection Attacks on Large Language Models.” arXiv:2402.00898 - [6] Yupei Liu and Yuqi Jia and Runpeng Geng and Jinyuan Jia and Neil Zhenqiang Gong. ”Formalizing and Benchmarking Prompt Injection Attacks and Defenses.” 33rd USENIX Security Symposium (USENIX Security 24) - [7] Md Abdur Rahman and Fan Wu and Alfredo Cuzzocrea and Sheikh Iqbal Ahamed. ”Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection.” arXiv:2410.21337 - [8] Kuo-Han Hung and Ching-Yun Ko and Ambrish Rawat and I-Hsin Chung and Winston H. Hsu and Pin-Yu Chen. ”Attention Tracker: Detecting Prompt Injection Attacks in LLMs.” arXiv:2411.00348 - [9] Yulin Chen and Haoran Li and Zihao Zheng and Yangqiu Song and Dekai Wu and Bryan Hooi. ”Defense Against Prompt Injection Attack by Leveraging Attack Techniques.” arXiv:2411.00459 - [10] Sahasra Kokkula and Somanathan R and Nandavardhan R and Aashishkumar and G Divya. ”Palisade – Prompt Injection Detection Framework.” arXiv:2410.21146 - [11] Sizhe Chen and Arman Zharmagambetov and Saeed Mahloujifar and Kamalika Chaudhuri and David Wagner and Chuan Guo. ”SecAlign: Defending Against Prompt Injection with Preference Optimization.” arXiv:2410.05451 - [12] Chong Zhang and Mingyu Jin and Qinkai Yu and Chengzhi Liu and Haochen Xue and Xiaobo Jin. ”Goal-guided Generative Prompt Injection Attack on Large Language Models.” arXiv:2404.07234 - [13] Huawei Lin and Yingjie Lao and Tong Geng and Tan Yu and Weijie Zhao. ”UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models.” arXiv:2502.13141 - [14] Edoardo Debenedetti and Ilia Shumailov and Tianqi Fan and Jamie Hayes and Nicholas Carlini and Daniel Fabian and Christoph Kern and Chongyang Shi and Andreas Terzis and Florian Tramèr. ”Defeating Prompt Injections by Design.” arXiv:2503.18813 - [15] William Hackett and Lewis Birch and Stefan Trawicki and Neeraj Suri and Peter Garraghan. ”Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails.” arXiv:2504.11168 - [16] OWASP Foundation. (2023). ”Top 10 for LLM Applications.” https://owasp.org/www-project-top-10-for-llm-applications - [17] OpenAI. (2023). ”GPT-4 Technical Report.” https://openai.com/research/gpt-4 - [18] Anthropic. (2023). ”Many-shot Jailbreaking.” https://www.anthropic.com/research/many-shot-jailbreaking - [19] Le Wang and Zonghao Ying and Tianyuan Zhang and Siyuan Liang and Shengshan Hu and Mingchuan Zhang and Aishan Liu and Xianglong Liu. ”Manipulating Multimodal Agents via Cross-Modal Prompt Injection.” arXiv:2504.14348 - [20] Jiaqi Xue and Mengxin Zheng and Ting Hua and Yilin Shen and Yepeng Liu and Ladislau Boloni and Qian Lou. ”TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models.” arXiv:2306.06815 - [21] Dennis Jacob and Hend Alzahrani and Zhanhao Hu and Basel Alomair and David Wagner. ”PromptShield: Deployable Detection for Prompt Injection Attacks.” arXiv:2501.15145 - [22] Wallace, Eric and Zhao, Tony and Feng, Shi and Singh, Sameer. ”Concealed Data Poisoning Attacks on NLP Models.” Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - [23] Irene Solaiman, et al. ”Release Strategies and the Social Impacts of Language Models.” arXiv:1908.09203 - [24] Miles Brundage and Shahar Avin and Jasmine Wang and Haydn Belfield, et al. ”Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims.” arXiv preprint arXiv:2004.07213 (2020). - [25] Percy Liang and Rishi Bommasani and Tony Lee, et al. ”Holistic Evaluation of Language Models.” arXiv preprint arXiv:2211.09110 (2023). - [26] Tom B. Brown and Benjamin Mann and Nick Ryder, et al. ”Language Models are Few-Shot Learners.” arXiv preprint arXiv:2005.14165 (2020). - [27] Sébastien Bubeck, et al. ”Sparks of Artificial General Intelligence: Early experiments with GPT-4.” arXiv preprint arXiv:2303.12712 (2023). - [28] Huang, Lei and Yu, Weijiang, et al. ”A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.” arXiv preprint arXiv:2311.05232 (2025) - [29] Nelson Elhage and Tristan Hume, et al. ”Toy Models of Superposition.” arXiv preprint arXiv:2209.10652 (2022). - [30] Kevin Meng and David Bau, et al. ”Locating and Editing Factual Associations in GPT.” arXiv preprint arXiv:2202.05262 (2023). - [31] Kaijie Zhu and Jindong Wang, et al. ”PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts.” arXiv preprint arXiv:2306.04528 (2024) - [32] Ari Holtzman and Jan Buys, et al. ”The Curious Case of Neural Text Degeneration.” arXiv preprint arXiv:1904.09751 (2020). - [33] Colin Raffel and Noam Shazeer, et al. ”Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” arXiv preprint arXiv:1910.10683 (2023). - [34] Daniel M. Ziegler and Nisan Stiennon, et al. ”Fine-Tuning Language Models from Human Preferences” arXiv preprint arXiv:1909.08593 (2020). - [35] Bender Emily M. and Gebru Timnit, et al. ”On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” Association for Computing Machinery (2021). - [36] Wolf Thomas and Debut Lysandre, et al. ”Transformers: State-of-the-Art Natural Language Processing.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. - [37] Jason Wei and Xuezhi Wang, et al. ”Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” arXiv preprint arXiv:2201.11903 (2023). - [38] Nicholas Carlini and Daniel Paleka, et al. ”Stealing Part of a Production Language Model” arXiv preprint arXiv:2403.06634 (2024). - [39] Inioluwa Deborah Raji and Andrew Smart, et al. ”Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing.” arXiv preprint arXiv:2001.00973 (2020). - [40] Nikhil Kandpal and Haikang Deng, et al. ”Large Language Models Struggle to Learn Long-Tail Knowledge.” arXiv preprint arXiv:2211.08411 (2023). - [41] Xuchen Suo, et al. ”Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications” arXiv preprint arXiv:2401.07612 (2024). - [42] Yu Peng and Zewen Long, et al. ”Playing Language Game with LLMs Leads to Jailbreaking.” arXiv preprint arXiv:2411.12762 (2024). - [43] Bangxin Li, et al. ”Exploiting Uncommon Text-Encoded Structures for Automated Jailbreaks in LLMs.” arXiv preprint arXiv:2406.08754v2 (2024). - [44] Ziqiu Wang and Jun Liu, et al. ”Poisoned LangChain: Jailbreak LLMs by LangChain.” arXiv preprint arXiv:2406.18122 (2024). - [45] Unit 42. (2024). ”Deceptive Delight: Jailbreak LLMs Through Camouflage and Distraction.” Palo Alto Networks. - [46] Nils Reimers and Iryna Gurevych. ”Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” arXiv preprint arXiv:1908.10084 (2019). - [47] Ashish Vaswani and Noam Shazeer and Niki Parmar, et al. ”Attention Is All You Need.” arXiv preprint arXiv:1706.03762 (2023). - [48] Xinyue Shen, et al. ””Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models” arXiv preprint arXiv:2308.03825 (2024). - [49] Peng Ding and Jun Kuang, et al. ”A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily” arXiv preprint arXiv:2311.08268 (2024). - [50] Hugo Touvron and Thibaut Lavril, et al. ”LLaMA: Open and Efficient Foundation Language Models” arXiv preprint arXiv:2302.13971 (2023). - [51] LMSYS ORG. ”Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality” https://lmsys.org/blog/2023-03-30-vicuna/. - [52] Mistral AI. ”Mistral 7B” https://mistral.ai/news/announcing-mistral-7b - [53] Apurv Verma and Satyapriya Krishna, et al. ”Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)” arXiv preprint arXiv:2407.14937 (2024).

Rendering Paper...