2505.04806

Model: healer-alpha-free

# Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs **Authors**: Chetan Pathade > Independent Researcher San Jose, CA, USA ## Abstract Large Language Models (LLMs) are increasingly integrated into consumer and enterprise applications. Despite their capabilities, they remain susceptible to adversarial attacks such as prompt injection and jailbreaks that override alignment safeguards. This paper provides a systematic investigation of jailbreak strategies against various state-of-the-art LLMs. We categorize over 1,400 adversarial prompts, analyze their success against GPT-4, Claude 2, Mistral 7B, and Vicuna, and examine their generalizability and construction logic. We further propose layered mitigation strategies and recommend a hybrid red-teaming and sandboxing approach for robust LLM security. Index Terms: Large Language Models, Prompt Injection, Jailbreak, Adversarial Prompts, AI Security, Red Teaming, LLM Safety ## I Introduction The field of artificial intelligence has experienced a paradigm shift with the emergence of large language models (LLMs). These systems have transitioned from research prototypes to core components of production-grade systems, shaping industries from finance and law to healthcare and entertainment. LLMs are praised for their fluency, contextual reasoning, and ability to generate human-like responses. However, these capabilities also expose them to a new class of security threats. As LLMs are increasingly used in decision-making systems, chatbots, content moderation tools, and virtual agents, the potential for abuse through adversarial inputs grows exponentially. Large Language Models (LLMs) have fundamentally transformed the landscape of natural language processing, enabling applications in content generation, customer service, coding assistance, legal analysis, and more. With models like OpenAI’s GPT-4 [17], Anthropic’s Claude 2 [18], Meta’s LLaMA [50], and open-source offerings such as Vicuna [51] and Mistral 7B [52], LLMs now influence millions of users globally. However, this ubiquity introduces significant security concerns, particularly surrounding adversarial prompt engineering techniques that manipulate model behavior. These techniques, often referred to as prompt injection or jailbreaks, are capable of bypassing built-in safety filters and elicit outputs that violate platform policies, such as generating hate speech, misinformation, or malicious code [1] [2] [4]. Prompt injection represents a new class of vulnerabilities unique to LLMs. Unlike traditional software vulnerabilities rooted in memory safety or access control flaws, prompt injection leverages the interpretive nature of natural language inputs [5] [6]. This paper explores the mechanisms and success of prompt injection across a range of LLMs, documenting the systemic weaknesses that attackers exploit [2] [7] [8]. Contributions of this work include: - A comprehensive taxonomy of jailbreak prompts categorized by attack vector [3] [6] - Empirical evaluation of prompt effectiveness across closed and open-source LLMs [4] [10] [11] - Scenario-specific attack success analysis in domains such as law, politics, and security [1] [9] [13] - Discussion of community-based jailbreak dissemination and its parallels to exploit markets [14] [15] [27] - Detailed recommendations for mitigating prompt injection vulnerabilities [7] [12] [16] ## II Background ### II-A Overview of Large Language Models Large Language Models operate using billions of parameters and are trained on diverse datasets encompassing text from books, articles, websites, and code. Notable models such as GPT-4, Claude 2, and Mistral 7B build on earlier architectures but have significantly improved reasoning, factual recall, and stylistic flexibility. The ability of these models to learn from few examples, a phenomenon called in-context learning, contributes to their versatility but also to their vulnerability. When exposed to crafted prompts, these models can be misled into misaligned behavior. LLMs are built on the transformer architecture introduced by Vaswani et al. in 2017 [47]. Recent advancements include autoregressive pretraining on massive corpora followed by supervised finetuning and alignment through Reinforcement Learning from Human Feedback (RLHF) [34]. ### II-B Alignment and Safety Mechanisms To prevent harmful output, LLMs rely on several safety mechanisms, including instruction tuning [28], reinforcement from rejection sampling (RLAIF), pre- and post-output moderation filters [16], and system prompts embedding safety guidelines [12] [33]. ### II-C Prompt Injection Explained Prompt injection is the LLM analogue to command injection in traditional computing [1] [3] [6]. Common attack vectors include role-based conditioning, instruction hijacking, obfuscated encoding, and multi-turn manipulation [2] [5] [7]. Studies have shown that these attacks are reproducible, transferable, and can circumvent various filtering methods [3] [10] [14]. ### II-D Related Work Zhang et al. introduced a foundational taxonomy categorizing prompt injections [3]. Shen et al. aggregated 1,405 jailbreak prompts across 131 forums, revealing a 95% success rate in some cases [48]. Ding et al. developed ReNeLLM, which improved jailbreak performance by 40% [49]. Anthropic’s many-shot prompt conditioning decreased attack success rates significantly [18]. OWASP’s Top 10 identified prompt injection as the most critical vulnerability [16]. Other notable contributions include: - Liu et al. on empirical jailbreak strategies [2] - Yi et al. on indirect prompt injection detection [4] - Suo et al. on defense techniques derived from attack insights [9] - Chen et al. on preference-aligned defenses [11] - William on bypass detection and Zhao et al. unified defenses [15] [13] - Apurv et al. on threat modeling for red-teaming LLMs [53] ## III Methodology This section outlines our approach to measuring LLM vulnerabilities with an emphasis on reproducibility and diversity. Our methodology integrates qualitative red teaming insights with quantitative metrics collected through structured prompt testing. All experiments are governed by ethical red-teaming principles. In addition to evaluating raw performance, we also tracked behavioral consistency and model self-awareness to adversarial stimuli. This dual-pronged framework allows us to detect subtle failure patterns beyond binary success metrics. This section builds on emerging adversarial benchmarks such as JailbreakBench and RedBench, while drawing defense insights from frameworks like PromptShield [21] and Palisade [10]. We employed Sentence-BERT embeddings [46], GPT-based moderation strategies [17], and adversarial annotation heuristics from prior poisoning literature [22] to validate prompt behavior and misalignment tendencies. ### III-A Dataset Construction We curated a dataset of 1,400+ adversarial prompts from: - Public jailbreak repositories (e.g., GitHub, JailbreakChat) - LLM exploit forums on Reddit and Discord - Prior academic corpora, including JailbreakBench [3] and PromptBench [6] Each prompt was manually validated, categorized into attack types (roleplay, logic traps, encoding, multi-turn), and annotated for content sensitivity (e.g., political, legal, explicit). ### III-B Target Models We tested prompts on four models: - GPT-4 (OpenAI, March 2024 snapshot) - Claude 2 (Anthropic, July 2023 API version) - Mistral 7B (open-weight model via Hugging Face Inference) - Vicuna-13B (via local HF inference server) Model versions were frozen to ensure reproducibility. All inference was performed using controlled prompts, with the system context initialized per the model’s recommended safety guidelines. ### III-C Evaluation Metrics We used the following primary metrics: - Attack Success Rate (ASR): Whether the model produced an output violating its intended guardrails - Prompt Generalizability: How often a prompt successful on one model succeeded on another - Time-to-Bypass: Average minutes taken to successfully induce misaligned behavior - Failure Mode Classification: Taxonomy of observed response behaviors (e.g., partial refusals, misleading responses) ### III-D Automation Pipeline We developed a semi-automated red-teaming script using the LangChain framework. Prompts were injected via API calls (OpenAI, Claude) and local model inference (Mistral, Vicuna). Output was scored using a hybrid method: - Keyword spotting (for trigger words) - GPT-based meta-evaluation of harmfulness [17] - Sentence-BERT semantic distance from refusal templates [46] ### III-E Defense Framework Evaluation To simulate real-world defenses, we layered external filtering strategies on outputs: - PromptShield ruleset [21] - Palisade detection framework [10] - Signed-Prompt verification logic [41] We then retested a subset of successful jailbreaks against these defenses to estimate defense coverage and bypass rate. ## IV Results The evaluation of prompt injection efficacy was conducted using a rigorous experimental design. The figures, included in the appendix or digital supplement, visually illustrate comparative vulnerability trends. For instance, Figure 1 shows that GPT-4 exhibited the highest attack success rate. These results highlight not only model-specific weaknesses but also the effectiveness of specific prompt engineering tactics. ### IV-A Model Susceptibility Analysis Among the tested models, GPT-4 demonstrated the highest vulnerability with an ASR of 87.2%, confirming its powerful but permissive instruction-following nature. While Claude 2 performed slightly better in filtering, it still succumbed to 82.5% of attacks. Open models such as Mistral 7B (71.3%) and Vicuna (69.4%) revealed significant weaknesses, likely due to the absence of robust fine-tuned safety layers. Interestingly, GPT-4 and Claude 2 shared structural similarities in moderation behavior-exhibiting soft refusals before ultimately yielding to adversarial logic, especially in legal, creative, or conditional prompts. These nuances suggest that model scale and alignment tuning complexity both contribute to attack surface depth. In terms of generalizability, jailbreak prompts that succeeded on GPT-4 transferred effectively to Claude 2 and Vicuna in 64.1% and 59.7% of cases respectively. Average time to generate a successful jailbreak was under 17 minutes for GPT-4, while Mistral required approximately 21.7 minutes on average. Our experiments evaluated over 1,400 adversarial prompts across four LLMs: GPT-4, Claude 2, Mistral 7B, and Vicuna. We analyze results along several dimensions, including model susceptibility, attack technique efficacy, prompt behavior patterns, and cross-model generalization. | GPT-4 Claude 2 Mistral 7B | 87.2 82.5 71.3 | 64.1 59.7 52.4 | 16.2 17.4 21.7 | | --- | --- | --- | --- | | Vicuna | 69.4 | 50.6 | 20.9 | TABLE I: Model-wise Evaluation Metrics <details> <summary>extracted/6434027/fig_time_to_bypass.png Details</summary> ![61068a52](/v1/image/61068a521266ff748ddc9a9131bcc7a8557d9f0060c31926b5f70375191cf6ec) ### Visual Description ## Bar Charts: Attack Success Rate, Prompt Generalizability, and Time-to-Bypass by Model ### Overview The image contains three vertically stacked bar charts comparing four large language models (LLMs) across three security-related metrics: Attack Success Rate (ASR), Prompt Generalizability, and Average Time-to-Bypass. The models evaluated are GPT-4, Claude 2, Mistral 7B, and Vicuna. Each chart uses a distinct color gradient (red, blue, green) to represent the data series. ### Components/Axes **Common Elements:** * **X-axis (all charts):** Labeled "Model". Categories from left to right: GPT-4, Claude 2, Mistral 7B, Vicuna. * **Chart Titles (top to bottom):** 1. "Attack Success Rate by Model" 2. "Prompt Generalizability by Model" 3. "Average Time-to-Bypass by Model" * **Grid:** Light gray, dashed horizontal grid lines are present in all charts. **Chart 1: Attack Success Rate by Model** * **Y-axis:** Labeled "ASR (%)". Scale from 0 to 80, with major ticks at 0, 20, 40, 60, 80. * **Data Series Color:** A red gradient, from lightest (GPT-4) to darkest (Vicuna). **Chart 2: Prompt Generalizability by Model** * **Y-axis:** Labeled "Generalizability (%)". Scale from 0 to 60, with major ticks at 0, 10, 20, 30, 40, 50, 60. * **Data Series Color:** A blue gradient, from lightest (GPT-4) to darkest (Vicuna). **Chart 3: Average Time-to-Bypass by Model** * **Y-axis:** Labeled "Time (min)". Scale from 0 to 20, with major ticks at 0, 5, 10, 15, 20. * **Data Series Color:** A green gradient, from lightest (GPT-4) to darkest (Vicuna). ### Detailed Analysis **Chart 1: Attack Success Rate (ASR)** * **Trend:** The ASR shows a clear downward trend from left to right. GPT-4 has the highest success rate, and Vicuna has the lowest among the four. * **Data Points (Approximate):** * GPT-4: ~87% (Bar extends above the 80% line) * Claude 2: ~82% (Bar is slightly above the 80% line) * Mistral 7B: ~71% (Bar is midway between 60% and 80%) * Vicuna: ~69% (Bar is slightly lower than Mistral 7B) **Chart 2: Prompt Generalizability** * **Trend:** Generalizability also shows a consistent downward trend from left to right. GPT-4 generalizes best, while Vicuna generalizes the least. * **Data Points (Approximate):** * GPT-4: ~64% (Bar extends above the 60% line) * Claude 2: ~60% (Bar aligns with the 60% line) * Mistral 7B: ~52% (Bar is slightly above the 50% line) * Vicuna: ~50% (Bar aligns with the 50% line) **Chart 3: Average Time-to-Bypass** * **Trend:** The time required to bypass the model's safety measures shows an upward trend from left to right. GPT-4 is bypassed the fastest, while Mistral 7B and Vicuna require the most time. * **Data Points (Approximate):** * GPT-4: ~16 minutes (Bar is slightly above the 15-minute line) * Claude 2: ~17.5 minutes (Bar is midway between 15 and 20) * Mistral 7B: ~22 minutes (Bar extends above the 20-minute line) * Vicuna: ~21 minutes (Bar is slightly lower than Mistral 7B) ### Key Observations 1. **Inverse Correlation:** There is a strong inverse relationship between Attack Success Rate/Generalizability and Time-to-Bypass. Models with higher ASR and generalizability (GPT-4, Claude 2) are bypassed more quickly. Models with lower ASR and generalizability (Mistral 7B, Vicuna) take longer to attack successfully. 2. **Model Grouping:** The models naturally form two groups: the higher-performing commercial models (GPT-4, Claude 2) and the open-source models (Mistral 7B, Vicuna). The commercial models are more susceptible to attacks but generalize better, while the open-source models are more resilient (take longer to break) but are less generalizable. 3. **Mistral 7B vs. Vicuna:** While their ASR and generalizability are similar, Mistral 7B has a slightly higher average bypass time than Vicuna, suggesting its defenses might be marginally more robust or consistent. ### Interpretation The data suggests a fundamental trade-off in the current LLM landscape between capability (generalizability) and robustness (resistance to adversarial attacks). The more capable and general-purpose a model is (like GPT-4), the more "surface area" it presents for attacks, leading to higher success rates and faster bypass times. Conversely, models that are perhaps more constrained or less capable (Mistral 7B, Vicuna) exhibit greater resistance, requiring more effort and time from attackers to compromise. This has significant implications for deployment. High-stakes applications requiring robust security might favor models with lower ASR and higher bypass times, even if it means sacrificing some generalizability. Applications prioritizing broad capability and flexibility might accept the higher attack risk associated with models like GPT-4. The charts do not show *why* these correlations exist, but they highlight a critical balancing act for developers and security teams when selecting and hardening LLMs. </details> Figure 1: Model-wise Evaluation Metrics ### IV-B Attack Category Performance Prompt injections exploiting roleplay dynamics (e.g., impersonation of fictional characters or hypothetical scenarios) achieved the highest ASR (89.6%). These prompts often bypass filters by deflecting responsibility away from the model (e.g., “as an AI in a movie script…”). Logic trap attacks (ASR: 81.4%) exploit conditional structures and moral dilemmas to elicit disallowed content. Encoding tricks (e.g., base64 or zero-width characters) achieved 76.2% ASR by evading keyword-based filtering mechanisms. While multi-turn dialogues yielded slightly lower effectiveness (68.7%), they often succeeded in long-form tasks where context buildup gradually weakened safety enforcement. <details> <summary>extracted/6434027/fig_attack_categories.png Details</summary> ![7eaf26e2](/v1/image/7eaf26e2e46389b13008037e980663a24f513b2c0e9c617d0f386f59766dc662) ### Visual Description ## Horizontal Bar Chart: Attack Category Effectiveness ### Overview This image is a horizontal bar chart titled "Attack Category Effectiveness." It visually compares the success rates of five distinct categories of attacks, presumably against some system or model. The chart uses a color gradient from light to dark to distinguish the categories, with longer bars indicating higher effectiveness. ### Components/Axes * **Chart Title:** "Attack Category Effectiveness" (centered at the top). * **Y-Axis (Vertical):** Labeled "Category." It lists five categorical variables from top to bottom: 1. Roleplay 2. Logic Traps 3. Encoding Tricks 4. Multi-turn 5. Framing * **X-Axis (Horizontal):** Labeled "Attack Success Rate (%)". The axis has numerical markers at 0, 20, 40, 60, and 80. The scale appears linear. * **Data Series:** Five horizontal bars, each corresponding to a category on the Y-axis. The bars are colored with a gradient: * Roleplay: Light salmon/peach color. * Logic Traps: Medium reddish-brown. * Encoding Tricks: Darker mauve. * Multi-turn: Dark purple. * Framing: Darkest purple/indigo. * **Grid:** Light, dashed vertical grid lines extend from the x-axis markers (20, 40, 60, 80) to aid in reading values. ### Detailed Analysis The chart presents the following approximate "Attack Success Rate (%)" values for each category, determined by visually aligning the end of each bar with the x-axis scale: 1. **Roleplay (Top bar, light salmon):** The bar extends past the 80% mark. **Approximate Value: 88%** (Uncertainty: ±2%). 2. **Logic Traps (Second bar, reddish-brown):** The bar extends just past the 80% mark. **Approximate Value: 81%** (Uncertainty: ±1%). 3. **Encoding Tricks (Third bar, mauve):** The bar ends between the 60% and 80% grid lines, closer to 80%. **Approximate Value: 76%** (Uncertainty: ±2%). 4. **Multi-turn (Fourth bar, dark purple):** The bar ends between the 60% and 80% grid lines, closer to 60%. **Approximate Value: 68%** (Uncertainty: ±2%). 5. **Framing (Bottom bar, darkest purple):** The bar ends just past the 60% grid line. **Approximate Value: 66%** (Uncertainty: ±2%). **Trend Verification:** The visual trend is a clear, stepwise decrease in bar length from top to bottom. The "Roleplay" bar is the longest, and the "Framing" bar is the shortest, confirming a descending order of effectiveness. ### Key Observations * **Highest Effectiveness:** "Roleplay" is the most effective attack category by a significant margin, with a success rate approaching 90%. * **Lowest Effectiveness:** "Framing" is the least effective among the shown categories, with a success rate in the mid-60s. * **Color Gradient:** The chart employs a deliberate color gradient where lighter, warmer colors correspond to higher success rates, and darker, cooler colors correspond to lower success rates. This provides a secondary visual cue for the ranking. * **Clustering:** "Logic Traps" and "Encoding Tricks" form a middle tier with success rates in the 70s. "Multi-turn" and "Framing" form a lower tier with success rates in the 60s. * **No Zero Baseline Issue:** The x-axis correctly starts at 0, allowing for an accurate visual comparison of the bar lengths. ### Interpretation This chart demonstrates a clear hierarchy in the effectiveness of different adversarial attack strategies. The data suggests that attacks based on **"Roleplay"**—likely involving the model adopting a specific persona or scenario to bypass safety guidelines—are currently the most potent threat vector, succeeding nearly 9 times out of 10. The significant drop-off to the next category ("Logic Traps") indicates that while logical paradoxes or constrained scenarios are effective, they are less reliable than narrative-based manipulation. The lower effectiveness of "Multi-turn" attacks is noteworthy, as it suggests that prolonged, conversational attacks may be less successful than well-crafted single prompts, possibly due to model safeguards that activate over longer interactions. The color gradient reinforces the narrative: the "hotter" (lighter) the attack category, the more "effective" it is. From a security perspective, this chart prioritizes where defensive efforts should be focused. Mitigating roleplay-based vulnerabilities would yield the greatest improvement in overall system robustness. The chart does not provide context on the target system, the sample size, or the timeframe of the tests, which are critical for a full assessment. </details> Figure 2: Attack Category Effectiveness ### IV-C Scenario-Specific Vulnerabilities Targeted domains revealed non-uniform vulnerabilities: - Political content: Prompts involving campaign advice or fake lobbying succeeded 85.5% of the time. - Legal content: Prompts framed as courtroom hypotheticals or legal simulations yielded 79.4% ASR. - Explicit content: Erotic roleplay prompts were especially effective in jailbreak forums, with a 76.1% success rate. - Malicious code: Although many models blocked direct malware requests, evasion through obfuscation or “educational context” resulted in 58.3% success, especially on Vicuna and Mistral. <details> <summary>extracted/6434027/fig_scenario_success_rates.png Details</summary> ![593b00e6](/v1/image/593b00e6d5dea81d76ca7aa63b563dcae614af6ad5f3566de688752c77436ee0) ### Visual Description ## Horizontal Bar Chart: Scenario-specific Success Rates ### Overview The image displays a horizontal bar chart titled "Scenario-specific Success Rates." It compares the success rates (in percentage) of four distinct attack scenarios. The chart uses a color gradient from dark purple to light orange to differentiate the categories. ### Components/Axes * **Title:** "Scenario-specific Success Rates" (centered at the top). * **Y-axis (Vertical):** Labeled "Scenario." It lists four categorical variables from top to bottom: 1. Political Lobbying 2. Legal Opinion 3. Pornographic Content 4. Malware Generation * **X-axis (Horizontal):** Labeled "Attack Success Rate (%)". It has numerical markers at 0, 10, 20, 30, 40, 50, 60, 70, and 80. The axis extends slightly beyond 80. * **Legend:** There is no separate legend box. The color of each bar corresponds directly to its label on the y-axis. * **Grid:** Light, dashed vertical grid lines are present at each major x-axis marker (10, 20, 30, etc.) to aid in reading values. ### Detailed Analysis The chart presents the following data points, read from the end of each bar against the x-axis scale. Values are approximate. 1. **Political Lobbying (Top bar, dark purple):** The bar extends past the 80% mark. **Approximate Success Rate: 85%.** 2. **Legal Opinion (Second bar, magenta/dark pink):** The bar ends just before the 80% mark. **Approximate Success Rate: 79%.** 3. **Pornographic Content (Third bar, red):** The bar ends between the 70% and 80% marks, closer to 80. **Approximate Success Rate: 76%.** 4. **Malware Generation (Bottom bar, light orange/peach):** The bar ends just before the 60% mark. **Approximate Success Rate: 58%.** **Trend Verification:** The visual trend shows a clear descending order of success rate from the top scenario to the bottom. The "Political Lobbying" bar is the longest, followed by "Legal Opinion," then "Pornographic Content," with "Malware Generation" being the shortest. ### Key Observations * **Highest Success Rate:** "Political Lobbying" has the highest attack success rate at approximately 85%. * **Lowest Success Rate:** "Malware Generation" has the lowest success rate at approximately 58%. * **Clustering:** The top three scenarios ("Political Lobbying," "Legal Opinion," "Pornographic Content") all have success rates above 75%, forming a high-success cluster. "Malware Generation" is a clear outlier on the lower end, over 15 percentage points below the next lowest. * **Color Gradient:** The bar colors transition from a dark, cool tone (purple) for the highest value to a light, warm tone (orange) for the lowest value, creating an intuitive visual hierarchy. ### Interpretation This chart likely illustrates the effectiveness of different types of adversarial attacks or manipulation attempts against a system (e.g., an AI model, a security protocol, or a social platform). The data suggests a significant variance in success based on the attack vector. * **Social/Political Attacks are Most Effective:** The high success rates for "Political Lobbying" and "Legal Opinion" imply that attacks leveraging social engineering, persuasion, or exploiting legal/political frameworks are more readily successful than purely technical ones. * **Content-Based Attacks are Moderately Effective:** Generating "Pornographic Content" also shows a high success rate, indicating that content policy circumvention is a viable attack path. * **Technical Attacks are Least Effective:** "Malware Generation" being the least successful suggests that the system has stronger defenses or that the technical barriers to generating functional malware are higher for the attacker in this context. * **Strategic Implication:** For defenders, this indicates that resources should be prioritized against social and political manipulation vectors, as they represent the most probable threat. The clear drop-off for malware generation suggests existing technical safeguards may be relatively more robust. The chart provides a risk-prioritization framework based on empirical attack success data. </details> Figure 3: Scenario-specific Success Rates ### IV-D Prompt Transferability The Prompt Transferability Matrix reveals the high portability of successful prompts. GPT-4-derived prompts transferred with 64.1% success to Claude 2 and over 50% to Mistral and Vicuna. This finding underscores the systemic nature of these vulnerabilities across architectures. Notably, Claude 2 showed higher resistance to Vicuna-origin prompts, indicating some directional asymmetry in generalization. This is likely due to the more fine-grained safety alignment mechanisms employed in commercial models. <details> <summary>extracted/6434027/fig_prompt_transferability.png Details</summary> ![25c41503](/v1/image/25c41503ba004a565746962518a35fdb9eb8257af9a2381cf71b509b0a6bd2f8) ### Visual Description ## Heatmap: Prompt Transferability Matrix ### Overview The image is a heatmap titled "Prompt Transferability Matrix." It visualizes the percentage of transferability of prompts between four different large language models (LLMs). The matrix is a 4x4 grid where each cell represents the transferability score from a source model (rows) to a target model (columns). The diagonal cells, representing transfer to the same model, are all 0.0%. ### Components/Axes * **Title:** "Prompt Transferability Matrix" (centered at the top). * **Y-Axis (Rows - Source Models):** Labeled on the left side. From top to bottom: "GPT-4", "Claude 2", "Mistral 7B", "Vicuna". * **X-Axis (Columns - Target Models):** Labeled at the bottom. From left to right: "GPT-4", "Claude 2", "Mistral 7B", "Vicuna". * **Color Bar/Legend:** Positioned on the right side of the heatmap. It is a vertical gradient bar labeled "Transferability (%)". The scale runs from 0 (dark blue) at the bottom to approximately 65 (dark red) at the top, with numerical ticks at 0, 10, 20, 30, 40, 50, and 60. * **Data Cells:** Each cell in the 4x4 grid contains a numerical value representing the transferability percentage and is colored according to the legend. ### Detailed Analysis The matrix contains the following exact numerical values, read row by row (source model) and column by column (target model): **Row 1: Source = GPT-4** * To GPT-4: **0.0** (Dark Blue) * To Claude 2: **61.9** (Dark Red) * To Mistral 7B: **49.7** (Light Orange) * To Vicuna: **48.3** (Light Orange) **Row 2: Source = Claude 2** * To GPT-4: **64.1** (Dark Red - Highest value in the matrix) * To Claude 2: **0.0** (Dark Blue) * To Mistral 7B: **47.2** (Light Orange) * To Vicuna: **46.1** (Light Orange) **Row 3: Source = Mistral 7B** * To GPT-4: **54.2** (Orange) * To Claude 2: **50.3** (Orange) * To Mistral 7B: **0.0** (Dark Blue) * To Vicuna: **45.7** (Light Orange) **Row 4: Source = Vicuna** * To GPT-4: **52.7** (Orange) * To Claude 2: **48.6** (Light Orange) * To Mistral 7B: **46.5** (Light Orange) * To Vicuna: **0.0** (Dark Blue) ### Key Observations 1. **Diagonal Zeroes:** All cells where the source and target model are the same (GPT-4→GPT-4, etc.) have a value of 0.0, indicated by dark blue. This is a baseline, as a prompt does not need to "transfer" to itself. 2. **Highest Transferability:** The highest recorded transferability is **64.1%** from **Claude 2 (source) to GPT-4 (target)**. The second highest is **61.9%** from **GPT-4 (source) to Claude 2 (target)**. This indicates a strong, bidirectional relationship between these two models. 3. **Moderate Transferability:** All other off-diagonal values fall within a relatively narrow range, approximately between **45.7% and 54.2%**. This suggests a consistent, moderate level of prompt transferability among the other model pairs (involving Mistral 7B and Vicuna). 4. **Color Pattern:** The heatmap shows a clear pattern where the cells for GPT-4↔Claude 2 are the most intensely red (highest values). The cells involving Mistral 7B and Vicuna as either source or target are predominantly orange/light orange, indicating mid-range values. The diagonal is uniformly blue. ### Interpretation This heatmap provides a quantitative measure of how effectively prompts designed for one LLM can be used with another. The data suggests the following: * **Strong Model Affinity:** GPT-4 and Claude 2 exhibit a particularly high degree of mutual prompt transferability (both >60%). This could imply similarities in their underlying architecture, training data, or instruction-following fine-tuning, making prompts crafted for one largely effective on the other. * **General Cross-Model Utility:** The fact that all non-zero values are above 45% indicates that prompts have significant, though not perfect, cross-model utility. A prompt engineered for one model has a reasonable chance of working on another, which is valuable for developers and researchers working across multiple platforms. * **Asymmetry:** While generally high, transferability is not perfectly symmetric. For example, prompts from Claude 2 transfer slightly better to GPT-4 (64.1%) than vice-versa (61.9%). Similarly, prompts from Mistral 7B transfer better to GPT-4 (54.2%) than to Vicuna (45.7%). These asymmetries may reflect differences in model capability or the specificity of the prompts tested. * **Practical Implication:** The matrix is a tool for understanding model interoperability. High transferability scores suggest that prompt optimization efforts for one model may yield benefits on another, potentially reducing redundant work. The lower scores (though still >45%) for pairs involving Mistral 7B and Vicuna indicate that more model-specific prompt tuning might be required for optimal performance when switching between these and the other models. </details> Figure 4: Prompt Transferability Matrix ### IV-E Failure Modes and Detection Gaps We observed five dominant failure patterns: - Partial Refusals (34%): Prompts initially triggered refusal but continued to output harmful content mid-response. - Hidden Compliance (22%): The model appeared to refuse but provided veiled or coded information. - No Output (18%): Complete refusal, often due to prompt being too direct or malformed. - Misleading Responses (15%): Factually incorrect or evasive answers. This taxonomy provides a baseline for future behavioral alignment benchmarks. | Partial Refusal | 34% | Hypotheticals, satire | | --- | --- | --- | | Hidden Compliance | 22% | Roleplay and analogy | | No Output | 18% | Base64, multi-turn traps | | Misleading Response | 15% | Legal/political scenarios | TABLE II: Prompt Failure Modes ### IV-F Prompt Length and Obfuscation Success rates were highest for prompts in the 101–150 token range (80.3%), suggesting a sweet spot for encoding deception while maintaining clarity. Prompts exceeding 150 tokens saw a slight dip in success-likely due to verbosity or token truncation. Encoded or obfuscated prompts, such as those using zero-width spaces, emojis, or alternate encodings, had lower detection rates (21.3%) but retained strong ASR (76.2%). These findings emphasize the need for semantic-level input sanitization. <details> <summary>extracted/6434027/fig_prompt_length_vs_asr.png Details</summary> ![f3388912](/v1/image/f3388912ab25379b1b2d95798023eb44d0a3c29d2869954eac310e1dfae8872c) ### Visual Description \n ## Bar Chart: Prompt Length vs. Average Success Rate ### Overview This is a vertical bar chart illustrating the relationship between the length of a prompt (measured in token count) and its average success rate (ASR). The chart suggests that success rate generally increases with prompt length up to a point, after which it may plateau or slightly decline. ### Components/Axes * **Chart Title:** "Prompt Length vs. Average Success Rate" (centered at the top). * **X-Axis (Horizontal):** * **Label:** "Prompt Token Count" (centered below the axis). * **Categories (from left to right):** 1. `<50` 2. `51-100` 3. `101-150` 4. `>150` * **Y-Axis (Vertical):** * **Label:** "ASR (%)" (rotated 90 degrees, positioned to the left of the axis). * **Scale:** Linear scale from 0 to 80, with major tick marks and grid lines at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70, 80). * **Data Series:** Four distinct bars, each representing a token count category. The bars are colored in a sequential palette, progressing from a light sage green to a dark slate blue. * **Grid:** Light gray, dashed horizontal grid lines extend from each major y-axis tick mark across the chart area. ### Detailed Analysis The chart displays four bars, each corresponding to a prompt token count range. The visual trend is an increase in bar height (success rate) from the first to the third category, followed by a slight decrease in the fourth. 1. **Bar 1 (Position: Far Left):** * **Category:** `<50` tokens. * **Color:** Light sage green. * **Trend:** This is the shortest bar. * **Approximate Value:** The top of the bar aligns just above the 70% grid line. Estimated ASR: **~71%**. 2. **Bar 2 (Position: Center-Left):** * **Category:** `51-100` tokens. * **Color:** Medium teal green. * **Trend:** This bar is taller than the first. * **Approximate Value:** The top of the bar is between the 70% and 80% grid lines, closer to 80%. Estimated ASR: **~77%**. 3. **Bar 3 (Position: Center-Right):** * **Category:** `101-150` tokens. * **Color:** Dark teal blue. * **Trend:** This is the tallest bar in the chart. * **Approximate Value:** The top of the bar appears to be exactly on or very slightly above the 80% grid line. Estimated ASR: **~80%**. 4. **Bar 4 (Position: Far Right):** * **Category:** `>150` tokens. * **Color:** Dark slate blue. * **Trend:** This bar is slightly shorter than the third bar but taller than the second. * **Approximate Value:** The top of the bar is just below the 80% grid line. Estimated ASR: **~78%**. ### Key Observations * **Peak Performance:** The highest average success rate (~80%) is observed for prompts in the `101-150` token range. * **Non-Linear Relationship:** The relationship is not strictly linear. Success rate increases significantly from the shortest prompts (`<50`) to medium-length prompts (`51-100` and `101-150`), but then shows a slight decline for the longest prompts (`>150`). * **High Baseline:** Even the shortest prompt category (`<50`) achieves a relatively high success rate of approximately 71%. * **Color Coding:** The chart uses a sequential color scheme (light green to dark blue) to visually distinguish the categories, with darker colors generally corresponding to longer prompt lengths and higher success rates (with the exception of the final bar). ### Interpretation The data suggests an **optimal prompt length window** for maximizing success rate, centered around 101-150 tokens. This implies that providing sufficient context and detail (within this range) is beneficial for task completion. The slight decrease in success rate for prompts longer than 150 tokens could indicate a point of **diminishing returns or potential interference**. Excessively long prompts might introduce noise, dilute the core instruction, or exceed the effective context window of the underlying model, leading to a minor performance drop. From a practical standpoint, this chart provides a heuristic for prompt engineering: aiming for a token count in the 100-150 range may yield the most reliable results, while very short prompts may lack necessary context, and very long prompts may become less efficient. The high baseline success for short prompts also indicates that the system is reasonably robust even with minimal input. </details> Figure 5: Prompt Length vs. Average Success Rate <details> <summary>extracted/6434027/fig_stealth_vs_detection.png Details</summary> ![b35b0826](/v1/image/b35b082611d518573cef9935cb667c811eaebed9a8d5b247786503e4c042f46f) ### Visual Description ## Dual-Axis Combination Chart: Stealthiness vs. Detection Likelihood and ASR ### Overview This is a dual-axis combination chart that visualizes the relationship between the stealthiness of an attack method (categorized into three classes) and two key metrics: the Detection Rate (as a percentage) and the Attack Success Rate (ASR, as a percentage). The chart demonstrates a clear trade-off between how stealthy an attack is and its likelihood of being detected versus its likelihood of succeeding. ### Components/Axes * **Chart Title:** "Stealthiness vs. Detection Likelihood and ASR" * **X-Axis (Horizontal):** * **Label:** "Stealth Class" * **Categories (from left to right):** 1. "High (encoded/meta)" 2. "Medium (roleplay)" 3. "Low (direct)" * **Primary Y-Axis (Left, Vertical):** * **Label:** "Detection Rate (%)" * **Scale:** Linear, from 0 to 70, with major gridlines at intervals of 10. * **Data Representation:** Salmon-colored bars. * **Secondary Y-Axis (Right, Vertical):** * **Label:** "Attack Success Rate (%)" * **Scale:** Linear, from 60 to 90, with major gridlines at intervals of 5. * **Data Representation:** A solid blue line with circular data points. * **Legend:** Not explicitly present. The data series are distinguished by color and chart type (bars vs. line), with their corresponding axes labeled directly. ### Detailed Analysis **Data Series 1: Detection Rate (Salmon Bars)** * **Trend Verification:** The height of the bars increases sharply from left to right, indicating that detection likelihood rises as stealth decreases. * **Data Points (Approximate):** * **High (encoded/meta) Stealth:** ~21% * **Medium (roleplay) Stealth:** ~39% * **Low (direct) Stealth:** ~75% **Data Series 2: Attack Success Rate (Blue Line)** * **Trend Verification:** The line slopes upward from the first to the second data point, then slopes steeply downward to the third point, forming an inverted "V" shape. This indicates ASR peaks at medium stealth. * **Data Points (Approximate):** * **High (encoded/meta) Stealth:** ~76% * **Medium (roleplay) Stealth:** ~90% * **Low (direct) Stealth:** ~59% ### Key Observations 1. **Inverse Relationship (Detection vs. Stealth):** There is a strong inverse relationship between stealth and detection. The "Low (direct)" attack class is detected at a rate (~75%) more than 3.5 times higher than the "High (encoded/meta)" class (~21%). 2. **Optimal Stealth for Success:** The "Medium (roleplay)" stealth class achieves the highest Attack Success Rate (~90%), despite having a moderate detection rate (~39%). This suggests a "sweet spot" where the attack is complex enough to evade some detection but not so complex that it hinders its own execution. 3. **High Stealth Trade-off:** While "High (encoded/meta)" stealth is the hardest to detect (~21%), its Attack Success Rate (~76%) is significantly lower than the medium class. This implies that highly encoded or meta-level attacks may be less reliable or effective, even if they are stealthier. 4. **Low Stealth Penalty:** "Low (direct)" attacks are the easiest to detect (~75%) and also have the lowest success rate (~59%), making them the least effective strategy overall. ### Interpretation This chart illustrates a fundamental security trade-off in adversarial attacks (likely against AI systems, given terms like "roleplay" and "encoded/meta"). The data suggests that **moderate stealth ("roleplay") is the most effective strategy**, balancing evasion of detection with operational effectiveness. * **High Stealth (Encoded/Meta):** These attacks prioritize evasion, possibly using complex encoding or operating at a meta-level. The cost is reduced reliability or potency, leading to a lower success rate despite high stealth. * **Medium Stealth (Roleplay):** This class likely involves attacks framed within a narrative or persona. It represents an optimal balance: it is obfuscated enough to avoid immediate detection by many systems but remains direct enough to execute its payload reliably, resulting in peak success. * **Low Stealth (Direct):** These are overt, unobfuscated attacks. They are easily flagged by security systems (high detection rate) and, once detected, are obviously malicious, leading to a low success rate. The overarching implication for defense is that detection systems are most challenged by moderately stealthy, context-aware attacks (like roleplay) rather than purely direct or overly complex encoded ones. For attackers, the data argues against pursuing maximum stealth at all costs, as it can undermine the attack's primary goal. The most dangerous threats are those that are just stealthy enough. </details> Figure 6: Stealthiness vs. Detection Likelihood and ASR ## V Discussion The implications of these findings extend beyond prompt injection. They expose the fragility of current safety alignment mechanisms under realistic threat conditions. Our work reinforces the need for adversarial testing as a continuous validation tool in LLM deployment pipelines. Prompt injection represents not only a technical challenge but a policy and governance issue as well. Failure to address these risks may erode trust in AI applications and hinder broader societal adoption. These results validate the claim that current LLM safety mechanisms are insufficiently robust against prompt injection, especially indirect or obfuscated attacks [1] [4] [5]. The findings reinforce prior studies that describe alignment filters as semantically shallow and largely reliant on static refusal templates [11] [16]. The ease with which these attacks transferred across models points to shared architectural vulnerabilities or training data biases [3] [6] [18]. Moreover, roleplay and scenario-based prompts exploit not only the model’s capacity for creativity but its inability to judge moral context effectively [2] [9] [14]. Online communities (e.g., Reddit, GitHub, Discord) operate as informal exploit databases, with prompt variations evolving similarly to malware strains in the cybersecurity domain. We echo ethical concerns raised in recent works regarding open publication of jailbreaks [19] [20] and advocate for controlled disclosures and bug bounty mechanisms tailored for LLM developers [15] [24]. ## VI Mitigation Strategies Effective defenses against prompt injection must evolve alongside adversarial creativity. Static filtering or keyword-based systems offer only limited protection. Our defense recommendations blend technical solutions with operational safeguards. We emphasize the importance of feedback loops between model developers and red teams, and call for public benchmarks that simulate dynamic adversarial scenarios. Defenses should be evaluated under live attack settings, with evolving attacker strategies embedded in the test suite. Our mitigation strategy draws inspiration from work such as PromptShield [21], Palisade [10], and UniGuardian [13]. We recommend: - System prompt hardening using context anchoring [14] - Behavior-based anomaly detection during multi-turn dialogues [8] [11] [18] - Input sanitization via Signed-Prompt techniques [41] - Embedding adversarial decoys and rejection-conditioned training [8] [12] [42] - Session-level analytics to detect evasion attempts [11] These methods, when applied in combination, form a layered defense that significantly reduces the likelihood of successful jailbreak attempts while preserving usability. ## VII Limitations and Future Work Despite its comprehensive scope, this study remains limited by the availability of open model weights and API constraints. Prompt injection tactics may evolve in ways not covered in our dataset. Additionally, prompt interpretation may differ across cultures and languages, warranting multilingual and socio-contextual extensions. Future work will aim to create shared evaluation platforms, akin to CVE databases, where new prompt exploits and defense bypasses can be collaboratively tracked and neutralized. While our findings are grounded in robust experimentation, limitations remain. Our evaluation used static model checkpoints and may not account for updates or real-time moderation layers applied in production APIs [3] [17]. Additionally, cultural and linguistic diversity in prompts remains underrepresented [25] [35]. We recommend future work focus on multilingual adversarial prompt corpora [25], evaluation of plug-in-enabled LLMs [44], and adversarial training using open red-teaming platforms [21] [8] [13]. The development of explainable safety filters and real-time flagging systems could greatly aid in closing the alignment gap [11] [18] [41]. ## VIII Conclusion This research affirms the growing consensus that prompt injection is not an edge-case anomaly but a fundamental issue in current-generation LLMs. The findings not only highlight the technical inadequacies of present alignment systems but also illuminate the adversarial creativity of the prompt engineering community. Addressing these challenges will require collaborative frameworks that blend secure NLP research, adversarial testing, and governance. We envision a future where LLMs are audited as rigorously as software, with red teaming treated as a core development practice. Prompt injection remains an open frontier in LLM safety. Through comprehensive evaluation and a synthesis of recent research, we provide compelling evidence that jailbreak techniques are both transferable and evolving. Our work aligns with the concerns raised in recent surveys [1] [2] [13] [31], and supports the call for stronger multi-layered defenses, proactive red teaming, and coordinated disclosure practices in AI development [19] [23] [24]. ## References - [1] Yi Liu and Gelei Deng and Yuekang Li and Kailong Wang and Zihao Wang and Xiaofeng Wang and Tianwei Zhang and Yepang Liu and Haoyu Wang and Yan Zheng and Yang Liu. ”Prompt Injection attack against LLM-integrated Applications.” arXiv:2306.05499 - [2] Yi Liu and Gelei Deng and Zhengzi Xu and Yuekang Li and Yaowen Zheng and Ying Zhang and Lida Zhao and Tianwei Zhang and Kailong Wang and Yang Liu. ”Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study.” arXiv:2305.13860 - [3] Xiaogeng Liu and Zhiyuan Yu and Yizhe Zhang and Ning Zhang and Chaowei Xiao. ”Automatic and Universal Prompt Injection Attacks against Large Language Models.” arXiv:2403.04957 - [4] Yi, Jingwei and Xie, Yueqi and Zhu, Bin and Kiciman, Emre and Sun, Guangzhong and Xie, Xing and Wu, Fangzhao. ”Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models.” arXiv:2312.14197 - [5] Sippo Rossi and Alisia Marianne Michel and Raghava Rao Mukkamala and Jason Bennett Thatcher. ”An Early Categorization of Prompt Injection Attacks on Large Language Models.” arXiv:2402.00898 - [6] Yupei Liu and Yuqi Jia and Runpeng Geng and Jinyuan Jia and Neil Zhenqiang Gong. ”Formalizing and Benchmarking Prompt Injection Attacks and Defenses.” 33rd USENIX Security Symposium (USENIX Security 24) - [7] Md Abdur Rahman and Fan Wu and Alfredo Cuzzocrea and Sheikh Iqbal Ahamed. ”Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection.” arXiv:2410.21337 - [8] Kuo-Han Hung and Ching-Yun Ko and Ambrish Rawat and I-Hsin Chung and Winston H. Hsu and Pin-Yu Chen. ”Attention Tracker: Detecting Prompt Injection Attacks in LLMs.” arXiv:2411.00348 - [9] Yulin Chen and Haoran Li and Zihao Zheng and Yangqiu Song and Dekai Wu and Bryan Hooi. ”Defense Against Prompt Injection Attack by Leveraging Attack Techniques.” arXiv:2411.00459 - [10] Sahasra Kokkula and Somanathan R and Nandavardhan R and Aashishkumar and G Divya. ”Palisade – Prompt Injection Detection Framework.” arXiv:2410.21146 - [11] Sizhe Chen and Arman Zharmagambetov and Saeed Mahloujifar and Kamalika Chaudhuri and David Wagner and Chuan Guo. ”SecAlign: Defending Against Prompt Injection with Preference Optimization.” arXiv:2410.05451 - [12] Chong Zhang and Mingyu Jin and Qinkai Yu and Chengzhi Liu and Haochen Xue and Xiaobo Jin. ”Goal-guided Generative Prompt Injection Attack on Large Language Models.” arXiv:2404.07234 - [13] Huawei Lin and Yingjie Lao and Tong Geng and Tan Yu and Weijie Zhao. ”UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models.” arXiv:2502.13141 - [14] Edoardo Debenedetti and Ilia Shumailov and Tianqi Fan and Jamie Hayes and Nicholas Carlini and Daniel Fabian and Christoph Kern and Chongyang Shi and Andreas Terzis and Florian Tramèr. ”Defeating Prompt Injections by Design.” arXiv:2503.18813 - [15] William Hackett and Lewis Birch and Stefan Trawicki and Neeraj Suri and Peter Garraghan. ”Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails.” arXiv:2504.11168 - [16] OWASP Foundation. (2023). ”Top 10 for LLM Applications.” https://owasp.org/www-project-top-10-for-llm-applications - [17] OpenAI. (2023). ”GPT-4 Technical Report.” https://openai.com/research/gpt-4 - [18] Anthropic. (2023). ”Many-shot Jailbreaking.” https://www.anthropic.com/research/many-shot-jailbreaking - [19] Le Wang and Zonghao Ying and Tianyuan Zhang and Siyuan Liang and Shengshan Hu and Mingchuan Zhang and Aishan Liu and Xianglong Liu. ”Manipulating Multimodal Agents via Cross-Modal Prompt Injection.” arXiv:2504.14348 - [20] Jiaqi Xue and Mengxin Zheng and Ting Hua and Yilin Shen and Yepeng Liu and Ladislau Boloni and Qian Lou. ”TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models.” arXiv:2306.06815 - [21] Dennis Jacob and Hend Alzahrani and Zhanhao Hu and Basel Alomair and David Wagner. ”PromptShield: Deployable Detection for Prompt Injection Attacks.” arXiv:2501.15145 - [22] Wallace, Eric and Zhao, Tony and Feng, Shi and Singh, Sameer. ”Concealed Data Poisoning Attacks on NLP Models.” Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - [23] Irene Solaiman, et al. ”Release Strategies and the Social Impacts of Language Models.” arXiv:1908.09203 - [24] Miles Brundage and Shahar Avin and Jasmine Wang and Haydn Belfield, et al. ”Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims.” arXiv preprint arXiv:2004.07213 (2020). - [25] Percy Liang and Rishi Bommasani and Tony Lee, et al. ”Holistic Evaluation of Language Models.” arXiv preprint arXiv:2211.09110 (2023). - [26] Tom B. Brown and Benjamin Mann and Nick Ryder, et al. ”Language Models are Few-Shot Learners.” arXiv preprint arXiv:2005.14165 (2020). - [27] Sébastien Bubeck, et al. ”Sparks of Artificial General Intelligence: Early experiments with GPT-4.” arXiv preprint arXiv:2303.12712 (2023). - [28] Huang, Lei and Yu, Weijiang, et al. ”A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.” arXiv preprint arXiv:2311.05232 (2025) - [29] Nelson Elhage and Tristan Hume, et al. ”Toy Models of Superposition.” arXiv preprint arXiv:2209.10652 (2022). - [30] Kevin Meng and David Bau, et al. ”Locating and Editing Factual Associations in GPT.” arXiv preprint arXiv:2202.05262 (2023). - [31] Kaijie Zhu and Jindong Wang, et al. ”PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts.” arXiv preprint arXiv:2306.04528 (2024) - [32] Ari Holtzman and Jan Buys, et al. ”The Curious Case of Neural Text Degeneration.” arXiv preprint arXiv:1904.09751 (2020). - [33] Colin Raffel and Noam Shazeer, et al. ”Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” arXiv preprint arXiv:1910.10683 (2023). - [34] Daniel M. Ziegler and Nisan Stiennon, et al. ”Fine-Tuning Language Models from Human Preferences” arXiv preprint arXiv:1909.08593 (2020). - [35] Bender Emily M. and Gebru Timnit, et al. ”On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” Association for Computing Machinery (2021). - [36] Wolf Thomas and Debut Lysandre, et al. ”Transformers: State-of-the-Art Natural Language Processing.” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. - [37] Jason Wei and Xuezhi Wang, et al. ”Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” arXiv preprint arXiv:2201.11903 (2023). - [38] Nicholas Carlini and Daniel Paleka, et al. ”Stealing Part of a Production Language Model” arXiv preprint arXiv:2403.06634 (2024). - [39] Inioluwa Deborah Raji and Andrew Smart, et al. ”Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing.” arXiv preprint arXiv:2001.00973 (2020). - [40] Nikhil Kandpal and Haikang Deng, et al. ”Large Language Models Struggle to Learn Long-Tail Knowledge.” arXiv preprint arXiv:2211.08411 (2023). - [41] Xuchen Suo, et al. ”Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLM-Integrated Applications” arXiv preprint arXiv:2401.07612 (2024). - [42] Yu Peng and Zewen Long, et al. ”Playing Language Game with LLMs Leads to Jailbreaking.” arXiv preprint arXiv:2411.12762 (2024). - [43] Bangxin Li, et al. ”Exploiting Uncommon Text-Encoded Structures for Automated Jailbreaks in LLMs.” arXiv preprint arXiv:2406.08754v2 (2024). - [44] Ziqiu Wang and Jun Liu, et al. ”Poisoned LangChain: Jailbreak LLMs by LangChain.” arXiv preprint arXiv:2406.18122 (2024). - [45] Unit 42. (2024). ”Deceptive Delight: Jailbreak LLMs Through Camouflage and Distraction.” Palo Alto Networks. - [46] Nils Reimers and Iryna Gurevych. ”Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” arXiv preprint arXiv:1908.10084 (2019). - [47] Ashish Vaswani and Noam Shazeer and Niki Parmar, et al. ”Attention Is All You Need.” arXiv preprint arXiv:1706.03762 (2023). - [48] Xinyue Shen, et al. ””Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models” arXiv preprint arXiv:2308.03825 (2024). - [49] Peng Ding and Jun Kuang, et al. ”A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily” arXiv preprint arXiv:2311.08268 (2024). - [50] Hugo Touvron and Thibaut Lavril, et al. ”LLaMA: Open and Efficient Foundation Language Models” arXiv preprint arXiv:2302.13971 (2023). - [51] LMSYS ORG. ”Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality” https://lmsys.org/blog/2023-03-30-vicuna/. - [52] Mistral AI. ”Mistral 7B” https://mistral.ai/news/announcing-mistral-7b - [53] Apurv Verma and Satyapriya Krishna, et al. ”Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)” arXiv preprint arXiv:2407.14937 (2024).

Rendering Paper...