2508.20228v2

Model: gemini-2.0-flash

# Robustness Assessment and Enhancement of Text Watermarking for Google’s SynthID **Authors**: Xia Han, Qi Li, Jianbing Niand Mohammad Zulkernine > ∗Xia Han and Qi Li contributed equally to this work. Abstract Recent advances in LLM watermarking methods such as SynthID-Text by Google DeepMind offer promising solutions for tracing the provenance of AI-generated text. However, our robustness assessment reveals that SynthID-Text is vulnerable to meaning-preserving attacks, such as paraphrasing, copy-paste modifications, and back-translation, which can significantly degrade watermark detectability. To address these limitations, we propose SynGuard, a hybrid framework that combines the semantic alignment strength of Semantic Invariant Robust (SIR) with the probabilistic watermarking mechanism of SynthID-Text. Our approach jointly embeds watermarks at both lexical and semantic levels, enabling robust provenance tracking while preserving the original meaning. Experimental results across multiple attack scenarios show that SynGuard improves watermark recovery by an average of 11.1% in F1 score compared to SynthID-Text. These findings demonstrate the effectiveness of semantic-aware watermarking in resisting real-world tampering. All code, datasets, and evaluation scripts are publicly available at: https://github.com/githshine/SynGuard. I Introduction Text watermarking has emerged as a promising solution for tracing the origin of AI-generated content, offering a lightweight, model-agnostic method for content provenance verification [1, 2]. It identifies generated text from surface form alone, without access to the original prompt or underlying model. This makes watermarking especially appealing in open-world scenarios, where black-box models and unknown sources proliferate. Among existing approaches, Google DeepMind’s SynthID-Text is state-of-the-art [3], notable as the only watermarking method integrated into a real-world product (Google’s Gemini models), a rare industrial deployment in this domain. It embeds imperceptible statistical signals during generation via tournament sampling, departing from earlier post-hoc or green-list based methods [4, 1]. This approach introduces controlled stochasticity in token selection and shows improved detectability in benign settings. However, its resilience to malicious tampering remains underexplored. Previous studies note the fragility of lexical watermarks under meaning-preserving, surface-altering transformations [5, 6]; SynthID-Text, despite advancements, shares this limitation, motivating deeper analysis of its practical robustness. In this work, we systematically assess SynthID-Text under real-world meaning-preserving transformations: paraphrasing, synonym substitution, copy-paste rearrangement, and back-translation, attacks preserving semantic content while modifying lexical or syntactic surface form. Results reveal a critical vulnerability: detection accuracy drops sharply even under light paraphrasing or translation. These findings align with prior concerns, highlighting a gap in current capabilities. To address this, we propose SynGuard, a hybrid scheme integrating Semantic Invariant Robust (SIR) alignment [6] with SynthID’s token-level probabilistic masking. Our method embeds provenance signals at both lexical and semantic levels: the semantic component guides generation toward SIR-favored contexts (enhancing robustness to synonym and paraphrase attacks), while SynthID’s token logic retains seed-derived randomness (resisting keyless removal). Unlike prior lexical-only approaches [1, 3], SynGuard adds a semantic signal to detect tampering that preserves meaning but alters surface structure. This hybrid design better balances false positive rate and tampering robustness. We formalize this via theoretical analysis (Section V-C), showing semantically consistent transformations rarely suppress SIR-guided scores unless meaning is significantly distorted, one of the first formal analyses of watermark resilience under semantic equivalence. Empirical evaluation across four attacks shows SynGuard improves average F1 by 11.1% over SynthID-Text, performing especially well under paraphrasing and round-trip translation (common in content reposting and cross-lingual reuse). We uncover a new vulnerability axis: back-translation-induced watermark degradation correlates with translation quality, as poorer machine translation distorts signals more even with preserved semantics. This insight introduces new considerations for evaluating robustness across linguistic contexts and highlights the need for multilingual benchmarks. Our contributions are summarized as follows: 1. Conduct the first comprehensive robustness evaluation of SynthID-Text under four meaning-preserving transformations: paraphrasing, synonym substitution, copy-paste tampering, back-translation. 1. Propose SynGuard, a hybrid algorithm combining semantic-aware token preferences with token-level probabilistic sampling. 1. Demonstrate SynGuard consistently improves detection robustness, particularly for surface-altered but meaning-preserved content. 1. Reveal back-translation attack vulnerability correlates with machine translation quality, an overlooked axis. II Related Work Text watermarking distinguishes AI vs human text by embedding specific information into text sequences without quality loss. By watermark insertion stage in text generation, methods fall into two types [4]: watermarking for existing text and during generation. The first type adds watermarks via post-processing of existing text, typically via reformatting sentences with Unicode, altering lexicon or syntax. Though easy to implement, they are easy to remove via reformatting/normalization. Watermarking during generation is achieved by modifying logits in token generation. This approach is more stable, imperceptible, harder for attackers to detect/remove. A key method is the KGW algorithm [1]: it splits vocabulary into green/red lists via pseudorandom seed. Adding positive bias to green list tokens makes them more likely selected than red ones. This skew enables high-confidence post hoc detection. KGW balances robustness and imperceptibility, underpinning recent frameworks [7, 8, 9]. Google DeepMind’s SynthID-Text [3] advances generation-based watermarking by using pseudorandom functions (PRFs) and tournament sampling to guide token generation in a more randomized and less perceptible manner. During the sampling process, each token candidate is assigned $m$ independent $g$ -values $(g_{1},...,g_{m})$ , and the token with the highest total $g$ -value (e.g., the sum of all $g_{i}$ ) among all candidates is selected. These $g$ -values can later be used for watermark detection. This design improves robustness against removal attacks such as truncation and basic paraphrasing. Despite these strengths, most generation-time watermarking algorithms, including SynthID-Text, do not incorporate semantic information when adjusting logits. As a result, they remain vulnerable to semantic-preserving adversarial attacks. Recent studies have begun exploring semantic-aware watermarking strategies [6, 10, 11]. A Semantic Invariant Robust watermarking algorithm is introduced [6], which maps extracted semantic features from preceding context into the logit space to guide next-token generation. In this approach, semantic similarity becomes a key indicator for detecting watermarks. While promising in terms of robustness, this method relies on additional language models, which increases computational complexity and resource consumption. Furthermore, enforcing semantic consistency reduces output diversity and naturalness. III Preliminaries III-A Large Language Model A large language model (LLM) $M$ operates over a defined set of tokens, known as the vocabulary $V$ . Given a sequence of tokens $t=[t_{0},t_{1},...,t_{T-1}]$ , also referred to as the prompt, the model computes the probability distribution over the next token $t_{T}$ as $P_{M}(t_{T}\mid t_{:T-1})$ . The model $M$ then samples one token from the vocabulary $V$ according to this distribution and other sampling parameters (e.g., temperature). This process is repeated iteratively until the maximum token length is reached or an end-of-sequence (EOS) token is generated. This next-token prediction is typically implemented using a neural network architecture called the Transformer [12]. The process involves two main steps: 1. The Transformer computes a vector of logits $z_{T}=M_{t_{:T-1}}$ over all tokens in $V$ , based on the current context $t_{:T-1}$ . 1. The softmax function is applied to these logits to produce a normalized probability distribution: $P_{M}(t_{T}\mid t_{:T-1})$ . III-B SynthID-Text in LLM Text Watermarking Text watermarking for LLMs operates mainly at two stages: embedding-level (modifying internal embedding vectors, which is complex and less generalizable) and generation-level (altering token generation via logits adjustment or sampling strategies). Generation-level methods include logits-based approaches (e.g., KGW algorithm [1], biasing logits toward “green list” tokens) and sampling-based approaches (e.g., Christ algorithm [13], using pseudorandom functions to guide sampling without logit modification). SynthID-Text is a sampling-based algorithm featuring a novel tournament sampling mechanism for token selection. Candidate tokens are sampled from the original LLM-generated probability distribution $p_{LM}$ , so higher-probability tokens may appear multiple times in the candidate set. Each candidate token is evaluated using $m$ independent pseudorandom binary watermark functions $g_{1},g_{2},...,g_{m}$ . These functions assign a value of 0 or 1 to a token $x∈ V$ based on both the token and a random seed $r∈\mathbb{R}$ : $g_{l}(x,r)∈\{0,1\}.$ The tournament sampling procedure selects the token with statistically high $g$ -values across the $m$ functions, while respecting the base LLM distribution. To detect if a text $t=[t_{1},...,t_{T}]$ is watermarked, the average $g$ -value across all tokens and functions is computed: $$ \text{Score}(t)=\frac{1}{mT}\sum_{i=1}^{T}\sum_{l=1}^{m}g_{l}(t_{i},r_{i}). \tag{1} $$ III-C Text Watermarking Challenges Compared to watermarking techniques in other media such as images or audio [14, 15, 16, 17], embedding watermarks in text introduces a distinct set of challenges: Token Budget Constraints: A standard $256× 256$ image offers over 65K potential pixel positions for embedding watermarks [18]. In contrast, the maximum token length for LLMs like GPT-4 is around 8.2K tokens (with limited access to 32K https://openai.com/index/gpt-4-research/), which is significantly smaller. This limited capacity makes it harder to embed watermarks without detection by human readers and increases vulnerability to adversarial edits. As a result, watermarking algorithms for text require more careful design to ensure both imperceptibility and robustness. Perturbation Sensitivity: Text data is highly sensitive to editing [19]. While small pixel changes in an image are often imperceptible to the human eye, even minor alterations in a text, such as character replacements or word substitutions, can be easily noticed by readers or detected by spelling and grammar tools. Moreover, replacing entire words can unintentionally alter the meaning, introduce ambiguity, or degrade sentence fluency. Vulnerability: Watermarks in text are particularly susceptible to removal through common natural language transformations. An attacker can easily re-edit the content by substituting synonyms, or paraphrasing with new sentence structures [20]. IV Evaluating the Robustness of SynthID-Text This chapter presents the experimental settings, evaluation metrics, and results from robustness analysis of the SynthID-Text watermarking algorithm. Section VI-A outlines the experimental setup, including the backbone model, dataset, and metrics used for evaluation. Sections IV-B through IV-E report SynthID-Text’s performance under four types of text editing attacks: synonym substitution, copy-and-paste, paraphrasing, and re-translation. Finally, Section IV-F summarizes and compares results across all attack types to provide a comprehensive evaluation. IV-A Experimental Setup Backbone Model and Dataset. All experiments were conducted using Sheared-LLaMA-1.3B [21], a model further pre-trained from meta-llama/Llama-2-7b-hf https://huggingface.co/meta-llama/Llama-2-7b-hf. The model used is publicly available via HuggingFace https://huggingface.co/princeton-nlp/Sheared-LLaMA-1.3B. For the dataset, we adopt the Colossal Clean Crawled Corpus (C4) [22], which includes diverse, high-quality web text. Each C4 sample is split into two segments: the first segment serves as the prompt for generation, while the second (human-written) segment is used as reference text. These unaltered human texts are treated as control data for evaluating the watermark detector’s false positive rate. Evaluation Metrics. The robustness of SynthID-Text is evaluated using the following metrics: - True Positive Rate (TPR): The proportion of watermarked texts correctly identified. - False Positive Rate (FPR): The proportion of unwatermarked texts incorrectly identified as watermarked. - F1 Score: The harmonic mean of precision and recall, computed at the best threshold. - ROC-AUC: The area under the Receiver Operating Characteristic (ROC) curve, measuring overall classification performance across all thresholds. Each experiment was conducted using 200 watermarked and 200 unwatermarked samples, each with a fixed length of $T=200$ tokens. All experiments were implemented using the MarkLLM toolkit [23]. IV-B Synonym Substitution Attack Given an original text sequence, the synonym substitution attack aims to replace words with their synonyms until a specified replacement ratio $\epsilon$ is reached, or no further substitutions are possible. This approach maintains semantic fidelity while subtly altering the lexical surface of the text. A well-chosen $\epsilon$ ensures that the semantic meaning remains largely intact, which aligns with the attack’s objective—to disrupt watermark detection without affecting readability or content. In this work, synonym replacement is guided by a context-aware language model to ensure substitutions remain semantically appropriate. Specifically, we implemented a method that uses WordNet [24], a widely used lexical database of English, to retrieve synonym sets for eligible words. For each target word, a synonym is randomly selected using the NumPy library’s random function [25]. The substitution is further refined using BERT-Large [26], which predicts contextually suitable replacements. The process is repeated iteratively until the desired substitution ratio $\epsilon$ is reached or no more valid substitutions remain. This ensures the altered text remains semantically coherent while maximally disrupting watermark patterns. Details of the BERT Span Attack. To perform context-aware synonym substitution, BERT-Large https://huggingface.co/google-bert/bert-large-uncased is first used to tokenize the watermarked text. Then, eligible words are iteratively replaced with contextually appropriate synonyms until either the maximum replacement ratio $\epsilon$ is reached or no further substitutions are possible. The substitution process proceeds as follows: - Randomly select a word that has at least one synonym and replace it with a [MASK] token: ⬇ "I love programming." "I [MASK] programming." Listing 1: Word Masking - Feed the masked sentence into the BERT-Large model, which produces a logits vector over the vocabulary using a forward pass. - Rank all candidate words based on their logits and select the word with the highest probability to replace the masked token. BERT-Large is chosen for its bidirectional architecture, allowing it to consider both preceding and succeeding context when predicting the masked word. This contextual understanding ensures that substituted words maintain semantic consistency with the original sentence. After applying the synonym substitution strategy to a set of 200 watermarked texts, each with a token length of $T=200$ , the resulting ROC curves are presented in Fig. 1. As shown, the area under the curve (AUC) gradually decreases as the replacement ratio increases. Even with a replacement ratio as high as 0.7, the AUC remains above 0.94, and the corresponding F1 score is relatively high at 0.884, as reported in Table I. These results demonstrate that SynthID-Text exhibits strong robustness against context-preserving lexical substitutions. <details> <summary>Synonym_Substitution_attack_for_SynthID-Text.png Details</summary> ![4e5a2943](/v1/image/4e5a2943586396ab716b22878eca91da999001fa3bc2e8eae7b2ccce4d666850) ### Visual Description ## Chart Type: Receiver Operating Characteristic (ROC) Curves ### Overview The image is a Receiver Operating Characteristic (ROC) curve chart comparing the performance of different models: Word-S(Context)-0.3, Word-S(Context)-0.5, Word-S(Context)-0.7, and SynthID, against a random guess baseline. The chart plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The Area Under the Curve (AUC) is provided for each model. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Legend:** Located in the bottom-right corner of the chart. * Blue line: Word-S(Context)-0.3 (AUC = 0.9990) * Orange line: Word-S(Context)-0.5 (AUC = 0.9770) * Green line: Word-S(Context)-0.7 (AUC = 0.9493) * Red line: SynthID (AUC = 1.0000) * Dashed gray line: Random Guess ### Detailed Analysis * **Word-S(Context)-0.3 (Blue):** The blue line rises sharply to a TPR of approximately 0.95 at a very low FPR (close to 0.0), then plateaus near 1.0. * **Word-S(Context)-0.5 (Orange):** The orange line rises sharply to a TPR of approximately 0.82 at a very low FPR (close to 0.0), then plateaus near 1.0. * **Word-S(Context)-0.7 (Green):** The green line rises to a TPR of approximately 0.72 at a very low FPR (close to 0.0), then gradually increases to approximately 0.95 before plateauing near 1.0. * **SynthID (Red):** The red line rises vertically to a TPR of approximately 0.35 at an FPR of 0.0, then jumps to 1.0 at an FPR of approximately 0.01, and remains at 1.0. * **Random Guess (Dashed Gray):** The dashed gray line represents a diagonal line from (0.0, 0.0) to (1.0, 1.0), indicating performance equivalent to random guessing. ### Key Observations * SynthID has the highest AUC (1.0000), followed by Word-S(Context)-0.3 (0.9990), Word-S(Context)-0.5 (0.9770), and Word-S(Context)-0.7 (0.9493). * The SynthID model achieves perfect performance, with a TPR of 1.0 at a very low FPR. * All models outperform the random guess baseline. * Word-S(Context)-0.3 performs better than Word-S(Context)-0.5, which performs better than Word-S(Context)-0.7. ### Interpretation The ROC curves illustrate the performance of different models in terms of their ability to discriminate between positive and negative cases. A higher AUC indicates better performance. The SynthID model demonstrates perfect classification, while the Word-S(Context) models show varying degrees of effectiveness, with Word-S(Context)-0.3 being the most effective among them. The fact that all models are above the random guess line indicates they all have some predictive power. The differences in performance among the Word-S(Context) models suggest that the context parameter influences the model's accuracy. </details> (a) Overall ROC curves under synonym substitution with different replacement ratios <details> <summary>zoom-in_synonym_substitution_attack_for_SynthID.png Details</summary> ![028ae38c](/v1/image/028ae38c646ac88c2755ba75e4fc0eee81ce53207b755c3720c57b4fbd041769) ### Visual Description ## Chart Type: Zoomed-in ROC Curve ### Overview The image is a Receiver Operating Characteristic (ROC) curve, specifically zoomed-in and plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) on a logarithmic scale. It compares the performance of different models: "Word-S(Context)-0.3", "Word-S(Context)-0.5", "Word-S(Context)-0.7", and "SynthID". The graph shows how well each model distinguishes between positive and negative cases, with a higher curve indicating better performance. ### Components/Axes * **Title:** Zoomed-in ROC Curve (Log-scaled FPR) * **X-axis:** False Positive Rate (FPR, log scale). The scale ranges from 10^-4 to 10^0. Axis markers are present at 10^-4, 10^-3, 10^-2, 10^-1, and 10^0. * **Y-axis:** True Positive Rate (TPR). The scale ranges from 0.90 to 1.00. Axis markers are present at 0.90, 0.92, 0.94, 0.96, 0.98, and 1.00. * **Legend:** Located in the bottom-right corner. It identifies the four data series: * Blue: Word-S(Context)-0.3 (AUC = 0.9990) * Orange: Word-S(Context)-0.5 (AUC = 0.9770) * Green: Word-S(Context)-0.7 (AUC = 0.9493) * Red: SynthID (AUC = 1.0000) * A dashed gray line extends from the bottom-left to the top-right corner, representing a baseline or random classifier. ### Detailed Analysis * **SynthID (Red):** This line is horizontal at TPR = 1.00, indicating perfect classification. * **Word-S(Context)-0.3 (Blue):** The line starts at TPR = 0.90 and rises sharply around FPR = 10^-2, reaching TPR = 0.98, then jumps to 1.00. * **Word-S(Context)-0.5 (Orange):** The line starts at TPR = 0.90 and rises more gradually than the blue line, with steps at approximately FPR = 0.02, 0.04, 0.06, 0.08, 0.1, reaching TPR = 1.00. * **Word-S(Context)-0.7 (Green):** This line rises even more gradually than the orange line, with steps at approximately FPR = 0.2, 0.4, 0.6, 0.8, reaching TPR = 1.00. ### Key Observations * SynthID achieves perfect classification (AUC = 1.0000). * Word-S(Context)-0.3 performs very well (AUC = 0.9990), closely approaching perfect classification. * Word-S(Context)-0.5 (AUC = 0.9770) and Word-S(Context)-0.7 (AUC = 0.9493) have lower AUC scores, indicating slightly worse performance. * The dashed gray line represents a random classifier, against which the models are being compared. ### Interpretation The ROC curve visualizes the trade-off between the true positive rate and the false positive rate for different classification models. The closer a curve is to the top-left corner, the better the model's performance. SynthID demonstrates perfect classification, while Word-S(Context)-0.3 is a close second. The other two models, Word-S(Context)-0.5 and Word-S(Context)-0.7, show progressively lower performance. The logarithmic scale on the x-axis allows for a detailed view of the models' behavior at very low false positive rates. The AUC values provided in the legend quantify the overall performance of each model, confirming the visual assessment. </details> (b) Zoomed-in ROC curves under synonym substitution with different replacement ratios Figure 1: ROC curves of SynthID-Text under synonym substitution attacks with varying replacement ratios. TABLE I: Watermark detection accuracy under different synonym substitution attack ratios. | Attack | TPR | FPR | F1 with best threshold | | --- | --- | --- | --- | | No attack | 1.0 | 0.0 | 1.0 | | Word-S(Context)-0.3 | 0.98 | 0.005 | 0.987 | | Word-S(Context)-0.5 | 0.91 | 0.035 | 0.936 | | Word-S(Context)-0.7 | 0.82 | 0.035 | 0.884 | IV-C Copy-and-Paste Attack Unlike synonym substitution attacks, the copy-and-paste attack does not alter the original watermarked text. Instead, it embeds the watermarked segment within a larger body of human-written or unwatermarked content. This type of attack exploits the fact that detection algorithms typically analyze text holistically; by diluting the watermarked portion, the overall watermark signal becomes weaker and harder to detect. Prior work [9] has shown that when the watermarked portion comprises only 10% of the total text, the attack can outperform many paraphrasing methods in reducing watermark detectability. In this work, we experiment with different copy-and-paste ratios and evaluate the detection performance to assess robustness. Fig. 2 presents the ROC curves for varying copy-and-paste ratios. The green curve represents the case where the added natural text is ten times longer than the original watermarked text, resulting in an AUC of 0.62—only slightly above random guess. As shown in Table II, the false positive rate (FPR) for ratio $=10$ reaches 0.53, meaning that more than half of unwatermarked texts are incorrectly identified as watermarked. As the copy-and-paste ratio increases, detection performance degrades further. When the ratio reaches 20 or higher, the AUC decreases to around or below 0.5, effectively equating to or falling below random guessing performance. <details> <summary>figures/copy-paste_attack_roc.png Details</summary> ![afe7141b](/v1/image/afe7141bc9e094ebfaa26b4857a686c2769c857adbcfe01e1eacf0819e12c9fc) ### Visual Description ## Chart Type: Receiver Operating Characteristic (ROC) Curves ### Overview The image is a Receiver Operating Characteristic (ROC) curve chart comparing the performance of different methods: SynthID, Copy-Paste-5, Copy-Paste-10, and Copy-Paste-15, against a random guess baseline. The chart plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The Area Under the Curve (AUC) is provided for each method. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Grid:** Light gray grid lines are present. * **Legend:** Located in the bottom-right corner, the legend identifies each line by method and its corresponding AUC score: * SynthID (AUC = 1.0000) - Blue line * Copy-Paste-5 (AUC = 0.8559) - Orange line * Copy-Paste-10 (AUC = 0.6679) - Green line * Copy-Paste-15 (AUC = 0.5883) - Red line * Random Guess - Dashed gray line ### Detailed Analysis * **SynthID (Blue):** The blue line rises vertically at FPR = 0.0 to TPR = 1.0, then continues horizontally. This indicates perfect classification. * **Copy-Paste-5 (Orange):** The orange line rises steeply between FPR 0.1 and 0.2, reaching TPR near 1.0 by FPR 0.3. * **Copy-Paste-10 (Green):** The green line rises more gradually, starting around FPR 0.2 and reaching TPR near 1.0 by FPR 0.5. * **Copy-Paste-15 (Red):** The red line rises even more gradually, starting around FPR 0.3 and reaching TPR near 1.0 by FPR 0.6. * **Random Guess (Dashed Gray):** The dashed gray line represents a diagonal line from (0,0) to (1,1), indicating performance equivalent to random guessing. **Data Points (Approximate):** * **SynthID (Blue):** * (0.0, 0.0) * (0.0, 1.0) * (1.0, 1.0) * **Copy-Paste-5 (Orange):** * (0.0, ~0.01) * (0.15, ~0.5) * (0.25, ~0.9) * (1.0, 1.0) * **Copy-Paste-10 (Green):** * (0.0, ~0.01) * (0.3, ~0.5) * (0.4, ~0.8) * (1.0, 1.0) * **Copy-Paste-15 (Red):** * (0.0, ~0.01) * (0.4, ~0.5) * (0.5, ~0.75) * (1.0, ~0.95) * **Random Guess (Dashed Gray):** * (0.0, 0.0) * (1.0, 1.0) ### Key Observations * SynthID has a perfect AUC score of 1.0000, indicating perfect classification. * The AUC scores decrease as we move from Copy-Paste-5 to Copy-Paste-10 to Copy-Paste-15, suggesting a decline in performance. * The Random Guess line serves as a baseline; all methods perform better than random guessing. ### Interpretation The ROC curves illustrate the performance of different methods in distinguishing between positive and negative cases. SynthID demonstrates ideal performance, perfectly separating the two classes. The Copy-Paste methods show varying degrees of effectiveness, with Copy-Paste-5 performing the best among them, and Copy-Paste-15 performing the worst. The AUC values quantify these differences, providing a single metric to compare the methods. The further the curve is from the random guess line (dashed gray), the better the model's performance. The data suggests that as the "Copy-Paste" number increases, the performance decreases. </details> Figure 2: ROC curves under different copy-and-paste attack ratios. The blue curve represents the original SynthID-Text ROC curve without attack; the gray curve indicates random guessing. Other curves depict results under varying ratios, where the ratio denotes how many times longer the inserted natural text is compared to the original watermarked text. TABLE II: Watermark detection accuracy under different copy-and-paste attack ratios | Attack | TPR | FPR | F1 with best threshold | | --- | --- | --- | --- | | No attack | 1.0 | 0.005 | 0.9975 | | Copy-and-Paste-5 | 0.985 | 0.27 | 0.874 | | Copy-and-Paste-10 | 0.995 | 0.53 | 0.788 | | Copy-and-Paste-20 | 0.99 | 0.565 | 0.775 | | Copy-and-Paste-30 | 0.99 | 0.565 | 0.775 | IV-D Paraphrasing Attack Paraphrasing attacks aim to modify the structure and wording of a paragraph while preserving its original semantic meaning. This is typically done by rephrasing sentences or altering word choice and sentence order. Therefore, paraphrasing can be characterized along two key dimensions: lexical diversity, which measures variation in vocabulary, and order diversity, which reflects changes in sentence or phrase order. In this experiment, we adopted the Dipper paraphrasing model [27], which is built on the T5-XXL [22] architecture. Dipper allows fine-tuned control over both lexical and order diversity through configurable parameters. Two levels of lexical diversity were used to conduct the attacks, and the results are shown in Fig. 3. From the graphs, it can be observed that compared to the original ROC curve of SynthID-Text without attack in Fig. 3 (a), the AUC in Fig. 3 (b) and (c) decrease by approximately 0.04–0.05 when only lexical diversity was applied. When both lexical diversity and order diversity were set simultaneously, the AUC experienced a decline to 0.91 in Fig. 3 (d) from 1.00 in the no attack setting. The corresponding FPR and F1 scores are presented in Table III. Particularly, when lex_diversity=10 and order_diversity=5 (shown in the fourth row), the FPR exceeded 20%, and the F1 score dropped to 0.84, indicating a significant reduction in detection accuracy under this paraphrasing condition. <details> <summary>figures/SynthID_solo_curve.png Details</summary> ![2f736575](/v1/image/2f73657531d0dc8476c05397a810a6298e3a22f4feb5d0cd9ba8308a00d87a00) ### Visual Description ## Chart: Receiver Operating Characteristic (ROC) Curve ### Overview The image is a Receiver Operating Characteristic (ROC) curve, a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The plot shows the True Positive Rate (TPR) against the False Positive Rate (FPR). A blue line represents the ROC curve with an Area Under the Curve (AUC) of 1.0000, indicating perfect classification. A dashed gray line represents a random guess. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curve * **X-axis:** False Positive Rate (FPR), with scale from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), with scale from 0.0 to 1.0 in increments of 0.2. * **Grid:** Light gray grid lines are present in the background. * **Legend:** Located in the bottom-right corner: * Blue line: ROC curve (AUC = 1.0000) * Dashed gray line: Random Guess ### Detailed Analysis * **ROC Curve (Blue):** The blue line starts at (0.0, 0.0), rises vertically to (0.0, 1.0), and then extends horizontally to (1.0, 1.0). This indicates a perfect classifier. * **Random Guess (Dashed Gray):** The dashed gray line starts at (0.0, 0.0) and extends diagonally to (1.0, 1.0). This represents the performance of a classifier that makes random predictions. ### Key Observations * The ROC curve reaches the top-left corner of the graph, indicating perfect classification. * The AUC is 1.0000, confirming the perfect classification. * The random guess line represents the baseline performance of a classifier that makes random predictions. ### Interpretation The ROC curve demonstrates the performance of a binary classifier. In this case, the classifier achieves perfect separation between the two classes, as indicated by the ROC curve reaching the top-left corner and the AUC being 1.0000. This means that the classifier can perfectly distinguish between positive and negative instances without making any errors. The random guess line serves as a reference point, showing the performance that would be expected from a classifier that makes random predictions. The large difference between the ROC curve and the random guess line highlights the effectiveness of the classifier. </details> (a) No attack (original SynthID-Text) <details> <summary>figures/Dipper-5_synthID.png Details</summary> ![221e6c55](/v1/image/221e6c5505e207369870c3b18d75f3fdafecb9c825a57dd8e7de49620f32c587) ### Visual Description ## Chart: Receiver Operating Characteristic (ROC) Curve ### Overview The image is a Receiver Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR). It visually represents the performance of a binary classification model. The plot includes the ROC curve itself, a diagonal line representing a random guess, and the Area Under the Curve (AUC) value for the ROC curve. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curve * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Legend:** Located in the bottom-right corner. * Blue solid line: ROC curve (AUC = 0.9468) * Gray dashed line: Random Guess ### Detailed Analysis * **ROC Curve (Blue):** The blue line represents the ROC curve. It starts at (0.0, 0.0) and rises sharply to approximately (0.05, 0.75). It then continues to rise, but at a slower rate, reaching approximately (0.2, 0.92). From (0.2, 0.92) to (0.6, 0.98) the line rises slowly. Finally, it plateaus near 1.0 for FPR values greater than 0.6. * **Random Guess (Gray):** The gray dashed line represents a random guess. It is a diagonal line from (0.0, 0.0) to (1.0, 1.0). * **AUC Value:** The Area Under the Curve (AUC) for the ROC curve is 0.9468. ### Key Observations * The ROC curve is significantly above the random guess line, indicating good performance of the classification model. * The AUC value of 0.9468 is close to 1.0, which suggests that the model has a high ability to distinguish between positive and negative classes. * The steep initial rise of the ROC curve indicates that the model achieves a high true positive rate with a relatively low false positive rate. ### Interpretation The ROC curve and AUC value demonstrate that the classification model performs well. The model is able to effectively discriminate between positive and negative instances. The high AUC value suggests that the model is likely to generalize well to new, unseen data. The ROC curve's shape indicates that the model is particularly good at identifying true positives without generating many false positives, especially at lower FPR values. </details> (b) Dipper paraphrasing with $lex\_diversity=5$ <details> <summary>figures/Dipper-10_synthID.png Details</summary> ![fd610df7](/v1/image/fd610df7392eccde23eda13a7b319472c0604fabd0f1ea110d76b0cdf2e767d3) ### Visual Description ## Chart Type: Receiver Operating Characteristic (ROC) Curve ### Overview The image is a Receiver Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR). It shows the performance of a classification model at all classification thresholds. The plot includes the ROC curve itself, a dashed line representing a random guess, and the Area Under the Curve (AUC) value for the ROC curve. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curve * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Grid:** Light gray grid lines are present at each 0.2 increment on both axes. * **Legend:** Located in the bottom-right corner: * Blue solid line: ROC curve (AUC = 0.9589) * Gray dashed line: Random Guess ### Detailed Analysis * **ROC Curve (Blue):** The blue line represents the ROC curve. It starts at (0.0, 0.0) and rises sharply to a TPR of approximately 0.7 at an FPR of around 0.02. It then continues to rise, but at a slower rate, reaching a TPR of approximately 0.95 at an FPR of around 0.2. From there, it gradually approaches a TPR of 1.0 as the FPR increases to 1.0. The AUC is given as 0.9589. * **Random Guess (Gray Dashed):** The gray dashed line represents a random guess. It is a diagonal line from (0.0, 0.0) to (1.0, 1.0). * **Data Points:** * (0.0, 0.0) - Starting point of both curves. * (0.02, 0.7) - Approximate point on the ROC curve where it rises sharply. * (0.2, 0.95) - Approximate point on the ROC curve where it starts to flatten. * (1.0, 1.0) - Ending point of both curves. ### Key Observations * The ROC curve is significantly above the random guess line, indicating a good classification model. * The AUC value of 0.9589 is close to 1.0, which suggests excellent performance. * The steep initial rise of the ROC curve indicates that the model can achieve a high true positive rate with a very low false positive rate. ### Interpretation The ROC curve and the associated AUC value demonstrate that the classification model performs very well. The model is able to distinguish between positive and negative cases with high accuracy. The large area under the curve indicates that the model is robust and can maintain its performance across different classification thresholds. The model is significantly better than random guessing, making it a useful tool for classification tasks. </details> (c) Dipper paraphrasing with $lex\_diversity=10$ aaaaaaa aaaaaaaa <details> <summary>figures/Dipper-10-5_SynthID.png Details</summary> ![681bc7dc](/v1/image/681bc7dcc2842621232b8291212da7dffe7e021db8f28f1fd477979199ab9977) ### Visual Description ## Chart: Receiver Operating Characteristic (ROC) Curve ### Overview The image is a Receiver Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR). It shows the performance of a classification model. The plot includes the ROC curve of the model and a diagonal line representing random guessing. The Area Under the Curve (AUC) is also provided. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curve * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Legend (bottom-right):** * Blue solid line: ROC curve (AUC = 0.9101) * Gray dashed line: Random Guess ### Detailed Analysis * **ROC Curve (Blue):** The blue ROC curve starts at (0.0, 0.0) and rises sharply, indicating a high true positive rate at low false positive rates. It reaches a TPR of approximately 0.7 at an FPR of 0.1. The curve continues to rise, but at a slower rate, approaching a TPR of 1.0 as the FPR approaches 1.0. The curve has a generally upward trend with some horizontal segments. * (0.0, 0.0) * (0.02, 0.5) * (0.05, 0.6) * (0.1, 0.75) * (0.2, 0.85) * (0.3, 0.9) * (0.4, 0.93) * (0.6, 0.97) * (0.8, 0.99) * (1.0, 1.0) * **Random Guess (Gray Dashed):** The gray dashed line represents a random guess, where the TPR equals the FPR. It is a diagonal line from (0.0, 0.0) to (1.0, 1.0). ### Key Observations * The ROC curve is significantly above the random guess line, indicating that the model performs much better than random chance. * The AUC is 0.9101, which is close to 1.0, indicating excellent performance of the classification model. * The curve rises steeply at the beginning, showing that the model can achieve a high true positive rate with a low false positive rate. ### Interpretation The ROC curve and the AUC value suggest that the classification model has a high ability to discriminate between positive and negative classes. The model is effective at correctly identifying positive cases while minimizing false positives. The AUC of 0.9101 indicates that there is a 91.01% chance that the model will be able to distinguish between positive and negative classes. The steep initial rise of the ROC curve suggests that the model is particularly good at identifying true positives early on, which is a desirable characteristic in many applications. </details> (d) Dipper paraphrasing with $lex\_diversity=10$ and $order\_diversity=5$ Figure 3: ROC curves under paraphrasing attacks with different settings. Note ∗: Due to hardware limitations in Google Colab Pro—specifically, a maximum GPU memory of 40 GB—Dipper could only be run once per session. As a result, the ROC curves were generated in separate runs, requiring a restart between each execution, and are presented across multiple graphs. TABLE III: Watermark detection accuracy under different paraphrasing attack settings | Attack | TPR | FPR | F1 with best threshold | | --- | --- | --- | --- | | No attack | 1.0 | 0.0 | 1.0 | | Dipper-5 | 0.915 | 0.16 | 0.882 | | Dipper-10 | 0.92 | 0.125 | 0.8998 | | Dipper-10-5 | 0.895 | 0.23 | 0.842 | Note ∗: In this figure, Dipper- $x$ denotes that the Dipper model was run with a lexical diversity parameter of $x$ , while Dipper- $x$ - $y$ indicates a lexical diversity of $x$ and an order diversity of $y$ . IV-E Re-Translation Attack The re-translation attack involves translating the original watermarked text into a pivot language and then translating it back into the original language. This process preserves the overall meaning, but may disrupt the watermark signal due to intermediate transformations applied by a translation model, as illustrated in Fig. 4. <details> <summary>figures/watermark_dilution_through_translation.jpg Details</summary> ![f6d2f212](/v1/image/f6d2f212e5f0b35c5dda5c9b72ce51347d2d5fc15fcfa0e1917dac23ce90b0e9) ### Visual Description ## Flow Diagram: Watermark Strength in Application Review ### Overview The image is a flow diagram illustrating the concept of watermark strength in the context of college application review. It shows how student applications, initially assessed based on test scores and grades, are processed through an LLM and a watermark algorithm. The diagram also includes responses from ZeeMee, translated between English and Chinese, regarding the completeness of the application review process. The watermark strength is depicted as a gradient, ranging from strong to weak. ### Components/Axes * **Nodes:** * "Students have long applied to colleges and universities with applications that are heavy on test scores and grades. While that's not necessarily wrong, the founders of" (Top-left, rounded rectangle) * "LLM" (Left-center, rounded rectangle) * "Watermark Algorithm" (Bottom-left, rounded rectangle) * "Watermark" (Bottom-left, blue rectangle inside "Watermark Algorithm") * "Response (En)" (Top-right, rounded rectangle): "ZeeMee believe it doesn't tell the whole story. This Redwood City, California-based company has created a platform that lets students bring their stories to life" * "Translation System" (Top-center, rectangle) * "Response (Zh)" (Center, rounded rectangle): "ZeeMee 认为它并没有讲述完整的故事这家位于加州红木城的公司创建了一个平台,让学生们能够将自己的故事变成现实。" * Translation: "ZeeMee believes it does not tell the complete story. This company based in Redwood City, California has created a platform that allows students to turn their stories into reality." * "Translation System" (Right-center, rectangle) * "Response (En)" (Bottom-right, rounded rectangle): "ZeeMee believes it doesn't tell the full story. The Redwood City, California-based company has created a platform that enables students to bring their own stories to life." * **Arrows:** Arrows indicate the flow of information/process. * **Watermark Strength Gradient:** A horizontal gradient bar at the bottom, transitioning from dark blue on the left ("Strong") to light blue on the right ("Weak"). * **Axis Labels:** * "Watermark Strength" (Horizontal axis label, centered below the gradient bar) * "Strong" (Left end of the gradient bar) * "Weak" (Right end of the gradient bar) ### Detailed Analysis or ### Content Details 1. **Initial Application Assessment:** The process begins with students applying to colleges based on test scores and grades. 2. **LLM Processing:** The applications are then processed through a Large Language Model (LLM). 3. **Watermark Algorithm:** The LLM output is fed into a watermark algorithm, which presumably assesses the originality or authenticity of the application. 4. **ZeeMee Responses:** ZeeMee's responses, in both English and Chinese, suggest that the traditional application review process (based solely on scores and grades) may not provide a complete picture of the applicant. 5. **Translation System:** The translation system is used to translate the response from Chinese to English. 6. **Watermark Strength:** The watermark strength gradient indicates the degree to which the watermark algorithm can reliably identify the source or authenticity of the application. A strong watermark suggests high confidence, while a weak watermark suggests low confidence. ### Key Observations * The diagram highlights the limitations of relying solely on test scores and grades in college applications. * The inclusion of ZeeMee's perspective suggests a need for a more holistic approach to application review. * The watermark algorithm is presented as a tool to assess the originality or authenticity of applications, potentially addressing issues like plagiarism or fabricated information. * The translation system ensures that the response is available in both English and Chinese. ### Interpretation The diagram illustrates a shift towards a more comprehensive approach to college application review, moving beyond traditional metrics like test scores and grades. The use of LLMs and watermark algorithms suggests an attempt to assess the authenticity and originality of applications. ZeeMee's responses emphasize the importance of considering the applicant's full story, rather than relying solely on quantifiable data. The watermark strength gradient represents the confidence level in the algorithm's ability to identify the source or authenticity of the application, which is crucial for ensuring fairness and integrity in the admissions process. The diagram suggests that a strong watermark is desirable, indicating a high degree of confidence in the application's authenticity. </details> Figure 4: Illustration of watermark dilution through translation For this experiment, we used the nllb-200-distilled-600M https://huggingface.co/facebook/nllb-200-distilled-600M model, a distilled 600M-parameter variant of NLLB-200 [28]. NLLB-200 is a multilingual machine translation model that supports direct translation between 200 languages and is designed for research purposes. Several different languages were selected as pivot languages, including French, Italian, Chinese, and Japanese. Since the original dataset only consists of English prompts and human-written English completions, the watermarked outputs were first translated into pivot language and then re-translated into English to maintain consistency with the original prompt language. The ROC curves under this re-translation attacks using different pivot languages are presented in Fig. 5. The results indicate that the choice of pivot language significantly influences the effectiveness of re-translation attacks. French and Italian, which both belong to the Latin language family , share substantial linguistic similarities with English, which has been heavily influenced by Latin. As a result, the round-trip retranslated texts maintain relatively high AUC scores. In contrast, Chinese is more significantly different from English, leading to the lowest AUC observed after re-translation. Surprisingly, Japanese produces the highest AUC among all tested pivot languages, even slightly surpassing Italian. This outcome may be attributed to the specific design of English-to-Japanese translation systems. Given the syntactic differences between Japanese and English (such as SOV versus SVO word order), many modern translation tools adopt a linear translation strategy when translating from English to Japanese [29, 30]. This approach attempts to preserve the original sentence structure as much as possible to enhance translation quality. Consequently, round-trip translation using Japanese tends to retain more of the original semantics and structure, making the re-translation attack less effective. Compared to the baseline performance of SynthID-Text without attack, the F1 score for the re-translation attack using Chinese reduces significantly from 1.00 to 0.711, while the F1 score remains 0.819 for Japanese, which is the highest, as shown in Table IV. <details> <summary>Retranslation_attacks_for_SynthID-Text.png Details</summary> ![29ee27b6](/v1/image/29ee27b6a8b1442ab3fee9e4a3c03213f29f5213cd0148a25251b1b48218fd20) ### Visual Description ## Chart Type: Receiver Operating Characteristic (ROC) Curves ### Overview The image is a Receiver Operating Characteristic (ROC) curve chart, comparing the performance of different models (TransAtt-French, TransAtt-Italian, TransAtt-Chinese, TransAtt-Japanese, and SynthID) against a random guess baseline. The chart plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The Area Under the Curve (AUC) is provided for each model in the legend. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Legend:** Located in the bottom-right corner, it identifies each model and its corresponding AUC score: * TransAtt-French (AUC = 0.7777) - Blue line * TransAtt-Italian (AUC = 0.8243) - Orange line * TransAtt-Chinese (AUC = 0.7151) - Green line * TransAtt-Japanese (AUC = 0.8311) - Red line * SynthID (AUC = 1.0000) - Purple line * Random Guess - Dashed gray line ### Detailed Analysis * **TransAtt-French (Blue):** Starts at approximately (0, 0.42), rises to approximately (0.2, 0.75), then gradually increases to approximately (1.0, 0.92). * **TransAtt-Italian (Orange):** Starts at approximately (0, 0.68), rises to approximately (0.2, 0.80), then gradually increases to approximately (1.0, 0.95). * **TransAtt-Chinese (Green):** Starts at approximately (0, 0.55), rises to approximately (0.2, 0.65), then gradually increases to approximately (1.0, 0.90). * **TransAtt-Japanese (Red):** Starts at approximately (0, 0.72), rises to approximately (0.2, 0.82), then gradually increases to approximately (1.0, 0.94). * **SynthID (Purple):** Rises vertically from (0, 0) to (0, 1), then runs horizontally to (1, 1). * **Random Guess (Dashed Gray):** A diagonal line from (0, 0) to (1, 1). ### Key Observations * SynthID has the highest AUC (1.0000), indicating perfect classification performance. * TransAtt-Japanese has the second-highest AUC (0.8311), followed by TransAtt-Italian (0.8243), TransAtt-French (0.7777), and TransAtt-Chinese (0.7151). * All models outperform the random guess baseline. * The ROC curves show the trade-off between TPR and FPR for each model. ### Interpretation The ROC curves visually represent the performance of different models in distinguishing between positive and negative cases. The AUC scores quantify this performance, with higher scores indicating better discrimination. SynthID achieves perfect classification, while the other models demonstrate varying degrees of accuracy. The curves show how the true positive rate changes as the false positive rate increases, allowing for a comparison of the models' sensitivity and specificity. The further the curve is from the random guess line, the better the model's performance. </details> Figure 5: ROC curves of re-translation attacks on SynthID TABLE IV: Watermark detection accuracy under re-translation attacks using different pivot languages | Attack | TPR | FPR | F1 | | --- | --- | --- | --- | | No attack | 1.0 | 0.0 | 1.0 | | Re-trans-French | 0.675 | 0.155 | 0.738 | | Re-trans-Italian | 0.76 | 0.11 | 0.813 | | Re-trans-Chinese | 0.675 | 0.225 | 0.711 | | Re-trans-Japanese | 0.715 | 0.03 | 0.819 | IV-F Summary Table V summarizes the watermark detection performance of SynthID-Text under various attack scenarios. For the re-translation attack, we present the result for Chinese as it is one of the three most widely spoken languages in the world. Without any attack, the algorithm achieves a perfect F1 score of 1.0 and a false positive rate (FPR) of 0.0, demonstrating excellent baseline performance in detecting watermarked text. Under synonym substitution attacks, the F1 score decreases to 0.884, slightly below 0.9, indicating a moderate level of resilience to lexical variation. For the copy-and-paste attack with a length ratio of 10, the F1 score decreases more substantially to 0.788, while the FPR rises sharply to 0.53. This suggests that simply appending large segments of natural (unwatermarked) text can significantly weaken watermark detectability, even if the original watermarked content remains unchanged. The paraphrasing attack, particularly when involving both high lexical diversity (lex_diversity = 10) and syntactic reordering (order_diversity = 5), also lead to a notable decrease in robustness. In this setting, the FPR increases to 0.23, and the F1 score falls to 0.842. The most severe degradation occurs under the re-translation attack. Translating the watermarked text into Chinese and subsequently back into English results in a significant decline in detection performance: the F1 score falls to 0.711, and the TPR declines to 0.675, only slightly better than random guessing. This highlights the substantial vulnerability of SynthID-Text to semantic-preserving transformations. These findings suggest that while SynthID-Text remains robust against simple lexical substitutions, it is significantly less effective under complex semantic-preserving attacks such as paraphrasing and round-trip translation, which pose the greatest challenges for reliable watermark detection. TABLE V: Watermark detection accuracy of SynthID-Text under various attacks | Attack | TPR | FPR | F1 | | --- | --- | --- | --- | | No attack | 1.0 | 0.0 | 1.0 | | Substitution ( $\epsilon=0.7$ ) | 0.82 | 0.035 | 0.884 | | Copy-and-Paste (ratio = 10) | 0.995 | 0.53 | 0.788 | | Paraphrasing (lex_diversity | | | | | = 10, order_diversity = 5) | 0.895 | 0.23 | 0.842 | | Re-Translation (Chinese) | 0.675 | 0.225 | 0.711 | V SynGuard: An Enhanced SynthID-Text Watermarking Since SynthID-Text embeds watermarks during the text generation process, if the generated text is regenerated or modified by another translation or language model, the original watermarking signals may be disrupted. As a result, the watermark information is prone to being destroyed. This vulnerability becomes especially apparent in the detection performance when subjected to back-translation attacks. The results could be found in Section VI. In this section, we introduce a novel watermarking method, SynGuard, which combines the Semantic Invariant Robust (SIR) watermarking algorithm [6] with the SynthID-Text tournament sampling mechanism [3]. V-A Watermark Embedding Watermarking algorithms embed watermarks by modifying logits during the token generation process. SynthID-Text achieves this by using the hash values of preceding tokens along with a secret key $k$ to generate pseudorandom numbers. These numbers are then used to guide the token sampling process. This design, based on pseudorandom functions and a fixed key, makes the watermark difficult to remove unless the attacker has access to both the key and the random seed. However, if the entire text is regenerated by another language model, such as in the back-translation scenario, the watermark signal can be severely degraded. This vulnerability stems from the fact that SynthID-Text does not incorporate semantic understanding into its watermarking process. By contrast, the SIR algorithm [6] embeds watermark signals by mapping semantic features of preceding tokens to specific token preferences. This semantic-aware approach has demonstrated resilience to meaning-preserving transformations. To enhance robustness against semantic perturbations, we propose a hybrid approach that integrates SynthID-Text with SIR. This new method, called SynGuard, generates three separate sets of logits at different stages and combines them to form the final logits vector. This vector is then passed through a softmax function to obtain a probability distribution over the vocabulary $V$ . The three component logits are: - Base LLM logits: Generated directly from the backbone LLM, representing the standard token probabilities. - SIR logits: Derived from a semantic watermarking model conditioned on the preceding text, encoding semantic consistency. - SynthID logits: Computed using the pseudorandom watermarking mechanism based on hash values of tokens, a random seed and a secret key. The overall embedding process is illustrated in Fig. 6, and the detailed procedure is described in Algorithm 1. <details> <summary>figures/new_algorithm_structure.jpg Details</summary> ![c36435d1](/v1/image/c36435d1e82159829e0dc86ccdd6e09832029f2dad63d36df99fc8c236060495) ### Visual Description ## Diagram: LLM Logit Combination ### Overview The image is a diagram illustrating a process where the outputs (logits) of multiple language models (LLMs) and a watermark model are combined to produce final logits. The diagram shows three parallel paths, each starting with a different model (Generative LLM, Embedding LLM, and SynthID), and converging at the end to produce "Final Logits". ### Components/Axes * **Text Box (Left):** Contains the text "Relish your self challenging off-road hurdles with endless fun and real driving sensations with luxury Offroad Car Driving games. Are you ready for this". * **Generative LLM:** A rectangular box labeled "Generative LLM". * **Embedding LLM:** A rectangular box labeled "Embedding LLM". * **SynthID:** A rectangular box labeled "SynthID". * **SIR Watermark Model:** A rectangular box labeled "SIR Watermark Model". This box is only in the path of the Embedding LLM. * **LLM Logits:** A bar graph labeled "LLM Logits". The bars are gray. * **SIR Logits:** A bar graph labeled "SIR Logits". The bars are light blue. * **SynthID Logits:** A bar graph labeled "SynthID Logits". The bars are light green. * **Final Logits:** A stacked bar graph labeled "Final Logits". The bars are composed of gray, light blue, and light green segments. * **Arrows:** Black arrows indicate the flow of information from left to right. ### Detailed Analysis or ### Content Details 1. **Input Text:** The text box on the left contains a promotional message, likely for a car driving game. 2. **Generative LLM Path:** * The "Generative LLM" block outputs "LLM Logits", represented by a gray bar graph. The bars in the graph show a decreasing trend from left to right. The approximate heights of the bars are: 10, 9, 8, 7, 6, 4, 3, 4. 3. **Embedding LLM Path:** * The "Embedding LLM" block feeds into the "SIR Watermark Model". * The "SIR Watermark Model" outputs "SIR Logits", represented by a light blue bar graph. The bars in the graph are relatively uniform in height. The approximate heights of the bars are: 2, 2, 2, 2, 2, 2, 2, 2. 4. **SynthID Path:** * The "SynthID" block outputs "SynthID Logits", represented by a light green bar graph. The bars are sparse and have low values. The approximate heights of the bars are: 1, 0, 1, 0, 0, 0, 1, 0. 5. **Final Logits:** * The "Final Logits" graph combines the outputs of the three paths. The gray bars from "LLM Logits" form the base of the stacked bars. The light blue bars from "SIR Logits" are stacked on top of the gray bars, and the light green bars from "SynthID Logits" are stacked on top of the light blue bars. The approximate heights of the final stacked bars are: 13, 12, 11, 9, 8, 6, 6, 4. ### Key Observations * The Generative LLM contributes the most significant component to the final logits, as indicated by the height of the gray bars. * The SIR Watermark Model adds a relatively consistent amount to the logits, as indicated by the uniform height of the light blue bars. * The SynthID contributes the least to the final logits, as indicated by the sparse and low values of the light green bars. ### Interpretation The diagram illustrates a system where multiple models contribute to the final output logits. The Generative LLM provides the primary content, while the Embedding LLM and SIR Watermark Model add a layer of security or identification through watermarking. The SynthID likely adds a further layer of identification or authentication. The combination of these models aims to produce a more robust and secure output. The relative contributions of each model can be inferred from the heights of the bars in the "Final Logits" graph. The system appears to prioritize the output of the Generative LLM, with the other models providing supplementary information. </details> Figure 6: SynGuard watermark embedding. Algorithm 1 Watermark Embedding of SynGuard 1: Language model $M$ , prompt $x^{\text{prompt}}$ , text $t=[t_{0},...,t_{T-1}]$ , embedding model $E$ , watermark model $W$ , semantic weight $\delta$ , tournament sampler $G$ , key $k$ , token $x$ 2: Generate logits from $M$ : $P_{M}(x^{\text{prompt}},t_{:T-1})$ ; 3: Generate embedding $E_{:T-1}$ ; 4: Get SIR watermark logits $P_{W}(E_{:T-1})$ ; 5: Get SynthID-Text watermark logits $P_{G}(x^{\text{prompt}},k,x)$ ; 6: Compute: | | $\displaystyle P_{\hat{M}}(x^{\text{prompt}},t_{:T-1})$ | $\displaystyle=P_{M}(x^{\text{prompt}},t_{:T-1})$ | | | --- | --- | --- | --- | 7: Final watermarked logits $P_{\hat{M}}(t_{T})$ V-B Watermark Extraction SynGuard determines whether a given text is watermarked by evaluating both the semantic similarity to the preceding context and the statistical watermark signal encoded as $g$ -values. Intuitively, the more semantically aligned a token is with its context, and the higher its corresponding $g$ -value, the more probable it is that the text was generated by a watermarking algorithm. Watermark Strength. The probability that a text contains a watermark is quantified by a composite score s. A higher s indicates a higher probability that the text is watermarked. Given a text $t=[t_{0},t_{1},...,t_{T}]$ , we compute two components: - Semantic similarity score: Let $P_{W}(x_{i},t_{:T-1})$ denote the semantic similarity between the token and the preceding generated text, computed using a pretrained semantic watermark model $W$ . The normalized semantic score is: $$ s_{\text{semantic}}=\frac{1}{T}\sum\limits{}_{i=0}^{T}\left(P_{W}(x_{i},t_{:T-1})-0\right). $$ - G-value score: Let $g_{l}$ represent the output of the $l_{th}$ SynthID-Text watermarking function for tokens. The average $g$ -value score is: $$ s_{\text{g-value}}=\frac{1}{T*m}\sum_{i=0}^{T}\sum_{l=0}^{m}g_{l}(x_{i},t_{:T-1}). $$ Since $s_{\text{semantic}}∈[-1,1]$ and $s_{\text{g-value}}∈[0,1]$ , we normalize $s_{\text{semantic}}$ to fall within the same range by applying a linear transformation. The final score $s$ is computed as: $$ s=\delta\cdot\frac{s_{\text{semantic}}+1}{2}+(1-\delta)\cdot s_{\text{g-value}}. \tag{2} $$ Here, $\delta∈[0,1]$ is a hyperparameter that controls the relative weighting between the semantic similarity signal and the token-level watermark signal. A larger $\delta$ places more emphasis on semantic alignment, while a smaller $\delta$ favors the token sampling randomness. V-C Robustness Analysis To evaluate the robustness of SynGuard, we consider adversaries who attempt to remove or forge the watermark while preserving the underlying semantics. Our hybrid approach combines semantic-awareness from SIR and pseudorandom unpredictability from SynthID, offering both attack robustness and key-based security guarantees. **Theorem 1** *Let $t=[t_{0},t_{1},...,t_{T}]$ be a watermarked text and $t^{\prime}$ be a meaning-preserving transformation of $t$ . Then, with high probability, the watermark detection score $s(t^{\prime})$ remains above detection threshold $\tau$ , i.e., the watermark is still detectable.* * Proof* The detection score $s$ is a weighted sum of two components: a semantic alignment score $s_{\text{semantic}}$ and a pseudorandom signature score $s_{\text{g-value}}$ . Because $t^{\prime}$ preserves the meaning of $t$ , the contextual embeddings of $t^{\prime}$ remain close to those of $t$ . Let $E(t_{:i})$ denote the semantic embedding of the prefix up to token $t_{i}$ . Since $t^{\prime}$ has nearly the same context at each position in a semantic sense, we have $\|E(t_{:i})-E(t^{\prime}_{:i})\|$ small for all $i$ . The semantic watermark model $W$ is assumed to be Lipschitz continuous [6]: $$ |P_{W}(E(t_{:i}))-P_{W}(E(t^{\prime}_{:i}))|\leq L\cdot\|E(t_{:i})-E(t^{\prime}_{:i})\|, $$ where $L>0$ denotes the Lipschitz constant. In other words, the watermark bias for the next token does not drastically change under a semantically invariant perturbation. Consequently, for each token position $i$ , the semantic preference $P_{W}(x_{i},t_{:i-1})$ assigned by $W$ to the actual token $x_{i}$ in $t^{\prime}$ will be close to the value it was for $t$ . If $t$ was watermarked, most tokens had high semantic preference values (the watermark favored those choices); $t^{\prime}$ , using synonymous or rephrased tokens, will on average still yield high $P_{W}$ values for each token, since the tokens remain well-aligned with a similar context. Thus, for each token $x^{\prime}_{i}$ in $t^{\prime}$ , we get $$ s^{\prime}_{\text{semantic}}=\frac{1}{T}\sum\limits{}_{i=0}^{T}\left(P_{W}(x^{\prime}_{i},t^{\prime}_{:i-1})-0\right)\approx s_{\text{semantic}}-\varepsilon, $$ for some small $\varepsilon$ . The SynthID component uses a secret key $k$ to generate pseudorandom preferences. Without $k$ , $s^{\prime}_{\text{g-value}}≈ 0.5$ . In the original watermarked $t$ , tokens are biased toward higher $g$ -values. Hence, under semantic-preserving transformation, the g-value component drops to 0.5, but $s_{\text{semantic}}$ remains high. Therefore, the overall score: $s(t^{\prime})=\delta·\frac{s^{\prime}_{\text{semantic}}+1}{2}+(1-\delta)· s^{\prime}_{\text{g-value}}$ is still above threshold if $\delta$ is reasonably large. In conclusion, the watermark remains detectable in $t^{\prime}$ . ∎ **Theorem 2** *Let $k$ be the watermark key for SynGuard. For any text $u$ not generated by the watermarking algorithm, the probability that $s(u)>\tau$ is exponentially small in $T$ .* * Proof* The robustness stems from the pseudorandom behavior of the SynthID component, which introduces a hidden bias into token selection based on a watermark key $k$ . The watermarking model adds a preference signal $g_{k}(x_{i},t_{:T-1})∈[0,1]$ for candidate tokens, and combines it with the semantic alignment score $P_{W}$ . The detector computes a combined score: | | $\displaystyle s=\frac{\delta}{T}\sum_{i=1}^{T}\frac{P_{W}(x_{i},t_{:T-1})+1}{2}+\frac{(1-\delta)}{T}\sum_{i=1}^{T}g_{k}(x_{i},t_{:T-1}).$ | | | --- | --- | --- | Now consider an attacker attempting to generate a fake watermarked text without access to $k$ : - Since $g_{k}$ is keyed and pseudorandom, its outputs are statistically independent of the attacker’s choices. - Therefore, the second term in $s$ , the SynthID component, behaves like uniform noise with expected value $≈ 0.5$ and variance $O(1/T)$ . - The first term (semantic preference) is not optimized in the attacker’s text either, since only the original watermarker uses $P_{W}$ for guidance. - Hence, the attacker’s overall score $s_{\text{fake}}≈ 0.5$ , with small deviations bounded by concentration inequalities. Let $Y_{i}=\frac{P_{W}(x_{i},t_{:i-1})+1}{2}$ and $Z_{i}=g_{k}(x_{i},t_{:i-1})$ , both taking values in $[0,1]$ . Define $X_{i}:=\delta Y_{i}+(1-\delta)Z_{i}$ , so $X_{i}∈[0,1]$ . Since $g_{k}$ is pseudorandom with no attacker control, and $P_{W}$ is optimized only during watermark generation, their expected values over attacker-generated text are both approximately $0.5$ . Hence $\mathbb{E}[X_{i}]=0.5$ . With $\mathbb{E}[X_{i}]=0.5$ , and $X_{1},...,X_{T}$ are i.i.d., Hoeffding’s inequality gives: $$ \Pr(s>\tau)=\Pr\left(\frac{1}{T}\sum_{i=1}^{T}X_{i}>\tau\right)\leq e^{-2T(\tau-0.5)^{2}}. $$ This shows that for any non-watermarked text $u$ , the probability of it being misclassified as watermarked (i.e., $s(u)>\tau$ ) decays exponentially with length $T$ . Meanwhile, a genuine watermarked text has both components biased upward (semantic tokens aligned and token scores chosen with positive $g_{k}$ bias), yielding $s_{\text{true}}>\tau$ , where $\tau∈(0.6,0.9)$ is the detection threshold. Therefore, false positives (attacker’s text exceeding threshold) are exponentially rare as $T$ increases. Likewise, removal attempts (via editing tokens) cannot reduce the score below threshold unless semantic meaning is also damaged. ∎ VI Experimental Evaluation This section presents the experimental settings, evaluation metrics, and results of SynGuard compared to the baselines. VI-A Experimental Setup Backbone Model and Dataset. All experiments were conducted using Sheared-LLaMA-1.3B [21], a model further pre-trained from meta-llama/Llama-2-7b-hf https://huggingface.co/meta-llama/Llama-2-7b-hf and opt-1.3B https://huggingface.co/facebook/opt-1.3b from Meta. These models used are publicly available via HuggingFace. For the dataset, we adopt the Colossal Clean Crawled Corpus (C4) [22], which includes diverse, high-quality web text. Each C4 sample is split into two segments: the first segment serves as the prompt for generation, while the second (human-written) segment is used as reference text. The quality of the generated text is assessed using Perplexity (PPL) scores, which reflect how fluent and natural the output text is. These unaltered human texts are treated as control data for evaluating the watermark detector’s false positive rate. Evaluation Metrics. The robustness is evaluated using the following metrics: True Positive Rate (TPR), False Positive Rate (FPR), F1 Score, and ROC-AUC. Each experiment was conducted using 200 watermarked and 200 unwatermarked samples, each with a fixed length of $T=200$ tokens, as same as the default setting of [23, 5]. All experiments were implemented using the MarkLLM toolkit [23]. VI-B Main Results This section uses the F1 score to demonstrate the detection accuracy of SynGuard, and compares it to the baseline methods, SIR and SynthID-Text. The naturalness of the output texts generated by these three algorithms is also evaluated to assess their textual quality. Detection Accuracy and ROC Curves. Fig. 7 (a) illustrates that all three algorithms achieve high detection accuracy, with AUC values above 0.9. From Fig. 7 (b), it is evident that SynthID-Text achieves the highest detection accuracy of 1.00. SIR yields the lowest detection accuracy at 0.9971, exhibiting a noticeable gap compared to SynthID-Text. The detection accuracy of SynGuard is slightly lower than SynthID-Text by only 0.0001, but higher than that of SIR. <details> <summary>figures/roc_curves_of_SynthID_SIR_SIR-SynthID.png Details</summary> ![0baa98cf](/v1/image/0baa98cf5ec20d58741b0ee9f27969e92c74ec4e3b5e5961e79bacf28f6fefc2) ### Visual Description ## Chart: Receiver Operating Characteristic (ROC) Curves ### Overview The image is a Receiver Operating Characteristic (ROC) curve chart, comparing the performance of three models: SynthID, SIR, and SynGuard, against a random guess baseline. The chart plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The Area Under the Curve (AUC) is provided for each model. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Legend:** Located in the bottom-right corner. * SynthID (AUC=1.0000) - Blue line * SIR (AUC=0.9971) - Orange line * SynGuard (AUC=0.9999) - Green line * Random Guess - Dashed gray line ### Detailed Analysis * **SynthID (Blue):** The line rises vertically at FPR=0.0 to TPR=1.0, then continues horizontally to FPR=1.0. This indicates perfect classification performance. * **SIR (Orange):** The line rises sharply near FPR=0.0 to a TPR of approximately 0.95-1.0, then continues horizontally to FPR=1.0. * **SynGuard (Green):** The line rises vertically at FPR=0.0 to a TPR of approximately 0.98-1.0, then continues horizontally to FPR=1.0. * **Random Guess (Dashed Gray):** A diagonal line from (0.0, 0.0) to (1.0, 1.0), representing the performance of a classifier that randomly guesses the outcome. ### Key Observations * SynthID has a perfect AUC score of 1.0000, indicating perfect classification. * SynGuard has a very high AUC score of 0.9999, very close to perfect classification. * SIR has a slightly lower AUC score of 0.9971, but still performs significantly better than random guessing. * All three models outperform the random guess baseline. ### Interpretation The ROC curves demonstrate the effectiveness of the SynthID, SIR, and SynGuard models in distinguishing between positive and negative cases. SynthID achieves perfect classification, while SynGuard and SIR also exhibit excellent performance. The AUC values quantify the overall performance of each model, with higher values indicating better discrimination ability. The curves show that all three models are significantly better than random guessing, making them useful for classification tasks. </details> (a) ROC curves of three algorithms. <details> <summary>figures/zoom-in_rocs_for_3_algorithms.png Details</summary> ![63608da8](/v1/image/63608da80c22e464219c504f376096f085cf1d4593bdaab9c081b472f7150bbb) ### Visual Description ## Chart Type: Zoomed-in ROC Curve ### Overview The image is a zoomed-in Receiver Operating Characteristic (ROC) curve, specifically focusing on the region where the False Positive Rate (FPR) is less than 0.1. It compares the performance of three different models: SynthID, SIR, and SynGuard. The y-axis represents the True Positive Rate (TPR), and the x-axis represents the False Positive Rate (FPR). The Area Under the Curve (AUC) is provided for each model in the legend. ### Components/Axes * **Title:** Zoomed-in ROC Curve (FPR < 0.1) * **X-axis:** False Positive Rate (FPR), with scale from 0.00 to 0.10 in increments of 0.02. * **Y-axis:** True Positive Rate (TPR), with scale from 0.90 to 1.00 in increments of 0.02. * **Legend (bottom-center):** * SynthID (AUC=1.0000) - Blue line * SIR (AUC=0.9971) - Orange line * SynGuard (AUC=0.9999) - Green line ### Detailed Analysis * **SynthID (Blue):** The blue line representing SynthID starts at (0.00, 1.00) and remains at 1.00 for all FPR values shown. * **SIR (Orange):** The orange line representing SIR starts at approximately (0.00, 0.96). It then increases stepwise: * From (0.00, 0.96) to approximately (0.01, 0.975) * From (0.01, 0.975) to approximately (0.02, 0.98) * From (0.02, 0.98) to approximately (0.035, 0.985) * From (0.035, 0.985) to (0.10, 0.99) * **SynGuard (Green):** The green line representing SynGuard starts at approximately (0.00, 0.90) and quickly rises to 1.00 at approximately FPR = 0.002, and remains at 1.00 for all FPR values shown. ### Key Observations * SynthID achieves perfect classification (AUC=1.0000) within the zoomed-in FPR range. * SynGuard performs very well (AUC=0.9999), reaching a TPR of 1.00 at a very low FPR. * SIR has the lowest AUC (0.9971) and a more gradual increase in TPR as FPR increases. ### Interpretation The ROC curve visualizes the trade-off between the true positive rate and the false positive rate for different classification thresholds. In this zoomed-in view (FPR < 0.1), SynthID and SynGuard demonstrate superior performance compared to SIR. SynthID achieves a perfect score, while SynGuard quickly reaches a perfect TPR with a minimal FPR. SIR's performance is still very good, but it requires a higher FPR to achieve a comparable TPR. The high AUC values for all three models indicate that they are all effective classifiers, but SynthID and SynGuard are particularly strong in this specific FPR range. </details> (b) Zoom-in ROC curves for the three algorithms. Figure 7: Comparison and zoomed-in view of ROC curves for three watermarking algorithms: SynthID-Text, SIR, and SynGuard. TABLE VI: Detection accuracy of SynthID-Text, SIR, and SynGuard. | Algorithm | TPR | FPR | F1 with best threshold | Running Time(s/it) | | --- | --- | --- | --- | --- | | SynthID-Text | 1.0 | 0.0 | 1.0 | 6.09 | | SIR | 0.98 | 0.015 | 0.9825 | 12.50 | | SynGuard | 0.995 | 0.0 | 0.9975 | 12.93 | Text Quality. PPL, a metric quantifying a language model’s predictive confidence in text (lower values indicate stronger alignment with the model’s training distribution, though not absolute quality), reveals nuanced watermarking impacts in Fig. 8. SynthID’s watermarked outputs exhibit lower PPL than their unwatermarked counterparts, suggesting its watermarking leverages semantically compatible tokens that align with the model’s learned patterns. In contrast, SIR’s watermarked texts show elevated PPL and broader distribution, indicative of disruptive interventions (e.g., forced token substitutions) that breach local coherence, amplifying predictive uncertainty. Our proposed SynGuard achieves lower PPL for watermarked texts relative to SIR, coupled with a compact distribution and minimal outliers. This arises from its hybrid design: integrating SynthID’s semantic-aware watermark encoding to preserve model-aligned fluency, while introducing stabilization mechanisms to curb output variability. Critically, PPL reflects model familiarity rather than intrinsic quality (e.g., logic or novelty), so these results underscore watermarking’s influence on textual conformity to pre-trained distributions. Time Overhead. Table VI reports the TPR, FPR, and F1 score for each method. The proposed SynGuard algorithm achieves an F1 score of 0.9975, just 0.25% below the maximum value of 1. Time overhead test results are obtained from an T4 graphics card with 15.0 GB of memory on Google Colab. As can be seen, while significantly improving robustness and text quality, SynGuard did not significantly increase time overhead and is comparable to the SIR scheme. <details> <summary>figures/text_quality-PPL.png Details</summary> ![27204f68](/v1/image/27204f68cf0bba74dbb66891f6b7c268422a55e6f4eb3aae7428fb08843c47e4) ### Visual Description ## Box Plot: PPL Score Comparison of Watermarked vs. Unwatermarked Data ### Overview The image is a box plot comparing the PPL (Perplexity) scores of watermarked and unwatermarked data across three different methods: SynthID, SIR, and SynGuard. The plot visualizes the distribution of PPL scores for each method and condition (watermarked/unwatermarked), highlighting the median, quartiles, and outliers. ### Components/Axes * **Title:** Implicitly, the plot compares PPL scores across different methods and watermark status. * **X-axis:** Categorical axis representing the methods: SynthID, SIR, and SynGuard. * **Y-axis:** Numerical axis labeled "PPL Score," ranging from 0 to 30, with tick marks at intervals of 5. * **Legend:** Located in the top-left corner, indicating the color-coding for "watermarked" (blue) and "unwatermarked" (orange). * **Box Plot Elements:** Each box plot displays the median (line within the box), the first and third quartiles (edges of the box), and the whiskers extending to the most extreme data points within 1.5 times the interquartile range. Outliers are represented as individual circles. ### Detailed Analysis **1. SynthID:** * **Watermarked (Blue):** The box extends from approximately 5.5 to 7.5, with a median around 6.5. The lower whisker extends to approximately 3, and the upper whisker extends to approximately 10. There are two outliers at approximately 12 and 20.5. * **Unwatermarked (Orange):** The box extends from approximately 8 to 13.5, with a median around 10. The lower whisker extends to approximately 3, and the upper whisker extends to approximately 20.5. There are several outliers, including one at approximately 24 and one at approximately 28. **2. SIR:** * **Watermarked (Blue):** The box extends from approximately 10.5 to 14.5, with a median around 13. The lower whisker extends to approximately 6, and the upper whisker extends to approximately 20. There are several outliers, including one at approximately 24 and one at approximately 28. * **Unwatermarked (Orange):** The box extends from approximately 8 to 12, with a median around 10. The lower whisker extends to approximately 3, and the upper whisker extends to approximately 20.5. There are several outliers, including one at approximately 24 and one at approximately 28. **3. SynGuard:** * **Watermarked (Blue):** The box extends from approximately 7 to 9, with a median around 8. The lower whisker extends to approximately 4, and the upper whisker extends to approximately 13.5. There is one outlier at approximately 13.5. * **Unwatermarked (Orange):** The box extends from approximately 8 to 13.5, with a median around 10. The lower whisker extends to approximately 3, and the upper whisker extends to approximately 20.5. There are several outliers, including one at approximately 24 and one at approximately 28. ### Key Observations * For SynthID, the watermarked data has a noticeably lower PPL score distribution compared to the unwatermarked data. * For SIR, the watermarked data has a slightly higher PPL score distribution compared to the unwatermarked data. * For SynGuard, the watermarked data has a noticeably lower PPL score distribution compared to the unwatermarked data. * All three methods exhibit outliers in both watermarked and unwatermarked conditions, indicating some data points with significantly higher PPL scores. * The range of PPL scores is generally higher for unwatermarked data across all three methods. ### Interpretation The box plot suggests that the presence of a watermark can influence the PPL score, and the direction and magnitude of this influence vary depending on the method used. SynthID and SynGuard appear to result in lower PPL scores for watermarked data, potentially indicating that the watermark improves the model's perplexity or predictability. Conversely, SIR shows a slight increase in PPL score for watermarked data. The presence of outliers suggests that there are instances where the watermark has a more pronounced effect on the PPL score. The higher range of PPL scores for unwatermarked data might indicate greater variability or uncertainty in the model's predictions without the watermark. Further investigation would be needed to understand the specific mechanisms by which each method interacts with the watermark and affects the PPL score. </details> Figure 8: Text Quality Comparison Using PPL. VI-C Robustness Evaluation under Attacks VI-C1 Synonym Substitution For the synonym substitution attack, we evaluated performance under varying substitution ratios: $[0,0.3,0.5,0.7]$ . The resulting ROC curves are shown in Fig. 9. Even with a substitution ratio of 0.7, the AUC decreased by only 1.23% and remained above 0.98. As shown in Table VII, the FPR values remained low across all ratios, and the F1 scores consistently exceeded 0.95. These results highlight the strong robustness of SynGuard against synonym substitution attacks. <details> <summary>figures/SIR-SynthID-synonym_substitution-roc.png Details</summary> ![7173c79f](/v1/image/7173c79f791efb91c7a1a4f5dceae5bf06e3ca38f675e458b38ffa51b7eca64f) ### Visual Description ## ROC Curve: Receiver Operating Characteristic Curves ### Overview The image is a Receiver Operating Characteristic (ROC) curve comparing the performance of different models: SynGuard, Word-S(Context)-0.3, Word-S(Context)-0.5, and Word-S(Context)-0.7. The plot shows the True Positive Rate (TPR) against the False Positive Rate (FPR) for each model, along with a "Random Guess" baseline. The Area Under the Curve (AUC) is provided for each model in the legend. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Legend:** Located in the bottom-right corner, it identifies each model and its corresponding AUC score: * SynGuard (AUC=1.0000) - Blue line * Word-S(Context)-0.3 (AUC=0.9986) - Orange line * Word-S(Context)-0.5 (AUC=0.9942) - Green line * Word-S(Context)-0.7 (AUC=0.9877) - Red line * Random Guess - Dashed gray line ### Detailed Analysis * **SynGuard (Blue):** The blue line representing SynGuard rises sharply to a TPR of 1.0 at a very low FPR (approximately 0.0). It maintains a TPR of 1.0 across all FPR values. * AUC = 1.0000 * **Word-S(Context)-0.3 (Orange):** The orange line representing Word-S(Context)-0.3 rises sharply, reaching a TPR of approximately 0.95 at a low FPR (approximately 0.02). It then plateaus near a TPR of 1.0. * AUC = 0.9986 * **Word-S(Context)-0.5 (Green):** The green line representing Word-S(Context)-0.5 rises sharply, reaching a TPR of approximately 0.90 at a low FPR (approximately 0.02). It then plateaus near a TPR of 1.0. * AUC = 0.9942 * **Word-S(Context)-0.7 (Red):** The red line representing Word-S(Context)-0.7 rises sharply, reaching a TPR of approximately 0.85 at a low FPR (approximately 0.02). It then plateaus near a TPR of 1.0. * AUC = 0.9877 * **Random Guess (Dashed Gray):** The dashed gray line represents a random guess, showing a linear relationship between FPR and TPR. It starts at (0.0, 0.0) and ends at (1.0, 1.0). ### Key Observations * SynGuard performs perfectly, achieving a TPR of 1.0 at a very low FPR, indicated by an AUC of 1.0000. * The Word-S(Context) models perform very well, with AUC scores close to 1.0. As the context value increases from 0.3 to 0.7, the AUC score decreases slightly, indicating a minor reduction in performance. * All models significantly outperform the "Random Guess" baseline. ### Interpretation The ROC curves demonstrate the effectiveness of the SynGuard and Word-S(Context) models in distinguishing between positive and negative cases. SynGuard achieves perfect classification, while the Word-S(Context) models also perform exceptionally well. The slight decrease in performance as the context value increases from 0.3 to 0.7 suggests that there might be an optimal context value for the Word-S(Context) model. The fact that all models outperform the random guess baseline indicates that they have learned meaningful patterns from the data. </details> (a) ROC curves <details> <summary>figures/new_method_word_sub_zoomin.png Details</summary> ![0a69412e](/v1/image/0a69412ebc663fad7e2e662494cda1429ffc12c833a585c4fb929b8d1b5843e2) ### Visual Description ## Chart Type: Zoomed-in ROC Curve ### Overview The image is a Receiver Operating Characteristic (ROC) curve, zoomed in on the top-left corner, plotting True Positive Rate (TPR) against False Positive Rate (FPR) on a log scale. It compares the performance of different models: SynGuard, Word-S(Context)-0.3, Word-S(Context)-0.5, and Word-S(Context)-0.7. The Area Under the Curve (AUC) is provided for each model in the legend. ### Components/Axes * **Title:** Zoomed-in ROC Curve (Log-scaled FPR) * **X-axis:** False Positive Rate (FPR, log scale). Scale ranges from 10^-4 to 10^0. Axis markers are present at 10^-4, 10^-3, 10^-2, 10^-1, and 10^0. * **Y-axis:** True Positive Rate (TPR). Scale ranges from 0.90 to 1.00. Axis markers are present at 0.90, 0.92, 0.94, 0.96, 0.98, and 1.00. * **Legend:** Located in the bottom-right corner. It identifies the models and their corresponding AUC values: * Blue: SynGuard (AUC=1.0000) * Orange: Word-S(Context)-0.3 (AUC=0.9986) * Green: Word-S(Context)-0.5 (AUC=0.9942) * Red: Word-S(Context)-0.7 (AUC=0.9877) * A dashed grey line extends from the bottom left to the top right of the chart. ### Detailed Analysis * **SynGuard (Blue):** The line is horizontal at TPR = 1.00 across the entire FPR range. * **Word-S(Context)-0.3 (Orange):** The line starts at approximately FPR = 10^-4 and TPR = 0.96, then rises to TPR = 0.98 at approximately FPR = 10^-2, and finally reaches TPR = 1.00 at approximately FPR = 0.1. * **Word-S(Context)-0.5 (Green):** The line starts at approximately FPR = 10^-4 and TPR = 0.93, then rises to TPR = 0.965 at approximately FPR = 0.01, then rises to TPR = 0.99 at approximately FPR = 0.1, and finally reaches TPR = 1.00 at approximately FPR = 0.3. * **Word-S(Context)-0.7 (Red):** The line starts at approximately FPR = 10^-4 and TPR = 0.93, then rises to TPR = 0.945 at approximately FPR = 0.01, then rises to TPR = 0.975 at approximately FPR = 0.1, and finally reaches TPR = 1.00 at approximately FPR = 0.8. ### Key Observations * SynGuard has the highest AUC (1.0000) and maintains a perfect True Positive Rate across all False Positive Rates. * The performance of the Word-S(Context) models decreases as the numerical suffix increases (0.3 > 0.5 > 0.7), as indicated by their AUC values. * All models eventually reach a TPR of 1.00, but at different FPR values. ### Interpretation The ROC curve visualizes the trade-off between the true positive rate and the false positive rate for different models. SynGuard demonstrates ideal performance, perfectly classifying positive cases without incurring any false positives. The Word-S(Context) models show varying degrees of performance, with higher suffix values indicating a lower AUC and a slower rise to a perfect TPR. The plot highlights the superior performance of SynGuard in this zoomed-in region of the ROC space, suggesting it is the most accurate model for the given task. The dashed grey line represents a classifier that performs no better than random chance. </details> (b) Zoomed-in views Figure 9: ROC curves of SynGuard under synonym substitution attacks. TABLE VII: Watermark detection accuracy of SynthID-Text and SynGuard under different synonym substitution attacks | Attack | SynthID-Text | SynGuard | | | | | | --- | --- | --- | --- | --- | --- | --- | | TPR | FPR | F1 | TPR | FPR | F1 | | | No attack | 1.00 | 0.00 | 1.000 | 1.00 | 0.00 | 1.000 | | Word-S(Context)-0.3 | 0.98 | 0.005 | 0.987 | 0.98 | 0.01 | 0.985 | | Word-S(Context)-0.5 | 0.91 | 0.035 | 0.936 | 0.97 | 0.01 | 0.977 | | Word-S(Context)-0.7 | 0.82 | 0.035 | 0.884 | 0.96 | 0.03 | 0.965 | VI-C2 Copy-and-Paste For the copy-and-paste attack, the key parameter is the ratio between the length of the natural (or unwatermarked) text into which the watermarked content is pasted and the length of the original watermarked segment. In this experiment, the watermarked content has a fixed length of $T=200$ . We tested three different length ratios: [5, 10, 15], and the results are presented in Fig. 10. Compared to synonym substitution, the impact of increasing the length ratio is more pronounced. When the copy-and-paste ratio reaches 10, the AUC already falls below 0.9. The detailed FPRs and F1 scores are listed in Table VIII. Increasing the length ratio from 5 to 10 results in only a slight F1 score decrease of approximately 0.56%. However, further increasing the ratio from 10 to 15 leads to a more substantial reduction of approximately 5%, with the F1 score decreasing to 0.848. <details> <summary>figures/copy-paste_attack_roc.png Details</summary> ![afe7141b](/v1/image/afe7141bc9e094ebfaa26b4857a686c2769c857adbcfe01e1eacf0819e12c9fc) ### Visual Description ## Chart Type: Receiver Operating Characteristic (ROC) Curves ### Overview The image is a Receiver Operating Characteristic (ROC) curve chart comparing the performance of different methods: SynthID, Copy-Paste-5, Copy-Paste-10, and Copy-Paste-15, against a random guess baseline. The chart plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The Area Under the Curve (AUC) is provided for each method. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Grid:** Light gray grid lines are present. * **Legend:** Located in the bottom-right corner, the legend identifies each line by method and its corresponding AUC score: * SynthID (AUC = 1.0000) - Blue line * Copy-Paste-5 (AUC = 0.8559) - Orange line * Copy-Paste-10 (AUC = 0.6679) - Green line * Copy-Paste-15 (AUC = 0.5883) - Red line * Random Guess - Dashed gray line ### Detailed Analysis * **SynthID (Blue):** The blue line rises vertically at FPR = 0.0 to TPR = 1.0, then continues horizontally. This indicates perfect classification. * **Copy-Paste-5 (Orange):** The orange line rises steeply between FPR 0.1 and 0.2, reaching TPR near 1.0 by FPR 0.3. * **Copy-Paste-10 (Green):** The green line rises more gradually, starting around FPR 0.2 and reaching TPR near 1.0 by FPR 0.5. * **Copy-Paste-15 (Red):** The red line rises even more gradually, starting around FPR 0.3 and reaching TPR near 1.0 by FPR 0.6. * **Random Guess (Dashed Gray):** The dashed gray line represents a diagonal line from (0,0) to (1,1), indicating performance equivalent to random guessing. **Data Points (Approximate):** * **SynthID (Blue):** * (0.0, 0.0) * (0.0, 1.0) * (1.0, 1.0) * **Copy-Paste-5 (Orange):** * (0.0, ~0.01) * (0.15, ~0.5) * (0.25, ~0.9) * (1.0, 1.0) * **Copy-Paste-10 (Green):** * (0.0, ~0.01) * (0.3, ~0.5) * (0.4, ~0.8) * (1.0, 1.0) * **Copy-Paste-15 (Red):** * (0.0, ~0.01) * (0.4, ~0.5) * (0.5, ~0.75) * (1.0, ~0.95) * **Random Guess (Dashed Gray):** * (0.0, 0.0) * (1.0, 1.0) ### Key Observations * SynthID has a perfect AUC score of 1.0000, indicating perfect classification. * The AUC scores decrease as we move from Copy-Paste-5 to Copy-Paste-10 to Copy-Paste-15, suggesting a decline in performance. * The Random Guess line serves as a baseline; all methods perform better than random guessing. ### Interpretation The ROC curves illustrate the performance of different methods in distinguishing between positive and negative cases. SynthID demonstrates ideal performance, perfectly separating the two classes. The Copy-Paste methods show varying degrees of effectiveness, with Copy-Paste-5 performing the best among them, and Copy-Paste-15 performing the worst. The AUC values quantify these differences, providing a single metric to compare the methods. The further the curve is from the random guess line (dashed gray), the better the model's performance. The data suggests that as the "Copy-Paste" number increases, the performance decreases. </details> (a) SynthID-Text <details> <summary>figures/copy-and-paste_attack_curves.png Details</summary> ![7252b30d](/v1/image/7252b30d238c9c09ac4318d4f7486d74af9676ba778ad3d2abc45a2336bd4b79) ### Visual Description ## Chart Type: Receiver Operating Characteristic (ROC) Curves ### Overview The image is a Receiver Operating Characteristic (ROC) curve chart comparing the performance of different methods: SynGuard, Copy-Paste-5, Copy-Paste-10, and Copy-Paste-15, along with a random guess baseline. The chart plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The Area Under the Curve (AUC) is provided for each method. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Legend:** Located in the bottom-right corner, it identifies each line by name and AUC score: * SynGuard (AUC=1.0000) - Blue line * Copy-Paste-5 (AUC=0.9299) - Orange line * Copy-Paste-10 (AUC=0.8670) - Green line * Copy-Paste-15 (AUC=0.7803) - Red line * Random Guess - Dashed gray line ### Detailed Analysis * **SynGuard (Blue):** The blue line rises vertically at FPR=0.0 to TPR=1.0, then remains at TPR=1.0 for all FPR values. This indicates perfect classification performance. * **Copy-Paste-5 (Orange):** The orange line rises sharply from (0.0, 0.0) to approximately (0.0, 0.68), then gradually increases to TPR=1.0 as FPR increases to approximately 0.2. * **Copy-Paste-10 (Green):** The green line rises from (0.0, 0.0) to approximately (0.1, 0.9), then gradually increases to TPR=1.0 as FPR increases to approximately 0.3. * **Copy-Paste-15 (Red):** The red line rises from (0.0, 0.0) to approximately (0.2, 0.9), then gradually increases to TPR=1.0 as FPR increases to approximately 0.4. * **Random Guess (Dashed Gray):** The dashed gray line represents a diagonal line from (0.0, 0.0) to (1.0, 1.0), indicating random classification performance. ### Key Observations * SynGuard has the highest AUC (1.0000), indicating perfect classification. * The AUC values decrease from Copy-Paste-5 to Copy-Paste-10 to Copy-Paste-15, indicating a decline in performance. * The Random Guess line serves as a baseline for comparison. ### Interpretation The ROC curves illustrate the performance of different methods in distinguishing between positive and negative cases. SynGuard demonstrates perfect classification, while the Copy-Paste methods show varying degrees of effectiveness, with performance decreasing as the number increases (5, 10, 15). The Random Guess line represents the performance expected by chance. The further a curve is from the Random Guess line and the closer it is to the top-left corner, the better the classification performance. </details> (b) SynGuard Figure 10: ROC curves under different copy-and-paste attack ratios for SynthID-Text and SynGuard. TABLE VIII: Watermark detection accuracy under varying copy-and-paste attack settings | Attack | SynthID-Text | SynGuard | | | | | | --- | --- | --- | --- | --- | --- | --- | | TPR | FPR | F1 | TPR | FPR | F1 | | | No attack | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | | Copy-Paste-5 | 0.985 | 0.245 | 0.883 | 0.95 | 0.17 | 0.896 | | Copy-Paste-10 | 1.0 | 0.435 | 0.821 | 0.985 | 0.225 | 0.891 | | Copy-Paste-15 | 0.99 | 0.485 | 0.800 | 0.99 | 0.345 | 0.848 | VI-C3 Paraphrasing We used the T5 https://huggingface.co/google/t5-v1_1-xxl model for tokenization and the Dipper https://huggingface.co/kalpeshk2011/dipper-paraphraser-xxl model to perform paraphrasing. The key parameters for Dipper are lex_diversity and order_diversity, which respectively control the lexical variation and the reordering of sentences or phrases in the generated text. In this paraphrasing attack experiment, we explored combinations of lex_diversity values of 5 and 10, and order_diversity values of 0 and 5. The results are shown in Fig. 11. Increasing either parameter, lex_diversity or order_diversity, leads to a decline in detection accuracy. Despite this degradation, even the most aggressive setting (lex_diversity = 10 and order_diversity = 5) still achieves an AUC above 0.95 and an F1 score exceeding 0.92, as reported in Table IX. <details> <summary>figures/rocs_for_paraphrasing_synthID.png Details</summary> ![ae722957](/v1/image/ae72295790d53c3422af2af6853807b7ce5b4d4436cd8cae0be36a169687bed1) ### Visual Description ## Chart Type: Receiver Operating Characteristic (ROC) Curves ### Overview The image is a Receiver Operating Characteristic (ROC) curve plot, comparing the performance of different attack scenarios (Dipper-5, Dipper-10, Dipper-10-5) against a "No Attack" baseline and a "Random Guess" baseline. The plot shows the True Positive Rate (TPR) against the False Positive Rate (FPR) for each scenario. The Area Under the Curve (AUC) is provided for each scenario in the legend. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Grid:** The plot has a grid for easier reading of values. * **Legend:** Located in the bottom-right corner, the legend identifies each curve by attack scenario and its corresponding AUC value. * Blue: No Attack (AUC=1.0000) * Green: Dipper-5 (AUC=0.9468) * Orange: Dipper-10 (AUC=0.9589) * Red: Dipper-10-5 (AUC=0.9101) * Dashed Gray: Random Guess ### Detailed Analysis * **No Attack (Blue):** The blue line representing "No Attack" rises vertically to a TPR of 1.0 at an FPR of approximately 0.0, and then remains at 1.0 for all FPR values. * **Dipper-5 (Green):** The green line representing "Dipper-5" rises sharply initially, then gradually increases to a TPR of approximately 0.98 at an FPR of approximately 0.6. * **Dipper-10 (Orange):** The orange line representing "Dipper-10" rises sharply initially, then gradually increases to a TPR of approximately 0.99 at an FPR of approximately 0.6. * **Dipper-10-5 (Red):** The red line representing "Dipper-10-5" rises less sharply than the other attack scenarios, reaching a TPR of approximately 0.95 at an FPR of approximately 0.6. * **Random Guess (Dashed Gray):** The dashed gray line representing "Random Guess" is a diagonal line from (0.0, 0.0) to (1.0, 1.0). ### Key Observations * The "No Attack" scenario has a perfect AUC of 1.0000, indicating perfect classification. * The "Dipper-10" scenario has the highest AUC among the attack scenarios (0.9589), followed by "Dipper-5" (0.9468) and "Dipper-10-5" (0.9101). * All attack scenarios perform significantly better than the "Random Guess" baseline. * The ROC curves for the attack scenarios are clustered relatively close together, suggesting similar performance characteristics. ### Interpretation The ROC curves illustrate the performance of different attack scenarios in terms of their ability to correctly classify events (True Positives) while minimizing false alarms (False Positives). The "No Attack" scenario represents an ideal case where there are no false positives and all true positives are correctly identified. The attack scenarios show varying degrees of performance, with "Dipper-10" performing the best among them. The AUC values provide a quantitative measure of the overall performance of each scenario, with higher AUC values indicating better performance. The "Random Guess" baseline represents the performance of a classifier that makes random predictions, providing a lower bound for comparison. The closer the ROC curve is to the top-left corner, the better the performance of the classifier. </details> (a) SynthID-Text <details> <summary>figures/rocs_for_paraphrasing_sir_synthID.png Details</summary> ![806225f0](/v1/image/806225f0af1a2d4db9e93fb5d834497848d57dd16a1e512cae0f91f0407d4c07) ### Visual Description ## Chart Type: Receiver Operating Characteristic (ROC) Curves ### Overview The image is a Receiver Operating Characteristic (ROC) curve chart, comparing the performance of different attack scenarios (No Attack, Dipper-5, Dipper-10, Dipper-10-5) against a random guess baseline. The chart plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The Area Under the Curve (AUC) is provided for each scenario, indicating the model's ability to distinguish between positive and negative classes. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Legend:** Located in the bottom-right corner, it identifies the different attack scenarios and their corresponding AUC values: * Blue: No Attack (AUC=1.0000) * Green: Dipper-5 (AUC=0.9818) * Orange: Dipper-10 (AUC=0.9799) * Red: Dipper-10-5 (AUC=0.9692) * Dashed Gray: Random Guess ### Detailed Analysis * **No Attack (Blue):** The blue line rises sharply to a TPR of 1.0 at a very low FPR (close to 0.0), and remains at 1.0 for all subsequent FPR values. This indicates perfect classification. * **Dipper-5 (Green):** The green line rises quickly, reaching a TPR of approximately 0.9 at an FPR of around 0.05. It then gradually increases to 1.0 as the FPR approaches 0.4. * **Dipper-10 (Orange):** The orange line follows a similar trend to Dipper-5, but its initial rise is slightly slower. It reaches a TPR of approximately 0.8 at an FPR of around 0.05, and gradually increases to 1.0 as the FPR approaches 0.4. * **Dipper-10-5 (Red):** The red line has the slowest initial rise among the attack scenarios. It reaches a TPR of approximately 0.8 at an FPR of around 0.1, and gradually increases to 1.0 as the FPR approaches 0.4. * **Random Guess (Dashed Gray):** The dashed gray line represents a random guess, with TPR equal to FPR. It is a diagonal line from (0.0, 0.0) to (1.0, 1.0). ### Key Observations * The "No Attack" scenario has the highest AUC (1.0000), indicating perfect classification. * The "Dipper-10-5" scenario has the lowest AUC (0.9692) among the attack scenarios, indicating the worst performance. * All attack scenarios outperform the "Random Guess" baseline. * The ROC curves for "Dipper-5" and "Dipper-10" are very close, suggesting similar performance. ### Interpretation The ROC curves illustrate the performance of different attack scenarios in terms of their ability to correctly classify events. The "No Attack" scenario represents an ideal situation where the system can perfectly distinguish between positive and negative cases. The "Dipper" attack scenarios represent varying degrees of compromise, with "Dipper-10-5" being the most challenging to detect. The AUC values provide a quantitative measure of the performance of each scenario, with higher AUC values indicating better performance. The fact that all attack scenarios outperform the "Random Guess" baseline suggests that the system has some ability to detect these attacks. The closer the ROC curve is to the top-left corner, the better the performance of the model. </details> (b) SynGuard Figure 11: ROC curves under various paraphrasing attack settings for SynthID-Text and SynGuard. TABLE IX: Watermark detection accuracy under different paraphrasing attack settings | Attack | SynthID-Text | SynGuard | | | | | | --- | --- | --- | --- | --- | --- | --- | | TPR | FPR | F1 | TPR | FPR | F1 | | | No attack | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | | Dipper-5 | 0.915 | 0.16 | 0.882 | 0.935 | 0.03 | 0.952 | | Dipper-10 | 0.92 | 0.125 | 0.900 | 0.94 | 0.03 | 0.954 | | Dipper-10-5 | 0.895 | 0.23 | 0.842 | 0.90 | 0.05 | 0.923 | Note: Dipper- $x$ denotes the lexical diversity is $x$ . Dipper- $x$ - $y$ indicates lexical diversity is $x$ and order diversity is $y$ . VI-C4 Back-translation For back-translation attack, we employed the nllb-200-distilled-600M https://huggingface.co/facebook/nllb-200-distilled-600M model and googletrans Python library to translate the original English watermarked text into different pivot languages and then back-translate it back into English. The retranslated text was subsequently used for watermark detection. The resulting ROC curves are shown in Fig. 12, and the results under different translators are shown in Table X. It can be observed from the results that the effectiveness of back-translation attacks is related to the translation performance of the translator for the target language, and has little to do with language-specific characteristics. Nllb is a multilingual machine translation model, with a single model handling translation for over 200 languages. In contrast, Google Translate uses dedicated machine translation models for different languages. Among the languages, back-translation attacks based on Chinese show the most significant accuracy drop and the best attack performance, which is generally consistent with the performance of machine translation. Meanwhile, the translation performance between German, French, Italian and English is better, resulting in less accuracy drop. Notably, while some studies [11] argue that the effectiveness of back-translation attacks is directly tied to language-specific characteristics, our findings suggest this claim is rather limited. We contend that the effectiveness of back-translation attacks is instead associated with the translation performance of the translator on the target language: language-specific characteristics determine the upper bound of machine translation model performance, while the richness of the training corpus further shapes this upper bound. Consequently, language-specific characteristics constitute only one of the indirect factors influencing back-translation attacks. TABLE X: Comparison of SynGuard watermark detection accuracy under back-translation attacks with different translation tools | Attack | Nllb-200-distilled-600M | googletrans | | | | | | --- | --- | --- | --- | --- | --- | --- | | TPR | FPR | F1 | TPR | FPR | F1 | | | No attack | 0.995 | 0.0 | 0.9975 | 0.995 | 0.0 | 0.9975 | | Back-trans-German | 0.762 | 0.095 | 0.821 | 0.930 | 0.058 | 0.936 | | Back-trans-French | 0.735 | 0.070 | 0.814 | 0.930 | 0.053 | 0.938 | | Back-trans-Italian | 0.832 | 0.130 | 0.848 | 0.928 | 0.070 | 0.929 | | Back-trans-Chinese | 0.680 | 0.07 | 0.777 | 0.920 | 0.058 | 0.930 | | Back-trans-Japanese | 0.807 | 0.095 | 0.848 | 0.900 | 0.010 | 0.942 | <details> <summary>ROC_curves_for_re-translation_attack_on_SIR-SynthID.png Details</summary> ![193c300e](/v1/image/193c300e329a0797ea1168db9a0991f42edcf6f65f775b01e7d6d9a3e8e0eff1) ### Visual Description ## Chart Type: ROC Curves ### Overview The image is a Receiver Operating Characteristic (ROC) curve chart, comparing the performance of different language models (German, French, Italian, Chinese, and Japanese) against a random guess baseline. The chart plots the True Positive Rate (TPR) against the False Positive Rate (FPR). Each language model is represented by a different colored line, and the Area Under the Curve (AUC) is provided in the legend. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Legend (bottom-right):** * Blue: TransAtt-German (AUC = 0.8266) * Orange: TransAtt-French (AUC = 0.8104) * Green: TransAtt-Italian (AUC = 0.8578) * Red: TransAtt-Chinese (AUC = 0.7669) * Purple: TransAtt-Japanese (AUC = 0.8590) * Dashed Gray: Random Guess ### Detailed Analysis * **TransAtt-German (Blue):** The line starts at approximately (0, 0.6), rises sharply, and then gradually increases to approximately (1, 0.92). * **TransAtt-French (Orange):** The line starts at approximately (0, 0.6), rises sharply, and then gradually increases to approximately (1, 0.88). * **TransAtt-Italian (Green):** The line starts at approximately (0, 0.7), rises sharply, and then gradually increases to approximately (1, 0.92). * **TransAtt-Chinese (Red):** The line starts at approximately (0, 0.6), rises sharply, and then gradually increases to approximately (1, 0.88). * **TransAtt-Japanese (Purple):** The line starts at approximately (0, 0), rises sharply to approximately (0, 0.7), and then gradually increases to approximately (1, 0.94). * **Random Guess (Dashed Gray):** A diagonal line from (0, 0) to (1, 1). ### Key Observations * The TransAtt-Japanese model (purple) has the highest AUC (0.8590), indicating the best performance among the language models. * The TransAtt-Chinese model (red) has the lowest AUC (0.7669), indicating the worst performance among the language models. * All language models outperform the random guess baseline. * The ROC curves for all models show a steep initial rise, indicating a good ability to correctly classify positive instances at low false positive rates. ### Interpretation The ROC curves visually represent the performance of different language models in a binary classification task. The AUC values provide a quantitative measure of the models' ability to distinguish between positive and negative instances. A higher AUC indicates better performance. The fact that all language models outperform the random guess baseline suggests that they are all effective to some extent. The TransAtt-Japanese model appears to be the most accurate, while the TransAtt-Chinese model is the least accurate among those tested. The steep initial rise in the ROC curves suggests that all models are good at identifying true positives without generating too many false positives, especially at lower thresholds. </details> (a) NLLB-200-distilled-600M <details> <summary>ROC_curves_for_back-translation_using_Google_Translate.png Details</summary> ![12c75940](/v1/image/12c75940a7c2792c0ebb502df6166bc563457bab6f1eae567f1ec946cfd6fab7) ### Visual Description ## Chart Type: Receiver Operating Characteristic (ROC) Curves ### Overview The image presents Receiver Operating Characteristic (ROC) curves for different language pairs, specifically English to German, French, Italian, Chinese, and Japanese. It also includes a "Random Guess" line for comparison. The ROC curves plot the True Positive Rate (TPR) against the False Positive Rate (FPR). The Area Under the Curve (AUC) is provided for each language pair, indicating the performance of the classification model. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Legend:** Located in the bottom-right corner. * English-German (AUC = 0.9728) - Blue line * English-French (AUC = 0.9760) - Orange line * English-Italian (AUC = 0.9743) - Green line * English-Chinese (AUC = 0.9705) - Red line * English-Japanese (AUC = 0.9768) - Purple line * Random Guess - Dashed gray line ### Detailed Analysis * **English-German (AUC = 0.9728):** The blue line starts at (0,0.78) then rises sharply to approximately (0.02, 0.92), then continues to rise more gradually, reaching approximately 0.98 at FPR = 0.2, and then plateaus near TPR = 1.0. * **English-French (AUC = 0.9760):** The orange line starts at (0,0.85) then rises sharply to approximately (0.02, 0.94), then continues to rise more gradually, reaching approximately 0.98 at FPR = 0.2, and then plateaus near TPR = 1.0. * **English-Italian (AUC = 0.9743):** The green line starts at (0,0.72) then rises sharply to approximately (0.02, 0.90), then continues to rise more gradually, reaching approximately 0.97 at FPR = 0.2, and then plateaus near TPR = 1.0. * **English-Chinese (AUC = 0.9705):** The red line starts at (0,0.85) then rises sharply to approximately (0.02, 0.92), then continues to rise more gradually, reaching approximately 0.96 at FPR = 0.2, and then plateaus near TPR = 1.0. * **English-Japanese (AUC = 0.9768):** The purple line starts at (0,0.0) then rises sharply to approximately (0.02, 0.95), then continues to rise more gradually, reaching approximately 0.98 at FPR = 0.2, and then plateaus near TPR = 1.0. * **Random Guess:** The dashed gray line represents a random classifier, with TPR equal to FPR. It starts at (0,0) and ends at (1,1). ### Key Observations * All language pairs perform significantly better than random guessing, as indicated by their AUC values being much greater than 0.5. * English-Japanese has the highest AUC (0.9768), closely followed by English-French (0.9760). * English-Chinese has the lowest AUC (0.9705) among the language pairs. * The ROC curves for all language pairs are very similar, indicating comparable performance in translation quality. * The curves rise sharply at low FPR values, indicating a high true positive rate even with a low false positive rate. ### Interpretation The ROC curves demonstrate the performance of machine translation models for different language pairs with English as the source language. The AUC values suggest that all models perform well, with English-Japanese and English-French showing slightly better performance than the others. The steep initial rise in the curves indicates that the models are effective at correctly identifying positive instances (accurate translations) while minimizing false positives (incorrect translations). The "Random Guess" line serves as a baseline, highlighting the significant improvement achieved by the translation models. The similarity in the curves suggests that the models have comparable performance across different target languages. The models are very good at distinguishing between correct and incorrect translations. </details> (b) Google Translator Figure 12: ROC curves for back-translation on SynGuard using different translation tools. VI-D SynGuard vs. SynthID-Text Table XI compares SynGuard and SynthID-Text robustness under identical attacks. SynGuard achieves higher F1 scores across all evaluated attacks with the same parameters, with comparable performance in no-attack scenarios. Specifically, SynGuard retains F1 $>$ 0.9 under synonym substitution and paraphrasing, and 0.9 under copy-and-paste, while SynthID-Text drops below 0.9 in all three. For back-translation (the most challenging attack), SynGuard outperforms SynthID-Text, with F1 rising from 0.777 to 0.711, FPR dropping from 0.225 to 0.07. Overall, F1 is improved by 9.3%-13%. These results confirm SynGuard enhances detection robustness across token-level (synonym substitution), sentence-level (paraphrasing), and context-level (copy-and-paste) attacks via semantic-aware watermarking. Taken collectively, our proposed SynGuard scheme exhibits computational overhead and robustness against text tampering attacks comparable to those of SIR, while demonstrating favorable text quality on par with that of SynthID-Text, thereby integrating the strengths of both approaches. TABLE XI: Comparison of watermark detection performance between SynGuard and SynthID-Text under various attacks | Attack | SynGuard | SynthID-Text | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | | Method | Parameters | TPR | FPR | F1 | TPR | FPR | F1 | | No attack | – | 0.995 | 0.0 | 0.9975 | 1.0 | 0.0 | 1.0 | | Substitution | $\epsilon=0.7$ | 0.96 | 0.03 | 0.965 | 0.82 | 0.035 | 0.884 | | Copy-and-Paste | ratio=10 | 0.985 | 0.225 | 0.891 | 0.995 | 0.53 | 0.788 | | Paraphrasing | lex $=10$ , order $=5$ | 0.9 | 0.05 | 0.923 | 0.895 | 0.23 | 0.842 | | Back-Translation | language=Chinese | 0.680 | 0.07 | 0.777 | 0.675 | 0.225 | 0.711 | Note ∗: Bold F1 scores indicate values above 0.9, reflecting strong detection performance. Blue-highlighted TPR or FPR values are below 0.6, suggesting performance close to random guessing. Red-highlighted F1 scores represent the lowest values observed across all tested attacks. VI-E Ablation Study In this subsection, we investigate how the semantic weight $\delta$ affects the performance of the proposed watermarking algorithm. Based on the F1 score and AUC values from this study, we selected an optimal $\delta$ and used it for the robustness evaluations. Semantic Weight $\delta$ . We introduce a semantic blending factor $\delta∈[0,1]$ , referred to as semantic_weight, to interpolate between the semantic score $s_{\text{semantic}}$ and the g-value-based score $s_{\text{g-value}}$ . A larger $\delta$ emphasizes semantic coherence, while a smaller $\delta$ gives more weight to the g-value randomness statistics. The ROC curves under different semantic weight settings are shown in Fig. 13. As $\delta$ increases from 0.1 to 0.7, the AUC improves consistently. The zoomed-in view in Fig. 13(b) reveals that the ROC curve for $\delta=0.7$ consistently outperforms the others. From Table XII, we observe that both TPR and F1 score increase as $\delta$ grows. Although the FPR for $\delta=0.7$ is not the lowest, it is only 0.005 higher than that of $\delta=0.5$ and identical to the FPR at $\delta=0.3$ . Therefore, in Session VI, we adopt $\delta=0.7$ as the default setting for the semantic weight in subsequent robustness evaluations. <details> <summary>figures/rocs_for_diff_delta.png Details</summary> ![1bfaa4d3](/v1/image/1bfaa4d3abad57f70c2850fd42de6c0000f99d32733489d98030c94346b7f727) ### Visual Description ## Chart: Receiver Operating Characteristic (ROC) Curves ### Overview The image is a Receiver Operating Characteristic (ROC) curve chart, comparing the performance of different models (likely machine learning classifiers) based on their True Positive Rate (TPR) and False Positive Rate (FPR). The chart includes four ROC curves, each representing a different model with varying delta values (0.1, 0.3, 0.5, and 0.7), along with a "Random Guess" baseline. The Area Under the Curve (AUC) is provided for each model. ### Components/Axes * **Title:** Receiver Operating Characteristic (ROC) Curves * **X-axis:** False Positive Rate (FPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Y-axis:** True Positive Rate (TPR), ranging from 0.0 to 1.0 in increments of 0.2. * **Grid:** Light gray grid lines are present at intervals of 0.2 on both axes. * **Legend:** Located in the bottom-right corner, the legend identifies each curve by its delta value and corresponding AUC: * Blue: delta=0.1 (AUC = 0.9966) * Orange: delta=0.3 (AUC = 0.9993) * Green: delta=0.5 (AUC = 0.9977) * Red: delta=0.7 (AUC = 0.9999) * Gray Dashed Line: Random Guess ### Detailed Analysis * **Random Guess:** The "Random Guess" line is a dashed gray line that runs diagonally from the bottom-left corner (0.0, 0.0) to the top-right corner (1.0, 1.0). This represents the performance of a classifier that makes random predictions. * **delta=0.1 (Blue):** The blue line starts at (0.0, 0.0), rises sharply to approximately (0.02, 0.98), plateaus near 1.0, and remains there for the rest of the x-axis. * **delta=0.3 (Orange):** The orange line starts at (0.0, 0.0), rises sharply to approximately (0.01, 0.99), plateaus near 1.0, and remains there for the rest of the x-axis. * **delta=0.5 (Green):** The green line starts at (0.0, 0.0), rises sharply to approximately (0.03, 0.97), plateaus near 1.0, and remains there for the rest of the x-axis. * **delta=0.7 (Red):** The red line starts at (0.0, 0.0), rises sharply to approximately (0.005, 0.99), plateaus near 1.0, and remains there for the rest of the x-axis. ### Key Observations * All four models (delta=0.1, 0.3, 0.5, and 0.7) significantly outperform the "Random Guess" baseline. * The AUC values for all models are very high (close to 1.0), indicating excellent classification performance. * The red line (delta=0.7) appears to have the steepest initial rise, followed by the orange line (delta=0.3), suggesting slightly better performance at very low false positive rates. * The blue line (delta=0.1) and green line (delta=0.5) are very similar, with slightly lower AUC values compared to the red and orange lines. ### Interpretation The ROC curves demonstrate that all four models are highly effective at distinguishing between positive and negative cases. The AUC values close to 1.0 indicate that these models have a very low false positive rate while maintaining a high true positive rate. The "delta=0.7" model appears to be the best performer, closely followed by "delta=0.3", as indicated by their higher AUC values and steeper initial rise in the ROC curve. The models are significantly better than random guessing. The differences between the models are subtle, but the ROC curves provide a visual representation of their relative performance. </details> (a) Regular ROC Curves *[Error downloading image: figures/rocs_for_diff_delta--zoom-in.png]* (b) Zoom-in ROC Curves Figure 13: ROC curves under different semantic weight settings ( $\delta$ ) TABLE XII: Watermark detection accuracy of SynGuard under varying semantic weights ( $\delta$ ) | Semantic Weight $\delta$ | TPR | FPR | F1 with best threshold | | --- | --- | --- | --- | | 0.0 | 1.0 | 0 | 1.0 | | 0.1 | 0.97 | 0 | 0.985 | | 0.3 | 0.99 | 0.01 | 0.990 | | 0.5 | 0.99 | 0.005 | 0.992 | | 0.7 | 1.0 | 0.01 | 0.995 | | 1.0 | 0.98 | 0.015 | 0.983 | VII Conclusions This paper evaluates SynthID-Text’s robustness across diverse attacks. While SynthID-Text resists simple lexical attacks, it is vulnerable to semantic-preserving transformations like paraphrasing and back translation, which severely reduce detection accuracy. To address this, we propose SynGuard, a hybrid algorithm integrating semantic sensitivity with SynthID-Text’s probabilistic design. Via a semantic blending factor $\delta$ , it balances semantic alignment and sampling randomness, boosting robustness and attack resistance. Under no-attack conditions, both methods perform comparably. For text quality, SynGuard’s slightly higher PPL score (vs. SynthID-Text) remains lower than unwatermarked text, indicating better fluency consistency. Across all attacks, SynGuard consistently outperforms SynthID-Text, improving F1 scores by 9.2%–13% even in pivot-language back-translation attacks (where distortion is worst). These results validate incorporating semantic information into watermarking. Overall, SynGuard is a more resilient strategy for large language models, particularly against prevalent semantic-preserving watermark removal attacks. References - [1] J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, and T. Goldstein, “A watermark for large language models,” in International Conference on Machine Learning. PMLR, 2023, pp. 17 061–17 084. - [2] E. N. Crothers, N. Japkowicz, and H. L. Viktor, “Machine-generated text: A comprehensive survey of threat models and detection methods,” IEEE Access, vol. 11, pp. 70 977–71 002, 2023. - [3] S. Dathathri, A. See, S. Ghaisas, P.-S. Huang, R. McAdam, J. Welbl, V. Bachani, A. Kaskasoli, R. Stanforth, T. Matejovicova, J. Hayes, and N. Vyas, “Scalable watermarking for identifying large language model outputs,” Nature, vol. 634, no. 8035, pp. 818–823, 2024. - [4] A. Liu, L. Pan, Y. Lu, J. Li, X. Hu, X. Zhang, L. Wen, I. King, H. Xiong, and P. Yu, “A survey of text watermarking in the era of large language models,” ACM Computing Surveys, vol. 57, no. 2, pp. 1–36, 2024. - [5] Z. Wang, T. Gu, B. Wu, and Y. Yang, “MorphMark: Flexible adaptive watermarking for large language models,” in ACL 2025, pp. 4842–4860. - [6] A. Liu, L. Pan, X. Hu, S. Meng, and L. Wen, “A semantic invariant robust watermark for large language models,” in ICLR 2024, 2024. - [7] X. Zhao, P. V. Ananth, L. Li, and Y. Wang, “Provable robust watermarking for ai-generated text,” in ICLR 2024. - [8] Z. Hu, L. Chen, X. Wu, Y. Wu, H. Zhang, and H. Huang, “Unbiased watermark for large language models,” in ICLR 2024. - [9] J. Kirchenbauer, J. Geiping, Y. Wen, M. Shu, K. Saifullah, K. Kong, K. Fernando, A. Saha, M. Goldblum, and T. Goldstein, “On the reliability of watermarks for large language models,” in ICLR 2024. - [10] J. Ren, H. Xu, Y. Liu, Y. Cui, S. Wang, D. Yin, and J. Tang, “A robust semantics-based watermark for large language model against paraphrasing,” in NAACL 2024, pp. 613–625. - [11] Z. He, B. Zhou, H. Hao, A. Liu, X. Wang, Z. Tu, Z. Zhang, and R. Wang, “Can watermarks survive translation? on the cross-lingual consistency of text watermark for large language models,” in ACL 2024, pp. 4115–4129. - [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. - [13] M. Christ, S. Gunn, and O. Zamir, “Undetectable watermarks for language models,” in COLT 2024, vol. 247, 2024, pp. 1125–1139. - [14] H. Chen, B. D. Rouhani, C. Fu, J. Zhao, and F. Koushanfar, “Deepmarks: A secure fingerprinting framework for digital rights management of deep learning models,” in ICMR 2019, pp. 105–113. - [15] T. Qiao, Y. Ma, N. Zheng, H. Wu, Y. Chen, M. Xu, and X. Luo, “A novel model watermarking for protecting generative adversarial network,” Computers & Security, vol. 127, p. 103102, 2023. - [16] J. Zhang, D. Chen, J. Liao, W. Zhang, H. Feng, G. Hua, and N. Yu, “Deep model intellectual property protection via deep watermarking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8, pp. 4005–4020, 2021. - [17] B. Darvish Rouhani, H. Chen, and F. Koushanfar, “Deepsigns: An end-to-end watermarking framework for ownership protection of deep neural networks,” in ASPLOS 2019, pp. 485–497. - [18] P. Neekhara, S. Hussain, X. Zhang, K. Huang, J. McAuley, and F. Koushanfar, “Facesigns: semi-fragile neural watermarks for media authentication and countering deepfakes,” in ACM Transactions on Multimedia Computing, Communications and Applications, 2024. - [19] X. Zhao, Y. Wang, and L. Li, “Protecting language generation models via invisible watermarking,” in ICML 2023, vol. 202, pp. 42 187–42 199. - [20] S. Qiu, Q. Liu, S. Zhou, and W. Huang, “Adversarial attack and defense technologies in natural language processing: A survey,” Neurocomputing, vol. 492, pp. 278–307, 2022. - [21] M. Xia, T. Gao, Z. Zeng, and D. Chen, “Sheared llama: Accelerating language model pre-training via structured pruning,” in ICLR 2024. - [22] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., 2020. - [23] L. Pan, A. Liu, Z. He, Z. Gao, X. Zhao, Y. Lu, B. Zhou, S. Liu, X. Hu, L. Wen, I. King, and P. S. Yu, “MarkLLM: An open-source toolkit for LLM watermarking,” in EMNLP 2024, pp. 61–71. - [24] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995. - [25] C. R. Harris, K. J. Millman, S. J. Van Der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith et al., “Array programming with numpy,” Nature, vol. 585, no. 7825, pp. 357–362, 2020. - [26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. [Online]. Available: https://arxiv.org/abs/1810.04805 - [27] K. Krishna, Y. Song, M. Karpinska, J. Wieting, and M. Iyyer, “Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense,” Advances in Neural Information Processing Systems, vol. 36, pp. 27 469–27 500, 2023. - [28] M. R. Costa-Jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard et al., “No language left behind: Scaling human-centered machine translation,” arXiv preprint arXiv:2207.04672, 2022. - [29] T. Mizowaki, H. Ogawa, and M. Yamada, “Syntactic cross and reading effort in english to japanese translation,” in Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 1: Empirical Translation Process Research), 2022, pp. 49–59. - [30] Y. Sekizawa, T. Kajiwara, and M. Komachi, “Improving japanese-to-english neural machine translation by paraphrasing the target language,” in Proceedings of the 4th Workshop on Asian Translation (WAT2017), 2017, pp. 64–69.

Rendering Paper...