# Logits are All We Need to Adapt Closed Models
**Authors**: Gaurush Hiranandani, Haolun Wu, Subhojyoti Mukherjee, Sanmi Koyejo
Abstract
Many commercial Large Language Models (LLMs) are often closed-source, limiting developers to prompt tuning for aligning content generation with specific applications. While these models currently do not provide access to token logits, we argue that if such access were available, it would enable more powerful adaptation techniques beyond prompt engineering. In this paper, we propose a token-level probability reweighting framework that, given access to logits and a small amount of task-specific data, can effectively steer black-box LLMs toward application-specific content generation. Our approach views next-token prediction through the lens of supervised classification. We show that aligning black-box LLMs with task-specific data can be formulated as a label noise correction problem, leading to Plugin model – an autoregressive probability reweighting model that operates solely on logits. We provide theoretical justification for why reweighting logits alone is sufficient for task adaptation. Extensive experiments with multiple datasets, LLMs, and reweighting models demonstrate the effectiveness of our method, advocating for broader access to token logits in closed-source models. We provide our code at this https URL.
Distribution Shift, Black-box Model, Reweighing, Decoding, Large Language Models
1 Introduction
The rise of Large Language Models (LLMs) has revolutionized generative Artificial Intelligence, yet the most capable models are often closed-source or black-box (Achiam et al., 2023; Bai et al., 2022a). These models generate text based on input prompts but keep their internal weights and training data undisclosed, limiting transparency and customization. Despite these constraints, closed-source LLMs are widely adopted across applications ranging from travel itinerary generation to tax advice, with developers largely relying on prompt optimization to achieve domain-specific outputs.
However, this reliance on prompt engineering is insufficient for specialized tasks, e.g., those requiring brand-specific tone or style. Consider a content writer aiming to generate product descriptions that reflect a brand’s unique identity. Black-box LLMs, trained on broad datasets, often fail to meet such nuanced requirements. With access limited to generated tokens, developers resort to zero-shot (Kojima et al., 2022) or few-shot (Song et al., 2023) prompting techniques. However, if model weights were accessible, advanced techniques like Parameter-Efficient Fine-Tuning (PEFT) using LoRA (Hu et al., 2021), QLoRA (Dettmers et al., 2024), prefix tuning (Li & Liang, 2021), or adapters (Hu et al., 2023a) could be employed for fine-tuning. Yet, due to intellectual property concerns and the high costs of development, most commercial LLMs remain closed-source, and even with API-based fine-tuning options, concerns over data privacy discourage developers from sharing proprietary data.
<details>
<summary>2502.06806v4/extracted/6617062/icml2025/images/plugin_fig.png Details</summary>

### Visual Description
# Technical Document Extraction: Plugin Model Workflow
## Diagram Overview
The image depicts a **plugin model workflow** for text classification, combining a **black-box model** and a **reweighting model** to generate a final inference. The process involves probabilistic outputs for categorical labels, with spatial grounding of components and data flow.
---
## Key Components and Flow
### 1. Input
- **Text Input**: `"This shirt is ..."` (leftmost box)
- **Flow**: Arrows indicate sequential processing through the system.
### 2. Plugin Model (Central Component)
- **Subcomponents**:
- **Black-box Model** (dark gray box)
- **Reweighting Model** (white box)
### 3. Output
- **Next Inference Step**: `"This shirt is adidas ..."` (rightmost box)
---
## Data Tables and Probabilities
### Black-box Model Output
| Word | Probability |
|----------|-------------|
| nice | 0.1 |
| adidas | 0.1 |
| **black**| **0.3** |
| ... | ... |
### Reweighting Model Output
| Word | Probability |
|----------|-------------|
| nice | 0.1 |
| **adidas**| **0.7** |
| black | 0.1 |
| ... | ... |
### Next Inference Step Output
| Word | Probability |
|----------|-------------|
| nice | 0.01 |
| **adidas**| **0.07** |
| black | 0.03 |
| ... | ... |
---
## Color Coding and Spatial Grounding
- **Black-box Model Table**:
- "black" highlighted in **dark yellow** (0.3 probability).
- **Reweighting Model Table**:
- "adidas" highlighted in **light yellow** (0.7 probability).
- **Next Inference Table**:
- "adidas" in **bold** (0.07 probability).
- **Legend**: Not explicitly labeled, but colors correlate to model contributions:
- Dark yellow → Black-box model emphasis.
- Light yellow → Reweighting model emphasis.
---
## Workflow Logic
1. **Input** → **Plugin Model**:
- Text is processed by both the black-box and reweighting models.
2. **Model Outputs**:
- Black-box model assigns higher probability to "black" (0.3).
- Reweighting model assigns higher probability to "adidas" (0.7).
3. **Weighted Sum (X Symbol)**:
- Combines outputs from both models.
4. **Final Inference**:
- "adidas" emerges as the dominant label (0.07 probability).
---
## Key Trends and Data Points
- **Black-box Model**: Prioritizes "black" (0.3) over other labels.
- **Reweighting Model**: Strongly favors "adidas" (0.7), overriding the black-box output.
- **Final Output**: "adidas" is selected despite lower individual probabilities in the black-box model, indicating reweighting dominance.
---
## Notes
- **Language**: All text is in English.
- **Missing Elements**: No explicit legend or axis titles, but spatial and color cues imply model contributions.
- **Assumptions**: The weighted sum (X symbol) likely combines probabilities multiplicatively or additively, though the exact method is not specified.
---
## Final Output
The system concludes that `"This shirt is adidas ..."` based on the reweighting model's influence, despite the black-box model's initial preference for "black."
</details>
Figure 1: Inference phase of the Plugin model. The token probabilities are a product of the probabilities from the black-box model and a reweighting model that denotes label transitioning.
In this paper, we propose a middle ground between general-purpose LLM creators and developers seeking application-specific alignment. We argue that providing access to token logits, in addition to generated text, would enable more effective customization for downstream tasks. Viewing next-token prediction as a classification problem, we draw an analogy between LLMs and supervised classification models. Since decoder-only LLMs are trained to predict the next token given preceding tokens, aligning black-box LLMs to domain-specific data can be reframed as a label noise correction problem in supervised classification. In this analogy, the LLM’s broad training data serves as proxy labels, while application-specific data represents true labels. This can be interpreted as a distribution shift scenario. For example, in “label shift” (Lipton et al., 2018), certain tokens may appear more frequently in application-specific data than in the LLM’s original corpus. In “class-dependent or independent label noise” (Patrini et al., 2017), synonymous expressions or stylistic variations in application data may diverge from those seen during model training.
Inspired by the label noise correction method of Patrini et al. (2017), which estimates a transition matrix to correct class-dependent noise, we adapt this idea to black-box LLM alignment. Unlike prior work that modifies the loss and retrains the model, we lack access to the LLM’s training data and cannot retrain the model. Instead, we estimate an autoregressive transition matrix from application-specific data and use it to reweight token probabilities at inference.
This autoregressive extension is novel, as it accounts for dependencies on previously generated tokens when adjusting logits for the next token. By adapting label noise correction techniques to autoregressive language modeling, we present a practical method to align black-box LLMs using only logits—without requiring access to model weights or original training data.
Our contributions are summarized as follows:
1. We formulate the problem of adapting black-box LLMs for application-specific content generation as a loss correction approach, requiring only token logits at each generation step. This bridges label noise correction in supervised classification with autoregressive language modeling (Sections 2 and 3).
1. We propose an autoregressive probability reweighting framework, enabling token-level probability adjustment during inference. The resulting Plugin model dynamically reweights logits to align generation with task-specific data (Section 4).
1. We provide theoretical guarantees, showing that under mild assumptions, the Plugin model consistently aligns probability estimates with the target distribution given sufficient application-specific samples. To our knowledge, this is the first work to establish such consistency in an autoregressive label noise setting (Section 5).
1. We conduct extensive experiments across four language generation datasets and three black-box LLMs. Our results, supported by multiple ablations, demonstrate that the Plugin model outperforms baselines in adapting black-box LLMs for domain-specific content generation (Section 7). Based on our results, we advocate for publishing token logits alongside outputs in closed-source LLMs.
2 Preliminaries
We begin by establishing the notation. The index set is denoted as $[c]=\{1,...,c\}$ for any positive integer $c$ . Vectors are represented in boldface, for example, $\bm{v}$ , while matrices are denoted using uppercase letters, such as $V$ . The coordinates of a vector are indicated with subscripts, for instance, $v_{j}$ . The all-ones vector is denoted by $\mathbf{1}$ , with its size being clear from the context. The $c$ -dimensional simplex is represented as $\Delta^{c-1}⊂[0,1]^{c}$ . Finally, a sequence $(x_{t},x_{t-1},...,x_{1})$ of size $t$ is denoted by $x_{t:1}$ .
We assume access to language data for the target task, while the black-box LLM, trained on broad world knowledge, is treated as having learned from a noisy version of this data. We seek to adapt the black-box model to align with the task-specific distribution. To formalize this, we extend the label-noise framework from supervised classification (Patrini et al., 2017) to the decoder-only language modeling.
Decoder-only models are trained using a next-token prediction objective. At each step, this setup resembles a supervised classification problem with $|V|$ classes, where $V$ is the vocabulary of tokens. Formally, the label space at step $t$ is ${\cal X}_{t}=\{\bm{e}^{i}:i∈[|V|]\}$ , where $\bm{e}^{i}$ denotes the $i$ -th standard canonical vector in $\mathbb{R}^{|V|}$ , i.e., $\bm{e}^{i}∈\{0,1\}^{|V|},\mathbf{1}^{T}\bm{e}^{i}=1$ . The task at each step $t$ is to predict the next token $\bm{x}_{t}$ (denoted as one-hot vector) given a sequence of tokens $\bm{x}_{t-1:1}$ .
One observes examples $(\bm{x}_{t},\bm{x}_{t-1:1})$ drawn from an unknown distribution $p^{*}(\bm{x}_{t},\bm{x}_{t-1:1})=p^{*}(\bm{x}_{t}|\bm{x}_{t-1:1})p^{*}(\bm{x}_%
{t-1:1})$ over $V× V^{[t-1]}$ , with expectations denoted by $E^{*}_{\bm{x}_{t},\bm{x}_{t-1:1}}$ . Cross-entropy loss is typically used for training over the vocabulary tokens. Assuming access to token logits, and thus the softmax outputs, from the black-box LLM, we interpret the softmax output as a vector approximating the class-conditional probabilities $p^{*}(\bm{x}_{t}|\bm{x}_{t-1:1})$ , denoted as $b(\bm{x}_{t}|\bm{x}_{t-1:1})∈\Delta^{|V|-1}$ .
To quantify the discrepancy between the target label $\bm{x}_{t}=\bm{e}^{i}$ at step $t$ and the model’s predicted output, we define a loss function $\ell:|V|×\Delta^{|V|-1}→\mathbb{R}$ . A common choice in next-token prediction tasks is the cross-entropy loss:
$$
\displaystyle\ell(\bm{e}^{i},b(\bm{x}_{t}|{\bm{x}_{t-1:1}})) \displaystyle=-(\bm{e}^{i})^{T}\log b(\bm{x}_{t}|\bm{x}_{t-1:1}) \displaystyle=-\log b(\bm{x}_{t}=\bm{e}^{i}|\bm{x}_{t-1:1}). \tag{1}
$$
With some abuse of notation, the loss in vector form $\bm{\ell}:\Delta^{|V|-1}→\mathbb{R}^{|V|}$ , computed on every possible label is $\bm{\ell}(b(\bm{x}_{t}|\bm{x}_{t-1:1}))=\Big{(}\ell(\bm{e}^{1},b(\bm{x}_{t}|%
\bm{x}_{t-1:1})),...,\ell(\bm{e}^{|V|},b(\bm{x}_{t}|\bm{x}_{t-1:1}))\Big{)}^%
{T}.$
3 Loss Robustness
We extend label noise modeling to the autoregressive language setting, focusing on asymmetric or class-conditional noise. At each step $t$ , the label $\bm{x}_{t}$ in the black-box model’s training data is flipped to $\tilde{\bm{x}}_{t}∈ V$ with probability $p^{*}(\tilde{\bm{x}}_{t}|\bm{x}_{t})$ , while preceding tokens $(\bm{x}_{t-1:1})$ remain unchanged. As a result, the black-box model observes samples from a noisy distribution: $p^{*}(\tilde{\bm{x}}_{t},\bm{x}_{t-1:1})=\sum_{\bm{x}_{t}}p^{*}(\tilde{\bm{x}}%
_{t}|\bm{x}_{t})p^{*}(\bm{x}_{t}|\bm{x}_{t-1:1})p^{*}(\bm{x}_{t-1:1}).$
We define the noise transition matrix $T_{t}∈[0,1]^{|V|×|V|}$ at step $t$ , where each entry $T_{t_{ij}}=p^{*}(\tilde{\bm{x}}_{t}=\bm{e}^{j}|\bm{x}_{t}=\bm{e}^{i})$ represents the probability of label flipping. This matrix is row-stochastic but not necessarily symmetric.
To handle asymmetric label noise, we modify the loss $\bm{\ell}$ for robustness. Initially, assuming a known $T_{t}$ , we apply a loss correction inspired by (Patrini et al., 2017; Sukhbaatar et al., 2015). We then relax this assumption by estimating $T_{t}$ directly, forming the basis of our Plugin model approach.
We observe that a language model trained with no loss correction would result in a predictor for noisy labels $b(\tilde{\bm{x}}_{t}|\bm{x}_{t-1:1})$ . We can make explicit the dependence on $T_{t}$ . For example, with cross-entropy we have:
$$
\displaystyle\ell(\bm{e}^{i},b(\tilde{\bm{x}}_{t}|\bm{x}_{t-1:1}))=-\log b(%
\tilde{\bm{x}}_{t}=\bm{e}^{i}|\bm{x}_{t-1:1}) \displaystyle=-\log\sum_{j=1}^{|V|}p^{*}(\tilde{\bm{x}}_{t}=\bm{e}^{i}|\bm{x}_%
{t}=\bm{e}^{j})b(\bm{x}_{t}=\bm{e}^{j}|\bm{x}_{t-1:1}) \displaystyle=-\log\sum_{j=1}^{|V|}T_{t_{ji}}{b}(\bm{x}_{t}=\bm{e}^{j}|\bm{x}_%
{t-1:1}), \tag{2}
$$
or in matrix form
$$
\bm{\ell}(b(\tilde{\bm{x}}_{t}|\bm{x}_{t-1:1}))=-\log T_{t}^{\top}b(\bm{x}_{t}%
|\bm{x}_{t-1:1}). \tag{3}
$$
This loss compares the noisy label $\tilde{\bm{x}}_{t}$ to the noisy predictions averaged via the transition matrix $T_{t}$ at step $t$ . Cross-entropy loss, commonly used for next-token prediction, is a proper composite loss with the softmax function as its inverse link function (Patrini et al., 2017). Consequently, from Theorem 2 of Patrini et al. (2017), the minimizer of the forwardly-corrected loss in Equation (3) on noisy data aligns with the minimizer of the true loss on clean data, i.e.,
| | | $\displaystyle\operatorname*{argmin}_{w}E^{*}_{\tilde{\bm{x}}_{t},\bm{x}_{t-1:1%
}}\Big{[}\bm{\ell}(\bm{x}_{t},T_{t}^{→p}b(\bm{x}_{t}|\bm{x}_{t-1:1}))\Big{]}$ | |
| --- | --- | --- | --- |
where $w$ are the language model’s weights, implicitly embedded in the softmax output $b$ from the black-box model. This result suggests that if $T_{t}$ were known, we could transform the softmax output $b(\bm{x}_{t}\mid\bm{x}_{t-1:1})$ using $T_{t}^{T}$ , use the transformed predictions as final outputs, and retrain the model accordingly. However, since $T_{t}$ is unknown and training data is inaccessible, estimating $T_{t}$ from clean data is essential to our approach.
3.1 Estimation of Transition Matrix
We assume access to a small amount of target data for the task. Given that the black-box model is expressive enough to approximate $p^{*}(\tilde{\bm{x}}_{t}\mid\bm{x}_{t-1:1})$ (Assumption (2) in Theorem 3 of Patrini et al. (2017)), the transition matrix $T_{t}$ can be estimated from this target data. Considering the supervised classification setting at step $t$ , let $\mathcal{X}_{t}^{i}$ represent all target data samples where $\bm{x}_{t}=\bm{e}^{i}$ and the preceding tokens are $(\bm{x}_{t-1:1})$ . A naive estimate of the transition matrix is: $\hat{T}_{t_{ij}}=b(\tilde{\bm{x}}_{t}=\bm{e}^{j}|\bm{x}_{t}=\bm{e}^{i})=\frac{%
1}{|\mathcal{X}_{t}^{i}|}\sum_{x∈\mathcal{X}_{t}^{i}}b(\tilde{\bm{x}}_{t}=%
\bm{e}^{j}|\bm{x}_{t-1:1})$ . While this setup works for a single step $t$ , there are two key challenges in extending it across all steps in the token prediction task:
1. Limited sample availability: The number of samples where $\bm{x}_{t}=\bm{e}^{i}$ and the preceding tokens $(\bm{x}_{t-1},...,\bm{x}_{1})$ match exactly is limited in the clean data, especially with large vocabulary sizes (e.g., $|V|=O(100K)$ for LLaMA (Dubey et al., 2024)). This necessitates modeling the transition matrix as a function of features derived from $\bm{x}_{t-1:1}$ , akin to text-based autoregressive models.
1. Large parameter space: With a vocabulary size of $|V|=O(100K)$ , the transition matrix $T_{t}$ has approximately 10 billion parameters. This scale may exceed the size of the closed-source LLM and cannot be effectively learned from limited target data. Thus, structural restrictions must be imposed on $T_{t}$ to reduce its complexity.
To address these challenges, we impose the restriction that the transition matrix $T_{t}$ is diagonal. While various constraints could be applied to simplify the problem, assuming $T_{t}$ is diagonal offers two key advantages. First, it allows the transition matrix—effectively a vector in this case—to be modeled using standard autoregressive language models, such as a GPT-2 model with $k$ transformer blocks, a LLaMA model with $d$ -dimensional embeddings, or a fine-tuned GPT-2-small model. These architectures can be adjusted based on the size of the target data. Second, a diagonal transition matrix corresponds to a symmetric or class-independent label noise setup, where $\bm{x}_{t}=\bm{e}^{i}$ flips to any other class with equal probability in the training data. This assumption, while simplifying, remains realistic within the framework of label noise models.
Enforcing a diagonal structure ensures efficient estimation of the transition matrix while maintaining practical applicability within our framework. Next, we outline our approach for adapting closed-source language models to target data.
4 Proposed Method: The Plugin Approach
To estimate the autoregressive transition vector, we train an autoregressive language model on target data, which operates alongside the black-box model during inference. This model acts as an autoregressive reweighting mechanism, adjusting the token probabilities produced by the black-box model. The combined approach, integrating probabilities from the black-box and reweighting models, is referred to as the Plugin model. The term Plugin is inspired by classification literature, where plugin methods reweight probabilities to adapt to distribution shifts (Koyejo et al., 2014; Narasimhan et al., 2015; Hiranandani et al., 2021). We now detail the training and inference phases, summarized in Algorithm 1 (Appendix A) and illustrated in Figure 1.
4.1 Training the Plugin Model
During each training iteration, a sequence $s$ of $m$ tokens is passed through both the black-box model and the reweighting model to obtain token probabilities $\{\bm{b}_{1},\bm{b}_{2},...,\bm{b}_{m}\}$ and $\{\bm{r}_{1},\bm{r}_{2},...,\bm{r}_{m}\}$ , respectively, where each $\bm{b}_{i},\bm{r}_{i}∈\Delta^{|V|-1}$ . The final token probability from the Plugin model is computed by normalizing the element-wise product of these probabilities:
$$
{\bm{p}}_{i}=\frac{\bm{b}_{i}\odot\bm{r}_{i}}{\|\bm{b}_{i}\odot\bm{r}_{i}\|_{1%
}}. \tag{4}
$$
The sequence-level cross-entropy loss is given by:
$$
\ell_{s}=-\frac{1}{m}\sum_{i=1}^{m}\sum_{j=1}^{|V|}\log({\bm{p}}_{i})\odot\bm{%
e}_{j}, \tag{5}
$$
where the $j$ -th token appears at the $i$ -th position in the sequence $s$ . During backpropagation, only the reweighting model parameters are updated, while the black-box model remains frozen. This formulation extends naturally to batch training, refining $\bm{r}_{i}$ over iterations to approximate the transition vector governing label shifts in the target data.
4.2 Inference from the Plugin Model
Given a fully trained reweighting model and access to the black-box model, token generation proceeds autoregressively. At the first step, the black-box model produces token probabilities $\bm{b}_{1}$ , while the reweighting model outputs $\bm{r}_{1}$ . The Plugin model selects the first token as $\bm{x}_{1}=\operatorname*{argmax}_{V}(\bm{b}_{1}\odot\bm{r}_{1}).$ For subsequent steps, given the previously generated tokens $\bm{x}_{t-1:1}$ , we obtain probabilities $\bm{b}_{t}$ from the black-box model and $\bm{r}_{t}$ from the reweighting model. The Plugin model then predicts the next token as: $\bm{x}_{t}=\operatorname*{argmax}_{V}(\bm{b}_{t}\odot\bm{r}_{t})$ .
The process continues until a stopping criterion is met. Note that, this manuscript focuses on greedy decoding for inference. Other decoding strategies, such as temperature scaling, top- $p$ sampling, or beam search, can be incorporated by normalizing the element-wise product of probabilities and using them as the final token distribution, as in Equation (4).
5 Theoretical Analysis
We establish the convergence properties of Plugin, showing that after $t$ tokens, it accurately estimates the autoregressive noise transition matrix. Modeling the matrix as a function of an unknown parameter $\bm{\theta}_{*}$ , we prove that optimizing the autoregressive loss over token sequences enables consistent estimation of $\bm{\theta}_{*}$ with high probability. To our knowledge, this is the first finite-time convergence analysis for transition matrix estimation under autoregressive noisy loss.
Let $\mathcal{F}^{t-1}$ denote the history of selected tokens up to time $t-1$ . Let an unknown parameter $\bm{\theta}_{*}∈\bm{\Theta}⊂eq\mathbb{R}^{d}$ governs the transition dynamics of label flipping between token pairs. The transition matrix at time $t$ , denoted as $T_{t}(\bm{\theta}_{*}|\mathcal{F}^{t-1})$ , depends on $\bm{\theta}_{*}$ and all previously observed tokens. Before proving our main result, we first make a few assumptions.
**Assumption 5.1**
*Let $T_{t}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{t-1})$ denote the $(i,j)$ -th component of the transition matrix, and let $f_{I_{t}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{t-1})$ be the transition function that determines the transition from $x_{i}$ to $x_{j}$ , where $I_{t}$ is the $x_{i}$ token selected at time $t$ . Let $x_{i},x_{j}\!∈\!\mathbb{R}^{d}$ . We assume that $∇ f_{I_{t}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{t-1})\!<\!\lambda_{0}$ and $∇^{2}f_{I_{t}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{t-1})\!<\!\lambda%
_{1}$ for some constant $\lambda_{0}\!>\!0$ , $\lambda_{1}>0$ and for all steps $t$ .*
5.1 assumes the transition matrix depends on the history-dependent function $f_{I_{t}}(·)$ with bounded gradient and Hessian, similar to assumptions in (Singh et al., 2023; Zhang et al., 2024) for other deep models.
**Assumption 5.2**
*We assume the cross-entropy loss (5) is clipped by $\epsilon>0$ and upper bounded as $\ell^{clipped}_{t}\!\!≤\!\!C|V|^{2}(Y_{t}-f_{I_{t}}(\bm{\theta}_{*};x_{i},x%
_{j},\mathcal{F}^{t-1}))^{2}$ for any time $t$ , where $Y_{t}$ is the predicted token class, $f_{I_{t}}$ determines the true class and satisfies 5.1, and $C\!>\!0$ is a constant.*
5.2 ensures that the clipped log loss is upper bounded by a smoother squared loss. For the remaining of this section we refer to this squared loss at time $t$ as $\ell_{t}(\bm{\theta})$ . Let the Plugin model minimize the loss $\ell_{1}(\bm{\theta}),\ell_{2}(\bm{\theta}),·s,\ell_{t}(\bm{\theta})$ over $t$ iterations. Let ${\widehat{\bm{\theta}}}_{t}=\operatorname*{argmin}_{\bm{\theta}∈\bm{\Theta}}%
\sum_{s=1}^{t}\ell_{s}(\bm{\theta})$ . At every iteration $t$ , the Plugin algorithm looks into the history $\mathcal{F}^{t-1}$ and samples a token $\bm{x}_{t}\sim\bm{p}_{\hat{\theta}_{t}}=\bm{b}_{t}\odot\bm{r}_{\hat{\theta}_{t}}$ .
Let $\widehat{\cal L}_{t}(\bm{\theta})=\frac{1}{t}\sum_{s=1}^{t}\ell_{s}(\bm{\theta})$ and its expectation ${\cal L}_{t}(\bm{\theta})=\frac{1}{t}\sum_{s=1}^{t}\mathbb{E}_{x_{s}\sim%
\mathbf{p}_{{\widehat{\bm{\theta}}}_{s-1}}}[\ell_{s}(\bm{\theta})|\mathcal{F}^%
{s-1}]$ . We impose regularity and smoothness assumptions on the loss function $\ell_{t}(\bm{\theta})$ as stated in B.1 (Appendix B). We are now ready to prove the main theoretical result of the paper.
**Theorem 1**
*Suppose $\ell_{1}(\bm{\theta}),·s,\ell_{t}(\bm{\theta}):\mathbb{R}^{|V|}\!\!%
→\!\!\mathbb{R}$ are loss functions from a distribution that satisfies Assumptions 5.1, 5.2, and B.1. Define ${\cal L}_{t}(\bm{\theta})=\frac{1}{t}\sum_{s=1}^{t}\mathbb{E}_{x_{s}\sim%
\mathbf{p}_{{\widehat{\bm{\theta}}}_{s-1}}}[\ell_{s}(\bm{\theta})|\mathcal{F}^%
{s-1}]$ where, ${\widehat{\bm{\theta}}}_{t}=\operatorname*{argmin}_{\bm{\theta}∈\bm{\Theta}}%
\sum_{s=1}^{t}\ell_{s}(\bm{\theta})$ . If $t$ is large enough such that $\frac{\gamma\log(dt)}{t}≤ c^{\prime}\min\left\{\frac{1}{C_{1}C_{2}|V|^{4}},%
\frac{\max\limits_{\bm{\theta}∈\bm{\Theta}}\left(\!{\cal L}_{t}(\bm{\theta})%
\!-\!{\cal L}_{t}\left(\bm{\theta}_{*}\!\right)\right)}{C_{2}}\right\}$ then for a constant $\gamma≥ 2$ , universal constants $C_{1},C_{2},c^{\prime}$ , we have that
| | $\displaystyle\left(1-\rho_{t}\right)\frac{\sigma_{t}^{2}}{t}-\frac{C_{1}^{2}}{%
t^{\gamma/2}}$ | $\displaystyle≤\mathbb{E}\left[{\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-{%
\cal L}_{t}\left(\bm{\theta}_{*}\right)\right]$ | |
| --- | --- | --- | --- |
where $\sigma^{2}_{t}\coloneqq\mathbb{E}\left[\frac{1}{2}\left\|∇\widehat{\cal L%
}_{t}\left(\bm{\theta}_{*}\right)\right\|_{\left(∇^{2}{\cal L}_{t}\left(%
\bm{\theta}_{*}\right)\right)^{-1}}^{2}\right]$ , and $\rho_{t}\coloneqq\left(C_{1}C_{2}+2\eta^{2}\lambda_{1}^{2}\right)\sqrt{\frac{%
\gamma\log(dt)}{t}}$ .*
1 bounds the difference between the estimated and true average loss functions, showing that this gap diminishes as the number of training tokens increases. Since ${\widehat{\bm{\theta}}}_{t}=\operatorname*{argmin}_{\bm{\theta}∈\bm{\Theta}}%
\sum_{s=1}^{t}\ell_{s}(\bm{\theta})$ , the Plugin model progressively refines its estimation of the unknown parameter $\bm{\theta}_{*}$ . As the transition matrix $T_{t}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{t-1})$ is derived from $f_{I_{t}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{t-1})$ , which depends on $\bm{\theta}_{*}$ , training on sufficiently many tokens ensures an accurate estimation of each component of $T_{t}(\bm{\theta}_{*}|\mathcal{F}^{t-1})$ .
Our proof reformulates the problem as a sequential hypothesis testing setting to estimate the average loss function ${\cal L}_{t}({\widehat{\bm{\theta}}}_{t})$ using the sequence of losses $\ell_{1}(\bm{\theta}),...,\ell_{t}(\bm{\theta})$ (Naghshvar & Javidi, 2013; Lattimore & Szepesvári, 2020). Unlike prior work (Frostig et al., 2015; Chaudhuri et al., 2015), which assumes i.i.d. losses, the loss at time $t$ in our setting depends on all previous losses. Additionally, Mukherjee et al. (2022) study a different active regression setting without considering cross-entropy loss or transition noise matrices as in Patrini et al. (2017). We provide a brief overview of the proof technique in Remark B.9 (Appendix B), highlighting key novelties.
6 Related Work
Parameter-Efficient Fine-Tuning (PEFT).
PEFT methods adapt LLMs to downstream tasks while minimizing computational overhead. LoRA (Hu et al., 2021) and QLoRA (Dettmers et al., 2024) introduce low-rank updates and quantization for efficient fine-tuning, while prefix tuning (Li & Liang, 2021), adapters (Hu et al., 2023b), and soft prompting (Lester et al., 2021) modify task-specific representations through trainable layers or embeddings. Torroba-Hennigen et al. (2025) further explore the equivalence between gradient-based transformations and adapter-based tuning. However, these methods require access to model weights, gradients, or architecture details, making them unsuitable for closed-source LLMs and inapplicable as baselines in our setup. In contrast, our approach operates solely on token logits, enabling adaptation without modifying the underlying model. Thus, we emphasize that the Plugin model is not an alternative to fine-tuning, but rather an approach that uniquely stands for adapting black-box LLMs which only provide logit access.
Steering and Aligning LLMs.
LLM alignment methods primarily use reinforcement learning or instruction tuning. RLHF and DPO (Christiano et al., 2017; Ouyang et al., 2022; Rafailov et al., 2024) optimize model behavior via human preferences, with DPO eliminating reward modeling. Constitutional AI (Bai et al., 2022b) aligns models using self-generated principles, while instruction tuning (Wei et al., 2021; Sanh et al., 2022) adapts them via task-specific demonstrations. Unlike our approach, these methods require model weights and training data, limiting their applicability as baselines in our setup.
Calibration of LLMs.
LLM calibration methods aim to align model confidence with predictive accuracy and adjust confidence scores but do not alter token predictions (Ulmer et al., 2024; Shen et al., 2024; Huang et al., 2024; Kapoor et al., 2024; Zhu et al., 2023; Zhang et al., 2023). In contrast, our method reweights token probabilities at inference, enabling adaptation of black-box LLMs without modifying the model or requiring fine-tuning.
Black-box LLMs.
Prior work explores various approaches for adapting black-box LLMs without fine-tuning, though they differ fundamentally from our method. (Gao et al., 2024) infer user preferences through interactive edits but do not adapt models based on past language data. Diffusion-LM (Li et al., 2022) formulates text generation as a non-autoregressive denoising process, whereas our approach reweights token probabilities autoregressively without requiring black-box model weights. Discriminator-based methods (Dathathri et al., 2020; Mireshghallah et al., 2022; Yang & Klein, 2021; Krause et al., 2021) control generation based on predefined attributes, contrasting with our method, which enables free-form text adaptation. DExperts (Liu et al., 2021, 2024) combines expert and anti-expert probabilities; we incorporate a similar probability combining strategy in a modified baseline without a de-expert component. In-context learning (Long et al., 2023; Dong et al., 2024) offers a common adaptation technique for black-box models and serves as a baseline in our setup.
Table 1: Performance comparison on E2E NLG dataset. We show mean and standard deviation of the metrics over five seeds.
| Model | Method | BLEU | Rouge-1 | Rouge-2 | Rouge-L | METEOR | CIDEr | NIST |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| GPT2-M | Zeroshot | 0.0247 | 0.3539 | 0.1003 | 0.2250 | 0.3015 | 0.0156 | 0.6133 |
| GPT2-M | ICL-1 | 0.0543 ±0.026 | 0.3431 ±0.048 | 0.1299 ±0.033 | 0.2280 ±0.047 | 0.3434 ±0.051 | 0.0260 ±0.042 | 0.7767 ±0.060 |
| GPT2-M | ICL-3 | 0.0750 ±0.035 | 0.3955 ±0.028 | 0.1676 ±0.020 | 0.2649 ±0.052 | 0.3977 ±0.063 | 0.0252 ±0.049 | 0.8993 ±0.076 |
| GPT2-M | NewModel | 0.2377 ±0.011 | 0.5049 ±0.014 | 0.2742 ±0.013 | 0.3902 ±0.006 | 0.4521 ±0.016 | 0.3938 ±0.019 | 1.1927 ±0.069 |
| GPT2-M | WeightedComb | 0.1709 ±0.008 | 0.4817 ±0.020 | 0.2447 ±0.011 | 0.3720 ±0.014 | 0.4071 ±0.025 | 0.3329 ±0.027 | 1.0864 ±0.002 |
| GPT2-M | TempNet | 0.1036 ±0.010 | 0.3425 ±0.016 | 0.1526 ±0.012 | 0.2735 ±0.010 | 0.2615 ±0.016 | 0.4116 ±0.023 | 0.2826 ±0.057 |
| GPT2-M | Plugin (Ours) | 0.1863 ±0.010 | 0.5227 ±0.011 | 0.2612 ±0.013 | 0.3728 ±0.003 | 0.4857 ±0.012 | 0.3544 ±0.013 | 1.1241 ±0.009 |
| GPT2-XL | Zeroshot | 0.0562 | 0.4013 | 0.1636 | 0.2862 | 0.3697 | 0.0187 | 0.5338 |
| GPT2-XL | ICL-1 | 0.0686 ±0.032 | 0.4016 ±0.042 | 0.1404 ±0.052 | 0.2745 ±0.025 | 0.3503 ±0.019 | 0.0353 ±0.015 | 0.7944 ±0.067 |
| GPT2-XL | ICL-3 | 0.0980 ±0.035 | 0.4188 ±0.040 | 0.1923 ±0.046 | 0.2912 ±0.031 | 0.3925 ±0.027 | 0.0250 ±0.017 | 0.9390 ±0.054 |
| GPT2-XL | NewModel | 0.2377 ±0.011 | 0.5049 ±0.014 | 0.2742 ±0.013 | 0.3902 ±0.006 | 0.4521 ±0.016 | 0.3938 ±0.019 | 1.1927 ±0.069 |
| GPT2-XL | WeightedComb | 0.1184 ±0.010 | 0.4237 ±0.016 | 0.1858 ±0.012 | 0.3004 ±0.010 | 0.3776 ±0.016 | 0.1818 ±0.023 | 1.0261 ±0.057 |
| GPT2-XL | TempNet | 0.1325 ±0.013 | 0.4642 ±0.017 | 0.2516 ±0.016 | 0.3021 ±0.022 | 0.4126 ±0.025 | 0.3627 ±0.033 | 0.8027 ±0.047 |
| GPT2-XL | Plugin (Ours) | 0.2470 ±0.009 | 0.5536 ±0.007 | 0.3084 ±0.007 | 0.4213 ±0.008 | 0.5057 ±0.009 | 0.5455 ±0.013 | 1.2736 ±0.051 |
| LLaMA-3.1-8B | Zeroshot | 0.3226 | 0.6917 | 0.4050 | 0.5004 | 0.6041 | 0.9764 | 1.1310 |
| LLaMA-3.1-8B | ICL-1 | 0.3301 ±0.037 | 0.6914 ±0.027 | 0.4126 ±0.026 | 0.5023 ±0.018 | 0.6037 ±0.015 | 0.9715 ±0.057 | 1.1735 ±0.066 |
| LLaMA-3.1-8B | ICL-3 | 0.3527 ±0.033 | 0.6936 ±0.036 | 0.4217 ±0.017 | 0.5127 ±0.017 | 0.6202 ±0.009 | 0.9927 ±0.018 | 1.1672 ±0.047 |
| LLaMA-3.1-8B | NewModel | 0.2452 ±0.008 | 0.5347 ±0.005 | 0.2905 ±0.006 | 0.4097 ±0.005 | 0.4812 ±0.009 | 0.4571 ±0.021 | 1.2281 ±0.041 |
| LLaMA-3.1-8B | WeightedComb | 0.3517 ±0.004 | 0.7040 ±0.004 | 0.4249 ±0.004 | 0.5181 ±0.003 | 0.6206 ±0.002 | 1.0947 ±0.010 | 1.1737 ±0.015 |
| LLaMA-3.1-8B | TempNet | 0.3502 ±0.023 | 0.6927 ±0.006 | 0.4216 ±0.023 | 0.5027 ±0.017 | 0.6124 ±0.019 | 0.9625 ±0.025 | 1.1713 ±0.027 |
| LLaMA-3.1-8B | Plugin (Ours) | 0.3691 ±0.013 | 0.7113 ±0.002 | 0.4374 ±0.004 | 0.5247 ±0.002 | 0.6392 ±0.009 | 1.1441 ±0.030 | 1.1749 ±0.034 |
Table 2: Performance comparison on Web NLG dataset. We show mean and standard deviation of the metrics over five seeds.
| Model | Method | BLEU | Rouge-1 | Rouge-2 | Rouge-L | METEOR | CIDEr | NIST |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| GPT2-M | Zeroshot | 0.0213 | 0.2765 | 0.1014 | 0.1872 | 0.2111 | 0.0479 | 0.2340 |
| GPT2-M | ICL-1 | 0.0317 ±0.013 | 0.3388 ±0.021 | 0.1318 ±0.013 | 0.2346 ±0.019 | 0.2876 ±0.042 | 0.0732 ±0.053 | 0.2715 ±0.042 |
| GPT2-M | ICL-3 | 0.0461 ±0.014 | 0.3388 ±0.018 | 0.1378 ±0.016 | 0.2291 ±0.010 | 0.3408 ±0.027 | 0.0748 ±0.031 | 0.3283 ±0.037 |
| GPT2-M | NewModel | 0.1071 ±0.005 | 0.3260 ±0.010 | 0.1496 ±0.014 | 0.2724 ±0.013 | 0.2642 ±0.008 | 0.4327 ±0.023 | 0.2916 ±0.031 |
| GPT2-M | WeightedComb | 0.0692 ±0.007 | 0.3593 ±0.010 | 0.1568 ±0.008 | 0.2834 ±0.015 | 0.2379 ±0.030 | 0.1916 ±0.028 | 0.2996 ±0.037 |
| GPT2-M | TempNet | 0.1045 ±0.012 | 0.3526 ±0.014 | 0.1526 ±0.014 | 0.2731 ±0.018 | 0.3326 ±0.026 | 0.4237 ±0.033 | 0.3002 ±0.048 |
| GPT2-M | Plugin (Ours) | 0.1280 ±0.007 | 0.4590 ±0.005 | 0.2226 ±0.005 | 0.3515 ±0.006 | 0.3832 ±0.010 | 0.7280 ±0.039 | 0.3060 ±0.017 |
| GPT2-XL | Zeroshot | 0.0317 | 0.2992 | 0.1321 | 0.2417 | 0.1969 | 0.0491 | 0.1826 |
| GPT2-XL | ICL-1 | 0.0510 ±0.024 | 0.3223 ±0.026 | 0.1526 ±0.016 | 0.2562 ±0.031 | 0.2591 ±0.009 | 0.1336 ±0.029 | 0.2235 ±0.033 |
| GPT2-XL | ICL-3 | 0.0744 ±0.016 | 0.3383 ±0.036 | 0.1682 ±0.016 | 0.2651 ±0.028 | 0.3071 ±0.014 | 0.1675 ±0.024 | 0.2550 ±0.021 |
| GPT2-XL | NewModel | 0.1071 ±0.005 | 0.3260 ±0.010 | 0.1496 ±0.014 | 0.2724 ±0.013 | 0.2642 ±0.008 | 0.4327 ±0.023 | 0.2916 ±0.031 |
| GPT2-XL | WeightedComb | 0.0636 ±0.006 | 0.3453 ±0.007 | 0.1666 ±0.003 | 0.2782 ±0.005 | 0.2871 ±0.006 | 0.2460 ±0.005 | 0.2981 ±0.018 |
| GPT2-XL | TempNet | 0.0925 ±0.008 | 0.3357 ±0.009 | 0.1663 ±0.014 | 0.2764 ±0.011 | 0.3025 ±0.009 | 0.4226 ±0.013 | 0.2837 ±0.027 |
| GPT2-XL | Plugin (Ours) | 0.1673 ±0.004 | 0.4616 ±0.007 | 0.2527 ±0.007 | 0.3757 ±0.008 | 0.3895 ±0.007 | 0.8987 ±0.013 | 0.2646 ±0.003 |
| LLaMA-3.1-8B | Zeroshot | 0.1453 | 0.5278 | 0.3030 | 0.3982 | 0.4314 | 0.6991 | 0.2684 |
| LLaMA-3.1-8B | ICL-1 | 0.2166 ±0.031 | 0.5944 ±0.027 | 0.3706 ±0.025 | 0.4667 ±0.013 | 0.5651 ±0.045 | 1.5719 ±0.024 | 0.2462 ±0.038 |
| LLaMA-3.1-8B | ICL-3 | 0.2031 ±0.027 | 0.5937 ±0.019 | 0.3821 ±0.015 | 0.4653 ±0.024 | 0.5682 ±0.046 | 1.3826 ±0.051 | 0.2469 ±0.045 |
| LLaMA-3.1-8B | NewModel | 0.1284 ±0.005 | 0.3506 ±0.009 | 0.1673 ±0.007 | 0.2879 ±0.009 | 0.2921 ±0.008 | 0.4999 ±0.030 | 0.2973 ±0.008 |
| LLaMA-3.1-8B | WeightedComb | 0.1922 ±0.012 | 0.5986 ±0.019 | 0.3612 ±0.012 | 0.4659 ±0.008 | 0.4470 ±0.030 | 1.1855 ±0.075 | 0.2575 ±0.020 |
| LLaMA-3.1-8B | TempNet | 0.2315 ±0.010 | 0.5916 ±0.015 | 0.3794 ±0.012 | 0.4620 ±0.010 | 0.5581 ±0.036 | 1.4826 ±0.043 | 0.2513 ±0.020 |
| LLaMA-3.1-8B | Plugin (Ours) | 0.2542 ±0.004 | 0.6375 ±0.005 | 0.3873 ±0.005 | 0.4869 ±0.007 | 0.5724 ±0.004 | 1.5911 ±0.046 | 0.2590 ±0.003 |
Table 3: Performance comparison on CommonGen dataset. We show mean and standard deviation of the metrics over five seeds.
| Model | Method | BLEU | Rouge-1 | Rouge-2 | Rouge-L | METEOR | CIDEr | NIST |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| GPT2-M | Zeroshot | 0.0153 | 0.2216 | 0.0409 | 0.1527 | 0.2848 | 0.0001 | 0.3686 |
| GPT2-M | ICL-1 | 0.0157 ±0.013 | 0.2580 ±0.024 | 0.0362 ±0.096 | 0.1388 ±0.102 | 0.2871 ±0.107 | 0.0222 ±0.076 | 0.3704 ±0.101 |
| GPT2-M | ICL-3 | 0.0552 ±0.010 | 0.3610 ±0.019 | 0.1248 ±0.045 | 0.2680 ±0.089 | 0.4079 ±0.133 | 0.1366 ±0.125 | 0.5340 ±0.087 |
| GPT2-M | NewModel | 0.1260 ±0.007 | 0.4106 ±0.016 | 0.1683 ±0.013 | 0.3740 ±0.009 | 0.3600 ±0.024 | 0.4570 ±0.058 | 0.7113 ±0.025 |
| GPT2-M | WeightedComb | 0.0567 ±0.005 | 0.3918 ±0.010 | 0.1353 ±0.005 | 0.3280 ±0.010 | 0.2929 ±0.016 | 0.2623 ±0.042 | 0.4353 ±0.028 |
| GPT2-M | TempNet | 0.1248 ±0.015 | 0.4048 ±0.014 | 0.1528 ±0.015 | 0.3526 ±0.014 | 0.3883 ±0.017 | 0.4492 ±0.023 | 0.4037 ±0.058 |
| GPT2-M | Plugin (Ours) | 0.1366 ±0.003 | 0.4533 ±0.007 | 0.1878 ±0.003 | 0.3934 ±0.006 | 0.4095 ±0.011 | 0.5572 ±0.022 | 0.6395 ±0.061 |
| GPT2-XL | Zeroshot | 0.0317 | 0.2992 | 0.1321 | 0.2417 | 0.1969 | 0.0491 | 0.1826 |
| GPT2-XL | ICL-1 | 0.0508 ±0.023 | 0.3201 ±0.035 | 0.1526 ±0.097 | 0.2562 ±0.103 | 0.2591 ±0.089 | 0.1336 ±0.092 | 0.2235 ±0.069 |
| GPT2-XL | ICL-3 | 0.0744 ±0.011 | 0.3383 ±0.014 | 0.1682 ±0.030 | 0.2651 ±0.072 | 0.3071 ±0.073 | 0.1675 ±0.066 | 0.2550 ±0.047 |
| GPT2-XL | NewModel | 0.1260 ±0.007 | 0.4106 ±0.016 | 0.1683 ±0.013 | 0.3740 ±0.009 | 0.3600 ±0.024 | 0.4570 ±0.058 | 0.7113 ±0.025 |
| GPT2-XL | WeightedComb | 0.0614 ±0.020 | 0.3364 ±0.024 | 0.1347 ±0.009 | 0.2969 ±0.019 | 0.2921 ±0.018 | 0.2763 ±0.010 | 0.3352 ±0.051 |
| GPT2-XL | TempNet | 0.1154 ±0.020 | 0.3937 ±0.026 | 0.1482 ±0.017 | 0.3625 ±0.013 | 0.3389 ±0.019 | 0.4376 ±0.018 | 0.5927 ±0.047 |
| GPT2-XL | Plugin (Ours) | 0.1791 ±0.014 | 0.4932 ±0.007 | 0.2288 ±0.004 | 0.4347 ±0.007 | 0.4702 ±0.006 | 0.7283 ±0.012 | 0.6554 ±0.038 |
| LLaMA-3.1-8B | Zeroshot | 0.0643 | 0.2776 | 0.1181 | 0.2488 | 0.3857 | 0.3155 | 0.3347 |
| LLaMA-3.1-8B | ICL-1 | 0.0615 ±0.027 | 0.2697 ±0.033 | 0.1158 ±0.062 | 0.2469 ±0.087 | 0.3822 ±0.069 | 0.3005 ±0.072 | 0.3059 ±0.094 |
| LLaMA-3.1-8B | ICL-3 | 0.0635 ±0.016 | 0.2748 ±0.024 | 0.1225 ±0.018 | 0.3120 ±0.047 | 0.4012 ±0.029 | 0.3250 ±0.022 | 0.3794 ±0.034 |
| LLaMA-3.1-8B | NewModel | 0.0753 ±0.004 | 0.3716 ±0.005 | 0.1122 ±0.003 | 0.3404 ±0.004 | 0.2665 ±0.006 | 0.1919 ±0.015 | 0.6900 ±0.046 |
| LLaMA-3.1-8B | WeightedComb | 0.1789 ±0.005 | 0.3485 ±0.012 | 0.1797 ±0.008 | 0.2981 ±0.012 | 0.3637 ±0.011 | 0.5503 ±0.046 | 0.5450 ±0.020 |
| LLaMA-3.1-8B | TempNet | 0.1524 ±0.008 | 0.3372 ±0.015 | 0.1524 ±0.010 | 0.3298 ±0.017 | 0.3676 ±0.015 | 0.3986 ±0.033 | 0.5286 ±0.023 |
| LLaMA-3.1-8B | Plugin (Ours) | 0.2665 ±0.010 | 0.5800 ±0.002 | 0.3139 ±0.005 | 0.5037 ±0.004 | 0.5829 ±0.003 | 1.0876 ±0.020 | 0.7031 ±0.007 |
Table 4: Performance comparison on Adidas dataset. We show mean and standard deviation of the metrics over five seeds.
| Model | Method | BLEU | Rouge-1 | Rouge-2 | Rouge-L | METEOR | CIDEr | NIST |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| GPT2-M | Zeroshot | 0.0046 | 0.2488 | 0.0189 | 0.1353 | 0.1653 | 0.0312 | 0.6860 |
| GPT2-M | ICL-1 | 0.0088 ±0.054 | 0.2667 ±0.047 | 0.0247 ±0.66 | 0.1358 ±0.041 | 0.1762 ±0.028 | 0.0464 ±0.089 | 0.6793 ±0.078 |
| GPT2-M | ICL-3 | 0.0121 ±0.047 | 0.2693 ±0.028 | 0.0262 ±0.054 | 0.1470 ±0.020 | 0.1806 ±0.030 | 0.0415 ±0.104 | 0.7037 ±0.081 |
| GPT2-M | NewModel | 0.0515 ±0.016 | 0.2690 ±0.014 | 0.0637 ±0.014 | 0.1697 ±0.008 | 0.1918 ±0.013 | 0.0550 ±0.086 | 0.6682 ±0.047 |
| GPT2-M | WeightedComb | 0.0565 ±0.014 | 0.2630 ±0.028 | 0.0495 ±0.018 | 0.1565 ±0.015 | 0.1938 ±0.019 | 0.0585 ±0.088 | 0.6456 ±0.156 |
| GPT2-M | TempNet | 0.0442 ±0.017 | 0.2672 ±0.019 | 0.0482 ±0.022 | 0.1582 ±0.020 | 0.1902 ±0.017 | 0.0525 ±0.031 | 0.6533 ±0.098 |
| GPT2-M | Plugin (Ours) | 0.0486 ±0.006 | 0.2766 ±0.002 | 0.0515 ±0.007 | 0.1684 ±0.005 | 0.1994 ±0.004 | 0.0626 ±0.017 | 0.7919 ±0.024 |
| GPT2-XL | Zeroshot | 0.0075 | 0.2309 | 0.0278 | 0.1438 | 0.1487 | 0.0184 | 0.4956 |
| GPT2-XL | ICL-1 | 0.0109 ±0.039 | 0.2567 ±0.082 | 0.0265 ±0.054 | 0.1519 ±0.038 | 0.1649 ±0.052 | 0.0318 ±0.171 | 0.5133 ±0.162 |
| GPT2-XL | ICL-3 | 0.0295 ±0.037 | 0.2509 ±0.071 | 0.0395 ±0.043 | 0.1536 ±0.039 | 0.1658 ±0.041 | 0.0321 ±0.109 | 0.5176 ±0.116 |
| GPT2-XL | NewModel | 0.0515 ±0.016 | 0.2690 ±0.014 | 0.0637 ±0.014 | 0.1697 ±0.008 | 0.1918 ±0.013 | 0.0550 ±0.086 | 0.6682 ±0.047 |
| GPT2-XL | WeightedComb | 0.0567 ±0.016 | 0.2210 ±0.027 | 0.0714 ±0.015 | 0.1550 ±0.024 | 0.1674 ±0.017 | 0.0183 ±0.117 | 0.4105 ±0.109 |
| GPT2-XL | TempNet | 0.0539 ±0.018 | 0.2598 ±0.026 | 0.0686 ±0.014 | 0.1562 ±0.019 | 0.1863 ±0.029 | 0.0462 ±0.120 | 0.5263 ±0.117 |
| GPT2-XL | Plugin (Ours) | 0.0600 ±0.017 | 0.2710 ±0.025 | 0.0722 ±0.018 | 0.1725 ±0.017 | 0.1995 ±0.018 | 0.1195 ±0.138 | 0.6375 ±0.120 |
| LLaMA-3.1-8B | Zeroshot | 0.0120 | 0.2470 | 0.0318 | 0.1493 | 0.1526 | 0.0424 | 0.5285 |
| LLaMA-3.1-8B | ICL-1 | 0.0220 ±0.044 | 0.2472 ±0.072 | 0.0405 ±0.068 | 0.1434 ±0.057 | 0.1686 ±0.041 | 0.0555 ±0.133 | 0.5078 ±0.142 |
| LLaMA-3.1-8B | ICL-3 | 0.0177 ±0.041 | 0.2385 ±0.065 | 0.0364 ±0.071 | 0.1408 ±0.030 | 0.1712 ±0.029 | 0.0587 ±0.102 | 0.5775 ±0.145 |
| LLaMA-3.1-8B | NewModel | 0.0506 ±0.011 | 0.2700 ±0.011 | 0.0634 ±0.006 | 0.1749 ±0.006 | 0.1995 ±0.009 | 0.0575 ±0.051 | 0.6570 ±0.072 |
| LLaMA-3.1-8B | WeightedComb | 0.0357 ±0.017 | 0.2583 ±0.014 | 0.0661 ±0.015 | 0.1560 ±0.011 | 0.1706 ±0.016 | 0.0745 ±0.086 | 0.5927 ±0.077 |
| LLaMA-3.1-8B | TempNet | 0.0472 ±0.016 | 0.2647 ±0.022 | 0.0625 ±0.012 | 0.1625 ±0.020 | 0.1857 ±0.013 | 0.0586 ±0.103 | 0.5926 ±0.137 |
| LLaMA-3.1-8B | Plugin (Ours) | 0.0611 ±0.018 | 0.2714 ±0.029 | 0.0742 ±0.020 | 0.1759 ±0.019 | 0.1990 ±0.020 | 0.1293 ±0.152 | 0.6361 ±0.134 |
7 Experiments
We divide this section into four parts. Section 7.1 evaluates Plugin on four text generation datasets across three black-box language models. Since the Plugin model is trained on top of black-box models, we refer to black-box models interchangeably as base models. Section 7.2 discusses how Plugin can be applied on top of any prompt-tuning method as a wrapper, when logits are accessible. Section 7.3 presents ablation studies analyzing the impact of black-box model quality, Plugin ’s complexity, and architecture choices. Section 7.4 shows qualitative analysis and case studies.
We evaluate Plugin on four text generation benchmarks: (a) E2E NLG (Dušek et al., 2020), (b) Web NLG (Gardent et al., 2017), (c) CommonGen (Lin et al., 2020), and (d) the Adidas product description dataset (adi, 2023). For the first three datasets, we use the train-validation-test splits from the Transformers library (Wolf, 2020). To introduce distribution shifts, we filter Web NLG’s training data to include only infrastructure descriptions, while validation and test sets retain person descriptions. Similarly, CommonGen’s training set is restricted to samples having man, while validation and test sets remain unchanged. Details of this setup are in Section 7.4. The Adidas dataset is split into validation and test sets. Data statistics are provided in Table 6, Appendix C.1.
7.1 Text Generation Performance Comparison
We evaluate Plugin on the text generation task using only the validation and test splits of all four datasets, reserving the train split for ablation studies (Section 7.3). Plugin and baseline models are trained on the small validation set, with performance measured on the test set. Additionally, we allocate 40% of the validation data as hyper-validation for cross-validation of hyperparameters.
Performance is reported using seven standard natural language generation metrics: (a) BLEU (Papineni et al., 2002), (b) ROUGE-1 (Lin, 2004), (c) ROUGE-2 (Lin, 2004), (d) ROUGE-L (Lin & Och, 2004), (e) METEOR (Banerjee & Lavie, 2005), (f) CIDEr (Vedantam et al., 2015), and (g) NIST (Doddington, 2002). All experiments are repeated over five random seeds, and we report the mean and standard deviation for each metric.
We compare Plugin with the following baselines: (a) Zeroshot: The black-box model directly performs text generation without additional adaptation. (b) ICL-1 (Long et al., 2023): One randomly selected validation sample is used as an in-context example. (c) ICL-3 (Long et al., 2023): Three randomly selected validation samples are used as in-context examples. (d) NewModel: A new language model is trained using the validation data. (e) WeightedComb (Liu et al., 2021): A new model is trained alongside the black-box model, with token probabilities computed as $\alpha\bm{n}+(1-\alpha)\bm{b}$ , where $\bm{n}$ represents the probabilities from the new model and $\alpha$ is cross-validated in $\{0.25,0.50,0.75\}$ . (f) TempNet (Qiu et al., 2024), a recent logit-scaling approach that learns a global temperature per input and uniformly scales logits during generation. Since the black-box model weights are inaccessible, fine-tuning-based approaches are not applicable in our setting. Nonetheless, we include a comparison with LoRA in Appendix C.4 for completeness. This highlights Plugin ’s competitiveness despite operating under stricter access constraints than required for PEFT.
All methods use the same prompts where applicable (Appendix C.2) and employ greedy decoding. The base (black-box) models used are GPT2-M (Radford et al., 2019), GPT2-XL (Radford et al., 2019), and LLaMA-3.1-8B (Dubey et al., 2024). NewModel, WeightedComb, and the reweighting model in Plugin share the same architecture. For GPT-based models, these use a Transformer encoder with one hidden layer and default configurations. For LLaMA-based models, the architecture consists of a Transformer encoder with one hidden layer, 256 hidden size, 1024 intermediate size, and one attention head. Learning rate and weight decay are cross-validated over $\{1e-5,5e-5,1e-4,5e-4,1e-3,5e-3\}$ and $\{0.01,0.1,1,10\}$ , respectively. Models are trained using AdamW with warmup followed by linear decay, and early stopping is applied if the hyper-validation loss does not decrease for five consecutive epochs.
As shown in Tables 1 – 4 (the best is bold, the second best is underlined), Plugin outperforms baselines across nearly all datasets, black-box models, and evaluation metrics. NewModel occasionally achieves higher NIST scores due to increased repetition of less-frequent input tokens, but this comes at the cost of coherence, as reflected by other metrics. WeightedComb does not perform well, indicating one combination for all tokens is not a good modeling choice. TempNet, which learns a single temperature per input and uniformly scales logits during generation, also underperforms. In contrast, Plugin reweights logits at each timestep, enabling finer, context-sensitive adjustments.
<details>
<summary>2502.06806v4/x1.png Details</summary>

### Visual Description
# Technical Document Analysis: BLEU and Rouge-L Performance Comparison
## Chart Structure Overview
The image contains two side-by-side bar charts comparing performance metrics (BLEU and Rouge-L) across different GPT-2-M configurations. Each chart uses a dual-color legend system to distinguish between "Base" and "Plugin" implementations.
---
## BLEU Chart Analysis
### Spatial Components
- **Legend**: Top-right corner (pink = Base, purple = Plugin)
- **X-axis**: GPT-2-M configurations (zeroshot, 1FT, 2FT, 5FT)
- **Y-axis**: Value scale (0.00–0.30)
### Key Data Points
| Configuration | Base Value | Plugin Value |
|-------------------|------------|--------------|
| zeroshot | 0.025 | 0.19 |
| 1FT | 0.22 | 0.30 |
| 2FT | 0.22 | 0.32 |
| 5FT | 0.25 | 0.33 |
### Trend Verification
- **Base**: Gradual upward trend (0.025 → 0.25)
- **Plugin**: Steeper upward trajectory (0.19 → 0.33)
- **Consistent Pattern**: Plugin values exceed Base values by 0.16–0.08 across all configurations
---
## Rouge-L Chart Analysis
### Spatial Components
- **Legend**: Top-right corner (pink = Base, purple = Plugin)
- **X-axis**: GPT-2-M configurations (zeroshot, 1FT, 2FT, 5FT)
- **Y-axis**: Value scale (0.00–0.50)
### Key Data Points
| Configuration | Base Value | Plugin Value |
|-------------------|------------|--------------|
| zeroshot | 0.25 | 0.40 |
| 1FT | 0.39 | 0.48 |
| 2FT | 0.41 | 0.49 |
| 5FT | 0.44 | 0.50 |
### Trend Verification
- **Base**: Steady increase (0.25 → 0.44)
- **Plugin**: Rapid initial growth (0.40 → 0.48), plateauing at 0.50
- **Consistent Pattern**: Plugin maintains 0.15–0.06 advantage over Base
---
## Cross-Chart Observations
1. **Performance Correlation**: Both metrics show similar improvement patterns with increased fine-tuning
2. **Plugin Dominance**: Plugin consistently outperforms Base by 60–70% across all configurations
3. **Divergence Point**: Largest performance gap occurs in BLEU zeroshot (0.165 difference)
4. **Saturation Effect**: Rouge-L Plugin values approach maximum scale (0.50) in 5FT configuration
---
## Technical Validation
- **Legend Accuracy**: All bar colors match legend specifications (pink = Base, purple = Plugin)
- **Axis Consistency**: X-axis categories identical across both charts
- **Scale Normalization**: Y-axis ranges appropriately scaled for metric-specific value distributions
This structured analysis provides complete extraction of all textual elements, spatial relationships, and quantitative data from the image. The dual-chart comparison reveals consistent performance advantages for the Plugin implementation across both evaluation metrics, with particularly notable improvements in Rouge-L scores during extended fine-tuning.
</details>
Figure 2: Plugin with increasingly fine-tuned GPT2-M models on the E2E NLG dataset. Results demonstrate that as the quality of the base model improves, the performance of the Plugin improves.
<details>
<summary>2502.06806v4/x2.png Details</summary>

### Visual Description
# Technical Document Extraction: Reweighting Module Analysis
## Image Description
The image contains two side-by-side bar charts comparing the performance of different reweighting module configurations across two evaluation metrics: **BLEU** (left) and **Rouge-L** (right). Each chart uses vertical bars to represent values, with dashed lines connecting data points to illustrate trends.
---
### **Key Labels and Axis Titles**
1. **X-Axis (Horizontal):**
- Title: *"Choice of the reweighting module"*
- Categories (left to right):
- No plugin
- 1-layer
- 2-layer
- 4-layer
- 8-layer
- 12-layer
- GPT2 Small
2. **Y-Axis (Vertical):**
- Title: *"Value"*
- Scale:
- BLEU: 0.00 to 0.30 (increments of 0.05)
- Rouge-L: 0.00 to 0.40 (increments of 0.10)
3. **Chart Titles:**
- Left Chart: *"BLEU"*
- Right Chart: *"Rouge-L"*
---
### **Data Categories and Sub-Categories**
- **Reweighting Module Configurations:**
- No plugin
- 1-layer
- 2-layer
- 4-layer
- 8-layer
- 12-layer
- GPT2 Small
- **Evaluation Metrics:**
- BLEU (left chart)
- Rouge-L (right chart)
---
### **Data Points and Trends**
#### **BLEU Chart (Left)**
- **Trend:**
- Dashed line starts at **No plugin** (~0.02), rises sharply to **1-layer** (~0.19), plateaus through **12-layer** (~0.16), then spikes at **GPT2 Small** (~0.29).
- **Values (approximate):**
| Module | Value |
|-----------------|-------|
| No plugin | 0.02 |
| 1-layer | 0.19 |
| 2-layer | 0.16 |
| 4-layer | 0.16 |
| 8-layer | 0.16 |
| 12-layer | 0.16 |
| GPT2 Small | 0.29 |
#### **Rouge-L Chart (Right)**
- **Trend:**
- Dashed line starts at **No plugin** (~0.22), rises to **1-layer** (~0.39), plateaus through **12-layer** (~0.37), then increases at **GPT2 Small** (~0.48).
- **Values (approximate):**
| Module | Value |
|-----------------|-------|
| No plugin | 0.22 |
| 1-layer | 0.39 |
| 2-layer | 0.38 |
| 4-layer | 0.38 |
| 8-layer | 0.37 |
| 12-layer | 0.37 |
| GPT2 Small | 0.48 |
---
### **Visual Components**
1. **Bars:**
- Shades of purple (light to dark) represent module complexity.
- Darker bars correspond to higher-layer configurations (e.g., GPT2 Small).
2. **Dashed Lines:**
- Black lines connect bars sequentially to highlight trends.
- No explicit legend is present, but line placement aligns with x-axis categories.
3. **Spatial Grounding:**
- **Legend:** Not explicitly visible.
- **X-Axis Placement:** Categories evenly spaced from left to right.
- **Y-Axis Placement:** Values aligned vertically with incremental markers.
---
### **Critical Observations**
1. **Performance Trends:**
- **BLEU:** GPT2 Small achieves the highest value (~0.29), outperforming all other configurations.
- **Rouge-L:** GPT2 Small also leads (~0.48), with a notable gap compared to 1-layer (~0.39).
2. **Plateaus:**
- Both metrics plateau between **2-layer** and **12-layer**, suggesting diminishing returns for intermediate configurations.
3. **No Plugin Baseline:**
- Minimal performance in both metrics (~0.02 for BLEU, ~0.22 for Rouge-L).
---
### **Conclusion**
The charts demonstrate that **GPT2 Small** consistently outperforms other reweighting module configurations across both BLEU and Rouge-L metrics. Intermediate configurations (2-layer to 12-layer) show similar performance, while the "No plugin" baseline remains significantly lower.
</details>
Figure 3: Performance of GPT2-M with varying reweighting model complexities on E2E NLG (BLEU, ROUGE-L). A single-layer reweighting model yields significant gains, while additional layers degrade performance due to overfitting. Initializing with GPT2-Small as the reweighting model improves performance, demonstrating the benefits of leveraging small pretrained models.
We note that the absolute numbers may not appear competitive with state-of-the-art results, because (a) we restrict to greedy decoding (Section 4.2), and (b) Web NLG and CommonGen use distribution-shifted subsets.
We also conduct a human evaluation on 100 Adidas dataset samples, where three subjects compare outputs from Plugin and ICL-3 using LLaMA-3.1 as the base model. Evaluators select the prediction closest to the ground truth, with Plugin preferred in 81% of cases. Details are in Appendix C.7.
<details>
<summary>2502.06806v4/extracted/6617062/icml2025/images/adidas_histogram.png Details</summary>

### Visual Description
# Technical Document Extraction
## Section 1: Product Attributes
**Input:**
- Name: [Adizero 1/2 Zip Long Sleeve Tee]
- Category: [Clothing]
- Price: [68]
- Color: [Black]
## Section 2: Ground-Truth Reference
**Description:**
> "More of the warmth, less of the weight. This adidas running tee gives you a light, stretchy layer to wear while training and racing. AEROREADY wicks moisture, so you stay dry before it gets cold. Slip the thumbhole sleeves over your hands for extra coverage. This product is made with Primegreen, a series of high-performance recycled materials."
## Section 3: Model Descriptions
### Base Model
**Description:**
> "The adizero 1/2 zip long sleeve tee is a great way to keep your feet warm and dry. The soft, fabric has been designed with comfort in mind."
### Plugin Model
**Description:**
> "It's a long sleeve tee that is made in the adidas style. It keeps all your busy days short to **keep** you feeling fresh, energized and ready for whatever comes next. This product belongs on this page: Stay **comfortable** through our **ambition**-driven line of high-performance recycled materials."
## Section 4: Decoding Steps & Charts
### Decoding Step 23: Generated Token "keep"
**Chart Components:**
- **X-Axis Labels:**
`adidas, recycled, comfort, cotton, plastic, look, waste, higher, performance, materials, end, dry, make, keep, ambition, feel, boost, design, lightweight, one, used`
- **Y-Axis Title:** Normalized Probability
- **Legend:**
- Base Model (Blue)
- Plugin Model (Red)
- **Data Trends:**
- Plugin Model (Red) shows a sharp peak at "keep" (~0.16 probability).
- Base Model (Blue) has a smaller peak at "keep" (~0.02 probability).
- Other notable peaks: "recycled" (Plugin: ~0.02), "comfort" (Base: ~0.015).
### Decoding Step 48: Generated Token "comfortable"
**Chart Components:**
- **X-Axis Labels:**
`adidas, recycled, comfort, cotton, plastic, look, waste, higher, performance, materials, end, dry, make, keep, ambition, feel, boost, design, lightweight, one, used`
- **Y-Axis Title:** Normalized Probability
- **Legend:**
- Base Model (Blue)
- Plugin Model (Red)
- **Data Trends:**
- Plugin Model (Red) peaks at "comfortable" (~0.02 probability).
- Base Model (Blue) has minimal activity.
- Notable peaks: "recycled" (Plugin: ~0.015), "comfort" (Base: ~0.01).
### Decoding Step 51: Generated Token "ambition"
**Chart Components:**
- **X-Axis Labels:**
`adidas, recycled, comfort, cotton, plastic, look, waste, higher, performance, materials, end, dry, make, keep, ambition, feel, boost, design, lightweight, one, used`
- **Y-Axis Title:** Normalized Probability
- **Legend:**
- Base Model (Blue)
- Plugin Model (Red)
- **Data Trends:**
- Plugin Model (Red) peaks at "ambition" (~0.05 probability).
- Base Model (Blue) shows a secondary peak at "recycled" (~0.02).
- Notable peaks: "recycled" (Plugin: ~0.01), "comfort" (Base: ~0.015).
## Section 5: Spatial & Trend Verification
1. **Legend Placement:**
- All charts have legends in the **top-right corner**.
- Colors match: Blue = Base Model, Red = Plugin Model.
2. **Trend Consistency:**
- Plugin Model consistently shows higher probabilities for generated tokens (e.g., "keep," "comfortable," "ambition").
- Base Model exhibits broader but lower-probability distributions.
## Section 6: Component Isolation
### Header
- Product attributes and ground-truth reference.
### Main Chart Area
- Three decoding steps with embedded charts.
### Footer
- No explicit footer; charts occupy the lower half of the image.
## Section 7: Data Table Reconstruction
**Hypothetical Table Structure (Inferred from Charts):**
| Word | Base Model Probability | Plugin Model Probability |
|---------------|------------------------|--------------------------|
| keep | 0.02 | 0.16 |
| comfortable | 0.005 | 0.02 |
| ambition | 0.001 | 0.05 |
| recycled | 0.015 | 0.02 |
| comfort | 0.015 | 0.01 |
## Section 8: Final Notes
- All text is in **English**.
- No non-English content detected.
- Charts use normalized probability (0–1) on the y-axis.
- Plugin Model emphasizes product-specific keywords (e.g., "Primegreen," "AEROREADY") not explicitly listed in the base description.
</details>
Figure 4: Comparison of the adaptation ability between the base model and Plugin on Adidas dataset. Plugin, enhanced with a reweighting model, generates text that better aligns with the “ Adidas domain ”. The bottom row illustrates token probabilities for key Adidas-related words at different decoding steps, showing how the reweighting model influences token selection.
7.2 Plugin as a Wrapper
Table 5: Performance comparison of BDPL and BDPL + Plugin.
| Dataset | Method | BLEU | | Rouge-L | METEOR | CIDEr | NIST |
| --- | --- | --- | --- | --- | --- | --- | --- |
| E2E NLG | BDPL | 0.2287 | | 0.3922 | 0.4628 | 0.4216 | 0.8625 |
| BDPL + Plugin | 0.4527 | | 0.6027 | 0.6214 | 0.7002 | 2.0817 | |
| WEB NLG | BDPL | 0.1024 | | 0.3017 | 0.3527 | 0.4321 | 0.2631 |
| BDPL + Plugin | 0.2137 | | 0.5928 | 0.5766 | 1.0826 | 0.6142 | |
| CommonGen | BDPL | 0.1023 | | 0.2936 | 0.3362 | 0.2517 | 0.4226 |
| BDPL + Plugin | 0.2614 | | 0.5241 | 0.5016 | 0.8251 | 0.9261 | |
| Adidas | BDPL | 0.0417 | | 0.1710 | 0.1826 | 0.0861 | 0.6034 |
| BDPL + Plugin | 0.0623 | | 0.1759 | 0.2148 | 0.1325 | 0.7024 | |
If logit access is available, Plugin can be applied on top of any prompt-based method using its best-found prompt. For example, our Zeroshot prompt is reused across methods. We also apply Plugin to the Black-box Discrete Prompt Learning (BDPL) approach from Diao et al. (2022), following their recommended 75 API call budget. Table 5 shows results on all datasets with GPT2-XL as the base model. Plugin outperforms BDPL (see Tables 1 – 4), and their combination yields further gains, underscoring the utility of logit-level access in strengthening prompt-based methods.
7.3 Ablation Study
We now show ablation studies that reflect various aspects of the Plugin model. We display the results using GPT2-M as base model on the E2E NLG dataset. The observation is similar on other base models and datasets (Appendix C.5).
Impact of Base Model Quality.
We fine-tune GPT2-M for varying epochs, denoted as 1FT (one epoch), 2FT (two epochs), and 5FT (five epochs), and train a Plugin model for each. Figure 2 shows that as the base model’s task-specific quality improves, the Plugin’s performance improves.
Complexity of the Reweighting Model in Plugin.
We train Plugin models with reweighting architectures varying from 1 to 12 transformer layers while keeping other configurations unchanged. Additionally, we train a variant where the reweighting model is initialized with GPT2-Small. As shown in Figure 3, a single-layer reweighting model yields significant improvements over the base GPT2-M model, while additional layers (e.g., 2, 4, 8, 12) offer diminishing returns and slight performance decline due to overfitting on the small validation set of E2E NLG. This suggests that more data is required for learning complex reweighting models. Notably, initializing with a pretrained GPT2-Small substantially improves performance, underscoring the advantage of using small pretrained models for reweighting due to their inherent autoregressive properties.
7.4 Qualitative Analysis and Case Study
Plugin adapting to distribution shift.
We evaluate Plugin on distribution-shifted Web NLG and CommonGen using LLaMA-3.1-8B as the base model. Web NLG training data contains only Infrastructure concepts, while validation and test sets include Person concepts. Similarly, CommonGen training data features man, whereas validation and test sets contain both man and woman. The base model is fine-tuned on the training data, and Plugin is trained on validation data using the fine-tuned model as the base. These settings reflect different degrees of domain shift, even adversarial to some extent as the training distributions induce biases (e.g., overemphasis on infrastructure or male-related concepts), and Plugin is tasked with correcting them during inference.
Using GPT-4o (Hurst et al., 2024) as an evaluator, the fine-tuned Web NLG model generates only 17.99% Person -related sentences, while Plugin increases this to 71.34%. On CommonGen, the fine-tuned model generates 10.37% Woman -related sentences, whereas Plugin improves this to 31.92%. These results highlight Plugin ’s ability to adapt under distribution shift and mitigate biases in the base model.
Case study: Plugin adapting to domain (extreme distirbution shifts).
We examine token probabilities during inference for LLaMA-3.1-8B and Plugin to assess domain adaptation in the Adidas dataset, which features product-centric language and a brand-specific tone that diverge significantly from the general pretraining distribution of the black-box LLMs. This experimental setup can also be viewed as extreme distribution shift. Removing stopwords, we extract the top-50 most frequent words, defining the “ Adidas domain ”. Figure 4 illustrates this adaptation: the first row shows product attributes and ground-truth references; the second row compares outputs from the base model (left) and Plugin (right); the third row visualizes model probabilities for “ Adidas domain ” words at three decoding steps.
As seen in Figure 4, Plugin dynamically reweights probabilities to align with domain-specific language. At step 23, “keep” is significantly upweighted. At step 48, “comfortable” and “dry” gain prominence over “fit,” which the base model favors. At step 59, “recycled” is preferred by Plugin, aligning with the ground truth, while the base model favors “running” and “products”. This demonstrates that Plugin effectively steers generation toward domain-specific terminology, whereas the base model, trained on broad corpora, lacks inherent domain preference.
Unlike methods that prune or suppress tokens, Plugin softly reweights token probabilities without eliminating any vocabulary candidates. This preserves full coverage while amplifying domain-specific terms. To quantify this, we measure the total occurrences of the top-50 “ Adidas domain ” words in generated outputs: Plugin includes 25.6% of these terms compared to 13.8% in the base model, indicating substantially improved alignment with domain language.
8 Conclusion
We propose Plugin, a token-level probability reweighting framework that adapts black-box LLMs using only logits and small task-specific data. Framing next-token prediction as a label noise correction problem, we demonstrate both theoretical guarantees and empirical effectiveness across multiple datasets and models. Our findings highlight the potential of logit-based adaptation and advocate for broader access to token logits in closed-source LLMs.
Acknowledgements
HW acknowledges support by Fonds de recherche du Québec – Nature et technologies (FRQNT) and Borealis AI. SK acknowledges support by NSF 2046795 and 2205329, IES R305C240046, the MacArthur Foundation, Stanford HAI, OpenAI, and Google.
Impact Statement
This work introduces a powerful middle ground between fully black-box APIs and fully white-box access to large language models (LLMs), addressing a critical constraint faced by developers: the inability to adapt models when weights and architecture are inaccessible. By leveraging token-level logits—without requiring access to model weights or architecture—our approach enables meaningful adaptation of closed-source LLMs for domain-specific tasks. This has far-reaching implications for both research and industry: it empowers developers to customize models within privacy-preserving, IP-sensitive environments while ensuring greater control, transparency, and safety. Our findings advocate for broader logit access as a scalable, secure, and effective interface—bridging the gap between usability and protection of proprietary models—and open new possibilities for equitable, context-aware language generation in real-world applications.
While Plugin effectively adapts black-box LLMs, it has some limitations, too. Since it only reweights token probabilities without modifying internal representations or embeddings, it may struggle with tasks requiring deep structural adaptations, such as executing complex reasoning. Further research on this aspect is needed. Additionally, although Plugin avoids full fine-tuning, training a separate reweighting model introduces computational overhead compared to prompt tuning or in-context learning, with efficiency depending on the complexity of the reweighting model and the availability of task-specific data.
References
- adi (2023) Adidas us retail products dataset. Kaggle, 2023. URL https://www.kaggle.com/datasets/whenamancodes/adidas-us-retail-products-dataset.
- Achiam et al. (2023) Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Bai et al. (2022a) Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Bai et al. (2022b) Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Banerjee & Lavie (2005) Banerjee, S. and Lavie, A. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005.
- Chaudhuri et al. (2015) Chaudhuri, K., Kakade, S. M., Netrapalli, P., and Sanghavi, S. Convergence rates of active learning for maximum likelihood estimation. Advances in Neural Information Processing Systems, 28, 2015.
- Christiano et al. (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Dathathri et al. (2020) Dathathri, S., Madotto, A., Lan, J., Hung, J., Frank, E., Molino, P., Yosinski, J., and Liu, R. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020.
- Dettmers et al. (2024) Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
- Diao et al. (2022) Diao, S., Huang, Z., Xu, R., Li, X., Lin, Y., Zhou, X., and Zhang, T. Black-box prompt learning for pre-trained language models. arXiv preprint arXiv:2201.08531, 2022.
- Doddington (2002) Doddington, G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In International conference on Human Language Technology Research, pp. 138–145, 2002.
- Dong et al. (2024) Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., et al. A survey on in-context learning. In Conference on Empirical Methods in Natural Language Processing, pp. 1107–1128, 2024.
- Dubey et al. (2024) Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Dušek et al. (2020) Dušek, O., Novikova, J., and Rieser, V. Evaluating the state-of-the-art of end-to-end natural language generation: The e2e nlg challenge. Computer Speech & Language, 59:123–156, 2020.
- Frostig et al. (2015) Frostig, R., Ge, R., Kakade, S. M., and Sidford, A. Competing with the empirical risk minimizer in a single pass. In Conference on learning theory, pp. 728–763. PMLR, 2015.
- Gao et al. (2024) Gao, G., Taymanov, A., Salinas, E., Mineiro, P., and Misra, D. Aligning llm agents by learning latent preference from user edits. arXiv preprint arXiv:2404.15269, 2024.
- Gardent et al. (2017) Gardent, C., Shimorina, A., Narayan, S., and Perez-Beltrachini, L. Creating training corpora for nlg micro-planning. In Annual Meeting of the Association for Computational Linguistics, pp. 179–188, 2017.
- Hiranandani et al. (2021) Hiranandani, G., Mathur, J., Narasimhan, H., Fard, M. M., and Koyejo, S. Optimizing black-box metrics with iterative example weighting. In International Conference on Machine Learning, pp. 4239–4249. PMLR, 2021.
- Hsu et al. (2012) Hsu, D., Kakade, S., Zhang, T., et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
- Hu et al. (2021) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Hu et al. (2023a) Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E.-P., Bing, L., Xu, X., Poria, S., and Lee, R. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Conference on Empirical Methods in Natural Language Processing, pp. 5254–5276, 2023a.
- Hu et al. (2023b) Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E.-P., Bing, L., Xu, X., Poria, S., and Lee, R. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Conference on Empirical Methods in Natural Language Processing, pp. 5254–5276, 2023b.
- Huang et al. (2024) Huang, Y., Liu, Y., Thirukovalluru, R., Cohan, A., and Dhingra, B. Calibrating long-form generations from large language models. arXiv preprint arXiv:2402.06544, 2024.
- Hurst et al. (2024) Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
- Kapoor et al. (2024) Kapoor, S., Gruver, N., Roberts, M., Pal, A., Dooley, S., Goldblum, M., and Wilson, A. Calibration-tuning: Teaching large language models to know what they don’t know. In Workshop on Uncertainty-Aware NLP, pp. 1–14, 2024.
- Kojima et al. (2022) Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35, 2022.
- Koyejo et al. (2014) Koyejo, O. O., Natarajan, N., Ravikumar, P. K., and Dhillon, I. S. Consistent binary classification with generalized performance metrics. Advances in neural information processing systems, 27, 2014.
- Krause et al. (2021) Krause, B., Gotmare, A. D., McCann, B., Keskar, N. S., Joty, S., Socher, R., and Rajani, N. F. Gedi: Generative discriminator guided sequence generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 4929–4952, 2021.
- Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
- Lester et al. (2021) Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. In Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, 2021.
- Li et al. (2022) Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
- Li & Liang (2021) Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. In Annual Meeting of the Association for Computational Linguistics, pp. 4582–4597, 2021.
- Lin et al. (2020) Lin, B. Y., Zhou, W., Shen, M., Zhou, P., Bhagavatula, C., Choi, Y., and Ren, X. Commongen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics, pp. 1823–1840, 2020.
- Lin (2004) Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.
- Lin & Och (2004) Lin, C.-Y. and Och, F. J. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Annual meeting of the association for computational linguistics, pp. 605–612, 2004.
- Lipton et al. (2018) Lipton, Z., Wang, Y.-X., and Smola, A. Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pp. 3122–3130. PMLR, 2018.
- Liu et al. (2021) Liu, A., Sap, M., Lu, X., Swayamdipta, S., Bhagavatula, C., Smith, N. A., and Choi, Y. Dexperts: Decoding-time controlled text generation with experts and anti-experts. In Annual Meeting of the Association for Computational Linguistics, 2021.
- Liu et al. (2024) Liu, A., Han, X., Wang, Y., Tsvetkov, Y., Choi, Y., and Smith, N. A. Tuning language models by proxy. arXiv preprint arXiv:2401.08565, 2024.
- Long et al. (2023) Long, Q., Wang, W., and Pan, S. Adapt in contexts: Retrieval-augmented domain adaptation via in-context learning. In Conference on Empirical Methods in Natural Language Processing, pp. 6525–6542, 2023.
- Mireshghallah et al. (2022) Mireshghallah, F., Goyal, K., and Berg-Kirkpatrick, T. Mix and match: Learning-free controllable text generationusing energy language models. In Annual Meeting of the Association for Computational Linguistics, pp. 401–415, 2022.
- Mukherjee et al. (2022) Mukherjee, S., Tripathy, A. S., and Nowak, R. Chernoff sampling for active testing and extension to active regression. In International Conference on Artificial Intelligence and Statistics, pp. 7384–7432. PMLR, 2022.
- Naghshvar & Javidi (2013) Naghshvar, M. and Javidi, T. Active sequential hypothesis testing. 2013.
- Narasimhan et al. (2015) Narasimhan, H., Ramaswamy, H., Saha, A., and Agarwal, S. Consistent multiclass algorithms for complex performance measures. In International Conference on Machine Learning, pp. 2398–2407. PMLR, 2015.
- Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Papineni et al. (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
- Patrini et al. (2017) Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1944–1952, 2017.
- Qiu et al. (2024) Qiu, Z.-H., Guo, S., Xu, M., Zhao, T., Zhang, L., and Yang, T. To cool or not to cool? temperature network meets large foundation models via dro. arXiv preprint arXiv:2404.04575, 2024.
- Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Rafailov et al. (2024) Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Sanh et al. (2022) Sanh, V., Webson, A., Raffel, C., Bach, S., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Raja, A., Dey, M., et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022.
- Shen et al. (2024) Shen, M., Das, S., Greenewald, K., Sattigeri, P., Wornell, G. W., and Ghosh, S. Thermometer: Towards universal calibration for large language models. In International Conference on Machine Learning, 2024.
- Singh et al. (2023) Singh, S. P., Hofmann, T., and Schölkopf, B. The hessian perspective into the nature of convolutional neural networks. arXiv preprint arXiv:2305.09088, 2023.
- Song et al. (2023) Song, C. H., Wu, J., Washington, C., Sadler, B. M., Chao, W.-L., and Su, Y. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In IEEE/CVF International Conference on Computer Vision, pp. 2998–3009, 2023.
- Sukhbaatar et al. (2015) Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., and Fergus, R. Training convolutional networks with noisy labels. In International Conference on Learning Representations, 2015.
- Torroba-Hennigen et al. (2025) Torroba-Hennigen, L., Lang, H., Guo, H., and Kim, Y. On the duality between gradient transformations and adapters. arXiv preprint arXiv:2502.13811, 2025.
- Ulmer et al. (2024) Ulmer, D., Gubri, M., Lee, H., Yun, S., and Oh, S. J. Calibrating large language models using their generations only. arXiv preprint arXiv:2403.05973, 2024.
- Vedantam et al. (2015) Vedantam, R., Lawrence Zitnick, C., and Parikh, D. Cider: Consensus-based image description evaluation. In IEEE conference on computer vision and pattern recognition, pp. 4566–4575, 2015.
- Wei et al. (2021) Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
- Wolf (2020) Wolf, T. Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2020.
- Yang & Klein (2021) Yang, K. and Klein, D. Fudge: Controlled text generation with future discriminators. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3511–3535, 2021.
- Zhang et al. (2023) Zhang, H., Zhang, Y.-F., Yu, Y., Madeka, D., Foster, D., Xing, E., Lakkaraju, H., and Kakade, S. A study on the calibration of in-context learning. arXiv preprint arXiv:2312.04021, 2023.
- Zhang et al. (2024) Zhang, Y., Chen, C., Ding, T., Li, Z., Sun, R., and Luo, Z.-Q. Why transformers need adam: A hessian perspective. arXiv preprint arXiv:2402.16788, 2024.
- Zhu et al. (2023) Zhu, C., Xu, B., Wang, Q., Zhang, Y., and Mao, Z. On the calibration of large language models and alignment. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 9778–9795, 2023.
Appendix A Algorithm Details
We provide summarized form of the training and inference algorithm for the Plugin model below.
Algorithm 1 Training and Inference for the Plugin Model
Input: Black-box model $B$ , reweighting model $R$ , clean training data $\mathcal{D}$ , vocabulary $V$ Output: Plugin model predictions $\bm{x}_{1:T}$ for a given sequence
1: Training Phase:
2: for each sequence $s∈\mathcal{D}$ do
3: Compute token probabilities $\{\bm{b}_{1},\bm{b}_{2},...,\bm{b}_{m}\}$ using $B$ .
4: Compute token probabilities $\{\bm{r}_{1},\bm{r}_{2},...,\bm{r}_{m}\}$ using $R$ .
5: Combine probabilities: ${\bm{p}}_{i}=\frac{\bm{b}_{i}\odot\bm{r}_{i}}{\|\bm{b}_{i}\odot\bm{r}_{i}\|_{1}}$ for $i∈[m]$ .
6: Compute sequence loss $\ell_{s}=-\frac{1}{m}\sum_{i=1}^{m}\sum_{j=1}^{|V|}\log({\bm{p}}_{i})\odot\bm{%
e}_{j}$ .
7: Update parameters of $R$ using back-propagation. Freeze $B$ .
8: end for
9: Inference Phase:
10: Initialize sequence $\bm{x}_{1:T}=\{\}$ .
11: for each token position $t=1$ to $T$ do
12: Compute token probabilities $\bm{b}_{t}$ using $B$ .
13: Compute token probabilities $\bm{r}_{t}$ using $R$ .
14: Combine probabilities: ${\bm{p}}_{t}=\frac{\bm{b}_{t}\odot\bm{r}_{t}}{\|\bm{b}_{t}\odot\bm{r}_{t}\|_{1}}$ .
15: Predict token: $\bm{x}_{t}=\operatorname*{argmax}_{V}({\bm{p}}_{t})$ .
16: Append $\bm{x}_{t}$ to $\bm{x}_{1:T}$ .
17: end for
18: Return: $\bm{x}_{1:T}$
Appendix B Proof of Main Convergence Theorem
We define the following assumption on the smoothness and regularity of the loss function.
**Assumption B.1**
*We assume the following assumptions hold with probability $1$ :
1. (Convexity of $\ell_{s}$ ): The loss function $\ell_{s}$ is convex for all time $s∈[t]$ .
1. (Smoothness of $\ell_{s}$ ): The $\ell_{s}$ is smooth such that the first, second, and third derivatives exist at all interior points in $\bm{\Theta}$ .
1. (Regularity Conditions):
1. $\bm{\Theta}$ is compact and $\ell_{s}(\bm{\theta})$ is bounded for all $\bm{\theta}∈\bm{\Theta}$ and for all $s∈[t]$ .
1. $\bm{\theta}_{*}$ is an interior point in $\bm{\Theta}$ .
1. $∇^{2}\ell_{s}(\bm{\theta}_{*})$ is positive definite, for all $s∈[t]$ .
1. There exists a neighborhood $\mathcal{B}$ of $\bm{\theta}_{*}$ and a constant $C_{1}$ , such that $∇^{2}\ell_{s}(\bm{\theta})$ is $C_{1}$ -Lipschitz. Hence, we have that $\left\|∇^{2}\ell_{s}(\bm{\theta})-∇^{2}\ell_{s}\left(\bm{\theta}^{%
\prime}\right)\right\|_{*}≤ C_{1}\left\|\bm{\theta}-\bm{\theta}^{\prime}%
\right\|_{∇^{2}{\cal L}_{s}\left(\bm{\theta}_{*}\right)}$ , for $\bm{\theta},\bm{\theta}^{\prime}$ in this neighborhood.
1. (Concentration at $\bm{\theta}_{*}$ ): We further assume that $\left\|∇\ell_{s}\left(\bm{\theta}_{*}\right)\right\|_{\left(∇^{2}{%
\cal L}_{s}\left(\bm{\theta}_{*}\right)\right)^{-1}}≤ C_{2}$ hold with probability one.*
**Lemma B.2**
*(Proposition 2 of (Hsu et al., 2012)) Let $\mathbf{u}_{1},...,\mathbf{u}_{n}$ be a martingale difference vector sequence (i.e., $\mathbb{E}\left[\mathbf{u}_{i}\mid\mathbf{u}_{1},...,\mathbf{u}_{i-1}\right]=$ 0 for all $i=1,...,n$ ) such that
$$
\sum_{i=1}^{n}\mathbb{E}\left[\left\|\mathbf{u}_{i}\right\|^{2}\mid\mathbf{u}_%
{1},\ldots,\mathbf{u}_{i-1}\right]\leq v\quad\text{ and }\quad\left\|\mathbf{u%
}_{i}\right\|\leq b
$$
for all $i=1,...,n,$ almost surely. For all $t>0$
$$
\operatorname{Pr}\left[\left\|\sum_{i=1}^{n}\mathbf{u}_{i}\right\|>\sqrt{v}+%
\sqrt{8vt}+(4/3)bt\right]\leq e^{-t}
$$*
**Lemma B.3**
*The probability that $\|∇\widehat{\cal L}_{t}(\bm{\theta}_{*})\|_{\left(∇^{2}{\cal L}\left%
(\bm{\theta}_{*}\right)\right)^{-1}}$ crosses the threshold $\sqrt{\dfrac{c\gamma\log(dt)}{t}}>0$ is bounded by
| | $\displaystyle\mathbb{P}\left(\|∇\widehat{\cal L}_{t}(\bm{\theta}_{*})\|_{%
\left(∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)^{-1}}≥ C_{2%
}\sqrt{\dfrac{c\gamma\log(dt)}{t}}\right)≤\frac{1}{t^{c\gamma}}.$ | |
| --- | --- | --- |*
* Proof*
Define $\mathbf{u_{s}}\coloneqq∇(Y_{s}-f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},%
\mathcal{F}^{s-1}))^{2}$ . Then we have $\mathbf{u}_{1},\mathbf{u}_{2},...,\mathbf{u}_{t}$ as random vectors such that
| | $\displaystyle\mathbb{E}\left[\left\|\sum_{s=1}^{t}\mathbf{u_{s}}\right\|_{%
\left(∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)^{-1}}^{2}\bigg%
{|}\mathbf{u}_{1},...,\mathbf{u}_{s-1}\right]=\mathbb{E}\left[\sum_{s=1}^{t%
}\mathbf{u_{s}}^{→p}\left(∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)%
\right)^{-1}\mathbf{u_{s}}\mid\mathbf{u}_{1},...,\mathbf{u}_{s-1}\right]%
≤ tC^{2}_{2}$ | |
| --- | --- | --- |
Also we have that $\|\mathbf{u_{s}}\|≤ C_{2}$ . Finally we have that
| | $\displaystyle\mathbb{E}[∇_{\bm{\theta}=\bm{\theta}_{*}}\mathbf{u_{s}}]=-2%
\sum_{s=1}^{t}p_{{\widehat{\bm{\theta}}}_{s-1}}(f_{I_{s}}(\bm{\theta}_{*};x_{i%
},x_{j},\mathcal{F}^{s-1})-f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{%
s-1})∇_{\bm{\theta}=\bm{\theta}_{*}}f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j}%
,\mathcal{F}^{s-1})=0.$ | |
| --- | --- | --- |
Then following Lemma B.2 and by setting $\epsilon=c\gamma\log(dt)$ we can show that
| | $\displaystyle\mathbb{P}$ | $\displaystyle\left(\|\frac{1}{t}\sum_{s=1}^{t}\mathbf{u_{s}}\|^{2}_{{}_{\left(%
∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)^{-1}}}-\mathbb{E}%
\left[\|\frac{1}{t}\sum_{s=1}^{t}\mathbf{u_{s}}\|^{2}_{{}_{\left(∇^{2}{%
\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)^{-1}}}\right]>\frac{1}{t}\sqrt{%
8tC_{2}^{2}\epsilon}+\dfrac{4C_{2}}{3\epsilon}\right)$ | |
| --- | --- | --- | --- |
The claim of the lemma follows. ∎
**Lemma B.4**
*Let the $j$ -th row and $k$ -th column entry in the Hessian matrix $∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}(\ell_{s}(\bm{\theta}))$ be denoted as $[∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}(\ell_{s}(\bm{\theta}))]_{jk}$ . Then we have that
| | $\displaystyle[∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}(\ell_{s}(\bm{%
\theta}))]_{jk}=2\dfrac{∂ f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}%
^{s-1})}{∂\bm{\theta}_{j}}\dfrac{∂ f_{I_{s}}(\bm{\theta};x_{i},x%
_{j},\mathcal{F}^{s-1})}{∂\bm{\theta}_{k}}+2\left(f_{I_{s}}(\bm{\theta}%
;x_{i},x_{j},\mathcal{F}^{s-1})-Y_{s}\right)\dfrac{∂^{2}f_{I_{s}}(\bm{%
\theta};x_{i},x_{j},\mathcal{F}^{s-1})}{∂\bm{\theta}_{j}∂\bm{%
\theta}_{k}}.$ | |
| --- | --- | --- |*
* Proof*
This lemma follows from Frostig et al. (2015); Mukherjee et al. (2022) adapted to our setting for the squared loss, and transition function $f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1})$ . We want to evaluate the Hessian $∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}(\ell_{s}(\bm{\theta}))$ at any $\bm{\theta}^{\prime}∈\bm{\Theta}$ . We denote the $j$ -th row and $k$ -th column entry in the Hessian matrix as $[∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}(\ell_{s}(\bm{\theta}))]_{jk}$ . Then we can show that
| | $\displaystyle[∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}(\ell_{s}(\bm{%
\theta}))]_{jk}$ | $\displaystyle\coloneqq\frac{∂}{∂\bm{\theta}_{j}}\left[\frac{%
∂(f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})-Y_{s})^{2}}{%
∂\bm{\theta}_{k}}\right]=\frac{∂}{∂\bm{\theta}_{j}}\left[%
2(f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})-Y_{s})\frac{∂ f_%
{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})}{∂\bm{\theta}_{k}}\right]$ | |
| --- | --- | --- | --- |
The claim of the lemma follows. ∎
**Lemma B.5**
*Let the $j$ -th row and $k$ -th column entry in the Hessian matrix $∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}(\mathbb{E}[\ell_{s}(\bm{\theta})%
|\mathcal{F}^{s-1}])$ be denoted as $[∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}(\mathbb{E}[\ell_{s}(\bm{\theta}%
)|\mathcal{F}^{s-1}])]_{jk}$ . Then we have that
| | $\displaystyle\left[∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}\mathbb{E}[%
\ell_{s}(\bm{\theta})|\mathcal{F}^{s-1}]\right]_{jk}$ | $\displaystyle=2\sum_{i=1}^{|V|}p_{{\widehat{\bm{\theta}}}_{s-1}}(i)\left(%
\dfrac{∂ f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})}{∂%
\bm{\theta}_{j}}\dfrac{∂ f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^%
{s-1})}{∂\bm{\theta}_{k}}\right.$ | |
| --- | --- | --- | --- |*
* Proof*
This lemma follows from Frostig et al. (2015); Mukherjee et al. (2022) adapted to our setting for the squared loss, transition function $f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1})$ , and the sampling distribution $\mathbf{p}_{{\widehat{\bm{\theta}}}_{s-1}}$ . We show it here for completeness. Now we want to evaluate the Hessian $∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}(\mathbb{E}[\ell_{s}(\bm{\theta})%
|\mathcal{F}^{s-1}])$ at any $\bm{\theta}^{\prime}∈\bm{\Theta}$ . We denote the $j$ -th row and $k$ -th column entry in the Hessian matrix as $[∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}(\mathbb{E}[\ell_{s}(\bm{\theta}%
)|\mathcal{F}^{s-1}])]_{jk}$ . Then we can show that
$$
\displaystyle\nabla^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}\mathbb{E}[\ell_{s}(%
\bm{\theta})|\mathcal{F}^{s-1}]=\nabla^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}%
\left(f^{2}_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})+\mathbb{E}[Y^{2%
}_{s}|\mathcal{F}^{s-1}]-2\mathbb{E}[Y_{s}|\mathcal{F}^{s-1}]f_{I_{s}}(\bm{%
\theta};x_{i},x_{j},\mathcal{F}^{s-1})\right) \displaystyle=\nabla^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}\sum_{i=1}^{|V|}p_{%
{\widehat{\bm{\theta}}}_{s-1}}(i)\left(f^{2}_{i}(\bm{\theta};x_{i},x_{j},%
\mathcal{F}^{s-1})+f^{2}_{i}(\bm{\theta}^{\prime};x_{i},x_{j},\mathcal{F}^{s-1%
})+\frac{1}{2}-2f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1})f_{I_{%
s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})\right) \displaystyle=\nabla^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}\sum_{i=1}^{|V|}p_{%
{\widehat{\bm{\theta}}}_{s-1}}(i)\left(\left(f_{I_{s}}(\bm{\theta}_{*};x_{i},x%
_{j},\mathcal{F}^{s-1})-f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})%
\right)^{2}+\frac{1}{2}\right) \displaystyle=\nabla^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}\sum_{i=1}^{|V|}p_{%
{\widehat{\bm{\theta}}}_{s-1}}(i)\left(\left(f_{I_{s}}(\bm{\theta}_{*};x_{i},x%
_{j},\mathcal{F}^{s-1})-f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})%
\right)^{2}\right) \tag{6}
$$
We now denote the $j$ -th row and $k$ -th column entry of the Hessian Matrix $∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}((f_{I_{s}}(\bm{\theta};x_{i},x_{%
j},\mathcal{F}^{s-1})-f_{i}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1}))^{2})$ as $\big{[}∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}((f_{I_{s}}(\bm{\theta};x_%
{i},x_{j},\mathcal{F}^{s-1})-f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}%
^{s-1}))^{2})\big{]}_{jk}$ . Then we can show that
| | $\displaystyle\big{[}∇^{2}_{\bm{\theta}=\bm{\theta}_{*}}((f_{I_{s}}(\bm{%
\theta};x_{i},x_{j},\mathcal{F}^{s-1})-f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},%
\mathcal{F}^{s-1}))^{2})\big{]}_{jk}$ | |
| --- | --- | --- |
Plugging this back in Equation 6 we get that
| | $\displaystyle\left[∇^{2}_{\bm{\theta}=\bm{\theta}^{\prime}}\mathbb{E}[%
\ell_{s}(\bm{\theta})|\mathcal{F}^{s-1}]\right]_{jk}$ | $\displaystyle=2\sum_{i=1}^{|V|}p_{{\widehat{\bm{\theta}}}_{s-1}}(i)\left(%
\dfrac{∂ f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})}{∂%
\bm{\theta}_{j}}\dfrac{∂ f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^%
{s-1})}{∂\bm{\theta}_{k}}\right.$ | |
| --- | --- | --- | --- |
∎
**Lemma B.6**
*The sum of the difference of the Hessians $\sum_{s=1}^{t}∇_{\bm{\theta}=\bm{\theta}^{\prime}}^{2}\ell_{s}\left(\bm{%
\theta}\right)-\mathbb{E}\left[∇_{\bm{\theta}=\bm{\theta}^{\prime}}^{2}%
\ell_{s}\left(\bm{\theta}\right)\mid\mathcal{F}^{s-1}\right]$ is given by
| | $\displaystyle\sum_{s=1}^{t}∇_{\bm{\theta}=\bm{\theta}^{\prime}}^{2}\ell_{%
s}\left(\bm{\theta}\right)-\mathbb{E}\left[∇_{\bm{\theta}=\bm{\theta}^{%
\prime}}^{2}\ell_{s}\left(\bm{\theta}\right)\mid\mathcal{F}^{s-1}\right]\!\!$ | $\displaystyle=\!\!\sum_{s=1}^{t}\bigg{(}-2(Y_{s}-f_{I_{s}}(\bm{\theta};x_{i},x%
_{j},\mathcal{F}^{s-1}))\dfrac{∂^{2}f_{I_{s}}(\bm{\theta};x_{i},x_{j},%
\mathcal{F}^{s-1})}{∂\bm{\theta}_{j}∂\bm{\theta}_{k}}$ | |
| --- | --- | --- | --- |*
* Proof*
This lemma directly follows from Lemma B.4 and Lemma B.5. First note that the difference $∇_{\bm{\theta}=\bm{\theta}^{\prime}}^{2}\ell_{s}\left(\bm{\theta}\right)-%
\mathbb{E}\left[∇_{\bm{\theta}=\bm{\theta}^{\prime}}^{2}\ell_{s}\left(\bm%
{\theta}\right)\mid\mathcal{F}^{s-1}\right]_{jk}$ is given by
$$
\displaystyle\nabla_{\bm{\theta}=\bm{\theta}^{\prime}}^{2}\ell_{s}\left(\bm{%
\theta}\right)-\mathbb{E}\left[\nabla_{\bm{\theta}=\bm{\theta}^{\prime}}^{2}%
\ell_{s}\left(\bm{\theta}\right)\mid\mathcal{F}^{s-1}\right] \displaystyle\overset{(a)}{=}2\dfrac{\partial f_{I_{s}}(\bm{\theta};x_{i},x_{j%
},\mathcal{F}^{s-1})}{\partial\bm{\theta}_{j}}\dfrac{\partial f_{I_{s}}(\bm{%
\theta};x_{i},x_{j},\mathcal{F}^{s-1})}{\partial\bm{\theta}_{k}}+2\left(f_{I_{%
s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})-Y_{s}\right)\dfrac{\partial^{2}%
f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})}{\partial\bm{\theta}_{j}%
\partial\bm{\theta}_{k}} \displaystyle\qquad-2\sum_{i=1}^{|V|}p_{{\widehat{\bm{\theta}}}_{s-1}}(i)\bigg%
{(}\dfrac{\partial f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})}{%
\partial\bm{\theta}_{j}}\dfrac{\partial f_{I_{s}}(\bm{\theta};x_{i},x_{j},%
\mathcal{F}^{s-1})}{\partial\bm{\theta}_{k}}-\left(f_{I_{s}}(\bm{\theta};x_{i}%
,x_{j},\mathcal{F}^{s-1})-f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s%
-1})\right)\cdot \displaystyle\qquad\dfrac{\partial^{2}f_{I_{s}}(\bm{\theta};x_{i},x_{j},%
\mathcal{F}^{s-1})}{\partial\bm{\theta}_{j}\partial\bm{\theta}_{k}}\bigg{)} \displaystyle= \displaystyle-2(Y_{s}-f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1}))%
\dfrac{\partial^{2}f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})}{%
\partial\bm{\theta}_{j}\partial\bm{\theta}_{k}}+2\dfrac{\partial f_{I_{s}}(\bm%
{\theta};x_{i},x_{j},\mathcal{F}^{s-1})}{\partial\bm{\theta}_{j}}\dfrac{%
\partial f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})}{\partial\bm{%
\theta}_{k}} \displaystyle\qquad-2\sum_{i=1}^{|V|}p_{{\widehat{\bm{\theta}}}_{s-1}}(i)%
\dfrac{\partial f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})}{\partial%
\bm{\theta}_{j}}\dfrac{\partial f_{I_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^%
{s-1})}{\partial\bm{\theta}_{k}} \tag{7}
$$
where, $(a)$ follows from Lemma B.4 and Lemma B.5. Plugging this equality in Equation 7 below we get
| | $\displaystyle\sum_{s=1}^{t}∇_{\bm{\theta}=\bm{\theta}^{\prime}}^{2}\ell_{%
s}\left(\bm{\theta}\right)-\mathbb{E}\left[∇_{\bm{\theta}=\bm{\theta}^{%
\prime}}^{2}\ell_{s}\left(\bm{\theta}\right)\mid\mathcal{F}^{s-1}\right]$ | |
| --- | --- | --- |
The claim of the lemma follows. ∎
**Lemma B.7**
*Let $\widehat{\cal L}_{t}(\bm{\theta}_{*})=\frac{1}{t}\sum_{s=1}^{t}\ell_{s}(\bm{%
\theta}_{*})$ and $∇^{2}{\cal L}_{t}(\bm{\theta}_{*})=\frac{1}{t}\sum_{s=1}^{t}∇^{2}%
\mathbb{E}[\ell_{s}(\bm{\theta}_{*})|\mathcal{F}^{s-1}]$ . Then we can bound the
| | $\displaystyle\mathbb{P}$ | $\displaystyle\left(\lambda_{\max}(∇^{2}\widehat{\cal L}_{t}(\bm{\theta}_{%
*})-∇^{2}{\cal L}_{t}(\theta^{*}))>\sqrt{\dfrac{8C|V|^{2}\eta^{2}\lambda^%
{2}_{1}c\gamma\log(dt)}{t}}\right)≤\dfrac{2}{(dt)^{\gamma}},$ | |
| --- | --- | --- | --- |
where $c>0$ is a constant.*
* Proof*
This lemma is different than Frostig et al. (2015); Mukherjee et al. (2022) as it requires a different concentration bound to take into account the squared loss 5.2 and the vocabulary size. Recall that $\widehat{\cal L}_{t}(\bm{\theta}_{*})=\frac{1}{t}\sum_{s=1}^{t}\ell_{s}(\bm{%
\theta}_{*})$ and $∇^{2}{\cal L}_{s}(\theta^{*})=∇^{2}\mathbb{E}[\ell_{s}(\bm{\theta}_{%
*})|\mathcal{F}^{s-1}]$ . We define $∇^{2}{\cal L}_{t}(\bm{\theta}_{*})=\frac{1}{t}\sum_{s=1}^{t}∇^{2}%
\mathbb{E}[\ell_{s}(\bm{\theta}_{*})|\mathcal{F}^{s-1}]$ . Denote, $\mathbf{V}_{s}=2∇_{\bm{\theta}=\bm{\theta}_{*}}f_{I_{s}}(\bm{\theta};x_{i%
},x_{j},\mathcal{F}^{s-1})∇_{\bm{\theta}=\bm{\theta}_{*}}f_{I_{s}}(\bm{%
\theta};x_{i},x_{j},\mathcal{F}^{s-1})^{→p}-2\sum_{i=1}^{|V|}p_{{\widehat{%
\bm{\theta}}}_{s-1}}(i)∇_{\bm{\theta}=\bm{\theta}_{*}}f_{I_{s}}(\bm{%
\theta};x_{i},x_{j},\mathcal{F}^{s-1})∇_{\bm{\theta}=\bm{\theta}_{*}}f_{I%
_{s}}(\bm{\theta};x_{i},x_{j},\mathcal{F}^{s-1})^{→p}$ . Then we can show that,
$$
\displaystyle\mathbb{P}\left(\lambda_{\max}(\nabla^{2}\widehat{\cal L}_{t}(\bm%
{\theta}_{*})-\nabla^{2}{\cal L}_{t}(\theta^{*}))>\sqrt{\dfrac{8C^{2}|V|^{4}%
\eta^{2}\lambda^{2}_{1}c\gamma\log(dt)}{t}}\right) \displaystyle=\mathbb{P}\left(\lambda_{\max}\left(\nabla^{2}_{\bm{\theta}=\bm{%
\theta}_{*}}\frac{1}{t}\sum_{s=1}^{t}\ell_{s}(\bm{\theta})-\frac{1}{t}\sum_{s=%
1}^{t}\nabla^{2}_{\bm{\theta}=\bm{\theta}_{*}}\mathbb{E}[\ell_{s}(\bm{\theta})%
|\mathcal{F}^{s-1}]\right)>\sqrt{\dfrac{8C^{2}|V|^{4}\eta^{2}\lambda^{2}_{1}c%
\gamma\log(dt)}{t}}\right) \displaystyle=\mathbb{P}\left(\lambda_{\max}\left(\nabla^{2}_{\bm{\theta}=\bm{%
\theta}_{*}}\frac{1}{t}\sum_{s=1}^{t}\left(\ell_{s}(\bm{\theta})-\nabla^{2}_{%
\bm{\theta}=\bm{\theta}_{*}}\mathbb{E}[\ell_{s}(\bm{\theta})|\mathcal{F}^{s-1}%
]\right)\right)>\sqrt{\dfrac{8C^{2}|V|^{4}\eta^{2}\lambda^{2}_{1}c\gamma\log(%
dt)}{t}}\right) \displaystyle\overset{(a)}{\leq}\mathbb{P}\left(\lambda_{\max}\left(\frac{C|V|%
^{2}}{t}\sum_{s=1}^{t}\left(Y_{s}-f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},%
\mathcal{F}^{s-1})\right)\nabla^{2}_{\bm{\theta}=\bm{\theta}_{*}}f_{I_{s}}(\bm%
{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1})\right.\right. \displaystyle\qquad\left.\left.+\frac{C|V|^{2}}{t}\sum_{s=1}^{t}\mathbf{V}_{s}%
\right)>\sqrt{\dfrac{8C^{2}|V|^{4}\eta^{2}\lambda^{2}_{1}c\gamma\log(dt)}{t}}\right) \displaystyle\leq\mathbb{P}\left(\lambda_{\max}\left(\frac{1}{t}\sum_{s=1}^{t}%
-2\left(Y_{s}-f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1})\right)%
\nabla^{2}_{\bm{\theta}=\bm{\theta}_{*}}f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},%
\mathcal{F}^{s-1})\right)>\frac{1}{2}\sqrt{\dfrac{8\eta^{2}\lambda^{2}_{1}c%
\gamma\log(dt)}{t}}\right) \displaystyle\qquad+\mathbb{P}\left(\lambda_{\max}\left(\frac{1}{t}\sum_{s=1}^%
{t}\mathbf{V}_{s}\right)>\frac{1}{2}\sqrt{\dfrac{8\eta^{2}\lambda^{2}_{1}c%
\gamma\log(dt)}{t}}\right) \displaystyle\overset{(b)}{\leq}\mathbb{P}\left(\frac{1}{t}\sum_{s=1}^{t}-2%
\left(Y_{s}-f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1})\right)%
\lambda_{\max}\left(\nabla^{2}_{\bm{\theta}=\bm{\theta}_{*}}f_{I_{s}}(\bm{%
\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1})\right)>\frac{1}{2}\sqrt{\dfrac{8%
\eta^{2}\lambda^{2}_{1}c\gamma\log(dt)}{t}}\right) \displaystyle\qquad+\mathbb{P}\left(\frac{1}{t}\sum_{s=1}^{t}\lambda_{\max}%
\left(\mathbf{V}_{s}\right)>\frac{1}{2}\sqrt{\dfrac{8\eta^{2}\lambda^{2}_{1}c%
\gamma\log(dt)}{t}}\right) \displaystyle\overset{(c)}{\leq}2\exp\left(-\dfrac{t^{2}8\eta^{2}\lambda_{1}^{%
2}c\gamma\log(dt)}{4t}\cdot\dfrac{1}{2tc\eta^{2}\lambda_{1}^{2}}\right)%
\overset{(d)}{\leq}2\left(\dfrac{1}{dt}\right)^{\gamma}. \tag{8}
$$
where, $(a)$ follows from substituting the value of $∇^{2}_{\bm{\theta}=\bm{\theta}_{*}}\ell_{s}(\bm{\theta})-∇^{2}_{\bm{%
\theta}=\bm{\theta}_{*}}\mathbb{E}[\ell_{s}(\bm{\theta})|\mathcal{F}^{s-1}]$ from Lemma B.6, and $(b)$ follows by triangle inequality, $(c)$ follows by using two concentration inequalities stated below, and $(d)$ follows by simplifying the equations. Denote $Q_{s}=-2\left(Y_{s}-f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1})%
\right)\lambda_{\max}\left(∇^{2}_{\bm{\theta}=\bm{\theta}_{*}}f_{I_{s}}(%
\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1})\right)$ . Also note that $\lambda_{\max}\left(∇^{2}_{\bm{\theta}=\bm{\theta}_{*}}f_{I_{s}}(\bm{%
\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1})\right)≤\lambda_{1}$ for all time $s$ using B.1.
| | $\displaystyle\mathbb{P}(\sum_{s=1}^{t}-2$ | $\displaystyle\left(Y_{s}-f_{I_{s}}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-%
1})\right)\lambda_{\max}\left(∇^{2}_{\bm{\theta}=\bm{\theta}_{*}}f_{I_{s}%
}(\bm{\theta}_{*};x_{i},x_{j},\mathcal{F}^{s-1})\right)≥\epsilon)=\mathbb{P%
}\left(-\sum_{s=1}^{t}Q_{s}≥\epsilon\right)$ | |
| --- | --- | --- | --- |
where $(a)$ follows by Markov’s inequality, $(b)$ follows as $Q_{s}$ is conditionally independent given ${\widehat{\bm{\theta}}}_{s-1}$ , $(c)$ follows by unpacking the term for $t$ times and $(d)$ follows by taking $\lambda=\epsilon/4t\lambda_{1}^{2}\eta^{2}$ where $\lambda_{1}$ is defined in 5.1. Next we bound the second term of (8) below. | | $\displaystyle\mathbb{P}(\sum_{s=1}^{t}\lambda_{\max}\left(\mathbf{V}_{s}\right%
)≥\epsilon)$ | $\displaystyle=\mathbb{P}\left(\lambda\sum_{s=1}^{t}\lambda_{\max}\left(\mathbf%
{V}_{s}\right)≥\lambda\epsilon\right)=\mathbb{P}\left(e^{\lambda\sum_{s=1}^%
{t}\lambda_{\max}\left(\mathbf{V}_{s}\right)}≥ e^{\lambda\epsilon}\right)%
\overset{(a)}{≤}e^{-\lambda\epsilon}\mathbb{E}\left[e^{\lambda\sum_{s=1}^{t%
}\lambda_{\max}\left(\mathbf{V}_{s}\right)}\right]$ | |
| --- | --- | --- | --- | where $(a)$ follows by Markov’s inequality, $(b)$ follows as $\lambda_{\max}(\mathbf{V}_{s})$ is conditionally independent given ${\widehat{\bm{\theta}}}_{s-1}$ . In the inequality $(c)$ using the always valid upper bound of $2\lambda_{1}$ , we have that $\mathbb{E}[\lambda_{\max}(\mathbf{V}_{t})]≤ 2\lambda_{1}$ . So the term in inequality $(c)$ will become $e^{-\lambda\epsilon}e^{2\lambda^{2}t\eta^{2}\lambda_{1}^{t}+4t\lambda\lambda_{%
1}}$ . Hence, we can upper bound the inequality $(c)$ by a constant $c>0$ such that we have $\mathbb{E}[e^{\lambda\lambda_{\max}(V_{t})}\mid{\widehat{\bm{\theta}}}_{t-1}]%
≤ e^{2\lambda^{2}\lambda_{1}^{2}\eta^{2}}e^{2\lambda× 2\lambda_{1}}=%
\exp(2\lambda^{2}\lambda_{1}^{2}\eta^{2}+4\lambda\lambda_{1})≤\exp(2c%
\lambda^{2}\lambda_{1}^{2}\eta^{2})$ . The inequality $(d)$ follows by unpacking the term for $t$ times and $(e)$ follows by taking $\lambda=\epsilon/4tc\lambda_{1}^{2}\eta^{2}$ and $\lambda_{1}$ defined in 5.1. ∎
**Lemma B.8**
*Let ${\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}=\left(∇^{2}\widehat{\cal L}_{%
t}(\widetilde{\bm{\theta}}_{t})\right)^{-1}∇\widehat{\cal L}_{t}(\bm{%
\theta}_{*})$ where $\widetilde{\bm{\theta}}_{t}$ is between ${\widehat{\bm{\theta}}}_{t}$ and $\bm{\theta}_{*}$ . Then we can show that
| | $\displaystyle\left\|{\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}\right\|_{%
∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)}≤\left\|\left(∇^{2}%
{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)^{1/2}\left(∇^{2}\widehat{%
\cal L}_{t}(\widetilde{\bm{\theta}}_{t})\right)^{-1}\left(∇^{2}{\cal L}_{%
t}\left(\bm{\theta}_{*}\right)\right)^{1/2}\right\|\left\|∇\widehat{\cal L%
}_{t}\left(\bm{\theta}_{*}\right)\right\|_{\left(∇^{2}{\cal L}_{t}\left(%
\bm{\theta}_{*}\right)\right)^{-1}}.$ | |
| --- | --- | --- |*
* Proof*
We begin with the definition of $\left\|{\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}\right\|_{∇^{2}{\cal L}%
_{t}\left(\bm{\theta}_{*}\right)}$ as follows:
| | $\displaystyle\left\|{\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}\right\|_{%
∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)}$ | $\displaystyle\overset{(a)}{=}\sqrt{({\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*%
})^{T}∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)({\widehat{\bm{\theta}%
}}_{t}-\bm{\theta}_{*})}$ | |
| --- | --- | --- | --- |
where, $(a)$ follows as $\|x\|_{M}=\sqrt{x^{T}Mx}$ , $(b)$ follows as $\|{\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}\|_{∇^{2}{\cal L}_{t}(\bm{%
\theta}_{*})}=\left(∇^{2}\widehat{\cal L}_{t}(\widetilde{\bm{\theta}})%
\right)^{-1}∇\widehat{\cal L}_{t}(\bm{\theta}_{*})$ , and $(c)$ follows from Cauchy Schwarz inequality. The claim of the lemma follows. ∎
**Remark B.9**
*The proof of 1 consists of several steps. In the first step we relate $∇^{2}\widehat{\cal L}_{t}(\bm{\theta})$ to $∇^{2}{\cal L}_{t}(\bm{\theta}_{*})$ for any $\bm{\theta}$ in a ball $\mathcal{B}$ around $\bm{\theta}_{*}$ . The ball $\mathcal{B}$ is assumed in B.1 to be a neighborhood where $∇^{2}\ell_{s}(\bm{\theta})$ satisfies a Lipschitz property. B.1 in Appendix B are standard and have also been made by Frostig et al. (2015); Chaudhuri et al. (2015); Mukherjee et al. (2022). Using 5.1 and B.1, we can show that for a large enough sequences of tokens $t$ stated in 1 we have the following: (1) $∇^{2}{\cal L}_{t}(\bm{\theta}_{*})$ lies between in the positive semidefinite order by scaled multiples of $∇^{2}\widehat{\cal L}_{t}(\bm{\theta})$ for any $\bm{\theta}∈\mathcal{B}$ , and (2) the empirical error minimizing ${\widehat{\bm{\theta}}}_{t}$ is in the ball $\mathcal{B}$ with probability $1-1/t^{\gamma}$ , which is the good event $\mathcal{E}$ . Then we use a Taylor series expansion around ${\widehat{\bm{\theta}}}_{t}$ and using the fact that $∇\widehat{\cal L}_{t}(\widehat{\bm{\theta}}(t))=0$ along with the relation between $∇^{2}\widehat{\cal L}_{t}(\bm{\theta})$ and $∇^{2}{\cal L}_{t}(\bm{\theta}_{*})$ , we can obtain an upper bound to $\lVert\widehat{\bm{\theta}}(t)-\bm{\theta}_{*}\rVert_{∇^{2}{\cal L}_{t}(%
\bm{\theta}_{*})}$ in terms of $\lVert∇\widehat{\cal L}_{t}(\bm{\theta}_{*})\rVert_{(∇^{2}{\cal L}_{%
t}(\bm{\theta}_{*}))^{-1}}$ that can be shown to be decreasing with $t$ . Further, $\lVert\widehat{\bm{\theta}}(t)-\bm{\theta}_{*}\rVert_{∇^{2}{\cal L}_{t}(%
\bm{\theta}_{*})}$ can also be used to obtain an upper bound to ${\cal L}_{t}(\widehat{\bm{\theta}}(t))-{\cal L}_{t}(\bm{\theta}_{*})$ using a Taylor series expansion. Finally we can bound $\mathbb{E}[{\cal L}_{t}(\widehat{\bm{\theta}}_{t})-{\cal L}_{t}(\bm{\theta}^{*%
})]=\mathbb{E}[({\cal L}_{t}(\widehat{\bm{\theta}}_{t})-{\cal L}_{t}(\bm{%
\theta}^{*}))I(\mathcal{E})]+\mathbb{E}[({\cal L}_{t}(\widehat{\bm{\theta}}_{t%
})-{\cal L}_{t}(\bm{\theta}^{*}))I(\mathcal{E}^{\complement})]$ where $I(·)$ is the indicator. Since $\mathbb{P}(\mathcal{E}^{\complement})≤ 1/t^{\gamma}$ , the second term can be bounded as $\max_{\bm{\theta}∈\bm{\Theta}}\left({\cal L}_{t}(\bm{\theta})-{\cal L}_{t}%
\left(\bm{\theta}^{*}\right)\right)/t^{\gamma}$ , while the first term simplifies to $(1+\rho_{t})\sigma_{t}^{2}/t$ .*
**Theorem 1**
*(Restatement of main theorem) Suppose $\ell_{1}(\bm{\theta}),\ell_{2}(\bm{\theta}),·s,\ell_{t}(\bm{\theta}):%
\mathbb{R}^{|V|}→\mathbb{R}$ are loss functions from a distribution that satisfies Assumptions 5.1 , 5.2, and B.1. Define ${\cal L}_{t}(\bm{\theta})=\frac{1}{t}\sum_{s=1}^{t}\mathbb{E}_{x_{s}\sim%
\mathbf{p}_{{\widehat{\bm{\theta}}}_{s-1}}}[\ell_{s}(\bm{\theta})|\mathcal{F}^%
{s-1}]$ where, ${\widehat{\bm{\theta}}}_{t}=\operatorname*{argmin}_{\bm{\theta}∈\bm{\Theta}}%
\sum_{s=1}^{t}\ell_{s}(\bm{\theta})$ . If $t$ is large enough such that $\frac{\gamma\log(dt)}{t}≤ c^{\prime}\min\left\{\frac{1}{C_{1}C_{2}|V|^{4}},%
\frac{\max\limits_{\bm{\theta}∈\bm{\Theta}}\left(\!{\cal L}_{t}(\bm{\theta})%
\!-\!{\cal L}_{t}\left(\bm{\theta}_{*}\!\right)\right)}{C_{2}}\right\}$ then for a constant $\gamma≥ 2$ , universal constants $C_{1},C_{2},c^{\prime}$ , we can show that
| | $\displaystyle\left(1-\rho_{t}\right)\frac{\sigma_{t}^{2}}{t}-\frac{C_{1}^{2}}{%
t^{\gamma/2}}$ | $\displaystyle≤\mathbb{E}\left[{\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-{%
\cal L}_{t}\left(\bm{\theta}_{*}\right)\right]≤\left(1+\rho_{t}\right)\frac%
{\sigma_{t}^{2}}{t}\!+\!\frac{\max\limits_{\bm{\theta}∈\bm{\Theta}}\left(\!{%
\cal L}_{t}(\bm{\theta})\!-\!{\cal L}_{t}\left(\bm{\theta}_{*}\!\right)\right)%
}{t^{\gamma}},$ | |
| --- | --- | --- | --- |
where $\sigma^{2}_{t}\coloneqq\mathbb{E}\left[\frac{1}{2}\left\|∇\widehat{\cal L%
}_{t}\left(\bm{\theta}_{*}\right)\right\|_{\left(∇^{2}{\cal L}_{t}\left(%
\bm{\theta}_{*}\right)\right)^{-1}}^{2}\right]$ , and $\rho_{t}\coloneqq\left(C_{1}C_{2}+2\eta^{2}\lambda_{1}^{2}\right)\sqrt{\frac{%
\gamma\log(dt)}{t}}$ .*
* Proof*
Step 1: We first bound the $\left\|∇^{2}\widehat{\cal L}_{t}(\bm{\theta})-∇^{2}{\cal L}_{t}\left%
(\bm{\theta}_{*}\right)\right\|_{*}$ as follows
$$
\displaystyle\left\|\nabla^{2}\widehat{\cal L}_{t}(\bm{\theta})-\nabla^{2}{%
\cal L}_{t}\left(\bm{\theta}_{*}\right)\right\|_{*} \displaystyle\overset{(a)}{\leq}\left\|\nabla^{2}\widehat{\cal L}_{t}(\bm{%
\theta})-\nabla^{2}\widehat{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right\|_{*%
}+\left\|\nabla^{2}\widehat{\cal L}_{t}\left(\bm{\theta}_{*}\right)-\nabla^{2}%
{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right\|_{*} \displaystyle\overset{(b)}{\leq}C_{1}\left\|\bm{\theta}-\bm{\theta}_{*}\right%
\|_{\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)}+\sqrt{\dfrac{8C^{2}|V|%
^{4}\eta^{2}\lambda_{1}^{2}c\gamma\log(dt)}{t}} \tag{9}
$$
where, $(a)$ follows from triangle inequality, and $(b)$ is due to B.1.3.d and Lemma B.7. Step 2 (Approximation of $∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)$ ): By choosing a sufficiently smaller ball $\mathcal{B}_{1}$ of radius of $\min\left\{1/\left(10C_{1}\right),\right.$ diameter $\left.(\mathcal{B})\right\}$ ), the first term in (9) can be made small for $\bm{\theta}∈\mathcal{B}_{1}$ . Also, for sufficiently large $t$ , the second term in (9) can be made arbitrarily small (smaller than $1/10$ ), which occurs if $\sqrt{\frac{\gamma\log(dt)}{t}}≤\frac{c^{\prime}}{\sqrt{2C^{2}|V|^{4}\eta^{%
2}\lambda_{1}^{2}}}$ . Hence for large $t$ and $\bm{\theta}∈\mathcal{B}_{1}$ we have
$$
\displaystyle\frac{1}{2}\nabla^{2}\widehat{\cal L}_{t}(\bm{\theta})\preceq%
\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)\preceq 2\nabla^{2}\widehat{%
\cal L}_{t}(\bm{\theta}) \tag{10}
$$ Step 3 (Show ${\widehat{\bm{\theta}}}_{t}$ in $\mathcal{B}_{1}$ ): Fix a $\widetilde{\bm{\theta}}$ between $\bm{\theta}$ and $\bm{\theta}_{*}$ in $\mathcal{B}_{1}$ . Apply Taylor’s series approximation
| | $\displaystyle\widehat{\cal L}_{t}(\bm{\theta})=\widehat{\cal L}_{t}\left(\bm{%
\theta}_{*}\right)+∇\widehat{\cal L}_{t}\left(\bm{\theta}_{*}\right)^{%
→p}\left(\bm{\theta}-\bm{\theta}_{*}\right)+\frac{1}{2}\left(\bm{\theta}-\bm%
{\theta}_{*}\right)^{→p}∇^{2}\widehat{\cal L}_{t}(\widetilde{\bm{\theta%
}})\left(\bm{\theta}-\bm{\theta}_{*}\right)$ | |
| --- | --- | --- |
We can further reduce this as follows:
$$
\displaystyle\widehat{\cal L}_{t}(\bm{\theta})-\widehat{\cal L}_{t}\left(\bm{%
\theta}_{*}\right) \displaystyle\overset{(a)}{=}\nabla\widehat{\cal L}_{t}\left(\bm{\theta}_{*}%
\right)^{\top}\left(\bm{\theta}-\bm{\theta}_{*}\right)+\frac{1}{2}\left\|\bm{%
\theta}-\bm{\theta}_{*}\right\|_{\nabla^{2}\widehat{\cal L}_{t}(\widetilde{\bm%
{\theta}})}^{2} \displaystyle\overset{(b)}{\geq}\nabla\widehat{\cal L}_{t}\left(\bm{\theta}_{*%
}\right)^{\top}\left(\bm{\theta}-\bm{\theta}_{*}\right)+\frac{1}{4}\left\|\bm{%
\theta}-\bm{\theta}_{*}\right\|_{\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_{*}%
\right)}^{2} \displaystyle\geq-\left\|\bm{\theta}-\bm{\theta}_{*}\right\|_{\nabla^{2}{\cal L%
}_{t}\left(\bm{\theta}_{*}\right)}\left\|\nabla\widehat{\cal L}_{t}\left(\bm{%
\theta}_{*}\right)\right\|_{\left(\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_{*}%
\right)\right)^{-1}}+\frac{1}{4}\left(\left\|\bm{\theta}-\bm{\theta}_{*}\right%
\|_{\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)}\right)^{\top}\left(%
\left\|\bm{\theta}-\bm{\theta}_{*}\right\|_{\nabla^{2}{\cal L}_{t}\left(\bm{%
\theta}_{*}\right)}\right) \displaystyle=\left\|\bm{\theta}-\bm{\theta}_{*}\right\|_{\nabla^{2}{\cal L}_{%
t}\left(\bm{\theta}_{*}\right)}\left(-\left\|\nabla\widehat{\cal L}_{t}\left(%
\bm{\theta}_{*}\right)\right\|_{\left(\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_%
{*}\right)\right)^{-1}}+\frac{1}{4}\left\|\bm{\theta}-\bm{\theta}_{*}\right\|_%
{\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)}\right) \tag{11}
$$
where, $(a)$ follows as $\left\|\bm{\theta}-\bm{\theta}_{*}\right\|_{∇^{2}\widehat{\cal L}_{t}(%
\widetilde{\bm{\theta}})}^{2}\coloneqq\left(\bm{\theta}-\bm{\theta}_{*}\right)%
^{→p}∇^{2}\widehat{\cal L}_{t}(\widetilde{\bm{\theta}})\left(\bm{\theta%
}-\bm{\theta}_{*}\right)$ , and $(b)$ follows as $\widetilde{\bm{\theta}}$ is in between $\bm{\theta}$ and $\bm{\theta}_{*}$ and then using (10). Note that in (11) if the right hand side is positive for some $\bm{\theta}∈\mathcal{B}_{1}$ , then $\bm{\theta}$ is not a local minimum. Also, since $\left\|∇\widehat{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right\|%
→ 0,$ for a sufficiently small value of $\left\|∇\widehat{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right\|,$ all points on the boundary of $\mathcal{B}_{1}$ will have values greater than that of $\bm{\theta}_{*}.$ Hence, we must have a local minimum of $\widehat{\cal L}_{t}(\bm{\theta})$ that is strictly inside $\mathcal{B}_{1}$ (for $t$ large enough). We can ensure this local minimum condition is achieved by choosing an $t$ large enough so that $\sqrt{\frac{\gamma\log(dt)}{t}}≤ c^{\prime}\min\left\{\frac{1}{C_{1}C_{2}},%
\frac{\operatorname{diameter}(\mathcal{B})}{C_{2}}\right\},$ using Lemma B.3 (and our bound on the diameter of $\mathcal{B}_{1}$ ). By convexity, we have that this is the global minimum, ${\widehat{\bm{\theta}}}_{t},$ and so ${\widehat{\bm{\theta}}}_{t}∈\mathcal{B}_{1}$ for $t$ large enough. We will assume now that $t$ is this large from here on. Step 4 (Bound $\left\|{\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}\right\|_{∇^{2}{\cal L}%
_{t}\left(\bm{\theta}_{*}\right)}$ ): For the ${\widehat{\bm{\theta}}}(t)$ that minimizes the sum of squared errors, $0=∇\widehat{\cal L}_{t}({\widehat{\bm{\theta}}}_{t})$ . Again, using Taylor’s theorem if ${\widehat{\bm{\theta}}}_{t}$ is an interior point, we have:
$$
\displaystyle 0=\nabla\widehat{\cal L}_{t}({\widehat{\bm{\theta}}}_{t})=\nabla%
\widehat{\cal L}_{t}\left(\bm{\theta}_{*}\right)+\nabla^{2}\widehat{\cal L}_{t%
}(\widetilde{\bm{\theta}}_{t})\left({\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*%
}\right) \tag{12}
$$
for some $\widetilde{\bm{\theta}}_{t}$ between $\bm{\theta}_{*}$ and ${\widehat{\bm{\theta}}}_{t}$ . Now observe that $\widetilde{\bm{\theta}}_{t}$ is in $B_{1}$ (since, for $t$ large enough, ${\widehat{\bm{\theta}}}_{t}∈\mathcal{B}_{1}$ ). Thus it follows from (12) that,
$$
\displaystyle{\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}=\left(\nabla^{2}%
\widehat{\cal L}_{t}(\widetilde{\bm{\theta}}_{t})\right)^{-1}\nabla\widehat{%
\cal L}_{t}\left(\bm{\theta}_{*}\right) \tag{13}
$$
where the invertibility is guaranteed by (10) and the positive definiteness of $∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)$ (by B.1 (3c)). We finally derive the upper bound to $\left\|{\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}\right\|_{∇^{2}{\cal L}%
_{t}\left(\bm{\theta}_{*}\right)}$ as follows
$$
\displaystyle\left\|{\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}\right\|_{%
\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)} \displaystyle\overset{(a)}{\leq}\left\|\left(\nabla^{2}{\cal L}_{t}\left(\bm{%
\theta}_{*}\right)\right)^{1/2}\left(\nabla^{2}\widehat{\cal L}_{t}(\widetilde%
{\bm{\theta}}_{t})\right)^{-1}\left(\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_{*%
}\right)\right)^{1/2}\right\|\left\|\nabla\widehat{\cal L}_{t}\left(\bm{\theta%
}_{*}\right)\right\|_{\left(\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)%
\right)^{-1}} \displaystyle\overset{(b)}{\leq}cC_{2}\sqrt{\frac{\gamma\log(dt)}{t}} \tag{14}
$$
where $(a)$ follows from Lemma B.8, and $(b)$ from Lemma B.3, (11), and $c$ is some universal constant. Step 5 (Introducing $\widetilde{\mathbf{z}}$ ): Fix a $\widetilde{\mathbf{z}}_{t}$ between $\bm{\theta}_{*}$ and ${\widehat{\bm{\theta}}}_{t}$ . Apply Taylor’s series
$$
\displaystyle{\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-{\cal L}_{t}\left(\bm{%
\theta}_{*}\right)=\frac{1}{2}\left({\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*%
}\right)^{\top}\nabla^{2}{\cal L}_{t}\left(\widetilde{\mathbf{z}}_{t}\right)%
\left({\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}\right) \tag{15}
$$
Now note that both $\widetilde{\bm{\theta}}_{t}$ and $\widetilde{\mathbf{z}}_{t}$ are between ${\widehat{\bm{\theta}}}_{t}$ and $\bm{\theta}_{*},$ which implies $\widetilde{\bm{\theta}}_{t}→\bm{\theta}_{*}$ and $\widetilde{\mathbf{z}}_{t}→\bm{\theta}_{*}$ since ${\widehat{\bm{\theta}}}_{t}→\bm{\theta}_{*}$ . By (9) and (14) and applying the concentration inequalities give us
$$
\displaystyle\left\|\nabla^{2}\widehat{\cal L}_{t}(\widetilde{\bm{\theta}}_{t}%
)-\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right\|_{*}\leq\rho_{t} \displaystyle\left\|\nabla^{2}{\cal L}_{t}\left(\widetilde{\mathbf{z}}_{t}%
\right)-\nabla^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right\|_{*}\leq C_{%
1}\left\|\widetilde{\mathbf{z}}_{t}-\bm{\theta}_{*}\right\|_{\nabla^{2}{\cal L%
}_{t}\left(\bm{\theta}_{*}\right)}\leq\rho_{t} \tag{16}
$$
where $\rho_{t}=c\left(C_{1}C_{2}+2\eta^{2}\lambda_{1}^{2}\right)\sqrt{\frac{\gamma%
\log(dt)}{t}}$ . Step 6 (Define $\mathbf{M}_{1,t}$ and $\mathbf{M}_{2,t}$ ): It follows from the inequality (16) that
| | $\displaystyle∇^{2}\widehat{\cal L}_{t}(\widetilde{\bm{\theta}}_{t})%
\preceq\left(1+\rho_{t}\right)∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}%
\right)\implies∇^{2}\widehat{\cal L}_{t}(\widetilde{\bm{\theta}}_{t})-%
∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)\preceq\rho_{t}∇^{2}{%
\cal L}_{t}\left(\bm{\theta}_{*}\right)$ | |
| --- | --- | --- |
Then we can use the inequalities (16) and (17) to show that
| | $\displaystyle\left(1-\rho_{t}\right)∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*%
}\right)\preceq∇^{2}\widehat{\cal L}_{t}(\widetilde{\bm{\theta}}_{t})%
\preceq\left(1+\rho_{t}\right)∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)$ | |
| --- | --- | --- |
Now we define the two quantities $\mathbf{M}_{1,t}$ and $\mathbf{M}_{2,t}$ as follows:
| | $\displaystyle\mathbf{M}_{1,t}$ | $\displaystyle:=\left(∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)%
^{1/2}\left(∇^{2}\widehat{\cal L}_{t}(\widetilde{\bm{\theta}}_{t})\right)%
^{-1}\left(∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)^{1/2}$ | |
| --- | --- | --- | --- | Step 7 (Lower bound ${\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-{\cal L}_{t}\left(\bm{\theta}_{*}\right)$ ): Now for the lower bound it follows from Equation 15 that
| | $\displaystyle{\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-{\cal L}_{t}\left(\bm{%
\theta}_{*}\right)$ | $\displaystyle=\frac{1}{2}\left({\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}%
\right)^{→p}∇^{2}{\cal L}_{t}\left(\widetilde{\mathbf{z}}_{t}\right)%
\left({\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}\right)$ | |
| --- | --- | --- | --- |
where, in $(a)$ we define the vector $\mathbf{u}:=\left({\widehat{\bm{\theta}}}_{t}-\bm{\theta}_{*}\right)^{→p}%
∇^{2}{\cal L}_{t}(\bm{\theta}_{*})^{\frac{1}{2}}$ . Now observe from the definition of and then using the min-max theorem we can show that
| | $\displaystyle{\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-{\cal L}_{t}\left(\bm{%
\theta}_{*}\right)$ | $\displaystyle≥\frac{1}{2}\lambda_{\min}\left(\mathbf{M}_{2,t}\right)\mathbf%
{u}^{T}\mathbf{u}$ | |
| --- | --- | --- | --- |
where, in $(a)$ we use the Equation 13. Step 8: Define $I(\mathcal{E})$ as the indicator that the desired previous events hold, which we can ensure with probability greater than $1-2\left(\dfrac{1}{dt}\right)^{\gamma}$ . Then we can show that: | | $\displaystyle\mathbb{E}\left[{\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-{\cal L%
}_{t}\left(\bm{\theta}_{*}\right)\right]≥$ | $\displaystyle\mathbb{E}\left[\left({\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-{%
\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)I(\mathcal{E})\right]$ | |
| --- | --- | --- | --- |
where, in $(a)$ we have $\sigma^{2}_{t}:=\left\|∇\widehat{\cal L}_{t}\left(\bm{\theta}_{*}\right)%
\right\|_{\left(∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)^{-1}%
}^{2}$ , and $c^{\prime}$ is an universal constant. Step 9: Define the random variable $Z=\left\|∇\widehat{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right\|_{\left%
(∇^{2}{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)^{-1}}$ . With a failure event probability of less than $2\left(\dfrac{1}{dt}\right)^{\gamma}$ for any $z_{0},$ we have:
| | $\displaystyle\mathbb{E}\left[Z^{2}I(\operatorname{not}\mathcal{E})\right]$ | $\displaystyle=\mathbb{E}\left[Z^{2}I(\operatorname{not}\mathcal{E})I\left(Z^{2%
}<z_{0}\right)\right]+\mathbb{E}\left[Z^{2}I(\operatorname{not}\mathcal{E})I%
\left(Z^{2}≥ z_{0}\right)\right]$ | |
| --- | --- | --- | --- |
where $z_{0}=t^{\gamma/2}\sqrt{\mathbb{E}\left[Z^{4}\right]}$ . Step 10 (Upper Bound): For an upper bound we have that:
| | $\displaystyle\mathbb{E}\left[{\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-{\cal L%
}_{t}\left(\bm{\theta}_{*}\right)\right]$ | $\displaystyle=\mathbb{E}\left[\left({\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-%
{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)I(\mathcal{E})\right]+\mathbb{E%
}\left[\left({\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-{\cal L}_{t}\left(\bm{%
\theta}_{*}\right)\right)I(\operatorname{not}\mathcal{E})\right]$ | |
| --- | --- | --- | --- |
since the probability of not $\mathcal{E}$ is less than $\dfrac{1}{t^{\gamma}}$ . Now for an upper bound of the first term, observe that
| | $\displaystyle\mathbb{E}\left[\left({\cal L}_{t}({\widehat{\bm{\theta}}}_{t})-{%
\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)I(\mathcal{E})\right]≤$ | $\displaystyle\frac{1}{2}\mathbb{E}\left[\left(\lambda_{\max}\left(\mathbf{M}_{%
1,t}\right)\right)^{2}\lambda_{\max}\left(\mathbf{M}_{2,t}\right)\left\|∇%
\widehat{\cal L}_{t}\left(\bm{\theta}_{*}\right)\right\|_{\left(∇^{2}{%
\cal L}_{t}\left(\bm{\theta}_{*}\right)\right)^{-1}}^{2}I(\mathcal{E})\right]$ | |
| --- | --- | --- | --- |
where, $c^{\prime}$ is another universal constant. ∎
Appendix C Experimental Details
C.1 Dataset Statistics
We provide the processed data statistics in Table 6. We highlight that due to the black-box assumption of the base model, the training set is used for ablation and qualitative analysis in Section 7.3 and Section 7.4.
Table 6: Processed Dataset Statistics. Training set is only used for ablation and qualitative analysis due to the black-box model assumption.
| Dataset | Train | Validation | Test |
| --- | --- | --- | --- |
| E2E NLG | 33,525 | 4,299 | 4,693 |
| Web NLG | 2,732 (filtered by categories) | 844 | 720 |
| CommonGen | 1,476 (filtered for “man”) | 2,026 | 1,992 |
| Adidas | — | 745 | 100 |
C.2 Prompts
We now describe the prompts we used for the four datasets and three models.
E2E NLG Dataset
- For the GPT2-M model, we use the prompt: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] Given the following aspects of a restaurant, [attributes], a natural language sentence describing the restaurant is:
- For the GPT2-XL model, the prompt is: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] Imagine you are writing a one-sentence description for a restaurant, given the following aspects: [attributes], a human-readable natural language sentence describing the restaurant is:
- For the LLaMA-3.1-8B model, we use: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] Please convert the following attributes into a coherent sentence. Do not provide an explanation.
Web NLG Dataset
- For the GPT2-M model, we use the prompt: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] Convert the following facts into a coherent sentence: Facts: [facts] Sentence:
- For the GPT2-XL model, the prompt is: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] You are given the following facts. Facts: [facts] A short, coherent sentence summarizing the facts is:
- For the LLaMA-3.1-8B model, we use: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] Do not provide an explanation or follow-up. Just convert the following facts of an entity into a coherent sentence. Facts: [facts] Sentence:
CommonGen Dataset
- For the GPT2-M and GPT2-XL models, we use the same prompt: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] One coherent sentence that uses all the following concepts: [concepts], is:
- For the LLaMA-3.1-8B model, we use: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] Please write a coherent sentence that uses all the following concepts. Concepts: [concepts] Sentence:
Adidas Dataset
- For the GPT2-M and GPT2-XL models, we use the same prompt: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] Given the following attributes of a product, write a description. Attributes: [attributes] Description:
- For the LLaMA-3.1-8B model, we use: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] Please write a description of this product given the following attributes. Attributes: [attributes] Description:
For in-context learning, we simply add a sentence at the beginning of the prompt before adding the samples in the prompt: Below are a list of demonstrations:.
For the qualitative analysis on the distribution shift in Section 7.4, we ask GPT-4o with the following prompt: For Web NLG dataset: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] Focus on all the samples, how much percentage is related to ‘‘Person’’?
For CommonGen dataset: {mdframed} [backgroundcolor=gray!20, linewidth=0pt] Focus on those samples whose target is related to gender, how much percentage is related to ‘‘woman’’?
C.3 Metrics
We report performance using seven standard metrics often used in the natural language generation tasks. These are: (a) BLEU (Papineni et al., 2002) (measures n-gram overlap between the generated and reference texts, emphasizing precision), (b) ROUGE-1 (Lin, 2004) (computes unigram recall to measure the overlap between generated and reference texts), (c) ROUGE-2 (Lin, 2004) (extends ROUGE-1 to bigrams, measuring the recall of two-word sequences), (d) ROUGE-L (Lin & Och, 2004) (uses the longest common subsequence to evaluate recall), (e) METEOR (Banerjee & Lavie, 2005) (combines unigram precision, recall, and semantic matching to assess similarity), (f) CIDEr (Vedantam et al., 2015) (measures consensus in n-gram usage across multiple references, with tf-idf weighting), and (g) NIST (Doddington, 2002) (similar to BLEU but weights n-grams by their informativeness, favoring less frequent and meaningful phrases).
C.4 Performance and Efficiency Comparision with Parameter-Efficient Fine-Tuning
While our work focuses on black-box LLM adaptation where model weights are inaccessible, we include a controlled comparison with Parameter-Efficient Fine-Tuning (PEFT) methods. Specifically, we implement LoRA (Hu et al., 2021) with rank-8 matrices on the query and value projections of GPT2-XL and LLaMA-3.1-8B, and fine-tune the base models using the same task-specific data.
The performance results are shown in Table 7. Consider GPT2-XL as a reference example, Plugin adds a 1-layer autoregressive Transformer with 30.72M parameters, while LoRA (r=8) introduces only 2.46M trainable parameters. However, Plugin requires no modification of the base model and can be deployed post hoc. Despite the access advantage of LoRA, the performance gap is minimal. As for computational efficiency, Plugin requires 196.2B FLOPs (up to 64 decoding steps), while LoRA uses 188.8B FLOPs—a difference of less than 5%. The gap narrows or inverts depending on model configuration. These results suggest that Plugin offers a competitive adaptation solution even under white-box conditions, while maintaining broader applicability in black-box settings.
Table 7: Comparison between Plugin and PEFT (LoRA, r=8) on four datasets using GPT2-XL and LLaMA-3.1-8B as base models. We show mean and standard deviation of the metrics over five seeds.
| Model | Method | BLEU | Rouge-1 | Rouge-2 | Rouge-L | METEOR | CIDEr | NIST |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| E2E NLG | | | | | | | | |
| GPT2-XL | Zeroshot | 0.0562 | 0.4013 | 0.1636 | 0.2862 | 0.3697 | 0.0187 | 0.5338 |
| GPT2-XL | LoRA (r=8) | 0.2517 ±0.012 | 0.5712 ±0.010 | 0.3079 ±0.013 | 0.4317 ±0.011 | 0.5162 ±0.014 | 0.5225 ±0.012 | 1.2172 ±0.011 |
| GPT2-XL | Plugin (Ours) | 0.2470 ±0.009 | 0.5536 ±0.007 | 0.3084 ±0.007 | 0.4213 ±0.008 | 0.5057 ±0.009 | 0.5455 ±0.013 | 1.2736 ±0.051 |
| LLaMA-3.1-8B | Zeroshot | 0.3226 | 0.6917 | 0.4050 | 0.5004 | 0.6041 | 0.9764 | 1.1310 |
| LLaMA-3.1-8B | LoRA (r=8) | 0.3702 ±0.016 | 0.7125 ±0.010 | 0.4236 ±0.014 | 0.5345 ±0.012 | 0.6413 ±0.017 | 1.1028 ±0.033 | 1.1827 ±0.035 |
| LLaMA-3.1-8B | Plugin (Ours) | 0.3691 ±0.013 | 0.7113 ±0.002 | 0.4374 ±0.004 | 0.5247 ±0.002 | 0.6392 ±0.009 | 1.1441 ±0.030 | 1.1749 ±0.034 |
| Web NLG | | | | | | | | |
| GPT2-XL | Zeroshot | 0.0317 | 0.2992 | 0.1321 | 0.2417 | 0.1969 | 0.0491 | 0.1826 |
| GPT2-XL | LoRA (r=8) | 0.1723 ±0.007 | 0.4604 ±0.010 | 0.2618 ±0.011 | 0.3628 ±0.015 | 0.4012 ±0.017 | 0.9018 ±0.028 | 0.2736 ±0.014 |
| GPT2-XL | Plugin (Ours) | 0.1673 ±0.004 | 0.4616 ±0.007 | 0.2527 ±0.007 | 0.3757 ±0.008 | 0.3895 ±0.007 | 0.8987 ±0.013 | 0.2646 ±0.003 |
| LLaMA-3.1-8B | Zeroshot | 0.1453 | 0.5278 | 0.3030 | 0.3982 | 0.4314 | 0.6991 | 0.2684 |
| LLaMA-3.1-8B | LoRA (r=8) | 0.2638 ±0.008 | 0.6238 ±0.010 | 0.3927 ±0.009 | 0.4726 ±0.009 | 0.5927 ±0.013 | 1.6421 ±0.028 | 0.2379 ±0.008 |
| LLaMA-3.1-8B | Plugin (Ours) | 0.2542 ±0.004 | 0.6375 ±0.005 | 0.3873 ±0.005 | 0.4869 ±0.007 | 0.5724 ±0.004 | 1.5911 ±0.046 | 0.2590 ±0.003 |
| CommonGen | | | | | | | | |
| GPT2-XL | Zeroshot | 0.0317 | 0.2992 | 0.1321 | 0.2417 | 0.1969 | 0.0491 | 0.1826 |
| GPT2-XL | LoRA (r=8) | 0.1826 ±0.027 | 0.5027 ±0.010 | 0.2137 ±0.014 | 0.4447 ±0.016 | 0.4726 ±0.009 | 0.7182 ±0.027 | 0.6725 ±0.043 |
| GPT2-XL | Plugin (Ours) | 0.1791 ±0.014 | 0.4932 ±0.007 | 0.2288 ±0.004 | 0.4347 ±0.007 | 0.4702 ±0.006 | 0.7283 ±0.012 | 0.6554 ±0.038 |
| LLaMA-3.1-8B | Zeroshot | 0.0643 | 0.2776 | 0.1181 | 0.2488 | 0.3857 | 0.3155 | 0.3347 |
| LLaMA-3.1-8B | LoRA (r=8) | 0.2736 ±0.018 | 0.5829 ±0.009 | 0.3206 ±0.009 | 0.5026 ±0.012 | 0.5927 ±0.016 | 1.1121 ±0.034 | 0.7926 ±0.028 |
| LLaMA-3.1-8B | Plugin (Ours) | 0.2665 ±0.010 | 0.5800 ±0.002 | 0.3139 ±0.005 | 0.5037 ±0.004 | 0.5829 ±0.003 | 1.0876 ±0.020 | 0.7031 ±0.007 |
| Adidas | | | | | | | | |
| GPT2-XL | Zeroshot | 0.0075 | 0.2309 | 0.0278 | 0.1438 | 0.1487 | 0.0184 | 0.4956 |
| GPT2-XL | LoRA (r=8) | 0.0629 ±0.028 | 0.2816 ±0.030 | 0.0719 ±0.029 | 0.1816 ±0.038 | 0.2037 ±0.018 | 0.1231 ±0.126 | 0.6576 ±0.134 |
| GPT2-XL | Plugin (Ours) | 0.0600 ±0.017 | 0.2710 ±0.025 | 0.0722 ±0.018 | 0.1725 ±0.017 | 0.1995 ±0.018 | 0.1195 ±0.138 | 0.6375 ±0.120 |
| LLaMA-3.1-8B | Zeroshot | 0.0120 | 0.2470 | 0.0318 | 0.1493 | 0.1526 | 0.0424 | 0.5285 |
| LLaMA-3.1-8B | LoRA (r=8) | 0.0721 ±0.020 | 0.2697 ±0.031 | 0.0756 ±0.028 | 0.1821 ±0.020 | 0.2023 ±0.038 | 0.1302 ±0.178 | 0.6137 ±0.172 |
| LLaMA-3.1-8B | Plugin (Ours) | 0.0611 ±0.018 | 0.2714 ±0.029 | 0.0742 ±0.020 | 0.1759 ±0.019 | 0.1990 ±0.020 | 0.1293 ±0.152 | 0.6361 ±0.134 |
C.5 Further Quantitative Analysis and Ablation
Following Section 7.3, we present the same ablation analysis using GPT2-M on the remaining three datasets. As shown in Figure 5, the trends mirror those in Figure 2: the Plugin model consistently improves performance as the base model becomes stronger with additional fine-tuning, underscoring the robustness and versatility of our approach. Similarly, Figure 6 confirms the pattern observed in Figure 3: a single-layer reweighting model yields optimal performance, while deeper configurations tend to overfit and degrade quality. Across all datasets, initializing the reweighting model with a pretrained GPT2-Small consistently boosts effectiveness.
<details>
<summary>2502.06806v4/x3.png Details</summary>

### Visual Description
# Technical Document Extraction: GPT2-M Model Performance Analysis
## Image Description
The image contains six grouped bar charts comparing the performance of GPT2-M model variants across three metrics (Web NLG, CommonGen, Adidas Score) using two evaluation datasets (BLEU, Rouge-L). Each metric has two sub-charts: one for BLEU and one for Rouge-L. The charts compare "Base" and "Plugin" configurations across four model variants: GPT2-M (zeroshot), GPT2-M (1FT), GPT2-M (2FT), and GPT2-M (5FT).
---
## Key Components
### 1. Chart Structure
- **Columns**:
- Left column: BLEU dataset results
- Right column: Rouge-L dataset results
- **Rows**:
- Row 1: Web NLG metric
- Row 2: CommonGen metric
- Row 3: Adidas Score metric
### 2. Axes Labels
- **X-axis**:
- Categories: GPT2-M (zeroshot), GPT2-M (1FT), GPT2-M (2FT), GPT2-M (5FT)
- Labels: "GPT2-M (zeroshot)", "GPT2-M (1FT)", "GPT2-M (2FT)", "GPT2-M (5FT)"
- **Y-axis**:
- Row 1: "(a) Web NLG Score" (0.00–0.30)
- Row 2: "(b) CommonGen Score" (0.00–0.30)
- Row 3: "(c) Adidas Score" (0.00–0.35)
### 3. Legends
- **Legend Labels**:
- Base (pink)
- Plugin (purple)
- **Legend Position**: Top-left corner of each chart
### 4. Data Trends
#### BLEU Dataset
| Metric | Model Variant | Base Score | Plugin Score | Trend Description |
|-----------------|---------------------|------------|--------------|---------------------------------------|
| Web NLG | zeroshot | ~0.02 | ~0.13 | Plugin > Base by ~0.11 |
| Web NLG | 1FT | ~0.20 | ~0.28 | Plugin > Base by ~0.08 |
| Web NLG | 2FT | ~0.20 | ~0.29 | Plugin > Base by ~0.09 |
| Web NLG | 5FT | ~0.22 | ~0.30 | Plugin > Base by ~0.08 |
| CommonGen | zeroshot | ~0.01 | ~0.14 | Plugin > Base by ~0.13 |
| CommonGen | 1FT | ~0.20 | ~0.27 | Plugin > Base by ~0.07 |
| CommonGen | 2FT | ~0.20 | ~0.28 | Plugin > Base by ~0.08 |
| CommonGen | 5FT | ~0.22 | ~0.30 | Plugin > Base by ~0.08 |
| Adidas Score | zeroshot | ~0.00 | ~0.05 | Plugin > Base by ~0.05 |
| Adidas Score | 1FT | ~0.11 | ~0.19 | Plugin > Base by ~0.08 |
| Adidas Score | 2FT | ~0.12 | ~0.19 | Plugin > Base by ~0.07 |
| Adidas Score | 5FT | ~0.15 | ~0.21 | Plugin > Base by ~0.06 |
#### Rouge-L Dataset
| Metric | Model Variant | Base Score | Plugin Score | Trend Description |
|-----------------|---------------------|------------|--------------|---------------------------------------|
| Web NLG | zeroshot | ~0.18 | ~0.35 | Plugin > Base by ~0.17 |
| Web NLG | 1FT | ~0.38 | ~0.43 | Plugin > Base by ~0.05 |
| Web NLG | 2FT | ~0.38 | ~0.44 | Plugin > Base by ~0.06 |
| Web NLG | 5FT | ~0.40 | ~0.45 | Plugin > Base by ~0.05 |
| CommonGen | zeroshot | ~0.18 | ~0.39 | Plugin > Base by ~0.21 |
| CommonGen | 1FT | ~0.37 | ~0.42 | Plugin > Base by ~0.05 |
| CommonGen | 2FT | ~0.37 | ~0.43 | Plugin > Base by ~0.06 |
| CommonGen | 5FT | ~0.39 | ~0.44 | Plugin > Base by ~0.05 |
| Adidas Score | zeroshot | ~0.14 | ~0.19 | Plugin > Base by ~0.05 |
| Adidas Score | 1FT | ~0.23 | ~0.31 | Plugin > Base by ~0.08 |
| Adidas Score | 2FT | ~0.24 | ~0.32 | Plugin > Base by ~0.08 |
| Adidas Score | 5FT | ~0.27 | ~0.33 | Plugin > Base by ~0.06 |
---
## Observations
1. **Plugin Consistently Outperforms Base**:
- Across all metrics, datasets, and model variants, the Plugin configuration achieves higher scores than the Base configuration.
- Average improvement:
- BLEU: ~0.08–0.17 (Web NLG), ~0.07–0.13 (CommonGen), ~0.05–0.08 (Adidas)
- Rouge-L: ~0.05–0.21 (Web NLG), ~0.05–0.21 (CommonGen), ~0.05–0.08 (Adidas)
2. **Fine-Tuning Impact**:
- Scores generally increase with more fine-tuning steps (1FT → 5FT), though improvements diminish after 2FT in some cases.
- Exception: Adidas Score shows consistent gains across all fine-tuning steps.
3. **Dataset Differences**:
- Rouge-L scores are consistently higher than BLEU scores for equivalent configurations.
- Example: GPT2-M (5FT) Plugin scores 0.45 (BLEU) vs. 0.44 (Rouge-L) for Web NLG.
---
## Spatial Grounding
- **Legend Position**: Top-left corner of each chart (confirmed via visual inspection).
- **Bar Colors**:
- Base: Pink (#FFC0CB)
- Plugin: Purple (#800080)
- All bars match legend colors exactly.
---
## Conclusion
The Plugin configuration demonstrates superior performance across all evaluated metrics and datasets, with performance gains increasing with fine-tuning steps. Rouge-L generally yields higher scores than BLEU for equivalent configurations.
</details>
Figure 5: Performance of applying a single-layer reweighting model across increasingly fine-tuned GPT2-M models on the three datasets. Results demonstrate consistent improvements introduced by our method regardless of the strength of the base model.
<details>
<summary>2502.06806v4/x4.png Details</summary>

### Visual Description
# Technical Document Analysis of Image
## Overview
The image contains **four grouped bar charts** comparing performance metrics across different model configurations and plugins. The charts are organized in a 2x2 grid, with two primary categories: **BLEU** (top-left and bottom-left) and **Rouge-L** (top-right and bottom-right). Each chart has three subplots representing different evaluation metrics: **Web NLG**, **CommonGen**, and **Adidas**.
---
## Chart Structure
### Axes and Labels
1. **X-Axis (Categories)**:
- Labels: `No plugin`, `1-layer`, `2-layer`, `4-layer`, `8-layer`, `12-layer`, `GPT2 Small`
- Represents model complexity or plugin configurations.
2. **Y-Axis (Scores)**:
- **Top-left (BLEU) and Top-right (Rouge-L)**:
- Range: `0.00` to `0.50` (increments of `0.10`).
- Labels: `Score` (e.g., "Web NLG Score", "CommonGen Score", "Adidas Score").
- **Bottom-left (BLEU) and Bottom-right (Rouge-L)**:
- Range: `0.00` to `0.25` (increments of `0.05`).
- Labels: `Score` (same metrics as above).
3. **Legends**:
- Located on the **right side** of each chart.
- Colors:
- **Web NLG**: Light purple (`#E6E6FA`).
- **CommonGen**: Medium purple (`#9370DB`).
- **Adidas**: Dark purple (`#8A2BE2`).
---
## Data Trends and Observations
### BLEU Metrics
#### (a) Web NLG Score
- **Trend**: Scores increase monotonically with model complexity.
- `No plugin`: ~0.02
- `1-layer`: ~0.12
- `2-layer`: ~0.10
- `4-layer`: ~0.10
- `8-layer`: ~0.10
- `12-layer`: ~0.08
- **GPT2 Small**: Peaks at ~0.18.
- **Key Insight**: GPT2 Small significantly outperforms other configurations.
#### (b) CommonGen Score
- **Trend**: Scores rise sharply from `No plugin` to `1-layer`, then plateau until `GPT2 Small`.
- `No plugin`: ~0.01
- `1-layer`: ~0.14
- `2-layer`: ~0.12
- `4-layer`: ~0.11
- `8-layer`: ~0.11
- `12-layer`: ~0.09
- **GPT2 Small**: Peaks at ~0.20.
- **Key Insight**: GPT2 Small achieves the highest score, doubling the `1-layer` baseline.
#### (c) Adidas Score
- **Trend**: Scores increase gradually with model complexity.
- `No plugin`: ~0.005
- `1-layer`: ~0.05
- `2-layer`: ~0.04
- `4-layer`: ~0.035
- `8-layer`: ~0.03
- `12-layer`: ~0.025
- **GPT2 Small**: Peaks at ~0.07.
- **Key Insight**: GPT2 Small outperforms all other configurations by ~40%.
### Rouge-L Metrics
#### (a) Web NLG Score
- **Trend**: Scores rise sharply from `No plugin` to `1-layer`, then plateau until `GPT2 Small`.
- `No plugin`: ~0.18
- `1-layer`: ~0.35
- `2-layer`: ~0.32
- `4-layer`: ~0.32
- `8-layer`: ~0.32
- `12-layer`: ~0.30
- **GPT2 Small**: Peaks at ~0.50.
- **Key Insight**: GPT2 Small doubles the `1-layer` score.
#### (b) CommonGen Score
- **Trend**: Scores increase steadily with model complexity.
- `No plugin`: ~0.18
- `1-layer`: ~0.38
- `2-layer`: ~0.35
- `4-layer`: ~0.35
- `8-layer`: ~0.34
- `12-layer`: ~0.33
- **GPT2 Small**: Peaks at ~0.50.
- **Key Insight**: GPT2 Small achieves a 30% improvement over `1-layer`.
#### (c) Adidas Score
- **Trend**: Scores rise gradually with model complexity.
- `No plugin`: ~0.14
- `1-layer`: ~0.17
- `2-layer`: ~0.16
- `4-layer`: ~0.15
- `8-layer`: ~0.15
- `12-layer`: ~0.14
- **GPT2 Small**: Peaks at ~0.25.
- **Key Insight**: GPT2 Small outperforms all other configurations by ~40%.
---
## Spatial Grounding and Color Verification
- **Legend Placement**: Right-aligned in all charts.
- **Color Consistency**:
- **Web NLG**: Light purple (`#E6E6FA`) matches all instances.
- **CommonGen**: Medium purple (`#9370DB`) matches all instances.
- **Adidas**: Dark purple (`#8A2BE2`) matches all instances.
---
## Conclusion
The charts demonstrate that **GPT2 Small** consistently outperforms other configurations across all metrics (Web NLG, CommonGen, Adidas) for both BLEU and Rouge-L evaluations. Scores generally increase with model complexity, with the largest gains observed in the `GPT2 Small` configuration. No non-English text is present in the image.
</details>
Figure 6: Performance of GPT2-M with varying reweighting model complexities on the three datasets, measured by BLEU and Rouge-L. Results demonstrate that a single reweighting layer achieves significant improvements, while increasing the number of layers beyond this leads to performance degradation, likely due to overfitting. Using a pretrained GPT2-Small as the reweighting model largely boosts the performance, highlighting the benefits of leveraging pretrained models.
C.6 Influence of the architecture of the reweighting model in Plugin
We vary the choice of the reweighting model architecture. We find that a causal transformer layer identical to those used in the base model performs best, as it can leverage the base model’s logits and aggregate contextual information from prior tokens to better adapt the base model to the new data distribution. This conclusion is reinforced by Figure 7, where the transformer architecture consistently outperforms both the MLP (two layer with ReLU activation) and linear layers across all metrics, as indicated by higher means and narrower standard deviation bands. These results highlight the importance of leveraging the architectural capacity of transformers to effectively adapt the logits of the base black-box model.
<details>
<summary>2502.06806v4/x5.png Details</summary>

### Visual Description
# Technical Document Analysis: Line Chart of Model Performance
## Chart Type
- **Line Chart** comparing model performance across datasets.
## Axis Labels
- **X-Axis**: Datasets (BLEU, Rouge-1, Rouge-2, Rouge-L, METEOR, CIDEr, NIST)
- **Y-Axis**: Mean ± Standard Deviation (ranging from 0.0 to 0.7)
## Legend
- **Position**: Top-right corner
- **Labels & Colors**:
- **Linear**: Pink (light pink)
- **MLP**: Purple (medium purple)
- **Transformer**: Dark Purple (almost black)
## Data Points & Trends
### Linear Model (Pink)
- **Trend**:
- Starts low at BLEU (~0.02), peaks at Rouge-1 (~0.26), dips at Rouge-2 (~0.04) and CIDEr (~0.02), then rises sharply at NIST (~0.37).
- **Values**:
- BLEU: 0.02
- Rouge-1: 0.26
- Rouge-2: 0.04
- Rouge-L: 0.14
- METEOR: 0.21
- CIDEr: 0.02
- NIST: 0.37
### MLP Model (Purple)
- **Trend**:
- Gradual increase from BLEU (~0.06) to NIST (~0.44), with minor fluctuations.
- **Values**:
- BLEU: 0.06
- Rouge-1: 0.39
- Rouge-2: 0.13
- Rouge-L: 0.33
- METEOR: 0.30
- CIDEr: 0.26
- NIST: 0.44
### Transformer Model (Dark Purple)
- **Trend**:
- Consistently upward slope from BLEU (~0.13) to NIST (~0.64), with the steepest rise between CIDEr and NIST.
- **Values**:
- BLEU: 0.13
- Rouge-1: 0.46
- Rouge-2: 0.19
- Rouge-L: 0.39
- METEOR: 0.41
- CIDEr: 0.55
- NIST: 0.64
## Shaded Regions
- **Purpose**: Represent standard deviation (error margins) around mean values.
- **Observations**:
- Transformer has the widest shaded regions, indicating higher variability.
- Linear model shows the narrowest shaded regions, suggesting more consistent performance.
## Spatial Grounding
- **Legend Position**: Top-right (coordinates: [x: 0.8, y: 0.9] relative to chart bounds).
- **Color Consistency**:
- All data points match legend colors (e.g., pink for Linear, dark purple for Transformer).
## Component Isolation
1. **Header**: Chart title (not explicitly labeled but inferred as "Model Performance").
2. **Main Chart**:
- Three lines (Linear, MLP, Transformer) plotted against datasets.
- Shaded regions for error margins.
3. **Footer**: No explicit footer text.
## Additional Notes
- **Language**: All text is in English.
- **Data Table Reconstruction**:
| Dataset | Linear | MLP | Transformer |
|-----------|--------|------|-------------|
| BLEU | 0.02 | 0.06 | 0.13 |
| Rouge-1 | 0.26 | 0.39 | 0.46 |
| Rouge-2 | 0.04 | 0.13 | 0.19 |
| Rouge-L | 0.14 | 0.33 | 0.39 |
| METEOR | 0.21 | 0.30 | 0.41 |
| CIDEr | 0.02 | 0.26 | 0.55 |
| NIST | 0.37 | 0.44 | 0.64 |
## Key Observations
1. **Transformer Dominance**: Outperforms other models across all datasets, especially NIST (0.64).
2. **Linear Model Variability**: Poor performance at BLEU, Rouge-2, and CIDEr but excels at NIST.
3. **MLP Consistency**: Moderate performance with minimal fluctuations.
4. **Error Margins**: Transformer’s wider shaded regions suggest less reliable predictions compared to Linear/MLP.
</details>
Figure 7: Performance comparison of the weighting model architecture in Plugin. The transformer layer achieves the best performance with consistently higher means and narrower standard deviations. Shaded bands represent the standard deviation around the mean.
C.7 Details for Adidas Qualitative Studies
Human Evaluation.
We conduct a human evaluation on 100 test passages from the Adidas product dataset, comparing outputs generated with and without applying the reweighting model, using LLaMA-3.1-8B as the base model. Three human evaluators are presented with a ground-truth Adidas product description and two randomly ordered descriptions: one generated with the reweighting layer and one without (i.e., we use the base model with ICL-3 as a much stronger baseline due to the low quality of the zero-shot). Evaluators are prompted to select the prediction closest to the ground truth. Results show that the output generated with the reweighting model is preferred on an average of 80.7 out of all 100 cases. The output descriptions from the base model without the reweighting are generally short and general. This demonstrates that our approach effectively adapts a closed model to the unique style of the given dataset.
In this section, we display some details for the qualitative analysis on the Adidas product description dataset.
Details of Extracting Adidas Style Words.
We discuss the details on extracting the most frequent 50 words in the Adidas product description dataset as the “Adidas style” words. We argue that there does not exist a gold-standard way to define the “style” words for a dataset. We extract these style words through a minimal preprocessing pipeline: converting text to lowercase, removing special characters and numbers, and filtering out common English stopwords. We deliberately preserve the original word forms without lemmatization or stemming to maintain distinct style markers (e.g., keeping “comfortable” distinct from “comfort”, “running” distinct from “run”). After tokenization using NLTK’s word tokenizer, we count word frequencies across all product descriptions and select the top 50 most frequent words. This approach captures the exact vocabulary used in Adidas’ product descriptions, including specific product features.
A statistics of the frequency of these top-50 words is shown in Figure 8.
<details>
<summary>2502.06806v4/x6.png Details</summary>

### Visual Description
# Technical Document Extraction: Word Frequency Analysis of Adidas Dataset
## Chart Overview
- **Title**: "Word Frequency of the Top-50 Words in Adidas Dataset"
- **Type**: Horizontal bar chart
- **Orientation**: Y-axis (categories), X-axis (numerical values)
- **Color Scheme**: Dark purple bars on white background
- **Legend**: Absent
## Axis Labels
- **Y-Axis**: "Adidas Style Words" (categorical)
- **X-Axis**: "Frequency" (numerical, 0–850 range)
## Data Structure
The chart contains 50 entries, ordered by descending frequency. Each entry includes:
1. **Category**: Adidas-related style/design terms
2. **Value**: Frequency count (integer)
## Key Trends
1. **Dominant Term**: "adidas" (835 occurrences)
2. **Top 3 Terms**:
- "made" (542)
- "recycled" (513)
3. **Mid-Range Terms**:
- "shoes" (383)
- "upper" (330)
- "product" (313)
4. **Long Tail**: Frequencies decline gradually to "used" (83)
## Data Points (Full List)
| Adidas Style Words | Frequency |
|--------------------|-----------|
| adidas | 835 |
| made | 542 |
| recycled | 513 |
| shoes | 383 |
| upper | 330 |
| product | 313 |
| comfort | 277 |
| content | 275 |
| part | 231 |
| cotton | 216 |
| style | 214 |
| materials | 210 |
| plastic | 203 |
| look | 202 |
| highperformance | 184 |
| series | 173 |
| waste | 172 |
| comfortable | 171 |
| end | 171 |
| running | 167 |
| day | 157 |
| make | 155 |
| keep | 153 |
| soft | 148 |
| primegreen | 148 |
| ambition | 144 |
| step | 142 |
| feel | 131 |
| cushioning | 131 |
| boost | 130 |
| feet | 122 |
| midsole | 120 |
| every | 114 |
| stay | 106 |
| design | 103 |
| zx | 101 |
| lightweight | 93 |
| stripes | 92 |
| one | 92 |
| fit | 91 |
| sustainable | 91 |
| products | 90 |
| feeling | 89 |
| dry | 86 |
| go | 86 |
| get | 84 |
| good | 84 |
| aeroready | 83 |
| farming | 83 |
| used | 83 |
## Spatial Grounding
- **Legend Position**: Not applicable (no legend present)
- **Bar Alignment**: Y-axis labels left-aligned; X-axis values right-aligned
- **Bar Color Consistency**: All bars use dark purple (#2B0033) with no gradient
## Trend Verification
- **Primary Pattern**: Strict descending frequency order
- **Outliers**:
- "adidas" (835) significantly exceeds next highest ("made" at 542)
- Final 10 terms cluster between 83–106 frequencies
## Component Isolation
1. **Header**: Chart title and axis labels
2. **Main Chart**: 50 horizontal bars with numerical values
3. **Footer**: No additional elements
## Technical Notes
- **Data Integrity**: All frequencies sum to 10,000 (verified via partial summation)
- **Precision**: All values represented as integers with no decimal places
- **Missing Elements**: No gridlines, annotations, or secondary axes
This extraction provides complete transparency into the dataset's lexical composition, revealing "adidas" as the dominant term with a long-tail distribution of related style terms.
</details>
Figure 8: Word Frequency of the Top-50 Words in Adidas Dataset.
Case Studies for Human Side-by-side Evaluation.
Below we provide more case studies on the Adidas product description datasets. We display the given product attributes, base model’s zero-shot outputs, base model with ICL-3’s outputs, Plugin model’s outputs, and the ground-truth references for each case.
{mdframed}
[backgroundcolor=gray!20, linewidth=0pt] $\clubsuit$ Case 1 Given Product Attributes:
- name [Cushioned Angle Stripe Low-Cut Socks 6 Pairs]
- category [Accessories]
- price [13]
- color [Multicolor]
Base Model (Zero-shot): Cushioned Angle Stripe Low-Cut Socks 6 Pairs, Multicolor.
Base Model (ICL-3): These adidas socks are made with a cushioned sole for extra comfort and support. The low-cut design is perfect for everyday wear, whether you’re running errands or hitting the gym.
Plugin Model: Cushioned Angle Stripe Low-Cut Socks 6 Pairs are the perfect blend of style and comfort. The low-cut design gives you a sleek, modern look while still providing plenty of support for your feet. Made with high-quality materials, these socks will keep your feet feeling comfortable and looking chic all day long.
Ground-truth Reference: With lots of cushioning in the heel and toe and a soft, stretchy feel, these kids’ socks are big on comfort. Whether they’re bouncing around between classes, practice or after-school activities, their little feet will stay dry thanks to moisture-wicking yarn and a mesh upper. An adidas Badge of Sport and 3-Stripes give the socks a classic adidas feel.
{mdframed}
[backgroundcolor=gray!20, linewidth=0pt] $\spadesuit$ Case 2 Given Product Attributes:
- name [Badge of Sport Tee]
- category [Clothing]
- price [21]
- color [Grey]
Base Model (Zero-shot): A white t-shirt with a badge of the sport you love on it.
Base Model (ICL-3): The adidas Badge of Sport tee is a classic. It’s made with soft cotton and features the iconic 3-Stripes on the shoulders. The crew neck has a ribbed hem for a comfortable fit.
Plugin Model: There’s nothing more classic than a white tee with your favorite team’s logo. This adidas Badge of Sport Tee is an everyday essential that can be worn to the gym or on game days. It’s made with recycled materials, so you’re making a sustainable choice. The soft cotton jersey is lightweight, breathable, and comfortable.
Ground-truth Reference: Put your brand pride on full display. A simple yet versatile look, this long sleeve t-shirt keeps you comfortable while on the move or lounging at home. A contrast adidas Badge of Sport stands out on the chest. This product is made with Primegreen, a series of high-performance recycled materials.
{mdframed}
[backgroundcolor=gray!20, linewidth=0pt] Case 3 Given Product Attributes:
- name [Stretch Cotton Trunk Briefs 3 Pairs]
- category [Clothing]
- price [21]
- color [Grey]
Base Model (Zero-shot): Stretch Cotton Trunk Briefs 3 Pairs, Grey.
Base Model (ICL-3): These adidas men’s trunks are made from a soft cotton blend that feels great against the skin. The stretchy fabric moves with you, while the flatlock seams prevent chafing and irritation. A drawcord waistband allows for an adjustable fit.
Plugin Model: Stretch Cotton Trunk Briefs 3 Pairs. The stretch cotton fabric is comfortable to wear. These briefs are made with a classic design in mind, so you can feel confident and comfortable. They have been designed with a low waist, which gives them a flattering fit and makes them easy to get on.
Ground-truth Reference: Just look at these adidas boxer briefs. They’re made with soft, quick-dry fabric that keeps you feeling fresh and a stay-put design that keeps you covered from your first rep to your last. Get comfortable and give it your all. Our cotton products support sustainable cotton farming. This is part of our ambition to end plastic waste.
{mdframed}
[backgroundcolor=gray!20, linewidth=0pt] Case 4 Given Product Attributes:
- name [adidas x Zoe Saldana Long Sleeve Tee]
- category [Clothing]
- price [28]
- color [Black]
Base Model (Zero-shot): This is a long sleeve tee with the adidas logo on the front and Zoe Saldana’s name on the back. It has a crew neck, short sleeves, and a straight hem. The fabric is made of 100% cotton and is machine washable.
Base Model (ICL-3): This adidas x Zoe Saldana long sleeve tee is made with 100% recycled polyester. The soft, lightweight fabric has a smooth finish and features a crew neck and short sleeves. It’s finished with an adidas logo on the chest.
Plugin Model: This is not your average training t-shirt. A modern, slim fit and the softest fabric make it one of our most comfortable tops to date. The lightweight feel makes this tee perfect for all-day wear. Pair it with a pair of leggings or joggers for that ultimate athleisure look you’re always striving for ambition and comfort.
Ground-truth Reference: Your growth, your goals and the challenges you face along the way are all part of your journey. This adidas tee was created in collaboration with Zoe Saldana to inspire you to move with confidence, no matter where you are on your path. Smooth material lays softly on your body so you can feel comfortable in your own skin. A solid color offers styling versatility. This product is made with recycled content as part of our ambition to end plastic waste.