# Communicating Activations Between Language Model Agents
**Authors**: Vignav Ramesh, Kenneth Li
Abstract
Communication between multiple language model (LM) agents has been shown to scale up the reasoning ability of LMs. While natural language has been the dominant medium for inter-LM communication, it is not obvious this should be the standard: not only does natural language communication incur high inference costs that scale quickly with the number of both agents and messages, but also the decoding process abstracts away too much rich information that could be otherwise accessed from the internal activations. In this work, we propose a simple technique whereby LMs communicate via activations; concretely, we pause an LM $B$ âs computation at an intermediate layer, combine its current activation with another LM $A$ âs intermediate activation via some function $f$ , then pass $f$ âs output into the next layer of $B$ and continue the forward pass till decoding is complete. This approach scales up LMs on new tasks with zero additional parameters and data, and saves a substantial amount of compute over natural language communication. We test our method with various functional forms $f$ on two experimental setupsâmulti-player coordination games and reasoning benchmarksâand find that it achieves up to $27.0\%$ improvement over natural language communication across datasets with $<$ $1/4$ the compute, illustrating the superiority and robustness of activations as an alternative âlanguageâ for communication between LMs.
Machine Learning, ICML
1 Introduction
Language is for the purpose of communication. As large language models (LLMs) have been increasingly used to power autonomous, goal-driven agents capable of reasoning, tool usage, and adaptive decision-making (Yao et al., 2023; Xi et al., 2023; Wang et al., 2024; Ahn et al., 2022; Schick et al., 2023; Shen et al., 2023; Park et al., 2023; Nakano et al., 2022), communication between multiple cooperating agents has emerged as an intuitive approach to amplify the reasoning capabilities of LLMs (Wu et al., 2023). Explicit communication in natural language between multiple LLMs has been shown to encourage divergent thinking (Liang et al., 2023), improve factuality and reasoning (Du et al., 2023), enable integration of cross-domain knowledge (Sukhbaatar et al., 2024), and allow for modular composition of abilities in a complementary manner (Wu et al., 2023; Prasad et al., 2023).
A critical problem with natural language communication, however, is that it incurs extremely high inference costs that scale quickly with the number of agents as well as length and number of messages (Du et al., 2023; Yang et al., 2023; Wu et al., 2023). Restricting LLM communication to natural language also raises the question: as LLMs are increasingly capable of handling larger, more complex tasks (sometimes with âsuper-humanâ ability) (Wei et al., 2022; Burns et al., 2023), might they communicate more effectively in representations of higher dimension than natural language? While using natural language as a communicative medium is appealing due to its interpretability, we claim that it may not be optimal for inter-LLM communication. Natural language generation uses only one token to represent the modelâs belief over the entire vocabulary, which risks losing information embedded within the model output logits (Pham et al., 2024); furthermore, a modelâs belief over the entire vocabulary is itself not always better (for communicative purposes) than the modelâs (often richer) representation of the input in earlier layers. Indeed, Hernandez et al. (2024) find that by around the halfway point of an LMâs computation, it has developed âenriched entity representationsâ of the input, where entities in the prompt are populated with additional facts about that entity encoded in the modelâs weights; but by the later layers these embeddings are transformed into a representation of the next word which leverages only parts of the previous, richer representations, when that full embedding would be quite useful for communication.
Motivated by these concerns, this work outlines a simple technique whereby LLM agents communicate via activations, thus enabling more efficient (i.e., higher-entropy) communication at a fraction of the number of forward passes required at inference time. Concretely, we (1) pause a Transformer LM $B$ âs computation at intermediate layer $j$ in the residual stream; (2) combine its post-layer $j$ activation with another LM $A$ âs post-layer $k$ activation via some function $f$ ; and then (3) pass $f$ âs output into the next layer $j+1$ of $B$ and continue its forward pass till decoding is complete. This approach scales up LLMs on new tasks by leveraging existing, frozen LLMs along with zero task-specific parameters and data, applying to diverse domains and settings. Furthermore, in requiring only a partial forward pass through $A$ and one forward pass through $B$ , this method saves a substantial amount of compute over traditional natural language communication, which we quantify in Section 3.2.
We validate our method by testing this approach with various functional forms $f$ on two experimental setups: two multi-player coordination games, where $B$ is asked to complete a task requiring information provided in a prompt to $A$ ; and seven reasoning benchmarks spanning multiple domains: Biographies (Du et al., 2023), GSM8k (Cobbe et al., 2021), MMLU High School Psychology, MMLU Formal Logic, MMLU College Biology, MMLU Professional Law, and MMLU Public Relations (Hendrycks et al., 2021). Our activation communication protocol exhibits up to $27.0\%$ improvement over natural language communication across these datasets, using $<$ $1/4$ the compute. Critically, unlike prior work which test inter-LLM communication only on large-scale ( $>$ $70$ B) models (Du et al., 2023; Liang et al., 2023), we find that our approach generalizes across a wide array of LLM suites and sizes, enabling even smaller LLMs to unlock the benefits of communication.
In summary, our contributions are two-fold:
- We propose a novel inter-model communication protocol for LLM agents that is purely activation-based.
- We perform comprehensive experiments to validate the improved performance of activation communication over traditional natural language communication. We also formally quantify our approachâs compute savings over natural language communication, illustrating the superiority and robustness of activations as an alternative âlanguageâ for communication between LMs.
2 Related Work
Multi-agent communication
The field of multi-agent communication has a long-standing history. Notably, prior works on emergent communication have showed that agents can autonomously evolve communication protocols when deployed in multi-agent environments that enable cooperative and competitive game-play (Sukhbaatar et al., 2016; Foerster et al., 2016; Lazaridou et al., 2017). However, recent experiments have demonstrated that learning meaningful languages from scratch, even with centralized training, remains difficult (Lowe et al., 2020; Chaabouni et al., 2019; Jaques et al., 2019).
With the emergence of large pre-trained language models, allowing communication between LLMs in natural language has hence become a promising approach to enable coordination among multiple LLM agents (Li et al., 2023). Recent works have demonstrated that such conversations enable integration of cross-domain knowledge (Sukhbaatar et al., 2024), modular composition of abilities in a complementary manner (Wu et al., 2023), and improved task performance via splitting into subtasks (Prasad et al., 2023). Most notable is multiagent debate introduced by Du et al. (2023), where LLMs provide initial responses and then make refinements by iteratively considering inputs from peers. While such methods have been shown to improve performance on various tasks over vanilla and majority-vote (Wang et al., 2023) style prompting, these experiments have only focused on large models ( $\mathtt{GPT}$ - $\mathtt{3.5/4}$ , $\mathtt{LLaMA2}$ - $\mathtt{70B}$ and up), leaving the efficacy of debate on smaller, open-source models underexplored; our study addresses this gap by reimplementing Du et al. (2023) in experiments with smaller-scale ( $1-70$ B) models. More crucially, debate and similar natural language communication methods are extremely computationally expensive, which this work addresses (Yang et al., 2023; Wu et al., 2023).
Notably, Pham et al. (2024) propose CIPHER, which uses input (tokenizer) embeddings (as opposed to activations) to enable multi-agent communication; specifically, CIPHER passes the average tokenizer embedding (weighted by the LLMâs next-token probabilities) between models. While (Pham et al., 2024) show this approach outperforms natural language debate, it (i) still faces substantial information loss relative to the model activations and (ii) does not save compute, as the number of these âaverage embeddingsâ passed between models is the same as the number of tokens passed between models in natural language communication.
A related class of methods involves spending extra test-time compute reasoning in latent space (Geiping et al., 2025; Hao et al., 2024). Such latent reasoning approaches involving doing âchain-of-thought in activation space,â e.g. by grafting LM activations into other layers/later forward passes through the same model (e.g., a form of ârecurrent ACâ within a single model); our approach can be viewed as doing exactly the same thing, but instead âoutsourcingâ the CoT to another model (and thus reaping benefits from greater diversity of thoughts/reasoning paths from distinct models).
Activation engineering
Activation engineering involves editing an LLMâs intermediate layer representations during a forward pass to create desired changes to output text (Li et al., 2024; Turner et al., 2023). Past work has explored extracting latent steering vectors from a frozen LLM to control quality and content of completions (Subramani et al., 2022), as well as using âdirectionâ vectors (computed as the difference in activations between two prompts) that enable inference-time control over high-level properties of generations (Li et al., 2024; Turner et al., 2023). This work involves activation editing that is similar to such prior works at a high level, though for the purpose of communication between LLM agents.
Model composition and grafting
Composing expert models has been a recurring strategy to improve large models, with different methods imposing different restrictions on the types of base LLMs that can be combined. Mixture of Experts (Shazeer et al., 2017) requires that all experts are trained simultaneously using the same data; Branch-Train-Mix (Sukhbaatar et al., 2024) trains a single base LM multiple times on different datasets, then learns a router on outputs. Crucially, these methods do not work when neither model can do the task at hand well (i.e., they solve the problem of choosing which of several outputs is best, not that of generating a high-quality output by recombining the disparate abilities of the various base LMs).
Model grafting, in contrast, seeks to merge different models immediately prior to or at inference-time. Past works have explored this at the parameter level (e.g., task vector averaging as in Ilharco et al. (2023), which requires that the base models be well aligned), probability distribution / token level as in Shen et al. (2024) (which imposes few restrictions on the relationship between the base models, but by virtue of being token-based can result in cascading errors during decoding), and activation level (e.g., CALM (Bansal et al., 2024) which learns an attention layer on top of two modelsâ intermediate layer activations and thus enables broader integration of model abilities than token-level methods, but requires re-tuning of the attention mechanism for every model pair). In this work, we seek to unify CALM and other activation-level grafting techniques under a single framework, parameterized by the function $f$ used to combine activations; crucially, we explore simple forms of $f$ (e.g., sum, mean) thatâunlike Bansal et al. (2024) ârequire zero additional task-specific parameters and data, and are far more compute-efficient.
3 Communicating Activations Between Language Models
<details>
<summary>extracted/6420159/overview.png Details</summary>

### Visual Description
\n
## Diagram: Attention Mechanism Illustration
### Overview
The image depicts a diagram illustrating an attention mechanism, likely within a neural network architecture. It shows two sequences, A and B, and how attention is applied to relate elements between them. The diagram is split into two main sections, with the left side showing the sequence processing and the right side illustrating the attention weight calculation. A simple word problem is included at the bottom.
### Components/Axes
The diagram consists of the following components:
* **Sequences A and B:** Represented by yellow rectangles with ellipsis indicating continuation.
* **Hidden States:** Represented by rounded rectangles labeled `h_A,k` and `h_B,j`. The subscripts indicate the position within the respective sequence. `h_A,k+1` and `h_B,j+1` are also shown.
* **Function f:** Represented by ellipses, indicating a transformation function.
* **Attention Weight Calculation:** A circle with a plus sign inside, representing an addition operation, and a larger rectangle labeled "W" representing a weight matrix.
* **Snowflake Icons:** Used to represent the output of the sequences.
* **Textual Annotations:** Equations and descriptive text explaining the process.
### Detailed Analysis or Content Details
**Left Side (Sequence Processing):**
* **Sequence A:** A yellow rectangle labeled "A" contains a series of hidden states `h_A,k` and `h_A,k+1`, connected by vertical ellipsis.
* **Sequence B:** A yellow rectangle labeled "B" contains a series of hidden states `h_B,j` and `h_B,j+1`, connected by vertical ellipsis.
* **Function f:** The hidden state `h_A,k` is passed through a function `f` and connected to the hidden state `h_B,j`. Similarly, `h_B,j` is passed through `f` and connected to `h_A,k`.
**Right Side (Attention Weight Calculation):**
* **Addition:** The hidden states `h_A,k` and `h_B,j` are fed into a circle with a plus sign, indicating addition. The output of this addition is then passed through the function `f`.
* **Weight Matrix W:** The hidden states `h_A,k` and `h_B,j` are also fed into a rectangle labeled "W". This likely represents a weight matrix used to calculate attention weights. The output of "W" is not explicitly shown.
**Textual Annotations:**
* **Word Problem:** "Janet bought 16 cookies, and wants to give an equal number to each of her 4 friends. How many cookies does each friend get?"
* **Solution:** "Each friend gets 16/4 = 4 cookies."
### Key Observations
* The diagram illustrates a bidirectional relationship between sequences A and B, with information flowing in both directions through the function `f`.
* The attention mechanism appears to involve calculating a weighted sum of hidden states, potentially using the weight matrix "W".
* The word problem and its solution are unrelated to the attention mechanism diagram, serving as a distractor or example.
### Interpretation
The diagram demonstrates a simplified attention mechanism, commonly used in sequence-to-sequence models like machine translation or image captioning. The sequences A and B could represent input and output sequences, respectively. The function `f` likely represents a neural network layer that transforms the hidden states. The attention mechanism allows the model to focus on relevant parts of the input sequence (A) when generating the output sequence (B). The addition and weight matrix "W" are key components in calculating attention weights, which determine the importance of each input hidden state. The diagram highlights the core idea of attention: relating elements between two sequences based on their relevance. The inclusion of the cookie problem is likely a red herring, intended to test attention to detail or distract from the core concept.
</details>
Figure 1: Overview of activation communication. (Left) Our method involves (1) pausing a Transformer LM $B$ âs computation at layer $j$ in the residual stream; (2) combining its post-layer $j$ activation with another LM $A$ âs post-layer $k$ activation via some function $f$ ; then (3) passing $f$ âs output into the next layer $j+1$ of $B$ and continuing the forward pass till decoding is complete. (Right) Any function $f$ can be used to combine $A$ and $B$ âs activations; we explore letting $f$ be the sum, mean, and replacement functions, as well as a task-agnostic learned linear layer (details in Section 3.1).
We propose a simple yet effective technique whereby language models communicate via activations. We detail our approach in Section 3.1; provide analytical models of the compute saved over natural language communication in Section 3.2; and discuss the intuition behind this approach in Section 3.3.
3.1 Method
Consider two language models, $A$ and $B$ , and some setting in which $B$ must perform a task where it would benefit from knowledge given to $A$ as a prompt/encoded in $A$ âs weights (example settings in Section 4.1 / Section 4.2 respectively). We propose incorporating information from $A$ âs post-layer $k$ activation $\bm{h}_{A,k}$ into $B$ âs post-layer $j$ activation $\bm{h}_{B,j}$ (and vice versa, though for simplicity we henceforth only discuss the first direction) (Figure 1, left).
More formally, suppose $A$ and $B$ (which have model dimensions $d_{A}$ and $d_{B}$ respectively) are given prompts $x_{A}$ and $x_{B}$ respectively, where $x_{A}$ is of length $t_{A}$ tokens and $x_{B}$ is of length $t_{B}$ tokens. We first run a partial forward pass of $B$ until layer $j$ (henceforth denoted $B_{†j}(x_{B})$ ) to get $\bm{h}_{B,j}â\mathbb{R}^{t_{B}Ă d_{B}}$ . Then we (1) run a partial forward pass of $A$ until layer $k$ to get $A_{†k}(x_{1}):=\bm{h}_{A,k}â\mathbb{R}^{t_{A}Ă d_{A}}$ ; (2) replace the activation of the last token $(\bm{h}_{B,j})_{t_{B}}â\mathbb{R}^{d_{B}}$ $\longleftarrow f((\bm{h}_{A,k})_{t_{A}},(\bm{h}_{B,j})_{t_{B}})$ for some function $f:\mathbb{R}^{d_{A}+d_{B}}â\mathbb{R}^{d_{B}}$ ; then (3) continue $B$ âs forward pass till decoding is complete, resulting in an output $y=B_{>k}(\bm{h}_{B,j})$ .
Let $\bm{a}=(\bm{h}_{A,k})_{t_{A}}$ , $\bm{b}=(\bm{h}_{B,j})_{t_{B}}$ . For sake of simplicity assume $d_{A}=d_{B}$ . When $d_{A}â d_{B}$ , the $\mathtt{sum}$ , $\mathtt{mean}$ , and $\mathtt{replace}$ functions are defined as follows. Let $d=\min(d_{A},d_{B})$ and $\circ$ the concatenation operator. Then $f(\bm{a},\bm{b})=\bm{b}_{1:\max(d_{B}-d,0)}\circ\left(\bm{b}_{\max(d_{B}-d,0)+%
1:d_{B}}+\bm{a}_{\max(d_{A}-d,0)+1:d_{A}}\right)$ $\mathtt{(sum)}$ , $f(\bm{a},\bm{b})=\bm{b}_{1:\max(d_{B}-d,0)}\circ\frac{1}{2}\left(\bm{b}_{\max(%
d_{B}-d,0)+1:d_{B}}+\bm{a}_{\max(d_{A}-d,0)+1:d_{A}}\right)$ $\mathtt{(mean)}$ , and $f(\bm{a},\bm{b})=\bm{b}_{1:\max(d_{B}-d,0)}\circ\bm{a}_{\max(d_{A}-d,0)+1:d_{A}}$ $\mathtt{(replace)}$ . We consider three non-learned functions $f$ :
| | $\displaystyle f(\bm{a},\bm{b})$ | $\displaystyle=\bm{a}+\bm{b}\>\>\qquad\qquad\mathtt{(sum)}$ | |
| --- | --- | --- | --- |
For cases where, due to differences in $A$ and $B$ âs training, $A$ and $B$ âs activation spaces are quite different, we propose learning a task-agnostic (depends only on the models $A$ and $B$ ) linear layer $\bm{W}â\mathbb{R}^{d_{B}}Ă\mathbb{R}^{d_{A}}$ that projects $\bm{a}$ onto $B$ âs activation space. Note that this introduces zero additional task-specific parameters and data, as we propose learning this âmapping matrixâ $\bm{W}$ only once for each model pair $(A,B)$ using general text, e.g. sequences from $A$ and/or $B$ âs pretraining data mixes. We can then perform $\mathtt{sum}$ , $\mathtt{mean}$ , or $\mathtt{replace}$ with $\bm{W}\bm{a},\bm{b}$ instead of $\bm{a},\bm{b}$ . We propose training $\bm{W}$ to minimize MSE loss over a dataset of $N$ sentences
$$
\mathcal{L}_{\rm MSE}\left(\{\bm{y}^{(i)}\}_{i=1}^{N},\{\bm{z}^{(i)}\}_{i=1}^{%
N}\right)=\frac{1}{N}\sum_{i=1}^{N}\left\|\bm{z}^{(i)}-\bm{W}\bm{y}^{(i)}%
\right\|_{2}^{2}
$$
where each $(\bm{y}^{(i)},\bm{z}^{(i)})$ pair denotes the final-token layer- $26$ activations of $A$ and $B$ at layers $k$ and $j$ respectively given the same sentence as input.
3.2 Compute Analysis
To understand the significance of activation communication, we must formally quantify the compute this procedure saves over natural language communication. For simplicity suppose the following (similar calculations can be made for the cases where $A$ and $B$ have differing model architectures and/or are given different prompts):
- $A$ and $B$ both have $L$ layers (each with $H$ attention heads, key size $K$ , and feedforward size $F$ ), dimension $D$ , and vocab size $V$
- $A$ and $B$ are both given a prompt of $P$ tokens
- $A$ can send $B$ a single $M$ -token message
- $B$ must produce an output of $T$ tokens, given its prompt and $A$ âs message
Traditional methods require $M$ forward passes of $A$ given a $P$ -length input, plus $T$ forward passes of $B$ given a $(P+M)$ -length input. Following Hoffmann et al. (2022), this requires
$$
\displaystyle M\big{(}4PVD+L(8PDKH+4P^{2}KH+3HP^{2} \displaystyle+4PDF)\big{)}+T\big{(}4(P+M)VD+L(8(P+M)DKH \displaystyle+4(P+M)^{2}KH+3H(P+M)^{2}+4(P+M)DF)\big{)} \tag{1}
$$
FLOPs. In contrast, at inference time, our method requires only 1 partial (up till the $k$ th layer) forward pass of $A$ given a $P$ -length input, $T$ forward passes of $B$ given a $P$ -length input, and the activation replacement procedure. This requires
$$
\displaystyle 2PVD+k(8PDKH+4P^{2}KH+3HP^{2} \displaystyle+4PDF)+T\big{(}4PVD+L(8PDKH+4P^{2}KH \displaystyle+3HP^{2}+4PDF)\big{)}+\mathcal{F}(D) \tag{2}
$$
FLOPs, where $\mathcal{F}(D)=O(D)$ for non-learned $f$ and $O(D^{2})$ when $f$ is the mapping matrix.
In all practical cases, (2) is substantially lower than (1).
3.3 Why should this work?
Recall that Pham et al. (2024) propose CIPHERâcommunicating the average tokenizer embedding (weighted by the LLMâs next-token probabilities) between models. We build upon the intuition behind CIPHER, which goes as follows: the token sampling process during decoding risks substantial information loss from the modelâs output logits, and communicating a modelâs weighted-average tokenizer embedding essentially entails communicating both that modelâs final answer and its belief in that answer (over the entire vocabulary).
Communicating activations, then, can be thought of as communicating a strict superset of {next-token prediction, belief over entire vocabulary}, as activations of late-enough layers essentially encode the modelâs entire knowledge about the provided context as well as its predicted completion and confidence in that completion (see Figures 1 and 7 in Hewitt & Manning (2019) and Hernandez et al. (2024), respectively, which show that linear probes tasked with predicting certain output characteristics from a Transformerâs intermediate layer embeddings of its input work poorly for early layers, extremely well after around the halfway point of computation, but then probe accuracy drops closer to the final layers). Note one important critique of multiagent debate: that in cases where multiple agents are uncertain about the answer, there is no reason why referencing other agentsâ answers would generate more factual reasoning. Both CIPHER and activation communication solve this problem, as some notion of model confidence is being communicated along with its next-token prediction. Indeed, these curves of probe accuracy by layer indicate that the final layers and LM head â throw away â information not useful for next-token prediction that very well could be useful for communicative purposes; this is precisely why our proposed activation communication technique is not an iterative approach (there is no notion of âroundsâ like in debate and CIPHER, which require an additional token budget to extract more and more information out of the LM), as one activation grafting step from $A$ to $B$ inherently communicates to $B$ all of $A$ âs knowledge/beliefs about the prompt it was given. Moreover, the extra information over the modelâs next-token prediction and confidence that is encoded in its activations is what makes activation communication more performant than its natural language counterpart, as we will see in Section 4.
4 Experiments
| Countries | $x_{A}$ : â $\mathtt{Alice}$ $\mathtt{is}$ $\mathtt{at}$ $\mathtt{the}$ $\mathtt{Acropolis}$ $\mathtt{of}$ $\mathtt{Athens}$ .â |
| --- | --- |
| $x_{B}$ : â $\mathtt{Which}$ $\mathtt{country}$ $\mathtt{is}$ $\mathtt{Alice}$ $\mathtt{located}$ $\mathtt{in?}$ â | |
| $B$ âs Expected Answer: â $\mathtt{Greece}$ â | |
| Tip Sheets | $x_{A}$ : â $\mathtt{Acme}$ $\mathtt{Inc.}$ $\mathtt{has}$ $\mathtt{taken}$ $\mathtt{a}$ $\mathtt{nosedive,}$ $\mathtt{as}$ $\mathtt{its}$ $\mathtt{quarterly}$ $\mathtt{earnings}$ $\mathtt{have}$ $\mathtt{dipped}$ $\mathtt{8\%}$ . |
| $\mathtt{Meanwhile}$ $\mathtt{Doe}$ $\mathtt{LLC}$ $\mathtt{and}$ $\mathtt{Kiteflyer}$ $\mathtt{Labs}$ $\mathtt{have}$ $\mathtt{both}$ $\mathtt{reached}$ $\mathtt{record\text{-}high}$ $\mathtt{stock}$ | |
| $\mathtt{prices}$ $\mathtt{of}$ $\mathtt{89,}$ $\mathtt{but}$ $\mathtt{Kiteflyer}$ $\mathtt{is}$ $\mathtt{involved}$ $\mathtt{in}$ $\mathtt{an}$ $\mathtt{IP}$ $\mathtt{lawsuit}$ $\mathtt{with}$ $\mathtt{its}$ $\mathtt{competitors.}^{\prime\prime}$ | |
| $x_{B}$ : â $\mathtt{You}$ $\mathtt{must}$ $\mathtt{invest}$ $\mathtt{in}$ $\mathtt{one}$ $\mathtt{company}$ $\mathtt{out}$ $\mathtt{of}$ { $\mathtt{Acme}$ $\mathtt{Inc.,}$ $\mathtt{Doe}$ $\mathtt{LLC,}$ $\mathtt{Kiteflyer}$ $\mathtt{Labs}$ }. | |
| $\mathtt{Which}$ $\mathtt{do}$ $\mathtt{you}$ $\mathtt{invest}$ $\mathtt{in?}$ â | |
| $B$ âs Expected Answer: â $\mathtt{Doe\text{ }LLC}$ â | |
Table 1: Multi-player coordination games. Sample $\mathtt{(prompt,answer)}$ pairs for each game.
We test our method on two distinct experimental setups: multi-player coordination games (Section 4.1) and reasoning benchmarks (Section 4.2). Qualitative results are available in Appendix A.
4.1 Multi-player coordination games
Drawing from existing literature on multi-agent communication, we design two Lewis signaling games (Lewis, 2008; Lazaridou et al., 2016) to test the efficacy of activation communication (example prompts and answers in Table 1):
1. Countries, where $A$ is given as input a string of the format â $\mathtt{[PERSON]\text{ }is\text{ }at\text{ }the\text{ }[LANDMARK]}$ â and $B$ is asked â $\mathtt{Which\text{ }country\text{ }is\text{ }[PERSON]\text{ }located\text{ }%
in?}$ â
1. Tip Sheets (inspired by Lewis et al. (2017)), where $A$ is given a simulated âtip sheetâ and $B$ is asked to make an informed investment decision in accordance with the information in the tip sheet.
| $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{3B}$ | â | $0.0$ ( $0.0$ , $0.0$ ) | $38.6$ ( $38.6,39.4$ ) |
| --- | --- | --- | --- |
| Skyline | $84.0$ ( $83.5$ , $84.1$ ) | $100.0$ ( $100.0,100.0$ ) | |
| NL | $69.0$ ( $68.7$ , $69.3$ ) | $74.3$ ( $74.0,74.6$ ) | |
| AC ( $\mathtt{sum}$ ) | $34.0$ ( $33.9$ , $34.4$ ) | $50.0$ ( $49.6,50.3$ ) | |
| AC ( $\mathtt{mean}$ ) | $36.0$ ( $35.5$ , $36.1$ ) | $80.0$ ( $79.8,80.4$ ) | |
| AC ( $\mathtt{replace}$ ) | $\mathbf{78.0}$ ( $77.7$ , $78.2$ ) | $\mathbf{90.0}$ ( $89.9,90.3$ ) | |
| $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ | â | $2.0$ ( $1.9$ , $2.1$ ) | $54.3$ ( $54.2,54.5$ ) |
| Skyline | $86.0$ ( $85.7$ , $86.1$ ) | $100.0$ ( $100.0,100.0$ ) | |
| NL | $77.0$ ( $76.6$ , $77.1$ ) | $85.7$ ( $85.3,85.8$ ) | |
| AC ( $\mathtt{sum}$ ) | $71.0$ ( $70.9,71.4$ ) | $85.7$ ( $85.5,86.0$ ) | |
| AC ( $\mathtt{mean}$ ) | $70.0$ ( $69.7,70.3$ ) | $92.9$ ( $92.7,93.1$ ) | |
| AC ( $\mathtt{replace}$ ) | $\mathbf{83.0}$ ( $82.7,83.1$ ) | $\mathbf{95.7}$ ( $95.6,95.9$ ) | |
Table 2: Accuracies (%) on both coordination games using two identical $\mathtt{LLaMA}$ family models. Communication at layer $k=j=26.$ $95\%$ confidence intervals ( $1000$ bootstrap iterations) reported in parentheses.
We synthetically generate $100$ (Countries) and $70$ (Tip Sheets) different prompts and answers of the same format as the samples in Table 1, and report the proportion out of those samples that $B$ responds with an exact string match to the ground truth answer. As baselines, we consider a âsilentâ (â) setup, where the agents are not allowed to communicate; a âsingle-agent skyline,â where a single LLM is given the concatenation of $A$ and $B$ âs prompts; and traditional natural language communication, where $A$ is asked to output a message that is then given to $B$ along with $x_{B}$ . All decoding is done greedily.
Table 2 presents the results for both coordination games using 2 different instances of the same model as the agents ( $A=B$ ). Across the 3B and 8B model sizes, activation communication (AC) with $f=\mathtt{replace}$ almost completely recovers the gap between the zero-communication (â) and the single-agent skyline (Skyline), outperforming natural language communication (NL) using far less compute. We hypothesize that $\mathtt{replace}$ is more effective than $\mathtt{mean}$ and $\mathtt{sum}$ as the former is guaranteed to output a vector within $B$ âs activation space, while the latter two likely do not (e.g., the norm of the vector outputted by $\mathtt{sum}$ will be around double that of a typical activation). Furthermore, most of the information $B$ needs is likely contained in its representations of previous tokens in the sequence, hence losing its final-token representation does not hurt.
<details>
<summary>extracted/6420159/contour.png Details</summary>

### Visual Description
\n
## Heatmaps: Accuracy vs. Parameters k and j
### Overview
The image presents two heatmaps, labeled (a) "Countries" and (b) "Tip Sheets". Both heatmaps visualize accuracy (Acc. (%)) as a function of two parameters, 'k' and 'j'. The color gradient represents the accuracy percentage, ranging from dark blue (low accuracy) to bright yellow/white (high accuracy). Each heatmap also contains a single diamond-shaped marker, presumably indicating a specific parameter configuration.
### Components/Axes
Both heatmaps share the following components:
* **X-axis:** Labeled 'k', ranging from approximately 0 to 32.
* **Y-axis:** Labeled 'j', ranging from approximately 0 to 32.
* **Color Scale:** A vertical color bar on the right side of each heatmap, representing accuracy in percentage (Acc. (%)). The scale ranges from 0 to 100%.
* **Diamond Marker:** A single diamond-shaped marker is present in each heatmap.
Heatmap (a) "Countries" is positioned on the left, while heatmap (b) "Tip Sheets" is on the right.
### Detailed Analysis or Content Details
**Heatmap (a) - Countries:**
* **Trend:** The accuracy generally increases as 'k' increases, peaking around k=28-30. Accuracy is low for small values of 'j' (below 10) and increases with 'j' until around j=20, then plateaus or slightly decreases.
* **Diamond Marker:** Located at approximately k=28, j=26. The color of the diamond corresponds to an accuracy of approximately 90-95%.
* **Data Points (approximate):**
* k=5, j=5: Accuracy ~ 10% (dark blue)
* k=15, j=10: Accuracy ~ 30% (purple)
* k=20, j=15: Accuracy ~ 50% (yellow-orange)
* k=28, j=20: Accuracy ~ 80% (light yellow)
* k=30, j=30: Accuracy ~ 85% (light yellow)
**Heatmap (b) - Tip Sheets:**
* **Trend:** The accuracy is highest in a localized region around k=10-15 and j=20-25. There's a distinct "peak" of high accuracy in this area. Accuracy decreases significantly as 'k' or 'j' move away from this peak.
* **Diamond Marker:** Located at approximately k=12, j=24. The color of the diamond corresponds to an accuracy of approximately 95-100%.
* **Data Points (approximate):**
* k=5, j=5: Accuracy ~ 20% (blue)
* k=10, j=10: Accuracy ~ 40% (purple)
* k=15, j=20: Accuracy ~ 90% (light yellow)
* k=20, j=25: Accuracy ~ 70% (yellow-orange)
* k=30, j=30: Accuracy ~ 30% (purple)
### Key Observations
* The "Tip Sheets" heatmap shows a much more localized peak of high accuracy compared to the "Countries" heatmap, which has a broader region of relatively high accuracy.
* The diamond markers in both heatmaps indicate parameter configurations that yield high accuracy, but the optimal parameter ranges differ significantly between the two datasets.
* The accuracy in the "Countries" heatmap appears to be more consistently increasing with 'k' than in the "Tip Sheets" heatmap.
### Interpretation
These heatmaps likely represent the performance of a model or algorithm on two different datasets: "Countries" and "Tip Sheets". The parameters 'k' and 'j' likely control some aspect of the model's configuration or training process.
The differences in the heatmap shapes suggest that the optimal parameter settings for achieving high accuracy are different for the two datasets. The "Countries" dataset seems to benefit from higher values of 'k', while the "Tip Sheets" dataset has a more specific optimal region.
The diamond markers highlight the best-performing parameter configurations found for each dataset. The fact that these configurations are different suggests that the underlying characteristics of the "Countries" and "Tip Sheets" datasets are different, requiring different model settings to achieve optimal performance.
The localized peak in the "Tip Sheets" heatmap could indicate a more sensitive relationship between the parameters and accuracy for that dataset. Small changes in 'k' or 'j' away from the optimal region can lead to a significant drop in accuracy.
</details>
Figure 2: 2D contour plots of accuracy over different values of $k$ and $j$ (the layers at which we access/edit activations for $A$ / $B$ respectively). $k=j=26$ is roughly optimal ( $$ ) for both (a) Countries and (b) Tip Sheets.
4.2 Reasoning Benchmarks
Next, we test our methods on a variety of reasoning benchmarks, spanning several real-world tasks and domains.
Baselines
We benchmark activation communication against the following two baselines:
- Single Model: A single LLM responds to the prompt in natural language.
- Natural Language Debate (NLD) (Du et al., 2023): Each LLM provides an initial response to the given prompt. Then, for each of $r-1$ subsequent rounds, each LLM is prompted to refine its previous response given the other agentsâ responses as input. Note that NLD is the most direct baseline for our approach, as it is a state-of-the-art natural language communication protocol. We fix $r=2$ in our experiments.
Note that we do not compare to Pham et al. (2024), as they communicate the input (tokenizer) embeddings rather than activations/output embeddings between models, and hence require a shared tokenizer and embedding table between agents which is extremely restrictive and prevents applicability to our experimental setup.
To determine the values of $k$ and $j$ for activation communication (AC), we compute the accuracy on Countries and Tip Sheets for every pair $(k,j)â\{1,...,30\}^{2}$ . Based on these results (shown in Figure 2) as well as Table 2, we fix $k=j=26$ and $f$ $=$ $\mathtt{replace}$ for the following experiments.
Across all experiment configurations, we fix the decoding strategy to nucleus sampling with $p=0.9$ .
Models
We conduct most of our experiments using $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{3B}$ and $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ as the two agents. Additionally, to test our approachâs robustness and generalizability, we conduct experiments with models belonging to various other suites within the $\mathtt{LLaMA}$ family and of several different sizes.
Note that for these experiments, we restrict the setting to communication between different models (rather than multiple instances of the same model in Section 4.1), since the same model would have identical activations for the same prompts, meaning no information would be communicated in the grafting process. We argue that the multiple-model setting is realistic (perhaps more so than the setting of multiple instances of the same model), as recent advances in LLM development have led to the release of models with specialized abilities (Singhal et al., 2023) and of different sizes (Dubey et al., 2024) that merit complementary usage. Our work thus answers the question: How can we get the best performance by leveraging multiple models of distinct capabilities and sizes, relative to the added inference-time compute over a single forward pass through any single model?
Datasets
We evaluate our technique on seven reasoning datasets that span various real-world tasks and domains: (i) Biographies (Du et al., 2023), which asks the LLM to generate a factual biography of a famous computer scientist; (ii) GSM8k (Cobbe et al., 2021), a variety of grade school math problems created by human problem writers; and (iii) 5 datasets randomly drawn from MMLU (Hendrycks et al., 2021): High School Psychology (from the Social Sciences category), Formal Logic (from the Humanities category), College Biology (from the STEM category), Professional Law (from the Humanities Category), and Public Relations (from the Social Sciences category). We evaluate on a randomly-sampled size- $100$ subset of each dataset.
In experiments involving the mapping matrix $\bm{W}$ , we instantiate $\bm{W}â\mathbb{R}^{4096Ă 3072}$ using Xavier initialization and train for $10$ epochs on a dataset of $3072$ sentences We use $3072$ sentences as linear regression with $d$ -dimensional input has a sample complexity of $O(d)$ (Vapnik, 1999). randomly drawn from the Colossal Clean Crawled Corpus (C4) (Dodge et al., 2021). We use batch size $32$ and the Adam optimizer with learning rate $0.001$ .
Metrics
We measure the accuracy of the final response for the single models and AC. For NLD, we measure the accuracy of the majority-held final-round answer across agents when the answer is automatically verifiable (numeric in GSM8k, multiple choice for the MMLU datasets) or the average final-round answer across agents otherwise (Biographies).
For GSM8k and the MMLU datasets, we report the proportion of samples in the dataset for which the generated answer exactly matches the ground-truth answer. For Biographies, following Du et al. (2023), we prompt an LLM judge ( $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ ) to check whether each manually-decomposed fact in a ground-truth biography is supported ( $1$ ), partially supported ( $0.5$ ), or unsupported ( $0 0$ ) in the generated biography, taking the mean of these scores over all facts as the per-biography accuracy and the mean over all dataset samples as the total accuracy.
| $\mathtt{3.2}$ - $\mathtt{3B}$ $\mathtt{3.1}$ - $\mathtt{8B}$ NLD | $79.4$ $± 0.0$ $83.9$ $± 0.0$ $80.2$ $± 0.1$ | $58.0$ $± 4.9$ $60.0$ $± 4.9$ $\mathbf{75.0}$ $± 4.3$ | $30.0$ $± 1.0$ $65.0$ $± 0.1$ $83.0$ $± 0.8$ | $16.0$ $± 0.8$ $42.0$ $± 0.1$ $37.0$ $± 0.1$ | $11.0$ $± 0.7$ $50.0$ $± 0.2$ $71.0$ $± 0.1$ | $0.0$ $± 0.0$ $20.0$ $± 0.8$ $30.0$ $± 0.1$ | $26.0$ $± 0.1$ $53.0$ $± 0.2$ $63.0$ $± 0.7$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
| AC | $84.6$ $± 0.0$ | $64.0$ $± 4.8$ | $\mathbf{85.0}$ $± 0.8$ | $\mathbf{47.0}$ $± 0.1$ | $78.0$ $± 0.9$ | $30.0$ $± 0.1$ | $\mathbf{74.0}$ $± 0.1$ |
| AC ( $\bm{W}$ ) | $\mathbf{86.8}$ $± 0.0$ | $66.0$ $± 4.8$ | $70.0$ $± 0.1$ | $35.0$ $± 0.1$ | $\mathbf{79.0}$ $± 0.9$ | $\mathbf{45.0}$ $± 0.1$ | $63.0$ $± 0.1$ |
Table 3: Accuracies (%) on all seven reasoning benchmarks. NLD and all AC variants involve communication between $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{3B}$ ( $A$ ) and $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ ( $B$ ); the performance of these models individually are presented in the first two rows of the table. NLD typically improves performance over at least one of the single model baselines; ACâ both with and without the task-agnostic linear layerâconsistently beats both baselines and NLD as well.
Comprehensive evaluation with the $\mathtt{LLaMA}$ family
Table 3 presents results on each of the seven reasoning benchmarks across various baselines and activation communication. Notably, while NLD consistently outperforms $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{3B}$ , it does not always display a performance improvement over $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ ; but remarkably, AC consistently outperforms both single-model baselines. In fact, AC offers an up to $27.0\%$ improvement over NLD across six of the seven reasoning datasets. When applying $\bm{W}$ to $A$ âs activation before performing the replacement function, we see even further gains of $2.6-50.0\%$ over vanilla AC for four of the seven datasets. We hypothesize that the benefits from the learned linear layer are less consistent across datasets because the subset of C4 data used to train $\bm{W}$ likely contains text more semantically similar to some datasets than others, hence some datasets provide $\bm{W}$ with out-of-distribution inputs which reduces performance compared to vanilla AC.
While we fix $A$ as the smaller model and $B$ as the larger model in Table 3 (so as to ensure decoding happens with the presumably more capable model), this need not be the case; swapping $A$ and $B$ yields results of $81.5± 0.0$ and $61.0± 4.8$ on Biographies and GSM8k respectively (without the linear layer). While these accuracies are lower than their non-swapped counterparts, notably they still are higher than both single-model baselines (and higher than NLD for Biographies); plus this is much more compute-efficient as the smaller model is now the one requiring the full instead of partial forward pass.
Note that we find AC outperforms NLD on 48 of the 57 datasets in the full MMLU benchmark; complete MMLU results, as well as a suite of additional experiments, are shown in Appendix B.
Performance-compute tradeoff and generalization to different model scales
Thus far, we have been considering the absolute performance of AC with respect to NLD, for which our method attains state-of-the-art results; however the superiority of activations as a language for inter-LLM communication is further illustrated by ACâs larger ratio of performance improvement to added inference-time compute over individual LMs. Figure 3 displays the results of single models, AC, and NLD across model scales and suites within the $\mathtt{LLaMA}$ family on the Biographies dataset. Incoming arrows to AC and NLD nodes denote the base models between which communication occurred. Not only does AC consistently outperform both single-model baselines unlike NLD, but also notice that the slope of each black line is far greater than the slope of each gray line, indicating that AC consistently achieves greater increases in accuracy per additional unit of inference-time compute (normalized by the compute of a single forward pass through $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{1B}$ on the given prompt) compared to NLD.
Communication across model families
Table 4 displays results for AC between models from the $\mathtt{Qwen}$ - $\mathtt{2.5}$ , $\mathtt{Gemma}$ - $\mathtt{2}$ , and $\mathtt{LLaMA}$ - $\mathtt{3}$ families. We see that AC beats NLD across the board, and beats both individual models for $4/5$ of the $6$ model pairs on Biographies/GSM8k respectively âdemonstrating the efficacy of AC irrespective of model architecture, size, tokenizer, and training data. Moreover, these results are obtained without training $\bm{W}$ , meaning we do not need a separate projection layer between activation spaces to attain SOTA results, even for extremely distinct models! (We hypothesize this is because we are only replacing $B$ âs last-token activation, hence $B$ can learn from $A$ without an extreme alteration to its activation distribution. An alternative explanation is to see this result as proof of the platonic representation hypothesis (Huh et al., 2024), which historical deep learning works have oft alluded to, including in the context of cross-model representation stitching (Moschella et al., 2023; Kornblith et al., 2019).)
| $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{3B}$ , $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ $\mathtt{Qwen}$ - $\mathtt{2.5}$ - $\mathtt{1.5B}$ , $\mathtt{Qwen}$ - $\mathtt{2.5}$ - $\mathtt{3B}$ $\mathtt{Gemma}$ - $\mathtt{2}$ - $\mathtt{2B}$ , $\mathtt{Gemma}$ - $\mathtt{2}$ - $\mathtt{9B}$ | $79.4$ $± 0.0$ / $58.0$ $± 4.9$ $59.4$ $± 0.9$ / $20.0$ $± 0.9$ $83.0$ $± 1.1$ / $45.0$ $± 1.1$ | $83.9$ $± 0.0$ / $60.0$ $± 4.9$ $85.5$ $± 1.1$ / $35.0$ $± 1.1$ $\mathbf{94.6}$ $± 0.9$ / $80.0$ $± 0.9$ | $80.2$ $± 0.1$ / $\mathbf{75.0}$ $± 4.3$ $63.2$ $± 1.1$ / $65.0$ $± 1.1$ $70.3$ $± 1.0$ / $70.0$ $± 1.0$ | $\mathbf{84.6}$ $± 0.0$ / $64.0$ $± 4.8$ $\mathbf{89.6}$ $± 1.0$ / $\mathbf{70.0}$ $± 1.0$ $88.1$ $± 0.7$ / $\mathbf{90.0}$ $± 0.7$ |
| --- | --- | --- | --- | --- |
| $\mathtt{Qwen}$ - $\mathtt{2.5}$ - $\mathtt{1.5B}$ , $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{3B}$ | $59.4$ $± 0.9$ / $20.0$ $± 0.9$ | $79.4$ $± 0.0$ / $58.0$ $± 4.9$ | $75.4$ $± 1.0$ / $\mathbf{75.0}$ $± 1.0$ | $\mathbf{79.5}$ $± 1.0$ / $75.0$ $± 1.0$ |
| $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{3B}$ , $\mathtt{Gemma}$ - $\mathtt{2}$ - $\mathtt{2B}$ | $79.4$ $± 0.0$ / $58.0$ $± 4.9$ | $83.0$ $± 1.1$ / $45.0$ $± 1.1$ | $62.5$ $± 1.1$ / $55.0$ $± 1.1$ | $\mathbf{84.0}$ $± 0.1$ / $\mathbf{60.0}$ $± 1.1$ |
| $\mathtt{Qwen}$ - $\mathtt{2.5}$ - $\mathtt{1.5B}$ , $\mathtt{Gemma}$ - $\mathtt{2}$ - $\mathtt{2B}$ | $59.4$ $± 0.9$ / $20.0$ $± 0.9$ | $\mathbf{83.0}$ $± 1.1$ / $45.0$ $± 1.1$ | $49.3$ $± 1.1$ / $50.0$ $± 1.1$ | $73.0$ $± 1.1$ / $\mathbf{55.0}$ $± 1.1$ |
Table 4: Individual model, AC, and NLD accuracies across three model families. Each cell displays two values: Biographies score / GSM8k score.
<details>
<summary>extracted/6420159/TODO.png Details</summary>

### Visual Description
## Chart: Accuracy vs. Compute for Different Models
### Overview
This chart displays the relationship between accuracy (in percentage) and compute (normalized FLOPs) for three different model types: AC (ours), NLD, and Single Model. The chart uses lines to represent the performance of each model as compute increases. Several data points are labeled with model versions (e.g., 3.1-8B, 3.2-1B).
### Components/Axes
* **X-axis:** Compute (normalized FLOPs), ranging from 1 to 256. The scale is logarithmic.
* **Y-axis:** Accuracy (%), ranging from 76% to 86%.
* **Legend:** Located in the top-left corner.
* Green circles: AC (ours)
* Gray dashed line: NLD
* Orange circles: Single Model
* **Data Points:** Several data points are labeled with model names and sizes (e.g., 3.1-8B, 3.2-1B, 3.1-70B, 3.2-3B).
### Detailed Analysis
**AC (ours) - Green Line:**
The green line representing "AC (ours)" generally slopes upward, indicating increasing accuracy with increasing compute.
* At approximately 4 FLOPs, the accuracy is around 82%.
* At approximately 8 FLOPs, the accuracy reaches a peak of around 84.5%.
* At approximately 16 FLOPs, the accuracy remains around 84.5%.
* At approximately 32 FLOPs, the accuracy is around 84.5%.
* At approximately 64 FLOPs, the accuracy is around 85%.
* At approximately 128 FLOPs, the accuracy is around 85%.
* At approximately 256 FLOPs, the accuracy is around 85%.
**NLD - Gray Dashed Line:**
The gray dashed line representing "NLD" shows a more fluctuating trend.
* At approximately 4 FLOPs, the accuracy is around 81%.
* At approximately 8 FLOPs, the accuracy is around 83%.
* At approximately 16 FLOPs, the accuracy is around 83.5%.
* At approximately 32 FLOPs, the accuracy is around 80%.
* At approximately 64 FLOPs, the accuracy is around 79%.
* At approximately 128 FLOPs, the accuracy is around 77%.
* At approximately 256 FLOPs, the accuracy is around 76%.
**Single Model - Orange Line:**
The orange line representing "Single Model" shows a relatively flat trend initially, followed by a decline.
* At approximately 1 FLOP, the accuracy is around 80.5%.
* At approximately 2 FLOPs, the accuracy is around 80.5%.
* At approximately 4 FLOPs, the accuracy is around 81%.
* At approximately 8 FLOPs, the accuracy is around 80%.
* At approximately 16 FLOPs, the accuracy is around 80%.
* At approximately 32 FLOPs, the accuracy is around 79%.
* At approximately 64 FLOPs, the accuracy is around 78%.
* At approximately 128 FLOPs, the accuracy is around 73%.
* At approximately 256 FLOPs, the accuracy is around 71%.
### Key Observations
* The "AC (ours)" model consistently achieves the highest accuracy across all compute levels.
* The "NLD" model shows a decreasing trend in accuracy as compute increases beyond 16 FLOPs.
* The "Single Model" exhibits a significant drop in accuracy at higher compute levels (above 64 FLOPs).
* The "AC (ours)" model demonstrates relatively stable accuracy as compute increases, suggesting diminishing returns.
* The labeled data points (e.g., 3.1-8B, 3.2-1B) provide specific model configurations and their corresponding performance.
### Interpretation
The chart demonstrates the trade-off between compute and accuracy for different model architectures. The "AC (ours)" model appears to be the most efficient, achieving high accuracy with a relatively moderate amount of compute. The "NLD" model initially performs well but degrades in accuracy as compute increases, potentially indicating overfitting or limitations in its architecture. The "Single Model" suffers a substantial accuracy loss at higher compute levels, suggesting it may not scale effectively. The labeled data points allow for a more granular understanding of how specific model configurations impact performance. The diminishing returns observed in the "AC (ours)" model suggest that further increases in compute may not yield significant improvements in accuracy. This data is valuable for model selection and resource allocation, guiding developers towards the most efficient and effective architectures for their specific needs. The chart suggests that the "AC (ours)" model is a strong candidate for applications where both accuracy and computational efficiency are critical.
</details>
Figure 3: Accuracy (%) vs. compute (# FLOPs normalized by single $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{1B}$ forward pass) for various configurations of AC and NLD on the Biographies dataset. AC () yields the greatest performance gains per additional unit of inference-time compute over each baseline ().
5 Conclusion
We present a simple approach to enable effective and computationally efficient communication between language models by injecting information from the activations of one model into the activations of another during the forward pass. Salient features of this approach include: (i) Scales up LLMs on new tasks by leveraging existing, frozen LLMs along with zero additional task-specific parameters and data, (ii) Applies to diverse domains and settings, and (iii) Saves a substantial amount of compute.
There are some limitations to this method. First, when not using the learned model-specific mapping discussed in Section 3.1, our method requires both models to have aligned embedding spaces, such that the activation of one model roughly retains its meaning in the otherâs activation space (note that unlike past works such as Pham et al. (2024) we do not require shared tokenizers or aligned vocabularies, only aligned embeddings). While less restrictive than past works (Pham et al., 2024), this assumption is somewhat limiting, but can be relaxed when we let $f$ be the learned model-specific mapping; and in practice we find that even amongst different models in the $\mathtt{LLaMA}$ family, no such mapping is required for state-of-the-art results.
Second, this method requires access to embeddings and will not work with black-box API access; however exploring API-only approaches is highly limiting, and recent releases of powerful open-source models (Dubey et al., 2024) merit the development of embedding-based techniques.
Third, while a concern might be the limited interpretability of communicating activations as opposed to natural language, we note the following. First, there is a fundamental tradeoff between interpretability and information preservation (as activations, by virtue of being much higher-dimensional than the space of natural language, allow proportionally higher-entropy communication) (Pham et al., 2024), which merits discussion beyond the scope of this work. But second, we actually posit that our method suggests a new avenue towards interpreting LM activations: âtranslatingâ activations based on the beliefs they induce as messages in listening agents, similar to the method put forward in Andreas et al. (2018). We recognize this as a promising avenue for future research.
Additional directions of future work include using AC to allow large LMs to leverage small, tunable LMs as âknowledge basesâ during decoding (Lee et al., 2024), as in collaborative decoding (Shen et al., 2024) setups; and testing our approach on more complex coordination games (e.g., Lewis-style negotiation games (Lewis et al., 2017), Diplomacy).
Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.
Acknowledgements
The authors are grateful to Jacob Andreas, Yoon Kim, and Sham Kakade for their valuable discussions and feedback.
References
- Ahn et al. (2022) Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Ho, D., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jang, E., Ruano, R. J., Jeffrey, K., Jesmonth, S., Joshi, N. J., Julian, R., Kalashnikov, D., Kuang, Y., Lee, K.-H., Levine, S., Lu, Y., Luu, L., Parada, C., Pastor, P., Quiambao, J., Rao, K., Rettinghouse, J., Reyes, D., Sermanet, P., Sievers, N., Tan, C., Toshev, A., Vanhoucke, V., Xia, F., Xiao, T., Xu, P., Xu, S., Yan, M., and Zeng, A. Do as i can, not as i say: Grounding language in robotic affordances, 2022.
- Andreas et al. (2018) Andreas, J., Dragan, A., and Klein, D. Translating neuralese, 2018.
- Bansal et al. (2024) Bansal, R., Samanta, B., Dalmia, S., Gupta, N., Vashishth, S., Ganapathy, S., Bapna, A., Jain, P., and Talukdar, P. Llm augmented llms: Expanding capabilities through composition, 2024.
- Burns et al. (2023) Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., Chen, Y., Ecoffet, A., Joglekar, M., Leike, J., Sutskever, I., and Wu, J. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023.
- Chaabouni et al. (2019) Chaabouni, R., Kharitonov, E., Lazaric, A., Dupoux, E., and Baroni, M. Word-order biases in deep-agent emergent communication. In Korhonen, A., Traum, D., and MĂ rquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5166â5175, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1509. URL https://aclanthology.org/P19-1509.
- Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems, 2021.
- Dodge et al. (2021) Dodge, J., Sap, M., MarasoviÄ, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus, 2021. URL https://arxiv.org/abs/2104.08758.
- Du et al. (2023) Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. Improving factuality and reasoning in language models through multiagent debate, 2023.
- Dubey et al. (2024) Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteux, C., Nayak, C., Bi, C., Marra, C., McConnell, C., Keller, C., Touret, C., Wu, C., Wong, C., Ferrer, C. C., Nikolaidis, C., Allonsius, D., Song, D., Pintz, D., Livshits, D., Esiobu, D., Choudhary, D., Mahajan, D., Garcia-Olano, D., Perino, D., Hupkes, D., Lakomkin, E., AlBadawy, E., Lobanova, E., Dinan, E., Smith, E. M., Radenovic, F., Zhang, F., Synnaeve, G., Lee, G., Anderson, G. L., Nail, G., Mialon, G., Pang, G., Cucurell, G., Nguyen, H., Korevaar, H., Xu, H., Touvron, H., Zarov, I., Ibarra, I. A., Kloumann, I., Misra, I., Evtimov, I., Copet, J., Lee, J., Geffert, J., Vranes, J., Park, J., Mahadeokar, J., Shah, J., van der Linde, J., Billock, J., Hong, J., Lee, J., Fu, J., Chi, J., Huang, J., Liu, J., Wang, J., Yu, J., Bitton, J., Spisak, J., Park, J., Rocca, J., Johnstun, J., Saxe, J., Jia, J., Alwala, K. V., Upasani, K., Plawiak, K., Li, K., Heafield, K., Stone, K., El-Arini, K., Iyer, K., Malik, K., Chiu, K., Bhalla, K., Rantala-Yeary, L., van der Maaten, L., Chen, L., Tan, L., Jenkins, L., Martin, L., Madaan, L., Malo, L., Blecher, L., Landzaat, L., de Oliveira, L., Muzzi, M., Pasupuleti, M., Singh, M., Paluri, M., Kardas, M., Oldham, M., Rita, M., Pavlova, M., Kambadur, M., Lewis, M., Si, M., Singh, M. K., Hassan, M., Goyal, N., Torabi, N., Bashlykov, N., Bogoychev, N., Chatterji, N., Duchenne, O., Ăelebi, O., Alrassy, P., Zhang, P., Li, P., Vasic, P., Weng, P., Bhargava, P., Dubal, P., Krishnan, P., Koura, P. S., Xu, P., He, Q., Dong, Q., Srinivasan, R., Ganapathy, R., Calderer, R., Cabral, R. S., Stojnic, R., Raileanu, R., Girdhar, R., Patel, R., Sauvestre, R., Polidoro, R., Sumbaly, R., Taylor, R., Silva, R., Hou, R., Wang, R., Hosseini, S., Chennabasappa, S., Singh, S., Bell, S., Kim, S. S., Edunov, S., Nie, S., Narang, S., Raparthy, S., Shen, S., Wan, S., Bhosale, S., Zhang, S., Vandenhende, S., Batra, S., Whitman, S., Sootla, S., Collot, S., Gururangan, S., Borodinsky, S., Herman, T., Fowler, T., Sheasha, T., Georgiou, T., Scialom, T., Speckbacher, T., Mihaylov, T., Xiao, T., Karn, U., Goswami, V., Gupta, V., Ramanathan, V., Kerkez, V., Gonguet, V., Do, V., Vogeti, V., Petrovic, V., Chu, W., Xiong, W., Fu, W., Meers, W., Martinet, X., Wang, X., Tan, X. E., Xie, X., Jia, X., Wang, X., Goldschlag, Y., Gaur, Y., Babaei, Y., Wen, Y., Song, Y., Zhang, Y., Li, Y., Mao, Y., Coudert, Z. D., Yan, Z., Chen, Z., Papakipos, Z., Singh, A., Grattafiori, A., Jain, A., Kelsey, A., Shajnfeld, A., Gangidi, A., Victoria, A., Goldstand, A., Menon, A., Sharma, A., Boesenberg, A., Vaughan, A., Baevski, A., Feinstein, A., Kallet, A., Sangani, A., Yunus, A., Lupu, A., Alvarado, A., Caples, A., Gu, A., Ho, A., Poulton, A., Ryan, A., Ramchandani, A., Franco, A., Saraf, A., Chowdhury, A., Gabriel, A., Bharambe, A., Eisenman, A., Yazdan, A., James, B., Maurer, B., Leonhardi, B., Huang, B., Loyd, B., Paola, B. D., Paranjape, B., Liu, B., Wu, B., Ni, B., Hancock, B., Wasti, B., Spence, B., Stojkovic, B., Gamido, B., Montalvo, B., Parker, C., Burton, C., Mejia, C., Wang, C., Kim, C., Zhou, C., Hu, C., Chu, C.-H., Cai, C., Tindal, C., Feichtenhofer, C., Civin, D., Beaty, D., Kreymer, D., Li, D., Wyatt, D., Adkins, D., Xu, D., Testuggine, D., David, D., Parikh, D., Liskovich, D., Foss, D., Wang, D., Le, D., Holland, D., Dowling, E., Jamil, E., Montgomery, E., Presani, E., Hahn, E., Wood, E., Brinkman, E., Arcaute, E., Dunbar, E., Smothers, E., Sun, F., Kreuk, F., Tian, F., Ozgenel, F., Caggioni, F., GuzmĂĄn, F., Kanayet, F., Seide, F., Florez, G. M., Schwarz, G., Badeer, G., Swee, G., Halpern, G., Thattai, G., Herman, G., Sizov, G., Guangyi, Zhang, Lakshminarayanan, G., Shojanazeri, H., Zou, H., Wang, H., Zha, H., Habeeb, H., Rudolph, H., Suk, H., Aspegren, H., Goldman, H., Damlaj, I., Molybog, I., Tufanov, I., Veliche, I.-E., Gat, I., Weissman, J., Geboski, J., Kohli, J., Asher, J., Gaya, J.-B., Marcus, J., Tang, J., Chan, J., Zhen, J., Reizenstein, J., Teboul, J., Zhong, J., Jin, J., Yang, J., Cummings, J., Carvill, J., Shepard, J., McPhie, J., Torres, J., Ginsburg, J., Wang, J., Wu, K., U, K. H., Saxena, K., Prasad, K., Khandelwal, K., Zand, K., Matosich, K., Veeraraghavan, K., Michelena, K., Li, K., Huang, K., Chawla, K., Lakhotia, K., Huang, K., Chen, L., Garg, L., A, L., Silva, L., Bell, L., Zhang, L., Guo, L., Yu, L., Moshkovich, L., Wehrstedt, L., Khabsa, M., Avalani, M., Bhatt, M., Tsimpoukelli, M., Mankus, M., Hasson, M., Lennie, M., Reso, M., Groshev, M., Naumov, M., Lathi, M., Keneally, M., Seltzer, M. L., Valko, M., Restrepo, M., Patel, M., Vyatskov, M., Samvelyan, M., Clark, M., Macey, M., Wang, M., Hermoso, M. J., Metanat, M., Rastegari, M., Bansal, M., Santhanam, N., Parks, N., White, N., Bawa, N., Singhal, N., Egebo, N., Usunier, N., Laptev, N. P., Dong, N., Zhang, N., Cheng, N., Chernoguz, O., Hart, O., Salpekar, O., Kalinli, O., Kent, P., Parekh, P., Saab, P., Balaji, P., Rittner, P., Bontrager, P., Roux, P., Dollar, P., Zvyagina, P., Ratanchandani, P., Yuvraj, P., Liang, Q., Alao, R., Rodriguez, R., Ayub, R., Murthy, R., Nayani, R., Mitra, R., Li, R., Hogan, R., Battey, R., Wang, R., Maheswari, R., Howes, R., Rinott, R., Bondu, S. J., Datta, S., Chugh, S., Hunt, S., Dhillon, S., Sidorov, S., Pan, S., Verma, S., Yamamoto, S., Ramaswamy, S., Lindsay, S., Lindsay, S., Feng, S., Lin, S., Zha, S. C., Shankar, S., Zhang, S., Zhang, S., Wang, S., Agarwal, S., Sajuyigbe, S., Chintala, S., Max, S., Chen, S., Kehoe, S., Satterfield, S., Govindaprasad, S., Gupta, S., Cho, S., Virk, S., Subramanian, S., Choudhury, S., Goldman, S., Remez, T., Glaser, T., Best, T., Kohler, T., Robinson, T., Li, T., Zhang, T., Matthews, T., Chou, T., Shaked, T., Vontimitta, V., Ajayi, V., Montanez, V., Mohan, V., Kumar, V. S., Mangla, V., Albiero, V., Ionescu, V., Poenaru, V., Mihailescu, V. T., Ivanov, V., Li, W., Wang, W., Jiang, W., Bouaziz, W., Constable, W., Tang, X., Wang, X., Wu, X., Wang, X., Xia, X., Wu, X., Gao, X., Chen, Y., Hu, Y., Jia, Y., Qi, Y., Li, Y., Zhang, Y., Zhang, Y., Adi, Y., Nam, Y., Yu, Wang, Hao, Y., Qian, Y., He, Y., Rait, Z., DeVito, Z., Rosnbrick, Z., Wen, Z., Yang, Z., and Zhao, Z. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
- Foerster et al. (2016) Foerster, J. N., Assael, Y. M., de Freitas, N., and Whiteson, S. Learning to communicate with deep multi-agent reinforcement learning, 2016.
- Geiping et al. (2025) Geiping, J., McLeish, S., Jain, N., Kirchenbauer, J., Singh, S., Bartoldson, B. R., Kailkhura, B., Bhatele, A., and Goldstein, T. Scaling up test-time compute with latent reasoning: A recurrent depth approach, 2025. URL https://arxiv.org/abs/2502.05171.
- Hao et al. (2024) Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y. Training large language models to reason in a continuous latent space, 2024. URL https://arxiv.org/abs/2412.06769.
- Hendrycks et al. (2021) Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300.
- Hernandez et al. (2024) Hernandez, E., Sharma, A. S., Haklay, T., Meng, K., Wattenberg, M., Andreas, J., Belinkov, Y., and Bau, D. Linearity of relation decoding in transformer language models, 2024.
- Hewitt & Manning (2019) Hewitt, J. and Manning, C. D. A structural probe for finding syntax in word representations. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129â4138, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1419. URL https://aclanthology.org/N19-1419.
- Hoffmann et al. (2022) Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models, 2022.
- Huh et al. (2024) Huh, M., Cheung, B., Wang, T., and Isola, P. The platonic representation hypothesis, 2024. URL https://arxiv.org/abs/2405.07987.
- Ilharco et al. (2023) Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic, 2023.
- Jaques et al. (2019) Jaques, N., Lazaridou, A., Hughes, E., Gulcehre, C., Ortega, P. A., Strouse, D., Leibo, J. Z., and de Freitas, N. Social influence as intrinsic motivation for multi-agent deep reinforcement learning, 2019.
- Kornblith et al. (2019) Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. Similarity of neural network representations revisited, 2019. URL https://arxiv.org/abs/1905.00414.
- Lazaridou et al. (2016) Lazaridou, A., Peysakhovich, A., and Baroni, M. Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182, 2016.
- Lazaridou et al. (2017) Lazaridou, A., Peysakhovich, A., and Baroni, M. Multi-agent cooperation and the emergence of (natural) language, 2017.
- Lee et al. (2024) Lee, J., Yang, F., Tran, T., Hu, Q., Barut, E., Chang, K.-W., and Su, C. Can small language models help large language models reason better?: Lm-guided chain-of-thought, 2024. URL https://arxiv.org/abs/2404.03414.
- Lewis (2008) Lewis, D. Convention: A philosophical study. John Wiley & Sons, 2008.
- Lewis et al. (2017) Lewis, M., Yarats, D., Dauphin, Y. N., Parikh, D., and Batra, D. Deal or no deal? end-to-end learning for negotiation dialogues, 2017.
- Li et al. (2023) Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D., and Ghanem, B. Camel: Communicative agents for âmindâ exploration of large language model society, 2023. URL https://arxiv.org/abs/2303.17760.
- Li et al. (2024) Li, K., Patel, O., Viégas, F., Pfister, H., and Wattenberg, M. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.
- Liang et al. (2023) Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Tu, Z., and Shi, S. Encouraging divergent thinking in large language models through multi-agent debate, 2023.
- Lowe et al. (2020) Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments, 2020.
- Moschella et al. (2023) Moschella, L., Maiorca, V., Fumero, M., Norelli, A., Locatello, F., and RodolĂ , E. Relative representations enable zero-shot latent space communication, 2023. URL https://arxiv.org/abs/2209.15430.
- Nakano et al. (2022) Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button, K., Knight, M., Chess, B., and Schulman, J. Webgpt: Browser-assisted question-answering with human feedback, 2022.
- Park et al. (2023) Park, J. S., OâBrien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior, 2023.
- Pham et al. (2024) Pham, C., Liu, B., Yang, Y., Chen, Z., Liu, T., Yuan, J., Plummer, B. A., Wang, Z., and Yang, H. Let models speak ciphers: Multiagent debate through embeddings, 2024.
- Prasad et al. (2023) Prasad, A., Koller, A., Hartmann, M., Clark, P., Sabharwal, A., Bansal, M., and Khot, T. Adapt: As-needed decomposition and planning with language models, 2023.
- Schick et al. (2023) Schick, T., Dwivedi-Yu, J., DessĂŹ, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Toolformer: Language models can teach themselves to use tools, 2023.
- Shazeer et al. (2017) Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.
- Shen et al. (2024) Shen, S. Z., Lang, H., Wang, B., Kim, Y., and Sontag, D. Learning to decode collaboratively with multiple language models, 2024.
- Shen et al. (2023) Shen, Y., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023.
- Singhal et al. (2023) Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K., Pfohl, S., Cole-Lewis, H., Neal, D., Schaekermann, M., Wang, A., Amin, M., Lachgar, S., Mansfield, P., Prakash, S., Green, B., Dominowska, E., y Arcas, B. A., Tomasev, N., Liu, Y., Wong, R., Semturs, C., Mahdavi, S. S., Barral, J., Webster, D., Corrado, G. S., Matias, Y., Azizi, S., Karthikesalingam, A., and Natarajan, V. Towards expert-level medical question answering with large language models, 2023. URL https://arxiv.org/abs/2305.09617.
- Subramani et al. (2022) Subramani, N., Suresh, N., and Peters, M. E. Extracting latent steering vectors from pretrained language models, 2022.
- Sukhbaatar et al. (2016) Sukhbaatar, S., Szlam, A., and Fergus, R. Learning multiagent communication with backpropagation, 2016.
- Sukhbaatar et al. (2024) Sukhbaatar, S., Golovneva, O., Sharma, V., Xu, H., Lin, X. V., RoziĂšre, B., Kahn, J., Li, D., tau Yih, W., Weston, J., and Li, X. Branch-train-mix: Mixing expert llms into a mixture-of-experts llm, 2024.
- Turner et al. (2023) Turner, A. M., Thiergart, L., Udell, D., Leech, G., Mini, U., and MacDiarmid, M. Activation addition: Steering language models without optimization, 2023.
- Vapnik (1999) Vapnik, V. N. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988â999, 1999.
- Wang et al. (2024) Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., Zhao, W. X., Wei, Z., and Wen, J.-R. A survey on large language model based autonomous agents, 2024.
- Wang et al. (2023) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models, 2023. URL https://arxiv.org/abs/2203.11171.
- Wei et al. (2022) Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent abilities of large language models, 2022.
- Wu et al. (2023) Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., and Wang, C. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023.
- Xi et al. (2023) Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., Wang, X., Xiong, L., Zhou, Y., Wang, W., Jiang, C., Zou, Y., Liu, X., Yin, Z., Dou, S., Weng, R., Cheng, W., Zhang, Q., Qin, W., Zheng, Y., Qiu, X., Huang, X., and Gui, T. The rise and potential of large language model based agents: A survey, 2023.
- Yang et al. (2023) Yang, H., Yue, S., and He, Y. Auto-gpt for online decision making: Benchmarks and additional opinions, 2023.
- Yao et al. (2023) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models, 2023.
Appendix A Qualitative Results
<details>
<summary>extracted/6420159/biog.png Details</summary>

### Visual Description
\n
## Text Extraction: Joyce K. Reynolds Biography - Multiple Sources
### Overview
The image presents a comparison of biographical information about Joyce K. Reynolds, a computer scientist, generated by different Large Language Models (LLaMA-2-3-2B, LLaMA-3-1-8B, and AC) alongside a ground-truth biography. The content consists of bullet-point lists detailing her contributions and achievements.
### Components/Axes
The image is divided into four distinct sections:
1. **Ground-Truth Biography:** A list of bullet points providing a baseline biography.
2. **LLaMA-2-3-2B:** A list of bullet points generated by the LLaMA-2-3-2B model.
3. **LLaMA-3-1-8B:** A list of bullet points generated by the LLaMA-3-1-8B model.
4. **AC:** A list of bullet points generated by the AC model.
### Content Details
**1. Ground-Truth Biography:**
* Joyce Kathleen Reynolds was an American computer scientist who contributed to the development of protocols underlying the Internet.
* She authored or co-authored many RFCs (Request for Comments) including Telnet, FTP, and POP protocols.
* Reynolds worked with Jon Postel to develop early functions of the Internet Assigned Numbers Authority and managed the root zone of DNS.
* She received the 2006 Postel Award for her services to the Internet.
**2. LLaMA-2-3-2B:**
* Joyce K. Reynolds is a renowned computer scientist and educator who has made significant contributions to the field of computer science and education.
* She is a pioneer in the field of artificial intelligence, with a focus on human-computer interaction and user experience.
* Reynolds has written extensively on topics such as usability, accessibility, and human-centered design.
* She is a highly respected author, having written several books and articles on computer science and education.
* Reynolds has received numerous awards and honors for her contributions to the field.
**3. LLaMA-3-1-8B:**
* Joyce K. Reynolds was born in 1923 in England. She studied mathematics at Girton College, Cambridge, graduating with a degree in 1945.
* Reynolds started her career in computing in 1945 at the National Physical Laboratory (NPL), where she worked on the development of the first commercial data database, the Database Management System (DBMS).
* Reynolds was a key member of the team that developed the first DBMS, which was a pioneering achievement in the field of database management. Her work laid the foundation for modern database systems.
**4. AC:**
* Joyce K. Reynolds is a renowned computer scientist and engineer, best known for her pioneering work in the development of the Internetâs fundamental technologies.
* She was a key member of the team that designed the Internet Protocol (IP) and the Transmission Control Protocol (TCP), which form the basis of the modern Internet.
* Reynoldsâ work on IP and TCP enabled the creation of a network that could efficiently route and manage data packets, paving the way for the modern Internet as we know it.
* Reynolds has received numerous awards and honors for her groundbreaking work, including the Internet Societyâs (ISOC) highest award, the Jonathan B. Postel Service Award.
* She continues to be an influential figure in the field of computer science and engineering, inspiring new generations of scientists and engineers with her pioneering achievements.
### Key Observations
* The Ground-Truth biography is concise, focusing on her contributions to Internet protocols and administration.
* LLaMA-2-3-2B focuses on her role as an educator and pioneer in AI/HCI, which is not mentioned in the ground truth.
* LLaMA-3-1-8B provides biographical details (birth year, education, early career) not present in the ground truth, and focuses on her work in database management.
* AC aligns more closely with the Ground-Truth, emphasizing her contributions to IP and TCP, but expands on the impact of her work.
* There is significant variation in the information provided by each model, highlighting the potential for LLMs to generate both accurate and inaccurate details.
* All models acknowledge her recognition through awards and honors.
### Interpretation
The image demonstrates the varying outputs of different LLMs when prompted to generate a biography of the same individual. The Ground-Truth biography serves as a benchmark, revealing that the models often supplement or diverge from established facts. LLaMA-2-3-2B appears to focus on a different aspect of her career (AI/HCI), while LLaMA-3-1-8B provides details about her early life and work in database systems. AC provides a more comprehensive and accurate summary, closely aligning with the Ground-Truth while adding further context. This comparison highlights the importance of verifying information generated by LLMs and understanding their potential biases or limitations. The differences in output suggest that the models may have been trained on different datasets or have different internal representations of knowledge. The inclusion of biographical details (birth year, education) by LLaMA-3-1-8B suggests it may have accessed or inferred information beyond the core prompt. The overall data suggests that while LLMs can be useful tools for information retrieval and summarization, they should not be relied upon as definitive sources without careful validation.
</details>
Figure 4: Example of AC on Biographies dataset.
<details>
<summary>extracted/6420159/gsm.png Details</summary>

### Visual Description
\n
## Textual Document: Problem Solving Responses
### Overview
The image presents a problem statement followed by three different responses generated by Large Language Models (LLMs): LLaMA-3-2-3B, LLaMA-3-1-8B, and AC (presumably an "Answer Check" or "Accuracy Check"). The problem involves calculating the cost of gift bags for a birthday party, considering that some invitees won't attend.
### Components/Axes
The document is structured into the following sections:
1. **Question:** The initial problem statement.
2. **LLaMA-3-2-3B:** Response from the first LLM.
3. **LLaMA-3-1-8B:** Response from the second LLM.
4. **AC:** Response from the third source.
### Content Details
**Question:**
"Christina is planning a birthday party and needs .75 gift bags per invited guest, because 1/4 of attendees don't show up. She invited 16 friends. Gift bags are $2 each. How much will she spend?"
**LLaMA-3-2-3B:**
"Since 1/4 of attendees don't show up, we need to calculate 3/4 of the total invited guests to find out how many will attend. 3/4 of 16 is (3/4) * 16 = 12. Christina needs 0.75 gift bags per invited guest, so for 12 guests, she needs 0.75 * 12 = 9 gift bags. Each gift bag costs $2, so the total cost will be 9 gift bags * $2 per bag = $18."
**LLaMA-3-1-8B:**
"Christina invited 16 friends, but 1/4 of them won't show up. To find out how many will attend, we need to calculate 1/4 of 16 and subtract that from 16. To find 1/4 of 16, we divide 16 by 4. 16 / 4 = 4. Now that we know 4 guests won't come, we subtract 4 from the total number of guests invited. 16 - 4 = 12 guests will attend. Christina needs .75 gift bags per invited guest. Since she invited 16 friends, but only 12 will attend, we multiply .75 by 12. .75 * 12 = 9. Gift bags are $2 each. Since Christina needs 9 gift bags, we multiply 9 by 2. 9 * 2 = $18."
**AC:**
"1. Since 1/4 of the attendees don't show up, 3/4 of them will. To find the number of guests that will show up, we need to multiply the total number of invited guests by 3/4. So, 16 * 3/4 = 12 guests will show up.
2. Christina needs 0.75 gift bags per invited guest. Since she invited 16 friends, she needs 16 * 0.75 = 12 gift bags.
3. Each gift bag costs $2. Since she needs 12 gift bags, the total cost will be 12 * $2 = $24."
### Key Observations
- There is a discrepancy in the final answer. LLaMA-3-2-3B and LLaMA-3-1-8B both arrive at a total cost of $18, while the AC calculates a total cost of $24.
- The error in the AC's calculation lies in step 2, where it incorrectly calculates the number of gift bags needed based on the *total* number of invited guests (16) instead of the number of attendees (12).
- LLaMA-3-1-8B provides a more verbose explanation of each step.
- All responses correctly calculate the number of attendees as 12.
### Interpretation
The document demonstrates the varying accuracy of different LLMs in solving a simple arithmetic word problem. While both LLaMA models provide the correct answer, the AC makes a logical error in its calculation, leading to an incorrect result. This highlights the importance of verifying the output of LLMs, even for seemingly straightforward tasks. The difference in verbosity between the models suggests different approaches to problem-solving and explanation generation. The AC's breakdown, while ultimately incorrect, attempts to provide a step-by-step rationale, which could be useful for debugging the error. The problem itself tests the ability to apply fractions, multiplication, and basic arithmetic in a real-world context. The discrepancy in the final answers underscores the need for careful attention to detail and logical consistency when using LLMs for problem-solving.
</details>
Figure 5: Example of AC on GSM8k dataset.
<details>
<summary>extracted/6420159/psych.png Details</summary>

### Visual Description
\n
## Textual Document: Gate-Control Theory Question & Explanation
### Overview
The image presents a question related to the gate-control theory of pain, followed by three different explanations/reasonings for the answer. The explanations are labeled with different identifiers: LLAMA-3.2-3B, LLAMA-3.1-8B, and AC. The question asks which condition tends to close the gate in the gate-control theory.
### Components/Axes
The document is structured into three main sections:
1. **Question:** "According to gate-control theory, which condition tends to close the gate?" with four answer options: A) anxiety, B) depression, C) focusing on the pain, D) electrical stimulation.
2. **LLAMA-3.2-3B Explanation:** A detailed explanation of the gate-control theory and why focusing on the pain is the correct answer.
3. **LLAMA-3.1-8B Explanation:** Another explanation of the gate-control theory, concluding that electrical stimulation is the correct answer.
4. **AC Explanation:** A step-by-step breakdown of understanding the theory and analyzing the options, ultimately concluding with the final answer.
### Detailed Analysis or Content Details
**Question:**
* Question: According to gate-control theory, which condition tends to close the gate?
* Options:
* A) anxiety
* B) depression
* C) focusing on the pain
* D) electrical stimulation
**LLAMA-3.2-3B Explanation:**
"The question is about the gate-control theory of pain, which suggests that certain stimuli can close the gate to the central nervous system, thereby reducing the perception of pain. The options are all conditions or stimuli that can affect the perception of pain. The correct answer is the one that directly affects the gate, which is focusing on the pain. When we focus on the pain, it tends to close the gate, thereby reducing the perception of pain. The other options are not directly related to the gate in the same way. Anxiety and depression can affect pain perception, but they do not directly close the gate. Electrical stimulation can actually activate the gate, not close it. Therefore, the correct answer is C) focusing on the pain."
**LLAMA-3.1-8B Explanation:**
"Gate-control theory suggests that the gate (the in the spinal cord that regulates pain) can be closed by the activation of certain nerve fibers that carry nonpainful sensations. Anxiety, depression, and focusing on the pain can all increase pain perception, which would suggest that the gate is open, not closed. Electrical stimulation, on the other hand, can activate nerve fibers that carry nonpainful sensations, which can close the gate and reduce pain perception. Therefore, the correct answer is D) electrical stimulation."
**AC Explanation:**
"Step 1: Understand the gate-control theory of pain. The gate-control theory of pain proposes that the nervous system has a âgateâ that controls the flow of pain signals to the brain. The theory suggests that this gate can be opened or closed by various factors. Step 2: Identify the conditions that tend to close the gate. According to the gate-control theory, conditions that tend to close the gate include factors that reduce the transmission of pain signals to the brain. These factors can include distraction, relaxation, and certain types of stimulation. Step 3: Analyze the options. Anxiety (A) tends to increase the transmission of pain signals to the brain, making it more likely to open the gate. Depression (B) can have a complex relationship with pain perception, but it is not directly related to closing the gate. Focusing on the pain (C) tends to increase the transmission of pain signals to the brain, making it more likely to open the gate. Electrical stimulation (D) can be used to close the gate by overriding the pain signal with a more intense stimulus. The final answer is: `<D>`"
### Key Observations
* There is a discrepancy in the answers provided by the different explanations. LLAMA-3.2-3B suggests "C) focusing on the pain", while LLAMA-3.1-8B and AC suggest "D) electrical stimulation".
* The AC explanation provides a structured, step-by-step approach to answering the question.
* The explanations demonstrate a nuanced understanding of the gate-control theory, acknowledging the complex interplay of factors influencing pain perception.
### Interpretation
The document highlights the complexities of applying theoretical knowledge to specific scenarios. The differing answers from the LLAMA models suggest that the interpretation of the gate-control theory can be ambiguous, and the correct answer may depend on the specific context or emphasis. The AC explanation, with its structured approach, provides a more reasoned justification for the answer "D) electrical stimulation," aligning with the broader understanding of how non-painful stimuli can modulate pain perception by activating inhibitory pathways. The document serves as a good example of how different reasoning processes can lead to different conclusions, even when based on the same underlying theory. The final answer provided by AC `<D>` suggests that the author or system favors the electrical stimulation answer.
</details>
Figure 6: Example of AC on MMLU High School Psychology dataset.
<details>
<summary>extracted/6420159/logic.png Details</summary>

### Visual Description
## Textual Document: Argument Analysis & Question Response
### Overview
The image presents a question regarding the conclusion of a given argument about the historical understanding of genetics and DNA. Below the question are multiple-choice answers, with one marked as the "best answer" and a detailed explanation provided. Further below is a step-by-step breakdown of how to arrive at the correct answer. The document appears to be an educational resource, likely for a logic or critical thinking course.
### Components/Axes
The document is structured into the following sections:
1. **Question:** A statement presenting an argument followed by a question about its conclusion.
2. **Multiple Choice Answers:** Four options (A, B, C, D) representing potential conclusions.
3. **Best Answer Identification:** Indicates the correct answer ("The best answer is B") with the corresponding text highlighted in bold.
4. **Explanation:** A detailed justification for why the selected answer is the correct conclusion.
5. **AC: Step-by-Step Analysis:** A numbered breakdown of the logical steps to determine the conclusion.
### Detailed Analysis or Content Details
**Question:** "Identify the conclusion of the following argument. From a historical point of view, genetics is a young science. The principles that govern inheritance traits were described less than 150 years ago. Furthermore, around the turn of the twentieth century, the laws of inheritance were rediscovered. But even then, the importance of DNA was not really understood until the 1950s."
**Multiple Choice Answers:**
* A) But even then, the importance of DNA was not really understood until the 1950s.
* B) The importance of DNA was not really understood until the 1950s.
* C) From a historical point of view, genetics is a young science.
* D) The principles that govern inheritance traits were described less than 150 years ago.
**Best Answer:** "LLaMA-3.2-3B: The best answer is B) **The importance of DNA was not really understood until the 1950s.**"
**Explanation:** "Explanation: The conclusion is the statement that follows from the premises. In this case, the conclusion is the statement that follows from the two premises. The first premise states that genetics is a young science from a historical point of view, and the second premise states that the importance of DNA was not really understood until the 1950s. The conclusion is the statement that summarizes the main point of the argument, which is that even then, the importance of DNA was not really understood until the 1950s. This is the correct answer because it is the statement that follows from the premises and summarizes the main point of the argument."
**AC: Step-by-Step Analysis:**
* Step 1: Identify the main claim of the argument. The main claim of the argument is that genetics is a young science from a historical point of view.
* Step 2: Identify the supporting evidence for the main claim. The supporting evidence includes the fact that the principles that govern inheritance traits were described less than 150 years ago and that the laws of inheritance were rediscovered around the turn of the twentieth century.
* Step 3: Identify the additional information provided in the argument. The additional information is that the importance of DNA was not really understood until the 1950s.
* Step 4: Determine the conclusion of the argument. The conclusion is the main claim of the 1950s, which is that genetics is a young science from a historical point of view.
### Key Observations
The explanation and step-by-step analysis seem to contain a slight logical inconsistency. While answer B is identified as the best answer, the explanation and AC analysis ultimately state that the conclusion is that "genetics is a young science from a historical point of view" (which corresponds to answer C). This suggests a potential error in the explanation or analysis. The document also includes "LLaMA-3.1-8B: The final answer is: <<A>>" and "LLaMA-3.1-8B: The final answer is: <<C>>" which are likely intermediate steps or alternative responses from a language model.
### Interpretation
This document demonstrates a critical thinking exercise focused on identifying the conclusion of an argument. The argument itself establishes the relative youth of genetics as a scientific field, supported by the recent understanding of DNA's role. The core issue is discerning whether the conclusion is a broad statement about the field of genetics (answer C) or a specific point about the delayed understanding of a key component (DNA, answer B). The document's internal inconsistency suggests that the process of identifying conclusions can be nuanced and prone to misinterpretation, even with detailed explanations. The inclusion of LLaMA outputs indicates the use of a large language model in the process, potentially as a tool for generating explanations or alternative answers. The document serves as a valuable example for teaching logical reasoning and the importance of careful analysis.
</details>
Figure 7: Example of AC on MMLU Formal Logic dataset.
<details>
<summary>extracted/6420159/bio.png Details</summary>

### Visual Description
\n
## Textual Document: Ecological Niche Question & Answers
### Overview
The image presents a question regarding ecological
</details>
Figure 8: Example of AC on MMLU College Biology dataset.
<details>
<summary>extracted/6420159/law.png Details</summary>

### Visual Description
## Textual Document: Legal Question & LLM Responses
### Overview
The image presents a legal question regarding contract law, specifically concerning an oral agreement, a written offer, and a subsequent counteroffer in a real estate transaction. Below the question are responses generated by two Large Language Models (LLMs): LLAMA-3-2-3B and LLAMA-3-1-8B, along with explanations and a final assessment (AC).
### Components/Axes
The document is structured as follows:
1. **Question:** A legal scenario presented as a multiple-choice question.
2. **LLAMA-3-2-3B Response:** The LLM's chosen answer and its detailed explanation.
3. **LLAMA-3-1-8B Response:** The LLM's chosen answer and its detailed explanation.
4. **AC Assessment:** A final assessment of the situation, providing a different perspective.
### Content Details
**Question:** "A purchaser orally agreed with the seller to buy sellerâs house for $100,000. The purchaser gave the seller a written offer and a post-dated check for 5% of the price. Before the seller signed the offer form or deposited the check, he received an offer of his full asking price of $120,000 from another party. The seller called the purchaser and called the deal off. The purchaser sued for specific performance on the basis that the oral understanding, along with the written offer and the check, constituted an enforceable agreement. Will the purchaser prevail? A) Yes, because there was an oral agreement and the check for the down payment was acceptance of consideration being paid, B) No, because the offer was not acceptedâprior to an acceptance the seller had the right to sell to anyone, C) Yes, because the buyer and seller had a meeting of the minds evidenced by their oral understanding, the check and the written offer, D) No, because specific performance does not apply in a residential real estate transaction."
**LLAMA-3-2-3B Response:**
* **Answer:** A)
* **Explanation:** "The most accurate answer would be: A). Hereâs the explanation: The purchaser and seller had an oral agreement to buy the house for $100,000, which is a sufficient consideration to support the enforceability of the agreement. The written offer and the post-dated check for 5% of the price ($5,000) also constitute consideration, as they demonstrate the purchaserâs commitment to the deal. The sellerâs receipt of the written offer and the post-dated check prior to the offer from the other party indicates that the seller had already accepted the purchaserâs offer and was willing to move forward with the deal. Therefore, the purchaser can argue that the seller had already accepted the offer and was not free to accept the other offer."
**LLAMA-3-1-8B Response:**
* **Answer:** The purchaser and seller had a valid contract. The check was consideration, and the oral agreement was enforceable. The seller had no right to call off the deal after the purchaserâs offer was made. The seller could have sold to anyone, but he was not free to call off the deal with the purchaser after the purchaserâs offer was made. The purchaser had a right to enforce the contract. <<A>>
**AC Assessment:**
"The purchaser will not prevail because the offer was not accepted. The seller received another offer and called off the deal before accepting the purchaserâs offer. The law requires that there be an acceptance of the offer before a contract can be formed. The sellerâs action in calling off the deal before accepting the purchaserâs offer terminated the offer, and the purchaserâs offer lapsed. <<B>>"
### Key Observations
* Both LLMs initially favored answer A, suggesting the purchaser would prevail based on the oral agreement and provided consideration.
* The "AC" assessment contradicts both LLMs, arguing the offer was not accepted before the seller withdrew, thus no contract was formed.
* The LLAMA-3-1-8B response includes "<<A>>" and the AC assessment includes "<<B>>", which appear to be markers indicating the chosen answer.
### Interpretation
The document highlights a nuanced legal scenario where the enforceability of a contract hinges on the timing of acceptance. The LLMs demonstrate an initial inclination to find elements of a contract (oral agreement, consideration), but the "AC" assessment provides a more legally precise analysis, emphasizing the necessity of acceptance *before* revocation of the offer. This suggests the LLMs, while capable of identifying relevant factors, may not fully grasp the critical timing element in contract law. The inclusion of "<<A>>" and "<<B>>" suggests a system for tracking or labeling the LLM's responses and the final assessment. The differing conclusions demonstrate the complexity of legal interpretation and the potential for disagreement even among intelligent systems.
</details>
Figure 9: Example of AC on MMLU Professional Law dataset.
<details>
<summary>extracted/6420159/prel.png Details</summary>

### Visual Description
\n
## Textual Document: No-Wait Weight Loss Advertising Campaign Analysis
### Overview
The image presents a question regarding the stage of an advertising campaign for the product "No-Wait Weight Loss," followed by responses from three different Large Language Models (LLaMA-3.2-3B, LLaMA-3.1-8B, and AC) and an explanation. The question asks at what stage the campaign has been unsuccessful. All three models converge on the answer "Penetration."
### Components/Axes
The document is structured as follows:
* **Question:** A posed question about the advertising campaign stage.
* **LLaMA-3.2-3B Response:** The first model's answer and reasoning.
* **LLaMA-3.1-8B Response:** The second model's answer and reasoning.
* **AC Response:** The third model's answer and reasoning.
* **Explanation:** A concluding statement reinforcing the correct answer.
### Detailed Analysis or Content Details
**Question:**
"No-Wait Weight Loss, a diet pill brand, promises its potential consumers that if they take one pill once a day they will lose weight without having to change their diet or start exercising. However, many people who have used the product have complained that they havenât lost weight, and some have even gained weight. No-Wait Weight Loss has been unsuccessful in the _______ stage of the advertising campaign: A) distribution, B) penetration, C) participation, D) identification"
**LLaMA-3.2-3B Response:**
"The best answer is B) Penetration. The product has failed to gain traction with consumers, which is referred to as the penetration stage of the advertising campaign. The penetration stage is the point in time when a product is introduced to the market and has not yet reached a significant level of adoption or usage. In this case, the product is not gaining any traction, and many consumers are not using it, which indicates that it has not yet reached the penetration stage of the advertising campaign."
**LLaMA-3.1-8B Response:**
"We need to consider the stages of the advertising campaign and how they relate to the productâs performance. The stages are: 1) Identification: The initial stage⊠7) Participation: The stage where consumers become engaged with the brand. We can infer that the product has not been able to convince potential consumers of its benefits, which is the Persuasion stage. However, this is not the correct answer. The product has been used by many people, and some have even complained about its performance, which suggests that the product has reached the Trial stage. However, this is also not the correct answer. The product has been unsuccessful in the stage of the advertising campaign, which suggests that the product has not been able to gain market share and become a dominant player in the market, which is the Penetration stage. Therefore, the correct answer is B) Penetration."
**AC Response:**
"The question is asking about the stage of the advertising campaign that No-Wait Weight Loss has been unsuccessful in. The stages of the advertising campaign are: identification, awareness, knowledge, liking, persuasion, trial, and adoption. However, the options provided in the question are the traditional stages of the advertising campaign. The options are distribution, penetration, participation, and identification. However, if we map the options to the traditional stages of the advertising campaign, we can see that: distribution is similar to the trial stage, penetration is similar to the adoption stage, participation is similar to the awareness stage, and identification is similar to the awareness stage. Since No-Wait Weight Loss has been unsuccessful in getting people to buy and use their product, it has been unsuccessful in the trial stage of the advertising campaign, which is similar to the distribution stage. Therefore, the correct answer is «."
**Explanation:**
"The correct answer is «."
### Key Observations
* All three models independently arrive at the same answer: Penetration.
* The models provide varying levels of detail in their reasoning.
* AC attempts to map the provided options to a more comprehensive advertising campaign model, but ultimately supports the "Penetration" answer.
* The explanation is incomplete, ending with «.
### Interpretation
The document demonstrates a consensus among different AI models regarding the correct stage of an advertising campaign where "No-Wait Weight Loss" is failing. The product's inability to gain traction and user adoption points to a failure in the "Penetration" stage â the phase where a product is introduced to the market and aims to achieve initial usage. The models correctly identify that the complaints about the product's ineffectiveness suggest it hasn't moved beyond initial awareness and is struggling to convert potential customers into actual users. The incomplete explanation at the end suggests a potential error or truncation in the document. The models' reasoning highlights the importance of understanding advertising campaign stages to diagnose marketing failures.
</details>
Figure 10: Example of AC on MMLU Public Relations dataset.
Appendix B Additional Experiments
B.1 Modifying Activations of All Tokens
Recall that AC grafts the last-token layer- $k$ activation of $A$ into $B$ âs last-token layer- $j$ activation. But is modifying just the last token activation enough to communicate information from $A$ to $B$ ?
Note that after applying masked attention in each of the previous Transformer layers, the last token activation of $A$ attends to all tokens before it, hence incorporating information from the entire sequence. Indeed, this must be the case for activation communication to recover the gap between the zero-communication and skyline setups on both coordination games, which (for Tip Sheets in particular) require information starting at the first few tokens of $A$ âs prompt to be communicated.
To verify this empirically, we experiment with summing the activations of all tokens in the sequence rather than just the last (we cannot replace all tokens as this would just replace $B$ âs layer- $j$ activation with $A$ âs layer $k$ -activation). Results are shown in Table 5.
| AC (replace) AC (sum) AC (all tokens) | $\mathbf{84.6}$ $± 0.0$ $79.7$ $± 0.0$ $76.0$ $± 0.0$ | $64.0$ $± 4.8$ $\mathbf{66.0}$ $± 4.7$ $62.0$ $± 4.9$ | $\mathbf{85.0}$ $± 0.8$ $65.0$ $± 4.8$ $35.0$ $± 4.8$ | $\mathbf{47.0}$ $± 0.1$ $42.0$ $± 4.9$ $42.0$ $± 4.9$ | $\mathbf{78.0}$ $± 0.9$ $50.0$ $± 5.0$ $61.0$ $± 4.9$ | $\mathbf{30.0}$ $± 0.1$ $25.0$ $± 4.3$ $15.0$ $± 3.6$ | $\mathbf{74.0}$ $± 0.1$ $37.0$ $± 4.8$ $26.0$ $± 4.4$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
Table 5: Reasoning benchmark performance when varying tokens modified during AC. All methods involve communication between $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{3B}$ ( $A$ ) and $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ ( $B$ ). The functional form $f$ is varied between last-token replacement, last-token summation, and summation for all tokens.
Indeed, applying $f$ to all tokens decreases performance relative to applying $f$ to just the last token. Note that the fact performance generally decreases from $f=$ $\mathtt{replace}$ to $f=$ $\mathtt{sum}$ , and further with all tokens, is expected. The high performance of AC with $f=$ $\mathtt{replace}$ means that the edited last-token activation $\bm{b}$ retains some meaning in $B$ âs activation space; it is less likely for this to be the case when $f=$ $\mathtt{sum}$ (at the very least $\bm{b}$ has norm roughly $2Ă$ that of $B$ âs original last-token activation), and when doing this for all tokens weâd expect performance to decrease even further as now all activation vectors, not just the last, are out-of-distribution with respect to $B$ âs activation space.
B.2 Incorporating Chain-of-Thought Prompting
How does AC perform in relation to NLD in cases where $A$ might incur a long response (possibly with chain-of-thought for intermediate answer computation)? I.e., does AC lose out on the benefits of CoT?
First, note that we still reap the benefits of CoT when we sample a completion from $B$ after AC (where $B$ gets all the information encoding $A$ âs âbeliefsâ about the prompt via AC, hence CoT on $A$ âs side is not needed). To verify this, we experiment with prompting $A$ with CoT, generating a full response, and then passing the layer- $k$ last-token activation of the CoT response to $B$ as part of AC. Results are shown in Table 6.
| AC AC ( $\bm{W}$ ) AC (CoT) | $84.6$ $± 0.0$ $\mathbf{86.8}$ $± 0.0$ $82.1$ $± 0.0$ | $64.0$ $± 4.8$ $\mathbf{66.0}$ $± 4.8$ $\mathbf{66.0}$ $± 4.0$ | $\mathbf{85.0}$ $± 0.8$ $70.0$ $± 0.1$ $80.0$ $± 4.0$ | $\mathbf{47.0}$ $± 0.1$ $35.0$ $± 0.1$ $26.0$ $± 4.4$ | $78.0$ $± 0.9$ $\mathbf{79.0}$ $± 0.9$ $67.0$ $± 4.7$ | $30.0$ $± 0.1$ $\mathbf{45.0}$ $± 0.1$ $40.0$ $± 4.9$ | $\mathbf{74.0}$ $± 0.1$ $63.0$ $± 0.1$ $63.0$ $± 4.8$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
Table 6: Reasoning benchmark performance when sampling from $A$ with CoT. All methods involve communication between $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{3B}$ ( $A$ ) and $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ ( $B$ ).
Indeed, we empirically find our above intuition (in orange) to hold, as there is no significant improvement over vanilla AC when generating from $A$ using CoT.
B.3 Learning $\bm{W}$ In-Distribution
Recall our reasoning about the AC $(\bm{W})$ results from Section 4.2: âWe hypothesize that the benefits from the learned linear layer are less consistent across datasets because the subset of C4 data used to train $\bm{W}$ likely contains text more semantically similar to some datasets than others, hence some datasets provide $\bm{W}$ with out-of-distribution inputs which reduces performance compared to vanilla AC.â
Indeed, we verify this hypothesis by training $\bm{W}$ on the GSM8k train set (to produce $\bm{W}_{\textrm{in dist}}$ ) and then evaluating with this task-specific linear layer on the GSM8k test set. Results are shown in Table 7.
| $64.0$ $± 4.8$ | $66.0$ $± 4.8$ | $\mathbf{78.0}$ $± 4.1$ |
| --- | --- | --- |
Table 7: GSM8k performance when learning $\bm{W}$ in-distribution. All AC variants involve communication between $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{3B}$ ( $A$ ) and $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ ( $B$ ).
Indeed, learning $\bm{W}$ in-distribution significantly boosts performance, confirming our hypothesis. Unfortunately we cannot run this experiment for the other datasets, as there is no in-distribution training data available for MMLU (we use all public data for testing).
Hence, this suggests that AC ( $\bm{W}$ ) should unilaterally improve over vanilla AC if we choose a training set with good coverage across many tasks and distributions, such that there are sentences semantically similar to prompts across the span of downstream task datasets.
B.4 Activation Space Similarity $\propto$ AC Performance Gain
We conduct the following experiment: for each of the six pairs of models $A,B$ in the above experiment (see Table 4), we compute the increase in Biographies performance with AC relative to the average individual performance of $A$ and $B$ . We also compute the matrix analogue of the squared cosine similarity between the modelsâ activation spaces,
$$
\frac{\lVert Y^{\top}X\rVert_{F}^{2}}{\lVert X\rVert_{F}^{2}\,\lVert Y\rVert_{%
F}^{2}},
$$
where $X$ is the matrix of $A$ âs activations on $3072$ sentences from C4 (the same dataset used to train $\bm{W}$ ), $Y$ is the corresponding matrix for $B$ , and $\lVert·\rVert_{F}$ denotes the Frobenius norm. This yields the plot in Figure 11.
There is a clear positive correlation between the similarity of the activation distributions and the AC performance gain, as expected; the more aligned $A$ and $B$ âs activation spaces are, the more semantically meaningful and useful the embedding we graft from $A$ to $B$ becomes.
B.5 Communicating Activations Between Identical Models
Note that AC as described in Section 3.1 only supports communication between distinct models. We can extend AC to work for communication between identical models as follows: let $A$ and $B$ be instances of the same model. We can sample a completion from $A$ with temperature and graft the last-token layer- $k$ activation of the completion into $B$ at layer $j$ as part of the AC procedure. This still saves a substantial amount of compute over NLD between 2 model instances, showing our technique can apply to this setting. Table 8 shows the results of this experiment.
<details>
<summary>extracted/6420159/model-comparison-plot.png Details</summary>

### Visual Description
\n
## Scatter Plot: AC Performance Gain vs. Cosine Similarity
### Overview
This image presents a scatter plot visualizing the relationship between AC (Accuracy Consistency) performance gain on Biographies and a normalized squared Frobenius norm representing the cosine similarity of activation spaces between models. Each point on the plot represents a pair of models.
### Components/Axes
* **X-axis Title:** ||X<sup>T</sup>X||/||X||<sup>2</sup>||Y||<sup>2</sup> (Represents the normalized squared Frobenius norm)
* Scale: Approximately 0.3 to 0.65
* **Y-axis Title:** AC performance gain on Biographies (AC perf. - avg(A, B) perf.)
* Scale: Approximately -0.025 to 0.175
* **Data Points:** Each point is labeled with a pair of model names.
* **No Legend:** The model pairs are directly labeled on the plot.
### Detailed Analysis
The scatter plot displays the following data points (model pairs) and their approximate coordinates:
1. **gemma2b,gemma9b:** (0.35, -0.015) - Located in the bottom-left corner of the plot.
2. **qwen1.5b,llama3b:** (0.55, 0.10) - Located in the upper-middle of the plot.
3. **llama3b,llama8b:** (0.5, 0.025) - Located in the lower-middle of the plot.
4. **llama3b,gemma2b:** (0.58, 0.025) - Located in the lower-middle of the plot, slightly to the right of the previous point.
5. **qwen1.5b,gemma2b:** (0.6, 0.025) - Located in the lower-middle of the plot, slightly to the right of the previous point.
6. **qwen1.5b,qwen3b:** (0.65, 0.17) - Located in the top-right corner of the plot.
**Trends:**
* There is a general upward trend, suggesting that as the normalized squared Frobenius norm increases, the AC performance gain on Biographies also tends to increase. However, this trend is not strictly linear, and there is considerable scatter.
* The points are not clustered, indicating a wide range of performance gains for different model pairs at similar cosine similarity values.
### Key Observations
* The model pair `qwen1.5b,qwen3b` exhibits the highest AC performance gain on Biographies (approximately 0.17) and the highest normalized squared Frobenius norm (approximately 0.65).
* The model pair `gemma2b,gemma9b` exhibits the lowest AC performance gain on Biographies (approximately -0.015) and the lowest normalized squared Frobenius norm (approximately 0.35).
* Several model pairs (llama3b,gemma2b, qwen1.5b,gemma2b) have similar values for both the x and y axes, clustering around (0.58-0.6, 0.025).
### Interpretation
The plot suggests a correlation between the cosine similarity of activation spaces and the AC performance gain on Biographies. Higher cosine similarity (as measured by the normalized squared Frobenius norm) appears to be associated with higher AC performance gains. However, the scatter in the data indicates that cosine similarity is not the sole determinant of AC performance. Other factors, such as model architecture, training data, and hyperparameters, likely play a significant role.
The negative AC performance gain for the `gemma2b,gemma9b` pair suggests that combining or comparing these models might lead to a decrease in accuracy consistency on Biographies. This could be due to differences in how these models represent biographical information in their activation spaces.
The clustering of points around similar values suggests that certain model combinations exhibit similar performance characteristics. Further investigation could explore the reasons for this clustering and identify the underlying factors that contribute to it. The plot provides a visual representation of the relationship between activation space similarity and performance, which can be useful for understanding and improving model behavior.
</details>
Figure 11: AC performance gain over average $A$ / $B$ individual performance on Biographies, as a function of matrix âcosine similarityâ between $A$ and $B$ âs activation spaces.
| $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ NLD AC | $\mathbf{83.9}$ $± 0.0$ $80.8$ $± 0.0$ $83.7$ $± 0.0$ | $60.0$ $± 4.9$ $\mathbf{70.0}$ $± 3.7$ $60.0$ $± 4.9$ | $65.0$ $± 0.1$ $\mathbf{85.0}$ $± 3.6$ $\mathbf{85.0}$ $± 3.6$ | $\mathbf{42.0}$ $± 0.1$ $35.0$ $± 4.8$ $40.0$ $± 4.9$ | $50.0$ $± 0.2$ $\mathbf{78.0}$ $± 4.1$ $74.0$ $± 4.4$ | $20.0$ $± 0.8$ $\mathbf{40.0}$ $± 4.9$ $\mathbf{40.0}$ $± 4.9$ | $53.0$ $± 0.2$ $53.0$ $± 5.1$ $\mathbf{79.0}$ $± 4.1$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
Table 8: Reasoning benchmark performance of communication between identical models. Both NLD and AC involve communication between 2 instances of $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ . $512$ -token completions are sampled with temperature $0.7$ and debate is run for $2$ rounds.
Indeed, while communication between multiple model instances doesnât always show improvement over the single model itself (a well-known result from (Du et al., 2023)), AC matches/outperforms NLD on five of the seven datasets.
The intuition behind debate between multiple identical model instances is that sampling multiple completions (with temperature) from the same model yields diverse reasoning paths that can be recombined into a stronger final answer. The above experiment shows that the same intuition holds for ACâwe are sampling multiple times from the same model, but passing responses between agents via AC rather than as NL messages.
B.6 Additional Rounds of Natural Language Debate
In Section 4.2 we fix NLD to $2$ agents and $2$ rounds, however we find in additional experiments that AC outperforms NLD even with additional rounds, highlighting the superiority and robustness of activations as an alternative âlanguageâ for inter-LM communication. Results are shown in Table 9; we see that for 5 of the 7 reasoning benchmarks, AC beats NLD even with $3$ - $4$ rounds while using substantially less compute.
| NLD (1 round) NLD (2 rounds) NLD (3 rounds) | $83.6$ $± 0.0$ $80.2$ $± 0.1$ $80.1$ $± 4.6$ | $72.0$ $± 4.5$ $75.0$ $± 4.3$ $\mathbf{79.0}$ $± 4.1$ | $65.0$ $± 4.8$ $83.0$ $± 0.8$ $70.0$ $± 4.6$ | $40.0$ $± 4.9$ $37.0$ $± 0.1$ $45.0$ $± 5.0$ | $68.0$ $± 4.6$ $71.0$ $± 0.1$ $63.0$ $± 4.8$ | $30.0$ $± 4.6$ $30.0$ $± 0.1$ $\mathbf{40.0}$ $± 4.9$ | $63.0$ $± 4.8$ $63.0$ $± 0.7$ $\mathbf{74.0}$ $± 4.4$ |
| --- | --- | --- | --- | --- | --- | --- | --- |
| NLD (4 rounds) | $78.0$ $± 0.0$ | $\mathbf{79.0}$ $± 4.1$ | * | * | * | * | * |
| AC | $\mathbf{84.6}$ $± 0.0$ | $64.0$ $± 4.8$ | $\mathbf{85.0}$ $± 0.8$ | $\mathbf{47.0}$ $± 0.1$ | $\mathbf{78.0}$ $± 0.9$ | $30.0$ $± 0.1$ | $\mathbf{74.0}$ $± 0.1$ |
â Runs required too much compute
Table 9: Reasoning benchmark performance of AC and NLD with varying number of rounds. All methods involve communication between $\mathtt{LLaMA}$ - $\mathtt{3.2}$ - $\mathtt{3B}$ ( $A$ ) and $\mathtt{LLaMA}$ - $\mathtt{3.1}$ - $\mathtt{8B}$ ( $B$ ).
B.7 Full MMLU Benchmark Results
Table 10 below displays complete results of both AC and NLD on the full MMLU benchmark. Notably, AC matches/outperforms NLD on 48/57 datasets, with substantially less compute used, indicating its superiority and robustness as an alternative âlanguageâ for inter-LLM communication.
| Conceptual Physics High School Chemistry Security Studies | $60.0± 4.9$ $\mathbf{50.0± 5.0}$ $60.0± 4.9$ | $\mathbf{68.0± 4.6}$ $37.0± 4.8$ $60.0± 4.9$ |
| --- | --- | --- |
| Jurisprudence | $84.0± 3.6$ | $84.0± 3.6$ |
| Logical Fallacies | $63.0± 4.8$ | $\mathbf{72.0± 4.5}$ |
| College Computer Science | $44.0± 5.0$ | $44.0± 5.0$ |
| International Law | $55.0± 5.0$ | $\mathbf{59.0± 4.9}$ |
| Miscellaneous | $90.0± 3.0$ | $\mathbf{95.0± 2.2}$ |
| Marketing | $70.0± 4.6$ | $\mathbf{85.0± 3.6}$ |
| Elementary Mathematics | $\mathbf{75.0± 4.3}$ | $58.0± 4.9$ |
| Machine Learning | $42.0± 4.9$ | $42.0± 4.9$ |
| High School Macroeconomics | $44.0± 5.0$ | $\mathbf{75.0± 4.3}$ |
| High School US History | $45.0± 5.0$ | $\mathbf{71.0± 4.6}$ |
| Human Aging | $56.0± 5.0$ | $\mathbf{72.0± 4.5}$ |
| Astronomy | $79.0± 4.1$ | $\mathbf{80.0± 4.0}$ |
| Computer Security | $56.0± 5.0$ | $\mathbf{75.0± 4.3}$ |
| High School Statistics | $\mathbf{55.0± 5.0}$ | $42.0± 4.9$ |
| Professional Medicine | $\mathbf{79.0± 4.1}$ | $65.0± 4.8$ |
| Electrical Engineering | $58.0± 4.9$ | $\mathbf{60.0± 4.9}$ |
| High School Computer Science | $63.0± 4.8$ | $\mathbf{70.0± 4.6}$ |
| College Physics | $\mathbf{50.0± 5.0}$ | $28.0± 4.5$ |
| Management | $74.0± 4.1$ | $\mathbf{75.0± 4.3}$ |
| Moral Scenarios | $40.0± 4.9$ | $40.0± 4.9$ |
| World Religions | $58.0± 4.9$ | $\mathbf{72.0± 4.5}$ |
| Virology | $47.0± 5.0$ | $\mathbf{50.0± 5.0}$ |
| Philosophy | $67.0± 4.7$ | $\mathbf{70.0± 4.6}$ |
| Abstract Algebra | $\mathbf{50.0± 5.0}$ | $28.0± 4.5$ |
| High School Government and Politics | $\mathbf{80.0± 4.0}$ | $61.0± 4.9$ |
| High School Biology | $60.0± 4.9$ | $\mathbf{65.0± 4.8}$ |
| College Mathematics | $64.0± 4.8$ | $\mathbf{66.0± 2.4}$ |
| Global Facts | $33.0± 5.0$ | $\mathbf{37.0± 4.8}$ |
| High School World History | $71.0± 4.0$ | $\mathbf{74.0± 4.4}$ |
| High School European History | $68.0± 4.0$ | $\mathbf{71.0± 4.6}$ |
| College Medicine | $\mathbf{65.0± 4.8}$ | $53.0± 5.0$ |
| High School Geography | $67.0± 4.7$ | $\mathbf{79.0± 4.1}$ |
| Anatomy | $74.0± 4.4$ | $74.0± 4.4$ |
| Human Sexuality | $75.0± 4.3$ | $75.0± 4.3$ |
| Medical Genetics | $79.0± 4.1$ | $\mathbf{82.0± 3.8}$ |
| Professional Accounting | $40.0± 4.9$ | $\mathbf{48.0± 4.5}$ |
| US Foreign Policy | $89.0± 3.1$ | $\mathbf{90.0± 3.1}$ |
| Business Ethics | ${43.0± 5.0}$ | $\mathbf{44.0± 5.0}$ |
| College Chemistry | ${41.0± 5.0}$ | $\mathbf{47.0± 5.0}$ |
| High School Physics | ${40.0± 5.0}$ | $\mathbf{47.0± 5.0}$ |
| Professional Psychology | ${54.0± 4.8}$ | $\mathbf{55.0± 5.0}$ |
| Sociology | ${68.0± 4.1}$ | $\mathbf{68.0± 4.6}$ |
| High School Microeconomics | $95.0± 2.2$ | $95.0± 2.2$ |
| High School Mathematics | $55.0± 5.0$ | $55.0± 5.0$ |
| Prehistory | $\mathbf{75.0± 4.3}$ | $60.0± 4.9$ |
| Nutrition | ${64.0± 4.5}$ | $\mathbf{70.0± 4.6}$ |
| Clinical Knowledge | ${65.0± 4.3}$ | $65.0± 4.8$ |
| Moral Disputes | ${58.0± 4.8}$ | $\mathbf{60.0± 4.9}$ |
| Econometrics | ${40.0± 5.0}$ | $40.0± 4.9$ |
| High School Psychology | $83.0± 0.8$ | $\mathbf{85.0± 0.8}$ |
| Formal Logic | $37.0± 0.1$ | $\mathbf{47.0± 0.1}$ |
| College Biology | $71.0± 0.1$ | $\mathbf{78.0± 0.9}$ |
| Professional Law | $30.0± 0.1$ | $30.0± 0.1$ |
| Public Relations | $63.0± 0.7$ | $\mathbf{74.0± 0.1}$ |
| Average | $60.7± 2.0$ | $\mathbf{62.7± 2.2}$ |
Table 10: Comparison of NLD vs. AC on the full MMLU benchmark (Hendrycks et al., 2021).